TensorCat commited on
Commit
683bf13
·
verified ·
1 Parent(s): 8a70417

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +680 -81
README.md CHANGED
@@ -27,6 +27,34 @@ tags:
27
  - education
28
  - tensor-talk
29
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  # TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering
32
 
@@ -612,7 +640,7 @@ It shows that retrieval grounding dramatically improves answer quality compared
612
 
613
  ## 6.1 Why an Agent Is Needed
614
 
615
- The handbook is reliable for stable academic rules, but some questions may require official web information.
616
 
617
  Examples:
618
 
@@ -621,26 +649,87 @@ Who is the current dean?
621
  Where can students find residential college information?
622
  What official page mentions PEKOM?
623
  Where is the official SPeCTRUM page?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
624
  ```
625
 
626
- For these cases, the system needs a controlled web agent.
 
 
 
 
 
 
 
 
627
 
628
- However, a web agent can be dangerous if it freely browses or trusts random pages. Therefore, this project uses a restricted official-source agent.
 
 
 
 
 
 
 
 
 
 
 
 
629
 
630
  ---
631
 
632
- ## 6.2 Official UM / FSKTM Web Agent
633
 
634
- The web agent is constrained to official UM / FSKTM domains.
635
 
636
- Priority domains:
637
 
638
  ```text
639
  fsktm.um.edu.my
640
  www.um.edu.my
 
641
  ```
642
 
643
- Auxiliary official domains include UM-related systems such as:
 
 
644
 
645
  ```text
646
  aasd.um.edu.my
@@ -653,41 +742,202 @@ intra.fsktm.um.edu.my
653
  gallery.fsktm.um.edu.my
654
  ```
655
 
656
- The agent performs:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
657
 
658
  ```text
659
- query planning
660
- official web discovery
661
- URL filtering
662
- page fetching
663
- evidence extraction
664
- evidence scoring
665
- Qwen-based evidence judging
666
- retry if weak
667
- fallback to handbook RAG if needed
668
  ```
669
 
 
 
 
 
 
 
 
 
 
670
  ---
671
 
672
- ## 6.3 Agent Is Not Fully Autonomous by Design
673
 
674
- This project does not use a completely unrestricted autonomous agent.
675
 
676
- That is intentional.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
677
 
678
- For a university handbook assistant, unrestricted autonomy is less useful than controlled evidence routing. The system needs to be:
 
 
 
 
679
 
680
  ```text
681
- safe
682
- source-constrained
683
- traceable
684
- fallback-aware
685
- grounded
 
 
686
  ```
687
 
688
- So the agent is better described as:
689
 
690
- > A constrained official-source web agent controlled by Harness Engineering.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
691
 
692
  ---
693
 
@@ -695,35 +945,175 @@ So the agent is better described as:
695
 
696
  ## 7.1 What Harness Engineering Means Here
697
 
698
- Harness Engineering is the guardrail system around the model and agent.
699
 
700
  A simple analogy:
701
 
702
  ```text
703
  The LLM/agent is the car.
704
- Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, and dashboard.
705
  ```
706
 
707
  The model can generate fluent answers, but the harness controls:
708
 
709
- - what it is allowed to search
710
- - what sources it can trust
711
- - whether a URL is fake
712
- - whether evidence is useful
713
- - whether the answer is grounded
714
- - whether the system should retry
715
- - whether it should fall back to local handbook RAG
716
- - what trace should be shown to the user
 
 
 
 
 
717
 
718
  ---
719
 
720
- ## 7.2 Harness Pipeline
 
 
721
 
722
- The standardized TensorTalk Harness Core follows this structure:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
723
 
724
  ```text
725
  User Question
726
 
 
 
727
  Local Handbook RAG
728
 
729
  Official Web Discovery
@@ -742,96 +1132,271 @@ Entity-aware Retry
742
 
743
  Weak Evidence Fallback
744
 
745
- Answer Generator
746
 
747
  Answer Grounding Judge
748
 
749
  Completeness Guard
750
 
 
 
751
  UI Trace
752
  ```
753
 
 
 
754
  ---
755
 
756
- ## 7.3 Harness Components
757
 
758
- The notebooks include several engineering patches and layers:
759
 
760
- ### V14 WAF-aware Harness
761
 
762
- Handles web pages blocked by WAF or browser failures.
 
 
 
 
 
 
 
 
 
 
 
763
 
764
- Functions:
765
 
766
- - detect WAF block pages
767
- - exclude blocked pages from evidence
768
- - provide diagnostics
769
- - use safe static fallback if browser click fails
770
- - reject query-fabricated URLs before evidence building
 
 
 
 
 
 
 
 
 
 
 
 
771
 
772
  ---
773
 
774
- ### V15 Qwen Evidence Judge Loop
775
 
776
- Adds an LLM-based evidence judge.
777
 
778
- Flow:
779
 
780
  ```text
781
- Planner
782
- → Search / Fetch
783
- → Evidence Filter
784
- Qwen Judge
785
- → Retry
786
- → Final Evidence
787
  ```
788
 
789
- The purpose is to avoid trusting weak web snippets blindly.
 
 
 
 
 
 
 
 
 
 
790
 
791
  ---
792
 
793
- ### V16 Local-aware Judge Repair
794
 
795
- Improves routing and fallback.
796
 
797
- It handles:
798
 
799
  ```text
800
- PEKOM routing
801
- CCNA Lab routing
802
- residential college routing
803
- local RAG fallback
804
- entity-aware retry
805
- fake URL rejection
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
806
  ```
807
 
 
 
808
  ---
809
 
810
- ### V17 — Strict Entity Judge and UI Polish
811
 
812
- Adds stricter entity matching and improves trace display.
813
 
814
- This helps avoid cases where a query about one entity is answered with another related but wrong page.
 
 
 
 
 
 
 
 
 
815
 
816
  ---
817
 
818
- ### V18 Balanced Official Reference Fallback
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
819
 
820
- Allows the system to still provide official references when strong web evidence is not enough, while avoiding over-trusting weak pages.
 
 
 
 
 
 
 
 
 
 
821
 
822
  ---
823
 
824
- ### V19 Answer Grounding Judge
 
 
825
 
826
- Checks whether the final generated answer is actually supported by evidence.
 
 
 
 
 
 
827
 
828
- This is important because even if retrieval is correct, the model may still introduce unsupported details.
 
 
 
 
 
 
 
 
 
 
 
829
 
830
  ---
831
 
832
- ### Completeness Guard
833
 
834
- Checks whether the answer is too incomplete and whether a rewrite or fallback should be triggered.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
835
 
836
  ---
837
 
@@ -1371,7 +1936,41 @@ Non-PPO fallback is forbidden in the final Improved Model demo.
1371
 
1372
  ---
1373
 
1374
- # 18. Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1375
 
1376
  TensorTalk demonstrates a staged LLM system development workflow:
1377
 
 
27
  - education
28
  - tensor-talk
29
  ---
30
+ ---
31
+ license: other
32
+ language:
33
+ - en
34
+ - zh
35
+ tags:
36
+ - qwen3
37
+ - qwen3-8b
38
+ - lora
39
+ - qlora
40
+ - sft
41
+ - rag
42
+ - faiss
43
+ - dense-retrieval
44
+ - agent
45
+ - ppo
46
+ - rlhf
47
+ - rule-reward
48
+ - harness-engineering
49
+ - um-handbook
50
+ - question-answering
51
+ - chatbot
52
+ - education
53
+ - tensor-talk
54
+ pipeline_tag: text-generation
55
+ base_model: Qwen/Qwen3-8B
56
+ library_name: transformers
57
+ ---
58
 
59
  # TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering
60
 
 
640
 
641
  ## 6.1 Why an Agent Is Needed
642
 
643
+ The handbook is reliable for stable academic rules, but a practical university assistant cannot depend only on static handbook text. Some user questions naturally require **official web discovery**, **source checking**, or **routing decisions**.
644
 
645
  Examples:
646
 
 
649
  Where can students find residential college information?
650
  What official page mentions PEKOM?
651
  Where is the official SPeCTRUM page?
652
+ What facilities are associated with a specific lab?
653
+ ```
654
+
655
+ For these cases, TensorTalk uses an official-source web agent. The agent is not designed as an unrestricted autonomous browser. It is deliberately designed as a **constrained agent** because the domain is academic handbook QA, where factual trustworthiness is more important than open-ended exploration.
656
+
657
+ In practical terms, the agent layer answers this question:
658
+
659
+ > If local handbook RAG is not enough, can the system search official UM/FSKTM sources, reject weak or fake sources, and return evidence safely?
660
+
661
+ ---
662
+
663
+ ## 6.2 How This Project Relates to LangChain and LangGraph
664
+
665
+ TensorTalk does **not** use LangChain, LangGraph, or LangSmith as the runtime framework. The agent and harness were implemented from scratch in the notebook.
666
+
667
+ However, the design is conceptually aligned with the official LangChain ecosystem ideas:
668
+
669
+ - LangChain describes agents as systems that combine language models with tools so they can reason about tasks, decide which tools to use, and iteratively work toward a result.
670
+ - LangGraph describes agent workflows using state, nodes, and edges, where nodes perform computation and edges determine the next transition.
671
+ - LangSmith describes evaluation as a workflow involving datasets, evaluators, and experiments to compare application versions.
672
+ - LangChain/LangGraph documentation also distinguishes between predetermined workflows and dynamic agents; TensorTalk intentionally uses a hybrid design because the handbook QA task needs both predictable guardrails and dynamic retrieval decisions.
673
+
674
+ Therefore, this project is best described as:
675
+
676
+ > A from-scratch implementation of a LangChain/LangGraph-inspired agentic RAG harness, not a project built by directly calling LangChain’s prebuilt agent framework.
677
+
678
+ This distinction is important. TensorTalk does not simply wrap a LangChain agent. Instead, it manually implements the major control ideas:
679
+
680
+ ```text
681
+ State tracking
682
+ → planning
683
+ → retrieval/tool routing
684
+ → source filtering
685
+ → evidence normalization
686
+ → evaluation/judging
687
+ → retry/fallback
688
+ → final generation
689
+ → trace output
690
  ```
691
 
692
+ This gives the project more transparency because each part of the agent loop is visible in the notebook and UI trace.
693
+
694
+ ---
695
+
696
+ ## 6.3 Agent Design Philosophy
697
+
698
+ The TensorTalk agent is built around four principles.
699
+
700
+ ### 1. Source-constrained autonomy
701
 
702
+ The agent can search and fetch information, but only from allowed official sources. It is not free to trust arbitrary search results.
703
+
704
+ ### 2. Evidence-first generation
705
+
706
+ The model should not directly answer a web-sensitive question before evidence is collected and judged.
707
+
708
+ ### 3. Retry and fallback
709
+
710
+ If official web evidence is weak, blocked, irrelevant, or unsafe, the system can retry with entity-aware search terms or fall back to local handbook RAG.
711
+
712
+ ### 4. Traceable decisions
713
+
714
+ The agent does not hide its routing decisions. It records URL candidates, accepted evidence, rejected evidence, grounding decisions, and fallback decisions in trace panels.
715
 
716
  ---
717
 
718
+ ## 6.4 Official UM / FSKTM Web Agent
719
 
720
+ The web agent is constrained to official UM / FSKTM domains. This is one of the most important safety and reliability choices in the project.
721
 
722
+ ### Priority official domains
723
 
724
  ```text
725
  fsktm.um.edu.my
726
  www.um.edu.my
727
+ um.edu.my
728
  ```
729
 
730
+ ### Auxiliary official UM-related domains
731
+
732
+ The project also recognizes selected UM-related service domains when they are relevant to student services, academic systems, library resources, research, career portals, or internal faculty resources:
733
 
734
  ```text
735
  aasd.um.edu.my
 
742
  gallery.fsktm.um.edu.my
743
  ```
744
 
745
+ The purpose of this whitelist is not to search the whole internet. The purpose is to constrain the agent to sources that are likely to be officially controlled by UM or FSKTM.
746
+
747
+ ---
748
+
749
+ ## 6.5 Domain Whitelist Design
750
+
751
+ The whitelist is used as a **domain guard** before a page can become trusted evidence.
752
+
753
+ The system treats URLs in three broad categories:
754
+
755
+ | URL type | Handling |
756
+ |---|---|
757
+ | Official UM/FSKTM URL | Can be considered as candidate evidence |
758
+ | UM-related service URL | Can be considered if relevant to the question type |
759
+ | Non-official or synthetic URL | Rejected or downgraded |
760
+
761
+ The whitelist helps prevent common LLM-agent failure cases:
762
 
763
  ```text
764
+ hallucinated programme pages
765
+ invented lab pages
766
+ fake student service URLs
767
+ misrouted search results
768
+ random third-party pages
769
+ SEO or unrelated pages
 
 
 
770
  ```
771
 
772
+ For example, the project specifically tests that the agent should not invent or accept URLs like:
773
+
774
+ ```text
775
+ programme-ccna-lab-more-detailedly
776
+ bachelor-of-computer-science-artificial-intelligence
777
+ ```
778
+
779
+ when those pages are not the correct evidence for the user’s question.
780
+
781
  ---
782
 
783
+ ## 6.6 Web Agent Workflow
784
 
785
+ The official-source web agent follows a controlled workflow.
786
 
787
+ ```text
788
+ User question
789
+
790
+ Intent and entity detection
791
+
792
+ Official-search query construction
793
+
794
+ Candidate URL discovery
795
+
796
+ Domain whitelist filtering
797
+
798
+ Synthetic/fake URL rejection
799
+
800
+ Fetch or static page fallback
801
+
802
+ WAF/block detection
803
+
804
+ Text extraction and normalization
805
+
806
+ Evidence scoring
807
+
808
+ Qwen evidence judge
809
+
810
+ Accept evidence, retry, or fallback to handbook RAG
811
+ ```
812
+
813
+ This means the agent is not only a web search function. It is a guarded evidence acquisition pipeline.
814
+
815
+ ---
816
 
817
+ ## 6.7 Planning Inside the Agent
818
+
819
+ Planning is a visible part of the TensorTalk system.
820
+
821
+ The planning layer is responsible for deciding:
822
 
823
  ```text
824
+ Is this a stable handbook question?
825
+ Is this a latest/current official-web question?
826
+ Should local RAG be used first?
827
+ Should official web discovery be attempted?
828
+ Which entity should be searched?
829
+ Which scope should be preferred: undergraduate, postgraduate, general, faculty, or university?
830
+ What evidence type is expected: handbook chunk, official page, contact page, facility page, announcement, policy page?
831
  ```
832
 
833
+ This planning step is aligned with the idea that agentic systems should not directly jump from user question to final answer. They need a control stage that decides which tools and evidence paths are appropriate.
834
 
835
+ TensorTalk’s planning is not a free-form hidden chain-of-thought that users must trust blindly. It is operationalized through explicit routing variables, trace objects, search decisions, and UI panels.
836
+
837
+ ---
838
+
839
+ ## 6.8 Generation Inside the Agent
840
+
841
+ Generation is the stage where the Qwen3-8B model produces an answer.
842
+
843
+ However, generation is not allowed to operate alone. The answer generator receives controlled context:
844
+
845
+ ```text
846
+ user question
847
+ retrieved local handbook evidence
848
+ accepted official web evidence
849
+ scope hints
850
+ source metadata
851
+ harness instructions
852
+ answer style constraints
853
+ ```
854
+
855
+ The generator is expected to:
856
+
857
+ ```text
858
+ answer directly
859
+ avoid unsupported claims
860
+ avoid fake URLs
861
+ avoid exposing internal reasoning
862
+ use local handbook evidence when web evidence is weak
863
+ prefer official web evidence only when it is relevant and trusted
864
+ ```
865
+
866
+ In the final stage, the generator is the PPO-trained Qwen3 actor, but it is still wrapped by the same RAG and Harness Engineering control layer.
867
+
868
+ ---
869
+
870
+ ## 6.9 Evaluation Inside the Agent
871
+
872
+ Evaluation is the other core part of the agent loop. TensorTalk evaluates both evidence and answers.
873
+
874
+ ### Evidence evaluation
875
+
876
+ The system checks:
877
+
878
+ ```text
879
+ Is the source official?
880
+ Is the URL synthetic or fake?
881
+ Is the page blocked by WAF?
882
+ Is the evidence relevant to the user question?
883
+ Does the evidence mention the right entity?
884
+ Does the evidence match the expected scope?
885
+ ```
886
+
887
+ ### Answer evaluation
888
+
889
+ The system checks:
890
+
891
+ ```text
892
+ Is the final answer grounded in accepted evidence?
893
+ Does it answer the user’s actual question?
894
+ Does it leak internal thinking?
895
+ Does it invent URLs?
896
+ Is it too vague?
897
+ Is it incomplete enough to require fallback or rewrite?
898
+ ```
899
+
900
+ This creates a full agentic loop:
901
+
902
+ ```text
903
+ Planning
904
+ → Retrieval / tool use
905
+ → Generation
906
+ → Evaluation
907
+ → Retry or fallback
908
+ → Final answer
909
+ ```
910
+
911
+ ---
912
+
913
+ ## 6.10 Why the Agent Is Not Fully Autonomous
914
+
915
+ The agent is intentionally not fully autonomous.
916
+
917
+ A fully autonomous browsing agent may:
918
+
919
+ ```text
920
+ search too broadly
921
+ trust wrong sources
922
+ follow irrelevant pages
923
+ invent missing pages
924
+ overuse web search
925
+ ignore handbook evidence
926
+ produce unsupported answers
927
+ ```
928
+
929
+ TensorTalk instead uses a constrained model:
930
+
931
+ ```text
932
+ Dynamic when needed
933
+ Guarded by default
934
+ Official-source-only for web evidence
935
+ RAG-first for handbook-stable questions
936
+ Fallback-aware when web evidence is weak
937
+ Traceable for debugging and demonstration
938
+ ```
939
+
940
+ This is more appropriate for a university handbook assistant.
941
 
942
  ---
943
 
 
945
 
946
  ## 7.1 What Harness Engineering Means Here
947
 
948
+ Harness Engineering is the external control system around the LLM, RAG, and agent.
949
 
950
  A simple analogy:
951
 
952
  ```text
953
  The LLM/agent is the car.
954
+ Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, dashboard, and driving examiner.
955
  ```
956
 
957
  The model can generate fluent answers, but the harness controls:
958
 
959
+ ```text
960
+ what it can search
961
+ which domains are trusted
962
+ which URLs are rejected
963
+ which evidence is useful
964
+ when to retry
965
+ when to fallback
966
+ whether the answer is grounded
967
+ whether the UI should show warning traces
968
+ whether the final response is safe enough to display
969
+ ```
970
+
971
+ In TensorTalk, Harness Engineering is not just prompt engineering. Prompt engineering tells the model what to do. Harness Engineering builds the surrounding execution system that checks whether the model actually did it correctly.
972
 
973
  ---
974
 
975
+ ## 7.2 From Prompt Engineering to Harness Engineering
976
+
977
+ Prompt engineering is like telling a driver:
978
 
979
+ ```text
980
+ Please drive carefully.
981
+ ```
982
+
983
+ Harness Engineering is like building:
984
+
985
+ ```text
986
+ lane barriers
987
+ speed checks
988
+ traffic rules
989
+ navigation checkpoints
990
+ fallback routes
991
+ dashboards
992
+ incident logs
993
+ ```
994
+
995
+ In this project, prompt engineering alone is not enough because the model may still:
996
+
997
+ ```text
998
+ invent fake URLs
999
+ mix undergraduate and postgraduate rules
1000
+ leak internal reasoning
1001
+ trust weak web snippets
1002
+ answer without evidence
1003
+ overuse web search
1004
+ ignore local handbook RAG
1005
+ ```
1006
+
1007
+ The harness prevents or detects these failures through code-level controls, not only prompt instructions.
1008
+
1009
+ ---
1010
+
1011
+ ## 7.3 From-scratch LangChain-style Harness
1012
+
1013
+ TensorTalk’s harness was built manually rather than by importing a prebuilt LangChain/LangGraph agent.
1014
+
1015
+ The implementation follows the same conceptual loop used in many modern agent frameworks:
1016
+
1017
+ ```text
1018
+ Planning
1019
+ → Tool / Retrieval Action
1020
+ → Generation
1021
+ → Evaluation
1022
+ → Retry / Fallback
1023
+ → Finalization
1024
+ ```
1025
+
1026
+ But each component is implemented explicitly:
1027
+
1028
+ | Conceptual framework idea | TensorTalk from-scratch implementation |
1029
+ |---|---|
1030
+ | Agent state | Trace dictionaries, evidence bundles, routing flags, runtime status |
1031
+ | Tools | Local RAG retriever, official web search/fetch, URL validator, evidence judge |
1032
+ | Nodes | Planning, retrieval, web discovery, evidence filtering, judging, generation, grounding |
1033
+ | Edges / transitions | Conditional retry, weak-evidence fallback, RAG fallback, final answer route |
1034
+ | Evaluation | Qwen evidence judge, rule checks, answer grounding judge, smoke tests |
1035
+ | Observability | Collapsed UI trace panels and printed diagnostic outputs |
1036
+
1037
+ This makes the system easier to inspect in an academic notebook because the control logic is visible.
1038
+
1039
+ ---
1040
+
1041
+ ## 7.4 Planning → Generation → Evaluation Closed Loop
1042
+
1043
+ The most important Harness Engineering contribution in TensorTalk is the closed loop:
1044
+
1045
+ ```text
1046
+ Planning
1047
+
1048
+ Generation
1049
+
1050
+ Evaluation
1051
+
1052
+ Retry / Fallback / Finalization
1053
+ ```
1054
+
1055
+ ### Planning
1056
+
1057
+ The planning layer decides how to handle the query.
1058
+
1059
+ It considers:
1060
+
1061
+ ```text
1062
+ question type
1063
+ scope
1064
+ entity
1065
+ whether local RAG is enough
1066
+ whether official web is needed
1067
+ whether the query is dynamic/current
1068
+ whether the answer should be handbook-grounded or web-grounded
1069
+ ```
1070
+
1071
+ ### Generation
1072
+
1073
+ The generation layer produces an answer using controlled evidence.
1074
+
1075
+ It receives:
1076
+
1077
+ ```text
1078
+ local handbook chunks
1079
+ official web evidence
1080
+ scope hints
1081
+ source metadata
1082
+ answer constraints
1083
+ ```
1084
+
1085
+ ### Evaluation
1086
+
1087
+ The evaluation layer checks the result.
1088
+
1089
+ It evaluates:
1090
+
1091
+ ```text
1092
+ source trust
1093
+ URL validity
1094
+ evidence relevance
1095
+ answer grounding
1096
+ completeness
1097
+ process leakage
1098
+ fake URLs
1099
+ fallback need
1100
+ ```
1101
+
1102
+ If evaluation fails, the system can retry, reroute, or fall back.
1103
+
1104
+ This is the engineering loop that makes TensorTalk more than a simple RAG chatbot.
1105
+
1106
+ ---
1107
+
1108
+ ## 7.5 Standardized Harness Core Pipeline
1109
+
1110
+ The final standardized TensorTalk Harness Core follows this pipeline:
1111
 
1112
  ```text
1113
  User Question
1114
 
1115
+ Planning Layer
1116
+
1117
  Local Handbook RAG
1118
 
1119
  Official Web Discovery
 
1132
 
1133
  Weak Evidence Fallback
1134
 
1135
+ PPO/SFT Answer Generator
1136
 
1137
  Answer Grounding Judge
1138
 
1139
  Completeness Guard
1140
 
1141
+ Final Answer
1142
+
1143
  UI Trace
1144
  ```
1145
 
1146
+ This pipeline is intentionally explicit. Each part has a clear job.
1147
+
1148
  ---
1149
 
1150
+ ## 7.6 Harness State and Trace Objects
1151
 
1152
+ The harness keeps structured trace data so that every answer can be inspected.
1153
 
1154
+ Typical trace information includes:
1155
 
1156
+ ```text
1157
+ retrieved local RAG chunks
1158
+ candidate web URLs
1159
+ accepted official URLs
1160
+ rejected URLs
1161
+ web evidence bundle
1162
+ harness core route
1163
+ evidence judge result
1164
+ answer grounding result
1165
+ fallback reason
1166
+ final answer preview
1167
+ ```
1168
 
1169
+ This is similar in spirit to observability and tracing in agent platforms, but implemented directly in the notebook and UI.
1170
 
1171
+ ---
1172
+
1173
+ ## 7.7 Domain Guard
1174
+
1175
+ The domain guard checks whether a candidate source belongs to the allowed official domain set.
1176
+
1177
+ It protects against:
1178
+
1179
+ ```text
1180
+ random third-party websites
1181
+ unofficial mirrors
1182
+ search result noise
1183
+ LLM-fabricated domains
1184
+ wrong university pages
1185
+ ```
1186
+
1187
+ It also makes the system explainable. If the agent rejects a page, the trace can show why.
1188
 
1189
  ---
1190
 
1191
+ ## 7.8 Fake URL Guard
1192
 
1193
+ The fake URL guard is one of the most important parts of the project because raw LLM generations can invent plausible-looking URLs.
1194
 
1195
+ Examples of risky synthetic URLs include:
1196
 
1197
  ```text
1198
+ https://spectrum.umlms
1199
+ http://spectrux.medicum
1200
+ programme-ccna-lab-more-detailedly
1201
+ https://aasd um edu my/studetn
 
 
1202
  ```
1203
 
1204
+ The guard checks and rejects URLs that:
1205
+
1206
+ ```text
1207
+ are malformed
1208
+ look fabricated
1209
+ contain suspicious path patterns
1210
+ do not belong to allowed domains
1211
+ are query-fabricated rather than discovered from official search/fetch
1212
+ ```
1213
+
1214
+ The PPO reward function also penalizes hallucinated URLs, but the harness is still necessary because reward shaping does not guarantee perfect URL behavior.
1215
 
1216
  ---
1217
 
1218
+ ## 7.9 WAF Detection
1219
 
1220
+ Some official pages can be blocked, partially loaded, or protected by web application firewalls.
1221
 
1222
+ The WAF-aware harness detects cases where:
1223
 
1224
  ```text
1225
+ the page cannot be fetched normally
1226
+ the content is a block page instead of the real page
1227
+ the browser click fails
1228
+ the official site returns insufficient text
1229
+ ```
1230
+
1231
+ When this happens, the system avoids treating the blocked page as strong evidence. It can use diagnostics, retry, static fallback, or local RAG fallback.
1232
+
1233
+ ---
1234
+
1235
+ ## 7.10 Evidence Normalizer
1236
+
1237
+ Fetched web pages and handbook chunks may be noisy.
1238
+
1239
+ The evidence normalizer attempts to convert them into a consistent evidence structure:
1240
+
1241
+ ```text
1242
+ title
1243
+ url
1244
+ source type
1245
+ domain
1246
+ text snippet
1247
+ score
1248
+ scope
1249
+ entity
1250
+ reason
1251
+ ```
1252
+
1253
+ This makes later judging and UI display easier.
1254
+
1255
+ ---
1256
+
1257
+ ## 7.11 Qwen Evidence Judge
1258
+
1259
+ The Qwen evidence judge is used to decide whether retrieved evidence actually helps answer the user’s question.
1260
+
1261
+ It checks:
1262
+
1263
+ ```text
1264
+ Does the evidence mention the right entity?
1265
+ Does it answer the question directly?
1266
+ Is it only loosely related?
1267
+ Is it a wrong programme/page?
1268
+ Is it official but irrelevant?
1269
  ```
1270
 
1271
+ This is important because official sources can still be irrelevant. A page can be official and still be the wrong evidence.
1272
+
1273
  ---
1274
 
1275
+ ## 7.12 Entity-aware Retry
1276
 
1277
+ If the first web discovery result is weak or misrouted, the harness can retry with better query terms.
1278
 
1279
+ For example, if a question about PEKOM gets routed toward an AI bachelor programme page, the harness should retry using terms related to:
1280
+
1281
+ ```text
1282
+ PEKOM
1283
+ Persatuan Komputer UM
1284
+ student society
1285
+ FSKTM student association
1286
+ ```
1287
+
1288
+ This prevents the agent from accepting the first official-looking but semantically wrong page.
1289
 
1290
  ---
1291
 
1292
+ ## 7.13 Weak Evidence Fallback
1293
+
1294
+ If the official web evidence is weak, TensorTalk can fall back to local handbook RAG.
1295
+
1296
+ This prevents a common agent failure:
1297
+
1298
+ ```text
1299
+ The system found a web page, so it trusts it even though it does not answer the question.
1300
+ ```
1301
+
1302
+ Instead, TensorTalk uses:
1303
+
1304
+ ```text
1305
+ web evidence if strong
1306
+ local handbook RAG if web evidence is weak
1307
+ hybrid answer if both are useful
1308
+ refusal/uncertainty if neither is sufficient
1309
+ ```
1310
+
1311
+ ---
1312
+
1313
+ ## 7.14 Answer Grounding Judge
1314
+
1315
+ After answer generation, the answer grounding judge checks whether the final answer is supported by the accepted evidence.
1316
+
1317
+ It helps catch cases where:
1318
+
1319
+ ```text
1320
+ retrieval was correct but generation added unsupported claims
1321
+ the model invented a URL
1322
+ the model mixed evidence from different scopes
1323
+ the answer contains a statement that does not appear in evidence
1324
+ ```
1325
+
1326
+ This is the evaluation part of the Planning → Generation → Evaluation loop.
1327
+
1328
+ ---
1329
+
1330
+ ## 7.15 Completeness Guard
1331
+
1332
+ The completeness guard checks whether the answer is too short, vague, or incomplete.
1333
 
1334
+ It can identify cases where the answer:
1335
+
1336
+ ```text
1337
+ only repeats the question
1338
+ does not include required details
1339
+ misses key fields
1340
+ does not answer the requested scope
1341
+ cuts off early
1342
+ ```
1343
+
1344
+ Depending on runtime settings, this can trigger a rewrite or fallback.
1345
 
1346
  ---
1347
 
1348
+ ## 7.16 Smoke Tests as Harness Unit Checks
1349
+
1350
+ The smoke tests are lightweight checks that make sure the harness pipeline still works after model or code changes.
1351
 
1352
+ Examples:
1353
+
1354
+ ```text
1355
+ PEKOM should not be routed to the AI bachelor page.
1356
+ Residential college should prefer the student-affairs residential page.
1357
+ CCNA Lab should not invent synthetic URLs.
1358
+ ```
1359
 
1360
+ These tests check:
1361
+
1362
+ ```text
1363
+ routing
1364
+ URL filtering
1365
+ official page preference
1366
+ fake URL rejection
1367
+ answer grounding trace
1368
+ harness core route
1369
+ ```
1370
+
1371
+ They are not a full benchmark. They are fast sanity checks that the system still runs through the expected pipeline.
1372
 
1373
  ---
1374
 
1375
+ ## 7.17 Why Harness Engineering Is Central to This Project
1376
 
1377
+ The final system does not rely on only one technique.
1378
+
1379
+ ```text
1380
+ SFT gives domain answer style.
1381
+ RAG gives handbook evidence.
1382
+ The web agent gives official external evidence.
1383
+ PPO improves answer behavior.
1384
+ Harness Engineering controls the whole system.
1385
+ ```
1386
+
1387
+ Without the harness, the system would still be vulnerable to:
1388
+
1389
+ ```text
1390
+ wrong source selection
1391
+ fake URLs
1392
+ weak web evidence
1393
+ scope confusion
1394
+ process leakage
1395
+ unsupported final answers
1396
+ stale artifact loading
1397
+ ```
1398
+
1399
+ Therefore, Harness Engineering is the system-level contribution that connects SFT, RAG, Agent, and PPO into one controlled workflow.
1400
 
1401
  ---
1402
 
 
1936
 
1937
  ---
1938
 
1939
+ # 18. Relation to LangChain / LangGraph / LangSmith Concepts
1940
+
1941
+ This project does not claim to be a LangChain implementation. Instead, it uses a from-scratch notebook implementation that follows similar engineering ideas.
1942
+
1943
+ Official LangChain ecosystem references that influenced the design include:
1944
+
1945
+ - [LangChain Agents documentation](https://docs.langchain.com/oss/javascript/langchain/agents): agents combine language models with tools and can iteratively work toward a goal.
1946
+ - [LangGraph Overview](https://docs.langchain.com/oss/python/langgraph/overview): LangGraph focuses on durable execution, streaming, human-in-the-loop, persistence, and orchestration for agent workflows.
1947
+ - [LangGraph Graph API](https://docs.langchain.com/oss/python/langgraph/graph-api): agent workflows can be modeled through state, nodes, and edges.
1948
+ - [LangGraph Workflows and Agents](https://docs.langchain.com/oss/python/langgraph/workflows-agents): workflows use predetermined code paths, while agents are more dynamic in tool usage and process control.
1949
+ - [LangSmith Evaluation](https://docs.langchain.com/langsmith/evaluation): evaluation can be structured around datasets, evaluators, and experiments.
1950
+ - [LangSmith Evaluation Types](https://docs.langchain.com/langsmith/evaluation-types): evaluation may include benchmarking, unit tests, regression tests, LLM-as-judge evaluators, code evaluators, and online monitoring.
1951
+ - [LangSmith Application-specific Evaluation Approaches](https://docs.langchain.com/langsmith/evaluation-approaches): autonomous agents are commonly discussed in terms of tool calling, memory, and planning.
1952
+
1953
+ TensorTalk maps these ideas into a custom system:
1954
+
1955
+ | LangChain ecosystem idea | TensorTalk implementation |
1956
+ |---|---|
1957
+ | Agent uses model + tools | Qwen3 model + local RAG + official web search + URL validator + evidence judge |
1958
+ | State | Trace dictionaries, evidence bundles, routing flags, model/backend status |
1959
+ | Nodes | Planning, retrieval, web discovery, filtering, judging, generation, grounding, completeness checking |
1960
+ | Edges | Conditional retry, official-web route, local-RAG fallback, weak-evidence fallback, final-answer route |
1961
+ | Planning | Query classification, scope detection, entity-aware routing, web/RAG decision |
1962
+ | Generation | SFT/PPO Qwen3 actor generates with accepted evidence |
1963
+ | Evaluation | Evidence judge, answer grounding judge, completeness guard, fake URL checks, smoke tests |
1964
+ | Observability | TensorTalk collapsed trace panels and diagnostic logs |
1965
+ | Regression/smoke testing | PEKOM route test, residential-college URL test, CCNA synthetic URL test |
1966
+
1967
+ This is why the project can be described as:
1968
+
1969
+ > A from-scratch LangChain/LangGraph-inspired RAG agent harness for UM Handbook QA, with a Planning → Generation → Evaluation control loop.
1970
+
1971
+ ---
1972
+
1973
+ # 19. Summary
1974
 
1975
  TensorTalk demonstrates a staged LLM system development workflow:
1976