--- license: apache-2.0 language: - en - zh base_model: - Qwen/Qwen3-8B pipeline_tag: text-generation library_name: transformers tags: - qwen3 - qwen3-8b - lora - qlora - sft - rag - faiss - dense-retrieval - agent - ppo - rlhf - rule-reward - harness-engineering - um-handbook - question-answering - chatbot - education - tensor-talk --- # TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering TensorTalk is a staged LLM engineering project built for **Universiti Malaya Faculty of Computer Science and Information Technology handbook question answering**. The system is designed to answer undergraduate, postgraduate, and general faculty handbook questions using a controlled progression of three experimental stages: 1. **Baseline 1 — Closed-book SFT Qwen3-8B** 2. **Baseline 2 — SFT Qwen3-8B + Metadata-aware RAG + Official Web Agent + Harness Engineering** 3. **Improved Model — Rule-reward PPO post-training + RAG + Agent + Harness Engineering** The project is not just a simple chatbot. It is a controlled comparison of how an LLM system improves when moving from memorized supervised fine-tuning, to retrieval-grounded answering, and finally to rule-reward post-training with a guarded agentic runtime. The main idea is: > Baseline 1 tests whether a fine-tuned model can answer handbook questions from parameters alone. > Baseline 2 keeps the same base model family but adds retrieval and harnessed evidence control. > The Improved Model keeps the RAG + Agent + Harness runtime and further adds PPO post-training to make the model better aligned with the desired answer behavior. --- # 1. Project Goal The goal of this project is to build a reliable and traceable UM Handbook assistant that can answer questions about: - Faculty objectives, vision, mission, history, facilities, and academic calendar - Undergraduate programme details - Postgraduate programme details - Candidature requirements - Grading and academic rules - Industrial training - Academic project requirements - Supervision policy - Thesis/dissertation requirements - Academic integrity and plagiarism - Facilities and labs - Official UM/FSKTM web information when handbook knowledge is insufficient or time-sensitive The project also aims to demonstrate a complete LLM system development path: ```text Closed-book SFT → RAG-augmented SFT → Metadata-aware retrieval → Official-source web agent → Harness Engineering guardrails → PPO rule-reward post-training → Strict artifact verification → Traceable TensorTalk UI ``` --- # 2. High-level System Overview The final TensorTalk system contains several layers. ```text User Question ↓ TensorTalk UI ↓ Planning / Thinking Display Layer ↓ Local Handbook RAG ↓ Official UM / FSKTM Web Agent ↓ Harness Engineering Guardrails ↓ Evidence Judge / Retry / Fallback ↓ PPO-trained Qwen3-8B Actor ↓ Answer Grounding Judge ↓ Completeness Guard ↓ Final Answer + Trace Panels ``` The final model is not used alone. It is wrapped inside a runtime harness that controls: - where the system can search - which sources it can trust - whether web evidence is useful - whether retrieved evidence supports the answer - whether the model produced fake URLs - whether the answer leaked internal reasoning - whether fallback to local handbook RAG is needed - whether the final answer is grounded enough to show This is why the final stage is better described as: > **A PPO-aligned RAG agent system with Harness Engineering**, rather than only a fine-tuned model. --- # 3. Dataset Design ## 3.1 Source Domain The dataset is built around UM FSKTM undergraduate and postgraduate handbook content. The data is organized into: - SFT question-answer dataset - hidden metadata - RAG knowledge base - RAG evaluation dataset - PPO preference dataset The project separates **model-visible training text** from **metadata used for retrieval, evaluation, and analysis**. This distinction is important: - Baseline 1 intentionally trains on question-answer text without forcing explicit metadata labels into the model-visible answer. - Baseline 2 uses metadata-aware retrieval to reduce scope confusion. - Stage 3 PPO uses preference pairs and reward functions to shape answer behavior. --- ## 3.2 Baseline 1 SFT Dataset Baseline 1 uses: ```text SFT_QA_Training_Ready.jsonl ``` The notebook validates: ```text Total examples: 1000 Train examples: 800 Validation examples: 100 Test examples: 100 Split ratio: 8:1:1 Duplicate question groups: 0 Duplicate question rows: 0 ``` Each example follows a supervised chat-style format: ```json { "prompt": [ { "role": "system", "content": "You are an academic assistant for the Faculty of Computer Science and Information Technology, Universiti Malaya..." }, { "role": "user", "content": "What are the faculty objectives?" } ], "completion": [ { "role": "assistant", "content": "The faculty objectives are..." } ], "question": "...", "answer": "..." } ``` This stage teaches the model to imitate handbook-style answers directly. --- ## 3.3 Baseline 2 RAG Dataset Baseline 2 uses the same SFT dataset direction, but adds external retrieval resources: ```text UM_RAG_Knowledge_Base.jsonl UM_RAG_Evaluation_Dataset.jsonl SFT_QA_Metadata.jsonl ``` The RAG knowledge base contains structured fields such as: ```text kb_id source_doc scope_label section pages source_text retrieval_text retrieval_keywords grounded_answer_bank matched_qa_ids ``` The RAG knowledge base loaded in the final Stage 3 runtime contains: ```text Loaded KB rows: 521 ``` The metadata layer allows the system to distinguish: ```text general undergraduate postgraduate ``` This is important because many handbook questions look similar but require different answers depending on the student scope. --- ## 3.4 PPO Preference Dataset The Improved Model uses: ```text UM_Handbook_PPO_Preference_Dataset.jsonl ``` The final PPO run uses the full dataset: ```text Total PPO preference rows: 1000 Train rows: 900 Validation rows: 100 Train fraction: 0.90 ``` The PPO dataset is not used like normal SFT data. In SFT, the model directly imitates a reference answer. In PPO, the model generates its own answer, receives a reward, and updates toward higher-reward behavior. --- # 4. Baseline 1 — Closed-book SFT Qwen3-8B ## 4.1 Purpose Baseline 1 asks a simple question: > Can Qwen3-8B learn UM Handbook question answering from supervised fine-tuning alone? This is a **closed-book baseline**. The model does not retrieve handbook evidence during inference. It must answer from what it learned during SFT. This is useful as a control baseline because it shows what happens when the model relies mainly on parameter memory. --- ## 4.2 Model Baseline 1 uses: ```text Base model: Qwen/Qwen3-8B Local path: /scr/user/kevin2002/TensorCat/NLP/UM_Handbook/models/Qwen3-8B ``` The notebook detected: ```text Backend: CUDA GPU: NVIDIA A100-SXM4-80GB dtype: bfloat16 4-bit QLoRA: enabled ``` --- ## 4.3 Training Method Baseline 1 uses LoRA / QLoRA supervised fine-tuning. LoRA configuration: ```text LoRA rank: 16 LoRA alpha: 32 LoRA dropout: 0.10 Target modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj ``` Training configuration: ```text Epochs: 8 Train split: 800 Validation split: 100 Test split: 100 Per-device train batch size: 2 Per-device eval batch size: 2 Gradient accumulation steps: 8 Learning rate: 1e-4 Packing: False ``` --- ## 4.4 Baseline 1 Results The training completed successfully. Training summary: ```text Training steps: 400 Training runtime: ~18.68 minutes for the main train stage Train loss: 0.4824 Final validation loss: ~0.146 Test loss: ~0.197 Perplexity: ~1.157 ``` Generation metrics: ### Validation ```text Exact match: 0.77 Token F1: 0.9111 ROUGE-1: 0.9122 ROUGE-2: 0.8700 ROUGE-L: 0.8979 SacreBLEU: 81.7240 chrF++: 86.8916 Average prediction words: 36.35 Average reference words: 38.57 ``` ### Test ```text Exact match: 0.72 Token F1: 0.8869 ROUGE-1: 0.8857 ROUGE-2: 0.8352 ROUGE-L: 0.8677 SacreBLEU: 81.1138 chrF++: 87.7054 Average prediction words: 38.03 Average reference words: 37.03 ``` --- ## 4.5 Baseline 1 Strengths Baseline 1 is strong when the question is close to the training distribution. It can reproduce handbook-style answers well and shows high text overlap with the reference answers. It is useful because: - it establishes the basic Qwen3-8B SFT capability - it verifies that the dataset format is learnable - it creates a clean closed-book control model - it provides a baseline for later RAG and PPO improvements --- ## 4.6 Baseline 1 Limitations Baseline 1 is still limited because it is a closed-book model. Main limitations: 1. **No retrieval evidence** It cannot check the handbook at inference time. 2. **Potential hallucination** If the question is out-of-distribution or requires exact source grounding, the model may answer from memory. 3. **Scope confusion** Undergraduate and postgraduate rules may be mixed if the question is ambiguous. 4. **No official web update mechanism** It cannot answer dynamic or latest-information questions reliably. 5. **No harness guardrails** It does not include fake URL detection, evidence judging, WAF handling, or fallback control. Baseline 1 is therefore a necessary but incomplete starting point. --- # 5. Baseline 2 — RAG + SFT + Metadata-aware Retrieval + Harness Agent ## 5.1 Purpose Baseline 2 asks: > What improves if we keep the same Qwen3-8B family but add retrieval-grounded evidence? The goal is to reduce hallucination and scope confusion by giving the model relevant handbook evidence at inference time. This stage introduces RAG and agentic harness logic while keeping the same broad model family and handbook task. --- ## 5.2 What RAG Means in This Project RAG stands for **Retrieval-Augmented Generation**. In simple terms: ```text Instead of asking the model to answer only from memory, the system first retrieves relevant handbook chunks, then asks the model to answer using those chunks. ``` In this project, RAG is not just keyword search. It uses: ```text Transformer embedding model + FAISS vector search + metadata-aware reranking + scope labels + top-k evidence blocks ``` The Baseline 2 retriever uses: ```text Embedding model: BAAI/bge-base-en-v1.5 Vector index: FAISS Similarity: inner product after embedding normalization Top-k retrieval: 3 ``` --- ## 5.3 Metadata-aware Retrieval The RAG system uses metadata to control retrieval quality. Important metadata fields include: ```text source_doc scope_label section pages kb_id knowledge group retrieval keywords grounded answer bank ``` This allows the retriever to prefer the correct audience scope. Example: ```text Question: What are the candidature requirements for Master of Software Engineering? Expected scope: postgraduate ``` The system should retrieve postgraduate chunks, not undergraduate chunks. This is one of the main improvements over Baseline 1. --- ## 5.4 RAG-augmented Training Dataset Baseline 2 creates a RAG-augmented dataset where training examples include evidence context. The training prompt can contain: ```text User question + retrieved handbook evidence + source metadata + answer instruction ``` This teaches the model to answer with evidence-aware context rather than only memorized answers. --- ## 5.5 Baseline 2 Training Configuration Baseline 2 uses Qwen3-8B with LoRA fine-tuning. Configuration: ```text Base model: Qwen/Qwen3-8B Embedding model: BAAI/bge-base-en-v1.5 LoRA rank: 8 LoRA alpha: 16 LoRA dropout: 0.05 Target modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj Epochs: 20 Per-device train batch size: 4 Per-device eval batch size: 8 Target global batch size: 8 Learning rate: 8e-5 Max sequence length: 1024 Validation ratio: 0.10 Test ratio: 0.10 Save merged model: False Runtime model path: base model + LoRA adapter ``` The notebook uses a safer non-merged runtime path when merged model export is unavailable or memory-expensive. --- ## 5.6 Baseline 2 Retrieval Evaluation Baseline 2 includes a retrieval evaluation set. Retrieval metrics: ```json { "retrieval_eval_size": 1000, "top_k": 3, "hit_at_1_primary": 0.821, "hit_at_k_primary": 0.954, "hit_at_k_same_group": 0.991, "scope_match_at_1": 0.996, "retriever_type": "dense_embedding + faiss + metadata_rerank", "embedding_model_name": "BAAI/bge-base-en-v1.5" } ``` Interpretation: - `hit_at_1_primary = 0.821` means the top retrieved chunk is exactly the expected primary evidence in 82.1% of cases. - `hit_at_k_primary = 0.954` means the correct primary evidence appears within top-3 in 95.4% of cases. - `hit_at_k_same_group = 0.991` means a same-group acceptable evidence appears in top-3 in 99.1% of cases. - `scope_match_at_1 = 0.996` means the top result almost always matches the correct undergraduate/postgraduate/general scope. This confirms that the RAG system is not random retrieval. It is a strong metadata-aware retrieval baseline. --- ## 5.7 Baseline 2 Generation Evaluation Generation evaluation was run on a smaller selected set for runtime practicality. Results: ```json { "generation_eval_size": 20, "top_k": 3, "plain_exact_match": 0.0, "plain_token_f1": 0.3391, "rag_exact_match": 0.0, "rag_token_f1": 0.8460, "rag_minus_plain_exact_match": 0.0, "rag_minus_plain_token_f1": 0.5069 } ``` This shows a large improvement from RAG: ```text Plain token F1: 0.3391 RAG token F1: 0.8460 Improvement: +0.5069 ``` This is one of the strongest pieces of evidence in the project. It shows that retrieval grounding dramatically improves answer quality compared with plain generation. --- # 6. Agent Layer in Baseline 2 and Improved Model ## 6.1 Why an Agent Is Needed The handbook is reliable for stable academic rules, but some questions may require official web information. Examples: ```text Who is the current dean? Where can students find residential college information? What official page mentions PEKOM? Where is the official SPeCTRUM page? ``` For these cases, the system needs a controlled web agent. However, a web agent can be dangerous if it freely browses or trusts random pages. Therefore, this project uses a restricted official-source agent. --- ## 6.2 Official UM / FSKTM Web Agent The web agent is constrained to official UM / FSKTM domains. Priority domains: ```text fsktm.um.edu.my www.um.edu.my ``` Auxiliary official domains include UM-related systems such as: ```text aasd.um.edu.my maya.um.edu.my umlib.um.edu.my umresearch.um.edu.my jobs.um.edu.my careerportal.fsktm.um.edu.my intra.fsktm.um.edu.my gallery.fsktm.um.edu.my ``` The agent performs: ```text query planning official web discovery URL filtering page fetching evidence extraction evidence scoring Qwen-based evidence judging retry if weak fallback to handbook RAG if needed ``` --- ## 6.3 Agent Is Not Fully Autonomous by Design This project does not use a completely unrestricted autonomous agent. That is intentional. For a university handbook assistant, unrestricted autonomy is less useful than controlled evidence routing. The system needs to be: ```text safe source-constrained traceable fallback-aware grounded ``` So the agent is better described as: > A constrained official-source web agent controlled by Harness Engineering. --- # 7. Harness Engineering ## 7.1 What Harness Engineering Means Here Harness Engineering is the guardrail system around the model and agent. A simple analogy: ```text The LLM/agent is the car. Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, and dashboard. ``` The model can generate fluent answers, but the harness controls: - what it is allowed to search - what sources it can trust - whether a URL is fake - whether evidence is useful - whether the answer is grounded - whether the system should retry - whether it should fall back to local handbook RAG - what trace should be shown to the user --- ## 7.2 Harness Pipeline The standardized TensorTalk Harness Core follows this structure: ```text User Question ↓ Local Handbook RAG ↓ Official Web Discovery ↓ Domain Guard ↓ Fake URL Guard ↓ WAF Detection ↓ Evidence Normalizer ↓ Qwen Evidence Judge ↓ Entity-aware Retry ↓ Weak Evidence Fallback ↓ Answer Generator ↓ Answer Grounding Judge ↓ Completeness Guard ↓ UI Trace ``` --- ## 7.3 Harness Components The notebooks include several engineering patches and layers: ### V14 — WAF-aware Harness Handles web pages blocked by WAF or browser failures. Functions: - detect WAF block pages - exclude blocked pages from evidence - provide diagnostics - use safe static fallback if browser click fails - reject query-fabricated URLs before evidence building --- ### V15 — Qwen Evidence Judge Loop Adds an LLM-based evidence judge. Flow: ```text Planner → Search / Fetch → Evidence Filter → Qwen Judge → Retry → Final Evidence ``` The purpose is to avoid trusting weak web snippets blindly. --- ### V16 — Local-aware Judge Repair Improves routing and fallback. It handles: ```text PEKOM routing CCNA Lab routing residential college routing local RAG fallback entity-aware retry fake URL rejection ``` --- ### V17 — Strict Entity Judge and UI Polish Adds stricter entity matching and improves trace display. This helps avoid cases where a query about one entity is answered with another related but wrong page. --- ### V18 — Balanced Official Reference Fallback Allows the system to still provide official references when strong web evidence is not enough, while avoiding over-trusting weak pages. --- ### V19 — Answer Grounding Judge Checks whether the final generated answer is actually supported by evidence. This is important because even if retrieval is correct, the model may still introduce unsupported details. --- ### Completeness Guard Checks whether the answer is too incomplete and whether a rewrite or fallback should be triggered. --- # 8. Improved Model — PPO Rule-reward Post-training + RAG + Agent + Harness ## 8.1 Purpose The Improved Model asks: > Can we further improve the model’s behavior after SFT/RAG by using PPO reward-based post-training? Baseline 2 already improves factual grounding through RAG and Harness Engineering. The Improved Model adds PPO to shape the model’s behavior. The goal is not to replace RAG. The goal is to make the model more aligned with the desired answer style and safety behavior. --- ## 8.2 What PPO Means in This Project PPO stands for **Proximal Policy Optimization**. In simple terms: ```text SFT teaches the model by imitation. PPO lets the model generate answers, scores them with a reward function, and updates the model toward higher-reward answers. ``` In this project: ```text Actor model: Qwen3-8B + LoRA Critic/value head: TRL value head model Reference model: frozen Qwen3-8B reference Reward: rule-based preference reward function KL control: used to avoid drifting too far from the reference model ``` --- ## 8.3 Rule-based Reward Function This project uses a rule-based reward function rather than a separately trained neural reward model. The reward function evaluates: ```text gold answer similarity rejected answer penalty evidence overlap scope correctness hallucinated URL penalty vague answer penalty process/thinking leakage penalty direct answer bonus repetition penalty degeneration/collapse penalty ``` This is why the model card should describe the final stage as: > Rule-reward PPO post-training not: > Full RLHF with a trained reward model The reward model type recorded in the notebook is: ```text rule_based_preference_reward_function uses_separate_neural_reward_model: False ``` --- ## 8.4 PPO Training Configuration The final PPO run uses: ```text Preference dataset rows: 1000 Train rows: 900 Validation rows: 100 MAX_PPO_ROWS: None Train fraction: 0.90 PPO epochs: 2 Batch size: 2 Mini-batch size: 1 Max new tokens: 72 Max PPO steps per epoch: None Planned steps per epoch: 450 Total planned steps: 900 Learning rate: 2e-6 Target KL: 0.10 Generation temperature: 0.45 Top-p: 0.78 Repetition penalty: 1.3 No-repeat ngram size: 4 ``` The run completed successfully: ```text Global PPO steps: 900 / 900 Elapsed time: 04:47:59 Degenerate ratio: 0.00% ``` --- ## 8.5 PPO Artifact Verification The Stage 3 notebook includes strict artifact verification. This is important because PPO notebooks can easily appear to run while silently saving old or incomplete artifacts. The strict save cell verifies: ```text training_log exists training_log records = 900 expected steps = 900 MAX_PPO_ROWS = None train rows = 900 valid rows = 100 NUM_PPO_EPOCHS = 2 MAX_PPO_STEPS_PER_EPOCH = None parameter hash changed after PPO PPO inference full actor exists PPO LoRA adapter exists non-PPO fallback forbidden ``` The final strict save output confirms: ```text Final PPO records saved: 900 / expected 900 Strict full PPO artifact contract passed. ``` The parameter change proof confirms: ```text aggregate_hash_changed: true changed_trainable_tensors: 506 unchanged_trainable_tensors: 0 ``` This proves that PPO training changed the trainable LoRA/value-head parameters rather than merely running a dry notebook. --- ## 8.6 Strict PPO-only Runtime The final runtime is configured so that the UI must use PPO artifacts only. The strict PPO gate confirms: ```text PPO records: 900 PPO full actor usable: True PPO LoRA adapter usable: True Strict PPO-only UI mode: True ``` The runtime loading order is: ```text 1. PPO full inference actor if full weights exist 2. Otherwise base Qwen3-8B + PPO LoRA adapter 3. Non-PPO fallback is forbidden ``` This prevents the final demo from accidentally loading an old Baseline 2 model or a stale 150-step PPO proof artifact. --- ## 8.7 PPO Validation The PPO-only validation evaluation uses a held-out validation sample. The displayed validation summary is: ```text reward: 0.477789 gold_overlap: 0.255351 rejected_overlap: 0.155080 ``` Interpretation: - reward is positive - gold overlap is higher than rejected overlap - the PPO-trained actor tends to move closer to preferred answers than rejected answers This does not mean the PPO model is perfect. It means the reward-shaped behavior is directionally positive. --- ## 8.8 PPO Limitations The PPO run is successful, but the raw PPO generations still show some imperfections. Observed issues include: 1. **Process leakage** Some outputs still include phrases like: ```text Okay, let me try to figure out... Wait, I need to check again... ``` The reward function penalizes this, but it is not completely eliminated. 2. **Occasional hallucinated URLs** Some raw generations may still invent URLs. The harness fake URL guard is therefore still necessary. 3. **OCR-style text artifacts** Some source chunks contain spacing or OCR issues, and the model may reproduce them. 4. **KL can be high** Some PPO logs show high `objective/kl`, meaning the PPO actor can drift noticeably from the reference model. However, the run completed with: ```text degenerate_ratio = 0.00% ``` and no detected repetition collapse. 5. **RAG/Harness remains necessary** PPO improves model behavior, but it does not replace retrieval grounding or guardrails. --- # 9. TensorTalk UI The project includes a WhatsApp-style Jupyter HTML UI called **TensorTalk**. The UI supports: - chat-style interface - TensorCat avatar - RAG on/off control - web agent on/off control - collapsed trace panels - retrieved evidence display - web evidence display - planning/thinking display layer - harness decision trace - answer grounding information - strict PPO artifact loading - new chat reset behavior The UI is part of the engineering contribution because it makes the harness process visible rather than hidden. --- # 10. Smoke Tests ## 10.1 What Smoke Test Means Here A smoke test is a lightweight system sanity check. It is not a full evaluation. It is a quick check that the main pipeline still works. In this project, smoke tests check whether: ```text PPO model loads RAG retrieves evidence web agent searches official sources fake URL guard blocks synthetic links answer grounding returns a result trace structure is produced fallback behavior still works ``` --- ## 10.2 Example Smoke Tests The notebook defines smoke tests such as: ```text 1. PEKOM should not be routed to AI bachelor page 2. Residential college should prefer student-affairs residential page 3. CCNA Lab should not invent synthetic URLs ``` These are not random examples. They are chosen to test known fragile parts of the pipeline: - entity routing - official URL preference - fake URL rejection - web/RAG trace structure --- # 11. Control Variable Design The project uses a control-variable style comparison. The base task remains the same: ```text UM FSKTM Handbook QA ``` The base model family remains the same: ```text Qwen3-8B ``` The dataset domain remains the same: ```text Undergraduate + postgraduate + general UM Handbook knowledge ``` What changes is the system layer: ```text Baseline 1: SFT only Baseline 2: SFT + RAG + Harness Agent Improved: SFT/RAG/Harness + PPO post-training ``` This allows the project to compare which improvements come from: - parameter learning - retrieval grounding - metadata-aware scope control - official web augmentation - harness guardrails - PPO reward shaping This is more rigorous than simply building three unrelated systems. --- # 12. Stage-by-stage Comparison Table | Dimension | Baseline 1: Closed-book SFT | Baseline 2: RAG + SFT + Agent/Harness | Improved Model: PPO + RAG + Agent/Harness | |---|---|---|---| | Main research question | Can the model memorize and reproduce handbook QA from SFT? | Does retrieval-grounded evidence improve handbook QA? | Can rule-reward PPO further align answer behavior while keeping RAG/Harness control? | | Base model | Qwen3-8B | Qwen3-8B | Qwen3-8B | | Main training method | Supervised fine-tuning | RAG-augmented supervised fine-tuning | Rule-reward PPO post-training | | Dataset used | 1000 SFT QA rows | SFT QA + metadata + RAG KB + RAG eval | 1000 PPO preference rows | | Train/validation/test | 800 / 100 / 100 | 8:1:1 RAG-augmented split | 900 train / 100 validation | | Retrieval | No | Yes | Yes | | Retrieval type | None | Dense embedding + FAISS + metadata-aware rerank | Same RAG runtime reused | | Embedding model | None | BAAI/bge-base-en-v1.5 | RAG runtime inherited from Baseline 2 | | Top-k evidence | None | Top-3 | Top-3 / runtime-dependent | | Metadata awareness | Hidden metadata only, not used at inference | Yes, scope/source/section aware | Yes, used by RAG/Harness runtime | | Scope control | Weak; model may confuse UG/PG if prompt is ambiguous | Stronger due to metadata-aware retrieval | Stronger due to RAG + PPO reward + harness | | Web agent | No | Yes | Yes | | Official domain control | No | Yes, UM/FSKTM official domain whitelist | Yes, same official-source guardrails | | Fake URL guard | No | Yes | Yes | | WAF handling | No | Yes | Yes | | Evidence judge | No | Yes, Qwen evidence judge | Yes | | Retry/fallback policy | No | Yes | Yes | | Answer grounding judge | No | Yes | Yes | | Completeness guard | No | Yes | Yes | | UI trace | Basic chat UI | Harness trace panels | Strict PPO + Harness trace panels | | LoRA rank | 16 | 8 | PPO actor based on LoRA actor/value setup | | Training epochs | 8 SFT epochs | 20 SFT epochs | 2 PPO epochs | | Main output artifact | LoRA adapter + merged model + `.pt` export | LoRA adapter, optional non-merged runtime | PPO full inference actor + PPO LoRA adapter + manifest | | Artifact strictness | Standard save | Adapter/runtime path checks | Manifest, training log count, parameter hash proof, strict gate | | Key metric | Test token F1 ≈ 0.8869 | RAG token F1 ≈ 0.846 on selected eval; retrieval Hit@3 ≈ 0.954 | PPO validation reward ≈ 0.4778; gold overlap > rejected overlap | | Strongest contribution | Clean SFT baseline | Evidence-grounded QA and metadata-aware retrieval | Full PPO post-training with strict artifact verification and harnessed runtime | | Main weakness | Closed-book hallucination risk | More complex runtime, depends on retriever quality | PPO raw outputs still need Harness/RAG due to possible process leakage and fake URLs | | Control variable role | Establishes parameter-only baseline | Adds retrieval and harness while keeping same domain/model family | Adds PPO reward shaping while preserving RAG/Harness pipeline | --- # 13. Technical Comparison of the Three Stages ## 13.1 Content-level Difference | Content Aspect | Baseline 1 | Baseline 2 | Improved Model | |---|---|---|---| | Stable handbook facts | Learned into model parameters | Retrieved from handbook KB | Retrieved and answered by PPO-aligned actor | | Latest or official web info | Not supported | Supported through official web agent | Supported through same official web agent | | UG vs PG distinction | Learned implicitly | Controlled by metadata retrieval | Controlled by metadata retrieval + reward/harness | | Evidence visibility | Not shown | Evidence shown in RAG trace | Evidence shown in PPO/Harness trace | | Hallucination control | Mostly prompt-based | Retrieval + grounding | Retrieval + grounding + reward penalties | | Fake URL control | Not available | Harness URL guard | Harness URL guard + PPO penalty signal | --- ## 13.2 Engineering-level Difference | Engineering Aspect | Baseline 1 | Baseline 2 | Improved Model | |---|---|---|---| | Notebook purpose | Train and evaluate closed-book SFT model | Build RAG-augmented model and harnessed agent runtime | Train PPO actor and attach it to final harness runtime | | Runtime complexity | Low | High | Highest | | Debug trace | Basic | Detailed RAG/Web/Harness trace | Detailed PPO/RAG/Web/Harness trace | | Failure handling | Minimal | Fallback and guardrail logic | Strict PPO-only fallback prevention plus harness fallback | | Artifact verification | Basic output save | Adapter/merged path checks | Manifest, training log count, parameter hash proof, strict gate | | Risk of stale artifact use | Moderate | Moderate | Actively guarded against | | Demo readiness | Good for simple QA | Strong for grounded QA | Strongest for final controlled system demo | --- # 14. Why the Improved Model Does Not Replace RAG A key design decision is that PPO does not replace RAG. PPO improves the model’s tendency to: - answer directly - avoid rejected-style answers - avoid vague answers - avoid process leakage - avoid fake URLs - avoid repetition collapse - use evidence-like wording more appropriately But PPO does not guarantee factual correctness by itself. Therefore, the final system still needs: ```text RAG for evidence Web Agent for official/latest information Harness for source control Grounding judge for answer verification Fallback for weak evidence ``` This is the correct division of responsibility: ```text SFT: teaches domain answer style RAG: supplies factual evidence Agent: finds official external evidence Harness: controls trust, routing, fallback, and trace PPO: improves answer behavior according to reward preferences ``` --- # 15. Known Limitations This project is a strong applied LLM system prototype, but it has limitations. ## 15.1 Not a full human-feedback RLHF system The PPO stage uses a rule-based reward function. It does not train a separate neural reward model from human preference labels. Correct description: ```text Rule-reward PPO post-training ``` Not: ```text Full RLHF with learned reward model ``` --- ## 15.2 Raw PPO generations can still be imperfect Observed raw PPO generations may include: - process leakage - occasional hallucinated URLs - OCR-like token spacing - incomplete course titles - noisy source-text reproduction The final Harness runtime is therefore necessary. --- ## 15.3 Web search is constrained The web agent is intentionally limited to official UM/FSKTM sources. It may refuse or fallback when official evidence is weak. This is a feature, not a bug, because the system prioritizes trustworthiness over open-ended browsing. --- ## 15.4 RAG depends on knowledge base quality If the RAG KB contains OCR noise or incomplete chunks, the model may inherit that noise. Future work should improve source cleaning and chunk normalization. --- ## 15.5 Notebook-based prototype The project is implemented as notebooks. A production version should separate modules into: ```text data/ retrieval/ agent/ harness/ training/ evaluation/ ui/ tests/ ``` --- # 16. Recommended Usage This project is intended for research, coursework, and demonstration purposes. It is not an official Universiti Malaya system. For official academic decisions, students should always refer to the official handbook, faculty office, or UM/FSKTM official websites. --- # 17. Suggested Inference Flow For final demonstration, use the Improved Model runtime: ```text 1. Load PPO full inference actor if available. 2. If unavailable, load base Qwen3-8B + PPO LoRA adapter. 3. Initialize local handbook RAG. 4. Enable official UM/FSKTM web agent if the question may need external/latest information. 5. Run through TensorTalkHarnessCore. 6. Display answer with evidence trace. ``` Strict runtime requirement: ```text Non-PPO fallback is forbidden in the final Improved Model demo. ``` --- # 18. Summary TensorTalk demonstrates a staged LLM system development workflow: ```text Baseline 1: Qwen3-8B learns handbook QA through closed-book SFT. Baseline 2: The system adds RAG, dense retrieval, metadata-aware reranking, official web search, and Harness Engineering. Improved Model: The system adds full 1000-row rule-reward PPO post-training, strict artifact verification, and a PPO-only final harness runtime. ``` The most important contribution is not only that the model can answer handbook questions, but that the system is controlled, evidence-aware, source-constrained, traceable, and evaluated through a clear baseline progression. The final system should be understood as: > **A Qwen3-8B based UM Handbook RAG Agent, improved with rule-reward PPO and controlled by Harness Engineering.**