TensorCat
/

TensorTalk

Model card Files Files and versions

xet

Community

TensorCat commited on 6 days ago

Commit

8a70417

verified ·

1 Parent(s): b4b8aae

Update README.md

Browse files

Files changed (1) hide show

README.md +1300 -157

README.md CHANGED Viewed

@@ -1,250 +1,1393 @@
 ---
 license: apache-2.0
 ---
-# TensorTalk / UM_Handbook
-TensorTalk is a handbook-grounded academic chat assistant built for the **Faculty of Computer Science and Information Technology, Universiti Malaya (UM)**.
-This project focuses on turning UM handbook content into a usable question-answering system through:
-- handbook preprocessing
-- source chunk construction
-- supervised QA dataset building
-- Qwen3-8B LoRA fine-tuning
-- merged-model deployment
-- a browser-style HTML chat demo
 ---
-## Project Goal
-The main goal of this project is to build a handbook-based assistant that can answer student questions using information learned from the UM handbook domain.
-The current version is designed around:
-- undergraduate and postgraduate handbook content
-- handbook-faithful answers
-- concise student-facing responses
-- a local/demo deployment workflow on DICC and notebook environments
-This project is also intended to support a broader experimental pipeline:
-- **Baseline 1:** closed-book supervised fine-tuning
-- **Baseline 2:** retrieval-augmented version for later comparison
 ---
-## What This Project Contains
-### 1. Dataset Preparation
-The project includes scripts and resources for preparing handbook data before fine-tuning:
-- handbook markdown preprocessing
-- source chunk dataset building
-- SFT QA dataset construction
-- configuration management for the preprocessing and dataset pipeline
-### 2. Fine-Tuning Workflow
-The model training workflow uses a Qwen3-8B base model with LoRA-based fine-tuning on the UM handbook QA dataset.
-The fine-tuning workflow includes:
-- notebook-based training on DICC
-- device-aware loading logic
-- train / validation / test style evaluation workflow
-- merged-model export for direct inference
-- LoRA adapter export for optional PEFT-based reuse
-- metrics and prediction file generation
-### 3. Deployment Demo
-The project includes a notebook-based HTML chat UI called **TensorTalk**.
-The demo provides:
-- a browser-style chat layout
-- a handbook-focused system prompt
-- merged-model loading for direct inference
-- a student-facing question-answer workflow
-- a simple deployment path for demonstration purposes
 ---
-## Current Project Structure
 ```text
-UM_Handbook/
-├── Dataset/
-│   └── SFT_Dataset/
-│       ├── SFT_QA_Training_Ready.jsonl
-│       ├── SFT_QA_Training_Ready_pretty.json
-│       ├── SFT_QA_Metadata.jsonl
-│       └── SFT_QA_Metadata_pretty.json
-├── assets/
-├── outputs/
-│   └── qwen3_um_handbook_optimized_1/
-│       ├── lora_adapter/
-│       ├── merged_model/
-│       ├── trainer_runs/
-│       ├── test_eval_runs/
-│       ├── dataset_split_summary.json
-│       ├── final_metrics.json
-│       ├── test_predictions.jsonl
-│       └── validation_predictions.jsonl
-├── FineTune_QWEN3_UM_Handbook_optimized_1.ipynb
-├── UM_Handbook_Markdown_Preprocess.py
-├── UM_SFT_QA_Dataset_Builder_from_Index.py
-├── UM_Source_Chunk_Dataset_Builder.py
-└── um_handbook_config.py
 ```
 ---
-## Key Files
-### Training and Data
-- `Dataset/SFT_Dataset/SFT_QA_Training_Ready.jsonl`
-  Main SFT training dataset used for handbook QA fine-tuning.
-- `UM_Handbook_Markdown_Preprocess.py`
-  Preprocesses handbook markdown / extracted source text.
-- `UM_Source_Chunk_Dataset_Builder.py`
-  Builds source chunks for downstream dataset and retrieval-related use.
-- `UM_SFT_QA_Dataset_Builder_from_Index.py`
-  Builds the supervised QA dataset from curated handbook content.
-- `um_handbook_config.py`
-  Central configuration file for paths and data-processing settings.
-### Training Output
-- `outputs/qwen3_um_handbook_optimized_1/merged_model/`
-  Main inference-ready model directory.
-  This is the directory used by the demo chat UI.
-- `outputs/qwen3_um_handbook_optimized_1/lora_adapter/`
-  LoRA adapter weights.
-  This is useful for PEFT-style loading with a base model, but it is not the primary path used by the current demo UI.
-- `outputs/qwen3_um_handbook_optimized_1/final_metrics.json`
-  Final evaluation summary.
-- `outputs/qwen3_um_handbook_optimized_1/validation_predictions.jsonl`
-  Validation-set generated answers for inspection.
-- `outputs/qwen3_um_handbook_optimized_1/test_predictions.jsonl`
-  Test-set generated answers for inspection.
-### Demo
-- `FineTune_QWEN3_UM_Handbook_optimized_1.ipynb`
-  Main notebook that contains the fine-tuning workflow and the TensorTalk HTML chat demo.
 ---
-## Model Artifact Notes
-This project may contain several model-related outputs. They are not all used in the same way.
-### `merged_model/`
-This is the most important deployment artifact for the current demo.
-Use this when:
-- running the current TensorTalk HTML chat UI
-- loading the fine-tuned model directly with Hugging Face `from_pretrained(...)`
-- sharing the main inference-ready model
-### `lora_adapter/`
-This contains LoRA delta weights only.
-Use this when:
-- loading the adapter on top of the original base model
-- reusing the fine-tuning result in a PEFT workflow
-- experimenting with a smaller transferable fine-tuning artifact
-### `.pt` exported model file
-If present, the `.pt` file is mainly a saved full-model artifact / backup export.
-Use this when:
-- archiving the full fine-tuned weights
-- running a custom loading workflow that explicitly expects a `.pt` file
-For the current TensorTalk chat UI, the primary runtime artifact is still **`merged_model/`**.
 ---
-## Current Demo Behavior
-The current demo is designed to answer questions such as:
-- dress code and appearance guidance
-- programme core courses / credit requirements
-- undergraduate vs postgraduate handbook information
-- academic rules and handbook-supported policy questions
-The answer style is intended to be:
-- handbook-grounded
-- short and direct
-- student-facing
-- non-speculative
 ---
-## Example Demo Output
-The screenshot below shows the current TensorTalk chat interface running with the fine-tuned UM handbook model.
-![TensorTalk Demo](assets/tensortalk_demo_chat.jpg)
 ---
-## Repository Preview
-The screenshot below shows the current top-level project layout.
-![Repository Structure](UM_Handbook/assets/tensortalk_demo_chat.jpg)
 ---
-## Suggested Minimal Deployment Package
-If the goal is only to demonstrate the chat UI to teammates, the minimal useful set is:
-- `merged_model/`
-- the chat notebook / UI code
-- optional avatar image under `assets/`
-The following items are not required for a simple demo run:
-- intermediate training checkpoints
-- test evaluation run directories
-- optional full `.pt` export
-- raw training logs not used by the demo
 ---
-## Notes
-- The project is organized so that **Dataset**, **models / outputs**, and **demo code** remain separate.
-- The current demo is notebook-friendly and was prepared around a DICC workflow.
-- The deployment path prioritizes clarity and reproducibility over a heavyweight full-stack application setup.
 ---
-## Status
-Current project status:
-- handbook preprocessing pipeline prepared
-- supervised QA dataset prepared
-- LoRA fine-tuning workflow completed
-- merged model exported
-- TensorTalk HTML chat demo running
-- evaluation outputs generated
 ---
-## Author / Project Name
-**TensorTalk**
-UM Handbook QA / Fine-Tuned Qwen3-8B LoRA Project

 ---
 license: apache-2.0
+language:
+- en
+- zh
+base_model:
+- Qwen/Qwen3-8B
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- qwen3
+- qwen3-8b
+- lora
+- qlora
+- sft
+- rag
+- faiss
+- dense-retrieval
+- agent
+- ppo
+- rlhf
+- rule-reward
+- harness-engineering
+- um-handbook
+- question-answering
+- chatbot
+- education
+- tensor-talk
 ---
+# TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering
+TensorTalk is a staged LLM engineering project built for **Universiti Malaya Faculty of Computer Science and Information Technology handbook question answering**. The system is designed to answer undergraduate, postgraduate, and general faculty handbook questions using a controlled progression of three experimental stages:
+1. **Baseline 1 — Closed-book SFT Qwen3-8B**
+2. **Baseline 2 — SFT Qwen3-8B + Metadata-aware RAG + Official Web Agent + Harness Engineering**
+3. **Improved Model — Rule-reward PPO post-training + RAG + Agent + Harness Engineering**
+The project is not just a simple chatbot. It is a controlled comparison of how an LLM system improves when moving from memorized supervised fine-tuning, to retrieval-grounded answering, and finally to rule-reward post-training with a guarded agentic runtime.
+The main idea is:
+> Baseline 1 tests whether a fine-tuned model can answer handbook questions from parameters alone.
+> Baseline 2 keeps the same base model family but adds retrieval and harnessed evidence control.
+> The Improved Model keeps the RAG + Agent + Harness runtime and further adds PPO post-training to make the model better aligned with the desired answer behavior.
+---
+# 1. Project Goal
+The goal of this project is to build a reliable and traceable UM Handbook assistant that can answer questions about:
+- Faculty objectives, vision, mission, history, facilities, and academic calendar
+- Undergraduate programme details
+- Postgraduate programme details
+- Candidature requirements
+- Grading and academic rules
+- Industrial training
+- Academic project requirements
+- Supervision policy
+- Thesis/dissertation requirements
+- Academic integrity and plagiarism
+- Facilities and labs
+- Official UM/FSKTM web information when handbook knowledge is insufficient or time-sensitive
+The project also aims to demonstrate a complete LLM system development path:
+```text
+Closed-book SFT
+→ RAG-augmented SFT
+→ Metadata-aware retrieval
+→ Official-source web agent
+→ Harness Engineering guardrails
+→ PPO rule-reward post-training
+→ Strict artifact verification
+→ Traceable TensorTalk UI
+```
+---
+# 2. High-level System Overview
+The final TensorTalk system contains several layers.
+```text
+User Question
+    ↓
+TensorTalk UI
+    ↓
+Planning / Thinking Display Layer
+    ↓
+Local Handbook RAG
+    ↓
+Official UM / FSKTM Web Agent
+    ↓
+Harness Engineering Guardrails
+    ↓
+Evidence Judge / Retry / Fallback
+    ↓
+PPO-trained Qwen3-8B Actor
+    ↓
+Answer Grounding Judge
+    ↓
+Completeness Guard
+    ↓
+Final Answer + Trace Panels
+```
+The final model is not used alone. It is wrapped inside a runtime harness that controls:
+- where the system can search
+- which sources it can trust
+- whether web evidence is useful
+- whether retrieved evidence supports the answer
+- whether the model produced fake URLs
+- whether the answer leaked internal reasoning
+- whether fallback to local handbook RAG is needed
+- whether the final answer is grounded enough to show
+This is why the final stage is better described as:
+> **A PPO-aligned RAG agent system with Harness Engineering**, rather than only a fine-tuned model.
+---
+# 3. Dataset Design
+## 3.1 Source Domain
+The dataset is built around UM FSKTM undergraduate and postgraduate handbook content. The data is organized into:
+- SFT question-answer dataset
+- hidden metadata
+- RAG knowledge base
+- RAG evaluation dataset
+- PPO preference dataset
+The project separates **model-visible training text** from **metadata used for retrieval, evaluation, and analysis**.
+This distinction is important:
+- Baseline 1 intentionally trains on question-answer text without forcing explicit metadata labels into the model-visible answer.
+- Baseline 2 uses metadata-aware retrieval to reduce scope confusion.
+- Stage 3 PPO uses preference pairs and reward functions to shape answer behavior.
+---
+## 3.2 Baseline 1 SFT Dataset
+Baseline 1 uses:
+```text
+SFT_QA_Training_Ready.jsonl
+```
+The notebook validates:
+```text
+Total examples: 1000
+Train examples: 800
+Validation examples: 100
+Test examples: 100
+Split ratio: 8:1:1
+Duplicate question groups: 0
+Duplicate question rows: 0
+```
+Each example follows a supervised chat-style format:
+```json
+{
+  "prompt": [
+    {
+      "role": "system",
+      "content": "You are an academic assistant for the Faculty of Computer Science and Information Technology, Universiti Malaya..."
+    },
+    {
+      "role": "user",
+      "content": "What are the faculty objectives?"
+    }
+  ],
+  "completion": [
+    {
+      "role": "assistant",
+      "content": "The faculty objectives are..."
+    }
+  ],
+  "question": "...",
+  "answer": "..."
+}
+```
+This stage teaches the model to imitate handbook-style answers directly.
+---
+## 3.3 Baseline 2 RAG Dataset
+Baseline 2 uses the same SFT dataset direction, but adds external retrieval resources:
+```text
+UM_RAG_Knowledge_Base.jsonl
+UM_RAG_Evaluation_Dataset.jsonl
+SFT_QA_Metadata.jsonl
+```
+The RAG knowledge base contains structured fields such as:
+```text
+kb_id
+source_doc
+scope_label
+section
+pages
+source_text
+retrieval_text
+retrieval_keywords
+grounded_answer_bank
+matched_qa_ids
+```
+The RAG knowledge base loaded in the final Stage 3 runtime contains:
+```text
+Loaded KB rows: 521
+```
+The metadata layer allows the system to distinguish:
+```text
+general
+undergraduate
+postgraduate
+```
+This is important because many handbook questions look similar but require different answers depending on the student scope.
+---
+## 3.4 PPO Preference Dataset
+The Improved Model uses:
+```text
+UM_Handbook_PPO_Preference_Dataset.jsonl
+```
+The final PPO run uses the full dataset:
+```text
+Total PPO preference rows: 1000
+Train rows: 900
+Validation rows: 100
+Train fraction: 0.90
+```
+The PPO dataset is not used like normal SFT data. In SFT, the model directly imitates a reference answer. In PPO, the model generates its own answer, receives a reward, and updates toward higher-reward behavior.
+---
+# 4. Baseline 1 — Closed-book SFT Qwen3-8B
+## 4.1 Purpose
+Baseline 1 asks a simple question:
+> Can Qwen3-8B learn UM Handbook question answering from supervised fine-tuning alone?
+This is a **closed-book baseline**. The model does not retrieve handbook evidence during inference. It must answer from what it learned during SFT.
+This is useful as a control baseline because it shows what happens when the model relies mainly on parameter memory.
+---
+## 4.2 Model
+Baseline 1 uses:
+```text
+Base model: Qwen/Qwen3-8B
+Local path: /scr/user/kevin2002/TensorCat/NLP/UM_Handbook/models/Qwen3-8B
+```
+The notebook detected:
+```text
+Backend: CUDA
+GPU: NVIDIA A100-SXM4-80GB
+dtype: bfloat16
+4-bit QLoRA: enabled
+```
+---
+## 4.3 Training Method
+Baseline 1 uses LoRA / QLoRA supervised fine-tuning.
+LoRA configuration:
+```text
+LoRA rank: 16
+LoRA alpha: 32
+LoRA dropout: 0.10
+Target modules:
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+  - gate_proj
+  - up_proj
+  - down_proj
+```
+Training configuration:
+```text
+Epochs: 8
+Train split: 800
+Validation split: 100
+Test split: 100
+Per-device train batch size: 2
+Per-device eval batch size: 2
+Gradient accumulation steps: 8
+Learning rate: 1e-4
+Packing: False
+```
+---
+## 4.4 Baseline 1 Results
+The training completed successfully.
+Training summary:
+```text
+Training steps: 400
+Training runtime: ~18.68 minutes for the main train stage
+Train loss: 0.4824
+Final validation loss: ~0.146
+Test loss: ~0.197
+Perplexity: ~1.157
+```
+Generation metrics:
+### Validation
+```text
+Exact match: 0.77
+Token F1: 0.9111
+ROUGE-1: 0.9122
+ROUGE-2: 0.8700
+ROUGE-L: 0.8979
+SacreBLEU: 81.7240
+chrF++: 86.8916
+Average prediction words: 36.35
+Average reference words: 38.57
+```
+### Test
+```text
+Exact match: 0.72
+Token F1: 0.8869
+ROUGE-1: 0.8857
+ROUGE-2: 0.8352
+ROUGE-L: 0.8677
+SacreBLEU: 81.1138
+chrF++: 87.7054
+Average prediction words: 38.03
+Average reference words: 37.03
+```
+---
+## 4.5 Baseline 1 Strengths
+Baseline 1 is strong when the question is close to the training distribution. It can reproduce handbook-style answers well and shows high text overlap with the reference answers.
+It is useful because:
+- it establishes the basic Qwen3-8B SFT capability
+- it verifies that the dataset format is learnable
+- it creates a clean closed-book control model
+- it provides a baseline for later RAG and PPO improvements
+---
+## 4.6 Baseline 1 Limitations
+Baseline 1 is still limited because it is a closed-book model.
+Main limitations:
+1. **No retrieval evidence**
+   It cannot check the handbook at inference time.
+2. **Potential hallucination**
+   If the question is out-of-distribution or requires exact source grounding, the model may answer from memory.
+3. **Scope confusion**
+   Undergraduate and postgraduate rules may be mixed if the question is ambiguous.
+4. **No official web update mechanism**
+   It cannot answer dynamic or latest-information questions reliably.
+5. **No harness guardrails**
+   It does not include fake URL detection, evidence judging, WAF handling, or fallback control.
+Baseline 1 is therefore a necessary but incomplete starting point.
+---
+# 5. Baseline 2 — RAG + SFT + Metadata-aware Retrieval + Harness Agent
+## 5.1 Purpose
+Baseline 2 asks:
+> What improves if we keep the same Qwen3-8B family but add retrieval-grounded evidence?
+The goal is to reduce hallucination and scope confusion by giving the model relevant handbook evidence at inference time.
+This stage introduces RAG and agentic harness logic while keeping the same broad model family and handbook task.
 ---
+## 5.2 What RAG Means in This Project
+RAG stands for **Retrieval-Augmented Generation**.
+In simple terms:
+```text
+Instead of asking the model to answer only from memory,
+the system first retrieves relevant handbook chunks,
+then asks the model to answer using those chunks.
+```
+In this project, RAG is not just keyword search. It uses:
+```text
+Transformer embedding model
++ FAISS vector search
++ metadata-aware reranking
++ scope labels
++ top-k evidence blocks
+```
+The Baseline 2 retriever uses:
+```text
+Embedding model: BAAI/bge-base-en-v1.5
+Vector index: FAISS
+Similarity: inner product after embedding normalization
+Top-k retrieval: 3
+```
 ---
+## 5.3 Metadata-aware Retrieval
+The RAG system uses metadata to control retrieval quality.
+Important metadata fields include:
+```text
+source_doc
+scope_label
+section
+pages
+kb_id
+knowledge group
+retrieval keywords
+grounded answer bank
+```
+This allows the retriever to prefer the correct audience scope.
+Example:
+```text
+Question: What are the candidature requirements for Master of Software Engineering?
+Expected scope: postgraduate
+```
+The system should retrieve postgraduate chunks, not undergraduate chunks.
+This is one of the main improvements over Baseline 1.
 ---
+## 5.4 RAG-augmented Training Dataset
+Baseline 2 creates a RAG-augmented dataset where training examples include evidence context.
+The training prompt can contain:
 ```text
+User question
++ retrieved handbook evidence
++ source metadata
++ answer instruction
 ```
+This teaches the model to answer with evidence-aware context rather than only memorized answers.
 ---
+## 5.5 Baseline 2 Training Configuration
+Baseline 2 uses Qwen3-8B with LoRA fine-tuning.
+Configuration:
+```text
+Base model: Qwen/Qwen3-8B
+Embedding model: BAAI/bge-base-en-v1.5
+LoRA rank: 8
+LoRA alpha: 16
+LoRA dropout: 0.05
+Target modules:
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+  - gate_proj
+  - up_proj
+  - down_proj
+Epochs: 20
+Per-device train batch size: 4
+Per-device eval batch size: 8
+Target global batch size: 8
+Learning rate: 8e-5
+Max sequence length: 1024
+Validation ratio: 0.10
+Test ratio: 0.10
+Save merged model: False
+Runtime model path: base model + LoRA adapter
+```
+The notebook uses a safer non-merged runtime path when merged model export is unavailable or memory-expensive.
+---
+## 5.6 Baseline 2 Retrieval Evaluation
+Baseline 2 includes a retrieval evaluation set.
+Retrieval metrics:
+```json
+{
+  "retrieval_eval_size": 1000,
+  "top_k": 3,
+  "hit_at_1_primary": 0.821,
+  "hit_at_k_primary": 0.954,
+  "hit_at_k_same_group": 0.991,
+  "scope_match_at_1": 0.996,
+  "retriever_type": "dense_embedding + faiss + metadata_rerank",
+  "embedding_model_name": "BAAI/bge-base-en-v1.5"
+}
+```
+Interpretation:
+- `hit_at_1_primary = 0.821` means the top retrieved chunk is exactly the expected primary evidence in 82.1% of cases.
+- `hit_at_k_primary = 0.954` means the correct primary evidence appears within top-3 in 95.4% of cases.
+- `hit_at_k_same_group = 0.991` means a same-group acceptable evidence appears in top-3 in 99.1% of cases.
+- `scope_match_at_1 = 0.996` means the top result almost always matches the correct undergraduate/postgraduate/general scope.
+This confirms that the RAG system is not random retrieval. It is a strong metadata-aware retrieval baseline.
 ---
+## 5.7 Baseline 2 Generation Evaluation
+Generation evaluation was run on a smaller selected set for runtime practicality.
+Results:
+```json
+{
+  "generation_eval_size": 20,
+  "top_k": 3,
+  "plain_exact_match": 0.0,
+  "plain_token_f1": 0.3391,
+  "rag_exact_match": 0.0,
+  "rag_token_f1": 0.8460,
+  "rag_minus_plain_exact_match": 0.0,
+  "rag_minus_plain_token_f1": 0.5069
+}
+```
+This shows a large improvement from RAG:
+```text
+Plain token F1: 0.3391
+RAG token F1:   0.8460
+Improvement:    +0.5069
+```
+This is one of the strongest pieces of evidence in the project.
+It shows that retrieval grounding dramatically improves answer quality compared with plain generation.
+---
+# 6. Agent Layer in Baseline 2 and Improved Model
+## 6.1 Why an Agent Is Needed
+The handbook is reliable for stable academic rules, but some questions may require official web information.
+Examples:
+```text
+Who is the current dean?
+Where can students find residential college information?
+What official page mentions PEKOM?
+Where is the official SPeCTRUM page?
+```
+For these cases, the system needs a controlled web agent.
+However, a web agent can be dangerous if it freely browses or trusts random pages. Therefore, this project uses a restricted official-source agent.
 ---
+## 6.2 Official UM / FSKTM Web Agent
+The web agent is constrained to official UM / FSKTM domains.
+Priority domains:
+```text
+fsktm.um.edu.my
+www.um.edu.my
+```
+Auxiliary official domains include UM-related systems such as:
+```text
+aasd.um.edu.my
+maya.um.edu.my
+umlib.um.edu.my
+umresearch.um.edu.my
+jobs.um.edu.my
+careerportal.fsktm.um.edu.my
+intra.fsktm.um.edu.my
+gallery.fsktm.um.edu.my
+```
+The agent performs:
+```text
+query planning
+official web discovery
+URL filtering
+page fetching
+evidence extraction
+evidence scoring
+Qwen-based evidence judging
+retry if weak
+fallback to handbook RAG if needed
+```
 ---
+## 6.3 Agent Is Not Fully Autonomous by Design
+This project does not use a completely unrestricted autonomous agent.
+That is intentional.
+For a university handbook assistant, unrestricted autonomy is less useful than controlled evidence routing. The system needs to be:
+```text
+safe
+source-constrained
+traceable
+fallback-aware
+grounded
+```
+So the agent is better described as:
+> A constrained official-source web agent controlled by Harness Engineering.
 ---
+# 7. Harness Engineering
+## 7.1 What Harness Engineering Means Here
+Harness Engineering is the guardrail system around the model and agent.
+A simple analogy:
+```text
+The LLM/agent is the car.
+Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, and dashboard.
+```
+The model can generate fluent answers, but the harness controls:
+- what it is allowed to search
+- what sources it can trust
+- whether a URL is fake
+- whether evidence is useful
+- whether the answer is grounded
+- whether the system should retry
+- whether it should fall back to local handbook RAG
+- what trace should be shown to the user
 ---
+## 7.2 Harness Pipeline
+The standardized TensorTalk Harness Core follows this structure:
+```text
+User Question
+    ↓
+Local Handbook RAG
+    ↓
+Official Web Discovery
+    ↓
+Domain Guard
+    ↓
+Fake URL Guard
+    ↓
+WAF Detection
+    ↓
+Evidence Normalizer
+    ↓
+Qwen Evidence Judge
+    ↓
+Entity-aware Retry
+    ↓
+Weak Evidence Fallback
+    ↓
+Answer Generator
+    ↓
+Answer Grounding Judge
+    ↓
+Completeness Guard
+    ↓
+UI Trace
+```
+---
+## 7.3 Harness Components
+The notebooks include several engineering patches and layers:
+### V14 — WAF-aware Harness
+Handles web pages blocked by WAF or browser failures.
+Functions:
+- detect WAF block pages
+- exclude blocked pages from evidence
+- provide diagnostics
+- use safe static fallback if browser click fails
+- reject query-fabricated URLs before evidence building
 ---
+### V15 — Qwen Evidence Judge Loop
+Adds an LLM-based evidence judge.
+Flow:
+```text
+Planner
+→ Search / Fetch
+→ Evidence Filter
+→ Qwen Judge
+→ Retry
+→ Final Evidence
+```
+The purpose is to avoid trusting weak web snippets blindly.
 ---
+### V16 — Local-aware Judge Repair
+Improves routing and fallback.
+It handles:
+```text
+PEKOM routing
+CCNA Lab routing
+residential college routing
+local RAG fallback
+entity-aware retry
+fake URL rejection
+```
 ---
+### V17 — Strict Entity Judge and UI Polish
+Adds stricter entity matching and improves trace display.
+This helps avoid cases where a query about one entity is answered with another related but wrong page.
+---
+### V18 — Balanced Official Reference Fallback
+Allows the system to still provide official references when strong web evidence is not enough, while avoiding over-trusting weak pages.
+---
+### V19 — Answer Grounding Judge
+Checks whether the final generated answer is actually supported by evidence.
+This is important because even if retrieval is correct, the model may still introduce unsupported details.
+---
+### Completeness Guard
+Checks whether the answer is too incomplete and whether a rewrite or fallback should be triggered.
+---
+# 8. Improved Model — PPO Rule-reward Post-training + RAG + Agent + Harness
+## 8.1 Purpose
+The Improved Model asks:
+> Can we further improve the model’s behavior after SFT/RAG by using PPO reward-based post-training?
+Baseline 2 already improves factual grounding through RAG and Harness Engineering. The Improved Model adds PPO to shape the model’s behavior.
+The goal is not to replace RAG. The goal is to make the model more aligned with the desired answer style and safety behavior.
+---
+## 8.2 What PPO Means in This Project
+PPO stands for **Proximal Policy Optimization**.
+In simple terms:
+```text
+SFT teaches the model by imitation.
+PPO lets the model generate answers, scores them with a reward function, and updates the model toward higher-reward answers.
+```
+In this project:
+```text
+Actor model: Qwen3-8B + LoRA
+Critic/value head: TRL value head model
+Reference model: frozen Qwen3-8B reference
+Reward: rule-based preference reward function
+KL control: used to avoid drifting too far from the reference model
+```
+---
+## 8.3 Rule-based Reward Function
+This project uses a rule-based reward function rather than a separately trained neural reward model.
+The reward function evaluates:
+```text
+gold answer similarity
+rejected answer penalty
+evidence overlap
+scope correctness
+hallucinated URL penalty
+vague answer penalty
+process/thinking leakage penalty
+direct answer bonus
+repetition penalty
+degeneration/collapse penalty
+```
+This is why the model card should describe the final stage as:
+> Rule-reward PPO post-training
+not:
+> Full RLHF with a trained reward model
+The reward model type recorded in the notebook is:
+```text
+rule_based_preference_reward_function
+uses_separate_neural_reward_model: False
+```
+---
+## 8.4 PPO Training Configuration
+The final PPO run uses:
+```text
+Preference dataset rows: 1000
+Train rows: 900
+Validation rows: 100
+MAX_PPO_ROWS: None
+Train fraction: 0.90
+PPO epochs: 2
+Batch size: 2
+Mini-batch size: 1
+Max new tokens: 72
+Max PPO steps per epoch: None
+Planned steps per epoch: 450
+Total planned steps: 900
+Learning rate: 2e-6
+Target KL: 0.10
+Generation temperature: 0.45
+Top-p: 0.78
+Repetition penalty: 1.3
+No-repeat ngram size: 4
+```
+The run completed successfully:
+```text
+Global PPO steps: 900 / 900
+Elapsed time: 04:47:59
+Degenerate ratio: 0.00%
+```
+---
+## 8.5 PPO Artifact Verification
+The Stage 3 notebook includes strict artifact verification.
+This is important because PPO notebooks can easily appear to run while silently saving old or incomplete artifacts.
+The strict save cell verifies:
+```text
+training_log exists
+training_log records = 900
+expected steps = 900
+MAX_PPO_ROWS = None
+train rows = 900
+valid rows = 100
+NUM_PPO_EPOCHS = 2
+MAX_PPO_STEPS_PER_EPOCH = None
+parameter hash changed after PPO
+PPO inference full actor exists
+PPO LoRA adapter exists
+non-PPO fallback forbidden
+```
+The final strict save output confirms:
+```text
+Final PPO records saved: 900 / expected 900
+Strict full PPO artifact contract passed.
+```
+The parameter change proof confirms:
+```text
+aggregate_hash_changed: true
+changed_trainable_tensors: 506
+unchanged_trainable_tensors: 0
+```
+This proves that PPO training changed the trainable LoRA/value-head parameters rather than merely running a dry notebook.
+---
+## 8.6 Strict PPO-only Runtime
+The final runtime is configured so that the UI must use PPO artifacts only.
+The strict PPO gate confirms:
+```text
+PPO records: 900
+PPO full actor usable: True
+PPO LoRA adapter usable: True
+Strict PPO-only UI mode: True
+```
+The runtime loading order is:
+```text
+1. PPO full inference actor if full weights exist
+2. Otherwise base Qwen3-8B + PPO LoRA adapter
+3. Non-PPO fallback is forbidden
+```
+This prevents the final demo from accidentally loading an old Baseline 2 model or a stale 150-step PPO proof artifact.
+---
+## 8.7 PPO Validation
+The PPO-only validation evaluation uses a held-out validation sample.
+The displayed validation summary is:
+```text
+reward:            0.477789
+gold_overlap:      0.255351
+rejected_overlap:  0.155080
+```
+Interpretation:
+- reward is positive
+- gold overlap is higher than rejected overlap
+- the PPO-trained actor tends to move closer to preferred answers than rejected answers
+This does not mean the PPO model is perfect. It means the reward-shaped behavior is directionally positive.
+---
+## 8.8 PPO Limitations
+The PPO run is successful, but the raw PPO generations still show some imperfections.
+Observed issues include:
+1. **Process leakage**
+   Some outputs still include phrases like:
+   ```text
+   Okay, let me try to figure out...
+   Wait, I need to check again...
+   ```
+   The reward function penalizes this, but it is not completely eliminated.
+2. **Occasional hallucinated URLs**
+   Some raw generations may still invent URLs. The harness fake URL guard is therefore still necessary.
+3. **OCR-style text artifacts**
+   Some source chunks contain spacing or OCR issues, and the model may reproduce them.
+4. **KL can be high**
+   Some PPO logs show high `objective/kl`, meaning the PPO actor can drift noticeably from the reference model. However, the run completed with:
+   ```text
+   degenerate_ratio = 0.00%
+   ```
+   and no detected repetition collapse.
+5. **RAG/Harness remains necessary**
+   PPO improves model behavior, but it does not replace retrieval grounding or guardrails.
+---
+# 9. TensorTalk UI
+The project includes a WhatsApp-style Jupyter HTML UI called **TensorTalk**.
+The UI supports:
+- chat-style interface
+- TensorCat avatar
+- RAG on/off control
+- web agent on/off control
+- collapsed trace panels
+- retrieved evidence display
+- web evidence display
+- planning/thinking display layer
+- harness decision trace
+- answer grounding information
+- strict PPO artifact loading
+- new chat reset behavior
+The UI is part of the engineering contribution because it makes the harness process visible rather than hidden.
+---
+# 10. Smoke Tests
+## 10.1 What Smoke Test Means Here
+A smoke test is a lightweight system sanity check.
+It is not a full evaluation. It is a quick check that the main pipeline still works.
+In this project, smoke tests check whether:
+```text
+PPO model loads
+RAG retrieves evidence
+web agent searches official sources
+fake URL guard blocks synthetic links
+answer grounding returns a result
+trace structure is produced
+fallback behavior still works
+```
+---
+## 10.2 Example Smoke Tests
+The notebook defines smoke tests such as:
+```text
+1. PEKOM should not be routed to AI bachelor page
+2. Residential college should prefer student-affairs residential page
+3. CCNA Lab should not invent synthetic URLs
+```
+These are not random examples. They are chosen to test known fragile parts of the pipeline:
+- entity routing
+- official URL preference
+- fake URL rejection
+- web/RAG trace structure
+---
+# 11. Control Variable Design
+The project uses a control-variable style comparison.
+The base task remains the same:
+```text
+UM FSKTM Handbook QA
+```
+The base model family remains the same:
+```text
+Qwen3-8B
+```
+The dataset domain remains the same:
+```text
+Undergraduate + postgraduate + general UM Handbook knowledge
+```
+What changes is the system layer:
+```text
+Baseline 1: SFT only
+Baseline 2: SFT + RAG + Harness Agent
+Improved: SFT/RAG/Harness + PPO post-training
+```
+This allows the project to compare which improvements come from:
+- parameter learning
+- retrieval grounding
+- metadata-aware scope control
+- official web augmentation
+- harness guardrails
+- PPO reward shaping
+This is more rigorous than simply building three unrelated systems.
+---
+# 12. Stage-by-stage Comparison Table
+| Dimension | Baseline 1: Closed-book SFT | Baseline 2: RAG + SFT + Agent/Harness | Improved Model: PPO + RAG + Agent/Harness |
+|---|---|---|---|
+| Main research question | Can the model memorize and reproduce handbook QA from SFT? | Does retrieval-grounded evidence improve handbook QA? | Can rule-reward PPO further align answer behavior while keeping RAG/Harness control? |
+| Base model | Qwen3-8B | Qwen3-8B | Qwen3-8B |
+| Main training method | Supervised fine-tuning | RAG-augmented supervised fine-tuning | Rule-reward PPO post-training |
+| Dataset used | 1000 SFT QA rows | SFT QA + metadata + RAG KB + RAG eval | 1000 PPO preference rows |
+| Train/validation/test | 800 / 100 / 100 | 8:1:1 RAG-augmented split | 900 train / 100 validation |
+| Retrieval | No | Yes | Yes |
+| Retrieval type | None | Dense embedding + FAISS + metadata-aware rerank | Same RAG runtime reused |
+| Embedding model | None | BAAI/bge-base-en-v1.5 | RAG runtime inherited from Baseline 2 |
+| Top-k evidence | None | Top-3 | Top-3 / runtime-dependent |
+| Metadata awareness | Hidden metadata only, not used at inference | Yes, scope/source/section aware | Yes, used by RAG/Harness runtime |
+| Scope control | Weak; model may confuse UG/PG if prompt is ambiguous | Stronger due to metadata-aware retrieval | Stronger due to RAG + PPO reward + harness |
+| Web agent | No | Yes | Yes |
+| Official domain control | No | Yes, UM/FSKTM official domain whitelist | Yes, same official-source guardrails |
+| Fake URL guard | No | Yes | Yes |
+| WAF handling | No | Yes | Yes |
+| Evidence judge | No | Yes, Qwen evidence judge | Yes |
+| Retry/fallback policy | No | Yes | Yes |
+| Answer grounding judge | No | Yes | Yes |
+| Completeness guard | No | Yes | Yes |
+| UI trace | Basic chat UI | Harness trace panels | Strict PPO + Harness trace panels |
+| LoRA rank | 16 | 8 | PPO actor based on LoRA actor/value setup |
+| Training epochs | 8 SFT epochs | 20 SFT epochs | 2 PPO epochs |
+| Main output artifact | LoRA adapter + merged model + `.pt` export | LoRA adapter, optional non-merged runtime | PPO full inference actor + PPO LoRA adapter + manifest |
+| Artifact strictness | Standard save | Adapter/runtime path checks | Manifest, training log count, parameter hash proof, strict gate |
+| Key metric | Test token F1 ≈ 0.8869 | RAG token F1 ≈ 0.846 on selected eval; retrieval Hit@3 ≈ 0.954 | PPO validation reward ≈ 0.4778; gold overlap > rejected overlap |
+| Strongest contribution | Clean SFT baseline | Evidence-grounded QA and metadata-aware retrieval | Full PPO post-training with strict artifact verification and harnessed runtime |
+| Main weakness | Closed-book hallucination risk | More complex runtime, depends on retriever quality | PPO raw outputs still need Harness/RAG due to possible process leakage and fake URLs |
+| Control variable role | Establishes parameter-only baseline | Adds retrieval and harness while keeping same domain/model family | Adds PPO reward shaping while preserving RAG/Harness pipeline |
+---
+# 13. Technical Comparison of the Three Stages
+## 13.1 Content-level Difference
+| Content Aspect | Baseline 1 | Baseline 2 | Improved Model |
+|---|---|---|---|
+| Stable handbook facts | Learned into model parameters | Retrieved from handbook KB | Retrieved and answered by PPO-aligned actor |
+| Latest or official web info | Not supported | Supported through official web agent | Supported through same official web agent |
+| UG vs PG distinction | Learned implicitly | Controlled by metadata retrieval | Controlled by metadata retrieval + reward/harness |
+| Evidence visibility | Not shown | Evidence shown in RAG trace | Evidence shown in PPO/Harness trace |
+| Hallucination control | Mostly prompt-based | Retrieval + grounding | Retrieval + grounding + reward penalties |
+| Fake URL control | Not available | Harness URL guard | Harness URL guard + PPO penalty signal |
+---
+## 13.2 Engineering-level Difference
+| Engineering Aspect | Baseline 1 | Baseline 2 | Improved Model |
+|---|---|---|---|
+| Notebook purpose | Train and evaluate closed-book SFT model | Build RAG-augmented model and harnessed agent runtime | Train PPO actor and attach it to final harness runtime |
+| Runtime complexity | Low | High | Highest |
+| Debug trace | Basic | Detailed RAG/Web/Harness trace | Detailed PPO/RAG/Web/Harness trace |
+| Failure handling | Minimal | Fallback and guardrail logic | Strict PPO-only fallback prevention plus harness fallback |
+| Artifact verification | Basic output save | Adapter/merged path checks | Manifest, training log count, parameter hash proof, strict gate |
+| Risk of stale artifact use | Moderate | Moderate | Actively guarded against |
+| Demo readiness | Good for simple QA | Strong for grounded QA | Strongest for final controlled system demo |
+---
+# 14. Why the Improved Model Does Not Replace RAG
+A key design decision is that PPO does not replace RAG.
+PPO improves the model’s tendency to:
+- answer directly
+- avoid rejected-style answers
+- avoid vague answers
+- avoid process leakage
+- avoid fake URLs
+- avoid repetition collapse
+- use evidence-like wording more appropriately
+But PPO does not guarantee factual correctness by itself.
+Therefore, the final system still needs:
+```text
+RAG for evidence
+Web Agent for official/latest information
+Harness for source control
+Grounding judge for answer verification
+Fallback for weak evidence
+```
+This is the correct division of responsibility:
+```text
+SFT: teaches domain answer style
+RAG: supplies factual evidence
+Agent: finds official external evidence
+Harness: controls trust, routing, fallback, and trace
+PPO: improves answer behavior according to reward preferences
+```
+---
+# 15. Known Limitations
+This project is a strong applied LLM system prototype, but it has limitations.
+## 15.1 Not a full human-feedback RLHF system
+The PPO stage uses a rule-based reward function. It does not train a separate neural reward model from human preference labels.
+Correct description:
+```text
+Rule-reward PPO post-training
+```
+Not:
+```text
+Full RLHF with learned reward model
+```
+---
+## 15.2 Raw PPO generations can still be imperfect
+Observed raw PPO generations may include:
+- process leakage
+- occasional hallucinated URLs
+- OCR-like token spacing
+- incomplete course titles
+- noisy source-text reproduction
+The final Harness runtime is therefore necessary.
+---
+## 15.3 Web search is constrained
+The web agent is intentionally limited to official UM/FSKTM sources. It may refuse or fallback when official evidence is weak.
+This is a feature, not a bug, because the system prioritizes trustworthiness over open-ended browsing.
+---
+## 15.4 RAG depends on knowledge base quality
+If the RAG KB contains OCR noise or incomplete chunks, the model may inherit that noise. Future work should improve source cleaning and chunk normalization.
+---
+## 15.5 Notebook-based prototype
+The project is implemented as notebooks. A production version should separate modules into:
+```text
+data/
+retrieval/
+agent/
+harness/
+training/
+evaluation/
+ui/
+tests/
+```
+---
+# 16. Recommended Usage
+This project is intended for research, coursework, and demonstration purposes.
+It is not an official Universiti Malaya system.
+For official academic decisions, students should always refer to the official handbook, faculty office, or UM/FSKTM official websites.
+---
+# 17. Suggested Inference Flow
+For final demonstration, use the Improved Model runtime:
+```text
+1. Load PPO full inference actor if available.
+2. If unavailable, load base Qwen3-8B + PPO LoRA adapter.
+3. Initialize local handbook RAG.
+4. Enable official UM/FSKTM web agent if the question may need external/latest information.
+5. Run through TensorTalkHarnessCore.
+6. Display answer with evidence trace.
+```
+Strict runtime requirement:
+```text
+Non-PPO fallback is forbidden in the final Improved Model demo.
+```
+---
+# 18. Summary
+TensorTalk demonstrates a staged LLM system development workflow:
+```text
+Baseline 1:
+Qwen3-8B learns handbook QA through closed-book SFT.
+Baseline 2:
+The system adds RAG, dense retrieval, metadata-aware reranking, official web search, and Harness Engineering.
+Improved Model:
+The system adds full 1000-row rule-reward PPO post-training, strict artifact verification, and a PPO-only final harness runtime.
+```
+The most important contribution is not only that the model can answer handbook questions, but that the system is controlled, evidence-aware, source-constrained, traceable, and evaluated through a clear baseline progression.
+The final system should be understood as:
+> **A Qwen3-8B based UM Handbook RAG Agent, improved with rule-reward PPO and controlled by Harness Engineering.**