Text Generation
Transformers
Safetensors
English
Chinese
qwen3
qwen3-8b
lora
qlora
sft
rag
faiss
dense-retrieval
agent
ppo
rlhf
rule-reward
harness-engineering
um-handbook
question-answering
chatbot
education
tensor-talk
Instructions to use TensorCat/TensorTalk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TensorCat/TensorTalk with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TensorCat/TensorTalk")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("TensorCat/TensorTalk", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TensorCat/TensorTalk with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TensorCat/TensorTalk" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TensorCat/TensorTalk
- SGLang
How to use TensorCat/TensorTalk with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TensorCat/TensorTalk with Docker Model Runner:
docker model run hf.co/TensorCat/TensorTalk
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| base_model: | |
| - Qwen/Qwen3-8B | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - qwen3 | |
| - qwen3-8b | |
| - lora | |
| - qlora | |
| - sft | |
| - rag | |
| - faiss | |
| - dense-retrieval | |
| - agent | |
| - ppo | |
| - rlhf | |
| - rule-reward | |
| - harness-engineering | |
| - um-handbook | |
| - question-answering | |
| - chatbot | |
| - education | |
| - tensor-talk | |
| # TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering | |
| TensorTalk is a staged LLM engineering project built for **Universiti Malaya Faculty of Computer Science and Information Technology handbook question answering**. The system is designed to answer undergraduate, postgraduate, and general faculty handbook questions using a controlled progression of three experimental stages: | |
| 1. **Baseline 1 — Closed-book SFT Qwen3-8B** | |
| 2. **Baseline 2 — SFT Qwen3-8B + Metadata-aware RAG + Official Web Agent + Harness Engineering** | |
| 3. **Improved Model — Rule-reward PPO post-training + RAG + Agent + Harness Engineering** | |
| The project is not just a simple chatbot. It is a controlled comparison of how an LLM system improves when moving from memorized supervised fine-tuning, to retrieval-grounded answering, and finally to rule-reward post-training with a guarded agentic runtime. | |
| The main idea is: | |
| > Baseline 1 tests whether a fine-tuned model can answer handbook questions from parameters alone. | |
| > Baseline 2 keeps the same base model family but adds retrieval and harnessed evidence control. | |
| > The Improved Model keeps the RAG + Agent + Harness runtime and further adds PPO post-training to make the model better aligned with the desired answer behavior. | |
| --- | |
| # 1. Project Goal | |
| The goal of this project is to build a reliable and traceable UM Handbook assistant that can answer questions about: | |
| - Faculty objectives, vision, mission, history, facilities, and academic calendar | |
| - Undergraduate programme details | |
| - Postgraduate programme details | |
| - Candidature requirements | |
| - Grading and academic rules | |
| - Industrial training | |
| - Academic project requirements | |
| - Supervision policy | |
| - Thesis/dissertation requirements | |
| - Academic integrity and plagiarism | |
| - Facilities and labs | |
| - Official UM/FSKTM web information when handbook knowledge is insufficient or time-sensitive | |
| The project also aims to demonstrate a complete LLM system development path: | |
| ```text | |
| Closed-book SFT | |
| → RAG-augmented SFT | |
| → Metadata-aware retrieval | |
| → Official-source web agent | |
| → Harness Engineering guardrails | |
| → PPO rule-reward post-training | |
| → Strict artifact verification | |
| → Traceable TensorTalk UI | |
| ``` | |
| --- | |
| # 2. High-level System Overview | |
| The final TensorTalk system contains several layers. | |
| ```text | |
| User Question | |
| ↓ | |
| TensorTalk UI | |
| ↓ | |
| Planning / Thinking Display Layer | |
| ↓ | |
| Local Handbook RAG | |
| ↓ | |
| Official UM / FSKTM Web Agent | |
| ↓ | |
| Harness Engineering Guardrails | |
| ↓ | |
| Evidence Judge / Retry / Fallback | |
| ↓ | |
| PPO-trained Qwen3-8B Actor | |
| ↓ | |
| Answer Grounding Judge | |
| ↓ | |
| Completeness Guard | |
| ↓ | |
| Final Answer + Trace Panels | |
| ``` | |
| The final model is not used alone. It is wrapped inside a runtime harness that controls: | |
| - where the system can search | |
| - which sources it can trust | |
| - whether web evidence is useful | |
| - whether retrieved evidence supports the answer | |
| - whether the model produced fake URLs | |
| - whether the answer leaked internal reasoning | |
| - whether fallback to local handbook RAG is needed | |
| - whether the final answer is grounded enough to show | |
| This is why the final stage is better described as: | |
| > **A PPO-aligned RAG agent system with Harness Engineering**, rather than only a fine-tuned model. | |
| --- | |
| # 3. Dataset Design | |
| ## 3.1 Source Domain | |
| The dataset is built around UM FSKTM undergraduate and postgraduate handbook content. The data is organized into: | |
| - SFT question-answer dataset | |
| - hidden metadata | |
| - RAG knowledge base | |
| - RAG evaluation dataset | |
| - PPO preference dataset | |
| The project separates **model-visible training text** from **metadata used for retrieval, evaluation, and analysis**. | |
| This distinction is important: | |
| - Baseline 1 intentionally trains on question-answer text without forcing explicit metadata labels into the model-visible answer. | |
| - Baseline 2 uses metadata-aware retrieval to reduce scope confusion. | |
| - Stage 3 PPO uses preference pairs and reward functions to shape answer behavior. | |
| --- | |
| ## 3.2 Baseline 1 SFT Dataset | |
| Baseline 1 uses: | |
| ```text | |
| SFT_QA_Training_Ready.jsonl | |
| ``` | |
| The notebook validates: | |
| ```text | |
| Total examples: 1000 | |
| Train examples: 800 | |
| Validation examples: 100 | |
| Test examples: 100 | |
| Split ratio: 8:1:1 | |
| Duplicate question groups: 0 | |
| Duplicate question rows: 0 | |
| ``` | |
| Each example follows a supervised chat-style format: | |
| ```json | |
| { | |
| "prompt": [ | |
| { | |
| "role": "system", | |
| "content": "You are an academic assistant for the Faculty of Computer Science and Information Technology, Universiti Malaya..." | |
| }, | |
| { | |
| "role": "user", | |
| "content": "What are the faculty objectives?" | |
| } | |
| ], | |
| "completion": [ | |
| { | |
| "role": "assistant", | |
| "content": "The faculty objectives are..." | |
| } | |
| ], | |
| "question": "...", | |
| "answer": "..." | |
| } | |
| ``` | |
| This stage teaches the model to imitate handbook-style answers directly. | |
| --- | |
| ## 3.3 Baseline 2 RAG Dataset | |
| Baseline 2 uses the same SFT dataset direction, but adds external retrieval resources: | |
| ```text | |
| UM_RAG_Knowledge_Base.jsonl | |
| UM_RAG_Evaluation_Dataset.jsonl | |
| SFT_QA_Metadata.jsonl | |
| ``` | |
| The RAG knowledge base contains structured fields such as: | |
| ```text | |
| kb_id | |
| source_doc | |
| scope_label | |
| section | |
| pages | |
| source_text | |
| retrieval_text | |
| retrieval_keywords | |
| grounded_answer_bank | |
| matched_qa_ids | |
| ``` | |
| The RAG knowledge base loaded in the final Stage 3 runtime contains: | |
| ```text | |
| Loaded KB rows: 521 | |
| ``` | |
| The metadata layer allows the system to distinguish: | |
| ```text | |
| general | |
| undergraduate | |
| postgraduate | |
| ``` | |
| This is important because many handbook questions look similar but require different answers depending on the student scope. | |
| --- | |
| ## 3.4 PPO Preference Dataset | |
| The Improved Model uses: | |
| ```text | |
| UM_Handbook_PPO_Preference_Dataset.jsonl | |
| ``` | |
| The final PPO run uses the full dataset: | |
| ```text | |
| Total PPO preference rows: 1000 | |
| Train rows: 900 | |
| Validation rows: 100 | |
| Train fraction: 0.90 | |
| ``` | |
| The PPO dataset is not used like normal SFT data. In SFT, the model directly imitates a reference answer. In PPO, the model generates its own answer, receives a reward, and updates toward higher-reward behavior. | |
| --- | |
| # 4. Baseline 1 — Closed-book SFT Qwen3-8B | |
| ## 4.1 Purpose | |
| Baseline 1 asks a simple question: | |
| > Can Qwen3-8B learn UM Handbook question answering from supervised fine-tuning alone? | |
| This is a **closed-book baseline**. The model does not retrieve handbook evidence during inference. It must answer from what it learned during SFT. | |
| This is useful as a control baseline because it shows what happens when the model relies mainly on parameter memory. | |
| --- | |
| ## 4.2 Model | |
| Baseline 1 uses: | |
| ```text | |
| Base model: Qwen/Qwen3-8B | |
| Local path: /scr/user/kevin2002/TensorCat/NLP/UM_Handbook/models/Qwen3-8B | |
| ``` | |
| The notebook detected: | |
| ```text | |
| Backend: CUDA | |
| GPU: NVIDIA A100-SXM4-80GB | |
| dtype: bfloat16 | |
| 4-bit QLoRA: enabled | |
| ``` | |
| --- | |
| ## 4.3 Training Method | |
| Baseline 1 uses LoRA / QLoRA supervised fine-tuning. | |
| LoRA configuration: | |
| ```text | |
| LoRA rank: 16 | |
| LoRA alpha: 32 | |
| LoRA dropout: 0.10 | |
| Target modules: | |
| - q_proj | |
| - k_proj | |
| - v_proj | |
| - o_proj | |
| - gate_proj | |
| - up_proj | |
| - down_proj | |
| ``` | |
| Training configuration: | |
| ```text | |
| Epochs: 8 | |
| Train split: 800 | |
| Validation split: 100 | |
| Test split: 100 | |
| Per-device train batch size: 2 | |
| Per-device eval batch size: 2 | |
| Gradient accumulation steps: 8 | |
| Learning rate: 1e-4 | |
| Packing: False | |
| ``` | |
| --- | |
| ## 4.4 Baseline 1 Results | |
| The training completed successfully. | |
| Training summary: | |
| ```text | |
| Training steps: 400 | |
| Training runtime: ~18.68 minutes for the main train stage | |
| Train loss: 0.4824 | |
| Final validation loss: ~0.146 | |
| Test loss: ~0.197 | |
| Perplexity: ~1.157 | |
| ``` | |
| Generation metrics: | |
| ### Validation | |
| ```text | |
| Exact match: 0.77 | |
| Token F1: 0.9111 | |
| ROUGE-1: 0.9122 | |
| ROUGE-2: 0.8700 | |
| ROUGE-L: 0.8979 | |
| SacreBLEU: 81.7240 | |
| chrF++: 86.8916 | |
| Average prediction words: 36.35 | |
| Average reference words: 38.57 | |
| ``` | |
| ### Test | |
| ```text | |
| Exact match: 0.72 | |
| Token F1: 0.8869 | |
| ROUGE-1: 0.8857 | |
| ROUGE-2: 0.8352 | |
| ROUGE-L: 0.8677 | |
| SacreBLEU: 81.1138 | |
| chrF++: 87.7054 | |
| Average prediction words: 38.03 | |
| Average reference words: 37.03 | |
| ``` | |
| --- | |
| ## 4.5 Baseline 1 Strengths | |
| Baseline 1 is strong when the question is close to the training distribution. It can reproduce handbook-style answers well and shows high text overlap with the reference answers. | |
| It is useful because: | |
| - it establishes the basic Qwen3-8B SFT capability | |
| - it verifies that the dataset format is learnable | |
| - it creates a clean closed-book control model | |
| - it provides a baseline for later RAG and PPO improvements | |
| --- | |
| ## 4.6 Baseline 1 Limitations | |
| Baseline 1 is still limited because it is a closed-book model. | |
| Main limitations: | |
| 1. **No retrieval evidence** | |
| It cannot check the handbook at inference time. | |
| 2. **Potential hallucination** | |
| If the question is out-of-distribution or requires exact source grounding, the model may answer from memory. | |
| 3. **Scope confusion** | |
| Undergraduate and postgraduate rules may be mixed if the question is ambiguous. | |
| 4. **No official web update mechanism** | |
| It cannot answer dynamic or latest-information questions reliably. | |
| 5. **No harness guardrails** | |
| It does not include fake URL detection, evidence judging, WAF handling, or fallback control. | |
| Baseline 1 is therefore a necessary but incomplete starting point. | |
| --- | |
| # 5. Baseline 2 — RAG + SFT + Metadata-aware Retrieval + Harness Agent | |
| ## 5.1 Purpose | |
| Baseline 2 asks: | |
| > What improves if we keep the same Qwen3-8B family but add retrieval-grounded evidence? | |
| The goal is to reduce hallucination and scope confusion by giving the model relevant handbook evidence at inference time. | |
| This stage introduces RAG and agentic harness logic while keeping the same broad model family and handbook task. | |
| --- | |
| ## 5.2 What RAG Means in This Project | |
| RAG stands for **Retrieval-Augmented Generation**. | |
| In simple terms: | |
| ```text | |
| Instead of asking the model to answer only from memory, | |
| the system first retrieves relevant handbook chunks, | |
| then asks the model to answer using those chunks. | |
| ``` | |
| In this project, RAG is not just keyword search. It uses: | |
| ```text | |
| Transformer embedding model | |
| + FAISS vector search | |
| + metadata-aware reranking | |
| + scope labels | |
| + top-k evidence blocks | |
| ``` | |
| The Baseline 2 retriever uses: | |
| ```text | |
| Embedding model: BAAI/bge-base-en-v1.5 | |
| Vector index: FAISS | |
| Similarity: inner product after embedding normalization | |
| Top-k retrieval: 3 | |
| ``` | |
| --- | |
| ## 5.3 Metadata-aware Retrieval | |
| The RAG system uses metadata to control retrieval quality. | |
| Important metadata fields include: | |
| ```text | |
| source_doc | |
| scope_label | |
| section | |
| pages | |
| kb_id | |
| knowledge group | |
| retrieval keywords | |
| grounded answer bank | |
| ``` | |
| This allows the retriever to prefer the correct audience scope. | |
| Example: | |
| ```text | |
| Question: What are the candidature requirements for Master of Software Engineering? | |
| Expected scope: postgraduate | |
| ``` | |
| The system should retrieve postgraduate chunks, not undergraduate chunks. | |
| This is one of the main improvements over Baseline 1. | |
| --- | |
| ## 5.4 RAG-augmented Training Dataset | |
| Baseline 2 creates a RAG-augmented dataset where training examples include evidence context. | |
| The training prompt can contain: | |
| ```text | |
| User question | |
| + retrieved handbook evidence | |
| + source metadata | |
| + answer instruction | |
| ``` | |
| This teaches the model to answer with evidence-aware context rather than only memorized answers. | |
| --- | |
| ## 5.5 Baseline 2 Training Configuration | |
| Baseline 2 uses Qwen3-8B with LoRA fine-tuning. | |
| Configuration: | |
| ```text | |
| Base model: Qwen/Qwen3-8B | |
| Embedding model: BAAI/bge-base-en-v1.5 | |
| LoRA rank: 8 | |
| LoRA alpha: 16 | |
| LoRA dropout: 0.05 | |
| Target modules: | |
| - q_proj | |
| - k_proj | |
| - v_proj | |
| - o_proj | |
| - gate_proj | |
| - up_proj | |
| - down_proj | |
| Epochs: 20 | |
| Per-device train batch size: 4 | |
| Per-device eval batch size: 8 | |
| Target global batch size: 8 | |
| Learning rate: 8e-5 | |
| Max sequence length: 1024 | |
| Validation ratio: 0.10 | |
| Test ratio: 0.10 | |
| Save merged model: False | |
| Runtime model path: base model + LoRA adapter | |
| ``` | |
| The notebook uses a safer non-merged runtime path when merged model export is unavailable or memory-expensive. | |
| --- | |
| ## 5.6 Baseline 2 Retrieval Evaluation | |
| Baseline 2 includes a retrieval evaluation set. | |
| Retrieval metrics: | |
| ```json | |
| { | |
| "retrieval_eval_size": 1000, | |
| "top_k": 3, | |
| "hit_at_1_primary": 0.821, | |
| "hit_at_k_primary": 0.954, | |
| "hit_at_k_same_group": 0.991, | |
| "scope_match_at_1": 0.996, | |
| "retriever_type": "dense_embedding + faiss + metadata_rerank", | |
| "embedding_model_name": "BAAI/bge-base-en-v1.5" | |
| } | |
| ``` | |
| Interpretation: | |
| - `hit_at_1_primary = 0.821` means the top retrieved chunk is exactly the expected primary evidence in 82.1% of cases. | |
| - `hit_at_k_primary = 0.954` means the correct primary evidence appears within top-3 in 95.4% of cases. | |
| - `hit_at_k_same_group = 0.991` means a same-group acceptable evidence appears in top-3 in 99.1% of cases. | |
| - `scope_match_at_1 = 0.996` means the top result almost always matches the correct undergraduate/postgraduate/general scope. | |
| This confirms that the RAG system is not random retrieval. It is a strong metadata-aware retrieval baseline. | |
| --- | |
| ## 5.7 Baseline 2 Generation Evaluation | |
| Generation evaluation was run on a smaller selected set for runtime practicality. | |
| Results: | |
| ```json | |
| { | |
| "generation_eval_size": 20, | |
| "top_k": 3, | |
| "plain_exact_match": 0.0, | |
| "plain_token_f1": 0.3391, | |
| "rag_exact_match": 0.0, | |
| "rag_token_f1": 0.8460, | |
| "rag_minus_plain_exact_match": 0.0, | |
| "rag_minus_plain_token_f1": 0.5069 | |
| } | |
| ``` | |
| This shows a large improvement from RAG: | |
| ```text | |
| Plain token F1: 0.3391 | |
| RAG token F1: 0.8460 | |
| Improvement: +0.5069 | |
| ``` | |
| This is one of the strongest pieces of evidence in the project. | |
| It shows that retrieval grounding dramatically improves answer quality compared with plain generation. | |
| --- | |
| # 6. Agent Layer in Baseline 2 and Improved Model | |
| ## 6.1 Why an Agent Is Needed | |
| The handbook is reliable for stable academic rules, but some questions may require official web information. | |
| Examples: | |
| ```text | |
| Who is the current dean? | |
| Where can students find residential college information? | |
| What official page mentions PEKOM? | |
| Where is the official SPeCTRUM page? | |
| ``` | |
| For these cases, the system needs a controlled web agent. | |
| However, a web agent can be dangerous if it freely browses or trusts random pages. Therefore, this project uses a restricted official-source agent. | |
| --- | |
| ## 6.2 Official UM / FSKTM Web Agent | |
| The web agent is constrained to official UM / FSKTM domains. | |
| Priority domains: | |
| ```text | |
| fsktm.um.edu.my | |
| www.um.edu.my | |
| ``` | |
| Auxiliary official domains include UM-related systems such as: | |
| ```text | |
| aasd.um.edu.my | |
| maya.um.edu.my | |
| umlib.um.edu.my | |
| umresearch.um.edu.my | |
| jobs.um.edu.my | |
| careerportal.fsktm.um.edu.my | |
| intra.fsktm.um.edu.my | |
| gallery.fsktm.um.edu.my | |
| ``` | |
| The agent performs: | |
| ```text | |
| query planning | |
| official web discovery | |
| URL filtering | |
| page fetching | |
| evidence extraction | |
| evidence scoring | |
| Qwen-based evidence judging | |
| retry if weak | |
| fallback to handbook RAG if needed | |
| ``` | |
| --- | |
| ## 6.3 Agent Is Not Fully Autonomous by Design | |
| This project does not use a completely unrestricted autonomous agent. | |
| That is intentional. | |
| For a university handbook assistant, unrestricted autonomy is less useful than controlled evidence routing. The system needs to be: | |
| ```text | |
| safe | |
| source-constrained | |
| traceable | |
| fallback-aware | |
| grounded | |
| ``` | |
| So the agent is better described as: | |
| > A constrained official-source web agent controlled by Harness Engineering. | |
| --- | |
| # 7. Harness Engineering | |
| ## 7.1 What Harness Engineering Means Here | |
| Harness Engineering is the guardrail system around the model and agent. | |
| A simple analogy: | |
| ```text | |
| The LLM/agent is the car. | |
| Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, and dashboard. | |
| ``` | |
| The model can generate fluent answers, but the harness controls: | |
| - what it is allowed to search | |
| - what sources it can trust | |
| - whether a URL is fake | |
| - whether evidence is useful | |
| - whether the answer is grounded | |
| - whether the system should retry | |
| - whether it should fall back to local handbook RAG | |
| - what trace should be shown to the user | |
| --- | |
| ## 7.2 Harness Pipeline | |
| The standardized TensorTalk Harness Core follows this structure: | |
| ```text | |
| User Question | |
| ↓ | |
| Local Handbook RAG | |
| ↓ | |
| Official Web Discovery | |
| ↓ | |
| Domain Guard | |
| ↓ | |
| Fake URL Guard | |
| ↓ | |
| WAF Detection | |
| ↓ | |
| Evidence Normalizer | |
| ↓ | |
| Qwen Evidence Judge | |
| ↓ | |
| Entity-aware Retry | |
| ↓ | |
| Weak Evidence Fallback | |
| ↓ | |
| Answer Generator | |
| ↓ | |
| Answer Grounding Judge | |
| ↓ | |
| Completeness Guard | |
| ↓ | |
| UI Trace | |
| ``` | |
| --- | |
| ## 7.3 Harness Components | |
| The notebooks include several engineering patches and layers: | |
| ### V14 — WAF-aware Harness | |
| Handles web pages blocked by WAF or browser failures. | |
| Functions: | |
| - detect WAF block pages | |
| - exclude blocked pages from evidence | |
| - provide diagnostics | |
| - use safe static fallback if browser click fails | |
| - reject query-fabricated URLs before evidence building | |
| --- | |
| ### V15 — Qwen Evidence Judge Loop | |
| Adds an LLM-based evidence judge. | |
| Flow: | |
| ```text | |
| Planner | |
| → Search / Fetch | |
| → Evidence Filter | |
| → Qwen Judge | |
| → Retry | |
| → Final Evidence | |
| ``` | |
| The purpose is to avoid trusting weak web snippets blindly. | |
| --- | |
| ### V16 — Local-aware Judge Repair | |
| Improves routing and fallback. | |
| It handles: | |
| ```text | |
| PEKOM routing | |
| CCNA Lab routing | |
| residential college routing | |
| local RAG fallback | |
| entity-aware retry | |
| fake URL rejection | |
| ``` | |
| --- | |
| ### V17 — Strict Entity Judge and UI Polish | |
| Adds stricter entity matching and improves trace display. | |
| This helps avoid cases where a query about one entity is answered with another related but wrong page. | |
| --- | |
| ### V18 — Balanced Official Reference Fallback | |
| Allows the system to still provide official references when strong web evidence is not enough, while avoiding over-trusting weak pages. | |
| --- | |
| ### V19 — Answer Grounding Judge | |
| Checks whether the final generated answer is actually supported by evidence. | |
| This is important because even if retrieval is correct, the model may still introduce unsupported details. | |
| --- | |
| ### Completeness Guard | |
| Checks whether the answer is too incomplete and whether a rewrite or fallback should be triggered. | |
| --- | |
| # 8. Improved Model — PPO Rule-reward Post-training + RAG + Agent + Harness | |
| ## 8.1 Purpose | |
| The Improved Model asks: | |
| > Can we further improve the model’s behavior after SFT/RAG by using PPO reward-based post-training? | |
| Baseline 2 already improves factual grounding through RAG and Harness Engineering. The Improved Model adds PPO to shape the model’s behavior. | |
| The goal is not to replace RAG. The goal is to make the model more aligned with the desired answer style and safety behavior. | |
| --- | |
| ## 8.2 What PPO Means in This Project | |
| PPO stands for **Proximal Policy Optimization**. | |
| In simple terms: | |
| ```text | |
| SFT teaches the model by imitation. | |
| PPO lets the model generate answers, scores them with a reward function, and updates the model toward higher-reward answers. | |
| ``` | |
| In this project: | |
| ```text | |
| Actor model: Qwen3-8B + LoRA | |
| Critic/value head: TRL value head model | |
| Reference model: frozen Qwen3-8B reference | |
| Reward: rule-based preference reward function | |
| KL control: used to avoid drifting too far from the reference model | |
| ``` | |
| --- | |
| ## 8.3 Rule-based Reward Function | |
| This project uses a rule-based reward function rather than a separately trained neural reward model. | |
| The reward function evaluates: | |
| ```text | |
| gold answer similarity | |
| rejected answer penalty | |
| evidence overlap | |
| scope correctness | |
| hallucinated URL penalty | |
| vague answer penalty | |
| process/thinking leakage penalty | |
| direct answer bonus | |
| repetition penalty | |
| degeneration/collapse penalty | |
| ``` | |
| This is why the model card should describe the final stage as: | |
| > Rule-reward PPO post-training | |
| not: | |
| > Full RLHF with a trained reward model | |
| The reward model type recorded in the notebook is: | |
| ```text | |
| rule_based_preference_reward_function | |
| uses_separate_neural_reward_model: False | |
| ``` | |
| --- | |
| ## 8.4 PPO Training Configuration | |
| The final PPO run uses: | |
| ```text | |
| Preference dataset rows: 1000 | |
| Train rows: 900 | |
| Validation rows: 100 | |
| MAX_PPO_ROWS: None | |
| Train fraction: 0.90 | |
| PPO epochs: 2 | |
| Batch size: 2 | |
| Mini-batch size: 1 | |
| Max new tokens: 72 | |
| Max PPO steps per epoch: None | |
| Planned steps per epoch: 450 | |
| Total planned steps: 900 | |
| Learning rate: 2e-6 | |
| Target KL: 0.10 | |
| Generation temperature: 0.45 | |
| Top-p: 0.78 | |
| Repetition penalty: 1.3 | |
| No-repeat ngram size: 4 | |
| ``` | |
| The run completed successfully: | |
| ```text | |
| Global PPO steps: 900 / 900 | |
| Elapsed time: 04:47:59 | |
| Degenerate ratio: 0.00% | |
| ``` | |
| --- | |
| ## 8.5 PPO Artifact Verification | |
| The Stage 3 notebook includes strict artifact verification. | |
| This is important because PPO notebooks can easily appear to run while silently saving old or incomplete artifacts. | |
| The strict save cell verifies: | |
| ```text | |
| training_log exists | |
| training_log records = 900 | |
| expected steps = 900 | |
| MAX_PPO_ROWS = None | |
| train rows = 900 | |
| valid rows = 100 | |
| NUM_PPO_EPOCHS = 2 | |
| MAX_PPO_STEPS_PER_EPOCH = None | |
| parameter hash changed after PPO | |
| PPO inference full actor exists | |
| PPO LoRA adapter exists | |
| non-PPO fallback forbidden | |
| ``` | |
| The final strict save output confirms: | |
| ```text | |
| Final PPO records saved: 900 / expected 900 | |
| Strict full PPO artifact contract passed. | |
| ``` | |
| The parameter change proof confirms: | |
| ```text | |
| aggregate_hash_changed: true | |
| changed_trainable_tensors: 506 | |
| unchanged_trainable_tensors: 0 | |
| ``` | |
| This proves that PPO training changed the trainable LoRA/value-head parameters rather than merely running a dry notebook. | |
| --- | |
| ## 8.6 Strict PPO-only Runtime | |
| The final runtime is configured so that the UI must use PPO artifacts only. | |
| The strict PPO gate confirms: | |
| ```text | |
| PPO records: 900 | |
| PPO full actor usable: True | |
| PPO LoRA adapter usable: True | |
| Strict PPO-only UI mode: True | |
| ``` | |
| The runtime loading order is: | |
| ```text | |
| 1. PPO full inference actor if full weights exist | |
| 2. Otherwise base Qwen3-8B + PPO LoRA adapter | |
| 3. Non-PPO fallback is forbidden | |
| ``` | |
| This prevents the final demo from accidentally loading an old Baseline 2 model or a stale 150-step PPO proof artifact. | |
| --- | |
| ## 8.7 PPO Validation | |
| The PPO-only validation evaluation uses a held-out validation sample. | |
| The displayed validation summary is: | |
| ```text | |
| reward: 0.477789 | |
| gold_overlap: 0.255351 | |
| rejected_overlap: 0.155080 | |
| ``` | |
| Interpretation: | |
| - reward is positive | |
| - gold overlap is higher than rejected overlap | |
| - the PPO-trained actor tends to move closer to preferred answers than rejected answers | |
| This does not mean the PPO model is perfect. It means the reward-shaped behavior is directionally positive. | |
| --- | |
| ## 8.8 PPO Limitations | |
| The PPO run is successful, but the raw PPO generations still show some imperfections. | |
| Observed issues include: | |
| 1. **Process leakage** | |
| Some outputs still include phrases like: | |
| ```text | |
| Okay, let me try to figure out... | |
| Wait, I need to check again... | |
| ``` | |
| The reward function penalizes this, but it is not completely eliminated. | |
| 2. **Occasional hallucinated URLs** | |
| Some raw generations may still invent URLs. The harness fake URL guard is therefore still necessary. | |
| 3. **OCR-style text artifacts** | |
| Some source chunks contain spacing or OCR issues, and the model may reproduce them. | |
| 4. **KL can be high** | |
| Some PPO logs show high `objective/kl`, meaning the PPO actor can drift noticeably from the reference model. However, the run completed with: | |
| ```text | |
| degenerate_ratio = 0.00% | |
| ``` | |
| and no detected repetition collapse. | |
| 5. **RAG/Harness remains necessary** | |
| PPO improves model behavior, but it does not replace retrieval grounding or guardrails. | |
| --- | |
| # 9. TensorTalk UI | |
| The project includes a WhatsApp-style Jupyter HTML UI called **TensorTalk**. | |
| The UI supports: | |
| - chat-style interface | |
| - TensorCat avatar | |
| - RAG on/off control | |
| - web agent on/off control | |
| - collapsed trace panels | |
| - retrieved evidence display | |
| - web evidence display | |
| - planning/thinking display layer | |
| - harness decision trace | |
| - answer grounding information | |
| - strict PPO artifact loading | |
| - new chat reset behavior | |
| The UI is part of the engineering contribution because it makes the harness process visible rather than hidden. | |
| --- | |
| # 10. Smoke Tests | |
| ## 10.1 What Smoke Test Means Here | |
| A smoke test is a lightweight system sanity check. | |
| It is not a full evaluation. It is a quick check that the main pipeline still works. | |
| In this project, smoke tests check whether: | |
| ```text | |
| PPO model loads | |
| RAG retrieves evidence | |
| web agent searches official sources | |
| fake URL guard blocks synthetic links | |
| answer grounding returns a result | |
| trace structure is produced | |
| fallback behavior still works | |
| ``` | |
| --- | |
| ## 10.2 Example Smoke Tests | |
| The notebook defines smoke tests such as: | |
| ```text | |
| 1. PEKOM should not be routed to AI bachelor page | |
| 2. Residential college should prefer student-affairs residential page | |
| 3. CCNA Lab should not invent synthetic URLs | |
| ``` | |
| These are not random examples. They are chosen to test known fragile parts of the pipeline: | |
| - entity routing | |
| - official URL preference | |
| - fake URL rejection | |
| - web/RAG trace structure | |
| --- | |
| # 11. Control Variable Design | |
| The project uses a control-variable style comparison. | |
| The base task remains the same: | |
| ```text | |
| UM FSKTM Handbook QA | |
| ``` | |
| The base model family remains the same: | |
| ```text | |
| Qwen3-8B | |
| ``` | |
| The dataset domain remains the same: | |
| ```text | |
| Undergraduate + postgraduate + general UM Handbook knowledge | |
| ``` | |
| What changes is the system layer: | |
| ```text | |
| Baseline 1: SFT only | |
| Baseline 2: SFT + RAG + Harness Agent | |
| Improved: SFT/RAG/Harness + PPO post-training | |
| ``` | |
| This allows the project to compare which improvements come from: | |
| - parameter learning | |
| - retrieval grounding | |
| - metadata-aware scope control | |
| - official web augmentation | |
| - harness guardrails | |
| - PPO reward shaping | |
| This is more rigorous than simply building three unrelated systems. | |
| --- | |
| # 12. Stage-by-stage Comparison Table | |
| | Dimension | Baseline 1: Closed-book SFT | Baseline 2: RAG + SFT + Agent/Harness | Improved Model: PPO + RAG + Agent/Harness | | |
| |---|---|---|---| | |
| | Main research question | Can the model memorize and reproduce handbook QA from SFT? | Does retrieval-grounded evidence improve handbook QA? | Can rule-reward PPO further align answer behavior while keeping RAG/Harness control? | | |
| | Base model | Qwen3-8B | Qwen3-8B | Qwen3-8B | | |
| | Main training method | Supervised fine-tuning | RAG-augmented supervised fine-tuning | Rule-reward PPO post-training | | |
| | Dataset used | 1000 SFT QA rows | SFT QA + metadata + RAG KB + RAG eval | 1000 PPO preference rows | | |
| | Train/validation/test | 800 / 100 / 100 | 8:1:1 RAG-augmented split | 900 train / 100 validation | | |
| | Retrieval | No | Yes | Yes | | |
| | Retrieval type | None | Dense embedding + FAISS + metadata-aware rerank | Same RAG runtime reused | | |
| | Embedding model | None | BAAI/bge-base-en-v1.5 | RAG runtime inherited from Baseline 2 | | |
| | Top-k evidence | None | Top-3 | Top-3 / runtime-dependent | | |
| | Metadata awareness | Hidden metadata only, not used at inference | Yes, scope/source/section aware | Yes, used by RAG/Harness runtime | | |
| | Scope control | Weak; model may confuse UG/PG if prompt is ambiguous | Stronger due to metadata-aware retrieval | Stronger due to RAG + PPO reward + harness | | |
| | Web agent | No | Yes | Yes | | |
| | Official domain control | No | Yes, UM/FSKTM official domain whitelist | Yes, same official-source guardrails | | |
| | Fake URL guard | No | Yes | Yes | | |
| | WAF handling | No | Yes | Yes | | |
| | Evidence judge | No | Yes, Qwen evidence judge | Yes | | |
| | Retry/fallback policy | No | Yes | Yes | | |
| | Answer grounding judge | No | Yes | Yes | | |
| | Completeness guard | No | Yes | Yes | | |
| | UI trace | Basic chat UI | Harness trace panels | Strict PPO + Harness trace panels | | |
| | LoRA rank | 16 | 8 | PPO actor based on LoRA actor/value setup | | |
| | Training epochs | 8 SFT epochs | 20 SFT epochs | 2 PPO epochs | | |
| | Main output artifact | LoRA adapter + merged model + `.pt` export | LoRA adapter, optional non-merged runtime | PPO full inference actor + PPO LoRA adapter + manifest | | |
| | Artifact strictness | Standard save | Adapter/runtime path checks | Manifest, training log count, parameter hash proof, strict gate | | |
| | Key metric | Test token F1 ≈ 0.8869 | RAG token F1 ≈ 0.846 on selected eval; retrieval Hit@3 ≈ 0.954 | PPO validation reward ≈ 0.4778; gold overlap > rejected overlap | | |
| | Strongest contribution | Clean SFT baseline | Evidence-grounded QA and metadata-aware retrieval | Full PPO post-training with strict artifact verification and harnessed runtime | | |
| | Main weakness | Closed-book hallucination risk | More complex runtime, depends on retriever quality | PPO raw outputs still need Harness/RAG due to possible process leakage and fake URLs | | |
| | Control variable role | Establishes parameter-only baseline | Adds retrieval and harness while keeping same domain/model family | Adds PPO reward shaping while preserving RAG/Harness pipeline | | |
| --- | |
| # 13. Technical Comparison of the Three Stages | |
| ## 13.1 Content-level Difference | |
| | Content Aspect | Baseline 1 | Baseline 2 | Improved Model | | |
| |---|---|---|---| | |
| | Stable handbook facts | Learned into model parameters | Retrieved from handbook KB | Retrieved and answered by PPO-aligned actor | | |
| | Latest or official web info | Not supported | Supported through official web agent | Supported through same official web agent | | |
| | UG vs PG distinction | Learned implicitly | Controlled by metadata retrieval | Controlled by metadata retrieval + reward/harness | | |
| | Evidence visibility | Not shown | Evidence shown in RAG trace | Evidence shown in PPO/Harness trace | | |
| | Hallucination control | Mostly prompt-based | Retrieval + grounding | Retrieval + grounding + reward penalties | | |
| | Fake URL control | Not available | Harness URL guard | Harness URL guard + PPO penalty signal | | |
| --- | |
| ## 13.2 Engineering-level Difference | |
| | Engineering Aspect | Baseline 1 | Baseline 2 | Improved Model | | |
| |---|---|---|---| | |
| | Notebook purpose | Train and evaluate closed-book SFT model | Build RAG-augmented model and harnessed agent runtime | Train PPO actor and attach it to final harness runtime | | |
| | Runtime complexity | Low | High | Highest | | |
| | Debug trace | Basic | Detailed RAG/Web/Harness trace | Detailed PPO/RAG/Web/Harness trace | | |
| | Failure handling | Minimal | Fallback and guardrail logic | Strict PPO-only fallback prevention plus harness fallback | | |
| | Artifact verification | Basic output save | Adapter/merged path checks | Manifest, training log count, parameter hash proof, strict gate | | |
| | Risk of stale artifact use | Moderate | Moderate | Actively guarded against | | |
| | Demo readiness | Good for simple QA | Strong for grounded QA | Strongest for final controlled system demo | | |
| --- | |
| # 14. Why the Improved Model Does Not Replace RAG | |
| A key design decision is that PPO does not replace RAG. | |
| PPO improves the model’s tendency to: | |
| - answer directly | |
| - avoid rejected-style answers | |
| - avoid vague answers | |
| - avoid process leakage | |
| - avoid fake URLs | |
| - avoid repetition collapse | |
| - use evidence-like wording more appropriately | |
| But PPO does not guarantee factual correctness by itself. | |
| Therefore, the final system still needs: | |
| ```text | |
| RAG for evidence | |
| Web Agent for official/latest information | |
| Harness for source control | |
| Grounding judge for answer verification | |
| Fallback for weak evidence | |
| ``` | |
| This is the correct division of responsibility: | |
| ```text | |
| SFT: teaches domain answer style | |
| RAG: supplies factual evidence | |
| Agent: finds official external evidence | |
| Harness: controls trust, routing, fallback, and trace | |
| PPO: improves answer behavior according to reward preferences | |
| ``` | |
| --- | |
| # 15. Known Limitations | |
| This project is a strong applied LLM system prototype, but it has limitations. | |
| ## 15.1 Not a full human-feedback RLHF system | |
| The PPO stage uses a rule-based reward function. It does not train a separate neural reward model from human preference labels. | |
| Correct description: | |
| ```text | |
| Rule-reward PPO post-training | |
| ``` | |
| Not: | |
| ```text | |
| Full RLHF with learned reward model | |
| ``` | |
| --- | |
| ## 15.2 Raw PPO generations can still be imperfect | |
| Observed raw PPO generations may include: | |
| - process leakage | |
| - occasional hallucinated URLs | |
| - OCR-like token spacing | |
| - incomplete course titles | |
| - noisy source-text reproduction | |
| The final Harness runtime is therefore necessary. | |
| --- | |
| ## 15.3 Web search is constrained | |
| The web agent is intentionally limited to official UM/FSKTM sources. It may refuse or fallback when official evidence is weak. | |
| This is a feature, not a bug, because the system prioritizes trustworthiness over open-ended browsing. | |
| --- | |
| ## 15.4 RAG depends on knowledge base quality | |
| If the RAG KB contains OCR noise or incomplete chunks, the model may inherit that noise. Future work should improve source cleaning and chunk normalization. | |
| --- | |
| ## 15.5 Notebook-based prototype | |
| The project is implemented as notebooks. A production version should separate modules into: | |
| ```text | |
| data/ | |
| retrieval/ | |
| agent/ | |
| harness/ | |
| training/ | |
| evaluation/ | |
| ui/ | |
| tests/ | |
| ``` | |
| --- | |
| # 16. Recommended Usage | |
| This project is intended for research, coursework, and demonstration purposes. | |
| It is not an official Universiti Malaya system. | |
| For official academic decisions, students should always refer to the official handbook, faculty office, or UM/FSKTM official websites. | |
| --- | |
| # 17. Suggested Inference Flow | |
| For final demonstration, use the Improved Model runtime: | |
| ```text | |
| 1. Load PPO full inference actor if available. | |
| 2. If unavailable, load base Qwen3-8B + PPO LoRA adapter. | |
| 3. Initialize local handbook RAG. | |
| 4. Enable official UM/FSKTM web agent if the question may need external/latest information. | |
| 5. Run through TensorTalkHarnessCore. | |
| 6. Display answer with evidence trace. | |
| ``` | |
| Strict runtime requirement: | |
| ```text | |
| Non-PPO fallback is forbidden in the final Improved Model demo. | |
| ``` | |
| --- | |
| # 18. Summary | |
| TensorTalk demonstrates a staged LLM system development workflow: | |
| ```text | |
| Baseline 1: | |
| Qwen3-8B learns handbook QA through closed-book SFT. | |
| Baseline 2: | |
| The system adds RAG, dense retrieval, metadata-aware reranking, official web search, and Harness Engineering. | |
| Improved Model: | |
| The system adds full 1000-row rule-reward PPO post-training, strict artifact verification, and a PPO-only final harness runtime. | |
| ``` | |
| The most important contribution is not only that the model can answer handbook questions, but that the system is controlled, evidence-aware, source-constrained, traceable, and evaluated through a clear baseline progression. | |
| The final system should be understood as: | |
| > **A Qwen3-8B based UM Handbook RAG Agent, improved with rule-reward PPO and controlled by Harness Engineering.** |