Text Generation
Transformers
Safetensors
English
Chinese
qwen3
qwen3-8b
lora
qlora
sft
rag
faiss
dense-retrieval
agent
ppo
rlhf
rule-reward
harness-engineering
um-handbook
question-answering
chatbot
education
tensor-talk
Instructions to use TensorCat/TensorTalk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TensorCat/TensorTalk with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TensorCat/TensorTalk")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("TensorCat/TensorTalk", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TensorCat/TensorTalk with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TensorCat/TensorTalk" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TensorCat/TensorTalk
- SGLang
How to use TensorCat/TensorTalk with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TensorCat/TensorTalk with Docker Model Runner:
docker model run hf.co/TensorCat/TensorTalk
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,250 +1,1393 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
# TensorTalk / UM_Handbook
|
| 5 |
|
| 6 |
-
TensorTalk
|
| 7 |
|
| 8 |
-
|
| 9 |
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
-
|
| 13 |
-
|
| 14 |
-
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
##
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- handbook-faithful answers
|
| 27 |
-
- concise student-facing responses
|
| 28 |
-
- a local/demo deployment workflow on DICC and notebook environments
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
-
##
|
| 38 |
|
| 39 |
-
|
| 40 |
-
The project includes scripts and resources for preparing handbook data before fine-tuning:
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- source chunk dataset building
|
| 44 |
-
- SFT QA dataset construction
|
| 45 |
-
- configuration management for the preprocessing and dataset pipeline
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
- device-aware loading logic
|
| 54 |
-
- train / validation / test style evaluation workflow
|
| 55 |
-
- merged-model export for direct inference
|
| 56 |
-
- LoRA adapter export for optional PEFT-based reuse
|
| 57 |
-
- metrics and prediction file generation
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
The
|
| 63 |
|
| 64 |
-
|
| 65 |
-
- a handbook-focused system prompt
|
| 66 |
-
- merged-model loading for direct inference
|
| 67 |
-
- a student-facing question-answer workflow
|
| 68 |
-
- a simple deployment path for demonstration purposes
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
```text
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
│ ├── SFT_QA_Training_Ready_pretty.json
|
| 80 |
-
│ ├── SFT_QA_Metadata.jsonl
|
| 81 |
-
│ └── SFT_QA_Metadata_pretty.json
|
| 82 |
-
├── assets/
|
| 83 |
-
├── outputs/
|
| 84 |
-
│ └── qwen3_um_handbook_optimized_1/
|
| 85 |
-
│ ├── lora_adapter/
|
| 86 |
-
│ ├── merged_model/
|
| 87 |
-
│ ├── trainer_runs/
|
| 88 |
-
│ ├── test_eval_runs/
|
| 89 |
-
│ ├── dataset_split_summary.json
|
| 90 |
-
│ ├── final_metrics.json
|
| 91 |
-
│ ├── test_predictions.jsonl
|
| 92 |
-
│ └── validation_predictions.jsonl
|
| 93 |
-
├── FineTune_QWEN3_UM_Handbook_optimized_1.ipynb
|
| 94 |
-
├── UM_Handbook_Markdown_Preprocess.py
|
| 95 |
-
├── UM_SFT_QA_Dataset_Builder_from_Index.py
|
| 96 |
-
├── UM_Source_Chunk_Dataset_Builder.py
|
| 97 |
-
└── um_handbook_config.py
|
| 98 |
```
|
| 99 |
|
|
|
|
|
|
|
| 100 |
---
|
| 101 |
|
| 102 |
-
##
|
| 103 |
|
| 104 |
-
|
| 105 |
-
- `Dataset/SFT_Dataset/SFT_QA_Training_Ready.jsonl`
|
| 106 |
-
Main SFT training dataset used for handbook QA fine-tuning.
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
-
|
| 112 |
-
Builds source chunks for downstream dataset and retrieval-related use.
|
| 113 |
|
| 114 |
-
-
|
| 115 |
-
Builds the supervised QA dataset from curated handbook content.
|
| 116 |
|
| 117 |
-
|
| 118 |
-
Central configuration file for paths and data-processing settings.
|
| 119 |
|
| 120 |
-
|
| 121 |
-
- `outputs/qwen3_um_handbook_optimized_1/merged_model/`
|
| 122 |
-
Main inference-ready model directory.
|
| 123 |
-
This is the directory used by the demo chat UI.
|
| 124 |
|
| 125 |
-
|
| 126 |
-
LoRA adapter weights.
|
| 127 |
-
This is useful for PEFT-style loading with a base model, but it is not the primary path used by the current demo UI.
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
|
| 132 |
-
|
| 133 |
-
Validation-set generated answers for inspection.
|
| 134 |
|
| 135 |
-
- `
|
| 136 |
-
|
|
|
|
|
|
|
| 137 |
|
| 138 |
-
|
| 139 |
-
- `FineTune_QWEN3_UM_Handbook_optimized_1.ipynb`
|
| 140 |
-
Main notebook that contains the fine-tuning workflow and the TensorTalk HTML chat demo.
|
| 141 |
|
| 142 |
---
|
| 143 |
|
| 144 |
-
##
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
This is the most important deployment artifact for the current demo.
|
| 150 |
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
-
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
-
|
| 160 |
-
- loading the adapter on top of the original base model
|
| 161 |
-
- reusing the fine-tuning result in a PEFT workflow
|
| 162 |
-
- experimenting with a smaller transferable fine-tuning artifact
|
| 163 |
|
| 164 |
-
##
|
| 165 |
-
If present, the `.pt` file is mainly a saved full-model artifact / backup export.
|
| 166 |
|
| 167 |
-
|
| 168 |
-
- archiving the full fine-tuned weights
|
| 169 |
-
- running a custom loading workflow that explicitly expects a `.pt` file
|
| 170 |
|
| 171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
---
|
| 174 |
|
| 175 |
-
##
|
| 176 |
|
| 177 |
-
The
|
| 178 |
|
| 179 |
-
|
| 180 |
-
- programme core courses / credit requirements
|
| 181 |
-
- undergraduate vs postgraduate handbook information
|
| 182 |
-
- academic rules and handbook-supported policy questions
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
---
|
| 192 |
|
| 193 |
-
##
|
|
|
|
|
|
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
---
|
| 200 |
|
| 201 |
-
#
|
|
|
|
|
|
|
| 202 |
|
| 203 |
-
|
| 204 |
|
| 205 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
|
| 207 |
---
|
| 208 |
|
| 209 |
-
##
|
| 210 |
|
| 211 |
-
|
| 212 |
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
|
| 218 |
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
---
|
| 225 |
|
| 226 |
-
##
|
| 227 |
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
|
| 232 |
---
|
| 233 |
|
| 234 |
-
##
|
|
|
|
|
|
|
| 235 |
|
| 236 |
-
|
| 237 |
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
-
|
|
|
|
|
|
|
| 244 |
|
| 245 |
---
|
| 246 |
|
| 247 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 248 |
|
| 249 |
-
**
|
| 250 |
-
UM Handbook QA / Fine-Tuned Qwen3-8B LoRA Project
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
base_model:
|
| 7 |
+
- Qwen/Qwen3-8B
|
| 8 |
+
pipeline_tag: text-generation
|
| 9 |
+
library_name: transformers
|
| 10 |
+
tags:
|
| 11 |
+
- qwen3
|
| 12 |
+
- qwen3-8b
|
| 13 |
+
- lora
|
| 14 |
+
- qlora
|
| 15 |
+
- sft
|
| 16 |
+
- rag
|
| 17 |
+
- faiss
|
| 18 |
+
- dense-retrieval
|
| 19 |
+
- agent
|
| 20 |
+
- ppo
|
| 21 |
+
- rlhf
|
| 22 |
+
- rule-reward
|
| 23 |
+
- harness-engineering
|
| 24 |
+
- um-handbook
|
| 25 |
+
- question-answering
|
| 26 |
+
- chatbot
|
| 27 |
+
- education
|
| 28 |
+
- tensor-talk
|
| 29 |
---
|
|
|
|
| 30 |
|
| 31 |
+
# TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering
|
| 32 |
|
| 33 |
+
TensorTalk is a staged LLM engineering project built for **Universiti Malaya Faculty of Computer Science and Information Technology handbook question answering**. The system is designed to answer undergraduate, postgraduate, and general faculty handbook questions using a controlled progression of three experimental stages:
|
| 34 |
|
| 35 |
+
1. **Baseline 1 — Closed-book SFT Qwen3-8B**
|
| 36 |
+
2. **Baseline 2 — SFT Qwen3-8B + Metadata-aware RAG + Official Web Agent + Harness Engineering**
|
| 37 |
+
3. **Improved Model — Rule-reward PPO post-training + RAG + Agent + Harness Engineering**
|
| 38 |
+
|
| 39 |
+
The project is not just a simple chatbot. It is a controlled comparison of how an LLM system improves when moving from memorized supervised fine-tuning, to retrieval-grounded answering, and finally to rule-reward post-training with a guarded agentic runtime.
|
| 40 |
+
|
| 41 |
+
The main idea is:
|
| 42 |
+
|
| 43 |
+
> Baseline 1 tests whether a fine-tuned model can answer handbook questions from parameters alone.
|
| 44 |
+
> Baseline 2 keeps the same base model family but adds retrieval and harnessed evidence control.
|
| 45 |
+
> The Improved Model keeps the RAG + Agent + Harness runtime and further adds PPO post-training to make the model better aligned with the desired answer behavior.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
# 1. Project Goal
|
| 50 |
+
|
| 51 |
+
The goal of this project is to build a reliable and traceable UM Handbook assistant that can answer questions about:
|
| 52 |
+
|
| 53 |
+
- Faculty objectives, vision, mission, history, facilities, and academic calendar
|
| 54 |
+
- Undergraduate programme details
|
| 55 |
+
- Postgraduate programme details
|
| 56 |
+
- Candidature requirements
|
| 57 |
+
- Grading and academic rules
|
| 58 |
+
- Industrial training
|
| 59 |
+
- Academic project requirements
|
| 60 |
+
- Supervision policy
|
| 61 |
+
- Thesis/dissertation requirements
|
| 62 |
+
- Academic integrity and plagiarism
|
| 63 |
+
- Facilities and labs
|
| 64 |
+
- Official UM/FSKTM web information when handbook knowledge is insufficient or time-sensitive
|
| 65 |
+
|
| 66 |
+
The project also aims to demonstrate a complete LLM system development path:
|
| 67 |
+
|
| 68 |
+
```text
|
| 69 |
+
Closed-book SFT
|
| 70 |
+
→ RAG-augmented SFT
|
| 71 |
+
→ Metadata-aware retrieval
|
| 72 |
+
→ Official-source web agent
|
| 73 |
+
→ Harness Engineering guardrails
|
| 74 |
+
→ PPO rule-reward post-training
|
| 75 |
+
→ Strict artifact verification
|
| 76 |
+
→ Traceable TensorTalk UI
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
# 2. High-level System Overview
|
| 82 |
+
|
| 83 |
+
The final TensorTalk system contains several layers.
|
| 84 |
+
|
| 85 |
+
```text
|
| 86 |
+
User Question
|
| 87 |
+
↓
|
| 88 |
+
TensorTalk UI
|
| 89 |
+
↓
|
| 90 |
+
Planning / Thinking Display Layer
|
| 91 |
+
↓
|
| 92 |
+
Local Handbook RAG
|
| 93 |
+
↓
|
| 94 |
+
Official UM / FSKTM Web Agent
|
| 95 |
+
↓
|
| 96 |
+
Harness Engineering Guardrails
|
| 97 |
+
↓
|
| 98 |
+
Evidence Judge / Retry / Fallback
|
| 99 |
+
↓
|
| 100 |
+
PPO-trained Qwen3-8B Actor
|
| 101 |
+
↓
|
| 102 |
+
Answer Grounding Judge
|
| 103 |
+
↓
|
| 104 |
+
Completeness Guard
|
| 105 |
+
↓
|
| 106 |
+
Final Answer + Trace Panels
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
The final model is not used alone. It is wrapped inside a runtime harness that controls:
|
| 110 |
+
|
| 111 |
+
- where the system can search
|
| 112 |
+
- which sources it can trust
|
| 113 |
+
- whether web evidence is useful
|
| 114 |
+
- whether retrieved evidence supports the answer
|
| 115 |
+
- whether the model produced fake URLs
|
| 116 |
+
- whether the answer leaked internal reasoning
|
| 117 |
+
- whether fallback to local handbook RAG is needed
|
| 118 |
+
- whether the final answer is grounded enough to show
|
| 119 |
+
|
| 120 |
+
This is why the final stage is better described as:
|
| 121 |
+
|
| 122 |
+
> **A PPO-aligned RAG agent system with Harness Engineering**, rather than only a fine-tuned model.
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
# 3. Dataset Design
|
| 127 |
+
|
| 128 |
+
## 3.1 Source Domain
|
| 129 |
+
|
| 130 |
+
The dataset is built around UM FSKTM undergraduate and postgraduate handbook content. The data is organized into:
|
| 131 |
+
|
| 132 |
+
- SFT question-answer dataset
|
| 133 |
+
- hidden metadata
|
| 134 |
+
- RAG knowledge base
|
| 135 |
+
- RAG evaluation dataset
|
| 136 |
+
- PPO preference dataset
|
| 137 |
+
|
| 138 |
+
The project separates **model-visible training text** from **metadata used for retrieval, evaluation, and analysis**.
|
| 139 |
+
|
| 140 |
+
This distinction is important:
|
| 141 |
+
|
| 142 |
+
- Baseline 1 intentionally trains on question-answer text without forcing explicit metadata labels into the model-visible answer.
|
| 143 |
+
- Baseline 2 uses metadata-aware retrieval to reduce scope confusion.
|
| 144 |
+
- Stage 3 PPO uses preference pairs and reward functions to shape answer behavior.
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## 3.2 Baseline 1 SFT Dataset
|
| 149 |
+
|
| 150 |
+
Baseline 1 uses:
|
| 151 |
+
|
| 152 |
+
```text
|
| 153 |
+
SFT_QA_Training_Ready.jsonl
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
The notebook validates:
|
| 157 |
+
|
| 158 |
+
```text
|
| 159 |
+
Total examples: 1000
|
| 160 |
+
Train examples: 800
|
| 161 |
+
Validation examples: 100
|
| 162 |
+
Test examples: 100
|
| 163 |
+
Split ratio: 8:1:1
|
| 164 |
+
Duplicate question groups: 0
|
| 165 |
+
Duplicate question rows: 0
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
Each example follows a supervised chat-style format:
|
| 169 |
+
|
| 170 |
+
```json
|
| 171 |
+
{
|
| 172 |
+
"prompt": [
|
| 173 |
+
{
|
| 174 |
+
"role": "system",
|
| 175 |
+
"content": "You are an academic assistant for the Faculty of Computer Science and Information Technology, Universiti Malaya..."
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"role": "user",
|
| 179 |
+
"content": "What are the faculty objectives?"
|
| 180 |
+
}
|
| 181 |
+
],
|
| 182 |
+
"completion": [
|
| 183 |
+
{
|
| 184 |
+
"role": "assistant",
|
| 185 |
+
"content": "The faculty objectives are..."
|
| 186 |
+
}
|
| 187 |
+
],
|
| 188 |
+
"question": "...",
|
| 189 |
+
"answer": "..."
|
| 190 |
+
}
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
This stage teaches the model to imitate handbook-style answers directly.
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
## 3.3 Baseline 2 RAG Dataset
|
| 198 |
+
|
| 199 |
+
Baseline 2 uses the same SFT dataset direction, but adds external retrieval resources:
|
| 200 |
+
|
| 201 |
+
```text
|
| 202 |
+
UM_RAG_Knowledge_Base.jsonl
|
| 203 |
+
UM_RAG_Evaluation_Dataset.jsonl
|
| 204 |
+
SFT_QA_Metadata.jsonl
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
The RAG knowledge base contains structured fields such as:
|
| 208 |
+
|
| 209 |
+
```text
|
| 210 |
+
kb_id
|
| 211 |
+
source_doc
|
| 212 |
+
scope_label
|
| 213 |
+
section
|
| 214 |
+
pages
|
| 215 |
+
source_text
|
| 216 |
+
retrieval_text
|
| 217 |
+
retrieval_keywords
|
| 218 |
+
grounded_answer_bank
|
| 219 |
+
matched_qa_ids
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
The RAG knowledge base loaded in the final Stage 3 runtime contains:
|
| 223 |
+
|
| 224 |
+
```text
|
| 225 |
+
Loaded KB rows: 521
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
The metadata layer allows the system to distinguish:
|
| 229 |
+
|
| 230 |
+
```text
|
| 231 |
+
general
|
| 232 |
+
undergraduate
|
| 233 |
+
postgraduate
|
| 234 |
+
```
|
| 235 |
+
|
| 236 |
+
This is important because many handbook questions look similar but require different answers depending on the student scope.
|
| 237 |
+
|
| 238 |
+
---
|
| 239 |
+
|
| 240 |
+
## 3.4 PPO Preference Dataset
|
| 241 |
+
|
| 242 |
+
The Improved Model uses:
|
| 243 |
+
|
| 244 |
+
```text
|
| 245 |
+
UM_Handbook_PPO_Preference_Dataset.jsonl
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
The final PPO run uses the full dataset:
|
| 249 |
+
|
| 250 |
+
```text
|
| 251 |
+
Total PPO preference rows: 1000
|
| 252 |
+
Train rows: 900
|
| 253 |
+
Validation rows: 100
|
| 254 |
+
Train fraction: 0.90
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
The PPO dataset is not used like normal SFT data. In SFT, the model directly imitates a reference answer. In PPO, the model generates its own answer, receives a reward, and updates toward higher-reward behavior.
|
| 258 |
+
|
| 259 |
+
---
|
| 260 |
+
|
| 261 |
+
# 4. Baseline 1 — Closed-book SFT Qwen3-8B
|
| 262 |
+
|
| 263 |
+
## 4.1 Purpose
|
| 264 |
+
|
| 265 |
+
Baseline 1 asks a simple question:
|
| 266 |
+
|
| 267 |
+
> Can Qwen3-8B learn UM Handbook question answering from supervised fine-tuning alone?
|
| 268 |
+
|
| 269 |
+
This is a **closed-book baseline**. The model does not retrieve handbook evidence during inference. It must answer from what it learned during SFT.
|
| 270 |
+
|
| 271 |
+
This is useful as a control baseline because it shows what happens when the model relies mainly on parameter memory.
|
| 272 |
+
|
| 273 |
+
---
|
| 274 |
+
|
| 275 |
+
## 4.2 Model
|
| 276 |
+
|
| 277 |
+
Baseline 1 uses:
|
| 278 |
+
|
| 279 |
+
```text
|
| 280 |
+
Base model: Qwen/Qwen3-8B
|
| 281 |
+
Local path: /scr/user/kevin2002/TensorCat/NLP/UM_Handbook/models/Qwen3-8B
|
| 282 |
+
```
|
| 283 |
+
|
| 284 |
+
The notebook detected:
|
| 285 |
+
|
| 286 |
+
```text
|
| 287 |
+
Backend: CUDA
|
| 288 |
+
GPU: NVIDIA A100-SXM4-80GB
|
| 289 |
+
dtype: bfloat16
|
| 290 |
+
4-bit QLoRA: enabled
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
---
|
| 294 |
+
|
| 295 |
+
## 4.3 Training Method
|
| 296 |
+
|
| 297 |
+
Baseline 1 uses LoRA / QLoRA supervised fine-tuning.
|
| 298 |
+
|
| 299 |
+
LoRA configuration:
|
| 300 |
+
|
| 301 |
+
```text
|
| 302 |
+
LoRA rank: 16
|
| 303 |
+
LoRA alpha: 32
|
| 304 |
+
LoRA dropout: 0.10
|
| 305 |
+
Target modules:
|
| 306 |
+
- q_proj
|
| 307 |
+
- k_proj
|
| 308 |
+
- v_proj
|
| 309 |
+
- o_proj
|
| 310 |
+
- gate_proj
|
| 311 |
+
- up_proj
|
| 312 |
+
- down_proj
|
| 313 |
+
```
|
| 314 |
+
|
| 315 |
+
Training configuration:
|
| 316 |
+
|
| 317 |
+
```text
|
| 318 |
+
Epochs: 8
|
| 319 |
+
Train split: 800
|
| 320 |
+
Validation split: 100
|
| 321 |
+
Test split: 100
|
| 322 |
+
Per-device train batch size: 2
|
| 323 |
+
Per-device eval batch size: 2
|
| 324 |
+
Gradient accumulation steps: 8
|
| 325 |
+
Learning rate: 1e-4
|
| 326 |
+
Packing: False
|
| 327 |
+
```
|
| 328 |
+
|
| 329 |
+
---
|
| 330 |
+
|
| 331 |
+
## 4.4 Baseline 1 Results
|
| 332 |
+
|
| 333 |
+
The training completed successfully.
|
| 334 |
+
|
| 335 |
+
Training summary:
|
| 336 |
+
|
| 337 |
+
```text
|
| 338 |
+
Training steps: 400
|
| 339 |
+
Training runtime: ~18.68 minutes for the main train stage
|
| 340 |
+
Train loss: 0.4824
|
| 341 |
+
Final validation loss: ~0.146
|
| 342 |
+
Test loss: ~0.197
|
| 343 |
+
Perplexity: ~1.157
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
Generation metrics:
|
| 347 |
+
|
| 348 |
+
### Validation
|
| 349 |
+
|
| 350 |
+
```text
|
| 351 |
+
Exact match: 0.77
|
| 352 |
+
Token F1: 0.9111
|
| 353 |
+
ROUGE-1: 0.9122
|
| 354 |
+
ROUGE-2: 0.8700
|
| 355 |
+
ROUGE-L: 0.8979
|
| 356 |
+
SacreBLEU: 81.7240
|
| 357 |
+
chrF++: 86.8916
|
| 358 |
+
Average prediction words: 36.35
|
| 359 |
+
Average reference words: 38.57
|
| 360 |
+
```
|
| 361 |
+
|
| 362 |
+
### Test
|
| 363 |
+
|
| 364 |
+
```text
|
| 365 |
+
Exact match: 0.72
|
| 366 |
+
Token F1: 0.8869
|
| 367 |
+
ROUGE-1: 0.8857
|
| 368 |
+
ROUGE-2: 0.8352
|
| 369 |
+
ROUGE-L: 0.8677
|
| 370 |
+
SacreBLEU: 81.1138
|
| 371 |
+
chrF++: 87.7054
|
| 372 |
+
Average prediction words: 38.03
|
| 373 |
+
Average reference words: 37.03
|
| 374 |
+
```
|
| 375 |
+
|
| 376 |
+
---
|
| 377 |
+
|
| 378 |
+
## 4.5 Baseline 1 Strengths
|
| 379 |
+
|
| 380 |
+
Baseline 1 is strong when the question is close to the training distribution. It can reproduce handbook-style answers well and shows high text overlap with the reference answers.
|
| 381 |
+
|
| 382 |
+
It is useful because:
|
| 383 |
+
|
| 384 |
+
- it establishes the basic Qwen3-8B SFT capability
|
| 385 |
+
- it verifies that the dataset format is learnable
|
| 386 |
+
- it creates a clean closed-book control model
|
| 387 |
+
- it provides a baseline for later RAG and PPO improvements
|
| 388 |
+
|
| 389 |
+
---
|
| 390 |
+
|
| 391 |
+
## 4.6 Baseline 1 Limitations
|
| 392 |
+
|
| 393 |
+
Baseline 1 is still limited because it is a closed-book model.
|
| 394 |
+
|
| 395 |
+
Main limitations:
|
| 396 |
+
|
| 397 |
+
1. **No retrieval evidence**
|
| 398 |
+
It cannot check the handbook at inference time.
|
| 399 |
+
|
| 400 |
+
2. **Potential hallucination**
|
| 401 |
+
If the question is out-of-distribution or requires exact source grounding, the model may answer from memory.
|
| 402 |
+
|
| 403 |
+
3. **Scope confusion**
|
| 404 |
+
Undergraduate and postgraduate rules may be mixed if the question is ambiguous.
|
| 405 |
+
|
| 406 |
+
4. **No official web update mechanism**
|
| 407 |
+
It cannot answer dynamic or latest-information questions reliably.
|
| 408 |
+
|
| 409 |
+
5. **No harness guardrails**
|
| 410 |
+
It does not include fake URL detection, evidence judging, WAF handling, or fallback control.
|
| 411 |
+
|
| 412 |
+
Baseline 1 is therefore a necessary but incomplete starting point.
|
| 413 |
+
|
| 414 |
+
---
|
| 415 |
+
|
| 416 |
+
# 5. Baseline 2 — RAG + SFT + Metadata-aware Retrieval + Harness Agent
|
| 417 |
+
|
| 418 |
+
## 5.1 Purpose
|
| 419 |
+
|
| 420 |
+
Baseline 2 asks:
|
| 421 |
+
|
| 422 |
+
> What improves if we keep the same Qwen3-8B family but add retrieval-grounded evidence?
|
| 423 |
+
|
| 424 |
+
The goal is to reduce hallucination and scope confusion by giving the model relevant handbook evidence at inference time.
|
| 425 |
+
|
| 426 |
+
This stage introduces RAG and agentic harness logic while keeping the same broad model family and handbook task.
|
| 427 |
|
| 428 |
---
|
| 429 |
|
| 430 |
+
## 5.2 What RAG Means in This Project
|
| 431 |
|
| 432 |
+
RAG stands for **Retrieval-Augmented Generation**.
|
| 433 |
|
| 434 |
+
In simple terms:
|
| 435 |
+
|
| 436 |
+
```text
|
| 437 |
+
Instead of asking the model to answer only from memory,
|
| 438 |
+
the system first retrieves relevant handbook chunks,
|
| 439 |
+
then asks the model to answer using those chunks.
|
| 440 |
+
```
|
| 441 |
|
| 442 |
+
In this project, RAG is not just keyword search. It uses:
|
|
|
|
|
|
|
|
|
|
| 443 |
|
| 444 |
+
```text
|
| 445 |
+
Transformer embedding model
|
| 446 |
+
+ FAISS vector search
|
| 447 |
+
+ metadata-aware reranking
|
| 448 |
+
+ scope labels
|
| 449 |
+
+ top-k evidence blocks
|
| 450 |
+
```
|
| 451 |
|
| 452 |
+
The Baseline 2 retriever uses:
|
| 453 |
+
|
| 454 |
+
```text
|
| 455 |
+
Embedding model: BAAI/bge-base-en-v1.5
|
| 456 |
+
Vector index: FAISS
|
| 457 |
+
Similarity: inner product after embedding normalization
|
| 458 |
+
Top-k retrieval: 3
|
| 459 |
+
```
|
| 460 |
|
| 461 |
---
|
| 462 |
|
| 463 |
+
## 5.3 Metadata-aware Retrieval
|
| 464 |
|
| 465 |
+
The RAG system uses metadata to control retrieval quality.
|
|
|
|
| 466 |
|
| 467 |
+
Important metadata fields include:
|
|
|
|
|
|
|
|
|
|
| 468 |
|
| 469 |
+
```text
|
| 470 |
+
source_doc
|
| 471 |
+
scope_label
|
| 472 |
+
section
|
| 473 |
+
pages
|
| 474 |
+
kb_id
|
| 475 |
+
knowledge group
|
| 476 |
+
retrieval keywords
|
| 477 |
+
grounded answer bank
|
| 478 |
+
```
|
| 479 |
|
| 480 |
+
This allows the retriever to prefer the correct audience scope.
|
| 481 |
|
| 482 |
+
Example:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 483 |
|
| 484 |
+
```text
|
| 485 |
+
Question: What are the candidature requirements for Master of Software Engineering?
|
| 486 |
+
Expected scope: postgraduate
|
| 487 |
+
```
|
| 488 |
|
| 489 |
+
The system should retrieve postgraduate chunks, not undergraduate chunks.
|
| 490 |
|
| 491 |
+
This is one of the main improvements over Baseline 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 492 |
|
| 493 |
---
|
| 494 |
|
| 495 |
+
## 5.4 RAG-augmented Training Dataset
|
| 496 |
+
|
| 497 |
+
Baseline 2 creates a RAG-augmented dataset where training examples include evidence context.
|
| 498 |
+
|
| 499 |
+
The training prompt can contain:
|
| 500 |
|
| 501 |
```text
|
| 502 |
+
User question
|
| 503 |
+
+ retrieved handbook evidence
|
| 504 |
+
+ source metadata
|
| 505 |
+
+ answer instruction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 506 |
```
|
| 507 |
|
| 508 |
+
This teaches the model to answer with evidence-aware context rather than only memorized answers.
|
| 509 |
+
|
| 510 |
---
|
| 511 |
|
| 512 |
+
## 5.5 Baseline 2 Training Configuration
|
| 513 |
|
| 514 |
+
Baseline 2 uses Qwen3-8B with LoRA fine-tuning.
|
|
|
|
|
|
|
| 515 |
|
| 516 |
+
Configuration:
|
| 517 |
+
|
| 518 |
+
```text
|
| 519 |
+
Base model: Qwen/Qwen3-8B
|
| 520 |
+
Embedding model: BAAI/bge-base-en-v1.5
|
| 521 |
+
LoRA rank: 8
|
| 522 |
+
LoRA alpha: 16
|
| 523 |
+
LoRA dropout: 0.05
|
| 524 |
+
Target modules:
|
| 525 |
+
- q_proj
|
| 526 |
+
- k_proj
|
| 527 |
+
- v_proj
|
| 528 |
+
- o_proj
|
| 529 |
+
- gate_proj
|
| 530 |
+
- up_proj
|
| 531 |
+
- down_proj
|
| 532 |
+
Epochs: 20
|
| 533 |
+
Per-device train batch size: 4
|
| 534 |
+
Per-device eval batch size: 8
|
| 535 |
+
Target global batch size: 8
|
| 536 |
+
Learning rate: 8e-5
|
| 537 |
+
Max sequence length: 1024
|
| 538 |
+
Validation ratio: 0.10
|
| 539 |
+
Test ratio: 0.10
|
| 540 |
+
Save merged model: False
|
| 541 |
+
Runtime model path: base model + LoRA adapter
|
| 542 |
+
```
|
| 543 |
|
| 544 |
+
The notebook uses a safer non-merged runtime path when merged model export is unavailable or memory-expensive.
|
|
|
|
| 545 |
|
| 546 |
+
---
|
|
|
|
| 547 |
|
| 548 |
+
## 5.6 Baseline 2 Retrieval Evaluation
|
|
|
|
| 549 |
|
| 550 |
+
Baseline 2 includes a retrieval evaluation set.
|
|
|
|
|
|
|
|
|
|
| 551 |
|
| 552 |
+
Retrieval metrics:
|
|
|
|
|
|
|
| 553 |
|
| 554 |
+
```json
|
| 555 |
+
{
|
| 556 |
+
"retrieval_eval_size": 1000,
|
| 557 |
+
"top_k": 3,
|
| 558 |
+
"hit_at_1_primary": 0.821,
|
| 559 |
+
"hit_at_k_primary": 0.954,
|
| 560 |
+
"hit_at_k_same_group": 0.991,
|
| 561 |
+
"scope_match_at_1": 0.996,
|
| 562 |
+
"retriever_type": "dense_embedding + faiss + metadata_rerank",
|
| 563 |
+
"embedding_model_name": "BAAI/bge-base-en-v1.5"
|
| 564 |
+
}
|
| 565 |
+
```
|
| 566 |
|
| 567 |
+
Interpretation:
|
|
|
|
| 568 |
|
| 569 |
+
- `hit_at_1_primary = 0.821` means the top retrieved chunk is exactly the expected primary evidence in 82.1% of cases.
|
| 570 |
+
- `hit_at_k_primary = 0.954` means the correct primary evidence appears within top-3 in 95.4% of cases.
|
| 571 |
+
- `hit_at_k_same_group = 0.991` means a same-group acceptable evidence appears in top-3 in 99.1% of cases.
|
| 572 |
+
- `scope_match_at_1 = 0.996` means the top result almost always matches the correct undergraduate/postgraduate/general scope.
|
| 573 |
|
| 574 |
+
This confirms that the RAG system is not random retrieval. It is a strong metadata-aware retrieval baseline.
|
|
|
|
|
|
|
| 575 |
|
| 576 |
---
|
| 577 |
|
| 578 |
+
## 5.7 Baseline 2 Generation Evaluation
|
| 579 |
|
| 580 |
+
Generation evaluation was run on a smaller selected set for runtime practicality.
|
| 581 |
|
| 582 |
+
Results:
|
|
|
|
| 583 |
|
| 584 |
+
```json
|
| 585 |
+
{
|
| 586 |
+
"generation_eval_size": 20,
|
| 587 |
+
"top_k": 3,
|
| 588 |
+
"plain_exact_match": 0.0,
|
| 589 |
+
"plain_token_f1": 0.3391,
|
| 590 |
+
"rag_exact_match": 0.0,
|
| 591 |
+
"rag_token_f1": 0.8460,
|
| 592 |
+
"rag_minus_plain_exact_match": 0.0,
|
| 593 |
+
"rag_minus_plain_token_f1": 0.5069
|
| 594 |
+
}
|
| 595 |
+
```
|
| 596 |
|
| 597 |
+
This shows a large improvement from RAG:
|
| 598 |
+
|
| 599 |
+
```text
|
| 600 |
+
Plain token F1: 0.3391
|
| 601 |
+
RAG token F1: 0.8460
|
| 602 |
+
Improvement: +0.5069
|
| 603 |
+
```
|
| 604 |
+
|
| 605 |
+
This is one of the strongest pieces of evidence in the project.
|
| 606 |
+
|
| 607 |
+
It shows that retrieval grounding dramatically improves answer quality compared with plain generation.
|
| 608 |
+
|
| 609 |
+
---
|
| 610 |
|
| 611 |
+
# 6. Agent Layer in Baseline 2 and Improved Model
|
|
|
|
|
|
|
|
|
|
| 612 |
|
| 613 |
+
## 6.1 Why an Agent Is Needed
|
|
|
|
| 614 |
|
| 615 |
+
The handbook is reliable for stable academic rules, but some questions may require official web information.
|
|
|
|
|
|
|
| 616 |
|
| 617 |
+
Examples:
|
| 618 |
+
|
| 619 |
+
```text
|
| 620 |
+
Who is the current dean?
|
| 621 |
+
Where can students find residential college information?
|
| 622 |
+
What official page mentions PEKOM?
|
| 623 |
+
Where is the official SPeCTRUM page?
|
| 624 |
+
```
|
| 625 |
+
|
| 626 |
+
For these cases, the system needs a controlled web agent.
|
| 627 |
+
|
| 628 |
+
However, a web agent can be dangerous if it freely browses or trusts random pages. Therefore, this project uses a restricted official-source agent.
|
| 629 |
|
| 630 |
---
|
| 631 |
|
| 632 |
+
## 6.2 Official UM / FSKTM Web Agent
|
| 633 |
|
| 634 |
+
The web agent is constrained to official UM / FSKTM domains.
|
| 635 |
|
| 636 |
+
Priority domains:
|
|
|
|
|
|
|
|
|
|
| 637 |
|
| 638 |
+
```text
|
| 639 |
+
fsktm.um.edu.my
|
| 640 |
+
www.um.edu.my
|
| 641 |
+
```
|
| 642 |
|
| 643 |
+
Auxiliary official domains include UM-related systems such as:
|
| 644 |
+
|
| 645 |
+
```text
|
| 646 |
+
aasd.um.edu.my
|
| 647 |
+
maya.um.edu.my
|
| 648 |
+
umlib.um.edu.my
|
| 649 |
+
umresearch.um.edu.my
|
| 650 |
+
jobs.um.edu.my
|
| 651 |
+
careerportal.fsktm.um.edu.my
|
| 652 |
+
intra.fsktm.um.edu.my
|
| 653 |
+
gallery.fsktm.um.edu.my
|
| 654 |
+
```
|
| 655 |
+
|
| 656 |
+
The agent performs:
|
| 657 |
+
|
| 658 |
+
```text
|
| 659 |
+
query planning
|
| 660 |
+
official web discovery
|
| 661 |
+
URL filtering
|
| 662 |
+
page fetching
|
| 663 |
+
evidence extraction
|
| 664 |
+
evidence scoring
|
| 665 |
+
Qwen-based evidence judging
|
| 666 |
+
retry if weak
|
| 667 |
+
fallback to handbook RAG if needed
|
| 668 |
+
```
|
| 669 |
|
| 670 |
---
|
| 671 |
|
| 672 |
+
## 6.3 Agent Is Not Fully Autonomous by Design
|
| 673 |
+
|
| 674 |
+
This project does not use a completely unrestricted autonomous agent.
|
| 675 |
|
| 676 |
+
That is intentional.
|
| 677 |
|
| 678 |
+
For a university handbook assistant, unrestricted autonomy is less useful than controlled evidence routing. The system needs to be:
|
| 679 |
+
|
| 680 |
+
```text
|
| 681 |
+
safe
|
| 682 |
+
source-constrained
|
| 683 |
+
traceable
|
| 684 |
+
fallback-aware
|
| 685 |
+
grounded
|
| 686 |
+
```
|
| 687 |
+
|
| 688 |
+
So the agent is better described as:
|
| 689 |
+
|
| 690 |
+
> A constrained official-source web agent controlled by Harness Engineering.
|
| 691 |
|
| 692 |
---
|
| 693 |
|
| 694 |
+
# 7. Harness Engineering
|
| 695 |
+
|
| 696 |
+
## 7.1 What Harness Engineering Means Here
|
| 697 |
|
| 698 |
+
Harness Engineering is the guardrail system around the model and agent.
|
| 699 |
|
| 700 |
+
A simple analogy:
|
| 701 |
+
|
| 702 |
+
```text
|
| 703 |
+
The LLM/agent is the car.
|
| 704 |
+
Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, and dashboard.
|
| 705 |
+
```
|
| 706 |
+
|
| 707 |
+
The model can generate fluent answers, but the harness controls:
|
| 708 |
+
|
| 709 |
+
- what it is allowed to search
|
| 710 |
+
- what sources it can trust
|
| 711 |
+
- whether a URL is fake
|
| 712 |
+
- whether evidence is useful
|
| 713 |
+
- whether the answer is grounded
|
| 714 |
+
- whether the system should retry
|
| 715 |
+
- whether it should fall back to local handbook RAG
|
| 716 |
+
- what trace should be shown to the user
|
| 717 |
|
| 718 |
---
|
| 719 |
|
| 720 |
+
## 7.2 Harness Pipeline
|
| 721 |
|
| 722 |
+
The standardized TensorTalk Harness Core follows this structure:
|
| 723 |
|
| 724 |
+
```text
|
| 725 |
+
User Question
|
| 726 |
+
↓
|
| 727 |
+
Local Handbook RAG
|
| 728 |
+
↓
|
| 729 |
+
Official Web Discovery
|
| 730 |
+
↓
|
| 731 |
+
Domain Guard
|
| 732 |
+
↓
|
| 733 |
+
Fake URL Guard
|
| 734 |
+
↓
|
| 735 |
+
WAF Detection
|
| 736 |
+
↓
|
| 737 |
+
Evidence Normalizer
|
| 738 |
+
↓
|
| 739 |
+
Qwen Evidence Judge
|
| 740 |
+
↓
|
| 741 |
+
Entity-aware Retry
|
| 742 |
+
↓
|
| 743 |
+
Weak Evidence Fallback
|
| 744 |
+
↓
|
| 745 |
+
Answer Generator
|
| 746 |
+
↓
|
| 747 |
+
Answer Grounding Judge
|
| 748 |
+
↓
|
| 749 |
+
Completeness Guard
|
| 750 |
+
↓
|
| 751 |
+
UI Trace
|
| 752 |
+
```
|
| 753 |
+
|
| 754 |
+
---
|
| 755 |
|
| 756 |
+
## 7.3 Harness Components
|
| 757 |
|
| 758 |
+
The notebooks include several engineering patches and layers:
|
| 759 |
+
|
| 760 |
+
### V14 — WAF-aware Harness
|
| 761 |
+
|
| 762 |
+
Handles web pages blocked by WAF or browser failures.
|
| 763 |
+
|
| 764 |
+
Functions:
|
| 765 |
+
|
| 766 |
+
- detect WAF block pages
|
| 767 |
+
- exclude blocked pages from evidence
|
| 768 |
+
- provide diagnostics
|
| 769 |
+
- use safe static fallback if browser click fails
|
| 770 |
+
- reject query-fabricated URLs before evidence building
|
| 771 |
|
| 772 |
---
|
| 773 |
|
| 774 |
+
### V15 — Qwen Evidence Judge Loop
|
| 775 |
|
| 776 |
+
Adds an LLM-based evidence judge.
|
| 777 |
+
|
| 778 |
+
Flow:
|
| 779 |
+
|
| 780 |
+
```text
|
| 781 |
+
Planner
|
| 782 |
+
→ Search / Fetch
|
| 783 |
+
→ Evidence Filter
|
| 784 |
+
→ Qwen Judge
|
| 785 |
+
→ Retry
|
| 786 |
+
→ Final Evidence
|
| 787 |
+
```
|
| 788 |
+
|
| 789 |
+
The purpose is to avoid trusting weak web snippets blindly.
|
| 790 |
|
| 791 |
---
|
| 792 |
|
| 793 |
+
### V16 — Local-aware Judge Repair
|
| 794 |
+
|
| 795 |
+
Improves routing and fallback.
|
| 796 |
|
| 797 |
+
It handles:
|
| 798 |
|
| 799 |
+
```text
|
| 800 |
+
PEKOM routing
|
| 801 |
+
CCNA Lab routing
|
| 802 |
+
residential college routing
|
| 803 |
+
local RAG fallback
|
| 804 |
+
entity-aware retry
|
| 805 |
+
fake URL rejection
|
| 806 |
+
```
|
| 807 |
|
| 808 |
---
|
| 809 |
|
| 810 |
+
### V17 — Strict Entity Judge and UI Polish
|
| 811 |
+
|
| 812 |
+
Adds stricter entity matching and improves trace display.
|
| 813 |
+
|
| 814 |
+
This helps avoid cases where a query about one entity is answered with another related but wrong page.
|
| 815 |
+
|
| 816 |
+
---
|
| 817 |
+
|
| 818 |
+
### V18 — Balanced Official Reference Fallback
|
| 819 |
+
|
| 820 |
+
Allows the system to still provide official references when strong web evidence is not enough, while avoiding over-trusting weak pages.
|
| 821 |
+
|
| 822 |
+
---
|
| 823 |
+
|
| 824 |
+
### V19 — Answer Grounding Judge
|
| 825 |
+
|
| 826 |
+
Checks whether the final generated answer is actually supported by evidence.
|
| 827 |
+
|
| 828 |
+
This is important because even if retrieval is correct, the model may still introduce unsupported details.
|
| 829 |
+
|
| 830 |
+
---
|
| 831 |
+
|
| 832 |
+
### Completeness Guard
|
| 833 |
+
|
| 834 |
+
Checks whether the answer is too incomplete and whether a rewrite or fallback should be triggered.
|
| 835 |
+
|
| 836 |
+
---
|
| 837 |
+
|
| 838 |
+
# 8. Improved Model — PPO Rule-reward Post-training + RAG + Agent + Harness
|
| 839 |
+
|
| 840 |
+
## 8.1 Purpose
|
| 841 |
+
|
| 842 |
+
The Improved Model asks:
|
| 843 |
+
|
| 844 |
+
> Can we further improve the model’s behavior after SFT/RAG by using PPO reward-based post-training?
|
| 845 |
+
|
| 846 |
+
Baseline 2 already improves factual grounding through RAG and Harness Engineering. The Improved Model adds PPO to shape the model’s behavior.
|
| 847 |
+
|
| 848 |
+
The goal is not to replace RAG. The goal is to make the model more aligned with the desired answer style and safety behavior.
|
| 849 |
+
|
| 850 |
+
---
|
| 851 |
+
|
| 852 |
+
## 8.2 What PPO Means in This Project
|
| 853 |
+
|
| 854 |
+
PPO stands for **Proximal Policy Optimization**.
|
| 855 |
+
|
| 856 |
+
In simple terms:
|
| 857 |
+
|
| 858 |
+
```text
|
| 859 |
+
SFT teaches the model by imitation.
|
| 860 |
+
PPO lets the model generate answers, scores them with a reward function, and updates the model toward higher-reward answers.
|
| 861 |
+
```
|
| 862 |
+
|
| 863 |
+
In this project:
|
| 864 |
+
|
| 865 |
+
```text
|
| 866 |
+
Actor model: Qwen3-8B + LoRA
|
| 867 |
+
Critic/value head: TRL value head model
|
| 868 |
+
Reference model: frozen Qwen3-8B reference
|
| 869 |
+
Reward: rule-based preference reward function
|
| 870 |
+
KL control: used to avoid drifting too far from the reference model
|
| 871 |
+
```
|
| 872 |
+
|
| 873 |
+
---
|
| 874 |
+
|
| 875 |
+
## 8.3 Rule-based Reward Function
|
| 876 |
+
|
| 877 |
+
This project uses a rule-based reward function rather than a separately trained neural reward model.
|
| 878 |
+
|
| 879 |
+
The reward function evaluates:
|
| 880 |
+
|
| 881 |
+
```text
|
| 882 |
+
gold answer similarity
|
| 883 |
+
rejected answer penalty
|
| 884 |
+
evidence overlap
|
| 885 |
+
scope correctness
|
| 886 |
+
hallucinated URL penalty
|
| 887 |
+
vague answer penalty
|
| 888 |
+
process/thinking leakage penalty
|
| 889 |
+
direct answer bonus
|
| 890 |
+
repetition penalty
|
| 891 |
+
degeneration/collapse penalty
|
| 892 |
+
```
|
| 893 |
+
|
| 894 |
+
This is why the model card should describe the final stage as:
|
| 895 |
+
|
| 896 |
+
> Rule-reward PPO post-training
|
| 897 |
+
|
| 898 |
+
not:
|
| 899 |
+
|
| 900 |
+
> Full RLHF with a trained reward model
|
| 901 |
+
|
| 902 |
+
The reward model type recorded in the notebook is:
|
| 903 |
+
|
| 904 |
+
```text
|
| 905 |
+
rule_based_preference_reward_function
|
| 906 |
+
uses_separate_neural_reward_model: False
|
| 907 |
+
```
|
| 908 |
+
|
| 909 |
+
---
|
| 910 |
+
|
| 911 |
+
## 8.4 PPO Training Configuration
|
| 912 |
+
|
| 913 |
+
The final PPO run uses:
|
| 914 |
+
|
| 915 |
+
```text
|
| 916 |
+
Preference dataset rows: 1000
|
| 917 |
+
Train rows: 900
|
| 918 |
+
Validation rows: 100
|
| 919 |
+
MAX_PPO_ROWS: None
|
| 920 |
+
Train fraction: 0.90
|
| 921 |
+
PPO epochs: 2
|
| 922 |
+
Batch size: 2
|
| 923 |
+
Mini-batch size: 1
|
| 924 |
+
Max new tokens: 72
|
| 925 |
+
Max PPO steps per epoch: None
|
| 926 |
+
Planned steps per epoch: 450
|
| 927 |
+
Total planned steps: 900
|
| 928 |
+
Learning rate: 2e-6
|
| 929 |
+
Target KL: 0.10
|
| 930 |
+
Generation temperature: 0.45
|
| 931 |
+
Top-p: 0.78
|
| 932 |
+
Repetition penalty: 1.3
|
| 933 |
+
No-repeat ngram size: 4
|
| 934 |
+
```
|
| 935 |
+
|
| 936 |
+
The run completed successfully:
|
| 937 |
+
|
| 938 |
+
```text
|
| 939 |
+
Global PPO steps: 900 / 900
|
| 940 |
+
Elapsed time: 04:47:59
|
| 941 |
+
Degenerate ratio: 0.00%
|
| 942 |
+
```
|
| 943 |
+
|
| 944 |
+
---
|
| 945 |
+
|
| 946 |
+
## 8.5 PPO Artifact Verification
|
| 947 |
+
|
| 948 |
+
The Stage 3 notebook includes strict artifact verification.
|
| 949 |
+
|
| 950 |
+
This is important because PPO notebooks can easily appear to run while silently saving old or incomplete artifacts.
|
| 951 |
+
|
| 952 |
+
The strict save cell verifies:
|
| 953 |
+
|
| 954 |
+
```text
|
| 955 |
+
training_log exists
|
| 956 |
+
training_log records = 900
|
| 957 |
+
expected steps = 900
|
| 958 |
+
MAX_PPO_ROWS = None
|
| 959 |
+
train rows = 900
|
| 960 |
+
valid rows = 100
|
| 961 |
+
NUM_PPO_EPOCHS = 2
|
| 962 |
+
MAX_PPO_STEPS_PER_EPOCH = None
|
| 963 |
+
parameter hash changed after PPO
|
| 964 |
+
PPO inference full actor exists
|
| 965 |
+
PPO LoRA adapter exists
|
| 966 |
+
non-PPO fallback forbidden
|
| 967 |
+
```
|
| 968 |
+
|
| 969 |
+
The final strict save output confirms:
|
| 970 |
+
|
| 971 |
+
```text
|
| 972 |
+
Final PPO records saved: 900 / expected 900
|
| 973 |
+
Strict full PPO artifact contract passed.
|
| 974 |
+
```
|
| 975 |
+
|
| 976 |
+
The parameter change proof confirms:
|
| 977 |
+
|
| 978 |
+
```text
|
| 979 |
+
aggregate_hash_changed: true
|
| 980 |
+
changed_trainable_tensors: 506
|
| 981 |
+
unchanged_trainable_tensors: 0
|
| 982 |
+
```
|
| 983 |
+
|
| 984 |
+
This proves that PPO training changed the trainable LoRA/value-head parameters rather than merely running a dry notebook.
|
| 985 |
+
|
| 986 |
+
---
|
| 987 |
+
|
| 988 |
+
## 8.6 Strict PPO-only Runtime
|
| 989 |
+
|
| 990 |
+
The final runtime is configured so that the UI must use PPO artifacts only.
|
| 991 |
+
|
| 992 |
+
The strict PPO gate confirms:
|
| 993 |
+
|
| 994 |
+
```text
|
| 995 |
+
PPO records: 900
|
| 996 |
+
PPO full actor usable: True
|
| 997 |
+
PPO LoRA adapter usable: True
|
| 998 |
+
Strict PPO-only UI mode: True
|
| 999 |
+
```
|
| 1000 |
+
|
| 1001 |
+
The runtime loading order is:
|
| 1002 |
+
|
| 1003 |
+
```text
|
| 1004 |
+
1. PPO full inference actor if full weights exist
|
| 1005 |
+
2. Otherwise base Qwen3-8B + PPO LoRA adapter
|
| 1006 |
+
3. Non-PPO fallback is forbidden
|
| 1007 |
+
```
|
| 1008 |
+
|
| 1009 |
+
This prevents the final demo from accidentally loading an old Baseline 2 model or a stale 150-step PPO proof artifact.
|
| 1010 |
+
|
| 1011 |
+
---
|
| 1012 |
+
|
| 1013 |
+
## 8.7 PPO Validation
|
| 1014 |
+
|
| 1015 |
+
The PPO-only validation evaluation uses a held-out validation sample.
|
| 1016 |
+
|
| 1017 |
+
The displayed validation summary is:
|
| 1018 |
+
|
| 1019 |
+
```text
|
| 1020 |
+
reward: 0.477789
|
| 1021 |
+
gold_overlap: 0.255351
|
| 1022 |
+
rejected_overlap: 0.155080
|
| 1023 |
+
```
|
| 1024 |
+
|
| 1025 |
+
Interpretation:
|
| 1026 |
+
|
| 1027 |
+
- reward is positive
|
| 1028 |
+
- gold overlap is higher than rejected overlap
|
| 1029 |
+
- the PPO-trained actor tends to move closer to preferred answers than rejected answers
|
| 1030 |
+
|
| 1031 |
+
This does not mean the PPO model is perfect. It means the reward-shaped behavior is directionally positive.
|
| 1032 |
+
|
| 1033 |
+
---
|
| 1034 |
+
|
| 1035 |
+
## 8.8 PPO Limitations
|
| 1036 |
+
|
| 1037 |
+
The PPO run is successful, but the raw PPO generations still show some imperfections.
|
| 1038 |
+
|
| 1039 |
+
Observed issues include:
|
| 1040 |
+
|
| 1041 |
+
1. **Process leakage**
|
| 1042 |
+
Some outputs still include phrases like:
|
| 1043 |
+
```text
|
| 1044 |
+
Okay, let me try to figure out...
|
| 1045 |
+
Wait, I need to check again...
|
| 1046 |
+
```
|
| 1047 |
+
The reward function penalizes this, but it is not completely eliminated.
|
| 1048 |
+
|
| 1049 |
+
2. **Occasional hallucinated URLs**
|
| 1050 |
+
Some raw generations may still invent URLs. The harness fake URL guard is therefore still necessary.
|
| 1051 |
+
|
| 1052 |
+
3. **OCR-style text artifacts**
|
| 1053 |
+
Some source chunks contain spacing or OCR issues, and the model may reproduce them.
|
| 1054 |
+
|
| 1055 |
+
4. **KL can be high**
|
| 1056 |
+
Some PPO logs show high `objective/kl`, meaning the PPO actor can drift noticeably from the reference model. However, the run completed with:
|
| 1057 |
+
```text
|
| 1058 |
+
degenerate_ratio = 0.00%
|
| 1059 |
+
```
|
| 1060 |
+
and no detected repetition collapse.
|
| 1061 |
+
|
| 1062 |
+
5. **RAG/Harness remains necessary**
|
| 1063 |
+
PPO improves model behavior, but it does not replace retrieval grounding or guardrails.
|
| 1064 |
+
|
| 1065 |
+
---
|
| 1066 |
+
|
| 1067 |
+
# 9. TensorTalk UI
|
| 1068 |
+
|
| 1069 |
+
The project includes a WhatsApp-style Jupyter HTML UI called **TensorTalk**.
|
| 1070 |
+
|
| 1071 |
+
The UI supports:
|
| 1072 |
+
|
| 1073 |
+
- chat-style interface
|
| 1074 |
+
- TensorCat avatar
|
| 1075 |
+
- RAG on/off control
|
| 1076 |
+
- web agent on/off control
|
| 1077 |
+
- collapsed trace panels
|
| 1078 |
+
- retrieved evidence display
|
| 1079 |
+
- web evidence display
|
| 1080 |
+
- planning/thinking display layer
|
| 1081 |
+
- harness decision trace
|
| 1082 |
+
- answer grounding information
|
| 1083 |
+
- strict PPO artifact loading
|
| 1084 |
+
- new chat reset behavior
|
| 1085 |
+
|
| 1086 |
+
The UI is part of the engineering contribution because it makes the harness process visible rather than hidden.
|
| 1087 |
+
|
| 1088 |
+
---
|
| 1089 |
+
|
| 1090 |
+
# 10. Smoke Tests
|
| 1091 |
+
|
| 1092 |
+
## 10.1 What Smoke Test Means Here
|
| 1093 |
+
|
| 1094 |
+
A smoke test is a lightweight system sanity check.
|
| 1095 |
+
|
| 1096 |
+
It is not a full evaluation. It is a quick check that the main pipeline still works.
|
| 1097 |
+
|
| 1098 |
+
In this project, smoke tests check whether:
|
| 1099 |
+
|
| 1100 |
+
```text
|
| 1101 |
+
PPO model loads
|
| 1102 |
+
RAG retrieves evidence
|
| 1103 |
+
web agent searches official sources
|
| 1104 |
+
fake URL guard blocks synthetic links
|
| 1105 |
+
answer grounding returns a result
|
| 1106 |
+
trace structure is produced
|
| 1107 |
+
fallback behavior still works
|
| 1108 |
+
```
|
| 1109 |
+
|
| 1110 |
+
---
|
| 1111 |
+
|
| 1112 |
+
## 10.2 Example Smoke Tests
|
| 1113 |
+
|
| 1114 |
+
The notebook defines smoke tests such as:
|
| 1115 |
+
|
| 1116 |
+
```text
|
| 1117 |
+
1. PEKOM should not be routed to AI bachelor page
|
| 1118 |
+
2. Residential college should prefer student-affairs residential page
|
| 1119 |
+
3. CCNA Lab should not invent synthetic URLs
|
| 1120 |
+
```
|
| 1121 |
+
|
| 1122 |
+
These are not random examples. They are chosen to test known fragile parts of the pipeline:
|
| 1123 |
+
|
| 1124 |
+
- entity routing
|
| 1125 |
+
- official URL preference
|
| 1126 |
+
- fake URL rejection
|
| 1127 |
+
- web/RAG trace structure
|
| 1128 |
+
|
| 1129 |
+
---
|
| 1130 |
+
|
| 1131 |
+
# 11. Control Variable Design
|
| 1132 |
+
|
| 1133 |
+
The project uses a control-variable style comparison.
|
| 1134 |
+
|
| 1135 |
+
The base task remains the same:
|
| 1136 |
+
|
| 1137 |
+
```text
|
| 1138 |
+
UM FSKTM Handbook QA
|
| 1139 |
+
```
|
| 1140 |
+
|
| 1141 |
+
The base model family remains the same:
|
| 1142 |
+
|
| 1143 |
+
```text
|
| 1144 |
+
Qwen3-8B
|
| 1145 |
+
```
|
| 1146 |
+
|
| 1147 |
+
The dataset domain remains the same:
|
| 1148 |
+
|
| 1149 |
+
```text
|
| 1150 |
+
Undergraduate + postgraduate + general UM Handbook knowledge
|
| 1151 |
+
```
|
| 1152 |
+
|
| 1153 |
+
What changes is the system layer:
|
| 1154 |
+
|
| 1155 |
+
```text
|
| 1156 |
+
Baseline 1: SFT only
|
| 1157 |
+
Baseline 2: SFT + RAG + Harness Agent
|
| 1158 |
+
Improved: SFT/RAG/Harness + PPO post-training
|
| 1159 |
+
```
|
| 1160 |
+
|
| 1161 |
+
This allows the project to compare which improvements come from:
|
| 1162 |
+
|
| 1163 |
+
- parameter learning
|
| 1164 |
+
- retrieval grounding
|
| 1165 |
+
- metadata-aware scope control
|
| 1166 |
+
- official web augmentation
|
| 1167 |
+
- harness guardrails
|
| 1168 |
+
- PPO reward shaping
|
| 1169 |
+
|
| 1170 |
+
This is more rigorous than simply building three unrelated systems.
|
| 1171 |
+
|
| 1172 |
+
---
|
| 1173 |
+
|
| 1174 |
+
# 12. Stage-by-stage Comparison Table
|
| 1175 |
+
|
| 1176 |
+
| Dimension | Baseline 1: Closed-book SFT | Baseline 2: RAG + SFT + Agent/Harness | Improved Model: PPO + RAG + Agent/Harness |
|
| 1177 |
+
|---|---|---|---|
|
| 1178 |
+
| Main research question | Can the model memorize and reproduce handbook QA from SFT? | Does retrieval-grounded evidence improve handbook QA? | Can rule-reward PPO further align answer behavior while keeping RAG/Harness control? |
|
| 1179 |
+
| Base model | Qwen3-8B | Qwen3-8B | Qwen3-8B |
|
| 1180 |
+
| Main training method | Supervised fine-tuning | RAG-augmented supervised fine-tuning | Rule-reward PPO post-training |
|
| 1181 |
+
| Dataset used | 1000 SFT QA rows | SFT QA + metadata + RAG KB + RAG eval | 1000 PPO preference rows |
|
| 1182 |
+
| Train/validation/test | 800 / 100 / 100 | 8:1:1 RAG-augmented split | 900 train / 100 validation |
|
| 1183 |
+
| Retrieval | No | Yes | Yes |
|
| 1184 |
+
| Retrieval type | None | Dense embedding + FAISS + metadata-aware rerank | Same RAG runtime reused |
|
| 1185 |
+
| Embedding model | None | BAAI/bge-base-en-v1.5 | RAG runtime inherited from Baseline 2 |
|
| 1186 |
+
| Top-k evidence | None | Top-3 | Top-3 / runtime-dependent |
|
| 1187 |
+
| Metadata awareness | Hidden metadata only, not used at inference | Yes, scope/source/section aware | Yes, used by RAG/Harness runtime |
|
| 1188 |
+
| Scope control | Weak; model may confuse UG/PG if prompt is ambiguous | Stronger due to metadata-aware retrieval | Stronger due to RAG + PPO reward + harness |
|
| 1189 |
+
| Web agent | No | Yes | Yes |
|
| 1190 |
+
| Official domain control | No | Yes, UM/FSKTM official domain whitelist | Yes, same official-source guardrails |
|
| 1191 |
+
| Fake URL guard | No | Yes | Yes |
|
| 1192 |
+
| WAF handling | No | Yes | Yes |
|
| 1193 |
+
| Evidence judge | No | Yes, Qwen evidence judge | Yes |
|
| 1194 |
+
| Retry/fallback policy | No | Yes | Yes |
|
| 1195 |
+
| Answer grounding judge | No | Yes | Yes |
|
| 1196 |
+
| Completeness guard | No | Yes | Yes |
|
| 1197 |
+
| UI trace | Basic chat UI | Harness trace panels | Strict PPO + Harness trace panels |
|
| 1198 |
+
| LoRA rank | 16 | 8 | PPO actor based on LoRA actor/value setup |
|
| 1199 |
+
| Training epochs | 8 SFT epochs | 20 SFT epochs | 2 PPO epochs |
|
| 1200 |
+
| Main output artifact | LoRA adapter + merged model + `.pt` export | LoRA adapter, optional non-merged runtime | PPO full inference actor + PPO LoRA adapter + manifest |
|
| 1201 |
+
| Artifact strictness | Standard save | Adapter/runtime path checks | Manifest, training log count, parameter hash proof, strict gate |
|
| 1202 |
+
| Key metric | Test token F1 ≈ 0.8869 | RAG token F1 ≈ 0.846 on selected eval; retrieval Hit@3 ≈ 0.954 | PPO validation reward ≈ 0.4778; gold overlap > rejected overlap |
|
| 1203 |
+
| Strongest contribution | Clean SFT baseline | Evidence-grounded QA and metadata-aware retrieval | Full PPO post-training with strict artifact verification and harnessed runtime |
|
| 1204 |
+
| Main weakness | Closed-book hallucination risk | More complex runtime, depends on retriever quality | PPO raw outputs still need Harness/RAG due to possible process leakage and fake URLs |
|
| 1205 |
+
| Control variable role | Establishes parameter-only baseline | Adds retrieval and harness while keeping same domain/model family | Adds PPO reward shaping while preserving RAG/Harness pipeline |
|
| 1206 |
+
|
| 1207 |
+
---
|
| 1208 |
+
|
| 1209 |
+
# 13. Technical Comparison of the Three Stages
|
| 1210 |
+
|
| 1211 |
+
## 13.1 Content-level Difference
|
| 1212 |
+
|
| 1213 |
+
| Content Aspect | Baseline 1 | Baseline 2 | Improved Model |
|
| 1214 |
+
|---|---|---|---|
|
| 1215 |
+
| Stable handbook facts | Learned into model parameters | Retrieved from handbook KB | Retrieved and answered by PPO-aligned actor |
|
| 1216 |
+
| Latest or official web info | Not supported | Supported through official web agent | Supported through same official web agent |
|
| 1217 |
+
| UG vs PG distinction | Learned implicitly | Controlled by metadata retrieval | Controlled by metadata retrieval + reward/harness |
|
| 1218 |
+
| Evidence visibility | Not shown | Evidence shown in RAG trace | Evidence shown in PPO/Harness trace |
|
| 1219 |
+
| Hallucination control | Mostly prompt-based | Retrieval + grounding | Retrieval + grounding + reward penalties |
|
| 1220 |
+
| Fake URL control | Not available | Harness URL guard | Harness URL guard + PPO penalty signal |
|
| 1221 |
+
|
| 1222 |
+
---
|
| 1223 |
+
|
| 1224 |
+
## 13.2 Engineering-level Difference
|
| 1225 |
+
|
| 1226 |
+
| Engineering Aspect | Baseline 1 | Baseline 2 | Improved Model |
|
| 1227 |
+
|---|---|---|---|
|
| 1228 |
+
| Notebook purpose | Train and evaluate closed-book SFT model | Build RAG-augmented model and harnessed agent runtime | Train PPO actor and attach it to final harness runtime |
|
| 1229 |
+
| Runtime complexity | Low | High | Highest |
|
| 1230 |
+
| Debug trace | Basic | Detailed RAG/Web/Harness trace | Detailed PPO/RAG/Web/Harness trace |
|
| 1231 |
+
| Failure handling | Minimal | Fallback and guardrail logic | Strict PPO-only fallback prevention plus harness fallback |
|
| 1232 |
+
| Artifact verification | Basic output save | Adapter/merged path checks | Manifest, training log count, parameter hash proof, strict gate |
|
| 1233 |
+
| Risk of stale artifact use | Moderate | Moderate | Actively guarded against |
|
| 1234 |
+
| Demo readiness | Good for simple QA | Strong for grounded QA | Strongest for final controlled system demo |
|
| 1235 |
+
|
| 1236 |
+
---
|
| 1237 |
+
|
| 1238 |
+
# 14. Why the Improved Model Does Not Replace RAG
|
| 1239 |
+
|
| 1240 |
+
A key design decision is that PPO does not replace RAG.
|
| 1241 |
+
|
| 1242 |
+
PPO improves the model’s tendency to:
|
| 1243 |
+
|
| 1244 |
+
- answer directly
|
| 1245 |
+
- avoid rejected-style answers
|
| 1246 |
+
- avoid vague answers
|
| 1247 |
+
- avoid process leakage
|
| 1248 |
+
- avoid fake URLs
|
| 1249 |
+
- avoid repetition collapse
|
| 1250 |
+
- use evidence-like wording more appropriately
|
| 1251 |
+
|
| 1252 |
+
But PPO does not guarantee factual correctness by itself.
|
| 1253 |
+
|
| 1254 |
+
Therefore, the final system still needs:
|
| 1255 |
+
|
| 1256 |
+
```text
|
| 1257 |
+
RAG for evidence
|
| 1258 |
+
Web Agent for official/latest information
|
| 1259 |
+
Harness for source control
|
| 1260 |
+
Grounding judge for answer verification
|
| 1261 |
+
Fallback for weak evidence
|
| 1262 |
+
```
|
| 1263 |
+
|
| 1264 |
+
This is the correct division of responsibility:
|
| 1265 |
+
|
| 1266 |
+
```text
|
| 1267 |
+
SFT: teaches domain answer style
|
| 1268 |
+
RAG: supplies factual evidence
|
| 1269 |
+
Agent: finds official external evidence
|
| 1270 |
+
Harness: controls trust, routing, fallback, and trace
|
| 1271 |
+
PPO: improves answer behavior according to reward preferences
|
| 1272 |
+
```
|
| 1273 |
+
|
| 1274 |
+
---
|
| 1275 |
+
|
| 1276 |
+
# 15. Known Limitations
|
| 1277 |
+
|
| 1278 |
+
This project is a strong applied LLM system prototype, but it has limitations.
|
| 1279 |
+
|
| 1280 |
+
## 15.1 Not a full human-feedback RLHF system
|
| 1281 |
+
|
| 1282 |
+
The PPO stage uses a rule-based reward function. It does not train a separate neural reward model from human preference labels.
|
| 1283 |
+
|
| 1284 |
+
Correct description:
|
| 1285 |
+
|
| 1286 |
+
```text
|
| 1287 |
+
Rule-reward PPO post-training
|
| 1288 |
+
```
|
| 1289 |
+
|
| 1290 |
+
Not:
|
| 1291 |
+
|
| 1292 |
+
```text
|
| 1293 |
+
Full RLHF with learned reward model
|
| 1294 |
+
```
|
| 1295 |
+
|
| 1296 |
+
---
|
| 1297 |
+
|
| 1298 |
+
## 15.2 Raw PPO generations can still be imperfect
|
| 1299 |
+
|
| 1300 |
+
Observed raw PPO generations may include:
|
| 1301 |
+
|
| 1302 |
+
- process leakage
|
| 1303 |
+
- occasional hallucinated URLs
|
| 1304 |
+
- OCR-like token spacing
|
| 1305 |
+
- incomplete course titles
|
| 1306 |
+
- noisy source-text reproduction
|
| 1307 |
+
|
| 1308 |
+
The final Harness runtime is therefore necessary.
|
| 1309 |
+
|
| 1310 |
+
---
|
| 1311 |
+
|
| 1312 |
+
## 15.3 Web search is constrained
|
| 1313 |
+
|
| 1314 |
+
The web agent is intentionally limited to official UM/FSKTM sources. It may refuse or fallback when official evidence is weak.
|
| 1315 |
+
|
| 1316 |
+
This is a feature, not a bug, because the system prioritizes trustworthiness over open-ended browsing.
|
| 1317 |
+
|
| 1318 |
+
---
|
| 1319 |
+
|
| 1320 |
+
## 15.4 RAG depends on knowledge base quality
|
| 1321 |
+
|
| 1322 |
+
If the RAG KB contains OCR noise or incomplete chunks, the model may inherit that noise. Future work should improve source cleaning and chunk normalization.
|
| 1323 |
+
|
| 1324 |
+
---
|
| 1325 |
+
|
| 1326 |
+
## 15.5 Notebook-based prototype
|
| 1327 |
+
|
| 1328 |
+
The project is implemented as notebooks. A production version should separate modules into:
|
| 1329 |
+
|
| 1330 |
+
```text
|
| 1331 |
+
data/
|
| 1332 |
+
retrieval/
|
| 1333 |
+
agent/
|
| 1334 |
+
harness/
|
| 1335 |
+
training/
|
| 1336 |
+
evaluation/
|
| 1337 |
+
ui/
|
| 1338 |
+
tests/
|
| 1339 |
+
```
|
| 1340 |
+
|
| 1341 |
+
---
|
| 1342 |
+
|
| 1343 |
+
# 16. Recommended Usage
|
| 1344 |
+
|
| 1345 |
+
This project is intended for research, coursework, and demonstration purposes.
|
| 1346 |
+
|
| 1347 |
+
It is not an official Universiti Malaya system.
|
| 1348 |
+
|
| 1349 |
+
For official academic decisions, students should always refer to the official handbook, faculty office, or UM/FSKTM official websites.
|
| 1350 |
+
|
| 1351 |
+
---
|
| 1352 |
+
|
| 1353 |
+
# 17. Suggested Inference Flow
|
| 1354 |
+
|
| 1355 |
+
For final demonstration, use the Improved Model runtime:
|
| 1356 |
+
|
| 1357 |
+
```text
|
| 1358 |
+
1. Load PPO full inference actor if available.
|
| 1359 |
+
2. If unavailable, load base Qwen3-8B + PPO LoRA adapter.
|
| 1360 |
+
3. Initialize local handbook RAG.
|
| 1361 |
+
4. Enable official UM/FSKTM web agent if the question may need external/latest information.
|
| 1362 |
+
5. Run through TensorTalkHarnessCore.
|
| 1363 |
+
6. Display answer with evidence trace.
|
| 1364 |
+
```
|
| 1365 |
+
|
| 1366 |
+
Strict runtime requirement:
|
| 1367 |
+
|
| 1368 |
+
```text
|
| 1369 |
+
Non-PPO fallback is forbidden in the final Improved Model demo.
|
| 1370 |
+
```
|
| 1371 |
+
|
| 1372 |
+
---
|
| 1373 |
+
|
| 1374 |
+
# 18. Summary
|
| 1375 |
+
|
| 1376 |
+
TensorTalk demonstrates a staged LLM system development workflow:
|
| 1377 |
+
|
| 1378 |
+
```text
|
| 1379 |
+
Baseline 1:
|
| 1380 |
+
Qwen3-8B learns handbook QA through closed-book SFT.
|
| 1381 |
+
|
| 1382 |
+
Baseline 2:
|
| 1383 |
+
The system adds RAG, dense retrieval, metadata-aware reranking, official web search, and Harness Engineering.
|
| 1384 |
+
|
| 1385 |
+
Improved Model:
|
| 1386 |
+
The system adds full 1000-row rule-reward PPO post-training, strict artifact verification, and a PPO-only final harness runtime.
|
| 1387 |
+
```
|
| 1388 |
+
|
| 1389 |
+
The most important contribution is not only that the model can answer handbook questions, but that the system is controlled, evidence-aware, source-constrained, traceable, and evaluated through a clear baseline progression.
|
| 1390 |
+
|
| 1391 |
+
The final system should be understood as:
|
| 1392 |
|
| 1393 |
+
> **A Qwen3-8B based UM Handbook RAG Agent, improved with rule-reward PPO and controlled by Harness Engineering.**
|
|
|