TensorCat commited on
Commit
8a70417
·
verified ·
1 Parent(s): b4b8aae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1300 -157
README.md CHANGED
@@ -1,250 +1,1393 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
- # TensorTalk / UM_Handbook
5
 
6
- TensorTalk is a handbook-grounded academic chat assistant built for the **Faculty of Computer Science and Information Technology, Universiti Malaya (UM)**.
7
 
8
- This project focuses on turning UM handbook content into a usable question-answering system through:
9
 
10
- - handbook preprocessing
11
- - source chunk construction
12
- - supervised QA dataset building
13
- - Qwen3-8B LoRA fine-tuning
14
- - merged-model deployment
15
- - a browser-style HTML chat demo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ---
18
 
19
- ## Project Goal
20
 
21
- The main goal of this project is to build a handbook-based assistant that can answer student questions using information learned from the UM handbook domain.
22
 
23
- The current version is designed around:
 
 
 
 
 
 
24
 
25
- - undergraduate and postgraduate handbook content
26
- - handbook-faithful answers
27
- - concise student-facing responses
28
- - a local/demo deployment workflow on DICC and notebook environments
29
 
30
- This project is also intended to support a broader experimental pipeline:
 
 
 
 
 
 
31
 
32
- - **Baseline 1:** closed-book supervised fine-tuning
33
- - **Baseline 2:** retrieval-augmented version for later comparison
 
 
 
 
 
 
34
 
35
  ---
36
 
37
- ## What This Project Contains
38
 
39
- ### 1. Dataset Preparation
40
- The project includes scripts and resources for preparing handbook data before fine-tuning:
41
 
42
- - handbook markdown preprocessing
43
- - source chunk dataset building
44
- - SFT QA dataset construction
45
- - configuration management for the preprocessing and dataset pipeline
46
 
47
- ### 2. Fine-Tuning Workflow
48
- The model training workflow uses a Qwen3-8B base model with LoRA-based fine-tuning on the UM handbook QA dataset.
 
 
 
 
 
 
 
 
49
 
50
- The fine-tuning workflow includes:
51
 
52
- - notebook-based training on DICC
53
- - device-aware loading logic
54
- - train / validation / test style evaluation workflow
55
- - merged-model export for direct inference
56
- - LoRA adapter export for optional PEFT-based reuse
57
- - metrics and prediction file generation
58
 
59
- ### 3. Deployment Demo
60
- The project includes a notebook-based HTML chat UI called **TensorTalk**.
 
 
61
 
62
- The demo provides:
63
 
64
- - a browser-style chat layout
65
- - a handbook-focused system prompt
66
- - merged-model loading for direct inference
67
- - a student-facing question-answer workflow
68
- - a simple deployment path for demonstration purposes
69
 
70
  ---
71
 
72
- ## Current Project Structure
 
 
 
 
73
 
74
  ```text
75
- UM_Handbook/
76
- ├── Dataset/
77
- │ └── SFT_Dataset/
78
- │ ├── SFT_QA_Training_Ready.jsonl
79
- │ ├── SFT_QA_Training_Ready_pretty.json
80
- │ ├── SFT_QA_Metadata.jsonl
81
- │ └── SFT_QA_Metadata_pretty.json
82
- ├── assets/
83
- ├── outputs/
84
- │ └── qwen3_um_handbook_optimized_1/
85
- │ ├── lora_adapter/
86
- │ ├── merged_model/
87
- │ ├── trainer_runs/
88
- │ ├── test_eval_runs/
89
- │ ├── dataset_split_summary.json
90
- │ ├── final_metrics.json
91
- │ ├── test_predictions.jsonl
92
- │ └── validation_predictions.jsonl
93
- ├── FineTune_QWEN3_UM_Handbook_optimized_1.ipynb
94
- ├── UM_Handbook_Markdown_Preprocess.py
95
- ├── UM_SFT_QA_Dataset_Builder_from_Index.py
96
- ├── UM_Source_Chunk_Dataset_Builder.py
97
- └── um_handbook_config.py
98
  ```
99
 
 
 
100
  ---
101
 
102
- ## Key Files
103
 
104
- ### Training and Data
105
- - `Dataset/SFT_Dataset/SFT_QA_Training_Ready.jsonl`
106
- Main SFT training dataset used for handbook QA fine-tuning.
107
 
108
- - `UM_Handbook_Markdown_Preprocess.py`
109
- Preprocesses handbook markdown / extracted source text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
- - `UM_Source_Chunk_Dataset_Builder.py`
112
- Builds source chunks for downstream dataset and retrieval-related use.
113
 
114
- - `UM_SFT_QA_Dataset_Builder_from_Index.py`
115
- Builds the supervised QA dataset from curated handbook content.
116
 
117
- - `um_handbook_config.py`
118
- Central configuration file for paths and data-processing settings.
119
 
120
- ### Training Output
121
- - `outputs/qwen3_um_handbook_optimized_1/merged_model/`
122
- Main inference-ready model directory.
123
- This is the directory used by the demo chat UI.
124
 
125
- - `outputs/qwen3_um_handbook_optimized_1/lora_adapter/`
126
- LoRA adapter weights.
127
- This is useful for PEFT-style loading with a base model, but it is not the primary path used by the current demo UI.
128
 
129
- - `outputs/qwen3_um_handbook_optimized_1/final_metrics.json`
130
- Final evaluation summary.
 
 
 
 
 
 
 
 
 
 
131
 
132
- - `outputs/qwen3_um_handbook_optimized_1/validation_predictions.jsonl`
133
- Validation-set generated answers for inspection.
134
 
135
- - `outputs/qwen3_um_handbook_optimized_1/test_predictions.jsonl`
136
- Test-set generated answers for inspection.
 
 
137
 
138
- ### Demo
139
- - `FineTune_QWEN3_UM_Handbook_optimized_1.ipynb`
140
- Main notebook that contains the fine-tuning workflow and the TensorTalk HTML chat demo.
141
 
142
  ---
143
 
144
- ## Model Artifact Notes
145
 
146
- This project may contain several model-related outputs. They are not all used in the same way.
147
 
148
- ### `merged_model/`
149
- This is the most important deployment artifact for the current demo.
150
 
151
- Use this when:
152
- - running the current TensorTalk HTML chat UI
153
- - loading the fine-tuned model directly with Hugging Face `from_pretrained(...)`
154
- - sharing the main inference-ready model
 
 
 
 
 
 
 
 
155
 
156
- ### `lora_adapter/`
157
- This contains LoRA delta weights only.
 
 
 
 
 
 
 
 
 
 
 
158
 
159
- Use this when:
160
- - loading the adapter on top of the original base model
161
- - reusing the fine-tuning result in a PEFT workflow
162
- - experimenting with a smaller transferable fine-tuning artifact
163
 
164
- ### `.pt` exported model file
165
- If present, the `.pt` file is mainly a saved full-model artifact / backup export.
166
 
167
- Use this when:
168
- - archiving the full fine-tuned weights
169
- - running a custom loading workflow that explicitly expects a `.pt` file
170
 
171
- For the current TensorTalk chat UI, the primary runtime artifact is still **`merged_model/`**.
 
 
 
 
 
 
 
 
 
 
 
172
 
173
  ---
174
 
175
- ## Current Demo Behavior
176
 
177
- The current demo is designed to answer questions such as:
178
 
179
- - dress code and appearance guidance
180
- - programme core courses / credit requirements
181
- - undergraduate vs postgraduate handbook information
182
- - academic rules and handbook-supported policy questions
183
 
184
- The answer style is intended to be:
 
 
 
185
 
186
- - handbook-grounded
187
- - short and direct
188
- - student-facing
189
- - non-speculative
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
  ---
192
 
193
- ## Example Demo Output
 
 
194
 
195
- The screenshot below shows the current TensorTalk chat interface running with the fine-tuned UM handbook model.
196
 
197
- ![TensorTalk Demo](assets/tensortalk_demo_chat.jpg)
 
 
 
 
 
 
 
 
 
 
 
 
198
 
199
  ---
200
 
201
- ## Repository Preview
 
 
202
 
203
- The screenshot below shows the current top-level project layout.
204
 
205
- ![Repository Structure](UM_Handbook/assets/tensortalk_demo_chat.jpg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
  ---
208
 
209
- ## Suggested Minimal Deployment Package
210
 
211
- If the goal is only to demonstrate the chat UI to teammates, the minimal useful set is:
212
 
213
- - `merged_model/`
214
- - the chat notebook / UI code
215
- - optional avatar image under `assets/`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
 
217
- The following items are not required for a simple demo run:
218
 
219
- - intermediate training checkpoints
220
- - test evaluation run directories
221
- - optional full `.pt` export
222
- - raw training logs not used by the demo
 
 
 
 
 
 
 
 
 
223
 
224
  ---
225
 
226
- ## Notes
227
 
228
- - The project is organized so that **Dataset**, **models / outputs**, and **demo code** remain separate.
229
- - The current demo is notebook-friendly and was prepared around a DICC workflow.
230
- - The deployment path prioritizes clarity and reproducibility over a heavyweight full-stack application setup.
 
 
 
 
 
 
 
 
 
 
 
231
 
232
  ---
233
 
234
- ## Status
 
 
235
 
236
- Current project status:
237
 
238
- - handbook preprocessing pipeline prepared
239
- - supervised QA dataset prepared
240
- - LoRA fine-tuning workflow completed
241
- - merged model exported
242
- - TensorTalk HTML chat demo running
243
- - evaluation outputs generated
 
 
244
 
245
  ---
246
 
247
- ## Author / Project Name
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
 
249
- **TensorTalk**
250
- UM Handbook QA / Fine-Tuned Qwen3-8B LoRA Project
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - Qwen/Qwen3-8B
8
+ pipeline_tag: text-generation
9
+ library_name: transformers
10
+ tags:
11
+ - qwen3
12
+ - qwen3-8b
13
+ - lora
14
+ - qlora
15
+ - sft
16
+ - rag
17
+ - faiss
18
+ - dense-retrieval
19
+ - agent
20
+ - ppo
21
+ - rlhf
22
+ - rule-reward
23
+ - harness-engineering
24
+ - um-handbook
25
+ - question-answering
26
+ - chatbot
27
+ - education
28
+ - tensor-talk
29
  ---
 
30
 
31
+ # TensorTalk: UM Handbook Qwen3-8B SFT + RAG + Agent + PPO + Harness Engineering
32
 
33
+ TensorTalk is a staged LLM engineering project built for **Universiti Malaya Faculty of Computer Science and Information Technology handbook question answering**. The system is designed to answer undergraduate, postgraduate, and general faculty handbook questions using a controlled progression of three experimental stages:
34
 
35
+ 1. **Baseline 1 — Closed-book SFT Qwen3-8B**
36
+ 2. **Baseline 2 — SFT Qwen3-8B + Metadata-aware RAG + Official Web Agent + Harness Engineering**
37
+ 3. **Improved Model — Rule-reward PPO post-training + RAG + Agent + Harness Engineering**
38
+
39
+ The project is not just a simple chatbot. It is a controlled comparison of how an LLM system improves when moving from memorized supervised fine-tuning, to retrieval-grounded answering, and finally to rule-reward post-training with a guarded agentic runtime.
40
+
41
+ The main idea is:
42
+
43
+ > Baseline 1 tests whether a fine-tuned model can answer handbook questions from parameters alone.
44
+ > Baseline 2 keeps the same base model family but adds retrieval and harnessed evidence control.
45
+ > The Improved Model keeps the RAG + Agent + Harness runtime and further adds PPO post-training to make the model better aligned with the desired answer behavior.
46
+
47
+ ---
48
+
49
+ # 1. Project Goal
50
+
51
+ The goal of this project is to build a reliable and traceable UM Handbook assistant that can answer questions about:
52
+
53
+ - Faculty objectives, vision, mission, history, facilities, and academic calendar
54
+ - Undergraduate programme details
55
+ - Postgraduate programme details
56
+ - Candidature requirements
57
+ - Grading and academic rules
58
+ - Industrial training
59
+ - Academic project requirements
60
+ - Supervision policy
61
+ - Thesis/dissertation requirements
62
+ - Academic integrity and plagiarism
63
+ - Facilities and labs
64
+ - Official UM/FSKTM web information when handbook knowledge is insufficient or time-sensitive
65
+
66
+ The project also aims to demonstrate a complete LLM system development path:
67
+
68
+ ```text
69
+ Closed-book SFT
70
+ → RAG-augmented SFT
71
+ → Metadata-aware retrieval
72
+ → Official-source web agent
73
+ → Harness Engineering guardrails
74
+ → PPO rule-reward post-training
75
+ → Strict artifact verification
76
+ → Traceable TensorTalk UI
77
+ ```
78
+
79
+ ---
80
+
81
+ # 2. High-level System Overview
82
+
83
+ The final TensorTalk system contains several layers.
84
+
85
+ ```text
86
+ User Question
87
+
88
+ TensorTalk UI
89
+
90
+ Planning / Thinking Display Layer
91
+
92
+ Local Handbook RAG
93
+
94
+ Official UM / FSKTM Web Agent
95
+
96
+ Harness Engineering Guardrails
97
+
98
+ Evidence Judge / Retry / Fallback
99
+
100
+ PPO-trained Qwen3-8B Actor
101
+
102
+ Answer Grounding Judge
103
+
104
+ Completeness Guard
105
+
106
+ Final Answer + Trace Panels
107
+ ```
108
+
109
+ The final model is not used alone. It is wrapped inside a runtime harness that controls:
110
+
111
+ - where the system can search
112
+ - which sources it can trust
113
+ - whether web evidence is useful
114
+ - whether retrieved evidence supports the answer
115
+ - whether the model produced fake URLs
116
+ - whether the answer leaked internal reasoning
117
+ - whether fallback to local handbook RAG is needed
118
+ - whether the final answer is grounded enough to show
119
+
120
+ This is why the final stage is better described as:
121
+
122
+ > **A PPO-aligned RAG agent system with Harness Engineering**, rather than only a fine-tuned model.
123
+
124
+ ---
125
+
126
+ # 3. Dataset Design
127
+
128
+ ## 3.1 Source Domain
129
+
130
+ The dataset is built around UM FSKTM undergraduate and postgraduate handbook content. The data is organized into:
131
+
132
+ - SFT question-answer dataset
133
+ - hidden metadata
134
+ - RAG knowledge base
135
+ - RAG evaluation dataset
136
+ - PPO preference dataset
137
+
138
+ The project separates **model-visible training text** from **metadata used for retrieval, evaluation, and analysis**.
139
+
140
+ This distinction is important:
141
+
142
+ - Baseline 1 intentionally trains on question-answer text without forcing explicit metadata labels into the model-visible answer.
143
+ - Baseline 2 uses metadata-aware retrieval to reduce scope confusion.
144
+ - Stage 3 PPO uses preference pairs and reward functions to shape answer behavior.
145
+
146
+ ---
147
+
148
+ ## 3.2 Baseline 1 SFT Dataset
149
+
150
+ Baseline 1 uses:
151
+
152
+ ```text
153
+ SFT_QA_Training_Ready.jsonl
154
+ ```
155
+
156
+ The notebook validates:
157
+
158
+ ```text
159
+ Total examples: 1000
160
+ Train examples: 800
161
+ Validation examples: 100
162
+ Test examples: 100
163
+ Split ratio: 8:1:1
164
+ Duplicate question groups: 0
165
+ Duplicate question rows: 0
166
+ ```
167
+
168
+ Each example follows a supervised chat-style format:
169
+
170
+ ```json
171
+ {
172
+ "prompt": [
173
+ {
174
+ "role": "system",
175
+ "content": "You are an academic assistant for the Faculty of Computer Science and Information Technology, Universiti Malaya..."
176
+ },
177
+ {
178
+ "role": "user",
179
+ "content": "What are the faculty objectives?"
180
+ }
181
+ ],
182
+ "completion": [
183
+ {
184
+ "role": "assistant",
185
+ "content": "The faculty objectives are..."
186
+ }
187
+ ],
188
+ "question": "...",
189
+ "answer": "..."
190
+ }
191
+ ```
192
+
193
+ This stage teaches the model to imitate handbook-style answers directly.
194
+
195
+ ---
196
+
197
+ ## 3.3 Baseline 2 RAG Dataset
198
+
199
+ Baseline 2 uses the same SFT dataset direction, but adds external retrieval resources:
200
+
201
+ ```text
202
+ UM_RAG_Knowledge_Base.jsonl
203
+ UM_RAG_Evaluation_Dataset.jsonl
204
+ SFT_QA_Metadata.jsonl
205
+ ```
206
+
207
+ The RAG knowledge base contains structured fields such as:
208
+
209
+ ```text
210
+ kb_id
211
+ source_doc
212
+ scope_label
213
+ section
214
+ pages
215
+ source_text
216
+ retrieval_text
217
+ retrieval_keywords
218
+ grounded_answer_bank
219
+ matched_qa_ids
220
+ ```
221
+
222
+ The RAG knowledge base loaded in the final Stage 3 runtime contains:
223
+
224
+ ```text
225
+ Loaded KB rows: 521
226
+ ```
227
+
228
+ The metadata layer allows the system to distinguish:
229
+
230
+ ```text
231
+ general
232
+ undergraduate
233
+ postgraduate
234
+ ```
235
+
236
+ This is important because many handbook questions look similar but require different answers depending on the student scope.
237
+
238
+ ---
239
+
240
+ ## 3.4 PPO Preference Dataset
241
+
242
+ The Improved Model uses:
243
+
244
+ ```text
245
+ UM_Handbook_PPO_Preference_Dataset.jsonl
246
+ ```
247
+
248
+ The final PPO run uses the full dataset:
249
+
250
+ ```text
251
+ Total PPO preference rows: 1000
252
+ Train rows: 900
253
+ Validation rows: 100
254
+ Train fraction: 0.90
255
+ ```
256
+
257
+ The PPO dataset is not used like normal SFT data. In SFT, the model directly imitates a reference answer. In PPO, the model generates its own answer, receives a reward, and updates toward higher-reward behavior.
258
+
259
+ ---
260
+
261
+ # 4. Baseline 1 — Closed-book SFT Qwen3-8B
262
+
263
+ ## 4.1 Purpose
264
+
265
+ Baseline 1 asks a simple question:
266
+
267
+ > Can Qwen3-8B learn UM Handbook question answering from supervised fine-tuning alone?
268
+
269
+ This is a **closed-book baseline**. The model does not retrieve handbook evidence during inference. It must answer from what it learned during SFT.
270
+
271
+ This is useful as a control baseline because it shows what happens when the model relies mainly on parameter memory.
272
+
273
+ ---
274
+
275
+ ## 4.2 Model
276
+
277
+ Baseline 1 uses:
278
+
279
+ ```text
280
+ Base model: Qwen/Qwen3-8B
281
+ Local path: /scr/user/kevin2002/TensorCat/NLP/UM_Handbook/models/Qwen3-8B
282
+ ```
283
+
284
+ The notebook detected:
285
+
286
+ ```text
287
+ Backend: CUDA
288
+ GPU: NVIDIA A100-SXM4-80GB
289
+ dtype: bfloat16
290
+ 4-bit QLoRA: enabled
291
+ ```
292
+
293
+ ---
294
+
295
+ ## 4.3 Training Method
296
+
297
+ Baseline 1 uses LoRA / QLoRA supervised fine-tuning.
298
+
299
+ LoRA configuration:
300
+
301
+ ```text
302
+ LoRA rank: 16
303
+ LoRA alpha: 32
304
+ LoRA dropout: 0.10
305
+ Target modules:
306
+ - q_proj
307
+ - k_proj
308
+ - v_proj
309
+ - o_proj
310
+ - gate_proj
311
+ - up_proj
312
+ - down_proj
313
+ ```
314
+
315
+ Training configuration:
316
+
317
+ ```text
318
+ Epochs: 8
319
+ Train split: 800
320
+ Validation split: 100
321
+ Test split: 100
322
+ Per-device train batch size: 2
323
+ Per-device eval batch size: 2
324
+ Gradient accumulation steps: 8
325
+ Learning rate: 1e-4
326
+ Packing: False
327
+ ```
328
+
329
+ ---
330
+
331
+ ## 4.4 Baseline 1 Results
332
+
333
+ The training completed successfully.
334
+
335
+ Training summary:
336
+
337
+ ```text
338
+ Training steps: 400
339
+ Training runtime: ~18.68 minutes for the main train stage
340
+ Train loss: 0.4824
341
+ Final validation loss: ~0.146
342
+ Test loss: ~0.197
343
+ Perplexity: ~1.157
344
+ ```
345
+
346
+ Generation metrics:
347
+
348
+ ### Validation
349
+
350
+ ```text
351
+ Exact match: 0.77
352
+ Token F1: 0.9111
353
+ ROUGE-1: 0.9122
354
+ ROUGE-2: 0.8700
355
+ ROUGE-L: 0.8979
356
+ SacreBLEU: 81.7240
357
+ chrF++: 86.8916
358
+ Average prediction words: 36.35
359
+ Average reference words: 38.57
360
+ ```
361
+
362
+ ### Test
363
+
364
+ ```text
365
+ Exact match: 0.72
366
+ Token F1: 0.8869
367
+ ROUGE-1: 0.8857
368
+ ROUGE-2: 0.8352
369
+ ROUGE-L: 0.8677
370
+ SacreBLEU: 81.1138
371
+ chrF++: 87.7054
372
+ Average prediction words: 38.03
373
+ Average reference words: 37.03
374
+ ```
375
+
376
+ ---
377
+
378
+ ## 4.5 Baseline 1 Strengths
379
+
380
+ Baseline 1 is strong when the question is close to the training distribution. It can reproduce handbook-style answers well and shows high text overlap with the reference answers.
381
+
382
+ It is useful because:
383
+
384
+ - it establishes the basic Qwen3-8B SFT capability
385
+ - it verifies that the dataset format is learnable
386
+ - it creates a clean closed-book control model
387
+ - it provides a baseline for later RAG and PPO improvements
388
+
389
+ ---
390
+
391
+ ## 4.6 Baseline 1 Limitations
392
+
393
+ Baseline 1 is still limited because it is a closed-book model.
394
+
395
+ Main limitations:
396
+
397
+ 1. **No retrieval evidence**
398
+ It cannot check the handbook at inference time.
399
+
400
+ 2. **Potential hallucination**
401
+ If the question is out-of-distribution or requires exact source grounding, the model may answer from memory.
402
+
403
+ 3. **Scope confusion**
404
+ Undergraduate and postgraduate rules may be mixed if the question is ambiguous.
405
+
406
+ 4. **No official web update mechanism**
407
+ It cannot answer dynamic or latest-information questions reliably.
408
+
409
+ 5. **No harness guardrails**
410
+ It does not include fake URL detection, evidence judging, WAF handling, or fallback control.
411
+
412
+ Baseline 1 is therefore a necessary but incomplete starting point.
413
+
414
+ ---
415
+
416
+ # 5. Baseline 2 — RAG + SFT + Metadata-aware Retrieval + Harness Agent
417
+
418
+ ## 5.1 Purpose
419
+
420
+ Baseline 2 asks:
421
+
422
+ > What improves if we keep the same Qwen3-8B family but add retrieval-grounded evidence?
423
+
424
+ The goal is to reduce hallucination and scope confusion by giving the model relevant handbook evidence at inference time.
425
+
426
+ This stage introduces RAG and agentic harness logic while keeping the same broad model family and handbook task.
427
 
428
  ---
429
 
430
+ ## 5.2 What RAG Means in This Project
431
 
432
+ RAG stands for **Retrieval-Augmented Generation**.
433
 
434
+ In simple terms:
435
+
436
+ ```text
437
+ Instead of asking the model to answer only from memory,
438
+ the system first retrieves relevant handbook chunks,
439
+ then asks the model to answer using those chunks.
440
+ ```
441
 
442
+ In this project, RAG is not just keyword search. It uses:
 
 
 
443
 
444
+ ```text
445
+ Transformer embedding model
446
+ + FAISS vector search
447
+ + metadata-aware reranking
448
+ + scope labels
449
+ + top-k evidence blocks
450
+ ```
451
 
452
+ The Baseline 2 retriever uses:
453
+
454
+ ```text
455
+ Embedding model: BAAI/bge-base-en-v1.5
456
+ Vector index: FAISS
457
+ Similarity: inner product after embedding normalization
458
+ Top-k retrieval: 3
459
+ ```
460
 
461
  ---
462
 
463
+ ## 5.3 Metadata-aware Retrieval
464
 
465
+ The RAG system uses metadata to control retrieval quality.
 
466
 
467
+ Important metadata fields include:
 
 
 
468
 
469
+ ```text
470
+ source_doc
471
+ scope_label
472
+ section
473
+ pages
474
+ kb_id
475
+ knowledge group
476
+ retrieval keywords
477
+ grounded answer bank
478
+ ```
479
 
480
+ This allows the retriever to prefer the correct audience scope.
481
 
482
+ Example:
 
 
 
 
 
483
 
484
+ ```text
485
+ Question: What are the candidature requirements for Master of Software Engineering?
486
+ Expected scope: postgraduate
487
+ ```
488
 
489
+ The system should retrieve postgraduate chunks, not undergraduate chunks.
490
 
491
+ This is one of the main improvements over Baseline 1.
 
 
 
 
492
 
493
  ---
494
 
495
+ ## 5.4 RAG-augmented Training Dataset
496
+
497
+ Baseline 2 creates a RAG-augmented dataset where training examples include evidence context.
498
+
499
+ The training prompt can contain:
500
 
501
  ```text
502
+ User question
503
+ + retrieved handbook evidence
504
+ + source metadata
505
+ + answer instruction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
506
  ```
507
 
508
+ This teaches the model to answer with evidence-aware context rather than only memorized answers.
509
+
510
  ---
511
 
512
+ ## 5.5 Baseline 2 Training Configuration
513
 
514
+ Baseline 2 uses Qwen3-8B with LoRA fine-tuning.
 
 
515
 
516
+ Configuration:
517
+
518
+ ```text
519
+ Base model: Qwen/Qwen3-8B
520
+ Embedding model: BAAI/bge-base-en-v1.5
521
+ LoRA rank: 8
522
+ LoRA alpha: 16
523
+ LoRA dropout: 0.05
524
+ Target modules:
525
+ - q_proj
526
+ - k_proj
527
+ - v_proj
528
+ - o_proj
529
+ - gate_proj
530
+ - up_proj
531
+ - down_proj
532
+ Epochs: 20
533
+ Per-device train batch size: 4
534
+ Per-device eval batch size: 8
535
+ Target global batch size: 8
536
+ Learning rate: 8e-5
537
+ Max sequence length: 1024
538
+ Validation ratio: 0.10
539
+ Test ratio: 0.10
540
+ Save merged model: False
541
+ Runtime model path: base model + LoRA adapter
542
+ ```
543
 
544
+ The notebook uses a safer non-merged runtime path when merged model export is unavailable or memory-expensive.
 
545
 
546
+ ---
 
547
 
548
+ ## 5.6 Baseline 2 Retrieval Evaluation
 
549
 
550
+ Baseline 2 includes a retrieval evaluation set.
 
 
 
551
 
552
+ Retrieval metrics:
 
 
553
 
554
+ ```json
555
+ {
556
+ "retrieval_eval_size": 1000,
557
+ "top_k": 3,
558
+ "hit_at_1_primary": 0.821,
559
+ "hit_at_k_primary": 0.954,
560
+ "hit_at_k_same_group": 0.991,
561
+ "scope_match_at_1": 0.996,
562
+ "retriever_type": "dense_embedding + faiss + metadata_rerank",
563
+ "embedding_model_name": "BAAI/bge-base-en-v1.5"
564
+ }
565
+ ```
566
 
567
+ Interpretation:
 
568
 
569
+ - `hit_at_1_primary = 0.821` means the top retrieved chunk is exactly the expected primary evidence in 82.1% of cases.
570
+ - `hit_at_k_primary = 0.954` means the correct primary evidence appears within top-3 in 95.4% of cases.
571
+ - `hit_at_k_same_group = 0.991` means a same-group acceptable evidence appears in top-3 in 99.1% of cases.
572
+ - `scope_match_at_1 = 0.996` means the top result almost always matches the correct undergraduate/postgraduate/general scope.
573
 
574
+ This confirms that the RAG system is not random retrieval. It is a strong metadata-aware retrieval baseline.
 
 
575
 
576
  ---
577
 
578
+ ## 5.7 Baseline 2 Generation Evaluation
579
 
580
+ Generation evaluation was run on a smaller selected set for runtime practicality.
581
 
582
+ Results:
 
583
 
584
+ ```json
585
+ {
586
+ "generation_eval_size": 20,
587
+ "top_k": 3,
588
+ "plain_exact_match": 0.0,
589
+ "plain_token_f1": 0.3391,
590
+ "rag_exact_match": 0.0,
591
+ "rag_token_f1": 0.8460,
592
+ "rag_minus_plain_exact_match": 0.0,
593
+ "rag_minus_plain_token_f1": 0.5069
594
+ }
595
+ ```
596
 
597
+ This shows a large improvement from RAG:
598
+
599
+ ```text
600
+ Plain token F1: 0.3391
601
+ RAG token F1: 0.8460
602
+ Improvement: +0.5069
603
+ ```
604
+
605
+ This is one of the strongest pieces of evidence in the project.
606
+
607
+ It shows that retrieval grounding dramatically improves answer quality compared with plain generation.
608
+
609
+ ---
610
 
611
+ # 6. Agent Layer in Baseline 2 and Improved Model
 
 
 
612
 
613
+ ## 6.1 Why an Agent Is Needed
 
614
 
615
+ The handbook is reliable for stable academic rules, but some questions may require official web information.
 
 
616
 
617
+ Examples:
618
+
619
+ ```text
620
+ Who is the current dean?
621
+ Where can students find residential college information?
622
+ What official page mentions PEKOM?
623
+ Where is the official SPeCTRUM page?
624
+ ```
625
+
626
+ For these cases, the system needs a controlled web agent.
627
+
628
+ However, a web agent can be dangerous if it freely browses or trusts random pages. Therefore, this project uses a restricted official-source agent.
629
 
630
  ---
631
 
632
+ ## 6.2 Official UM / FSKTM Web Agent
633
 
634
+ The web agent is constrained to official UM / FSKTM domains.
635
 
636
+ Priority domains:
 
 
 
637
 
638
+ ```text
639
+ fsktm.um.edu.my
640
+ www.um.edu.my
641
+ ```
642
 
643
+ Auxiliary official domains include UM-related systems such as:
644
+
645
+ ```text
646
+ aasd.um.edu.my
647
+ maya.um.edu.my
648
+ umlib.um.edu.my
649
+ umresearch.um.edu.my
650
+ jobs.um.edu.my
651
+ careerportal.fsktm.um.edu.my
652
+ intra.fsktm.um.edu.my
653
+ gallery.fsktm.um.edu.my
654
+ ```
655
+
656
+ The agent performs:
657
+
658
+ ```text
659
+ query planning
660
+ official web discovery
661
+ URL filtering
662
+ page fetching
663
+ evidence extraction
664
+ evidence scoring
665
+ Qwen-based evidence judging
666
+ retry if weak
667
+ fallback to handbook RAG if needed
668
+ ```
669
 
670
  ---
671
 
672
+ ## 6.3 Agent Is Not Fully Autonomous by Design
673
+
674
+ This project does not use a completely unrestricted autonomous agent.
675
 
676
+ That is intentional.
677
 
678
+ For a university handbook assistant, unrestricted autonomy is less useful than controlled evidence routing. The system needs to be:
679
+
680
+ ```text
681
+ safe
682
+ source-constrained
683
+ traceable
684
+ fallback-aware
685
+ grounded
686
+ ```
687
+
688
+ So the agent is better described as:
689
+
690
+ > A constrained official-source web agent controlled by Harness Engineering.
691
 
692
  ---
693
 
694
+ # 7. Harness Engineering
695
+
696
+ ## 7.1 What Harness Engineering Means Here
697
 
698
+ Harness Engineering is the guardrail system around the model and agent.
699
 
700
+ A simple analogy:
701
+
702
+ ```text
703
+ The LLM/agent is the car.
704
+ Harness Engineering is the guardrail, traffic rule, checkpoint, fallback route, and dashboard.
705
+ ```
706
+
707
+ The model can generate fluent answers, but the harness controls:
708
+
709
+ - what it is allowed to search
710
+ - what sources it can trust
711
+ - whether a URL is fake
712
+ - whether evidence is useful
713
+ - whether the answer is grounded
714
+ - whether the system should retry
715
+ - whether it should fall back to local handbook RAG
716
+ - what trace should be shown to the user
717
 
718
  ---
719
 
720
+ ## 7.2 Harness Pipeline
721
 
722
+ The standardized TensorTalk Harness Core follows this structure:
723
 
724
+ ```text
725
+ User Question
726
+
727
+ Local Handbook RAG
728
+
729
+ Official Web Discovery
730
+
731
+ Domain Guard
732
+
733
+ Fake URL Guard
734
+
735
+ WAF Detection
736
+
737
+ Evidence Normalizer
738
+
739
+ Qwen Evidence Judge
740
+
741
+ Entity-aware Retry
742
+
743
+ Weak Evidence Fallback
744
+
745
+ Answer Generator
746
+
747
+ Answer Grounding Judge
748
+
749
+ Completeness Guard
750
+
751
+ UI Trace
752
+ ```
753
+
754
+ ---
755
 
756
+ ## 7.3 Harness Components
757
 
758
+ The notebooks include several engineering patches and layers:
759
+
760
+ ### V14 WAF-aware Harness
761
+
762
+ Handles web pages blocked by WAF or browser failures.
763
+
764
+ Functions:
765
+
766
+ - detect WAF block pages
767
+ - exclude blocked pages from evidence
768
+ - provide diagnostics
769
+ - use safe static fallback if browser click fails
770
+ - reject query-fabricated URLs before evidence building
771
 
772
  ---
773
 
774
+ ### V15 — Qwen Evidence Judge Loop
775
 
776
+ Adds an LLM-based evidence judge.
777
+
778
+ Flow:
779
+
780
+ ```text
781
+ Planner
782
+ → Search / Fetch
783
+ → Evidence Filter
784
+ → Qwen Judge
785
+ → Retry
786
+ → Final Evidence
787
+ ```
788
+
789
+ The purpose is to avoid trusting weak web snippets blindly.
790
 
791
  ---
792
 
793
+ ### V16 — Local-aware Judge Repair
794
+
795
+ Improves routing and fallback.
796
 
797
+ It handles:
798
 
799
+ ```text
800
+ PEKOM routing
801
+ CCNA Lab routing
802
+ residential college routing
803
+ local RAG fallback
804
+ entity-aware retry
805
+ fake URL rejection
806
+ ```
807
 
808
  ---
809
 
810
+ ### V17 Strict Entity Judge and UI Polish
811
+
812
+ Adds stricter entity matching and improves trace display.
813
+
814
+ This helps avoid cases where a query about one entity is answered with another related but wrong page.
815
+
816
+ ---
817
+
818
+ ### V18 — Balanced Official Reference Fallback
819
+
820
+ Allows the system to still provide official references when strong web evidence is not enough, while avoiding over-trusting weak pages.
821
+
822
+ ---
823
+
824
+ ### V19 — Answer Grounding Judge
825
+
826
+ Checks whether the final generated answer is actually supported by evidence.
827
+
828
+ This is important because even if retrieval is correct, the model may still introduce unsupported details.
829
+
830
+ ---
831
+
832
+ ### Completeness Guard
833
+
834
+ Checks whether the answer is too incomplete and whether a rewrite or fallback should be triggered.
835
+
836
+ ---
837
+
838
+ # 8. Improved Model — PPO Rule-reward Post-training + RAG + Agent + Harness
839
+
840
+ ## 8.1 Purpose
841
+
842
+ The Improved Model asks:
843
+
844
+ > Can we further improve the model’s behavior after SFT/RAG by using PPO reward-based post-training?
845
+
846
+ Baseline 2 already improves factual grounding through RAG and Harness Engineering. The Improved Model adds PPO to shape the model’s behavior.
847
+
848
+ The goal is not to replace RAG. The goal is to make the model more aligned with the desired answer style and safety behavior.
849
+
850
+ ---
851
+
852
+ ## 8.2 What PPO Means in This Project
853
+
854
+ PPO stands for **Proximal Policy Optimization**.
855
+
856
+ In simple terms:
857
+
858
+ ```text
859
+ SFT teaches the model by imitation.
860
+ PPO lets the model generate answers, scores them with a reward function, and updates the model toward higher-reward answers.
861
+ ```
862
+
863
+ In this project:
864
+
865
+ ```text
866
+ Actor model: Qwen3-8B + LoRA
867
+ Critic/value head: TRL value head model
868
+ Reference model: frozen Qwen3-8B reference
869
+ Reward: rule-based preference reward function
870
+ KL control: used to avoid drifting too far from the reference model
871
+ ```
872
+
873
+ ---
874
+
875
+ ## 8.3 Rule-based Reward Function
876
+
877
+ This project uses a rule-based reward function rather than a separately trained neural reward model.
878
+
879
+ The reward function evaluates:
880
+
881
+ ```text
882
+ gold answer similarity
883
+ rejected answer penalty
884
+ evidence overlap
885
+ scope correctness
886
+ hallucinated URL penalty
887
+ vague answer penalty
888
+ process/thinking leakage penalty
889
+ direct answer bonus
890
+ repetition penalty
891
+ degeneration/collapse penalty
892
+ ```
893
+
894
+ This is why the model card should describe the final stage as:
895
+
896
+ > Rule-reward PPO post-training
897
+
898
+ not:
899
+
900
+ > Full RLHF with a trained reward model
901
+
902
+ The reward model type recorded in the notebook is:
903
+
904
+ ```text
905
+ rule_based_preference_reward_function
906
+ uses_separate_neural_reward_model: False
907
+ ```
908
+
909
+ ---
910
+
911
+ ## 8.4 PPO Training Configuration
912
+
913
+ The final PPO run uses:
914
+
915
+ ```text
916
+ Preference dataset rows: 1000
917
+ Train rows: 900
918
+ Validation rows: 100
919
+ MAX_PPO_ROWS: None
920
+ Train fraction: 0.90
921
+ PPO epochs: 2
922
+ Batch size: 2
923
+ Mini-batch size: 1
924
+ Max new tokens: 72
925
+ Max PPO steps per epoch: None
926
+ Planned steps per epoch: 450
927
+ Total planned steps: 900
928
+ Learning rate: 2e-6
929
+ Target KL: 0.10
930
+ Generation temperature: 0.45
931
+ Top-p: 0.78
932
+ Repetition penalty: 1.3
933
+ No-repeat ngram size: 4
934
+ ```
935
+
936
+ The run completed successfully:
937
+
938
+ ```text
939
+ Global PPO steps: 900 / 900
940
+ Elapsed time: 04:47:59
941
+ Degenerate ratio: 0.00%
942
+ ```
943
+
944
+ ---
945
+
946
+ ## 8.5 PPO Artifact Verification
947
+
948
+ The Stage 3 notebook includes strict artifact verification.
949
+
950
+ This is important because PPO notebooks can easily appear to run while silently saving old or incomplete artifacts.
951
+
952
+ The strict save cell verifies:
953
+
954
+ ```text
955
+ training_log exists
956
+ training_log records = 900
957
+ expected steps = 900
958
+ MAX_PPO_ROWS = None
959
+ train rows = 900
960
+ valid rows = 100
961
+ NUM_PPO_EPOCHS = 2
962
+ MAX_PPO_STEPS_PER_EPOCH = None
963
+ parameter hash changed after PPO
964
+ PPO inference full actor exists
965
+ PPO LoRA adapter exists
966
+ non-PPO fallback forbidden
967
+ ```
968
+
969
+ The final strict save output confirms:
970
+
971
+ ```text
972
+ Final PPO records saved: 900 / expected 900
973
+ Strict full PPO artifact contract passed.
974
+ ```
975
+
976
+ The parameter change proof confirms:
977
+
978
+ ```text
979
+ aggregate_hash_changed: true
980
+ changed_trainable_tensors: 506
981
+ unchanged_trainable_tensors: 0
982
+ ```
983
+
984
+ This proves that PPO training changed the trainable LoRA/value-head parameters rather than merely running a dry notebook.
985
+
986
+ ---
987
+
988
+ ## 8.6 Strict PPO-only Runtime
989
+
990
+ The final runtime is configured so that the UI must use PPO artifacts only.
991
+
992
+ The strict PPO gate confirms:
993
+
994
+ ```text
995
+ PPO records: 900
996
+ PPO full actor usable: True
997
+ PPO LoRA adapter usable: True
998
+ Strict PPO-only UI mode: True
999
+ ```
1000
+
1001
+ The runtime loading order is:
1002
+
1003
+ ```text
1004
+ 1. PPO full inference actor if full weights exist
1005
+ 2. Otherwise base Qwen3-8B + PPO LoRA adapter
1006
+ 3. Non-PPO fallback is forbidden
1007
+ ```
1008
+
1009
+ This prevents the final demo from accidentally loading an old Baseline 2 model or a stale 150-step PPO proof artifact.
1010
+
1011
+ ---
1012
+
1013
+ ## 8.7 PPO Validation
1014
+
1015
+ The PPO-only validation evaluation uses a held-out validation sample.
1016
+
1017
+ The displayed validation summary is:
1018
+
1019
+ ```text
1020
+ reward: 0.477789
1021
+ gold_overlap: 0.255351
1022
+ rejected_overlap: 0.155080
1023
+ ```
1024
+
1025
+ Interpretation:
1026
+
1027
+ - reward is positive
1028
+ - gold overlap is higher than rejected overlap
1029
+ - the PPO-trained actor tends to move closer to preferred answers than rejected answers
1030
+
1031
+ This does not mean the PPO model is perfect. It means the reward-shaped behavior is directionally positive.
1032
+
1033
+ ---
1034
+
1035
+ ## 8.8 PPO Limitations
1036
+
1037
+ The PPO run is successful, but the raw PPO generations still show some imperfections.
1038
+
1039
+ Observed issues include:
1040
+
1041
+ 1. **Process leakage**
1042
+ Some outputs still include phrases like:
1043
+ ```text
1044
+ Okay, let me try to figure out...
1045
+ Wait, I need to check again...
1046
+ ```
1047
+ The reward function penalizes this, but it is not completely eliminated.
1048
+
1049
+ 2. **Occasional hallucinated URLs**
1050
+ Some raw generations may still invent URLs. The harness fake URL guard is therefore still necessary.
1051
+
1052
+ 3. **OCR-style text artifacts**
1053
+ Some source chunks contain spacing or OCR issues, and the model may reproduce them.
1054
+
1055
+ 4. **KL can be high**
1056
+ Some PPO logs show high `objective/kl`, meaning the PPO actor can drift noticeably from the reference model. However, the run completed with:
1057
+ ```text
1058
+ degenerate_ratio = 0.00%
1059
+ ```
1060
+ and no detected repetition collapse.
1061
+
1062
+ 5. **RAG/Harness remains necessary**
1063
+ PPO improves model behavior, but it does not replace retrieval grounding or guardrails.
1064
+
1065
+ ---
1066
+
1067
+ # 9. TensorTalk UI
1068
+
1069
+ The project includes a WhatsApp-style Jupyter HTML UI called **TensorTalk**.
1070
+
1071
+ The UI supports:
1072
+
1073
+ - chat-style interface
1074
+ - TensorCat avatar
1075
+ - RAG on/off control
1076
+ - web agent on/off control
1077
+ - collapsed trace panels
1078
+ - retrieved evidence display
1079
+ - web evidence display
1080
+ - planning/thinking display layer
1081
+ - harness decision trace
1082
+ - answer grounding information
1083
+ - strict PPO artifact loading
1084
+ - new chat reset behavior
1085
+
1086
+ The UI is part of the engineering contribution because it makes the harness process visible rather than hidden.
1087
+
1088
+ ---
1089
+
1090
+ # 10. Smoke Tests
1091
+
1092
+ ## 10.1 What Smoke Test Means Here
1093
+
1094
+ A smoke test is a lightweight system sanity check.
1095
+
1096
+ It is not a full evaluation. It is a quick check that the main pipeline still works.
1097
+
1098
+ In this project, smoke tests check whether:
1099
+
1100
+ ```text
1101
+ PPO model loads
1102
+ RAG retrieves evidence
1103
+ web agent searches official sources
1104
+ fake URL guard blocks synthetic links
1105
+ answer grounding returns a result
1106
+ trace structure is produced
1107
+ fallback behavior still works
1108
+ ```
1109
+
1110
+ ---
1111
+
1112
+ ## 10.2 Example Smoke Tests
1113
+
1114
+ The notebook defines smoke tests such as:
1115
+
1116
+ ```text
1117
+ 1. PEKOM should not be routed to AI bachelor page
1118
+ 2. Residential college should prefer student-affairs residential page
1119
+ 3. CCNA Lab should not invent synthetic URLs
1120
+ ```
1121
+
1122
+ These are not random examples. They are chosen to test known fragile parts of the pipeline:
1123
+
1124
+ - entity routing
1125
+ - official URL preference
1126
+ - fake URL rejection
1127
+ - web/RAG trace structure
1128
+
1129
+ ---
1130
+
1131
+ # 11. Control Variable Design
1132
+
1133
+ The project uses a control-variable style comparison.
1134
+
1135
+ The base task remains the same:
1136
+
1137
+ ```text
1138
+ UM FSKTM Handbook QA
1139
+ ```
1140
+
1141
+ The base model family remains the same:
1142
+
1143
+ ```text
1144
+ Qwen3-8B
1145
+ ```
1146
+
1147
+ The dataset domain remains the same:
1148
+
1149
+ ```text
1150
+ Undergraduate + postgraduate + general UM Handbook knowledge
1151
+ ```
1152
+
1153
+ What changes is the system layer:
1154
+
1155
+ ```text
1156
+ Baseline 1: SFT only
1157
+ Baseline 2: SFT + RAG + Harness Agent
1158
+ Improved: SFT/RAG/Harness + PPO post-training
1159
+ ```
1160
+
1161
+ This allows the project to compare which improvements come from:
1162
+
1163
+ - parameter learning
1164
+ - retrieval grounding
1165
+ - metadata-aware scope control
1166
+ - official web augmentation
1167
+ - harness guardrails
1168
+ - PPO reward shaping
1169
+
1170
+ This is more rigorous than simply building three unrelated systems.
1171
+
1172
+ ---
1173
+
1174
+ # 12. Stage-by-stage Comparison Table
1175
+
1176
+ | Dimension | Baseline 1: Closed-book SFT | Baseline 2: RAG + SFT + Agent/Harness | Improved Model: PPO + RAG + Agent/Harness |
1177
+ |---|---|---|---|
1178
+ | Main research question | Can the model memorize and reproduce handbook QA from SFT? | Does retrieval-grounded evidence improve handbook QA? | Can rule-reward PPO further align answer behavior while keeping RAG/Harness control? |
1179
+ | Base model | Qwen3-8B | Qwen3-8B | Qwen3-8B |
1180
+ | Main training method | Supervised fine-tuning | RAG-augmented supervised fine-tuning | Rule-reward PPO post-training |
1181
+ | Dataset used | 1000 SFT QA rows | SFT QA + metadata + RAG KB + RAG eval | 1000 PPO preference rows |
1182
+ | Train/validation/test | 800 / 100 / 100 | 8:1:1 RAG-augmented split | 900 train / 100 validation |
1183
+ | Retrieval | No | Yes | Yes |
1184
+ | Retrieval type | None | Dense embedding + FAISS + metadata-aware rerank | Same RAG runtime reused |
1185
+ | Embedding model | None | BAAI/bge-base-en-v1.5 | RAG runtime inherited from Baseline 2 |
1186
+ | Top-k evidence | None | Top-3 | Top-3 / runtime-dependent |
1187
+ | Metadata awareness | Hidden metadata only, not used at inference | Yes, scope/source/section aware | Yes, used by RAG/Harness runtime |
1188
+ | Scope control | Weak; model may confuse UG/PG if prompt is ambiguous | Stronger due to metadata-aware retrieval | Stronger due to RAG + PPO reward + harness |
1189
+ | Web agent | No | Yes | Yes |
1190
+ | Official domain control | No | Yes, UM/FSKTM official domain whitelist | Yes, same official-source guardrails |
1191
+ | Fake URL guard | No | Yes | Yes |
1192
+ | WAF handling | No | Yes | Yes |
1193
+ | Evidence judge | No | Yes, Qwen evidence judge | Yes |
1194
+ | Retry/fallback policy | No | Yes | Yes |
1195
+ | Answer grounding judge | No | Yes | Yes |
1196
+ | Completeness guard | No | Yes | Yes |
1197
+ | UI trace | Basic chat UI | Harness trace panels | Strict PPO + Harness trace panels |
1198
+ | LoRA rank | 16 | 8 | PPO actor based on LoRA actor/value setup |
1199
+ | Training epochs | 8 SFT epochs | 20 SFT epochs | 2 PPO epochs |
1200
+ | Main output artifact | LoRA adapter + merged model + `.pt` export | LoRA adapter, optional non-merged runtime | PPO full inference actor + PPO LoRA adapter + manifest |
1201
+ | Artifact strictness | Standard save | Adapter/runtime path checks | Manifest, training log count, parameter hash proof, strict gate |
1202
+ | Key metric | Test token F1 ≈ 0.8869 | RAG token F1 ≈ 0.846 on selected eval; retrieval Hit@3 ≈ 0.954 | PPO validation reward ≈ 0.4778; gold overlap > rejected overlap |
1203
+ | Strongest contribution | Clean SFT baseline | Evidence-grounded QA and metadata-aware retrieval | Full PPO post-training with strict artifact verification and harnessed runtime |
1204
+ | Main weakness | Closed-book hallucination risk | More complex runtime, depends on retriever quality | PPO raw outputs still need Harness/RAG due to possible process leakage and fake URLs |
1205
+ | Control variable role | Establishes parameter-only baseline | Adds retrieval and harness while keeping same domain/model family | Adds PPO reward shaping while preserving RAG/Harness pipeline |
1206
+
1207
+ ---
1208
+
1209
+ # 13. Technical Comparison of the Three Stages
1210
+
1211
+ ## 13.1 Content-level Difference
1212
+
1213
+ | Content Aspect | Baseline 1 | Baseline 2 | Improved Model |
1214
+ |---|---|---|---|
1215
+ | Stable handbook facts | Learned into model parameters | Retrieved from handbook KB | Retrieved and answered by PPO-aligned actor |
1216
+ | Latest or official web info | Not supported | Supported through official web agent | Supported through same official web agent |
1217
+ | UG vs PG distinction | Learned implicitly | Controlled by metadata retrieval | Controlled by metadata retrieval + reward/harness |
1218
+ | Evidence visibility | Not shown | Evidence shown in RAG trace | Evidence shown in PPO/Harness trace |
1219
+ | Hallucination control | Mostly prompt-based | Retrieval + grounding | Retrieval + grounding + reward penalties |
1220
+ | Fake URL control | Not available | Harness URL guard | Harness URL guard + PPO penalty signal |
1221
+
1222
+ ---
1223
+
1224
+ ## 13.2 Engineering-level Difference
1225
+
1226
+ | Engineering Aspect | Baseline 1 | Baseline 2 | Improved Model |
1227
+ |---|---|---|---|
1228
+ | Notebook purpose | Train and evaluate closed-book SFT model | Build RAG-augmented model and harnessed agent runtime | Train PPO actor and attach it to final harness runtime |
1229
+ | Runtime complexity | Low | High | Highest |
1230
+ | Debug trace | Basic | Detailed RAG/Web/Harness trace | Detailed PPO/RAG/Web/Harness trace |
1231
+ | Failure handling | Minimal | Fallback and guardrail logic | Strict PPO-only fallback prevention plus harness fallback |
1232
+ | Artifact verification | Basic output save | Adapter/merged path checks | Manifest, training log count, parameter hash proof, strict gate |
1233
+ | Risk of stale artifact use | Moderate | Moderate | Actively guarded against |
1234
+ | Demo readiness | Good for simple QA | Strong for grounded QA | Strongest for final controlled system demo |
1235
+
1236
+ ---
1237
+
1238
+ # 14. Why the Improved Model Does Not Replace RAG
1239
+
1240
+ A key design decision is that PPO does not replace RAG.
1241
+
1242
+ PPO improves the model’s tendency to:
1243
+
1244
+ - answer directly
1245
+ - avoid rejected-style answers
1246
+ - avoid vague answers
1247
+ - avoid process leakage
1248
+ - avoid fake URLs
1249
+ - avoid repetition collapse
1250
+ - use evidence-like wording more appropriately
1251
+
1252
+ But PPO does not guarantee factual correctness by itself.
1253
+
1254
+ Therefore, the final system still needs:
1255
+
1256
+ ```text
1257
+ RAG for evidence
1258
+ Web Agent for official/latest information
1259
+ Harness for source control
1260
+ Grounding judge for answer verification
1261
+ Fallback for weak evidence
1262
+ ```
1263
+
1264
+ This is the correct division of responsibility:
1265
+
1266
+ ```text
1267
+ SFT: teaches domain answer style
1268
+ RAG: supplies factual evidence
1269
+ Agent: finds official external evidence
1270
+ Harness: controls trust, routing, fallback, and trace
1271
+ PPO: improves answer behavior according to reward preferences
1272
+ ```
1273
+
1274
+ ---
1275
+
1276
+ # 15. Known Limitations
1277
+
1278
+ This project is a strong applied LLM system prototype, but it has limitations.
1279
+
1280
+ ## 15.1 Not a full human-feedback RLHF system
1281
+
1282
+ The PPO stage uses a rule-based reward function. It does not train a separate neural reward model from human preference labels.
1283
+
1284
+ Correct description:
1285
+
1286
+ ```text
1287
+ Rule-reward PPO post-training
1288
+ ```
1289
+
1290
+ Not:
1291
+
1292
+ ```text
1293
+ Full RLHF with learned reward model
1294
+ ```
1295
+
1296
+ ---
1297
+
1298
+ ## 15.2 Raw PPO generations can still be imperfect
1299
+
1300
+ Observed raw PPO generations may include:
1301
+
1302
+ - process leakage
1303
+ - occasional hallucinated URLs
1304
+ - OCR-like token spacing
1305
+ - incomplete course titles
1306
+ - noisy source-text reproduction
1307
+
1308
+ The final Harness runtime is therefore necessary.
1309
+
1310
+ ---
1311
+
1312
+ ## 15.3 Web search is constrained
1313
+
1314
+ The web agent is intentionally limited to official UM/FSKTM sources. It may refuse or fallback when official evidence is weak.
1315
+
1316
+ This is a feature, not a bug, because the system prioritizes trustworthiness over open-ended browsing.
1317
+
1318
+ ---
1319
+
1320
+ ## 15.4 RAG depends on knowledge base quality
1321
+
1322
+ If the RAG KB contains OCR noise or incomplete chunks, the model may inherit that noise. Future work should improve source cleaning and chunk normalization.
1323
+
1324
+ ---
1325
+
1326
+ ## 15.5 Notebook-based prototype
1327
+
1328
+ The project is implemented as notebooks. A production version should separate modules into:
1329
+
1330
+ ```text
1331
+ data/
1332
+ retrieval/
1333
+ agent/
1334
+ harness/
1335
+ training/
1336
+ evaluation/
1337
+ ui/
1338
+ tests/
1339
+ ```
1340
+
1341
+ ---
1342
+
1343
+ # 16. Recommended Usage
1344
+
1345
+ This project is intended for research, coursework, and demonstration purposes.
1346
+
1347
+ It is not an official Universiti Malaya system.
1348
+
1349
+ For official academic decisions, students should always refer to the official handbook, faculty office, or UM/FSKTM official websites.
1350
+
1351
+ ---
1352
+
1353
+ # 17. Suggested Inference Flow
1354
+
1355
+ For final demonstration, use the Improved Model runtime:
1356
+
1357
+ ```text
1358
+ 1. Load PPO full inference actor if available.
1359
+ 2. If unavailable, load base Qwen3-8B + PPO LoRA adapter.
1360
+ 3. Initialize local handbook RAG.
1361
+ 4. Enable official UM/FSKTM web agent if the question may need external/latest information.
1362
+ 5. Run through TensorTalkHarnessCore.
1363
+ 6. Display answer with evidence trace.
1364
+ ```
1365
+
1366
+ Strict runtime requirement:
1367
+
1368
+ ```text
1369
+ Non-PPO fallback is forbidden in the final Improved Model demo.
1370
+ ```
1371
+
1372
+ ---
1373
+
1374
+ # 18. Summary
1375
+
1376
+ TensorTalk demonstrates a staged LLM system development workflow:
1377
+
1378
+ ```text
1379
+ Baseline 1:
1380
+ Qwen3-8B learns handbook QA through closed-book SFT.
1381
+
1382
+ Baseline 2:
1383
+ The system adds RAG, dense retrieval, metadata-aware reranking, official web search, and Harness Engineering.
1384
+
1385
+ Improved Model:
1386
+ The system adds full 1000-row rule-reward PPO post-training, strict artifact verification, and a PPO-only final harness runtime.
1387
+ ```
1388
+
1389
+ The most important contribution is not only that the model can answer handbook questions, but that the system is controlled, evidence-aware, source-constrained, traceable, and evaluated through a clear baseline progression.
1390
+
1391
+ The final system should be understood as:
1392
 
1393
+ > **A Qwen3-8B based UM Handbook RAG Agent, improved with rule-reward PPO and controlled by Harness Engineering.**