TensorCat commited on
Commit
36f82a2
Β·
verified Β·
1 Parent(s): 052d67e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +250 -3
README.md CHANGED
@@ -1,3 +1,250 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # TensorTalk / UM_Handbook
5
+
6
+ TensorTalk is a handbook-grounded academic chat assistant built for the **Faculty of Computer Science and Information Technology, Universiti Malaya (UM)**.
7
+
8
+ This project focuses on turning UM handbook content into a usable question-answering system through:
9
+
10
+ - handbook preprocessing
11
+ - source chunk construction
12
+ - supervised QA dataset building
13
+ - Qwen3-8B LoRA fine-tuning
14
+ - merged-model deployment
15
+ - a browser-style HTML chat demo
16
+
17
+ ---
18
+
19
+ ## Project Goal
20
+
21
+ The main goal of this project is to build a handbook-based assistant that can answer student questions using information learned from the UM handbook domain.
22
+
23
+ The current version is designed around:
24
+
25
+ - undergraduate and postgraduate handbook content
26
+ - handbook-faithful answers
27
+ - concise student-facing responses
28
+ - a local/demo deployment workflow on DICC and notebook environments
29
+
30
+ This project is also intended to support a broader experimental pipeline:
31
+
32
+ - **Baseline 1:** closed-book supervised fine-tuning
33
+ - **Baseline 2:** retrieval-augmented version for later comparison
34
+
35
+ ---
36
+
37
+ ## What This Project Contains
38
+
39
+ ### 1. Dataset Preparation
40
+ The project includes scripts and resources for preparing handbook data before fine-tuning:
41
+
42
+ - handbook markdown preprocessing
43
+ - source chunk dataset building
44
+ - SFT QA dataset construction
45
+ - configuration management for the preprocessing and dataset pipeline
46
+
47
+ ### 2. Fine-Tuning Workflow
48
+ The model training workflow uses a Qwen3-8B base model with LoRA-based fine-tuning on the UM handbook QA dataset.
49
+
50
+ The fine-tuning workflow includes:
51
+
52
+ - notebook-based training on DICC
53
+ - device-aware loading logic
54
+ - train / validation / test style evaluation workflow
55
+ - merged-model export for direct inference
56
+ - LoRA adapter export for optional PEFT-based reuse
57
+ - metrics and prediction file generation
58
+
59
+ ### 3. Deployment Demo
60
+ The project includes a notebook-based HTML chat UI called **TensorTalk**.
61
+
62
+ The demo provides:
63
+
64
+ - a browser-style chat layout
65
+ - a handbook-focused system prompt
66
+ - merged-model loading for direct inference
67
+ - a student-facing question-answer workflow
68
+ - a simple deployment path for demonstration purposes
69
+
70
+ ---
71
+
72
+ ## Current Project Structure
73
+
74
+ ```text
75
+ UM_Handbook/
76
+ β”œβ”€β”€ Dataset/
77
+ β”‚ └── SFT_Dataset/
78
+ β”‚ β”œβ”€β”€ SFT_QA_Training_Ready.jsonl
79
+ β”‚ β”œβ”€β”€ SFT_QA_Training_Ready_pretty.json
80
+ β”‚ β”œβ”€β”€ SFT_QA_Metadata.jsonl
81
+ β”‚ └── SFT_QA_Metadata_pretty.json
82
+ β”œβ”€β”€ assets/
83
+ β”œβ”€β”€ outputs/
84
+ β”‚ └── qwen3_um_handbook_optimized_1/
85
+ β”‚ β”œβ”€β”€ lora_adapter/
86
+ β”‚ β”œβ”€β”€ merged_model/
87
+ β”‚ β”œβ”€β”€ trainer_runs/
88
+ β”‚ β”œβ”€β”€ test_eval_runs/
89
+ β”‚ β”œβ”€β”€ dataset_split_summary.json
90
+ β”‚ β”œβ”€β”€ final_metrics.json
91
+ β”‚ β”œβ”€β”€ test_predictions.jsonl
92
+ β”‚ └── validation_predictions.jsonl
93
+ β”œβ”€β”€ FineTune_QWEN3_UM_Handbook_optimized_1.ipynb
94
+ β”œβ”€β”€ UM_Handbook_Markdown_Preprocess.py
95
+ β”œβ”€β”€ UM_SFT_QA_Dataset_Builder_from_Index.py
96
+ β”œβ”€β”€ UM_Source_Chunk_Dataset_Builder.py
97
+ └── um_handbook_config.py
98
+ ```
99
+
100
+ ---
101
+
102
+ ## Key Files
103
+
104
+ ### Training and Data
105
+ - `Dataset/SFT_Dataset/SFT_QA_Training_Ready.jsonl`
106
+ Main SFT training dataset used for handbook QA fine-tuning.
107
+
108
+ - `UM_Handbook_Markdown_Preprocess.py`
109
+ Preprocesses handbook markdown / extracted source text.
110
+
111
+ - `UM_Source_Chunk_Dataset_Builder.py`
112
+ Builds source chunks for downstream dataset and retrieval-related use.
113
+
114
+ - `UM_SFT_QA_Dataset_Builder_from_Index.py`
115
+ Builds the supervised QA dataset from curated handbook content.
116
+
117
+ - `um_handbook_config.py`
118
+ Central configuration file for paths and data-processing settings.
119
+
120
+ ### Training Output
121
+ - `outputs/qwen3_um_handbook_optimized_1/merged_model/`
122
+ Main inference-ready model directory.
123
+ This is the directory used by the demo chat UI.
124
+
125
+ - `outputs/qwen3_um_handbook_optimized_1/lora_adapter/`
126
+ LoRA adapter weights.
127
+ This is useful for PEFT-style loading with a base model, but it is not the primary path used by the current demo UI.
128
+
129
+ - `outputs/qwen3_um_handbook_optimized_1/final_metrics.json`
130
+ Final evaluation summary.
131
+
132
+ - `outputs/qwen3_um_handbook_optimized_1/validation_predictions.jsonl`
133
+ Validation-set generated answers for inspection.
134
+
135
+ - `outputs/qwen3_um_handbook_optimized_1/test_predictions.jsonl`
136
+ Test-set generated answers for inspection.
137
+
138
+ ### Demo
139
+ - `FineTune_QWEN3_UM_Handbook_optimized_1.ipynb`
140
+ Main notebook that contains the fine-tuning workflow and the TensorTalk HTML chat demo.
141
+
142
+ ---
143
+
144
+ ## Model Artifact Notes
145
+
146
+ This project may contain several model-related outputs. They are not all used in the same way.
147
+
148
+ ### `merged_model/`
149
+ This is the most important deployment artifact for the current demo.
150
+
151
+ Use this when:
152
+ - running the current TensorTalk HTML chat UI
153
+ - loading the fine-tuned model directly with Hugging Face `from_pretrained(...)`
154
+ - sharing the main inference-ready model
155
+
156
+ ### `lora_adapter/`
157
+ This contains LoRA delta weights only.
158
+
159
+ Use this when:
160
+ - loading the adapter on top of the original base model
161
+ - reusing the fine-tuning result in a PEFT workflow
162
+ - experimenting with a smaller transferable fine-tuning artifact
163
+
164
+ ### `.pt` exported model file
165
+ If present, the `.pt` file is mainly a saved full-model artifact / backup export.
166
+
167
+ Use this when:
168
+ - archiving the full fine-tuned weights
169
+ - running a custom loading workflow that explicitly expects a `.pt` file
170
+
171
+ For the current TensorTalk chat UI, the primary runtime artifact is still **`merged_model/`**.
172
+
173
+ ---
174
+
175
+ ## Current Demo Behavior
176
+
177
+ The current demo is designed to answer questions such as:
178
+
179
+ - dress code and appearance guidance
180
+ - programme core courses / credit requirements
181
+ - undergraduate vs postgraduate handbook information
182
+ - academic rules and handbook-supported policy questions
183
+
184
+ The answer style is intended to be:
185
+
186
+ - handbook-grounded
187
+ - short and direct
188
+ - student-facing
189
+ - non-speculative
190
+
191
+ ---
192
+
193
+ ## Example Demo Output
194
+
195
+ The screenshot below shows the current TensorTalk chat interface running with the fine-tuned UM handbook model.
196
+
197
+ ![TensorTalk Demo](assets/tensortalk_demo_chat.jpg)
198
+
199
+ ---
200
+
201
+ ## Repository Preview
202
+
203
+ The screenshot below shows the current top-level project layout.
204
+
205
+ ![Repository Structure](assets/repo_structure.png)
206
+
207
+ ---
208
+
209
+ ## Suggested Minimal Deployment Package
210
+
211
+ If the goal is only to demonstrate the chat UI to teammates, the minimal useful set is:
212
+
213
+ - `merged_model/`
214
+ - the chat notebook / UI code
215
+ - optional avatar image under `assets/`
216
+
217
+ The following items are not required for a simple demo run:
218
+
219
+ - intermediate training checkpoints
220
+ - test evaluation run directories
221
+ - optional full `.pt` export
222
+ - raw training logs not used by the demo
223
+
224
+ ---
225
+
226
+ ## Notes
227
+
228
+ - The project is organized so that **Dataset**, **models / outputs**, and **demo code** remain separate.
229
+ - The current demo is notebook-friendly and was prepared around a DICC workflow.
230
+ - The deployment path prioritizes clarity and reproducibility over a heavyweight full-stack application setup.
231
+
232
+ ---
233
+
234
+ ## Status
235
+
236
+ Current project status:
237
+
238
+ - handbook preprocessing pipeline prepared
239
+ - supervised QA dataset prepared
240
+ - LoRA fine-tuning workflow completed
241
+ - merged model exported
242
+ - TensorTalk HTML chat demo running
243
+ - evaluation outputs generated
244
+
245
+ ---
246
+
247
+ ## Author / Project Name
248
+
249
+ **TensorTalk**
250
+ UM Handbook QA / Fine-Tuned Qwen3-8B LoRA Project