PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
d7752eb
Β·
verified Β·
1 Parent(s): c8eb079

Restore and update project journal with zero-shot baseline

Browse files
Files changed (1) hide show
  1. PROJECT_JOURNAL.md +366 -0
PROJECT_JOURNAL.md CHANGED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TMF921 Intent-to-Configuration Research Journal
2
+
3
+ This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
4
+
5
+ Repository links:
6
+
7
+ - Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
8
+ - Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
9
+ - Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
10
+ - Base model: https://huggingface.co/Qwen/Qwen3-8B
11
+
12
+ ---
13
+
14
+ ## Current status summary
15
+
16
+ Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.
17
+
18
+ Stage 2 status: **diagnostic / not promoted**.
19
+
20
+ Best stage-1 normalized metrics:
21
+
22
+ | Split | JSON parse | Normalized field F1 | Normalized key F1 |
23
+ |---|---:|---:|---:|
24
+ | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
25
+ | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
26
+ | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
27
+ | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
28
+ | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |
29
+
30
+ Zero-shot Qwen3-8B baseline, 200 examples per split:
31
+
32
+ | Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
33
+ |---|---:|---:|---:|
34
+ | `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
35
+ | `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
36
+ | `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
37
+ | `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
38
+ | `test_adversarial` | 0.000 | 0.0000 | 0.0000 |
39
+
40
+ Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.
41
+
42
+ ---
43
+
44
+ ## 2026-04-30 β€” Dataset cloned and audited
45
+
46
+ The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited.
47
+
48
+ Key findings:
49
+
50
+ - Total rows: **41,815**
51
+ - Train: **39,294**
52
+ - Test: **2,521**
53
+ - Missing values: **0**
54
+ - Duplicate IDs: **0**
55
+ - Assistant JSON parse validity: **100%**
56
+ - Exact train/test full-message overlap: **0**
57
+ - Near-duplicate prompt similarity was high:
58
+ - >= 0.90: **1,290 / 2,521**
59
+ - >= 0.95: **602 / 2,521**
60
+ - >= 0.98: **262 / 2,521**
61
+ - `create` lifecycle operation: **95.9%**
62
+ - adversarial rows: **166 = 0.397%**
63
+ - unique JSON structure signatures: **31**
64
+
65
+ Interpretation:
66
+
67
+ The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.
68
+
69
+ Decision:
70
+
71
+ Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.
72
+
73
+ ---
74
+
75
+ ## 2026-04-30 β€” Research SOTA dataset created
76
+
77
+ Created:
78
+
79
+ - https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
80
+
81
+ Splits:
82
+
83
+ | Split | Rows | Purpose |
84
+ |---|---:|---|
85
+ | `train_base` | 26,357 | unaugmented training after OOD holdouts |
86
+ | `train_sota` | 32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers |
87
+ | `validation` | 1,547 | validation |
88
+ | `test_in_distribution` | 1,455 | in-distribution test |
89
+ | `test_template_ood` | 3,503 | held-out prompt-template family |
90
+ | `test_use_case_ood` | 4,341 | held-out use cases |
91
+ | `test_sector_ood` | 4,579 | held-out sectors |
92
+ | `test_adversarial` | 33 | held-out adversarial examples |
93
+
94
+ Qwen3 token-length audit:
95
+
96
+ - mean: **754.1**
97
+ - p50: **705**
98
+ - p95: **1293**
99
+ - p99: **1300**
100
+ - max: **1316**
101
+ - fit within 2048: **100%**
102
+
103
+ `train_sota` balancing:
104
+
105
+ - non-create lifecycle rows: **5,166 = 15.97%**
106
+ - adversarial rows: **2,115 = 6.54%**
107
+ - synthetic multi-turn wrappers: **1,281**
108
+
109
+ Decision:
110
+
111
+ Use `train_sota` for the first Qwen3-8B QLoRA training run.
112
+
113
+ ---
114
+
115
+ ## 2026-04-30 / 2026-05-01 β€” Training/evaluation repo created
116
+
117
+ Created:
118
+
119
+ - https://huggingface.co/nraptisss/tmf921-intent-training
120
+
121
+ Default recipe:
122
+
123
+ - Base model: `Qwen/Qwen3-8B`
124
+ - Method: QLoRA SFT
125
+ - Quantization: 4-bit NF4 + double quantization
126
+ - LoRA target modules: `all-linear`
127
+ - LoRA rank: 64
128
+ - LR: 2e-4
129
+ - Max length: 2048
130
+ - Loss: assistant-only SFT loss
131
+ - bf16: enabled
132
+ - gradient checkpointing: enabled
133
+ - train split: `train_sota`
134
+
135
+ The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.
136
+
137
+ ---
138
+
139
+ ## 2026-05-01 β€” Runtime issues fixed
140
+
141
+ Fixed issues:
142
+
143
+ 1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks.
144
+ 2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works.
145
+ 3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`.
146
+ 4. Deprecated `warmup_ratio`: replaced with `warmup_steps`.
147
+
148
+ Server GPU evidence:
149
+
150
+ ```text
151
+ torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
152
+ cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
153
+ ```
154
+
155
+ ---
156
+
157
+ ## 2026-05-01 / 2026-05-02 β€” Stage-1 Qwen3-8B QLoRA training completed
158
+
159
+ Run directory:
160
+
161
+ ```text
162
+ runs/qwen3-8b-qlora-20260501-083834
163
+ ```
164
+
165
+ Training behavior:
166
+
167
+ - Initial loss: **1.212**
168
+ - Later loss: **~0.14–0.15**
169
+ - Mean token accuracy: **~0.945–0.953**
170
+ - Validation loss plateau: **~0.153**
171
+
172
+ No observed:
173
+
174
+ - CUDA OOM
175
+ - NaNs
176
+ - divergence
177
+ - gradient explosion
178
+
179
+ Decision:
180
+
181
+ Evaluate the trained adapter across ID and OOD splits.
182
+
183
+ ---
184
+
185
+ ## 2026-05-02 / 2026-05-04 β€” Evaluation speed issue fixed
186
+
187
+ Initial 4-bit adapter evaluation was too slow:
188
+
189
+ ```text
190
+ test_in_distribution: 1455 examples in ~25h
191
+ ```
192
+
193
+ Fixes:
194
+
195
+ - batched generation,
196
+ - dynamic generation length,
197
+ - periodic save/resume,
198
+ - merged bf16 model evaluation.
199
+
200
+ ---
201
+
202
+ ## 2026-05-04 β€” Stage-1 raw and normalized evaluation
203
+
204
+ Raw metrics:
205
+
206
+ | Split | JSON parse | Exact match | Field F1 | KPI presence |
207
+ |---|---:|---:|---:|---:|
208
+ | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
209
+ | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
210
+ | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
211
+ | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
212
+ | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
213
+
214
+ Normalized metrics:
215
+
216
+ | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
217
+ |---|---:|---:|---:|---:|
218
+ | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
219
+ | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
220
+ | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
221
+ | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
222
+ | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
223
+
224
+ Interpretation:
225
+
226
+ The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.
227
+
228
+ Weak layers:
229
+
230
+ - `o1_nrm`: normalized field F1 around **0.39–0.40**
231
+ - `a1_policy`: normalized field F1 around **0.67–0.68**
232
+ - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
233
+ - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
234
+
235
+ Decision:
236
+
237
+ Test a stage-2 weak-layer continuation experiment.
238
+
239
+ ---
240
+
241
+ ## 2026-05-05 β€” Stage-2 weak-layer continuation run and evaluation
242
+
243
+ Stage-2 setup:
244
+
245
+ - initialized from stage-1 adapter,
246
+ - weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`,
247
+ - stage-2 rows: **13,829**,
248
+ - weak rows: **10,638**,
249
+ - replay rows: **3,191**,
250
+ - LR: **5e-5**,
251
+ - epochs: **1**.
252
+
253
+ Stage-2 training was stable. Adapter continuation was correctly configured:
254
+
255
+ ```text
256
+ trainable params: 174,587,904
257
+ requires_grad={'default': True}
258
+ devices={'default': ['cuda']}
259
+ ```
260
+
261
+ Stage-2 evaluation comparison:
262
+
263
+ | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
264
+ |---|---:|---:|---:|---:|---:|---:|
265
+ | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
266
+ | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
267
+ | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
268
+ | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
269
+ | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
270
+
271
+ Decision:
272
+
273
+ Stage 2 is **diagnostic only** and is **not promoted**. Stage 1 remains the primary model.
274
+
275
+ Interpretation:
276
+
277
+ Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.
278
+
279
+ ---
280
+
281
+ ## 2026-05-06 β€” Zero-shot Qwen3-8B baseline completed
282
+
283
+ Goal:
284
+
285
+ Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.
286
+
287
+ Action:
288
+
289
+ Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split:
290
+
291
+ ```bash
292
+ EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
293
+ bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot
294
+ ```
295
+
296
+ Zero-shot metrics:
297
+
298
+ | Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
299
+ |---|---:|---:|---:|
300
+ | `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
301
+ | `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
302
+ | `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
303
+ | `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
304
+ | `test_adversarial` | 0.000 | 0.0000 | 0.0000 |
305
+
306
+ Comparison with fine-tuned stage 1:
307
+
308
+ | Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
309
+ |---|---:|---:|---:|---:|---:|---:|
310
+ | ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
311
+ | Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
312
+ | Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
313
+ | Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
314
+ | Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
315
+
316
+ Interpretation:
317
+
318
+ Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.
319
+
320
+ ---
321
+
322
+ ## 2026-05-07 β€” Publication packaging and paper scaffold
323
+
324
+ Completed:
325
+
326
+ - finalized dataset card,
327
+ - finalized primary stage-1 model card,
328
+ - added `REPRODUCIBILITY.md`,
329
+ - added `scripts/reproduce_stage1_eval.sh`,
330
+ - added `scripts/run_zero_shot_baseline.sh`,
331
+ - added `scripts/package_results.py`,
332
+ - added `scripts/sample_failure_examples.py`,
333
+ - uploaded `results/` and `analysis/` artifacts,
334
+ - added `paper/outline.md`,
335
+ - added `paper/tables.md`.
336
+
337
+ Current publication-ready assets:
338
+
339
+ - dataset card,
340
+ - model card,
341
+ - results package,
342
+ - qualitative examples,
343
+ - reproducibility checklist,
344
+ - paper outline,
345
+ - draft tables,
346
+ - project journal.
347
+
348
+ ---
349
+
350
+ ## Current open research questions
351
+
352
+ 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
353
+ 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
354
+ 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
355
+ 4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
356
+
357
+ ## Next recommended step
358
+
359
+ Write the first manuscript draft using:
360
+
361
+ - `paper/outline.md`,
362
+ - `paper/tables.md`,
363
+ - `PROJECT_JOURNAL.md`,
364
+ - `results/stage1_vs_stage2_comparison.md`,
365
+ - `results/baselines/zero_shot_vs_finetuned.md`,
366
+ - `analysis/stage1_examples/failure_examples.md`.