PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
c8eb079
·
verified ·
1 Parent(s): 18d34ec

Upload PROJECT_JOURNAL.md

Browse files
Files changed (1) hide show
  1. PROJECT_JOURNAL.md +0 -981
PROJECT_JOURNAL.md CHANGED
@@ -1,981 +0,0 @@
1
- # TMF921 Intent-to-Configuration Research Journal
2
-
3
- This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
4
-
5
- Repository links:
6
-
7
- - Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
8
- - Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
9
- - Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
10
- - Base model: https://huggingface.co/Qwen/Qwen3-8B
11
-
12
- ---
13
-
14
- ## Journal conventions
15
-
16
- Each entry should include:
17
-
18
- 1. **Date/time**
19
- 2. **Goal**
20
- 3. **Action**
21
- 4. **Evidence / result**
22
- 5. **Interpretation**
23
- 6. **Decision / next step**
24
-
25
- For research claims, prefer numeric evidence over qualitative statements.
26
-
27
- ---
28
-
29
- ## 2026-04-30 — Dataset cloned and audited
30
-
31
- ### Goal
32
-
33
- Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training.
34
-
35
- ### Action
36
-
37
- The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity.
38
-
39
- ### Evidence / result
40
-
41
- Dataset size:
42
-
43
- - Total rows: **41,815**
44
- - Train: **39,294**
45
- - Test: **2,521**
46
-
47
- Quality checks:
48
-
49
- - Missing values: **0**
50
- - Duplicate IDs: **0**
51
- - Duplicate full conversations: **0**
52
- - Assistant JSON parse validity: **41,815 / 41,815 = 100%**
53
- - Role sequence: `system -> user -> assistant` for all rows
54
-
55
- Leakage / similarity findings:
56
-
57
- - Exact train/test user-prompt overlap: **0**
58
- - Exact train/test full-message overlap: **0**
59
- - Near-duplicate prompt similarity was high:
60
- - test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521**
61
- - >= 0.95: **602 / 2,521**
62
- - >= 0.98: **262 / 2,521**
63
-
64
- Distribution findings:
65
-
66
- - `create` lifecycle operation: **40,090 / 41,815 = 95.9%**
67
- - non-create lifecycle rows: **1,725 = 4.1%**
68
- - adversarial rows: **166 = 0.397%**
69
- - only **31 unique JSON structure signatures** across 41,815 rows
70
-
71
- ### Interpretation
72
-
73
- The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation.
74
-
75
- ### Decision / next step
76
-
77
- Create a research-grade derivative dataset with:
78
-
79
- - OOD splits,
80
- - train/eval provenance columns,
81
- - token-length audit,
82
- - validation flags,
83
- - lifecycle/adversarial upsampling for training only,
84
- - no fabricated continuous-KPI or cross-layer-paired examples without a validated generator.
85
-
86
- ---
87
-
88
- ## 2026-04-30 — Research SOTA dataset created
89
-
90
- ### Goal
91
-
92
- Implement the audit recommendations while preserving scientific soundness.
93
-
94
- ### Action
95
-
96
- Created `nraptisss/TMF921-intent-to-config-research-sota`.
97
-
98
- Implemented:
99
-
100
- - `train_base`
101
- - `train_sota`
102
- - `validation`
103
- - `test_in_distribution`
104
- - `test_template_ood`
105
- - `test_use_case_ood`
106
- - `test_sector_ood`
107
- - `test_adversarial`
108
-
109
- Added columns:
110
-
111
- - `system`, `prompt`, `completion`
112
- - `prompt_template_id`
113
- - `scenario_id`
114
- - `json_structure_id`
115
- - `json_root_family`
116
- - `messages_format_valid`
117
- - `assistant_is_valid_json`
118
- - `slice_sst_valid`
119
- - `kpi_profile_valid`
120
- - `semantic_rule_valid_v1`
121
- - `qwen3_chat_template_tokens`
122
- - `fits_2048_qwen3`
123
- - `fits_4096_qwen3`
124
- - `sampling_weight_*`
125
- - `is_augmented`, `augmentation_type`, `source_id`, `conversation_type`
126
-
127
- ### Evidence / result
128
-
129
- Published dataset:
130
-
131
- - https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
132
-
133
- Splits:
134
-
135
- | Split | Rows | Purpose |
136
- |---|---:|---|
137
- | `train_base` | 26,357 | unaugmented training after OOD holdouts |
138
- | `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers |
139
- | `validation` | 1,547 | validation |
140
- | `test_in_distribution` | 1,455 | in-distribution test |
141
- | `test_template_ood` | 3,503 | held-out prompt-template family |
142
- | `test_use_case_ood` | 4,341 | held-out use cases |
143
- | `test_sector_ood` | 4,579 | held-out sectors |
144
- | `test_adversarial` | 33 | held-out adversarial examples |
145
-
146
- Qwen3 token-length audit:
147
-
148
- - mean: **754.1**
149
- - p50: **705**
150
- - p95: **1293**
151
- - p99: **1300**
152
- - max: **1316**
153
- - fit within 2048: **100%**
154
-
155
- `train_sota` balancing:
156
-
157
- - non-create lifecycle rows: **5,166 = 15.97%**
158
- - adversarial rows: **2,115 = 6.54%**
159
- - synthetic multi-turn wrappers: **1,281**
160
-
161
- ### Interpretation
162
-
163
- `max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting.
164
-
165
- ### Decision / next step
166
-
167
- Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA.
168
-
169
- ---
170
-
171
- ## 2026-04-30 / 2026-05-01 — Training/evaluation repo created
172
-
173
- ### Goal
174
-
175
- Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB.
176
-
177
- ### Action
178
-
179
- Created `nraptisss/tmf921-intent-training` with:
180
-
181
- - QLoRA SFT training script,
182
- - evaluation script,
183
- - merge script,
184
- - RTX 6000 Ada install script,
185
- - GPU preflight,
186
- - nohup run scripts,
187
- - resumable checkpoints,
188
- - unique run directories.
189
-
190
- Default recipe:
191
-
192
- - model: `Qwen/Qwen3-8B`
193
- - method: QLoRA NF4 + double quant
194
- - LoRA target modules: `all-linear`
195
- - LoRA rank: `64`
196
- - LoRA alpha: `16`
197
- - LoRA dropout: `0.05`
198
- - LR: `2e-4`
199
- - scheduler: constant
200
- - max length: `2048`
201
- - assistant-only loss: enabled
202
- - bf16: enabled
203
- - gradient checkpointing: enabled
204
- - train split: `train_sota`
205
- - eval split: `validation`
206
-
207
- ### Evidence / result
208
-
209
- Repo:
210
-
211
- - https://huggingface.co/nraptisss/tmf921-intent-training
212
-
213
- ### Interpretation
214
-
215
- The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU.
216
-
217
- ### Decision / next step
218
-
219
- Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results.
220
-
221
- ---
222
-
223
- ## 2026-05-01 — Runtime issues fixed
224
-
225
- ### Goal
226
-
227
- Resolve server-side training errors and ensure training uses GPU.
228
-
229
- ### Issues encountered and fixes
230
-
231
- #### 1. CPU/GPU uncertainty
232
-
233
- Observed concern that training might not use GPU.
234
-
235
- Fix:
236
-
237
- - Added `scripts/check_gpu.py`
238
- - Added `scripts/install_rtx6000ada.sh`
239
- - Added fail-fast CUDA checks to training/evaluation scripts.
240
-
241
- Evidence from server logs:
242
-
243
- ```text
244
- torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
245
- cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
246
- ```
247
-
248
- Conclusion: GPU setup confirmed.
249
-
250
- #### 2. TRL conversational dataset detection error
251
-
252
- Error:
253
-
254
- ```text
255
- ValueError: You set assistant_only_loss=True, but the dataset is not conversational.
256
- ```
257
-
258
- Cause:
259
-
260
- The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format.
261
-
262
- Fix:
263
-
264
- Training script now passes only:
265
-
266
- ```python
267
- train_dataset = train_dataset.select_columns(["messages"])
268
- eval_dataset = eval_dataset.select_columns(["messages"])
269
- ```
270
-
271
- #### 3. Trackio invalid Space ID
272
-
273
- Error:
274
-
275
- ```text
276
- HFValidationError: Repo id ... 'nraptisss/'
277
- ```
278
-
279
- Cause:
280
-
281
- Invalid `TRACKIO_SPACE_ID=nraptisss/`.
282
-
283
- Fix:
284
-
285
- Added validation/sanitization for Trackio Space IDs and support for:
286
-
287
- ```bash
288
- DISABLE_TRACKIO=1
289
- ```
290
-
291
- #### 4. Deprecated warmup argument
292
-
293
- Warning:
294
-
295
- ```text
296
- warmup_ratio is deprecated
297
- ```
298
-
299
- Fix:
300
-
301
- Changed config/script to use:
302
-
303
- ```yaml
304
- warmup_steps: 0
305
- ```
306
-
307
- ### Decision / next step
308
-
309
- Restart training with fixed scripts and disabled Trackio to avoid external logging failures.
310
-
311
- ---
312
-
313
- ## 2026-05-01 / 2026-05-02 — Qwen3-8B QLoRA training run completed
314
-
315
- ### Goal
316
-
317
- Train Qwen3-8B QLoRA on `train_sota`.
318
-
319
- ### Action
320
-
321
- Started training under nohup with unique run directory:
322
-
323
- ```text
324
- runs/qwen3-8b-qlora-20260501-083834
325
- ```
326
-
327
- Trackio disabled:
328
-
329
- ```bash
330
- DISABLE_TRACKIO=1
331
- ```
332
-
333
- ### Evidence / result
334
-
335
- Training logs showed stable convergence.
336
-
337
- Representative metrics:
338
-
339
- Initial:
340
-
341
- ```text
342
- loss: 1.212
343
- mean_token_accuracy: 0.7922
344
- ```
345
-
346
- After early training:
347
-
348
- ```text
349
- loss: ~0.15
350
- mean_token_accuracy: ~0.945-0.953
351
- ```
352
-
353
- Validation loss over training:
354
-
355
- ```text
356
- eval_loss: 0.1593 at epoch 0.1236
357
- eval_loss: 0.1561 at epoch 0.2472
358
- eval_loss: 0.1548 at epoch 0.3709
359
- eval_loss: 0.1535 at epoch 0.8653
360
- eval_loss: 0.1530 at epoch 1.607
361
- eval_loss: 0.1532 at epoch 1.730
362
- ```
363
-
364
- No observed:
365
-
366
- - CUDA OOM,
367
- - NaNs,
368
- - divergence,
369
- - gradient explosion.
370
-
371
- ### Interpretation
372
-
373
- The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence.
374
-
375
- ### Decision / next step
376
-
377
- Evaluate the trained adapter across ID and OOD splits.
378
-
379
- ---
380
-
381
- ## 2026-05-02 / 2026-05-04 — Evaluation speed issue and merged-model evaluation
382
-
383
- ### Goal
384
-
385
- Evaluate the trained adapter on all splits.
386
-
387
- ### Issue
388
-
389
- Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation:
390
-
391
- ```text
392
- test_in_distribution: 1455 examples in ~25h
393
- test_template_ood: ~30-90s/example
394
- ```
395
-
396
- ### Action
397
-
398
- Patched evaluator to support:
399
-
400
- - batched generation,
401
- - dynamic generation length based on target length + buffer,
402
- - periodic save/resume,
403
- - partial prediction reuse.
404
-
405
- Also recommended merging adapter into base bf16 model for faster inference.
406
-
407
- ### Decision / next step
408
-
409
- Use merged model evaluation and normalized metrics.
410
-
411
- ---
412
-
413
- ## 2026-05-04 — Raw evaluation results
414
-
415
- ### Goal
416
-
417
- Measure raw JSON and field-level performance.
418
-
419
- ### Evidence / result
420
-
421
- Raw metrics:
422
-
423
- | Split | JSON parse | Exact match | Field F1 | KPI presence |
424
- |---|---:|---:|---:|---:|
425
- | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
426
- | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
427
- | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
428
- | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
429
- | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
430
-
431
- ### Interpretation
432
-
433
- The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links).
434
-
435
- ### Decision / next step
436
-
437
- Implement a normalized evaluator that removes volatile fields before scoring.
438
-
439
- ---
440
-
441
- ## 2026-05-04 — Normalized evaluator implemented and run
442
-
443
- ### Goal
444
-
445
- Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement.
446
-
447
- ### Action
448
-
449
- Added:
450
-
451
- ```text
452
- scripts/normalize_eval_metrics.py
453
- ```
454
-
455
- Normalization removes/masks:
456
-
457
- - IDs,
458
- - hrefs,
459
- - names/descriptions,
460
- - timestamps,
461
- - schema links,
462
- - UUID/hash-like strings,
463
- - generated request/policy/booking/intent IDs.
464
-
465
- It computes:
466
-
467
- - normalized exact match,
468
- - normalized field precision/recall/F1,
469
- - normalized key precision/recall/F1,
470
- - stratified metrics.
471
-
472
- ### Evidence / result
473
-
474
- Headline normalized metrics:
475
-
476
- | Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact |
477
- |---|---:|---:|---:|---:|---:|
478
- | `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 |
479
- | `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 |
480
- | `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 |
481
- | `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 |
482
- | `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 |
483
-
484
- Strong layers:
485
-
486
- - `tmf921`: normalized field F1 around **0.93–0.94**
487
- - `camara`: normalized field F1 around **0.81–0.87**
488
- - `intent_3gpp`: normalized field F1 around **0.80–0.82**
489
- - `etsi_zsm`: normalized field F1 around **0.75–0.79**
490
-
491
- Weak layers:
492
-
493
- - `o1_nrm`: normalized field F1 around **0.39–0.40**
494
- - `a1_policy`: normalized field F1 around **0.67–0.68**
495
- - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
496
- - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
497
-
498
- ### Interpretation
499
-
500
- The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 ≈ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs.
501
-
502
- ### Decision / next step
503
-
504
- Plan a second-stage weak-layer fine-tune focused on:
505
-
506
- - `o1_nrm`,
507
- - `a1_policy`,
508
- - `tmf921_lifecycle_report`,
509
- - `tmf921_lifecycle_monitor`,
510
- - optionally `tmf921_lifecycle_scale`.
511
-
512
- Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting.
513
-
514
- ---
515
-
516
- ## Current scientific status
517
-
518
- ### What can be claimed now
519
-
520
- The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves:
521
-
522
- - near-perfect JSON validity,
523
- - stable OOD generalization,
524
- - excellent adversarial rejection,
525
- - normalized structural key F1 around 98% across non-adversarial ID/OOD splits,
526
- - normalized field F1 around 77–80% across ID/OOD splits.
527
-
528
- ### What should not be overclaimed
529
-
530
- Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation.
531
-
532
- ### Main weaknesses
533
-
534
- - O1 NRM value fidelity is poor despite correct structure.
535
- - Lifecycle report/monitor outputs need targeted improvement.
536
- - Raw exact match remains low for primary create configs.
537
-
538
- ### Next planned experiment
539
-
540
- Second-stage weak-layer adapter continuation:
541
-
542
- - initialize from current Qwen3-8B TMF921 adapter,
543
- - train on weak-layer examples plus replay buffer,
544
- - lower LR: `5e-5` or `1e-4`,
545
- - 1 epoch,
546
- - same max length 2048,
547
- - evaluate again with raw + normalized metrics.
548
-
549
- ---
550
-
551
- ## Open questions
552
-
553
- 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
554
- 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
555
- 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
556
- 4. Should training use a weak-layer second stage or should dataset generation be improved first?
557
-
558
- ---
559
-
560
- ## Running log template
561
-
562
- ```markdown
563
- ## YYYY-MM-DD — Short title
564
-
565
- ### Goal
566
-
567
- ### Action
568
-
569
- ### Evidence / result
570
-
571
- ### Interpretation
572
-
573
- ### Decision / next step
574
- ```
575
-
576
- ---
577
-
578
- ## 2026-05-04 — Stage 2 weak-layer continuation plan implemented
579
-
580
- ### Goal
581
-
582
- Improve weak target layers identified by normalized evaluation without degrading strong layers.
583
-
584
- Weak layers from normalized evaluation:
585
-
586
- - `o1_nrm`: normalized field F1 around **0.39–0.40**
587
- - `a1_policy`: normalized field F1 around **0.67–0.68**
588
- - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
589
- - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
590
- - `tmf921_lifecycle_scale`: mixed, included because lifecycle scaling still had noticeable errors
591
-
592
- ### Action
593
-
594
- Added stage-2 tooling:
595
-
596
- - `scripts/build_weak_layer_dataset.py`
597
- - `scripts/train_continue_adapter.py`
598
- - `configs/stage2_weak_layer_qwen3_8b.yaml`
599
- - `scripts/nohup_stage2_weak.sh`
600
-
601
- The weak-layer dataset builder creates a local parquet training set with:
602
-
603
- 1. all weak-layer rows from `train_sota`,
604
- 2. duplicated rare weak layers up to a minimum count,
605
- 3. a replay buffer from non-weak layers to reduce forgetting.
606
-
607
- The continuation trainer loads:
608
-
609
- 1. Qwen3-8B base model in 4-bit NF4,
610
- 2. the existing LoRA adapter with `is_trainable=True`,
611
- 3. the local weak-layer replay dataset,
612
- 4. TRL `SFTTrainer` without a new `peft_config`, per PEFT/TRL continuation best practices.
613
-
614
- Stage-2 default hyperparameters:
615
-
616
- ```yaml
617
- learning_rate: 5e-5
618
- epochs: 1
619
- per_device_train_batch_size: 1
620
- gradient_accumulation_steps: 16
621
- max_length: 2048
622
- assistant_only_loss: true
623
- ```
624
-
625
- ### Interpretation
626
-
627
- A lower learning rate and replay buffer should improve weak-layer value fidelity while reducing catastrophic forgetting on strong layers. This is a targeted continuation, not a replacement for Gen4 data generation or official schema validation.
628
-
629
- ### Decision / next step
630
-
631
- Run stage-2 from the completed stage-1 adapter:
632
-
633
- ```bash
634
- bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
635
- ```
636
-
637
- After training, evaluate with the same raw + normalized OOD protocol and compare against stage-1 metrics.
638
-
639
- ---
640
-
641
- ## 2026-05-05 — Stage 2 weak-layer continuation run started
642
-
643
- ### Goal
644
-
645
- Run the stage-2 weak-layer continuation experiment implemented on 2026-05-04.
646
-
647
- The intended scientific question is:
648
-
649
- > Can a short, low-learning-rate continuation on weak target layers improve low-performing layer-specific value fidelity while preserving the strong global JSON validity, key structure, and adversarial behavior from stage 1?
650
-
651
- ### Action
652
-
653
- Started stage 2 with:
654
-
655
- ```bash
656
- bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
657
- ```
658
-
659
- Generated run:
660
-
661
- ```text
662
- runs/stage2-weak-20260505-080040
663
- ```
664
-
665
- Source adapter:
666
-
667
- ```text
668
- runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
669
- ```
670
-
671
- ### Stage-2 dataset composition
672
-
673
- The weak-layer dataset builder produced:
674
-
675
- ```json
676
- {
677
- "rows_train_stage2": 13829,
678
- "rows_validation": 1547,
679
- "weak_rows_total_after_duplication": 10638,
680
- "replay_rows": 3191,
681
- "rare_min_per_layer": 1500,
682
- "replay_ratio": 0.3
683
- }
684
- ```
685
-
686
- Layer counts before/after rare-layer duplication:
687
-
688
- | Target layer | Before | After |
689
- |---|---:|---:|
690
- | `o1_nrm` | 2,672 | 2,672 |
691
- | `a1_policy` | 3,466 | 3,466 |
692
- | `tmf921_lifecycle_report` | 596 | 1,500 |
693
- | `tmf921_lifecycle_monitor` | 726 | 1,500 |
694
- | `tmf921_lifecycle_scale` | 576 | 1,500 |
695
-
696
- Replay buffer size:
697
-
698
- - replay rows from non-weak layers: **3,191**
699
- - purpose: reduce catastrophic forgetting on strong layers such as `tmf921`, `camara`, `intent_3gpp`, `etsi_zsm`, and adversarial rejection.
700
-
701
- Full target-layer composition in stage-2 train set:
702
-
703
- | Target layer | Rows |
704
- |---|---:|
705
- | `a1_policy` | 3,466 |
706
- | `o1_nrm` | 2,672 |
707
- | `tmf921_lifecycle_monitor` | 1,500 |
708
- | `tmf921_lifecycle_report` | 1,500 |
709
- | `tmf921_lifecycle_scale` | 1,500 |
710
- | `tmf921` replay | 902 |
711
- | `intent_3gpp` replay | 630 |
712
- | `camara` replay | 618 |
713
- | `etsi_zsm` replay | 335 |
714
- | adversarial replay and other lifecycle replay | remaining rows |
715
-
716
- ### Training configuration
717
-
718
- Resolved stage-2 config:
719
-
720
- ```yaml
721
- model_name_or_path: Qwen/Qwen3-8B
722
- adapter_path: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
723
- dataset_dir: runs/stage2-weak-20260505-080040/weak_layer_data
724
- output_dir: runs/stage2-weak-20260505-080040/outputs/adapter
725
- learning_rate: 5.0e-05
726
- epochs: 1
727
- per_device_train_batch_size: 1
728
- gradient_accumulation_steps: 16
729
- max_length: 2048
730
- assistant_only_loss: true
731
- bf16: true
732
- gradient_checkpointing: true
733
- optim: paged_adamw_32bit
734
- ```
735
-
736
- ### Evidence that adapter continuation was configured correctly
737
-
738
- Server log confirmed:
739
-
740
- ```text
741
- Base model: Qwen/Qwen3-8B
742
- Adapter: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
743
- trainable params: 174,587,904 || all params: 8,365,323,264 || trainable%: 2.0870
744
- TunerModelStatus(... active_adapters=['default'], requires_grad={'default': True}, devices={'default': ['cuda']})
745
- ```
746
-
747
- Interpretation:
748
-
749
- - The existing adapter was loaded.
750
- - Adapter weights are trainable.
751
- - Training is on CUDA.
752
- - The base model is not being full-finetuned; only LoRA adapter parameters are updated.
753
-
754
- ### Early training evidence
755
-
756
- Stage-2 training began normally after tokenization:
757
-
758
- ```text
759
- Tokenizing train dataset: 13,829 / 13,829
760
- Tokenizing eval dataset: 1,547 / 1,547
761
- ```
762
-
763
- Representative early logs:
764
-
765
- ```text
766
- loss: 0.1313, grad_norm: 0.0199, lr: 5e-05, mean_token_accuracy: 0.9572, epoch: 0.0012
767
- loss: 0.1686, grad_norm: 0.0317, lr: 5e-05, mean_token_accuracy: 0.9435, epoch: 0.0116
768
- loss: 0.1541, grad_norm: 0.0166, lr: 5e-05, mean_token_accuracy: 0.9463, epoch: 0.1157
769
- ```
770
-
771
- Validation during stage 2:
772
-
773
- ```text
774
- eval_loss: 0.1581 at epoch 0.1157
775
- eval_loss: 0.1582 at epoch 0.2314
776
- eval_loss: 0.1584 at epoch 0.3471
777
- eval_loss: 0.1585 at epoch 0.4628
778
- ```
779
-
780
- At approximately 50% completion:
781
-
782
- ```text
783
- epoch: 0.4975 / 1.0
784
- loss: 0.1366-0.1428 range near midpoint
785
- grad_norm: generally <0.14
786
- mean_token_accuracy: about 0.95
787
- ```
788
-
789
- ### Interpretation
790
-
791
- The stage-2 run is healthy:
792
-
793
- - no CUDA OOM,
794
- - no NaN/Inf,
795
- - no gradient explosion,
796
- - GPU is active,
797
- - adapter continuation is correctly configured.
798
-
799
- Validation loss is slightly worse than the stage-1 plateau (~0.153), but this is expected because stage 2 intentionally shifts the training distribution toward harder weak layers. The decisive evaluation is not broad validation loss alone; it is the post-stage2 OOD normalized weak-layer comparison.
800
-
801
- ### Decision / next step
802
-
803
- Let stage 2 finish. After completion:
804
-
805
- 1. merge the stage-2 adapter,
806
- 2. run OOD evaluation,
807
- 3. run normalized evaluator,
808
- 4. compare against stage-1 baselines.
809
-
810
- Commands planned after stage 2:
811
-
812
- ```bash
813
- RUN_DIR="runs/stage2-weak-20260505-080040"
814
-
815
- python scripts/merge_adapter.py \
816
- --base_model Qwen/Qwen3-8B \
817
- --adapter "$RUN_DIR/outputs/adapter" \
818
- --output_dir "$RUN_DIR/outputs/merged"
819
-
820
- EVAL_BATCH_SIZE=8 \
821
- bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"
822
-
823
- python scripts/normalize_eval_metrics.py \
824
- --eval_dir "$RUN_DIR/eval"
825
- ```
826
-
827
- ### Success criteria
828
-
829
- Stage 2 is successful if:
830
-
831
- 1. weak-layer normalized field F1 improves:
832
- - `o1_nrm` above stage-1 ~0.39-0.40,
833
- - `a1_policy` above stage-1 ~0.67-0.68,
834
- - `tmf921_lifecycle_report` above stage-1 ~0.15-0.18,
835
- - `tmf921_lifecycle_monitor` above stage-1 ~0.39-0.52;
836
- 2. global normalized field F1 does not regress substantially:
837
- - stage-1 ID: 0.7956,
838
- - stage-1 template OOD: 0.7865,
839
- - stage-1 use-case OOD: 0.7907,
840
- - stage-1 sector OOD: 0.7697;
841
- 3. JSON parse remains near 100%;
842
- 4. adversarial normalized exact remains close to 0.9697.
843
-
844
- ### Failure modes to watch
845
-
846
- - Global regression from weak-layer overfitting.
847
- - Adversarial degradation from insufficient replay.
848
- - O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
849
- - Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
850
-
851
-
852
- ---
853
-
854
- ## 2026-05-05 — Stage 2 evaluation completed and decision made
855
-
856
- ### Goal
857
-
858
- Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.
859
-
860
- ### Action
861
-
862
- After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:
863
-
864
- - `test_in_distribution`
865
- - `test_template_ood`
866
- - `test_use_case_ood`
867
- - `test_sector_ood`
868
- - `test_adversarial`
869
-
870
- The normalized evaluator was then run on the generated predictions:
871
-
872
- ```bash
873
- python scripts/normalize_eval_metrics.py \
874
- --eval_dir runs/stage2-weak-20260505-080040/eval
875
- ```
876
-
877
- ### Evidence / result
878
-
879
- Global normalized comparison, stage 1 -> stage 2:
880
-
881
- | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
882
- |---|---:|---:|---:|---:|---:|---:|
883
- | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
884
- | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
885
- | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
886
- | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
887
- | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
888
-
889
- JSON parse comparison:
890
-
891
- | Split | Stage 1 parse | Stage 2 parse | Delta |
892
- |---|---:|---:|---:|
893
- | `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
894
- | `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
895
- | `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
896
- | `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
897
- | `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |
898
-
899
- Weak-layer normalized field F1 comparison, stage 1 -> stage 2:
900
-
901
- | Split | Layer | Stage 1 | Stage 2 | Delta |
902
- |---|---|---:|---:|---:|
903
- | ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
904
- | ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
905
- | ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
906
- | ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
907
- | ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
908
- | Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
909
- | Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
910
- | Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
911
- | Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
912
- | Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
913
- | Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
914
- | Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
915
- | Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
916
- | Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
917
- | Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
918
- | Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
919
- | Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
920
- | Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
921
- | Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
922
-
923
- ### Interpretation
924
-
925
- Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.
926
-
927
- Key observations:
928
-
929
- 1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
930
- 2. Normalized key F1 regressed slightly across all splits.
931
- 3. Adversarial performance regressed meaningfully:
932
- - normalized field F1: **0.9697 -> 0.9596**
933
- - normalized key F1: **1.0000 -> 0.9697**
934
- - parse rate: **1.0000 -> 0.9697**
935
- 4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
936
- 5. `a1_policy` also did not improve meaningfully.
937
- 6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.
938
-
939
- The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:
940
-
941
- - insufficient semantic supervision in the data,
942
- - inadequacy of flat field-F1 for some low-level configs,
943
- - need for layer-specific validators and value extractors,
944
- - or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.
945
-
946
- ### Decision
947
-
948
- Stage 2 should **not** replace the stage-1 model as the main model.
949
-
950
- The stage-1 adapter remains the current primary model because it has:
951
-
952
- - slightly better global normalized metrics,
953
- - better adversarial robustness,
954
- - no meaningful disadvantage on O1/A1 compared with stage 2.
955
-
956
- Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.
957
-
958
- ### Next step
959
-
960
- Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:
961
-
962
- 1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
963
- 2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
964
- 3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
965
- 4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.
966
-
967
- ### Updated project status
968
-
969
- Primary model: **stage 1 Qwen3-8B QLoRA adapter**
970
-
971
- Stage 2 status: **diagnostic / not promoted**
972
-
973
- Current best headline metrics remain the stage-1 normalized results:
974
-
975
- | Split | JSON parse | Normalized field F1 | Normalized key F1 |
976
- |---|---:|---:|---:|
977
- | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
978
- | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
979
- | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
980
- | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
981
- | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |