PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
a91c7ff
Β·
verified Β·
1 Parent(s): 63d52bc

Add scientific project journal

Browse files
Files changed (1) hide show
  1. PROJECT_JOURNAL.md +574 -0
PROJECT_JOURNAL.md ADDED
@@ -0,0 +1,574 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TMF921 Intent-to-Configuration Research Journal
2
+
3
+ This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
4
+
5
+ Repository links:
6
+
7
+ - Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
8
+ - Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
9
+ - Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
10
+ - Base model: https://huggingface.co/Qwen/Qwen3-8B
11
+
12
+ ---
13
+
14
+ ## Journal conventions
15
+
16
+ Each entry should include:
17
+
18
+ 1. **Date/time**
19
+ 2. **Goal**
20
+ 3. **Action**
21
+ 4. **Evidence / result**
22
+ 5. **Interpretation**
23
+ 6. **Decision / next step**
24
+
25
+ For research claims, prefer numeric evidence over qualitative statements.
26
+
27
+ ---
28
+
29
+ ## 2026-04-30 β€” Dataset cloned and audited
30
+
31
+ ### Goal
32
+
33
+ Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training.
34
+
35
+ ### Action
36
+
37
+ The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity.
38
+
39
+ ### Evidence / result
40
+
41
+ Dataset size:
42
+
43
+ - Total rows: **41,815**
44
+ - Train: **39,294**
45
+ - Test: **2,521**
46
+
47
+ Quality checks:
48
+
49
+ - Missing values: **0**
50
+ - Duplicate IDs: **0**
51
+ - Duplicate full conversations: **0**
52
+ - Assistant JSON parse validity: **41,815 / 41,815 = 100%**
53
+ - Role sequence: `system -> user -> assistant` for all rows
54
+
55
+ Leakage / similarity findings:
56
+
57
+ - Exact train/test user-prompt overlap: **0**
58
+ - Exact train/test full-message overlap: **0**
59
+ - Near-duplicate prompt similarity was high:
60
+ - test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521**
61
+ - >= 0.95: **602 / 2,521**
62
+ - >= 0.98: **262 / 2,521**
63
+
64
+ Distribution findings:
65
+
66
+ - `create` lifecycle operation: **40,090 / 41,815 = 95.9%**
67
+ - non-create lifecycle rows: **1,725 = 4.1%**
68
+ - adversarial rows: **166 = 0.397%**
69
+ - only **31 unique JSON structure signatures** across 41,815 rows
70
+
71
+ ### Interpretation
72
+
73
+ The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation.
74
+
75
+ ### Decision / next step
76
+
77
+ Create a research-grade derivative dataset with:
78
+
79
+ - OOD splits,
80
+ - train/eval provenance columns,
81
+ - token-length audit,
82
+ - validation flags,
83
+ - lifecycle/adversarial upsampling for training only,
84
+ - no fabricated continuous-KPI or cross-layer-paired examples without a validated generator.
85
+
86
+ ---
87
+
88
+ ## 2026-04-30 β€” Research SOTA dataset created
89
+
90
+ ### Goal
91
+
92
+ Implement the audit recommendations while preserving scientific soundness.
93
+
94
+ ### Action
95
+
96
+ Created `nraptisss/TMF921-intent-to-config-research-sota`.
97
+
98
+ Implemented:
99
+
100
+ - `train_base`
101
+ - `train_sota`
102
+ - `validation`
103
+ - `test_in_distribution`
104
+ - `test_template_ood`
105
+ - `test_use_case_ood`
106
+ - `test_sector_ood`
107
+ - `test_adversarial`
108
+
109
+ Added columns:
110
+
111
+ - `system`, `prompt`, `completion`
112
+ - `prompt_template_id`
113
+ - `scenario_id`
114
+ - `json_structure_id`
115
+ - `json_root_family`
116
+ - `messages_format_valid`
117
+ - `assistant_is_valid_json`
118
+ - `slice_sst_valid`
119
+ - `kpi_profile_valid`
120
+ - `semantic_rule_valid_v1`
121
+ - `qwen3_chat_template_tokens`
122
+ - `fits_2048_qwen3`
123
+ - `fits_4096_qwen3`
124
+ - `sampling_weight_*`
125
+ - `is_augmented`, `augmentation_type`, `source_id`, `conversation_type`
126
+
127
+ ### Evidence / result
128
+
129
+ Published dataset:
130
+
131
+ - https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
132
+
133
+ Splits:
134
+
135
+ | Split | Rows | Purpose |
136
+ |---|---:|---|
137
+ | `train_base` | 26,357 | unaugmented training after OOD holdouts |
138
+ | `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers |
139
+ | `validation` | 1,547 | validation |
140
+ | `test_in_distribution` | 1,455 | in-distribution test |
141
+ | `test_template_ood` | 3,503 | held-out prompt-template family |
142
+ | `test_use_case_ood` | 4,341 | held-out use cases |
143
+ | `test_sector_ood` | 4,579 | held-out sectors |
144
+ | `test_adversarial` | 33 | held-out adversarial examples |
145
+
146
+ Qwen3 token-length audit:
147
+
148
+ - mean: **754.1**
149
+ - p50: **705**
150
+ - p95: **1293**
151
+ - p99: **1300**
152
+ - max: **1316**
153
+ - fit within 2048: **100%**
154
+
155
+ `train_sota` balancing:
156
+
157
+ - non-create lifecycle rows: **5,166 = 15.97%**
158
+ - adversarial rows: **2,115 = 6.54%**
159
+ - synthetic multi-turn wrappers: **1,281**
160
+
161
+ ### Interpretation
162
+
163
+ `max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting.
164
+
165
+ ### Decision / next step
166
+
167
+ Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA.
168
+
169
+ ---
170
+
171
+ ## 2026-04-30 / 2026-05-01 β€” Training/evaluation repo created
172
+
173
+ ### Goal
174
+
175
+ Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB.
176
+
177
+ ### Action
178
+
179
+ Created `nraptisss/tmf921-intent-training` with:
180
+
181
+ - QLoRA SFT training script,
182
+ - evaluation script,
183
+ - merge script,
184
+ - RTX 6000 Ada install script,
185
+ - GPU preflight,
186
+ - nohup run scripts,
187
+ - resumable checkpoints,
188
+ - unique run directories.
189
+
190
+ Default recipe:
191
+
192
+ - model: `Qwen/Qwen3-8B`
193
+ - method: QLoRA NF4 + double quant
194
+ - LoRA target modules: `all-linear`
195
+ - LoRA rank: `64`
196
+ - LoRA alpha: `16`
197
+ - LoRA dropout: `0.05`
198
+ - LR: `2e-4`
199
+ - scheduler: constant
200
+ - max length: `2048`
201
+ - assistant-only loss: enabled
202
+ - bf16: enabled
203
+ - gradient checkpointing: enabled
204
+ - train split: `train_sota`
205
+ - eval split: `validation`
206
+
207
+ ### Evidence / result
208
+
209
+ Repo:
210
+
211
+ - https://huggingface.co/nraptisss/tmf921-intent-training
212
+
213
+ ### Interpretation
214
+
215
+ The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU.
216
+
217
+ ### Decision / next step
218
+
219
+ Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results.
220
+
221
+ ---
222
+
223
+ ## 2026-05-01 β€” Runtime issues fixed
224
+
225
+ ### Goal
226
+
227
+ Resolve server-side training errors and ensure training uses GPU.
228
+
229
+ ### Issues encountered and fixes
230
+
231
+ #### 1. CPU/GPU uncertainty
232
+
233
+ Observed concern that training might not use GPU.
234
+
235
+ Fix:
236
+
237
+ - Added `scripts/check_gpu.py`
238
+ - Added `scripts/install_rtx6000ada.sh`
239
+ - Added fail-fast CUDA checks to training/evaluation scripts.
240
+
241
+ Evidence from server logs:
242
+
243
+ ```text
244
+ torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
245
+ cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
246
+ ```
247
+
248
+ Conclusion: GPU setup confirmed.
249
+
250
+ #### 2. TRL conversational dataset detection error
251
+
252
+ Error:
253
+
254
+ ```text
255
+ ValueError: You set assistant_only_loss=True, but the dataset is not conversational.
256
+ ```
257
+
258
+ Cause:
259
+
260
+ The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format.
261
+
262
+ Fix:
263
+
264
+ Training script now passes only:
265
+
266
+ ```python
267
+ train_dataset = train_dataset.select_columns(["messages"])
268
+ eval_dataset = eval_dataset.select_columns(["messages"])
269
+ ```
270
+
271
+ #### 3. Trackio invalid Space ID
272
+
273
+ Error:
274
+
275
+ ```text
276
+ HFValidationError: Repo id ... 'nraptisss/'
277
+ ```
278
+
279
+ Cause:
280
+
281
+ Invalid `TRACKIO_SPACE_ID=nraptisss/`.
282
+
283
+ Fix:
284
+
285
+ Added validation/sanitization for Trackio Space IDs and support for:
286
+
287
+ ```bash
288
+ DISABLE_TRACKIO=1
289
+ ```
290
+
291
+ #### 4. Deprecated warmup argument
292
+
293
+ Warning:
294
+
295
+ ```text
296
+ warmup_ratio is deprecated
297
+ ```
298
+
299
+ Fix:
300
+
301
+ Changed config/script to use:
302
+
303
+ ```yaml
304
+ warmup_steps: 0
305
+ ```
306
+
307
+ ### Decision / next step
308
+
309
+ Restart training with fixed scripts and disabled Trackio to avoid external logging failures.
310
+
311
+ ---
312
+
313
+ ## 2026-05-01 / 2026-05-02 β€” Qwen3-8B QLoRA training run completed
314
+
315
+ ### Goal
316
+
317
+ Train Qwen3-8B QLoRA on `train_sota`.
318
+
319
+ ### Action
320
+
321
+ Started training under nohup with unique run directory:
322
+
323
+ ```text
324
+ runs/qwen3-8b-qlora-20260501-083834
325
+ ```
326
+
327
+ Trackio disabled:
328
+
329
+ ```bash
330
+ DISABLE_TRACKIO=1
331
+ ```
332
+
333
+ ### Evidence / result
334
+
335
+ Training logs showed stable convergence.
336
+
337
+ Representative metrics:
338
+
339
+ Initial:
340
+
341
+ ```text
342
+ loss: 1.212
343
+ mean_token_accuracy: 0.7922
344
+ ```
345
+
346
+ After early training:
347
+
348
+ ```text
349
+ loss: ~0.15
350
+ mean_token_accuracy: ~0.945-0.953
351
+ ```
352
+
353
+ Validation loss over training:
354
+
355
+ ```text
356
+ eval_loss: 0.1593 at epoch 0.1236
357
+ eval_loss: 0.1561 at epoch 0.2472
358
+ eval_loss: 0.1548 at epoch 0.3709
359
+ eval_loss: 0.1535 at epoch 0.8653
360
+ eval_loss: 0.1530 at epoch 1.607
361
+ eval_loss: 0.1532 at epoch 1.730
362
+ ```
363
+
364
+ No observed:
365
+
366
+ - CUDA OOM,
367
+ - NaNs,
368
+ - divergence,
369
+ - gradient explosion.
370
+
371
+ ### Interpretation
372
+
373
+ The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence.
374
+
375
+ ### Decision / next step
376
+
377
+ Evaluate the trained adapter across ID and OOD splits.
378
+
379
+ ---
380
+
381
+ ## 2026-05-02 / 2026-05-04 β€” Evaluation speed issue and merged-model evaluation
382
+
383
+ ### Goal
384
+
385
+ Evaluate the trained adapter on all splits.
386
+
387
+ ### Issue
388
+
389
+ Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation:
390
+
391
+ ```text
392
+ test_in_distribution: 1455 examples in ~25h
393
+ test_template_ood: ~30-90s/example
394
+ ```
395
+
396
+ ### Action
397
+
398
+ Patched evaluator to support:
399
+
400
+ - batched generation,
401
+ - dynamic generation length based on target length + buffer,
402
+ - periodic save/resume,
403
+ - partial prediction reuse.
404
+
405
+ Also recommended merging adapter into base bf16 model for faster inference.
406
+
407
+ ### Decision / next step
408
+
409
+ Use merged model evaluation and normalized metrics.
410
+
411
+ ---
412
+
413
+ ## 2026-05-04 β€” Raw evaluation results
414
+
415
+ ### Goal
416
+
417
+ Measure raw JSON and field-level performance.
418
+
419
+ ### Evidence / result
420
+
421
+ Raw metrics:
422
+
423
+ | Split | JSON parse | Exact match | Field F1 | KPI presence |
424
+ |---|---:|---:|---:|---:|
425
+ | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
426
+ | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
427
+ | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
428
+ | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
429
+ | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
430
+
431
+ ### Interpretation
432
+
433
+ The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links).
434
+
435
+ ### Decision / next step
436
+
437
+ Implement a normalized evaluator that removes volatile fields before scoring.
438
+
439
+ ---
440
+
441
+ ## 2026-05-04 β€” Normalized evaluator implemented and run
442
+
443
+ ### Goal
444
+
445
+ Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement.
446
+
447
+ ### Action
448
+
449
+ Added:
450
+
451
+ ```text
452
+ scripts/normalize_eval_metrics.py
453
+ ```
454
+
455
+ Normalization removes/masks:
456
+
457
+ - IDs,
458
+ - hrefs,
459
+ - names/descriptions,
460
+ - timestamps,
461
+ - schema links,
462
+ - UUID/hash-like strings,
463
+ - generated request/policy/booking/intent IDs.
464
+
465
+ It computes:
466
+
467
+ - normalized exact match,
468
+ - normalized field precision/recall/F1,
469
+ - normalized key precision/recall/F1,
470
+ - stratified metrics.
471
+
472
+ ### Evidence / result
473
+
474
+ Headline normalized metrics:
475
+
476
+ | Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact |
477
+ |---|---:|---:|---:|---:|---:|
478
+ | `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 |
479
+ | `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 |
480
+ | `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 |
481
+ | `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 |
482
+ | `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 |
483
+
484
+ Strong layers:
485
+
486
+ - `tmf921`: normalized field F1 around **0.93–0.94**
487
+ - `camara`: normalized field F1 around **0.81–0.87**
488
+ - `intent_3gpp`: normalized field F1 around **0.80–0.82**
489
+ - `etsi_zsm`: normalized field F1 around **0.75–0.79**
490
+
491
+ Weak layers:
492
+
493
+ - `o1_nrm`: normalized field F1 around **0.39–0.40**
494
+ - `a1_policy`: normalized field F1 around **0.67–0.68**
495
+ - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
496
+ - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
497
+
498
+ ### Interpretation
499
+
500
+ The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 β‰ˆ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs.
501
+
502
+ ### Decision / next step
503
+
504
+ Plan a second-stage weak-layer fine-tune focused on:
505
+
506
+ - `o1_nrm`,
507
+ - `a1_policy`,
508
+ - `tmf921_lifecycle_report`,
509
+ - `tmf921_lifecycle_monitor`,
510
+ - optionally `tmf921_lifecycle_scale`.
511
+
512
+ Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting.
513
+
514
+ ---
515
+
516
+ ## Current scientific status
517
+
518
+ ### What can be claimed now
519
+
520
+ The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves:
521
+
522
+ - near-perfect JSON validity,
523
+ - stable OOD generalization,
524
+ - excellent adversarial rejection,
525
+ - normalized structural key F1 around 98% across non-adversarial ID/OOD splits,
526
+ - normalized field F1 around 77–80% across ID/OOD splits.
527
+
528
+ ### What should not be overclaimed
529
+
530
+ Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation.
531
+
532
+ ### Main weaknesses
533
+
534
+ - O1 NRM value fidelity is poor despite correct structure.
535
+ - Lifecycle report/monitor outputs need targeted improvement.
536
+ - Raw exact match remains low for primary create configs.
537
+
538
+ ### Next planned experiment
539
+
540
+ Second-stage weak-layer adapter continuation:
541
+
542
+ - initialize from current Qwen3-8B TMF921 adapter,
543
+ - train on weak-layer examples plus replay buffer,
544
+ - lower LR: `5e-5` or `1e-4`,
545
+ - 1 epoch,
546
+ - same max length 2048,
547
+ - evaluate again with raw + normalized metrics.
548
+
549
+ ---
550
+
551
+ ## Open questions
552
+
553
+ 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
554
+ 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
555
+ 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
556
+ 4. Should training use a weak-layer second stage or should dataset generation be improved first?
557
+
558
+ ---
559
+
560
+ ## Running log template
561
+
562
+ ```markdown
563
+ ## YYYY-MM-DD β€” Short title
564
+
565
+ ### Goal
566
+
567
+ ### Action
568
+
569
+ ### Evidence / result
570
+
571
+ ### Interpretation
572
+
573
+ ### Decision / next step
574
+ ```