nraptisss
/

tmf921-intent-training

+# TMF921 Intent-to-Configuration Research Journal
+This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
+Repository links:
+- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
+- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
+- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
+- Base model: https://huggingface.co/Qwen/Qwen3-8B
+---
+## Current status summary
+Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.
+Stage 2 status: **diagnostic / not promoted**.
+Best stage-1 normalized metrics:
+| Split | JSON parse | Normalized field F1 | Normalized key F1 |
+|---|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
+| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
+| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
+| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
+| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |
+Zero-shot Qwen3-8B baseline, 200 examples per split:
+| Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
+|---|---:|---:|---:|
+| `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
+| `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
+| `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
+| `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
+| `test_adversarial` | 0.000 | 0.0000 | 0.0000 |
+Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.
+---
+## 2026-04-30 — Dataset cloned and audited
+The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited.
+Key findings:
+- Total rows: **41,815**
+- Train: **39,294**
+- Test: **2,521**
+- Missing values: **0**
+- Duplicate IDs: **0**
+- Assistant JSON parse validity: **100%**
+- Exact train/test full-message overlap: **0**
+- Near-duplicate prompt similarity was high:
+  - >= 0.90: **1,290 / 2,521**
+  - >= 0.95: **602 / 2,521**
+  - >= 0.98: **262 / 2,521**
+- `create` lifecycle operation: **95.9%**
+- adversarial rows: **166 = 0.397%**
+- unique JSON structure signatures: **31**
+Interpretation:
+The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.
+Decision:
+Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.
+---
+## 2026-04-30 — Research SOTA dataset created
+Created:
+- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
+Splits:
+| Split | Rows | Purpose |
+|---|---:|---|
+| `train_base` | 26,357 | unaugmented training after OOD holdouts |
+| `train_sota` | 32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers |
+| `validation` | 1,547 | validation |
+| `test_in_distribution` | 1,455 | in-distribution test |
+| `test_template_ood` | 3,503 | held-out prompt-template family |
+| `test_use_case_ood` | 4,341 | held-out use cases |
+| `test_sector_ood` | 4,579 | held-out sectors |
+| `test_adversarial` | 33 | held-out adversarial examples |
+Qwen3 token-length audit:
+- mean: **754.1**
+- p50: **705**
+- p95: **1293**
+- p99: **1300**
+- max: **1316**
+- fit within 2048: **100%**
+`train_sota` balancing:
+- non-create lifecycle rows: **5,166 = 15.97%**
+- adversarial rows: **2,115 = 6.54%**
+- synthetic multi-turn wrappers: **1,281**
+Decision:
+Use `train_sota` for the first Qwen3-8B QLoRA training run.
+---
+## 2026-04-30 / 2026-05-01 — Training/evaluation repo created
+Created:
+- https://huggingface.co/nraptisss/tmf921-intent-training
+Default recipe:
+- Base model: `Qwen/Qwen3-8B`
+- Method: QLoRA SFT
+- Quantization: 4-bit NF4 + double quantization
+- LoRA target modules: `all-linear`
+- LoRA rank: 64
+- LR: 2e-4
+- Max length: 2048
+- Loss: assistant-only SFT loss
+- bf16: enabled
+- gradient checkpointing: enabled
+- train split: `train_sota`
+The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.
+---
+## 2026-05-01 — Runtime issues fixed
+Fixed issues:
+1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks.
+2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works.
+3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`.
+4. Deprecated `warmup_ratio`: replaced with `warmup_steps`.
+Server GPU evidence:
+```text
+torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
+cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
+```
+---
+## 2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed
+Run directory:
+```text
+runs/qwen3-8b-qlora-20260501-083834
+```
+Training behavior:
+- Initial loss: **1.212**
+- Later loss: **~0.14–0.15**
+- Mean token accuracy: **~0.945–0.953**
+- Validation loss plateau: **~0.153**
+No observed:
+- CUDA OOM
+- NaNs
+- divergence
+- gradient explosion
+Decision:
+Evaluate the trained adapter across ID and OOD splits.
+---
+## 2026-05-02 / 2026-05-04 — Evaluation speed issue fixed
+Initial 4-bit adapter evaluation was too slow:
+```text
+test_in_distribution: 1455 examples in ~25h
+```
+Fixes:
+- batched generation,
+- dynamic generation length,
+- periodic save/resume,
+- merged bf16 model evaluation.
+---
+## 2026-05-04 — Stage-1 raw and normalized evaluation
+Raw metrics:
+| Split | JSON parse | Exact match | Field F1 | KPI presence |
+|---|---:|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
+| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
+| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
+| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
+| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
+Normalized metrics:
+| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
+|---|---:|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
+| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
+| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
+| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
+| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
+Interpretation:
+The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.
+Weak layers:
+- `o1_nrm`: normalized field F1 around **0.39–0.40**
+- `a1_policy`: normalized field F1 around **0.67–0.68**
+- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
+- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
+Decision:
+Test a stage-2 weak-layer continuation experiment.
+---
+## 2026-05-05 — Stage-2 weak-layer continuation run and evaluation
+Stage-2 setup:
+- initialized from stage-1 adapter,
+- weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`,
+- stage-2 rows: **13,829**,
+- weak rows: **10,638**,
+- replay rows: **3,191**,
+- LR: **5e-5**,
+- epochs: **1**.
+Stage-2 training was stable. Adapter continuation was correctly configured:
+```text
+trainable params: 174,587,904
+requires_grad={'default': True}
+devices={'default': ['cuda']}
+```
+Stage-2 evaluation comparison:
+| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
+|---|---:|---:|---:|---:|---:|---:|
+| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
+| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
+| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
+| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
+| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
+Decision:
+Stage 2 is **diagnostic only** and is **not promoted**. Stage 1 remains the primary model.
+Interpretation:
+Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.
+---
+## 2026-05-06 — Zero-shot Qwen3-8B baseline completed
+Goal:
+Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.
+Action:
+Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split:
+```bash
+EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
+bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot
+```
+Zero-shot metrics:
+| Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
+|---|---:|---:|---:|
+| `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
+| `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
+| `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
+| `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
+| `test_adversarial` | 0.000 | 0.0000 | 0.0000 |
+Comparison with fine-tuned stage 1:
+| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
+|---|---:|---:|---:|---:|---:|---:|
+| ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
+| Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
+| Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
+| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
+| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
+Interpretation:
+Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.
+---
+## 2026-05-07 — Publication packaging and paper scaffold
+Completed:
+- finalized dataset card,
+- finalized primary stage-1 model card,
+- added `REPRODUCIBILITY.md`,
+- added `scripts/reproduce_stage1_eval.sh`,
+- added `scripts/run_zero_shot_baseline.sh`,
+- added `scripts/package_results.py`,
+- added `scripts/sample_failure_examples.py`,
+- uploaded `results/` and `analysis/` artifacts,
+- added `paper/outline.md`,
+- added `paper/tables.md`.
+Current publication-ready assets:
+- dataset card,
+- model card,
+- results package,
+- qualitative examples,
+- reproducibility checklist,
+- paper outline,
+- draft tables,
+- project journal.
+---
+## Current open research questions
+1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
+2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
+3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
+4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
+## Next recommended step
+Write the first manuscript draft using:
+- `paper/outline.md`,
+- `paper/tables.md`,
+- `PROJECT_JOURNAL.md`,
+- `results/stage1_vs_stage2_comparison.md`,
+- `results/baselines/zero_shot_vs_finetuned.md`,
+- `analysis/stage1_examples/failure_examples.md`.