PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
tmf921-intent-training / PROJECT_JOURNAL.md
nraptisss's picture
Update PROJECT_JOURNAL.md with semantic evaluator results
03ca0d6 verified
# TMF921 Intent-to-Configuration Research Journal
This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
Repository links:
- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
- Base model: https://huggingface.co/Qwen/Qwen3-8B
---
## Current status summary
Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.
Stage 2 status: **diagnostic / not promoted**.
Best stage-1 normalized metrics:
| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |
Zero-shot Qwen3-8B baseline, 200 examples per split:
| Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
| `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
| `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
| `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
| `test_adversarial` | 0.000 | 0.0000 | 0.0000 |
Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.
---
## 2026-04-30 — Dataset cloned and audited
The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited.
Key findings:
- Total rows: **41,815**
- Train: **39,294**
- Test: **2,521**
- Missing values: **0**
- Duplicate IDs: **0**
- Assistant JSON parse validity: **100%**
- Exact train/test full-message overlap: **0**
- Near-duplicate prompt similarity was high:
- >= 0.90: **1,290 / 2,521**
- >= 0.95: **602 / 2,521**
- >= 0.98: **262 / 2,521**
- `create` lifecycle operation: **95.9%**
- adversarial rows: **166 = 0.397%**
- unique JSON structure signatures: **31**
Interpretation:
The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.
Decision:
Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.
---
## 2026-04-30 — Research SOTA dataset created
Created:
- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
Splits:
| Split | Rows | Purpose |
|---|---:|---|
| `train_base` | 26,357 | unaugmented training after OOD holdouts |
| `train_sota` | 32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers |
| `validation` | 1,547 | validation |
| `test_in_distribution` | 1,455 | in-distribution test |
| `test_template_ood` | 3,503 | held-out prompt-template family |
| `test_use_case_ood` | 4,341 | held-out use cases |
| `test_sector_ood` | 4,579 | held-out sectors |
| `test_adversarial` | 33 | held-out adversarial examples |
Qwen3 token-length audit:
- mean: **754.1**
- p50: **705**
- p95: **1293**
- p99: **1300**
- max: **1316**
- fit within 2048: **100%**
`train_sota` balancing:
- non-create lifecycle rows: **5,166 = 15.97%**
- adversarial rows: **2,115 = 6.54%**
- synthetic multi-turn wrappers: **1,281**
Decision:
Use `train_sota` for the first Qwen3-8B QLoRA training run.
---
## 2026-04-30 / 2026-05-01 — Training/evaluation repo created
Created:
- https://huggingface.co/nraptisss/tmf921-intent-training
Default recipe:
- Base model: `Qwen/Qwen3-8B`
- Method: QLoRA SFT
- Quantization: 4-bit NF4 + double quantization
- LoRA target modules: `all-linear`
- LoRA rank: 64
- LR: 2e-4
- Max length: 2048
- Loss: assistant-only SFT loss
- bf16: enabled
- gradient checkpointing: enabled
- train split: `train_sota`
The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.
---
## 2026-05-01 — Runtime issues fixed
Fixed issues:
1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks.
2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works.
3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`.
4. Deprecated `warmup_ratio`: replaced with `warmup_steps`.
Server GPU evidence:
```text
torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
```
---
## 2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed
Run directory:
```text
runs/qwen3-8b-qlora-20260501-083834
```
Training behavior:
- Initial loss: **1.212**
- Later loss: **~0.14–0.15**
- Mean token accuracy: **~0.945–0.953**
- Validation loss plateau: **~0.153**
No observed:
- CUDA OOM
- NaNs
- divergence
- gradient explosion
Decision:
Evaluate the trained adapter across ID and OOD splits.
---
## 2026-05-02 / 2026-05-04 — Evaluation speed issue fixed
Initial 4-bit adapter evaluation was too slow:
```text
test_in_distribution: 1455 examples in ~25h
```
Fixes:
- batched generation,
- dynamic generation length,
- periodic save/resume,
- merged bf16 model evaluation.
---
## 2026-05-04 — Stage-1 raw and normalized evaluation
Raw metrics:
| Split | JSON parse | Exact match | Field F1 | KPI presence |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
Normalized metrics:
| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
Interpretation:
The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.
Weak layers:
- `o1_nrm`: normalized field F1 around **0.39–0.40**
- `a1_policy`: normalized field F1 around **0.67–0.68**
- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
Decision:
Test a stage-2 weak-layer continuation experiment.
---
## 2026-05-05 — Stage-2 weak-layer continuation run and evaluation
Stage-2 setup:
- initialized from stage-1 adapter,
- weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`,
- stage-2 rows: **13,829**,
- weak rows: **10,638**,
- replay rows: **3,191**,
- LR: **5e-5**,
- epochs: **1**.
Stage-2 training was stable. Adapter continuation was correctly configured:
```text
trainable params: 174,587,904
requires_grad={'default': True}
devices={'default': ['cuda']}
```
Stage-2 evaluation comparison:
| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|---|---:|---:|---:|---:|---:|---:|
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
Decision:
Stage 2 is **diagnostic only** and is **not promoted**. Stage 1 remains the primary model.
Interpretation:
Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.
---
## 2026-05-06 — Zero-shot Qwen3-8B baseline completed
Goal:
Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.
Action:
Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split:
```bash
EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot
```
Zero-shot metrics:
| Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
| `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
| `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
| `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
| `test_adversarial` | 0.000 | 0.0000 | 0.0000 |
Comparison with fine-tuned stage 1:
| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|---|---:|---:|---:|---:|---:|---:|
| ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
| Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
| Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
Interpretation:
Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.
---
## 2026-05-07 — Publication packaging and paper scaffold
Completed:
- finalized dataset card,
- finalized primary stage-1 model card,
- added `REPRODUCIBILITY.md`,
- added `scripts/reproduce_stage1_eval.sh`,
- added `scripts/run_zero_shot_baseline.sh`,
- added `scripts/package_results.py`,
- added `scripts/sample_failure_examples.py`,
- uploaded `results/` and `analysis/` artifacts,
- added `paper/outline.md`,
- added `paper/tables.md`.
Current publication-ready assets:
- dataset card,
- model card,
- results package,
- qualitative examples,
- reproducibility checklist,
- paper outline,
- draft tables,
- project journal.
---
## Current open research questions
1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
## Next recommended step
Write the first manuscript draft using:
- `paper/outline.md`,
- `paper/tables.md`,
- `PROJECT_JOURNAL.md`,
- `results/stage1_vs_stage2_comparison.md`,
- `results/baselines/zero_shot_vs_finetuned.md`,
- `analysis/stage1_examples/failure_examples.md`.
---
## 2026-05-07 — O1/A1 semantic evaluator results added
### Goal
Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality.
### Action
Implemented and ran a prototype semantic evaluator:
```bash
python scripts/evaluate_semantic_o1_a1.py \
--eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged
python scripts/evaluate_semantic_o1_a1.py \
--eval_dir runs/stage2-weak-20260505-080040/eval
```
The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for:
- `o1_nrm`
- `a1_policy`
This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance.
### Evidence / result
Global O1/A1 semantic comparison:
| Metric | Stage 1 | Stage 2 | Delta |
|---|---:|---:|---:|
| `sem_overall_score` | 0.6830 | 0.6893 | +0.0063 |
| `sem_core_score` | 0.8777 | 0.8883 | +0.0106 |
| `sem_kpi_score` | 0.5125 | 0.5148 | +0.0023 |
| `parse_json` | 1.0000 | 1.0000 | +0.0000 |
| `norm_field_f1` | 0.5462 | 0.5459 | -0.0003 |
A1 policy:
| Metric | Stage 1 | Stage 2 | Delta |
|---|---:|---:|---:|
| `sem_overall_score` | 0.8077 | 0.8148 | +0.0071 |
| `sem_core_score` | 0.8569 | 0.8714 | +0.0144 |
| `sem_kpi_score` | 0.7118 | 0.7112 | -0.0007 |
| `norm_field_f1` | 0.6776 | 0.6771 | -0.0005 |
O1 NRM:
| Metric | Stage 1 | Stage 2 | Delta |
|---|---:|---:|---:|
| `sem_overall_score` | 0.5366 | 0.5420 | +0.0053 |
| `sem_core_score` | 0.9022 | 0.9082 | +0.0060 |
| `sem_kpi_score` | 0.2784 | 0.2841 | +0.0057 |
| `norm_field_f1` | 0.3918 | 0.3918 | -0.0001 |
### Interpretation
The semantic evaluator confirms the previous conclusion with more nuance:
1. Stage 2 gives very small improvements to O1/A1 semantic scores.
2. The gains are mostly in core structural/identifier fields.
3. KPI/value fidelity remains weak, especially for O1 NRM.
4. The improvements are too small to offset stage-2 adversarial regression.
5. Stage 1 remains the primary model.
The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment:
- O1 semantic core score: about **0.90**
- O1 semantic KPI score: about **0.28**
Thus, the main weakness is not JSON structure but low-level telecom value fidelity.
### Decision / next step
Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on:
- canonical scenario labels,
- O1/A1 semantic validators,
- standards-derived schema validation,
- Gen4 deterministic per-layer renderers.
Artifacts added:
- `results/semantic/o1_a1_stage1_vs_stage2.md`
- `results/semantic/o1_a1_stage1_vs_stage2_summary.json`