# TMF921 Intent-to-Configuration Research Journal

This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.

Repository links:

- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
- Base model: https://huggingface.co/Qwen/Qwen3-8B

---

## Current status summary

Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.

Stage 2 status: **diagnostic / not promoted**.

Best stage-1 normalized metrics:

| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |

Zero-shot Qwen3-8B baseline, 200 examples per split:

| Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
| `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
| `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
| `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
| `test_adversarial` | 0.000 | 0.0000 | 0.0000 |

Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.

---

## 2026-04-30 — Dataset cloned and audited

The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited.

Key findings:

- Total rows: **41,815**
- Train: **39,294**
- Test: **2,521**
- Missing values: **0**
- Duplicate IDs: **0**
- Assistant JSON parse validity: **100%**
- Exact train/test full-message overlap: **0**
- Near-duplicate prompt similarity was high:
  - >= 0.90: **1,290 / 2,521**
  - >= 0.95: **602 / 2,521**
  - >= 0.98: **262 / 2,521**
- `create` lifecycle operation: **95.9%**
- adversarial rows: **166 = 0.397%**
- unique JSON structure signatures: **31**

Interpretation:

The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.

Decision:

Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.

---

## 2026-04-30 — Research SOTA dataset created

Created:

- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota

Splits:

| Split | Rows | Purpose |
|---|---:|---|
| `train_base` | 26,357 | unaugmented training after OOD holdouts |
| `train_sota` | 32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers |
| `validation` | 1,547 | validation |
| `test_in_distribution` | 1,455 | in-distribution test |
| `test_template_ood` | 3,503 | held-out prompt-template family |
| `test_use_case_ood` | 4,341 | held-out use cases |
| `test_sector_ood` | 4,579 | held-out sectors |
| `test_adversarial` | 33 | held-out adversarial examples |

Qwen3 token-length audit:

- mean: **754.1**
- p50: **705**
- p95: **1293**
- p99: **1300**
- max: **1316**
- fit within 2048: **100%**

`train_sota` balancing:

- non-create lifecycle rows: **5,166 = 15.97%**
- adversarial rows: **2,115 = 6.54%**
- synthetic multi-turn wrappers: **1,281**

Decision:

Use `train_sota` for the first Qwen3-8B QLoRA training run.

---

## 2026-04-30 / 2026-05-01 — Training/evaluation repo created

Created:

- https://huggingface.co/nraptisss/tmf921-intent-training

Default recipe:

- Base model: `Qwen/Qwen3-8B`
- Method: QLoRA SFT
- Quantization: 4-bit NF4 + double quantization
- LoRA target modules: `all-linear`
- LoRA rank: 64
- LR: 2e-4
- Max length: 2048
- Loss: assistant-only SFT loss
- bf16: enabled
- gradient checkpointing: enabled
- train split: `train_sota`

The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.

---

## 2026-05-01 — Runtime issues fixed

Fixed issues:

1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks.
2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works.
3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`.
4. Deprecated `warmup_ratio`: replaced with `warmup_steps`.

Server GPU evidence:

```text
torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
```

---

## 2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed

Run directory:

```text
runs/qwen3-8b-qlora-20260501-083834
```

Training behavior:

- Initial loss: **1.212**
- Later loss: **~0.14–0.15**
- Mean token accuracy: **~0.945–0.953**
- Validation loss plateau: **~0.153**

No observed:

- CUDA OOM
- NaNs
- divergence
- gradient explosion

Decision:

Evaluate the trained adapter across ID and OOD splits.

---

## 2026-05-02 / 2026-05-04 — Evaluation speed issue fixed

Initial 4-bit adapter evaluation was too slow:

```text
test_in_distribution: 1455 examples in ~25h
```

Fixes:

- batched generation,
- dynamic generation length,
- periodic save/resume,
- merged bf16 model evaluation.

---

## 2026-05-04 — Stage-1 raw and normalized evaluation

Raw metrics:

| Split | JSON parse | Exact match | Field F1 | KPI presence |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |

Normalized metrics:

| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |

Interpretation:

The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.

Weak layers:

- `o1_nrm`: normalized field F1 around **0.39–0.40**
- `a1_policy`: normalized field F1 around **0.67–0.68**
- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**

Decision:

Test a stage-2 weak-layer continuation experiment.

---

## 2026-05-05 — Stage-2 weak-layer continuation run and evaluation

Stage-2 setup:

- initialized from stage-1 adapter,
- weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`,
- stage-2 rows: **13,829**,
- weak rows: **10,638**,
- replay rows: **3,191**,
- LR: **5e-5**,
- epochs: **1**.

Stage-2 training was stable. Adapter continuation was correctly configured:

```text
trainable params: 174,587,904
requires_grad={'default': True}
devices={'default': ['cuda']}
```

Stage-2 evaluation comparison:

| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|---|---:|---:|---:|---:|---:|---:|
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |

Decision:

Stage 2 is **diagnostic only** and is **not promoted**. Stage 1 remains the primary model.

Interpretation:

Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.

---

## 2026-05-06 — Zero-shot Qwen3-8B baseline completed

Goal:

Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.

Action:

Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split:

```bash
EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot
```

Zero-shot metrics:

| Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
|---|---:|---:|---:|
| `test_in_distribution` | 0.335 | 0.0009 | 0.0169 |
| `test_template_ood` | 0.340 | 0.0014 | 0.0172 |
| `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 |
| `test_sector_ood` | 0.345 | 0.0008 | 0.0171 |
| `test_adversarial` | 0.000 | 0.0000 | 0.0000 |

Comparison with fine-tuned stage 1:

| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|---|---:|---:|---:|---:|---:|---:|
| ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
| Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
| Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |

Interpretation:

Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.

---

## 2026-05-07 — Publication packaging and paper scaffold

Completed:

- finalized dataset card,
- finalized primary stage-1 model card,
- added `REPRODUCIBILITY.md`,
- added `scripts/reproduce_stage1_eval.sh`,
- added `scripts/run_zero_shot_baseline.sh`,
- added `scripts/package_results.py`,
- added `scripts/sample_failure_examples.py`,
- uploaded `results/` and `analysis/` artifacts,
- added `paper/outline.md`,
- added `paper/tables.md`.

Current publication-ready assets:

- dataset card,
- model card,
- results package,
- qualitative examples,
- reproducibility checklist,
- paper outline,
- draft tables,
- project journal.

---

## Current open research questions

1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?

## Next recommended step

Write the first manuscript draft using:

- `paper/outline.md`,
- `paper/tables.md`,
- `PROJECT_JOURNAL.md`,
- `results/stage1_vs_stage2_comparison.md`,
- `results/baselines/zero_shot_vs_finetuned.md`,
- `analysis/stage1_examples/failure_examples.md`.

---

## 2026-05-07 — O1/A1 semantic evaluator results added

### Goal

Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality.

### Action

Implemented and ran a prototype semantic evaluator:

```bash
python scripts/evaluate_semantic_o1_a1.py \
  --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged

python scripts/evaluate_semantic_o1_a1.py \
  --eval_dir runs/stage2-weak-20260505-080040/eval
```

The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for:

- `o1_nrm`
- `a1_policy`

This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance.

### Evidence / result

Global O1/A1 semantic comparison:

| Metric | Stage 1 | Stage 2 | Delta |
|---|---:|---:|---:|
| `sem_overall_score` | 0.6830 | 0.6893 | +0.0063 |
| `sem_core_score` | 0.8777 | 0.8883 | +0.0106 |
| `sem_kpi_score` | 0.5125 | 0.5148 | +0.0023 |
| `parse_json` | 1.0000 | 1.0000 | +0.0000 |
| `norm_field_f1` | 0.5462 | 0.5459 | -0.0003 |

A1 policy:

| Metric | Stage 1 | Stage 2 | Delta |
|---|---:|---:|---:|
| `sem_overall_score` | 0.8077 | 0.8148 | +0.0071 |
| `sem_core_score` | 0.8569 | 0.8714 | +0.0144 |
| `sem_kpi_score` | 0.7118 | 0.7112 | -0.0007 |
| `norm_field_f1` | 0.6776 | 0.6771 | -0.0005 |

O1 NRM:

| Metric | Stage 1 | Stage 2 | Delta |
|---|---:|---:|---:|
| `sem_overall_score` | 0.5366 | 0.5420 | +0.0053 |
| `sem_core_score` | 0.9022 | 0.9082 | +0.0060 |
| `sem_kpi_score` | 0.2784 | 0.2841 | +0.0057 |
| `norm_field_f1` | 0.3918 | 0.3918 | -0.0001 |

### Interpretation

The semantic evaluator confirms the previous conclusion with more nuance:

1. Stage 2 gives very small improvements to O1/A1 semantic scores.
2. The gains are mostly in core structural/identifier fields.
3. KPI/value fidelity remains weak, especially for O1 NRM.
4. The improvements are too small to offset stage-2 adversarial regression.
5. Stage 1 remains the primary model.

The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment:

- O1 semantic core score: about **0.90**
- O1 semantic KPI score: about **0.28**

Thus, the main weakness is not JSON structure but low-level telecom value fidelity.

### Decision / next step

Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on:

- canonical scenario labels,
- O1/A1 semantic validators,
- standards-derived schema validation,
- Gen4 deterministic per-layer renderers.

Artifacts added:

- `results/semantic/o1_a1_stage1_vs_stage2.md`
- `results/semantic/o1_a1_stage1_vs_stage2_summary.json`