# TMF921 Intent-to-Configuration Research Journal This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step. Repository links: - Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented - Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota - Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training - Base model: https://huggingface.co/Qwen/Qwen3-8B --- ## Current status summary Current primary model: **stage-1 Qwen3-8B QLoRA adapter**. Stage 2 status: **diagnostic / not promoted**. Best stage-1 normalized metrics: | Split | JSON parse | Normalized field F1 | Normalized key F1 | |---|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | Zero-shot Qwen3-8B baseline, 200 examples per split: | Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 | |---|---:|---:|---:| | `test_in_distribution` | 0.335 | 0.0009 | 0.0169 | | `test_template_ood` | 0.340 | 0.0014 | 0.0172 | | `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 | | `test_sector_ood` | 0.345 | 0.0008 | 0.0171 | | `test_adversarial` | 0.000 | 0.0000 | 0.0000 | Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation. --- ## 2026-04-30 — Dataset cloned and audited The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited. Key findings: - Total rows: **41,815** - Train: **39,294** - Test: **2,521** - Missing values: **0** - Duplicate IDs: **0** - Assistant JSON parse validity: **100%** - Exact train/test full-message overlap: **0** - Near-duplicate prompt similarity was high: - >= 0.90: **1,290 / 2,521** - >= 0.95: **602 / 2,521** - >= 0.98: **262 / 2,521** - `create` lifecycle operation: **95.9%** - adversarial rows: **166 = 0.397%** - unique JSON structure signatures: **31** Interpretation: The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark. Decision: Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling. --- ## 2026-04-30 — Research SOTA dataset created Created: - https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota Splits: | Split | Rows | Purpose | |---|---:|---| | `train_base` | 26,357 | unaugmented training after OOD holdouts | | `train_sota` | 32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers | | `validation` | 1,547 | validation | | `test_in_distribution` | 1,455 | in-distribution test | | `test_template_ood` | 3,503 | held-out prompt-template family | | `test_use_case_ood` | 4,341 | held-out use cases | | `test_sector_ood` | 4,579 | held-out sectors | | `test_adversarial` | 33 | held-out adversarial examples | Qwen3 token-length audit: - mean: **754.1** - p50: **705** - p95: **1293** - p99: **1300** - max: **1316** - fit within 2048: **100%** `train_sota` balancing: - non-create lifecycle rows: **5,166 = 15.97%** - adversarial rows: **2,115 = 6.54%** - synthetic multi-turn wrappers: **1,281** Decision: Use `train_sota` for the first Qwen3-8B QLoRA training run. --- ## 2026-04-30 / 2026-05-01 — Training/evaluation repo created Created: - https://huggingface.co/nraptisss/tmf921-intent-training Default recipe: - Base model: `Qwen/Qwen3-8B` - Method: QLoRA SFT - Quantization: 4-bit NF4 + double quantization - LoRA target modules: `all-linear` - LoRA rank: 64 - LR: 2e-4 - Max length: 2048 - Loss: assistant-only SFT loss - bf16: enabled - gradient checkpointing: enabled - train split: `train_sota` The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold. --- ## 2026-05-01 — Runtime issues fixed Fixed issues: 1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks. 2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works. 3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`. 4. Deprecated `warmup_ratio`: replaced with `warmup_steps`. Server GPU evidence: ```text torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0 cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation ``` --- ## 2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed Run directory: ```text runs/qwen3-8b-qlora-20260501-083834 ``` Training behavior: - Initial loss: **1.212** - Later loss: **~0.14–0.15** - Mean token accuracy: **~0.945–0.953** - Validation loss plateau: **~0.153** No observed: - CUDA OOM - NaNs - divergence - gradient explosion Decision: Evaluate the trained adapter across ID and OOD splits. --- ## 2026-05-02 / 2026-05-04 — Evaluation speed issue fixed Initial 4-bit adapter evaluation was too slow: ```text test_in_distribution: 1455 examples in ~25h ``` Fixes: - batched generation, - dynamic generation length, - periodic save/resume, - merged bf16 model evaluation. --- ## 2026-05-04 — Stage-1 raw and normalized evaluation Raw metrics: | Split | JSON parse | Exact match | Field F1 | KPI presence | |---|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 | | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 | | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 | | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 | | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 | Normalized metrics: | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact | |---|---:|---:|---:|---:| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 | | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 | | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 | | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 | | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 | Interpretation: The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated. Weak layers: - `o1_nrm`: normalized field F1 around **0.39–0.40** - `a1_policy`: normalized field F1 around **0.67–0.68** - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18** - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52** Decision: Test a stage-2 weak-layer continuation experiment. --- ## 2026-05-05 — Stage-2 weak-layer continuation run and evaluation Stage-2 setup: - initialized from stage-1 adapter, - weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`, - stage-2 rows: **13,829**, - weak rows: **10,638**, - replay rows: **3,191**, - LR: **5e-5**, - epochs: **1**. Stage-2 training was stable. Adapter continuation was correctly configured: ```text trainable params: 174,587,904 requires_grad={'default': True} devices={'default': ['cuda']} ``` Stage-2 evaluation comparison: | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta | |---|---:|---:|---:|---:|---:|---:| | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 | | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 | | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 | | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 | | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 | Decision: Stage 2 is **diagnostic only** and is **not promoted**. Stage 1 remains the primary model. Interpretation: Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune. --- ## 2026-05-06 — Zero-shot Qwen3-8B baseline completed Goal: Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning. Action: Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split: ```bash EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \ bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot ``` Zero-shot metrics: | Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 | |---|---:|---:|---:| | `test_in_distribution` | 0.335 | 0.0009 | 0.0169 | | `test_template_ood` | 0.340 | 0.0014 | 0.0172 | | `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 | | `test_sector_ood` | 0.345 | 0.0008 | 0.0171 | | `test_adversarial` | 0.000 | 0.0000 | 0.0000 | Comparison with fine-tuned stage 1: | Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 | |---|---:|---:|---:|---:|---:|---:| | ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 | | Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 | | Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 | | Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 | | Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 | Interpretation: Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential. --- ## 2026-05-07 — Publication packaging and paper scaffold Completed: - finalized dataset card, - finalized primary stage-1 model card, - added `REPRODUCIBILITY.md`, - added `scripts/reproduce_stage1_eval.sh`, - added `scripts/run_zero_shot_baseline.sh`, - added `scripts/package_results.py`, - added `scripts/sample_failure_examples.py`, - uploaded `results/` and `analysis/` artifacts, - added `paper/outline.md`, - added `paper/tables.md`. Current publication-ready assets: - dataset card, - model card, - results package, - qualitative examples, - reproducibility checklist, - paper outline, - draft tables, - project journal. --- ## Current open research questions 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1? 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring? 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation? 4. Can official or derived validators be added for TMF921/CAMARA/A1/O1? ## Next recommended step Write the first manuscript draft using: - `paper/outline.md`, - `paper/tables.md`, - `PROJECT_JOURNAL.md`, - `results/stage1_vs_stage2_comparison.md`, - `results/baselines/zero_shot_vs_finetuned.md`, - `analysis/stage1_examples/failure_examples.md`. --- ## 2026-05-07 — O1/A1 semantic evaluator results added ### Goal Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality. ### Action Implemented and ran a prototype semantic evaluator: ```bash python scripts/evaluate_semantic_o1_a1.py \ --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged python scripts/evaluate_semantic_o1_a1.py \ --eval_dir runs/stage2-weak-20260505-080040/eval ``` The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for: - `o1_nrm` - `a1_policy` This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance. ### Evidence / result Global O1/A1 semantic comparison: | Metric | Stage 1 | Stage 2 | Delta | |---|---:|---:|---:| | `sem_overall_score` | 0.6830 | 0.6893 | +0.0063 | | `sem_core_score` | 0.8777 | 0.8883 | +0.0106 | | `sem_kpi_score` | 0.5125 | 0.5148 | +0.0023 | | `parse_json` | 1.0000 | 1.0000 | +0.0000 | | `norm_field_f1` | 0.5462 | 0.5459 | -0.0003 | A1 policy: | Metric | Stage 1 | Stage 2 | Delta | |---|---:|---:|---:| | `sem_overall_score` | 0.8077 | 0.8148 | +0.0071 | | `sem_core_score` | 0.8569 | 0.8714 | +0.0144 | | `sem_kpi_score` | 0.7118 | 0.7112 | -0.0007 | | `norm_field_f1` | 0.6776 | 0.6771 | -0.0005 | O1 NRM: | Metric | Stage 1 | Stage 2 | Delta | |---|---:|---:|---:| | `sem_overall_score` | 0.5366 | 0.5420 | +0.0053 | | `sem_core_score` | 0.9022 | 0.9082 | +0.0060 | | `sem_kpi_score` | 0.2784 | 0.2841 | +0.0057 | | `norm_field_f1` | 0.3918 | 0.3918 | -0.0001 | ### Interpretation The semantic evaluator confirms the previous conclusion with more nuance: 1. Stage 2 gives very small improvements to O1/A1 semantic scores. 2. The gains are mostly in core structural/identifier fields. 3. KPI/value fidelity remains weak, especially for O1 NRM. 4. The improvements are too small to offset stage-2 adversarial regression. 5. Stage 1 remains the primary model. The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment: - O1 semantic core score: about **0.90** - O1 semantic KPI score: about **0.28** Thus, the main weakness is not JSON structure but low-level telecom value fidelity. ### Decision / next step Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on: - canonical scenario labels, - O1/A1 semantic validators, - standards-derived schema validation, - Gen4 deterministic per-layer renderers. Artifacts added: - `results/semantic/o1_a1_stage1_vs_stage2.md` - `results/semantic/o1_a1_stage1_vs_stage2_summary.json`