| # TMF921 Intent-to-Configuration Research Journal |
|
|
| This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step. |
|
|
| Repository links: |
|
|
| - Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented |
| - Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota |
| - Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training |
| - Base model: https://huggingface.co/Qwen/Qwen3-8B |
|
|
| --- |
|
|
| ## Current status summary |
|
|
| Current primary model: **stage-1 Qwen3-8B QLoRA adapter**. |
|
|
| Stage 2 status: **diagnostic / not promoted**. |
|
|
| Best stage-1 normalized metrics: |
|
|
| | Split | JSON parse | Normalized field F1 | Normalized key F1 | |
| |---|---:|---:|---:| |
| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | |
| | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | |
| | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | |
| | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | |
| | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | |
|
|
| Zero-shot Qwen3-8B baseline, 200 examples per split: |
|
|
| | Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 | |
| |---|---:|---:|---:| |
| | `test_in_distribution` | 0.335 | 0.0009 | 0.0169 | |
| | `test_template_ood` | 0.340 | 0.0014 | 0.0172 | |
| | `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 | |
| | `test_sector_ood` | 0.345 | 0.0008 | 0.0171 | |
| | `test_adversarial` | 0.000 | 0.0000 | 0.0000 | |
|
|
| Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation. |
|
|
| --- |
|
|
| ## 2026-04-30 — Dataset cloned and audited |
|
|
| The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited. |
|
|
| Key findings: |
|
|
| - Total rows: **41,815** |
| - Train: **39,294** |
| - Test: **2,521** |
| - Missing values: **0** |
| - Duplicate IDs: **0** |
| - Assistant JSON parse validity: **100%** |
| - Exact train/test full-message overlap: **0** |
| - Near-duplicate prompt similarity was high: |
| - >= 0.90: **1,290 / 2,521** |
| - >= 0.95: **602 / 2,521** |
| - >= 0.98: **262 / 2,521** |
| - `create` lifecycle operation: **95.9%** |
| - adversarial rows: **166 = 0.397%** |
| - unique JSON structure signatures: **31** |
|
|
| Interpretation: |
|
|
| The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark. |
|
|
| Decision: |
|
|
| Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling. |
|
|
| --- |
|
|
| ## 2026-04-30 — Research SOTA dataset created |
|
|
| Created: |
|
|
| - https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota |
|
|
| Splits: |
|
|
| | Split | Rows | Purpose | |
| |---|---:|---| |
| | `train_base` | 26,357 | unaugmented training after OOD holdouts | |
| | `train_sota` | 32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers | |
| | `validation` | 1,547 | validation | |
| | `test_in_distribution` | 1,455 | in-distribution test | |
| | `test_template_ood` | 3,503 | held-out prompt-template family | |
| | `test_use_case_ood` | 4,341 | held-out use cases | |
| | `test_sector_ood` | 4,579 | held-out sectors | |
| | `test_adversarial` | 33 | held-out adversarial examples | |
|
|
| Qwen3 token-length audit: |
|
|
| - mean: **754.1** |
| - p50: **705** |
| - p95: **1293** |
| - p99: **1300** |
| - max: **1316** |
| - fit within 2048: **100%** |
|
|
| `train_sota` balancing: |
|
|
| - non-create lifecycle rows: **5,166 = 15.97%** |
| - adversarial rows: **2,115 = 6.54%** |
| - synthetic multi-turn wrappers: **1,281** |
|
|
| Decision: |
|
|
| Use `train_sota` for the first Qwen3-8B QLoRA training run. |
|
|
| --- |
|
|
| ## 2026-04-30 / 2026-05-01 — Training/evaluation repo created |
|
|
| Created: |
|
|
| - https://huggingface.co/nraptisss/tmf921-intent-training |
|
|
| Default recipe: |
|
|
| - Base model: `Qwen/Qwen3-8B` |
| - Method: QLoRA SFT |
| - Quantization: 4-bit NF4 + double quantization |
| - LoRA target modules: `all-linear` |
| - LoRA rank: 64 |
| - LR: 2e-4 |
| - Max length: 2048 |
| - Loss: assistant-only SFT loss |
| - bf16: enabled |
| - gradient checkpointing: enabled |
| - train split: `train_sota` |
|
|
| The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold. |
|
|
| --- |
|
|
| ## 2026-05-01 — Runtime issues fixed |
|
|
| Fixed issues: |
|
|
| 1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks. |
| 2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works. |
| 3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`. |
| 4. Deprecated `warmup_ratio`: replaced with `warmup_steps`. |
|
|
| Server GPU evidence: |
|
|
| ```text |
| torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0 |
| cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation |
| ``` |
|
|
| --- |
|
|
| ## 2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed |
|
|
| Run directory: |
|
|
| ```text |
| runs/qwen3-8b-qlora-20260501-083834 |
| ``` |
|
|
| Training behavior: |
|
|
| - Initial loss: **1.212** |
| - Later loss: **~0.14–0.15** |
| - Mean token accuracy: **~0.945–0.953** |
| - Validation loss plateau: **~0.153** |
|
|
| No observed: |
|
|
| - CUDA OOM |
| - NaNs |
| - divergence |
| - gradient explosion |
|
|
| Decision: |
|
|
| Evaluate the trained adapter across ID and OOD splits. |
|
|
| --- |
|
|
| ## 2026-05-02 / 2026-05-04 — Evaluation speed issue fixed |
|
|
| Initial 4-bit adapter evaluation was too slow: |
|
|
| ```text |
| test_in_distribution: 1455 examples in ~25h |
| ``` |
|
|
| Fixes: |
|
|
| - batched generation, |
| - dynamic generation length, |
| - periodic save/resume, |
| - merged bf16 model evaluation. |
|
|
| --- |
|
|
| ## 2026-05-04 — Stage-1 raw and normalized evaluation |
|
|
| Raw metrics: |
|
|
| | Split | JSON parse | Exact match | Field F1 | KPI presence | |
| |---|---:|---:|---:|---:| |
| | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 | |
| | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 | |
| | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 | |
| | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 | |
| | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 | |
|
|
| Normalized metrics: |
|
|
| | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact | |
| |---|---:|---:|---:|---:| |
| | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 | |
| | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 | |
| | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 | |
| | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 | |
| | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 | |
|
|
| Interpretation: |
|
|
| The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated. |
|
|
| Weak layers: |
|
|
| - `o1_nrm`: normalized field F1 around **0.39–0.40** |
| - `a1_policy`: normalized field F1 around **0.67–0.68** |
| - `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18** |
| - `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52** |
|
|
| Decision: |
|
|
| Test a stage-2 weak-layer continuation experiment. |
|
|
| --- |
|
|
| ## 2026-05-05 — Stage-2 weak-layer continuation run and evaluation |
|
|
| Stage-2 setup: |
|
|
| - initialized from stage-1 adapter, |
| - weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`, |
| - stage-2 rows: **13,829**, |
| - weak rows: **10,638**, |
| - replay rows: **3,191**, |
| - LR: **5e-5**, |
| - epochs: **1**. |
|
|
| Stage-2 training was stable. Adapter continuation was correctly configured: |
|
|
| ```text |
| trainable params: 174,587,904 |
| requires_grad={'default': True} |
| devices={'default': ['cuda']} |
| ``` |
|
|
| Stage-2 evaluation comparison: |
|
|
| | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta | |
| |---|---:|---:|---:|---:|---:|---:| |
| | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 | |
| | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 | |
| | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 | |
| | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 | |
| | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 | |
|
|
| Decision: |
|
|
| Stage 2 is **diagnostic only** and is **not promoted**. Stage 1 remains the primary model. |
|
|
| Interpretation: |
|
|
| Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune. |
|
|
| --- |
|
|
| ## 2026-05-06 — Zero-shot Qwen3-8B baseline completed |
|
|
| Goal: |
|
|
| Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning. |
|
|
| Action: |
|
|
| Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split: |
|
|
| ```bash |
| EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \ |
| bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot |
| ``` |
|
|
| Zero-shot metrics: |
|
|
| | Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 | |
| |---|---:|---:|---:| |
| | `test_in_distribution` | 0.335 | 0.0009 | 0.0169 | |
| | `test_template_ood` | 0.340 | 0.0014 | 0.0172 | |
| | `test_use_case_ood` | 0.325 | 0.0012 | 0.0198 | |
| | `test_sector_ood` | 0.345 | 0.0008 | 0.0171 | |
| | `test_adversarial` | 0.000 | 0.0000 | 0.0000 | |
|
|
| Comparison with fine-tuned stage 1: |
|
|
| | Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 | |
| |---|---:|---:|---:|---:|---:|---:| |
| | ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 | |
| | Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 | |
| | Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 | |
| | Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 | |
| | Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 | |
|
|
| Interpretation: |
|
|
| Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential. |
|
|
| --- |
|
|
| ## 2026-05-07 — Publication packaging and paper scaffold |
|
|
| Completed: |
|
|
| - finalized dataset card, |
| - finalized primary stage-1 model card, |
| - added `REPRODUCIBILITY.md`, |
| - added `scripts/reproduce_stage1_eval.sh`, |
| - added `scripts/run_zero_shot_baseline.sh`, |
| - added `scripts/package_results.py`, |
| - added `scripts/sample_failure_examples.py`, |
| - uploaded `results/` and `analysis/` artifacts, |
| - added `paper/outline.md`, |
| - added `paper/tables.md`. |
|
|
| Current publication-ready assets: |
|
|
| - dataset card, |
| - model card, |
| - results package, |
| - qualitative examples, |
| - reproducibility checklist, |
| - paper outline, |
| - draft tables, |
| - project journal. |
|
|
| --- |
|
|
| ## Current open research questions |
|
|
| 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1? |
| 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring? |
| 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation? |
| 4. Can official or derived validators be added for TMF921/CAMARA/A1/O1? |
|
|
| ## Next recommended step |
|
|
| Write the first manuscript draft using: |
|
|
| - `paper/outline.md`, |
| - `paper/tables.md`, |
| - `PROJECT_JOURNAL.md`, |
| - `results/stage1_vs_stage2_comparison.md`, |
| - `results/baselines/zero_shot_vs_finetuned.md`, |
| - `analysis/stage1_examples/failure_examples.md`. |
|
|
| --- |
|
|
| ## 2026-05-07 — O1/A1 semantic evaluator results added |
|
|
| ### Goal |
|
|
| Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality. |
|
|
| ### Action |
|
|
| Implemented and ran a prototype semantic evaluator: |
|
|
| ```bash |
| python scripts/evaluate_semantic_o1_a1.py \ |
| --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged |
| |
| python scripts/evaluate_semantic_o1_a1.py \ |
| --eval_dir runs/stage2-weak-20260505-080040/eval |
| ``` |
|
|
| The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for: |
|
|
| - `o1_nrm` |
| - `a1_policy` |
|
|
| This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance. |
|
|
| ### Evidence / result |
|
|
| Global O1/A1 semantic comparison: |
|
|
| | Metric | Stage 1 | Stage 2 | Delta | |
| |---|---:|---:|---:| |
| | `sem_overall_score` | 0.6830 | 0.6893 | +0.0063 | |
| | `sem_core_score` | 0.8777 | 0.8883 | +0.0106 | |
| | `sem_kpi_score` | 0.5125 | 0.5148 | +0.0023 | |
| | `parse_json` | 1.0000 | 1.0000 | +0.0000 | |
| | `norm_field_f1` | 0.5462 | 0.5459 | -0.0003 | |
|
|
| A1 policy: |
|
|
| | Metric | Stage 1 | Stage 2 | Delta | |
| |---|---:|---:|---:| |
| | `sem_overall_score` | 0.8077 | 0.8148 | +0.0071 | |
| | `sem_core_score` | 0.8569 | 0.8714 | +0.0144 | |
| | `sem_kpi_score` | 0.7118 | 0.7112 | -0.0007 | |
| | `norm_field_f1` | 0.6776 | 0.6771 | -0.0005 | |
|
|
| O1 NRM: |
|
|
| | Metric | Stage 1 | Stage 2 | Delta | |
| |---|---:|---:|---:| |
| | `sem_overall_score` | 0.5366 | 0.5420 | +0.0053 | |
| | `sem_core_score` | 0.9022 | 0.9082 | +0.0060 | |
| | `sem_kpi_score` | 0.2784 | 0.2841 | +0.0057 | |
| | `norm_field_f1` | 0.3918 | 0.3918 | -0.0001 | |
|
|
| ### Interpretation |
|
|
| The semantic evaluator confirms the previous conclusion with more nuance: |
|
|
| 1. Stage 2 gives very small improvements to O1/A1 semantic scores. |
| 2. The gains are mostly in core structural/identifier fields. |
| 3. KPI/value fidelity remains weak, especially for O1 NRM. |
| 4. The improvements are too small to offset stage-2 adversarial regression. |
| 5. Stage 1 remains the primary model. |
|
|
| The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment: |
|
|
| - O1 semantic core score: about **0.90** |
| - O1 semantic KPI score: about **0.28** |
|
|
| Thus, the main weakness is not JSON structure but low-level telecom value fidelity. |
|
|
| ### Decision / next step |
|
|
| Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on: |
|
|
| - canonical scenario labels, |
| - O1/A1 semantic validators, |
| - standards-derived schema validation, |
| - Gen4 deterministic per-layer renderers. |
|
|
| Artifacts added: |
|
|
| - `results/semantic/o1_a1_stage1_vs_stage2.md` |
| - `results/semantic/o1_a1_stage1_vs_stage2_summary.json` |
|
|
|
|