TMF921 Intent-to-Configuration Research Journal
This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
Repository links:
- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
- Base model: https://huggingface.co/Qwen/Qwen3-8B
Current status summary
Current primary model: stage-1 Qwen3-8B QLoRA adapter.
Stage 2 status: diagnostic / not promoted.
Best stage-1 normalized metrics:
| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|---|---|---|---|
test_in_distribution |
1.0000 | 0.7956 | 0.9811 |
test_template_ood |
1.0000 | 0.7865 | 0.9801 |
test_use_case_ood |
0.9998 | 0.7907 | 0.9805 |
test_sector_ood |
1.0000 | 0.7697 | 0.9818 |
test_adversarial |
1.0000 | 0.9697 | 1.0000 |
Zero-shot Qwen3-8B baseline, 200 examples per split:
| Split | Zero-shot parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
|---|---|---|---|
test_in_distribution |
0.335 | 0.0009 | 0.0169 |
test_template_ood |
0.340 | 0.0014 | 0.0172 |
test_use_case_ood |
0.325 | 0.0012 | 0.0198 |
test_sector_ood |
0.345 | 0.0008 | 0.0171 |
test_adversarial |
0.000 | 0.0000 | 0.0000 |
Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.
2026-04-30 β Dataset cloned and audited
The source dataset nraptisss/TMF921-intent-to-config-augmented was cloned and audited.
Key findings:
- Total rows: 41,815
- Train: 39,294
- Test: 2,521
- Missing values: 0
- Duplicate IDs: 0
- Assistant JSON parse validity: 100%
- Exact train/test full-message overlap: 0
- Near-duplicate prompt similarity was high:
= 0.90: 1,290 / 2,521
= 0.95: 602 / 2,521
= 0.98: 262 / 2,521
createlifecycle operation: 95.9%- adversarial rows: 166 = 0.397%
- unique JSON structure signatures: 31
Interpretation:
The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.
Decision:
Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.
2026-04-30 β Research SOTA dataset created
Created:
Splits:
| Split | Rows | Purpose |
|---|---|---|
train_base |
26,357 | unaugmented training after OOD holdouts |
train_sota |
32,357 | training with marked lifecycle/adversarial upsampling and multi-turn wrappers |
validation |
1,547 | validation |
test_in_distribution |
1,455 | in-distribution test |
test_template_ood |
3,503 | held-out prompt-template family |
test_use_case_ood |
4,341 | held-out use cases |
test_sector_ood |
4,579 | held-out sectors |
test_adversarial |
33 | held-out adversarial examples |
Qwen3 token-length audit:
- mean: 754.1
- p50: 705
- p95: 1293
- p99: 1300
- max: 1316
- fit within 2048: 100%
train_sota balancing:
- non-create lifecycle rows: 5,166 = 15.97%
- adversarial rows: 2,115 = 6.54%
- synthetic multi-turn wrappers: 1,281
Decision:
Use train_sota for the first Qwen3-8B QLoRA training run.
2026-04-30 / 2026-05-01 β Training/evaluation repo created
Created:
Default recipe:
- Base model:
Qwen/Qwen3-8B - Method: QLoRA SFT
- Quantization: 4-bit NF4 + double quantization
- LoRA target modules:
all-linear - LoRA rank: 64
- LR: 2e-4
- Max length: 2048
- Loss: assistant-only SFT loss
- bf16: enabled
- gradient checkpointing: enabled
- train split:
train_sota
The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.
2026-05-01 β Runtime issues fixed
Fixed issues:
- GPU uncertainty: added
check_gpu.py,install_rtx6000ada.sh, and fail-fast CUDA checks. - TRL dataset detection: passed only
messagesto SFTTrainer soassistant_only_loss=Trueworks. - Trackio invalid Space ID: sanitized Trackio config and added
DISABLE_TRACKIO=1. - Deprecated
warmup_ratio: replaced withwarmup_steps.
Server GPU evidence:
torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
2026-05-01 / 2026-05-02 β Stage-1 Qwen3-8B QLoRA training completed
Run directory:
runs/qwen3-8b-qlora-20260501-083834
Training behavior:
- Initial loss: 1.212
- Later loss: ~0.14β0.15
- Mean token accuracy: ~0.945β0.953
- Validation loss plateau: ~0.153
No observed:
- CUDA OOM
- NaNs
- divergence
- gradient explosion
Decision:
Evaluate the trained adapter across ID and OOD splits.
2026-05-02 / 2026-05-04 β Evaluation speed issue fixed
Initial 4-bit adapter evaluation was too slow:
test_in_distribution: 1455 examples in ~25h
Fixes:
- batched generation,
- dynamic generation length,
- periodic save/resume,
- merged bf16 model evaluation.
2026-05-04 β Stage-1 raw and normalized evaluation
Raw metrics:
| Split | JSON parse | Exact match | Field F1 | KPI presence |
|---|---|---|---|---|
test_in_distribution |
1.0000 | 0.0227 | 0.6868 | 0.7973 |
test_template_ood |
1.0000 | 0.0014 | 0.6790 | 0.8062 |
test_use_case_ood |
0.9998 | 0.0122 | 0.6825 | 0.7883 |
test_sector_ood |
1.0000 | 0.0166 | 0.6610 | 0.7733 |
test_adversarial |
1.0000 | 0.9697 | 0.9697 | 1.0000 |
Normalized metrics:
| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
|---|---|---|---|---|
test_in_distribution |
1.0000 | 0.7956 | 0.9811 | 0.0351 |
test_template_ood |
1.0000 | 0.7865 | 0.9801 | 0.0177 |
test_use_case_ood |
0.9998 | 0.7907 | 0.9805 | 0.0253 |
test_sector_ood |
1.0000 | 0.7697 | 0.9818 | 0.0293 |
test_adversarial |
1.0000 | 0.9697 | 1.0000 | 0.9697 |
Interpretation:
The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.
Weak layers:
o1_nrm: normalized field F1 around 0.39β0.40a1_policy: normalized field F1 around 0.67β0.68tmf921_lifecycle_report: normalized field F1 around 0.15β0.18tmf921_lifecycle_monitor: normalized field F1 around 0.39β0.52
Decision:
Test a stage-2 weak-layer continuation experiment.
2026-05-05 β Stage-2 weak-layer continuation run and evaluation
Stage-2 setup:
- initialized from stage-1 adapter,
- weak layers:
o1_nrm,a1_policy,tmf921_lifecycle_report,tmf921_lifecycle_monitor,tmf921_lifecycle_scale, - stage-2 rows: 13,829,
- weak rows: 10,638,
- replay rows: 3,191,
- LR: 5e-5,
- epochs: 1.
Stage-2 training was stable. Adapter continuation was correctly configured:
trainable params: 174,587,904
requires_grad={'default': True}
devices={'default': ['cuda']}
Stage-2 evaluation comparison:
| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|---|---|---|---|---|---|---|
test_in_distribution |
0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
test_template_ood |
0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
test_use_case_ood |
0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
test_sector_ood |
0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
test_adversarial |
0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
Decision:
Stage 2 is diagnostic only and is not promoted. Stage 1 remains the primary model.
Interpretation:
Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.
2026-05-06 β Zero-shot Qwen3-8B baseline completed
Goal:
Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.
Action:
Ran zero-shot Qwen/Qwen3-8B on 200 examples per split:
EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot
Zero-shot metrics:
| Split | Zero-shot JSON parse | Zero-shot norm field F1 | Zero-shot norm key F1 |
|---|---|---|---|
test_in_distribution |
0.335 | 0.0009 | 0.0169 |
test_template_ood |
0.340 | 0.0014 | 0.0172 |
test_use_case_ood |
0.325 | 0.0012 | 0.0198 |
test_sector_ood |
0.345 | 0.0008 | 0.0171 |
test_adversarial |
0.000 | 0.0000 | 0.0000 |
Comparison with fine-tuned stage 1:
| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|---|---|---|---|---|---|---|
| ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
| Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
| Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
Interpretation:
Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.
2026-05-07 β Publication packaging and paper scaffold
Completed:
- finalized dataset card,
- finalized primary stage-1 model card,
- added
REPRODUCIBILITY.md, - added
scripts/reproduce_stage1_eval.sh, - added
scripts/run_zero_shot_baseline.sh, - added
scripts/package_results.py, - added
scripts/sample_failure_examples.py, - uploaded
results/andanalysis/artifacts, - added
paper/outline.md, - added
paper/tables.md.
Current publication-ready assets:
- dataset card,
- model card,
- results package,
- qualitative examples,
- reproducibility checklist,
- paper outline,
- draft tables,
- project journal.
Current open research questions
- Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
- Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
- Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
- Can official or derived validators be added for TMF921/CAMARA/A1/O1?
Next recommended step
Write the first manuscript draft using:
paper/outline.md,paper/tables.md,PROJECT_JOURNAL.md,results/stage1_vs_stage2_comparison.md,results/baselines/zero_shot_vs_finetuned.md,analysis/stage1_examples/failure_examples.md.
2026-05-07 β O1/A1 semantic evaluator results added
Goal
Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality.
Action
Implemented and ran a prototype semantic evaluator:
python scripts/evaluate_semantic_o1_a1.py \
--eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged
python scripts/evaluate_semantic_o1_a1.py \
--eval_dir runs/stage2-weak-20260505-080040/eval
The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for:
o1_nrma1_policy
This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance.
Evidence / result
Global O1/A1 semantic comparison:
| Metric | Stage 1 | Stage 2 | Delta |
|---|---|---|---|
sem_overall_score |
0.6830 | 0.6893 | +0.0063 |
sem_core_score |
0.8777 | 0.8883 | +0.0106 |
sem_kpi_score |
0.5125 | 0.5148 | +0.0023 |
parse_json |
1.0000 | 1.0000 | +0.0000 |
norm_field_f1 |
0.5462 | 0.5459 | -0.0003 |
A1 policy:
| Metric | Stage 1 | Stage 2 | Delta |
|---|---|---|---|
sem_overall_score |
0.8077 | 0.8148 | +0.0071 |
sem_core_score |
0.8569 | 0.8714 | +0.0144 |
sem_kpi_score |
0.7118 | 0.7112 | -0.0007 |
norm_field_f1 |
0.6776 | 0.6771 | -0.0005 |
O1 NRM:
| Metric | Stage 1 | Stage 2 | Delta |
|---|---|---|---|
sem_overall_score |
0.5366 | 0.5420 | +0.0053 |
sem_core_score |
0.9022 | 0.9082 | +0.0060 |
sem_kpi_score |
0.2784 | 0.2841 | +0.0057 |
norm_field_f1 |
0.3918 | 0.3918 | -0.0001 |
Interpretation
The semantic evaluator confirms the previous conclusion with more nuance:
- Stage 2 gives very small improvements to O1/A1 semantic scores.
- The gains are mostly in core structural/identifier fields.
- KPI/value fidelity remains weak, especially for O1 NRM.
- The improvements are too small to offset stage-2 adversarial regression.
- Stage 1 remains the primary model.
The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment:
- O1 semantic core score: about 0.90
- O1 semantic KPI score: about 0.28
Thus, the main weakness is not JSON structure but low-level telecom value fidelity.
Decision / next step
Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on:
- canonical scenario labels,
- O1/A1 semantic validators,
- standards-derived schema validation,
- Gen4 deterministic per-layer renderers.
Artifacts added:
results/semantic/o1_a1_stage1_vs_stage2.mdresults/semantic/o1_a1_stage1_vs_stage2_summary.json