tmf921-intent-training / PROJECT_JOURNAL.md

Update PROJECT_JOURNAL.md with semantic evaluator results

03ca0d6 verified 3 days ago

14 kB

TMF921 Intent-to-Configuration Research Journal

This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.

Repository links:

Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
Base model: https://huggingface.co/Qwen/Qwen3-8B

Current status summary

Current primary model: stage-1 Qwen3-8B QLoRA adapter.

Stage 2 status: diagnostic / not promoted.

Best stage-1 normalized metrics:

Split	JSON parse	Normalized field F1	Normalized key F1
`test_in_distribution`	1.0000	0.7956	0.9811
`test_template_ood`	1.0000	0.7865	0.9801
`test_use_case_ood`	0.9998	0.7907	0.9805
`test_sector_ood`	1.0000	0.7697	0.9818
`test_adversarial`	1.0000	0.9697	1.0000

Zero-shot Qwen3-8B baseline, 200 examples per split:

Split	Zero-shot parse	Zero-shot norm field F1	Zero-shot norm key F1
`test_in_distribution`	0.335	0.0009	0.0169
`test_template_ood`	0.340	0.0014	0.0172
`test_use_case_ood`	0.325	0.0012	0.0198
`test_sector_ood`	0.345	0.0008	0.0171
`test_adversarial`	0.000	0.0000	0.0000

Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.

2026-04-30 — Dataset cloned and audited

The source dataset nraptisss/TMF921-intent-to-config-augmented was cloned and audited.

Key findings:

Total rows: 41,815
Train: 39,294
Test: 2,521
Missing values: 0
Duplicate IDs: 0
Assistant JSON parse validity: 100%
Exact train/test full-message overlap: 0
Near-duplicate prompt similarity was high:
- = 0.90: 1,290 / 2,521
- = 0.95: 602 / 2,521
- = 0.98: 262 / 2,521
create lifecycle operation: 95.9%
adversarial rows: 166 = 0.397%
unique JSON structure signatures: 31

Interpretation:

The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.

Decision:

Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.

2026-04-30 — Research SOTA dataset created

Created:

https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota

Splits:

Split	Rows	Purpose
`train_base`	26,357	unaugmented training after OOD holdouts
`train_sota`	32,357	training with marked lifecycle/adversarial upsampling and multi-turn wrappers
`validation`	1,547	validation
`test_in_distribution`	1,455	in-distribution test
`test_template_ood`	3,503	held-out prompt-template family
`test_use_case_ood`	4,341	held-out use cases
`test_sector_ood`	4,579	held-out sectors
`test_adversarial`	33	held-out adversarial examples

Qwen3 token-length audit:

mean: 754.1
p50: 705
p95: 1293
p99: 1300
max: 1316
fit within 2048: 100%

train_sota balancing:

non-create lifecycle rows: 5,166 = 15.97%
adversarial rows: 2,115 = 6.54%
synthetic multi-turn wrappers: 1,281

Decision:

Use train_sota for the first Qwen3-8B QLoRA training run.

2026-04-30 / 2026-05-01 — Training/evaluation repo created

Created:

https://huggingface.co/nraptisss/tmf921-intent-training

Default recipe:

Base model: Qwen/Qwen3-8B
Method: QLoRA SFT
Quantization: 4-bit NF4 + double quantization
LoRA target modules: all-linear
LoRA rank: 64
LR: 2e-4
Max length: 2048
Loss: assistant-only SFT loss
bf16: enabled
gradient checkpointing: enabled
train split: train_sota

The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.

2026-05-01 — Runtime issues fixed

Fixed issues:

GPU uncertainty: added check_gpu.py, install_rtx6000ada.sh, and fail-fast CUDA checks.
TRL dataset detection: passed only messages to SFTTrainer so assistant_only_loss=True works.
Trackio invalid Space ID: sanitized Trackio config and added DISABLE_TRACKIO=1.
Deprecated warmup_ratio: replaced with warmup_steps.

Server GPU evidence:

torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation

2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed

Run directory:

runs/qwen3-8b-qlora-20260501-083834

Training behavior:

Initial loss: 1.212
Later loss: ~0.14–0.15
Mean token accuracy: ~0.945–0.953
Validation loss plateau: ~0.153

No observed:

CUDA OOM
NaNs
divergence
gradient explosion

Decision:

Evaluate the trained adapter across ID and OOD splits.

2026-05-02 / 2026-05-04 — Evaluation speed issue fixed

Initial 4-bit adapter evaluation was too slow:

test_in_distribution: 1455 examples in ~25h

Fixes:

batched generation,
dynamic generation length,
periodic save/resume,
merged bf16 model evaluation.

2026-05-04 — Stage-1 raw and normalized evaluation

Raw metrics:

Split	JSON parse	Exact match	Field F1	KPI presence
`test_in_distribution`	1.0000	0.0227	0.6868	0.7973
`test_template_ood`	1.0000	0.0014	0.6790	0.8062
`test_use_case_ood`	0.9998	0.0122	0.6825	0.7883
`test_sector_ood`	1.0000	0.0166	0.6610	0.7733
`test_adversarial`	1.0000	0.9697	0.9697	1.0000

Normalized metrics:

Split	JSON parse	Normalized field F1	Normalized key F1	Normalized exact
`test_in_distribution`	1.0000	0.7956	0.9811	0.0351
`test_template_ood`	1.0000	0.7865	0.9801	0.0177
`test_use_case_ood`	0.9998	0.7907	0.9805	0.0253
`test_sector_ood`	1.0000	0.7697	0.9818	0.0293
`test_adversarial`	1.0000	0.9697	1.0000	0.9697

Interpretation:

The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.

Weak layers:

o1_nrm: normalized field F1 around 0.39–0.40
a1_policy: normalized field F1 around 0.67–0.68
tmf921_lifecycle_report: normalized field F1 around 0.15–0.18
tmf921_lifecycle_monitor: normalized field F1 around 0.39–0.52

Decision:

Test a stage-2 weak-layer continuation experiment.

2026-05-05 — Stage-2 weak-layer continuation run and evaluation

Stage-2 setup:

initialized from stage-1 adapter,
weak layers: o1_nrm, a1_policy, tmf921_lifecycle_report, tmf921_lifecycle_monitor, tmf921_lifecycle_scale,
stage-2 rows: 13,829,
weak rows: 10,638,
replay rows: 3,191,
LR: 5e-5,
epochs: 1.

Stage-2 training was stable. Adapter continuation was correctly configured:

trainable params: 174,587,904
requires_grad={'default': True}
devices={'default': ['cuda']}

Stage-2 evaluation comparison:

Split	Stage 1 norm field F1	Stage 2 norm field F1	Delta	Stage 1 norm key F1	Stage 2 norm key F1	Delta
`test_in_distribution`	0.7956	0.7952	-0.0003	0.9811	0.9796	-0.0014
`test_template_ood`	0.7865	0.7855	-0.0010	0.9801	0.9786	-0.0015
`test_use_case_ood`	0.7907	0.7895	-0.0012	0.9805	0.9787	-0.0018
`test_sector_ood`	0.7697	0.7694	-0.0002	0.9818	0.9809	-0.0009
`test_adversarial`	0.9697	0.9596	-0.0101	1.0000	0.9697	-0.0303

Decision:

Stage 2 is diagnostic only and is not promoted. Stage 1 remains the primary model.

Interpretation:

Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.

2026-05-06 — Zero-shot Qwen3-8B baseline completed

Goal:

Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.

Action:

Ran zero-shot Qwen/Qwen3-8B on 200 examples per split:

EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot

Zero-shot metrics:

Split	Zero-shot JSON parse	Zero-shot norm field F1	Zero-shot norm key F1
`test_in_distribution`	0.335	0.0009	0.0169
`test_template_ood`	0.340	0.0014	0.0172
`test_use_case_ood`	0.325	0.0012	0.0198
`test_sector_ood`	0.345	0.0008	0.0171
`test_adversarial`	0.000	0.0000	0.0000

Comparison with fine-tuned stage 1:

Split	Zero-shot parse	Fine-tuned parse	Zero-shot norm field F1	Fine-tuned norm field F1	Zero-shot norm key F1	Fine-tuned norm key F1
ID	0.335	1.000	0.0009	0.7956	0.0169	0.9811
Template OOD	0.340	1.000	0.0014	0.7865	0.0172	0.9801
Use-case OOD	0.325	0.9998	0.0012	0.7907	0.0198	0.9805
Sector OOD	0.345	1.000	0.0008	0.7697	0.0171	0.9818
Adversarial	0.000	1.000	0.0000	0.9697	0.0000	1.0000

Interpretation:

Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.

2026-05-07 — Publication packaging and paper scaffold

Completed:

finalized dataset card,
finalized primary stage-1 model card,
added REPRODUCIBILITY.md,
added scripts/reproduce_stage1_eval.sh,
added scripts/run_zero_shot_baseline.sh,
added scripts/package_results.py,
added scripts/sample_failure_examples.py,
uploaded results/ and analysis/ artifacts,
added paper/outline.md,
added paper/tables.md.

Current publication-ready assets:

dataset card,
model card,
results package,
qualitative examples,
reproducibility checklist,
paper outline,
draft tables,
project journal.

Current open research questions

Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
Can official or derived validators be added for TMF921/CAMARA/A1/O1?

Next recommended step

Write the first manuscript draft using:

paper/outline.md,
paper/tables.md,
PROJECT_JOURNAL.md,
results/stage1_vs_stage2_comparison.md,
results/baselines/zero_shot_vs_finetuned.md,
analysis/stage1_examples/failure_examples.md.

2026-05-07 — O1/A1 semantic evaluator results added

Goal

Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality.

Action

Implemented and ran a prototype semantic evaluator:

python scripts/evaluate_semantic_o1_a1.py \
  --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged

python scripts/evaluate_semantic_o1_a1.py \
  --eval_dir runs/stage2-weak-20260505-080040/eval

The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for:

o1_nrm
a1_policy

This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance.

Evidence / result

Global O1/A1 semantic comparison:

Metric	Stage 1	Stage 2	Delta
`sem_overall_score`	0.6830	0.6893	+0.0063
`sem_core_score`	0.8777	0.8883	+0.0106
`sem_kpi_score`	0.5125	0.5148	+0.0023
`parse_json`	1.0000	1.0000	+0.0000
`norm_field_f1`	0.5462	0.5459	-0.0003

A1 policy:

Metric	Stage 1	Stage 2	Delta
`sem_overall_score`	0.8077	0.8148	+0.0071
`sem_core_score`	0.8569	0.8714	+0.0144
`sem_kpi_score`	0.7118	0.7112	-0.0007
`norm_field_f1`	0.6776	0.6771	-0.0005

O1 NRM:

Metric	Stage 1	Stage 2	Delta
`sem_overall_score`	0.5366	0.5420	+0.0053
`sem_core_score`	0.9022	0.9082	+0.0060
`sem_kpi_score`	0.2784	0.2841	+0.0057
`norm_field_f1`	0.3918	0.3918	-0.0001

Interpretation

The semantic evaluator confirms the previous conclusion with more nuance:

Stage 2 gives very small improvements to O1/A1 semantic scores.
The gains are mostly in core structural/identifier fields.
KPI/value fidelity remains weak, especially for O1 NRM.
The improvements are too small to offset stage-2 adversarial regression.
Stage 1 remains the primary model.

The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment:

O1 semantic core score: about 0.90
O1 semantic KPI score: about 0.28

Thus, the main weakness is not JSON structure but low-level telecom value fidelity.

Decision / next step

Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on:

canonical scenario labels,
O1/A1 semantic validators,
standards-derived schema validation,
Gen4 deterministic per-layer renderers.

Artifacts added:

results/semantic/o1_a1_stage1_vs_stage2.md
results/semantic/o1_a1_stage1_vs_stage2_summary.json