PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
tmf921-intent-training / PROJECT_JOURNAL.md
nraptisss's picture
Update PROJECT_JOURNAL.md with semantic evaluator results
03ca0d6 verified

TMF921 Intent-to-Configuration Research Journal

This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.

Repository links:


Current status summary

Current primary model: stage-1 Qwen3-8B QLoRA adapter.

Stage 2 status: diagnostic / not promoted.

Best stage-1 normalized metrics:

Split JSON parse Normalized field F1 Normalized key F1
test_in_distribution 1.0000 0.7956 0.9811
test_template_ood 1.0000 0.7865 0.9801
test_use_case_ood 0.9998 0.7907 0.9805
test_sector_ood 1.0000 0.7697 0.9818
test_adversarial 1.0000 0.9697 1.0000

Zero-shot Qwen3-8B baseline, 200 examples per split:

Split Zero-shot parse Zero-shot norm field F1 Zero-shot norm key F1
test_in_distribution 0.335 0.0009 0.0169
test_template_ood 0.340 0.0014 0.0172
test_use_case_ood 0.325 0.0012 0.0198
test_sector_ood 0.345 0.0008 0.0171
test_adversarial 0.000 0.0000 0.0000

Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.


2026-04-30 β€” Dataset cloned and audited

The source dataset nraptisss/TMF921-intent-to-config-augmented was cloned and audited.

Key findings:

  • Total rows: 41,815
  • Train: 39,294
  • Test: 2,521
  • Missing values: 0
  • Duplicate IDs: 0
  • Assistant JSON parse validity: 100%
  • Exact train/test full-message overlap: 0
  • Near-duplicate prompt similarity was high:
    • = 0.90: 1,290 / 2,521

    • = 0.95: 602 / 2,521

    • = 0.98: 262 / 2,521

  • create lifecycle operation: 95.9%
  • adversarial rows: 166 = 0.397%
  • unique JSON structure signatures: 31

Interpretation:

The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.

Decision:

Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.


2026-04-30 β€” Research SOTA dataset created

Created:

Splits:

Split Rows Purpose
train_base 26,357 unaugmented training after OOD holdouts
train_sota 32,357 training with marked lifecycle/adversarial upsampling and multi-turn wrappers
validation 1,547 validation
test_in_distribution 1,455 in-distribution test
test_template_ood 3,503 held-out prompt-template family
test_use_case_ood 4,341 held-out use cases
test_sector_ood 4,579 held-out sectors
test_adversarial 33 held-out adversarial examples

Qwen3 token-length audit:

  • mean: 754.1
  • p50: 705
  • p95: 1293
  • p99: 1300
  • max: 1316
  • fit within 2048: 100%

train_sota balancing:

  • non-create lifecycle rows: 5,166 = 15.97%
  • adversarial rows: 2,115 = 6.54%
  • synthetic multi-turn wrappers: 1,281

Decision:

Use train_sota for the first Qwen3-8B QLoRA training run.


2026-04-30 / 2026-05-01 β€” Training/evaluation repo created

Created:

Default recipe:

  • Base model: Qwen/Qwen3-8B
  • Method: QLoRA SFT
  • Quantization: 4-bit NF4 + double quantization
  • LoRA target modules: all-linear
  • LoRA rank: 64
  • LR: 2e-4
  • Max length: 2048
  • Loss: assistant-only SFT loss
  • bf16: enabled
  • gradient checkpointing: enabled
  • train split: train_sota

The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.


2026-05-01 β€” Runtime issues fixed

Fixed issues:

  1. GPU uncertainty: added check_gpu.py, install_rtx6000ada.sh, and fail-fast CUDA checks.
  2. TRL dataset detection: passed only messages to SFTTrainer so assistant_only_loss=True works.
  3. Trackio invalid Space ID: sanitized Trackio config and added DISABLE_TRACKIO=1.
  4. Deprecated warmup_ratio: replaced with warmup_steps.

Server GPU evidence:

torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation

2026-05-01 / 2026-05-02 β€” Stage-1 Qwen3-8B QLoRA training completed

Run directory:

runs/qwen3-8b-qlora-20260501-083834

Training behavior:

  • Initial loss: 1.212
  • Later loss: ~0.14–0.15
  • Mean token accuracy: ~0.945–0.953
  • Validation loss plateau: ~0.153

No observed:

  • CUDA OOM
  • NaNs
  • divergence
  • gradient explosion

Decision:

Evaluate the trained adapter across ID and OOD splits.


2026-05-02 / 2026-05-04 β€” Evaluation speed issue fixed

Initial 4-bit adapter evaluation was too slow:

test_in_distribution: 1455 examples in ~25h

Fixes:

  • batched generation,
  • dynamic generation length,
  • periodic save/resume,
  • merged bf16 model evaluation.

2026-05-04 β€” Stage-1 raw and normalized evaluation

Raw metrics:

Split JSON parse Exact match Field F1 KPI presence
test_in_distribution 1.0000 0.0227 0.6868 0.7973
test_template_ood 1.0000 0.0014 0.6790 0.8062
test_use_case_ood 0.9998 0.0122 0.6825 0.7883
test_sector_ood 1.0000 0.0166 0.6610 0.7733
test_adversarial 1.0000 0.9697 0.9697 1.0000

Normalized metrics:

Split JSON parse Normalized field F1 Normalized key F1 Normalized exact
test_in_distribution 1.0000 0.7956 0.9811 0.0351
test_template_ood 1.0000 0.7865 0.9801 0.0177
test_use_case_ood 0.9998 0.7907 0.9805 0.0253
test_sector_ood 1.0000 0.7697 0.9818 0.0293
test_adversarial 1.0000 0.9697 1.0000 0.9697

Interpretation:

The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.

Weak layers:

  • o1_nrm: normalized field F1 around 0.39–0.40
  • a1_policy: normalized field F1 around 0.67–0.68
  • tmf921_lifecycle_report: normalized field F1 around 0.15–0.18
  • tmf921_lifecycle_monitor: normalized field F1 around 0.39–0.52

Decision:

Test a stage-2 weak-layer continuation experiment.


2026-05-05 β€” Stage-2 weak-layer continuation run and evaluation

Stage-2 setup:

  • initialized from stage-1 adapter,
  • weak layers: o1_nrm, a1_policy, tmf921_lifecycle_report, tmf921_lifecycle_monitor, tmf921_lifecycle_scale,
  • stage-2 rows: 13,829,
  • weak rows: 10,638,
  • replay rows: 3,191,
  • LR: 5e-5,
  • epochs: 1.

Stage-2 training was stable. Adapter continuation was correctly configured:

trainable params: 174,587,904
requires_grad={'default': True}
devices={'default': ['cuda']}

Stage-2 evaluation comparison:

Split Stage 1 norm field F1 Stage 2 norm field F1 Delta Stage 1 norm key F1 Stage 2 norm key F1 Delta
test_in_distribution 0.7956 0.7952 -0.0003 0.9811 0.9796 -0.0014
test_template_ood 0.7865 0.7855 -0.0010 0.9801 0.9786 -0.0015
test_use_case_ood 0.7907 0.7895 -0.0012 0.9805 0.9787 -0.0018
test_sector_ood 0.7697 0.7694 -0.0002 0.9818 0.9809 -0.0009
test_adversarial 0.9697 0.9596 -0.0101 1.0000 0.9697 -0.0303

Decision:

Stage 2 is diagnostic only and is not promoted. Stage 1 remains the primary model.

Interpretation:

Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.


2026-05-06 β€” Zero-shot Qwen3-8B baseline completed

Goal:

Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.

Action:

Ran zero-shot Qwen/Qwen3-8B on 200 examples per split:

EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot

Zero-shot metrics:

Split Zero-shot JSON parse Zero-shot norm field F1 Zero-shot norm key F1
test_in_distribution 0.335 0.0009 0.0169
test_template_ood 0.340 0.0014 0.0172
test_use_case_ood 0.325 0.0012 0.0198
test_sector_ood 0.345 0.0008 0.0171
test_adversarial 0.000 0.0000 0.0000

Comparison with fine-tuned stage 1:

Split Zero-shot parse Fine-tuned parse Zero-shot norm field F1 Fine-tuned norm field F1 Zero-shot norm key F1 Fine-tuned norm key F1
ID 0.335 1.000 0.0009 0.7956 0.0169 0.9811
Template OOD 0.340 1.000 0.0014 0.7865 0.0172 0.9801
Use-case OOD 0.325 0.9998 0.0012 0.7907 0.0198 0.9805
Sector OOD 0.345 1.000 0.0008 0.7697 0.0171 0.9818
Adversarial 0.000 1.000 0.0000 0.9697 0.0000 1.0000

Interpretation:

Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.


2026-05-07 β€” Publication packaging and paper scaffold

Completed:

  • finalized dataset card,
  • finalized primary stage-1 model card,
  • added REPRODUCIBILITY.md,
  • added scripts/reproduce_stage1_eval.sh,
  • added scripts/run_zero_shot_baseline.sh,
  • added scripts/package_results.py,
  • added scripts/sample_failure_examples.py,
  • uploaded results/ and analysis/ artifacts,
  • added paper/outline.md,
  • added paper/tables.md.

Current publication-ready assets:

  • dataset card,
  • model card,
  • results package,
  • qualitative examples,
  • reproducibility checklist,
  • paper outline,
  • draft tables,
  • project journal.

Current open research questions

  1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
  2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
  3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
  4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?

Next recommended step

Write the first manuscript draft using:

  • paper/outline.md,
  • paper/tables.md,
  • PROJECT_JOURNAL.md,
  • results/stage1_vs_stage2_comparison.md,
  • results/baselines/zero_shot_vs_finetuned.md,
  • analysis/stage1_examples/failure_examples.md.

2026-05-07 β€” O1/A1 semantic evaluator results added

Goal

Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality.

Action

Implemented and ran a prototype semantic evaluator:

python scripts/evaluate_semantic_o1_a1.py \
  --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged

python scripts/evaluate_semantic_o1_a1.py \
  --eval_dir runs/stage2-weak-20260505-080040/eval

The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for:

  • o1_nrm
  • a1_policy

This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance.

Evidence / result

Global O1/A1 semantic comparison:

Metric Stage 1 Stage 2 Delta
sem_overall_score 0.6830 0.6893 +0.0063
sem_core_score 0.8777 0.8883 +0.0106
sem_kpi_score 0.5125 0.5148 +0.0023
parse_json 1.0000 1.0000 +0.0000
norm_field_f1 0.5462 0.5459 -0.0003

A1 policy:

Metric Stage 1 Stage 2 Delta
sem_overall_score 0.8077 0.8148 +0.0071
sem_core_score 0.8569 0.8714 +0.0144
sem_kpi_score 0.7118 0.7112 -0.0007
norm_field_f1 0.6776 0.6771 -0.0005

O1 NRM:

Metric Stage 1 Stage 2 Delta
sem_overall_score 0.5366 0.5420 +0.0053
sem_core_score 0.9022 0.9082 +0.0060
sem_kpi_score 0.2784 0.2841 +0.0057
norm_field_f1 0.3918 0.3918 -0.0001

Interpretation

The semantic evaluator confirms the previous conclusion with more nuance:

  1. Stage 2 gives very small improvements to O1/A1 semantic scores.
  2. The gains are mostly in core structural/identifier fields.
  3. KPI/value fidelity remains weak, especially for O1 NRM.
  4. The improvements are too small to offset stage-2 adversarial regression.
  5. Stage 1 remains the primary model.

The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment:

  • O1 semantic core score: about 0.90
  • O1 semantic KPI score: about 0.28

Thus, the main weakness is not JSON structure but low-level telecom value fidelity.

Decision / next step

Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on:

  • canonical scenario labels,
  • O1/A1 semantic validators,
  • standards-derived schema validation,
  • Gen4 deterministic per-layer renderers.

Artifacts added:

  • results/semantic/o1_a1_stage1_vs_stage2.md
  • results/semantic/o1_a1_stage1_vs_stage2_summary.json