tmf921-intent-training / PROJECT_JOURNAL.md

Update PROJECT_JOURNAL.md with semantic evaluator results

03ca0d6 verified 4 days ago

14 kB

	# TMF921 Intent-to-Configuration Research Journal

	This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.

	Repository links:

	- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
	- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
	- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
	- Base model: https://huggingface.co/Qwen/Qwen3-8B

	---

	## Current status summary

	Current primary model: stage-1 Qwen3-8B QLoRA adapter.

	Stage 2 status: diagnostic / not promoted.

	Best stage-1 normalized metrics:

	\| Split \| JSON parse \| Normalized field F1 \| Normalized key F1 \|
	\|---\|---:\|---:\|---:\|
	\| `test_in_distribution` \| 1.0000 \| 0.7956 \| 0.9811 \|
	\| `test_template_ood` \| 1.0000 \| 0.7865 \| 0.9801 \|
	\| `test_use_case_ood` \| 0.9998 \| 0.7907 \| 0.9805 \|
	\| `test_sector_ood` \| 1.0000 \| 0.7697 \| 0.9818 \|
	\| `test_adversarial` \| 1.0000 \| 0.9697 \| 1.0000 \|

	Zero-shot Qwen3-8B baseline, 200 examples per split:

	\| Split \| Zero-shot parse \| Zero-shot norm field F1 \| Zero-shot norm key F1 \|
	\|---\|---:\|---:\|---:\|
	\| `test_in_distribution` \| 0.335 \| 0.0009 \| 0.0169 \|
	\| `test_template_ood` \| 0.340 \| 0.0014 \| 0.0172 \|
	\| `test_use_case_ood` \| 0.325 \| 0.0012 \| 0.0198 \|
	\| `test_sector_ood` \| 0.345 \| 0.0008 \| 0.0171 \|
	\| `test_adversarial` \| 0.000 \| 0.0000 \| 0.0000 \|

	Main conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-configuration generation.

	---

	## 2026-04-30 — Dataset cloned and audited

	The source dataset `nraptisss/TMF921-intent-to-config-augmented` was cloned and audited.

	Key findings:

	- Total rows: 41,815
	- Train: 39,294
	- Test: 2,521
	- Missing values: 0
	- Duplicate IDs: 0
	- Assistant JSON parse validity: 100%
	- Exact train/test full-message overlap: 0
	- Near-duplicate prompt similarity was high:
	- >= 0.90: 1,290 / 2,521
	- >= 0.95: 602 / 2,521
	- >= 0.98: 262 / 2,521
	- `create` lifecycle operation: 95.9%
	- adversarial rows: 166 = 0.397%
	- unique JSON structure signatures: 31

	Interpretation:

	The dataset is technically clean and suitable for SFT, but the original split is mainly in-distribution/template-compliance rather than a strong OOD benchmark.

	Decision:

	Create a research-grade derivative dataset with OOD splits, provenance columns, token audit, validation flags, and training-only rare-class upsampling.

	---

	## 2026-04-30 — Research SOTA dataset created

	Created:

	- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota

	Splits:

	\| Split \| Rows \| Purpose \|
	\|---\|---:\|---\|
	\| `train_base` \| 26,357 \| unaugmented training after OOD holdouts \|
	\| `train_sota` \| 32,357 \| training with marked lifecycle/adversarial upsampling and multi-turn wrappers \|
	\| `validation` \| 1,547 \| validation \|
	\| `test_in_distribution` \| 1,455 \| in-distribution test \|
	\| `test_template_ood` \| 3,503 \| held-out prompt-template family \|
	\| `test_use_case_ood` \| 4,341 \| held-out use cases \|
	\| `test_sector_ood` \| 4,579 \| held-out sectors \|
	\| `test_adversarial` \| 33 \| held-out adversarial examples \|

	Qwen3 token-length audit:

	- mean: 754.1
	- p50: 705
	- p95: 1293
	- p99: 1300
	- max: 1316
	- fit within 2048: 100%

	`train_sota` balancing:

	- non-create lifecycle rows: 5,166 = 15.97%
	- adversarial rows: 2,115 = 6.54%
	- synthetic multi-turn wrappers: 1,281

	Decision:

	Use `train_sota` for the first Qwen3-8B QLoRA training run.

	---

	## 2026-04-30 / 2026-05-01 — Training/evaluation repo created

	Created:

	- https://huggingface.co/nraptisss/tmf921-intent-training

	Default recipe:

	- Base model: `Qwen/Qwen3-8B`
	- Method: QLoRA SFT
	- Quantization: 4-bit NF4 + double quantization
	- LoRA target modules: `all-linear`
	- LoRA rank: 64
	- LR: 2e-4
	- Max length: 2048
	- Loss: assistant-only SFT loss
	- bf16: enabled
	- gradient checkpointing: enabled
	- train split: `train_sota`

	The repo includes GPU preflight, nohup run/resume scripts, evaluation scripts, normalized evaluator, stage-2 diagnostic tooling, packaging scripts, and paper scaffold.

	---

	## 2026-05-01 — Runtime issues fixed

	Fixed issues:

	1. GPU uncertainty: added `check_gpu.py`, `install_rtx6000ada.sh`, and fail-fast CUDA checks.
	2. TRL dataset detection: passed only `messages` to SFTTrainer so `assistant_only_loss=True` works.
	3. Trackio invalid Space ID: sanitized Trackio config and added `DISABLE_TRACKIO=1`.
	4. Deprecated `warmup_ratio`: replaced with `warmup_steps`.

	Server GPU evidence:

	```text
	torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
	cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
	```

	---

	## 2026-05-01 / 2026-05-02 — Stage-1 Qwen3-8B QLoRA training completed

	Run directory:

	```text
	runs/qwen3-8b-qlora-20260501-083834
	```

	Training behavior:

	- Initial loss: 1.212
	- Later loss: ~0.14–0.15
	- Mean token accuracy: ~0.945–0.953
	- Validation loss plateau: ~0.153

	No observed:

	- CUDA OOM
	- NaNs
	- divergence
	- gradient explosion

	Decision:

	Evaluate the trained adapter across ID and OOD splits.

	---

	## 2026-05-02 / 2026-05-04 — Evaluation speed issue fixed

	Initial 4-bit adapter evaluation was too slow:

	```text
	test_in_distribution: 1455 examples in ~25h
	```

	Fixes:

	- batched generation,
	- dynamic generation length,
	- periodic save/resume,
	- merged bf16 model evaluation.

	---

	## 2026-05-04 — Stage-1 raw and normalized evaluation

	Raw metrics:

	\| Split \| JSON parse \| Exact match \| Field F1 \| KPI presence \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `test_in_distribution` \| 1.0000 \| 0.0227 \| 0.6868 \| 0.7973 \|
	\| `test_template_ood` \| 1.0000 \| 0.0014 \| 0.6790 \| 0.8062 \|
	\| `test_use_case_ood` \| 0.9998 \| 0.0122 \| 0.6825 \| 0.7883 \|
	\| `test_sector_ood` \| 1.0000 \| 0.0166 \| 0.6610 \| 0.7733 \|
	\| `test_adversarial` \| 1.0000 \| 0.9697 \| 0.9697 \| 1.0000 \|

	Normalized metrics:

	\| Split \| JSON parse \| Normalized field F1 \| Normalized key F1 \| Normalized exact \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `test_in_distribution` \| 1.0000 \| 0.7956 \| 0.9811 \| 0.0351 \|
	\| `test_template_ood` \| 1.0000 \| 0.7865 \| 0.9801 \| 0.0177 \|
	\| `test_use_case_ood` \| 0.9998 \| 0.7907 \| 0.9805 \| 0.0253 \|
	\| `test_sector_ood` \| 1.0000 \| 0.7697 \| 0.9818 \| 0.0293 \|
	\| `test_adversarial` \| 1.0000 \| 0.9697 \| 1.0000 \| 0.9697 \|

	Interpretation:

	The model reliably emits valid JSON and correct structural schemas. Raw exact match underestimates performance because many fields are volatile/generated.

	Weak layers:

	- `o1_nrm`: normalized field F1 around 0.39–0.40
	- `a1_policy`: normalized field F1 around 0.67–0.68
	- `tmf921_lifecycle_report`: normalized field F1 around 0.15–0.18
	- `tmf921_lifecycle_monitor`: normalized field F1 around 0.39–0.52

	Decision:

	Test a stage-2 weak-layer continuation experiment.

	---

	## 2026-05-05 — Stage-2 weak-layer continuation run and evaluation

	Stage-2 setup:

	- initialized from stage-1 adapter,
	- weak layers: `o1_nrm`, `a1_policy`, `tmf921_lifecycle_report`, `tmf921_lifecycle_monitor`, `tmf921_lifecycle_scale`,
	- stage-2 rows: 13,829,
	- weak rows: 10,638,
	- replay rows: 3,191,
	- LR: 5e-5,
	- epochs: 1.

	Stage-2 training was stable. Adapter continuation was correctly configured:

	```text
	trainable params: 174,587,904
	requires_grad={'default': True}
	devices={'default': ['cuda']}
	```

	Stage-2 evaluation comparison:

	\| Split \| Stage 1 norm field F1 \| Stage 2 norm field F1 \| Delta \| Stage 1 norm key F1 \| Stage 2 norm key F1 \| Delta \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `test_in_distribution` \| 0.7956 \| 0.7952 \| -0.0003 \| 0.9811 \| 0.9796 \| -0.0014 \|
	\| `test_template_ood` \| 0.7865 \| 0.7855 \| -0.0010 \| 0.9801 \| 0.9786 \| -0.0015 \|
	\| `test_use_case_ood` \| 0.7907 \| 0.7895 \| -0.0012 \| 0.9805 \| 0.9787 \| -0.0018 \|
	\| `test_sector_ood` \| 0.7697 \| 0.7694 \| -0.0002 \| 0.9818 \| 0.9809 \| -0.0009 \|
	\| `test_adversarial` \| 0.9697 \| 0.9596 \| -0.0101 \| 1.0000 \| 0.9697 \| -0.0303 \|

	Decision:

	Stage 2 is diagnostic only and is not promoted. Stage 1 remains the primary model.

	Interpretation:

	Weak-layer exposure alone did not solve O1/A1 value fidelity. The next scientific step is semantic evaluation and better canonical data generation, not another blind weak-layer fine-tune.

	---

	## 2026-05-06 — Zero-shot Qwen3-8B baseline completed

	Goal:

	Determine whether Qwen3-8B can perform the task without domain-specific fine-tuning.

	Action:

	Ran zero-shot `Qwen/Qwen3-8B` on 200 examples per split:

	```bash
	EVAL_BATCH_SIZE=4 BASELINE_MAX_SAMPLES=200 \
	bash scripts/run_zero_shot_baseline.sh outputs/baselines/qwen3-8b-zero-shot
	```

	Zero-shot metrics:

	\| Split \| Zero-shot JSON parse \| Zero-shot norm field F1 \| Zero-shot norm key F1 \|
	\|---\|---:\|---:\|---:\|
	\| `test_in_distribution` \| 0.335 \| 0.0009 \| 0.0169 \|
	\| `test_template_ood` \| 0.340 \| 0.0014 \| 0.0172 \|
	\| `test_use_case_ood` \| 0.325 \| 0.0012 \| 0.0198 \|
	\| `test_sector_ood` \| 0.345 \| 0.0008 \| 0.0171 \|
	\| `test_adversarial` \| 0.000 \| 0.0000 \| 0.0000 \|

	Comparison with fine-tuned stage 1:

	\| Split \| Zero-shot parse \| Fine-tuned parse \| Zero-shot norm field F1 \| Fine-tuned norm field F1 \| Zero-shot norm key F1 \| Fine-tuned norm key F1 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| ID \| 0.335 \| 1.000 \| 0.0009 \| 0.7956 \| 0.0169 \| 0.9811 \|
	\| Template OOD \| 0.340 \| 1.000 \| 0.0014 \| 0.7865 \| 0.0172 \| 0.9801 \|
	\| Use-case OOD \| 0.325 \| 0.9998 \| 0.0012 \| 0.7907 \| 0.0198 \| 0.9805 \|
	\| Sector OOD \| 0.345 \| 1.000 \| 0.0008 \| 0.7697 \| 0.0171 \| 0.9818 \|
	\| Adversarial \| 0.000 \| 1.000 \| 0.0000 \| 0.9697 \| 0.0000 \| 1.0000 \|

	Interpretation:

	Zero-shot Qwen3-8B largely fails the task. Domain-specific QLoRA fine-tuning is essential.

	---

	## 2026-05-07 — Publication packaging and paper scaffold

	Completed:

	- finalized dataset card,
	- finalized primary stage-1 model card,
	- added `REPRODUCIBILITY.md`,
	- added `scripts/reproduce_stage1_eval.sh`,
	- added `scripts/run_zero_shot_baseline.sh`,
	- added `scripts/package_results.py`,
	- added `scripts/sample_failure_examples.py`,
	- uploaded `results/` and `analysis/` artifacts,
	- added `paper/outline.md`,
	- added `paper/tables.md`.

	Current publication-ready assets:

	- dataset card,
	- model card,
	- results package,
	- qualitative examples,
	- reproducibility checklist,
	- paper outline,
	- draft tables,
	- project journal.

	---

	## Current open research questions

	1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
	2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
	3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
	4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?

	## Next recommended step

	Write the first manuscript draft using:

	- `paper/outline.md`,
	- `paper/tables.md`,
	- `PROJECT_JOURNAL.md`,
	- `results/stage1_vs_stage2_comparison.md`,
	- `results/baselines/zero_shot_vs_finetuned.md`,
	- `analysis/stage1_examples/failure_examples.md`.

	---

	## 2026-05-07 — O1/A1 semantic evaluator results added

	### Goal

	Assess whether the weak-layer problem is genuinely value-level or whether flat normalized field F1 underestimates O1 NRM and A1 policy quality.

	### Action

	Implemented and ran a prototype semantic evaluator:

	```bash
	python scripts/evaluate_semantic_o1_a1.py \
	--eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged

	python scripts/evaluate_semantic_o1_a1.py \
	--eval_dir runs/stage2-weak-20260505-080040/eval
	```

	The evaluator reads existing predictions and recovers metadata from the benchmark dataset by row id. It scores telecom-relevant fields and structures for:

	- `o1_nrm`
	- `a1_policy`

	This evaluator is a prototype and does not claim official 3GPP/O-RAN compliance.

	### Evidence / result

	Global O1/A1 semantic comparison:

	\| Metric \| Stage 1 \| Stage 2 \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| `sem_overall_score` \| 0.6830 \| 0.6893 \| +0.0063 \|
	\| `sem_core_score` \| 0.8777 \| 0.8883 \| +0.0106 \|
	\| `sem_kpi_score` \| 0.5125 \| 0.5148 \| +0.0023 \|
	\| `parse_json` \| 1.0000 \| 1.0000 \| +0.0000 \|
	\| `norm_field_f1` \| 0.5462 \| 0.5459 \| -0.0003 \|

	A1 policy:

	\| Metric \| Stage 1 \| Stage 2 \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| `sem_overall_score` \| 0.8077 \| 0.8148 \| +0.0071 \|
	\| `sem_core_score` \| 0.8569 \| 0.8714 \| +0.0144 \|
	\| `sem_kpi_score` \| 0.7118 \| 0.7112 \| -0.0007 \|
	\| `norm_field_f1` \| 0.6776 \| 0.6771 \| -0.0005 \|

	O1 NRM:

	\| Metric \| Stage 1 \| Stage 2 \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| `sem_overall_score` \| 0.5366 \| 0.5420 \| +0.0053 \|
	\| `sem_core_score` \| 0.9022 \| 0.9082 \| +0.0060 \|
	\| `sem_kpi_score` \| 0.2784 \| 0.2841 \| +0.0057 \|
	\| `norm_field_f1` \| 0.3918 \| 0.3918 \| -0.0001 \|

	### Interpretation

	The semantic evaluator confirms the previous conclusion with more nuance:

	1. Stage 2 gives very small improvements to O1/A1 semantic scores.
	2. The gains are mostly in core structural/identifier fields.
	3. KPI/value fidelity remains weak, especially for O1 NRM.
	4. The improvements are too small to offset stage-2 adversarial regression.
	5. Stage 1 remains the primary model.

	The most important new insight is that O1 NRM has strong core structural recognition but weak KPI/value assignment:

	- O1 semantic core score: about 0.90
	- O1 semantic KPI score: about 0.28

	Thus, the main weakness is not JSON structure but low-level telecom value fidelity.

	### Decision / next step

	Use the semantic evaluator results in the paper as additional evidence that O1/A1 errors are value-fidelity problems. Do not run another blind weak-layer fine-tune. Future work should focus on:

	- canonical scenario labels,
	- O1/A1 semantic validators,
	- standards-derived schema validation,
	- Gen4 deterministic per-layer renderers.

	Artifacts added:

	- `results/semantic/o1_a1_stage1_vs_stage2.md`
	- `results/semantic/o1_a1_stage1_vs_stage2_summary.json`