nraptisss
/

tmf921-intent-training

+# TMF921 Intent-to-Configuration Research Journal
+This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
+Repository links:
+- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
+- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
+- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
+- Base model: https://huggingface.co/Qwen/Qwen3-8B
+---
+## Journal conventions
+Each entry should include:
+1. **Date/time**
+2. **Goal**
+3. **Action**
+4. **Evidence / result**
+5. **Interpretation**
+6. **Decision / next step**
+For research claims, prefer numeric evidence over qualitative statements.
+---
+## 2026-04-30 — Dataset cloned and audited
+### Goal
+Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training.
+### Action
+The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity.
+### Evidence / result
+Dataset size:
+- Total rows: **41,815**
+- Train: **39,294**
+- Test: **2,521**
+Quality checks:
+- Missing values: **0**
+- Duplicate IDs: **0**
+- Duplicate full conversations: **0**
+- Assistant JSON parse validity: **41,815 / 41,815 = 100%**
+- Role sequence: `system -> user -> assistant` for all rows
+Leakage / similarity findings:
+- Exact train/test user-prompt overlap: **0**
+- Exact train/test full-message overlap: **0**
+- Near-duplicate prompt similarity was high:
+  - test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521**
+  - >= 0.95: **602 / 2,521**
+  - >= 0.98: **262 / 2,521**
+Distribution findings:
+- `create` lifecycle operation: **40,090 / 41,815 = 95.9%**
+- non-create lifecycle rows: **1,725 = 4.1%**
+- adversarial rows: **166 = 0.397%**
+- only **31 unique JSON structure signatures** across 41,815 rows
+### Interpretation
+The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation.
+### Decision / next step
+Create a research-grade derivative dataset with:
+- OOD splits,
+- train/eval provenance columns,
+- token-length audit,
+- validation flags,
+- lifecycle/adversarial upsampling for training only,
+- no fabricated continuous-KPI or cross-layer-paired examples without a validated generator.
+---
+## 2026-04-30 — Research SOTA dataset created
+### Goal
+Implement the audit recommendations while preserving scientific soundness.
+### Action
+Created `nraptisss/TMF921-intent-to-config-research-sota`.
+Implemented:
+- `train_base`
+- `train_sota`
+- `validation`
+- `test_in_distribution`
+- `test_template_ood`
+- `test_use_case_ood`
+- `test_sector_ood`
+- `test_adversarial`
+Added columns:
+- `system`, `prompt`, `completion`
+- `prompt_template_id`
+- `scenario_id`
+- `json_structure_id`
+- `json_root_family`
+- `messages_format_valid`
+- `assistant_is_valid_json`
+- `slice_sst_valid`
+- `kpi_profile_valid`
+- `semantic_rule_valid_v1`
+- `qwen3_chat_template_tokens`
+- `fits_2048_qwen3`
+- `fits_4096_qwen3`
+- `sampling_weight_*`
+- `is_augmented`, `augmentation_type`, `source_id`, `conversation_type`
+### Evidence / result
+Published dataset:
+- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
+Splits:
+| Split | Rows | Purpose |
+|---|---:|---|
+| `train_base` | 26,357 | unaugmented training after OOD holdouts |
+| `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers |
+| `validation` | 1,547 | validation |
+| `test_in_distribution` | 1,455 | in-distribution test |
+| `test_template_ood` | 3,503 | held-out prompt-template family |
+| `test_use_case_ood` | 4,341 | held-out use cases |
+| `test_sector_ood` | 4,579 | held-out sectors |
+| `test_adversarial` | 33 | held-out adversarial examples |
+Qwen3 token-length audit:
+- mean: **754.1**
+- p50: **705**
+- p95: **1293**
+- p99: **1300**
+- max: **1316**
+- fit within 2048: **100%**
+`train_sota` balancing:
+- non-create lifecycle rows: **5,166 = 15.97%**
+- adversarial rows: **2,115 = 6.54%**
+- synthetic multi-turn wrappers: **1,281**
+### Interpretation
+`max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting.
+### Decision / next step
+Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA.
+---
+## 2026-04-30 / 2026-05-01 — Training/evaluation repo created
+### Goal
+Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB.
+### Action
+Created `nraptisss/tmf921-intent-training` with:
+- QLoRA SFT training script,
+- evaluation script,
+- merge script,
+- RTX 6000 Ada install script,
+- GPU preflight,
+- nohup run scripts,
+- resumable checkpoints,
+- unique run directories.
+Default recipe:
+- model: `Qwen/Qwen3-8B`
+- method: QLoRA NF4 + double quant
+- LoRA target modules: `all-linear`
+- LoRA rank: `64`
+- LoRA alpha: `16`
+- LoRA dropout: `0.05`
+- LR: `2e-4`
+- scheduler: constant
+- max length: `2048`
+- assistant-only loss: enabled
+- bf16: enabled
+- gradient checkpointing: enabled
+- train split: `train_sota`
+- eval split: `validation`
+### Evidence / result
+Repo:
+- https://huggingface.co/nraptisss/tmf921-intent-training
+### Interpretation
+The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU.
+### Decision / next step
+Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results.
+---
+## 2026-05-01 — Runtime issues fixed
+### Goal
+Resolve server-side training errors and ensure training uses GPU.
+### Issues encountered and fixes
+#### 1. CPU/GPU uncertainty
+Observed concern that training might not use GPU.
+Fix:
+- Added `scripts/check_gpu.py`
+- Added `scripts/install_rtx6000ada.sh`
+- Added fail-fast CUDA checks to training/evaluation scripts.
+Evidence from server logs:
+```text
+torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
+cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
+```
+Conclusion: GPU setup confirmed.
+#### 2. TRL conversational dataset detection error
+Error:
+```text
+ValueError: You set assistant_only_loss=True, but the dataset is not conversational.
+```
+Cause:
+The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format.
+Fix:
+Training script now passes only:
+```python
+train_dataset = train_dataset.select_columns(["messages"])
+eval_dataset = eval_dataset.select_columns(["messages"])
+```
+#### 3. Trackio invalid Space ID
+Error:
+```text
+HFValidationError: Repo id ... 'nraptisss/'
+```
+Cause:
+Invalid `TRACKIO_SPACE_ID=nraptisss/`.
+Fix:
+Added validation/sanitization for Trackio Space IDs and support for:
+```bash
+DISABLE_TRACKIO=1
+```
+#### 4. Deprecated warmup argument
+Warning:
+```text
+warmup_ratio is deprecated
+```
+Fix:
+Changed config/script to use:
+```yaml
+warmup_steps: 0
+```
+### Decision / next step
+Restart training with fixed scripts and disabled Trackio to avoid external logging failures.
+---
+## 2026-05-01 / 2026-05-02 — Qwen3-8B QLoRA training run completed
+### Goal
+Train Qwen3-8B QLoRA on `train_sota`.
+### Action
+Started training under nohup with unique run directory:
+```text
+runs/qwen3-8b-qlora-20260501-083834
+```
+Trackio disabled:
+```bash
+DISABLE_TRACKIO=1
+```
+### Evidence / result
+Training logs showed stable convergence.
+Representative metrics:
+Initial:
+```text
+loss: 1.212
+mean_token_accuracy: 0.7922
+```
+After early training:
+```text
+loss: ~0.15
+mean_token_accuracy: ~0.945-0.953
+```
+Validation loss over training:
+```text
+eval_loss: 0.1593 at epoch 0.1236
+eval_loss: 0.1561 at epoch 0.2472
+eval_loss: 0.1548 at epoch 0.3709
+eval_loss: 0.1535 at epoch 0.8653
+eval_loss: 0.1530 at epoch 1.607
+eval_loss: 0.1532 at epoch 1.730
+```
+No observed:
+- CUDA OOM,
+- NaNs,
+- divergence,
+- gradient explosion.
+### Interpretation
+The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence.
+### Decision / next step
+Evaluate the trained adapter across ID and OOD splits.
+---
+## 2026-05-02 / 2026-05-04 — Evaluation speed issue and merged-model evaluation
+### Goal
+Evaluate the trained adapter on all splits.
+### Issue
+Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation:
+```text
+test_in_distribution: 1455 examples in ~25h
+test_template_ood: ~30-90s/example
+```
+### Action
+Patched evaluator to support:
+- batched generation,
+- dynamic generation length based on target length + buffer,
+- periodic save/resume,
+- partial prediction reuse.
+Also recommended merging adapter into base bf16 model for faster inference.
+### Decision / next step
+Use merged model evaluation and normalized metrics.
+---
+## 2026-05-04 — Raw evaluation results
+### Goal
+Measure raw JSON and field-level performance.
+### Evidence / result
+Raw metrics:
+| Split | JSON parse | Exact match | Field F1 | KPI presence |
+|---|---:|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
+| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
+| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
+| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
+| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
+### Interpretation
+The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links).
+### Decision / next step
+Implement a normalized evaluator that removes volatile fields before scoring.
+---
+## 2026-05-04 — Normalized evaluator implemented and run
+### Goal
+Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement.
+### Action
+Added:
+```text
+scripts/normalize_eval_metrics.py
+```
+Normalization removes/masks:
+- IDs,
+- hrefs,
+- names/descriptions,
+- timestamps,
+- schema links,
+- UUID/hash-like strings,
+- generated request/policy/booking/intent IDs.
+It computes:
+- normalized exact match,
+- normalized field precision/recall/F1,
+- normalized key precision/recall/F1,
+- stratified metrics.
+### Evidence / result
+Headline normalized metrics:
+| Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact |
+|---|---:|---:|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 |
+| `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 |
+| `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 |
+| `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 |
+| `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 |
+Strong layers:
+- `tmf921`: normalized field F1 around **0.93–0.94**
+- `camara`: normalized field F1 around **0.81–0.87**
+- `intent_3gpp`: normalized field F1 around **0.80–0.82**
+- `etsi_zsm`: normalized field F1 around **0.75–0.79**
+Weak layers:
+- `o1_nrm`: normalized field F1 around **0.39–0.40**
+- `a1_policy`: normalized field F1 around **0.67–0.68**
+- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
+- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
+### Interpretation
+The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 ≈ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs.
+### Decision / next step
+Plan a second-stage weak-layer fine-tune focused on:
+- `o1_nrm`,
+- `a1_policy`,
+- `tmf921_lifecycle_report`,
+- `tmf921_lifecycle_monitor`,
+- optionally `tmf921_lifecycle_scale`.
+Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting.
+---
+## Current scientific status
+### What can be claimed now
+The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves:
+- near-perfect JSON validity,
+- stable OOD generalization,
+- excellent adversarial rejection,
+- normalized structural key F1 around 98% across non-adversarial ID/OOD splits,
+- normalized field F1 around 77–80% across ID/OOD splits.
+### What should not be overclaimed
+Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation.
+### Main weaknesses
+- O1 NRM value fidelity is poor despite correct structure.
+- Lifecycle report/monitor outputs need targeted improvement.
+- Raw exact match remains low for primary create configs.
+### Next planned experiment
+Second-stage weak-layer adapter continuation:
+- initialize from current Qwen3-8B TMF921 adapter,
+- train on weak-layer examples plus replay buffer,
+- lower LR: `5e-5` or `1e-4`,
+- 1 epoch,
+- same max length 2048,
+- evaluate again with raw + normalized metrics.
+---
+## Open questions
+1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
+2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
+3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
+4. Should training use a weak-layer second stage or should dataset generation be improved first?
+---
+## Running log template
+```markdown
+## YYYY-MM-DD — Short title
+### Goal
+### Action
+### Evidence / result
+### Interpretation
+### Decision / next step
+```