nraptisss
/

tmf921-intent-training

@@ -1,981 +0,0 @@
-# TMF921 Intent-to-Configuration Research Journal
-This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
-Repository links:
-- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
-- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
-- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
-- Base model: https://huggingface.co/Qwen/Qwen3-8B
----
-## Journal conventions
-Each entry should include:
-1. **Date/time**
-2. **Goal**
-3. **Action**
-4. **Evidence / result**
-5. **Interpretation**
-6. **Decision / next step**
-For research claims, prefer numeric evidence over qualitative statements.
----
-## 2026-04-30 — Dataset cloned and audited
-### Goal
-Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training.
-### Action
-The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity.
-### Evidence / result
-Dataset size:
-- Total rows: **41,815**
-- Train: **39,294**
-- Test: **2,521**
-Quality checks:
-- Missing values: **0**
-- Duplicate IDs: **0**
-- Duplicate full conversations: **0**
-- Assistant JSON parse validity: **41,815 / 41,815 = 100%**
-- Role sequence: `system -> user -> assistant` for all rows
-Leakage / similarity findings:
-- Exact train/test user-prompt overlap: **0**
-- Exact train/test full-message overlap: **0**
-- Near-duplicate prompt similarity was high:
-  - test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521**
-  - >= 0.95: **602 / 2,521**
-  - >= 0.98: **262 / 2,521**
-Distribution findings:
-- `create` lifecycle operation: **40,090 / 41,815 = 95.9%**
-- non-create lifecycle rows: **1,725 = 4.1%**
-- adversarial rows: **166 = 0.397%**
-- only **31 unique JSON structure signatures** across 41,815 rows
-### Interpretation
-The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation.
-### Decision / next step
-Create a research-grade derivative dataset with:
-- OOD splits,
-- train/eval provenance columns,
-- token-length audit,
-- validation flags,
-- lifecycle/adversarial upsampling for training only,
-- no fabricated continuous-KPI or cross-layer-paired examples without a validated generator.
----
-## 2026-04-30 — Research SOTA dataset created
-### Goal
-Implement the audit recommendations while preserving scientific soundness.
-### Action
-Created `nraptisss/TMF921-intent-to-config-research-sota`.
-Implemented:
-- `train_base`
-- `train_sota`
-- `validation`
-- `test_in_distribution`
-- `test_template_ood`
-- `test_use_case_ood`
-- `test_sector_ood`
-- `test_adversarial`
-Added columns:
-- `system`, `prompt`, `completion`
-- `prompt_template_id`
-- `scenario_id`
-- `json_structure_id`
-- `json_root_family`
-- `messages_format_valid`
-- `assistant_is_valid_json`
-- `slice_sst_valid`
-- `kpi_profile_valid`
-- `semantic_rule_valid_v1`
-- `qwen3_chat_template_tokens`
-- `fits_2048_qwen3`
-- `fits_4096_qwen3`
-- `sampling_weight_*`
-- `is_augmented`, `augmentation_type`, `source_id`, `conversation_type`
-### Evidence / result
-Published dataset:
-- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
-Splits:
-| Split | Rows | Purpose |
-|---|---:|---|
-| `train_base` | 26,357 | unaugmented training after OOD holdouts |
-| `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers |
-| `validation` | 1,547 | validation |
-| `test_in_distribution` | 1,455 | in-distribution test |
-| `test_template_ood` | 3,503 | held-out prompt-template family |
-| `test_use_case_ood` | 4,341 | held-out use cases |
-| `test_sector_ood` | 4,579 | held-out sectors |
-| `test_adversarial` | 33 | held-out adversarial examples |
-Qwen3 token-length audit:
-- mean: **754.1**
-- p50: **705**
-- p95: **1293**
-- p99: **1300**
-- max: **1316**
-- fit within 2048: **100%**
-`train_sota` balancing:
-- non-create lifecycle rows: **5,166 = 15.97%**
-- adversarial rows: **2,115 = 6.54%**
-- synthetic multi-turn wrappers: **1,281**
-### Interpretation
-`max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting.
-### Decision / next step
-Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA.
----
-## 2026-04-30 / 2026-05-01 — Training/evaluation repo created
-### Goal
-Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB.
-### Action
-Created `nraptisss/tmf921-intent-training` with:
-- QLoRA SFT training script,
-- evaluation script,
-- merge script,
-- RTX 6000 Ada install script,
-- GPU preflight,
-- nohup run scripts,
-- resumable checkpoints,
-- unique run directories.
-Default recipe:
-- model: `Qwen/Qwen3-8B`
-- method: QLoRA NF4 + double quant
-- LoRA target modules: `all-linear`
-- LoRA rank: `64`
-- LoRA alpha: `16`
-- LoRA dropout: `0.05`
-- LR: `2e-4`
-- scheduler: constant
-- max length: `2048`
-- assistant-only loss: enabled
-- bf16: enabled
-- gradient checkpointing: enabled
-- train split: `train_sota`
-- eval split: `validation`
-### Evidence / result
-Repo:
-- https://huggingface.co/nraptisss/tmf921-intent-training
-### Interpretation
-The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU.
-### Decision / next step
-Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results.
----
-## 2026-05-01 — Runtime issues fixed
-### Goal
-Resolve server-side training errors and ensure training uses GPU.
-### Issues encountered and fixes
-#### 1. CPU/GPU uncertainty
-Observed concern that training might not use GPU.
-Fix:
-- Added `scripts/check_gpu.py`
-- Added `scripts/install_rtx6000ada.sh`
-- Added fail-fast CUDA checks to training/evaluation scripts.
-Evidence from server logs:
-```text
-torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
-cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
-```
-Conclusion: GPU setup confirmed.
-#### 2. TRL conversational dataset detection error
-Error:
-```text
-ValueError: You set assistant_only_loss=True, but the dataset is not conversational.
-```
-Cause:
-The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format.
-Fix:
-Training script now passes only:
-```python
-train_dataset = train_dataset.select_columns(["messages"])
-eval_dataset = eval_dataset.select_columns(["messages"])
-```
-#### 3. Trackio invalid Space ID
-Error:
-```text
-HFValidationError: Repo id ... 'nraptisss/'
-```
-Cause:
-Invalid `TRACKIO_SPACE_ID=nraptisss/`.
-Fix:
-Added validation/sanitization for Trackio Space IDs and support for:
-```bash
-DISABLE_TRACKIO=1
-```
-#### 4. Deprecated warmup argument
-Warning:
-```text
-warmup_ratio is deprecated
-```
-Fix:
-Changed config/script to use:
-```yaml
-warmup_steps: 0
-```
-### Decision / next step
-Restart training with fixed scripts and disabled Trackio to avoid external logging failures.
----
-## 2026-05-01 / 2026-05-02 — Qwen3-8B QLoRA training run completed
-### Goal
-Train Qwen3-8B QLoRA on `train_sota`.
-### Action
-Started training under nohup with unique run directory:
-```text
-runs/qwen3-8b-qlora-20260501-083834
-```
-Trackio disabled:
-```bash
-DISABLE_TRACKIO=1
-```
-### Evidence / result
-Training logs showed stable convergence.
-Representative metrics:
-Initial:
-```text
-loss: 1.212
-mean_token_accuracy: 0.7922
-```
-After early training:
-```text
-loss: ~0.15
-mean_token_accuracy: ~0.945-0.953
-```
-Validation loss over training:
-```text
-eval_loss: 0.1593 at epoch 0.1236
-eval_loss: 0.1561 at epoch 0.2472
-eval_loss: 0.1548 at epoch 0.3709
-eval_loss: 0.1535 at epoch 0.8653
-eval_loss: 0.1530 at epoch 1.607
-eval_loss: 0.1532 at epoch 1.730
-```
-No observed:
-- CUDA OOM,
-- NaNs,
-- divergence,
-- gradient explosion.
-### Interpretation
-The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence.
-### Decision / next step
-Evaluate the trained adapter across ID and OOD splits.
----
-## 2026-05-02 / 2026-05-04 — Evaluation speed issue and merged-model evaluation
-### Goal
-Evaluate the trained adapter on all splits.
-### Issue
-Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation:
-```text
-test_in_distribution: 1455 examples in ~25h
-test_template_ood: ~30-90s/example
-```
-### Action
-Patched evaluator to support:
-- batched generation,
-- dynamic generation length based on target length + buffer,
-- periodic save/resume,
-- partial prediction reuse.
-Also recommended merging adapter into base bf16 model for faster inference.
-### Decision / next step
-Use merged model evaluation and normalized metrics.
----
-## 2026-05-04 — Raw evaluation results
-### Goal
-Measure raw JSON and field-level performance.
-### Evidence / result
-Raw metrics:
-| Split | JSON parse | Exact match | Field F1 | KPI presence |
-|---|---:|---:|---:|---:|
-| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
-| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
-| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
-| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
-| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
-### Interpretation
-The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links).
-### Decision / next step
-Implement a normalized evaluator that removes volatile fields before scoring.
----
-## 2026-05-04 — Normalized evaluator implemented and run
-### Goal
-Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement.
-### Action
-Added:
-```text
-scripts/normalize_eval_metrics.py
-```
-Normalization removes/masks:
-- IDs,
-- hrefs,
-- names/descriptions,
-- timestamps,
-- schema links,
-- UUID/hash-like strings,
-- generated request/policy/booking/intent IDs.
-It computes:
-- normalized exact match,
-- normalized field precision/recall/F1,
-- normalized key precision/recall/F1,
-- stratified metrics.
-### Evidence / result
-Headline normalized metrics:
-| Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact |
-|---|---:|---:|---:|---:|---:|
-| `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 |
-| `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 |
-| `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 |
-| `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 |
-| `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 |
-Strong layers:
-- `tmf921`: normalized field F1 around **0.93–0.94**
-- `camara`: normalized field F1 around **0.81–0.87**
-- `intent_3gpp`: normalized field F1 around **0.80–0.82**
-- `etsi_zsm`: normalized field F1 around **0.75–0.79**
-Weak layers:
-- `o1_nrm`: normalized field F1 around **0.39–0.40**
-- `a1_policy`: normalized field F1 around **0.67–0.68**
-- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
-- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
-### Interpretation
-The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 ≈ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs.
-### Decision / next step
-Plan a second-stage weak-layer fine-tune focused on:
-- `o1_nrm`,
-- `a1_policy`,
-- `tmf921_lifecycle_report`,
-- `tmf921_lifecycle_monitor`,
-- optionally `tmf921_lifecycle_scale`.
-Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting.
----
-## Current scientific status
-### What can be claimed now
-The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves:
-- near-perfect JSON validity,
-- stable OOD generalization,
-- excellent adversarial rejection,
-- normalized structural key F1 around 98% across non-adversarial ID/OOD splits,
-- normalized field F1 around 77–80% across ID/OOD splits.
-### What should not be overclaimed
-Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation.
-### Main weaknesses
-- O1 NRM value fidelity is poor despite correct structure.
-- Lifecycle report/monitor outputs need targeted improvement.
-- Raw exact match remains low for primary create configs.
-### Next planned experiment
-Second-stage weak-layer adapter continuation:
-- initialize from current Qwen3-8B TMF921 adapter,
-- train on weak-layer examples plus replay buffer,
-- lower LR: `5e-5` or `1e-4`,
-- 1 epoch,
-- same max length 2048,
-- evaluate again with raw + normalized metrics.
----
-## Open questions
-1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
-2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
-3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
-4. Should training use a weak-layer second stage or should dataset generation be improved first?
----
-## Running log template
-```markdown
-## YYYY-MM-DD — Short title
-### Goal
-### Action
-### Evidence / result
-### Interpretation
-### Decision / next step
-```
----
-## 2026-05-04 — Stage 2 weak-layer continuation plan implemented
-### Goal
-Improve weak target layers identified by normalized evaluation without degrading strong layers.
-Weak layers from normalized evaluation:
-- `o1_nrm`: normalized field F1 around **0.39–0.40**
-- `a1_policy`: normalized field F1 around **0.67–0.68**
-- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
-- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
-- `tmf921_lifecycle_scale`: mixed, included because lifecycle scaling still had noticeable errors
-### Action
-Added stage-2 tooling:
-- `scripts/build_weak_layer_dataset.py`
-- `scripts/train_continue_adapter.py`
-- `configs/stage2_weak_layer_qwen3_8b.yaml`
-- `scripts/nohup_stage2_weak.sh`
-The weak-layer dataset builder creates a local parquet training set with:
-1. all weak-layer rows from `train_sota`,
-2. duplicated rare weak layers up to a minimum count,
-3. a replay buffer from non-weak layers to reduce forgetting.
-The continuation trainer loads:
-1. Qwen3-8B base model in 4-bit NF4,
-2. the existing LoRA adapter with `is_trainable=True`,
-3. the local weak-layer replay dataset,
-4. TRL `SFTTrainer` without a new `peft_config`, per PEFT/TRL continuation best practices.
-Stage-2 default hyperparameters:
-```yaml
-learning_rate: 5e-5
-epochs: 1
-per_device_train_batch_size: 1
-gradient_accumulation_steps: 16
-max_length: 2048
-assistant_only_loss: true
-```
-### Interpretation
-A lower learning rate and replay buffer should improve weak-layer value fidelity while reducing catastrophic forgetting on strong layers. This is a targeted continuation, not a replacement for Gen4 data generation or official schema validation.
-### Decision / next step
-Run stage-2 from the completed stage-1 adapter:
-```bash
-bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
-```
-After training, evaluate with the same raw + normalized OOD protocol and compare against stage-1 metrics.
----
-## 2026-05-05 — Stage 2 weak-layer continuation run started
-### Goal
-Run the stage-2 weak-layer continuation experiment implemented on 2026-05-04.
-The intended scientific question is:
-> Can a short, low-learning-rate continuation on weak target layers improve low-performing layer-specific value fidelity while preserving the strong global JSON validity, key structure, and adversarial behavior from stage 1?
-### Action
-Started stage 2 with:
-```bash
-bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
-```
-Generated run:
-```text
-runs/stage2-weak-20260505-080040
-```
-Source adapter:
-```text
-runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
-```
-### Stage-2 dataset composition
-The weak-layer dataset builder produced:
-```json
-{
-  "rows_train_stage2": 13829,
-  "rows_validation": 1547,
-  "weak_rows_total_after_duplication": 10638,
-  "replay_rows": 3191,
-  "rare_min_per_layer": 1500,
-  "replay_ratio": 0.3
-}
-```
-Layer counts before/after rare-layer duplication:
-| Target layer | Before | After |
-|---|---:|---:|
-| `o1_nrm` | 2,672 | 2,672 |
-| `a1_policy` | 3,466 | 3,466 |
-| `tmf921_lifecycle_report` | 596 | 1,500 |
-| `tmf921_lifecycle_monitor` | 726 | 1,500 |
-| `tmf921_lifecycle_scale` | 576 | 1,500 |
-Replay buffer size:
-- replay rows from non-weak layers: **3,191**
-- purpose: reduce catastrophic forgetting on strong layers such as `tmf921`, `camara`, `intent_3gpp`, `etsi_zsm`, and adversarial rejection.
-Full target-layer composition in stage-2 train set:
-| Target layer | Rows |
-|---|---:|
-| `a1_policy` | 3,466 |
-| `o1_nrm` | 2,672 |
-| `tmf921_lifecycle_monitor` | 1,500 |
-| `tmf921_lifecycle_report` | 1,500 |
-| `tmf921_lifecycle_scale` | 1,500 |
-| `tmf921` replay | 902 |
-| `intent_3gpp` replay | 630 |
-| `camara` replay | 618 |
-| `etsi_zsm` replay | 335 |
-| adversarial replay and other lifecycle replay | remaining rows |
-### Training configuration
-Resolved stage-2 config:
-```yaml
-model_name_or_path: Qwen/Qwen3-8B
-adapter_path: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
-dataset_dir: runs/stage2-weak-20260505-080040/weak_layer_data
-output_dir: runs/stage2-weak-20260505-080040/outputs/adapter
-learning_rate: 5.0e-05
-epochs: 1
-per_device_train_batch_size: 1
-gradient_accumulation_steps: 16
-max_length: 2048
-assistant_only_loss: true
-bf16: true
-gradient_checkpointing: true
-optim: paged_adamw_32bit
-```
-### Evidence that adapter continuation was configured correctly
-Server log confirmed:
-```text
-Base model: Qwen/Qwen3-8B
-Adapter: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
-trainable params: 174,587,904 || all params: 8,365,323,264 || trainable%: 2.0870
-TunerModelStatus(... active_adapters=['default'], requires_grad={'default': True}, devices={'default': ['cuda']})
-```
-Interpretation:
-- The existing adapter was loaded.
-- Adapter weights are trainable.
-- Training is on CUDA.
-- The base model is not being full-finetuned; only LoRA adapter parameters are updated.
-### Early training evidence
-Stage-2 training began normally after tokenization:
-```text
-Tokenizing train dataset: 13,829 / 13,829
-Tokenizing eval dataset: 1,547 / 1,547
-```
-Representative early logs:
-```text
-loss: 0.1313, grad_norm: 0.0199, lr: 5e-05, mean_token_accuracy: 0.9572, epoch: 0.0012
-loss: 0.1686, grad_norm: 0.0317, lr: 5e-05, mean_token_accuracy: 0.9435, epoch: 0.0116
-loss: 0.1541, grad_norm: 0.0166, lr: 5e-05, mean_token_accuracy: 0.9463, epoch: 0.1157
-```
-Validation during stage 2:
-```text
-eval_loss: 0.1581 at epoch 0.1157
-eval_loss: 0.1582 at epoch 0.2314
-eval_loss: 0.1584 at epoch 0.3471
-eval_loss: 0.1585 at epoch 0.4628
-```
-At approximately 50% completion:
-```text
-epoch: 0.4975 / 1.0
-loss: 0.1366-0.1428 range near midpoint
-grad_norm: generally <0.14
-mean_token_accuracy: about 0.95
-```
-### Interpretation
-The stage-2 run is healthy:
-- no CUDA OOM,
-- no NaN/Inf,
-- no gradient explosion,
-- GPU is active,
-- adapter continuation is correctly configured.
-Validation loss is slightly worse than the stage-1 plateau (~0.153), but this is expected because stage 2 intentionally shifts the training distribution toward harder weak layers. The decisive evaluation is not broad validation loss alone; it is the post-stage2 OOD normalized weak-layer comparison.
-### Decision / next step
-Let stage 2 finish. After completion:
-1. merge the stage-2 adapter,
-2. run OOD evaluation,
-3. run normalized evaluator,
-4. compare against stage-1 baselines.
-Commands planned after stage 2:
-```bash
-RUN_DIR="runs/stage2-weak-20260505-080040"
-python scripts/merge_adapter.py \
-  --base_model Qwen/Qwen3-8B \
-  --adapter "$RUN_DIR/outputs/adapter" \
-  --output_dir "$RUN_DIR/outputs/merged"
-EVAL_BATCH_SIZE=8 \
-bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"
-python scripts/normalize_eval_metrics.py \
-  --eval_dir "$RUN_DIR/eval"
-```
-### Success criteria
-Stage 2 is successful if:
-1. weak-layer normalized field F1 improves:
-   - `o1_nrm` above stage-1 ~0.39-0.40,
-   - `a1_policy` above stage-1 ~0.67-0.68,
-   - `tmf921_lifecycle_report` above stage-1 ~0.15-0.18,
-   - `tmf921_lifecycle_monitor` above stage-1 ~0.39-0.52;
-2. global normalized field F1 does not regress substantially:
-   - stage-1 ID: 0.7956,
-   - stage-1 template OOD: 0.7865,
-   - stage-1 use-case OOD: 0.7907,
-   - stage-1 sector OOD: 0.7697;
-3. JSON parse remains near 100%;
-4. adversarial normalized exact remains close to 0.9697.
-### Failure modes to watch
-- Global regression from weak-layer overfitting.
-- Adversarial degradation from insufficient replay.
-- O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
-- Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
----
-## 2026-05-05 — Stage 2 evaluation completed and decision made
-### Goal
-Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.
-### Action
-After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:
-- `test_in_distribution`
-- `test_template_ood`
-- `test_use_case_ood`
-- `test_sector_ood`
-- `test_adversarial`
-The normalized evaluator was then run on the generated predictions:
-```bash
-python scripts/normalize_eval_metrics.py \
-  --eval_dir runs/stage2-weak-20260505-080040/eval
-```
-### Evidence / result
-Global normalized comparison, stage 1 -> stage 2:
-| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
-|---|---:|---:|---:|---:|---:|---:|
-| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
-| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
-| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
-| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
-| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
-JSON parse comparison:
-| Split | Stage 1 parse | Stage 2 parse | Delta |
-|---|---:|---:|---:|
-| `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
-| `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
-| `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
-| `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
-| `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |
-Weak-layer normalized field F1 comparison, stage 1 -> stage 2:
-| Split | Layer | Stage 1 | Stage 2 | Delta |
-|---|---|---:|---:|---:|
-| ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
-| ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
-| ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
-| ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
-| ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
-| Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
-| Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
-| Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
-| Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
-| Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
-| Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
-| Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
-| Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
-| Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
-| Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
-| Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
-| Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
-| Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
-| Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
-### Interpretation
-Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.
-Key observations:
-1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
-2. Normalized key F1 regressed slightly across all splits.
-3. Adversarial performance regressed meaningfully:
-   - normalized field F1: **0.9697 -> 0.9596**
-   - normalized key F1: **1.0000 -> 0.9697**
-   - parse rate: **1.0000 -> 0.9697**
-4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
-5. `a1_policy` also did not improve meaningfully.
-6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.
-The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:
-- insufficient semantic supervision in the data,
-- inadequacy of flat field-F1 for some low-level configs,
-- need for layer-specific validators and value extractors,
-- or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.
-### Decision
-Stage 2 should **not** replace the stage-1 model as the main model.
-The stage-1 adapter remains the current primary model because it has:
-- slightly better global normalized metrics,
-- better adversarial robustness,
-- no meaningful disadvantage on O1/A1 compared with stage 2.
-Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.
-### Next step
-Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:
-1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
-2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
-3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
-4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.
-### Updated project status
-Primary model: **stage 1 Qwen3-8B QLoRA adapter**
-Stage 2 status: **diagnostic / not promoted**
-Current best headline metrics remain the stage-1 normalized results:
-| Split | JSON parse | Normalized field F1 | Normalized key F1 |
-|---|---:|---:|---:|
-| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
-| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
-| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
-| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
-| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |