nraptisss
/

tmf921-intent-training

@@ -848,3 +848,134 @@ Stage 2 is successful if:
 - O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
 - Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.

 - O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
 - Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
+---
+## 2026-05-05 — Stage 2 evaluation completed and decision made
+### Goal
+Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.
+### Action
+After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:
+- `test_in_distribution`
+- `test_template_ood`
+- `test_use_case_ood`
+- `test_sector_ood`
+- `test_adversarial`
+The normalized evaluator was then run on the generated predictions:
+```bash
+python scripts/normalize_eval_metrics.py \
+  --eval_dir runs/stage2-weak-20260505-080040/eval
+```
+### Evidence / result
+Global normalized comparison, stage 1 -> stage 2:
+| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
+|---|---:|---:|---:|---:|---:|---:|
+| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
+| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
+| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
+| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
+| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
+JSON parse comparison:
+| Split | Stage 1 parse | Stage 2 parse | Delta |
+|---|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
+| `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
+| `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
+| `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
+| `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |
+Weak-layer normalized field F1 comparison, stage 1 -> stage 2:
+| Split | Layer | Stage 1 | Stage 2 | Delta |
+|---|---|---:|---:|---:|
+| ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
+| ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
+| ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
+| ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
+| ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
+| Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
+| Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
+| Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
+| Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
+| Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
+| Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
+| Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
+| Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
+| Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
+| Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
+| Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
+| Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
+| Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
+| Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
+### Interpretation
+Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.
+Key observations:
+1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
+2. Normalized key F1 regressed slightly across all splits.
+3. Adversarial performance regressed meaningfully:
+   - normalized field F1: **0.9697 -> 0.9596**
+   - normalized key F1: **1.0000 -> 0.9697**
+   - parse rate: **1.0000 -> 0.9697**
+4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
+5. `a1_policy` also did not improve meaningfully.
+6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.
+The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:
+- insufficient semantic supervision in the data,
+- inadequacy of flat field-F1 for some low-level configs,
+- need for layer-specific validators and value extractors,
+- or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.
+### Decision
+Stage 2 should **not** replace the stage-1 model as the main model.
+The stage-1 adapter remains the current primary model because it has:
+- slightly better global normalized metrics,
+- better adversarial robustness,
+- no meaningful disadvantage on O1/A1 compared with stage 2.
+Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.
+### Next step
+Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:
+1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
+2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
+3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
+4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.
+### Updated project status
+Primary model: **stage 1 Qwen3-8B QLoRA adapter**
+Stage 2 status: **diagnostic / not promoted**
+Current best headline metrics remain the stage-1 normalized results:
+| Split | JSON parse | Normalized field F1 | Normalized key F1 |
+|---|---:|---:|---:|
+| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
+| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
+| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
+| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
+| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |