PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
3241031
·
verified ·
1 Parent(s): eccc07b

Record stage2 evaluation results and decision not to promote

Browse files
Files changed (1) hide show
  1. PROJECT_JOURNAL.md +131 -0
PROJECT_JOURNAL.md CHANGED
@@ -848,3 +848,134 @@ Stage 2 is successful if:
848
  - O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
849
  - Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
850
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
848
  - O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
849
  - Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
850
 
851
+
852
+ ---
853
+
854
+ ## 2026-05-05 — Stage 2 evaluation completed and decision made
855
+
856
+ ### Goal
857
+
858
+ Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.
859
+
860
+ ### Action
861
+
862
+ After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:
863
+
864
+ - `test_in_distribution`
865
+ - `test_template_ood`
866
+ - `test_use_case_ood`
867
+ - `test_sector_ood`
868
+ - `test_adversarial`
869
+
870
+ The normalized evaluator was then run on the generated predictions:
871
+
872
+ ```bash
873
+ python scripts/normalize_eval_metrics.py \
874
+ --eval_dir runs/stage2-weak-20260505-080040/eval
875
+ ```
876
+
877
+ ### Evidence / result
878
+
879
+ Global normalized comparison, stage 1 -> stage 2:
880
+
881
+ | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
882
+ |---|---:|---:|---:|---:|---:|---:|
883
+ | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
884
+ | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
885
+ | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
886
+ | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
887
+ | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
888
+
889
+ JSON parse comparison:
890
+
891
+ | Split | Stage 1 parse | Stage 2 parse | Delta |
892
+ |---|---:|---:|---:|
893
+ | `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
894
+ | `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
895
+ | `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
896
+ | `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
897
+ | `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |
898
+
899
+ Weak-layer normalized field F1 comparison, stage 1 -> stage 2:
900
+
901
+ | Split | Layer | Stage 1 | Stage 2 | Delta |
902
+ |---|---|---:|---:|---:|
903
+ | ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
904
+ | ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
905
+ | ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
906
+ | ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
907
+ | ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
908
+ | Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
909
+ | Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
910
+ | Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
911
+ | Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
912
+ | Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
913
+ | Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
914
+ | Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
915
+ | Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
916
+ | Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
917
+ | Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
918
+ | Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
919
+ | Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
920
+ | Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
921
+ | Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
922
+
923
+ ### Interpretation
924
+
925
+ Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.
926
+
927
+ Key observations:
928
+
929
+ 1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
930
+ 2. Normalized key F1 regressed slightly across all splits.
931
+ 3. Adversarial performance regressed meaningfully:
932
+ - normalized field F1: **0.9697 -> 0.9596**
933
+ - normalized key F1: **1.0000 -> 0.9697**
934
+ - parse rate: **1.0000 -> 0.9697**
935
+ 4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
936
+ 5. `a1_policy` also did not improve meaningfully.
937
+ 6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.
938
+
939
+ The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:
940
+
941
+ - insufficient semantic supervision in the data,
942
+ - inadequacy of flat field-F1 for some low-level configs,
943
+ - need for layer-specific validators and value extractors,
944
+ - or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.
945
+
946
+ ### Decision
947
+
948
+ Stage 2 should **not** replace the stage-1 model as the main model.
949
+
950
+ The stage-1 adapter remains the current primary model because it has:
951
+
952
+ - slightly better global normalized metrics,
953
+ - better adversarial robustness,
954
+ - no meaningful disadvantage on O1/A1 compared with stage 2.
955
+
956
+ Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.
957
+
958
+ ### Next step
959
+
960
+ Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:
961
+
962
+ 1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
963
+ 2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
964
+ 3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
965
+ 4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.
966
+
967
+ ### Updated project status
968
+
969
+ Primary model: **stage 1 Qwen3-8B QLoRA adapter**
970
+
971
+ Stage 2 status: **diagnostic / not promoted**
972
+
973
+ Current best headline metrics remain the stage-1 normalized results:
974
+
975
+ | Split | JSON parse | Normalized field F1 | Normalized key F1 |
976
+ |---|---:|---:|---:|
977
+ | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
978
+ | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
979
+ | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
980
+ | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
981
+ | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |