Record stage2 evaluation results and decision not to promote
Browse files- PROJECT_JOURNAL.md +131 -0
PROJECT_JOURNAL.md
CHANGED
|
@@ -848,3 +848,134 @@ Stage 2 is successful if:
|
|
| 848 |
- O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
|
| 849 |
- Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
|
| 850 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 848 |
- O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
|
| 849 |
- Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
|
| 850 |
|
| 851 |
+
|
| 852 |
+
---
|
| 853 |
+
|
| 854 |
+
## 2026-05-05 — Stage 2 evaluation completed and decision made
|
| 855 |
+
|
| 856 |
+
### Goal
|
| 857 |
+
|
| 858 |
+
Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.
|
| 859 |
+
|
| 860 |
+
### Action
|
| 861 |
+
|
| 862 |
+
After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:
|
| 863 |
+
|
| 864 |
+
- `test_in_distribution`
|
| 865 |
+
- `test_template_ood`
|
| 866 |
+
- `test_use_case_ood`
|
| 867 |
+
- `test_sector_ood`
|
| 868 |
+
- `test_adversarial`
|
| 869 |
+
|
| 870 |
+
The normalized evaluator was then run on the generated predictions:
|
| 871 |
+
|
| 872 |
+
```bash
|
| 873 |
+
python scripts/normalize_eval_metrics.py \
|
| 874 |
+
--eval_dir runs/stage2-weak-20260505-080040/eval
|
| 875 |
+
```
|
| 876 |
+
|
| 877 |
+
### Evidence / result
|
| 878 |
+
|
| 879 |
+
Global normalized comparison, stage 1 -> stage 2:
|
| 880 |
+
|
| 881 |
+
| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|
| 882 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 883 |
+
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
|
| 884 |
+
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
|
| 885 |
+
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
|
| 886 |
+
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
|
| 887 |
+
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
|
| 888 |
+
|
| 889 |
+
JSON parse comparison:
|
| 890 |
+
|
| 891 |
+
| Split | Stage 1 parse | Stage 2 parse | Delta |
|
| 892 |
+
|---|---:|---:|---:|
|
| 893 |
+
| `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
|
| 894 |
+
| `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
|
| 895 |
+
| `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
|
| 896 |
+
| `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
|
| 897 |
+
| `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |
|
| 898 |
+
|
| 899 |
+
Weak-layer normalized field F1 comparison, stage 1 -> stage 2:
|
| 900 |
+
|
| 901 |
+
| Split | Layer | Stage 1 | Stage 2 | Delta |
|
| 902 |
+
|---|---|---:|---:|---:|
|
| 903 |
+
| ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
|
| 904 |
+
| ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
|
| 905 |
+
| ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
|
| 906 |
+
| ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
|
| 907 |
+
| ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
|
| 908 |
+
| Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
|
| 909 |
+
| Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
|
| 910 |
+
| Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
|
| 911 |
+
| Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
|
| 912 |
+
| Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
|
| 913 |
+
| Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
|
| 914 |
+
| Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
|
| 915 |
+
| Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
|
| 916 |
+
| Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
|
| 917 |
+
| Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
|
| 918 |
+
| Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
|
| 919 |
+
| Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
|
| 920 |
+
| Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
|
| 921 |
+
| Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
|
| 922 |
+
|
| 923 |
+
### Interpretation
|
| 924 |
+
|
| 925 |
+
Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.
|
| 926 |
+
|
| 927 |
+
Key observations:
|
| 928 |
+
|
| 929 |
+
1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
|
| 930 |
+
2. Normalized key F1 regressed slightly across all splits.
|
| 931 |
+
3. Adversarial performance regressed meaningfully:
|
| 932 |
+
- normalized field F1: **0.9697 -> 0.9596**
|
| 933 |
+
- normalized key F1: **1.0000 -> 0.9697**
|
| 934 |
+
- parse rate: **1.0000 -> 0.9697**
|
| 935 |
+
4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
|
| 936 |
+
5. `a1_policy` also did not improve meaningfully.
|
| 937 |
+
6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.
|
| 938 |
+
|
| 939 |
+
The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:
|
| 940 |
+
|
| 941 |
+
- insufficient semantic supervision in the data,
|
| 942 |
+
- inadequacy of flat field-F1 for some low-level configs,
|
| 943 |
+
- need for layer-specific validators and value extractors,
|
| 944 |
+
- or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.
|
| 945 |
+
|
| 946 |
+
### Decision
|
| 947 |
+
|
| 948 |
+
Stage 2 should **not** replace the stage-1 model as the main model.
|
| 949 |
+
|
| 950 |
+
The stage-1 adapter remains the current primary model because it has:
|
| 951 |
+
|
| 952 |
+
- slightly better global normalized metrics,
|
| 953 |
+
- better adversarial robustness,
|
| 954 |
+
- no meaningful disadvantage on O1/A1 compared with stage 2.
|
| 955 |
+
|
| 956 |
+
Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.
|
| 957 |
+
|
| 958 |
+
### Next step
|
| 959 |
+
|
| 960 |
+
Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:
|
| 961 |
+
|
| 962 |
+
1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
|
| 963 |
+
2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
|
| 964 |
+
3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
|
| 965 |
+
4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.
|
| 966 |
+
|
| 967 |
+
### Updated project status
|
| 968 |
+
|
| 969 |
+
Primary model: **stage 1 Qwen3-8B QLoRA adapter**
|
| 970 |
+
|
| 971 |
+
Stage 2 status: **diagnostic / not promoted**
|
| 972 |
+
|
| 973 |
+
Current best headline metrics remain the stage-1 normalized results:
|
| 974 |
+
|
| 975 |
+
| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|
| 976 |
+
|---|---:|---:|---:|
|
| 977 |
+
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
|
| 978 |
+
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
|
| 979 |
+
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
|
| 980 |
+
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
|
| 981 |
+
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |
|