Upload PROJECT_JOURNAL.md
Browse files- PROJECT_JOURNAL.md +0 -981
PROJECT_JOURNAL.md
CHANGED
|
@@ -1,981 +0,0 @@
|
|
| 1 |
-
# TMF921 Intent-to-Configuration Research Journal
|
| 2 |
-
|
| 3 |
-
This file is the running scientific journal for the TMF921 intent-to-configuration project. It records what was done, why decisions were made, what failed, what was fixed, and what evidence supports each next step.
|
| 4 |
-
|
| 5 |
-
Repository links:
|
| 6 |
-
|
| 7 |
-
- Source augmented dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-augmented
|
| 8 |
-
- Research SOTA dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
|
| 9 |
-
- Training/evaluation repo: https://huggingface.co/nraptisss/tmf921-intent-training
|
| 10 |
-
- Base model: https://huggingface.co/Qwen/Qwen3-8B
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## Journal conventions
|
| 15 |
-
|
| 16 |
-
Each entry should include:
|
| 17 |
-
|
| 18 |
-
1. **Date/time**
|
| 19 |
-
2. **Goal**
|
| 20 |
-
3. **Action**
|
| 21 |
-
4. **Evidence / result**
|
| 22 |
-
5. **Interpretation**
|
| 23 |
-
6. **Decision / next step**
|
| 24 |
-
|
| 25 |
-
For research claims, prefer numeric evidence over qualitative statements.
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## 2026-04-30 — Dataset cloned and audited
|
| 30 |
-
|
| 31 |
-
### Goal
|
| 32 |
-
|
| 33 |
-
Clone and scientifically audit `nraptisss/TMF921-intent-to-config-augmented` before training.
|
| 34 |
-
|
| 35 |
-
### Action
|
| 36 |
-
|
| 37 |
-
The dataset was cloned in the sandbox and a comprehensive audit was run over schema, missingness, ChatML formatting, JSON validity, duplicates/leakage, distribution balance, numeric KPI ranges, train/test similarity, and scientific validity.
|
| 38 |
-
|
| 39 |
-
### Evidence / result
|
| 40 |
-
|
| 41 |
-
Dataset size:
|
| 42 |
-
|
| 43 |
-
- Total rows: **41,815**
|
| 44 |
-
- Train: **39,294**
|
| 45 |
-
- Test: **2,521**
|
| 46 |
-
|
| 47 |
-
Quality checks:
|
| 48 |
-
|
| 49 |
-
- Missing values: **0**
|
| 50 |
-
- Duplicate IDs: **0**
|
| 51 |
-
- Duplicate full conversations: **0**
|
| 52 |
-
- Assistant JSON parse validity: **41,815 / 41,815 = 100%**
|
| 53 |
-
- Role sequence: `system -> user -> assistant` for all rows
|
| 54 |
-
|
| 55 |
-
Leakage / similarity findings:
|
| 56 |
-
|
| 57 |
-
- Exact train/test user-prompt overlap: **0**
|
| 58 |
-
- Exact train/test full-message overlap: **0**
|
| 59 |
-
- Near-duplicate prompt similarity was high:
|
| 60 |
-
- test prompts with char-ngram similarity >= 0.90 to train: **1,290 / 2,521**
|
| 61 |
-
- >= 0.95: **602 / 2,521**
|
| 62 |
-
- >= 0.98: **262 / 2,521**
|
| 63 |
-
|
| 64 |
-
Distribution findings:
|
| 65 |
-
|
| 66 |
-
- `create` lifecycle operation: **40,090 / 41,815 = 95.9%**
|
| 67 |
-
- non-create lifecycle rows: **1,725 = 4.1%**
|
| 68 |
-
- adversarial rows: **166 = 0.397%**
|
| 69 |
-
- only **31 unique JSON structure signatures** across 41,815 rows
|
| 70 |
-
|
| 71 |
-
### Interpretation
|
| 72 |
-
|
| 73 |
-
The source dataset is technically clean and suitable for SFT, but the original split is mainly an in-distribution/template-compliance split, not a strong OOD benchmark. JSON validity is excellent, but scientific benchmark validity requires OOD splits and normalized/semantic evaluation.
|
| 74 |
-
|
| 75 |
-
### Decision / next step
|
| 76 |
-
|
| 77 |
-
Create a research-grade derivative dataset with:
|
| 78 |
-
|
| 79 |
-
- OOD splits,
|
| 80 |
-
- train/eval provenance columns,
|
| 81 |
-
- token-length audit,
|
| 82 |
-
- validation flags,
|
| 83 |
-
- lifecycle/adversarial upsampling for training only,
|
| 84 |
-
- no fabricated continuous-KPI or cross-layer-paired examples without a validated generator.
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
## 2026-04-30 — Research SOTA dataset created
|
| 89 |
-
|
| 90 |
-
### Goal
|
| 91 |
-
|
| 92 |
-
Implement the audit recommendations while preserving scientific soundness.
|
| 93 |
-
|
| 94 |
-
### Action
|
| 95 |
-
|
| 96 |
-
Created `nraptisss/TMF921-intent-to-config-research-sota`.
|
| 97 |
-
|
| 98 |
-
Implemented:
|
| 99 |
-
|
| 100 |
-
- `train_base`
|
| 101 |
-
- `train_sota`
|
| 102 |
-
- `validation`
|
| 103 |
-
- `test_in_distribution`
|
| 104 |
-
- `test_template_ood`
|
| 105 |
-
- `test_use_case_ood`
|
| 106 |
-
- `test_sector_ood`
|
| 107 |
-
- `test_adversarial`
|
| 108 |
-
|
| 109 |
-
Added columns:
|
| 110 |
-
|
| 111 |
-
- `system`, `prompt`, `completion`
|
| 112 |
-
- `prompt_template_id`
|
| 113 |
-
- `scenario_id`
|
| 114 |
-
- `json_structure_id`
|
| 115 |
-
- `json_root_family`
|
| 116 |
-
- `messages_format_valid`
|
| 117 |
-
- `assistant_is_valid_json`
|
| 118 |
-
- `slice_sst_valid`
|
| 119 |
-
- `kpi_profile_valid`
|
| 120 |
-
- `semantic_rule_valid_v1`
|
| 121 |
-
- `qwen3_chat_template_tokens`
|
| 122 |
-
- `fits_2048_qwen3`
|
| 123 |
-
- `fits_4096_qwen3`
|
| 124 |
-
- `sampling_weight_*`
|
| 125 |
-
- `is_augmented`, `augmentation_type`, `source_id`, `conversation_type`
|
| 126 |
-
|
| 127 |
-
### Evidence / result
|
| 128 |
-
|
| 129 |
-
Published dataset:
|
| 130 |
-
|
| 131 |
-
- https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
|
| 132 |
-
|
| 133 |
-
Splits:
|
| 134 |
-
|
| 135 |
-
| Split | Rows | Purpose |
|
| 136 |
-
|---|---:|---|
|
| 137 |
-
| `train_base` | 26,357 | unaugmented training after OOD holdouts |
|
| 138 |
-
| `train_sota` | 32,357 | training split with marked lifecycle/adversarial upsampling and multi-turn wrappers |
|
| 139 |
-
| `validation` | 1,547 | validation |
|
| 140 |
-
| `test_in_distribution` | 1,455 | in-distribution test |
|
| 141 |
-
| `test_template_ood` | 3,503 | held-out prompt-template family |
|
| 142 |
-
| `test_use_case_ood` | 4,341 | held-out use cases |
|
| 143 |
-
| `test_sector_ood` | 4,579 | held-out sectors |
|
| 144 |
-
| `test_adversarial` | 33 | held-out adversarial examples |
|
| 145 |
-
|
| 146 |
-
Qwen3 token-length audit:
|
| 147 |
-
|
| 148 |
-
- mean: **754.1**
|
| 149 |
-
- p50: **705**
|
| 150 |
-
- p95: **1293**
|
| 151 |
-
- p99: **1300**
|
| 152 |
-
- max: **1316**
|
| 153 |
-
- fit within 2048: **100%**
|
| 154 |
-
|
| 155 |
-
`train_sota` balancing:
|
| 156 |
-
|
| 157 |
-
- non-create lifecycle rows: **5,166 = 15.97%**
|
| 158 |
-
- adversarial rows: **2,115 = 6.54%**
|
| 159 |
-
- synthetic multi-turn wrappers: **1,281**
|
| 160 |
-
|
| 161 |
-
### Interpretation
|
| 162 |
-
|
| 163 |
-
`max_length=2048` is justified for Qwen3-8B. `train_sota` improves rare-class exposure. OOD splits allow scientifically meaningful generalization reporting.
|
| 164 |
-
|
| 165 |
-
### Decision / next step
|
| 166 |
-
|
| 167 |
-
Build a training/evaluation repository for a single RTX 6000 Ada server using Qwen3-8B QLoRA.
|
| 168 |
-
|
| 169 |
-
---
|
| 170 |
-
|
| 171 |
-
## 2026-04-30 / 2026-05-01 — Training/evaluation repo created
|
| 172 |
-
|
| 173 |
-
### Goal
|
| 174 |
-
|
| 175 |
-
Create a reproducible repo for training and evaluation on RTX 6000 Ada 48/50GB.
|
| 176 |
-
|
| 177 |
-
### Action
|
| 178 |
-
|
| 179 |
-
Created `nraptisss/tmf921-intent-training` with:
|
| 180 |
-
|
| 181 |
-
- QLoRA SFT training script,
|
| 182 |
-
- evaluation script,
|
| 183 |
-
- merge script,
|
| 184 |
-
- RTX 6000 Ada install script,
|
| 185 |
-
- GPU preflight,
|
| 186 |
-
- nohup run scripts,
|
| 187 |
-
- resumable checkpoints,
|
| 188 |
-
- unique run directories.
|
| 189 |
-
|
| 190 |
-
Default recipe:
|
| 191 |
-
|
| 192 |
-
- model: `Qwen/Qwen3-8B`
|
| 193 |
-
- method: QLoRA NF4 + double quant
|
| 194 |
-
- LoRA target modules: `all-linear`
|
| 195 |
-
- LoRA rank: `64`
|
| 196 |
-
- LoRA alpha: `16`
|
| 197 |
-
- LoRA dropout: `0.05`
|
| 198 |
-
- LR: `2e-4`
|
| 199 |
-
- scheduler: constant
|
| 200 |
-
- max length: `2048`
|
| 201 |
-
- assistant-only loss: enabled
|
| 202 |
-
- bf16: enabled
|
| 203 |
-
- gradient checkpointing: enabled
|
| 204 |
-
- train split: `train_sota`
|
| 205 |
-
- eval split: `validation`
|
| 206 |
-
|
| 207 |
-
### Evidence / result
|
| 208 |
-
|
| 209 |
-
Repo:
|
| 210 |
-
|
| 211 |
-
- https://huggingface.co/nraptisss/tmf921-intent-training
|
| 212 |
-
|
| 213 |
-
### Interpretation
|
| 214 |
-
|
| 215 |
-
The training approach is consistent with QLoRA literature and fits the memory constraints of a 48/50GB RTX 6000 Ada GPU.
|
| 216 |
-
|
| 217 |
-
### Decision / next step
|
| 218 |
-
|
| 219 |
-
Run training under `nohup`, require CUDA preflight, and ensure unique output directories to avoid overwriting results.
|
| 220 |
-
|
| 221 |
-
---
|
| 222 |
-
|
| 223 |
-
## 2026-05-01 — Runtime issues fixed
|
| 224 |
-
|
| 225 |
-
### Goal
|
| 226 |
-
|
| 227 |
-
Resolve server-side training errors and ensure training uses GPU.
|
| 228 |
-
|
| 229 |
-
### Issues encountered and fixes
|
| 230 |
-
|
| 231 |
-
#### 1. CPU/GPU uncertainty
|
| 232 |
-
|
| 233 |
-
Observed concern that training might not use GPU.
|
| 234 |
-
|
| 235 |
-
Fix:
|
| 236 |
-
|
| 237 |
-
- Added `scripts/check_gpu.py`
|
| 238 |
-
- Added `scripts/install_rtx6000ada.sh`
|
| 239 |
-
- Added fail-fast CUDA checks to training/evaluation scripts.
|
| 240 |
-
|
| 241 |
-
Evidence from server logs:
|
| 242 |
-
|
| 243 |
-
```text
|
| 244 |
-
torch=2.6.0+cu124 torch.version.cuda=12.4 CUDA_VISIBLE_DEVICES=0
|
| 245 |
-
cuda device_count=1 gpu0=NVIDIA RTX 6000 Ada Generation
|
| 246 |
-
```
|
| 247 |
-
|
| 248 |
-
Conclusion: GPU setup confirmed.
|
| 249 |
-
|
| 250 |
-
#### 2. TRL conversational dataset detection error
|
| 251 |
-
|
| 252 |
-
Error:
|
| 253 |
-
|
| 254 |
-
```text
|
| 255 |
-
ValueError: You set assistant_only_loss=True, but the dataset is not conversational.
|
| 256 |
-
```
|
| 257 |
-
|
| 258 |
-
Cause:
|
| 259 |
-
|
| 260 |
-
The dataset contains `messages` plus convenience `prompt`/`completion` columns. TRL inferred prompt-completion format instead of conversational format.
|
| 261 |
-
|
| 262 |
-
Fix:
|
| 263 |
-
|
| 264 |
-
Training script now passes only:
|
| 265 |
-
|
| 266 |
-
```python
|
| 267 |
-
train_dataset = train_dataset.select_columns(["messages"])
|
| 268 |
-
eval_dataset = eval_dataset.select_columns(["messages"])
|
| 269 |
-
```
|
| 270 |
-
|
| 271 |
-
#### 3. Trackio invalid Space ID
|
| 272 |
-
|
| 273 |
-
Error:
|
| 274 |
-
|
| 275 |
-
```text
|
| 276 |
-
HFValidationError: Repo id ... 'nraptisss/'
|
| 277 |
-
```
|
| 278 |
-
|
| 279 |
-
Cause:
|
| 280 |
-
|
| 281 |
-
Invalid `TRACKIO_SPACE_ID=nraptisss/`.
|
| 282 |
-
|
| 283 |
-
Fix:
|
| 284 |
-
|
| 285 |
-
Added validation/sanitization for Trackio Space IDs and support for:
|
| 286 |
-
|
| 287 |
-
```bash
|
| 288 |
-
DISABLE_TRACKIO=1
|
| 289 |
-
```
|
| 290 |
-
|
| 291 |
-
#### 4. Deprecated warmup argument
|
| 292 |
-
|
| 293 |
-
Warning:
|
| 294 |
-
|
| 295 |
-
```text
|
| 296 |
-
warmup_ratio is deprecated
|
| 297 |
-
```
|
| 298 |
-
|
| 299 |
-
Fix:
|
| 300 |
-
|
| 301 |
-
Changed config/script to use:
|
| 302 |
-
|
| 303 |
-
```yaml
|
| 304 |
-
warmup_steps: 0
|
| 305 |
-
```
|
| 306 |
-
|
| 307 |
-
### Decision / next step
|
| 308 |
-
|
| 309 |
-
Restart training with fixed scripts and disabled Trackio to avoid external logging failures.
|
| 310 |
-
|
| 311 |
-
---
|
| 312 |
-
|
| 313 |
-
## 2026-05-01 / 2026-05-02 — Qwen3-8B QLoRA training run completed
|
| 314 |
-
|
| 315 |
-
### Goal
|
| 316 |
-
|
| 317 |
-
Train Qwen3-8B QLoRA on `train_sota`.
|
| 318 |
-
|
| 319 |
-
### Action
|
| 320 |
-
|
| 321 |
-
Started training under nohup with unique run directory:
|
| 322 |
-
|
| 323 |
-
```text
|
| 324 |
-
runs/qwen3-8b-qlora-20260501-083834
|
| 325 |
-
```
|
| 326 |
-
|
| 327 |
-
Trackio disabled:
|
| 328 |
-
|
| 329 |
-
```bash
|
| 330 |
-
DISABLE_TRACKIO=1
|
| 331 |
-
```
|
| 332 |
-
|
| 333 |
-
### Evidence / result
|
| 334 |
-
|
| 335 |
-
Training logs showed stable convergence.
|
| 336 |
-
|
| 337 |
-
Representative metrics:
|
| 338 |
-
|
| 339 |
-
Initial:
|
| 340 |
-
|
| 341 |
-
```text
|
| 342 |
-
loss: 1.212
|
| 343 |
-
mean_token_accuracy: 0.7922
|
| 344 |
-
```
|
| 345 |
-
|
| 346 |
-
After early training:
|
| 347 |
-
|
| 348 |
-
```text
|
| 349 |
-
loss: ~0.15
|
| 350 |
-
mean_token_accuracy: ~0.945-0.953
|
| 351 |
-
```
|
| 352 |
-
|
| 353 |
-
Validation loss over training:
|
| 354 |
-
|
| 355 |
-
```text
|
| 356 |
-
eval_loss: 0.1593 at epoch 0.1236
|
| 357 |
-
eval_loss: 0.1561 at epoch 0.2472
|
| 358 |
-
eval_loss: 0.1548 at epoch 0.3709
|
| 359 |
-
eval_loss: 0.1535 at epoch 0.8653
|
| 360 |
-
eval_loss: 0.1530 at epoch 1.607
|
| 361 |
-
eval_loss: 0.1532 at epoch 1.730
|
| 362 |
-
```
|
| 363 |
-
|
| 364 |
-
No observed:
|
| 365 |
-
|
| 366 |
-
- CUDA OOM,
|
| 367 |
-
- NaNs,
|
| 368 |
-
- divergence,
|
| 369 |
-
- gradient explosion.
|
| 370 |
-
|
| 371 |
-
### Interpretation
|
| 372 |
-
|
| 373 |
-
The run converged smoothly. Loss stabilized around 0.14–0.15 and validation loss plateaued near 0.153, indicating stable SFT convergence.
|
| 374 |
-
|
| 375 |
-
### Decision / next step
|
| 376 |
-
|
| 377 |
-
Evaluate the trained adapter across ID and OOD splits.
|
| 378 |
-
|
| 379 |
-
---
|
| 380 |
-
|
| 381 |
-
## 2026-05-02 / 2026-05-04 — Evaluation speed issue and merged-model evaluation
|
| 382 |
-
|
| 383 |
-
### Goal
|
| 384 |
-
|
| 385 |
-
Evaluate the trained adapter on all splits.
|
| 386 |
-
|
| 387 |
-
### Issue
|
| 388 |
-
|
| 389 |
-
Initial evaluator used single-example 4-bit adapter generation with large `max_new_tokens`, causing very slow evaluation:
|
| 390 |
-
|
| 391 |
-
```text
|
| 392 |
-
test_in_distribution: 1455 examples in ~25h
|
| 393 |
-
test_template_ood: ~30-90s/example
|
| 394 |
-
```
|
| 395 |
-
|
| 396 |
-
### Action
|
| 397 |
-
|
| 398 |
-
Patched evaluator to support:
|
| 399 |
-
|
| 400 |
-
- batched generation,
|
| 401 |
-
- dynamic generation length based on target length + buffer,
|
| 402 |
-
- periodic save/resume,
|
| 403 |
-
- partial prediction reuse.
|
| 404 |
-
|
| 405 |
-
Also recommended merging adapter into base bf16 model for faster inference.
|
| 406 |
-
|
| 407 |
-
### Decision / next step
|
| 408 |
-
|
| 409 |
-
Use merged model evaluation and normalized metrics.
|
| 410 |
-
|
| 411 |
-
---
|
| 412 |
-
|
| 413 |
-
## 2026-05-04 — Raw evaluation results
|
| 414 |
-
|
| 415 |
-
### Goal
|
| 416 |
-
|
| 417 |
-
Measure raw JSON and field-level performance.
|
| 418 |
-
|
| 419 |
-
### Evidence / result
|
| 420 |
-
|
| 421 |
-
Raw metrics:
|
| 422 |
-
|
| 423 |
-
| Split | JSON parse | Exact match | Field F1 | KPI presence |
|
| 424 |
-
|---|---:|---:|---:|---:|
|
| 425 |
-
| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
|
| 426 |
-
| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
|
| 427 |
-
| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
|
| 428 |
-
| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
|
| 429 |
-
| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
|
| 430 |
-
|
| 431 |
-
### Interpretation
|
| 432 |
-
|
| 433 |
-
The model learned JSON formatting and adversarial rejection very well. Raw exact-match is low for primary config layers, but raw exact match is likely too strict because many fields are volatile/generated (`id`, `href`, timestamps, descriptions, schema links).
|
| 434 |
-
|
| 435 |
-
### Decision / next step
|
| 436 |
-
|
| 437 |
-
Implement a normalized evaluator that removes volatile fields before scoring.
|
| 438 |
-
|
| 439 |
-
---
|
| 440 |
-
|
| 441 |
-
## 2026-05-04 — Normalized evaluator implemented and run
|
| 442 |
-
|
| 443 |
-
### Goal
|
| 444 |
-
|
| 445 |
-
Re-score existing predictions using metrics that better reflect structural/semantic configuration agreement.
|
| 446 |
-
|
| 447 |
-
### Action
|
| 448 |
-
|
| 449 |
-
Added:
|
| 450 |
-
|
| 451 |
-
```text
|
| 452 |
-
scripts/normalize_eval_metrics.py
|
| 453 |
-
```
|
| 454 |
-
|
| 455 |
-
Normalization removes/masks:
|
| 456 |
-
|
| 457 |
-
- IDs,
|
| 458 |
-
- hrefs,
|
| 459 |
-
- names/descriptions,
|
| 460 |
-
- timestamps,
|
| 461 |
-
- schema links,
|
| 462 |
-
- UUID/hash-like strings,
|
| 463 |
-
- generated request/policy/booking/intent IDs.
|
| 464 |
-
|
| 465 |
-
It computes:
|
| 466 |
-
|
| 467 |
-
- normalized exact match,
|
| 468 |
-
- normalized field precision/recall/F1,
|
| 469 |
-
- normalized key precision/recall/F1,
|
| 470 |
-
- stratified metrics.
|
| 471 |
-
|
| 472 |
-
### Evidence / result
|
| 473 |
-
|
| 474 |
-
Headline normalized metrics:
|
| 475 |
-
|
| 476 |
-
| Split | JSON parse | Raw field F1 | Normalized field F1 | Normalized key F1 | Normalized exact |
|
| 477 |
-
|---|---:|---:|---:|---:|---:|
|
| 478 |
-
| `test_in_distribution` | 1.0000 | 0.6868 | **0.7956** | **0.9811** | 0.0351 |
|
| 479 |
-
| `test_template_ood` | 1.0000 | 0.6790 | **0.7865** | **0.9801** | 0.0177 |
|
| 480 |
-
| `test_use_case_ood` | 0.9998 | 0.6825 | **0.7907** | **0.9805** | 0.0253 |
|
| 481 |
-
| `test_sector_ood` | 1.0000 | 0.6610 | **0.7697** | **0.9818** | 0.0293 |
|
| 482 |
-
| `test_adversarial` | 1.0000 | 0.9697 | **0.9697** | **1.0000** | 0.9697 |
|
| 483 |
-
|
| 484 |
-
Strong layers:
|
| 485 |
-
|
| 486 |
-
- `tmf921`: normalized field F1 around **0.93–0.94**
|
| 487 |
-
- `camara`: normalized field F1 around **0.81–0.87**
|
| 488 |
-
- `intent_3gpp`: normalized field F1 around **0.80–0.82**
|
| 489 |
-
- `etsi_zsm`: normalized field F1 around **0.75–0.79**
|
| 490 |
-
|
| 491 |
-
Weak layers:
|
| 492 |
-
|
| 493 |
-
- `o1_nrm`: normalized field F1 around **0.39–0.40**
|
| 494 |
-
- `a1_policy`: normalized field F1 around **0.67–0.68**
|
| 495 |
-
- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
|
| 496 |
-
- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
|
| 497 |
-
|
| 498 |
-
### Interpretation
|
| 499 |
-
|
| 500 |
-
The model is much stronger than raw exact-match suggested. It reliably emits valid JSON and correct structural schemas (`norm_key_f1 ≈ 0.98`) across ID and OOD splits. Field-level value fidelity is moderate-to-strong overall, but weak for low-level O1 NRM values and monitoring/report lifecycle outputs.
|
| 501 |
-
|
| 502 |
-
### Decision / next step
|
| 503 |
-
|
| 504 |
-
Plan a second-stage weak-layer fine-tune focused on:
|
| 505 |
-
|
| 506 |
-
- `o1_nrm`,
|
| 507 |
-
- `a1_policy`,
|
| 508 |
-
- `tmf921_lifecycle_report`,
|
| 509 |
-
- `tmf921_lifecycle_monitor`,
|
| 510 |
-
- optionally `tmf921_lifecycle_scale`.
|
| 511 |
-
|
| 512 |
-
Use the current adapter as initialization, lower LR, and include replay from strong layers to prevent forgetting.
|
| 513 |
-
|
| 514 |
-
---
|
| 515 |
-
|
| 516 |
-
## Current scientific status
|
| 517 |
-
|
| 518 |
-
### What can be claimed now
|
| 519 |
-
|
| 520 |
-
The Qwen3-8B QLoRA model trained on the TMF921 Research SOTA split achieves:
|
| 521 |
-
|
| 522 |
-
- near-perfect JSON validity,
|
| 523 |
-
- stable OOD generalization,
|
| 524 |
-
- excellent adversarial rejection,
|
| 525 |
-
- normalized structural key F1 around 98% across non-adversarial ID/OOD splits,
|
| 526 |
-
- normalized field F1 around 77–80% across ID/OOD splits.
|
| 527 |
-
|
| 528 |
-
### What should not be overclaimed
|
| 529 |
-
|
| 530 |
-
Do not claim production-grade standards compliance yet. Current evaluation is normalized JSON/field scoring, not official TMF921/3GPP/ETSI/CAMARA/O-RAN schema validation.
|
| 531 |
-
|
| 532 |
-
### Main weaknesses
|
| 533 |
-
|
| 534 |
-
- O1 NRM value fidelity is poor despite correct structure.
|
| 535 |
-
- Lifecycle report/monitor outputs need targeted improvement.
|
| 536 |
-
- Raw exact match remains low for primary create configs.
|
| 537 |
-
|
| 538 |
-
### Next planned experiment
|
| 539 |
-
|
| 540 |
-
Second-stage weak-layer adapter continuation:
|
| 541 |
-
|
| 542 |
-
- initialize from current Qwen3-8B TMF921 adapter,
|
| 543 |
-
- train on weak-layer examples plus replay buffer,
|
| 544 |
-
- lower LR: `5e-5` or `1e-4`,
|
| 545 |
-
- 1 epoch,
|
| 546 |
-
- same max length 2048,
|
| 547 |
-
- evaluate again with raw + normalized metrics.
|
| 548 |
-
|
| 549 |
-
---
|
| 550 |
-
|
| 551 |
-
## Open questions
|
| 552 |
-
|
| 553 |
-
1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
|
| 554 |
-
2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
|
| 555 |
-
3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
|
| 556 |
-
4. Should training use a weak-layer second stage or should dataset generation be improved first?
|
| 557 |
-
|
| 558 |
-
---
|
| 559 |
-
|
| 560 |
-
## Running log template
|
| 561 |
-
|
| 562 |
-
```markdown
|
| 563 |
-
## YYYY-MM-DD — Short title
|
| 564 |
-
|
| 565 |
-
### Goal
|
| 566 |
-
|
| 567 |
-
### Action
|
| 568 |
-
|
| 569 |
-
### Evidence / result
|
| 570 |
-
|
| 571 |
-
### Interpretation
|
| 572 |
-
|
| 573 |
-
### Decision / next step
|
| 574 |
-
```
|
| 575 |
-
|
| 576 |
-
---
|
| 577 |
-
|
| 578 |
-
## 2026-05-04 — Stage 2 weak-layer continuation plan implemented
|
| 579 |
-
|
| 580 |
-
### Goal
|
| 581 |
-
|
| 582 |
-
Improve weak target layers identified by normalized evaluation without degrading strong layers.
|
| 583 |
-
|
| 584 |
-
Weak layers from normalized evaluation:
|
| 585 |
-
|
| 586 |
-
- `o1_nrm`: normalized field F1 around **0.39–0.40**
|
| 587 |
-
- `a1_policy`: normalized field F1 around **0.67–0.68**
|
| 588 |
-
- `tmf921_lifecycle_report`: normalized field F1 around **0.15–0.18**
|
| 589 |
-
- `tmf921_lifecycle_monitor`: normalized field F1 around **0.39–0.52**
|
| 590 |
-
- `tmf921_lifecycle_scale`: mixed, included because lifecycle scaling still had noticeable errors
|
| 591 |
-
|
| 592 |
-
### Action
|
| 593 |
-
|
| 594 |
-
Added stage-2 tooling:
|
| 595 |
-
|
| 596 |
-
- `scripts/build_weak_layer_dataset.py`
|
| 597 |
-
- `scripts/train_continue_adapter.py`
|
| 598 |
-
- `configs/stage2_weak_layer_qwen3_8b.yaml`
|
| 599 |
-
- `scripts/nohup_stage2_weak.sh`
|
| 600 |
-
|
| 601 |
-
The weak-layer dataset builder creates a local parquet training set with:
|
| 602 |
-
|
| 603 |
-
1. all weak-layer rows from `train_sota`,
|
| 604 |
-
2. duplicated rare weak layers up to a minimum count,
|
| 605 |
-
3. a replay buffer from non-weak layers to reduce forgetting.
|
| 606 |
-
|
| 607 |
-
The continuation trainer loads:
|
| 608 |
-
|
| 609 |
-
1. Qwen3-8B base model in 4-bit NF4,
|
| 610 |
-
2. the existing LoRA adapter with `is_trainable=True`,
|
| 611 |
-
3. the local weak-layer replay dataset,
|
| 612 |
-
4. TRL `SFTTrainer` without a new `peft_config`, per PEFT/TRL continuation best practices.
|
| 613 |
-
|
| 614 |
-
Stage-2 default hyperparameters:
|
| 615 |
-
|
| 616 |
-
```yaml
|
| 617 |
-
learning_rate: 5e-5
|
| 618 |
-
epochs: 1
|
| 619 |
-
per_device_train_batch_size: 1
|
| 620 |
-
gradient_accumulation_steps: 16
|
| 621 |
-
max_length: 2048
|
| 622 |
-
assistant_only_loss: true
|
| 623 |
-
```
|
| 624 |
-
|
| 625 |
-
### Interpretation
|
| 626 |
-
|
| 627 |
-
A lower learning rate and replay buffer should improve weak-layer value fidelity while reducing catastrophic forgetting on strong layers. This is a targeted continuation, not a replacement for Gen4 data generation or official schema validation.
|
| 628 |
-
|
| 629 |
-
### Decision / next step
|
| 630 |
-
|
| 631 |
-
Run stage-2 from the completed stage-1 adapter:
|
| 632 |
-
|
| 633 |
-
```bash
|
| 634 |
-
bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
|
| 635 |
-
```
|
| 636 |
-
|
| 637 |
-
After training, evaluate with the same raw + normalized OOD protocol and compare against stage-1 metrics.
|
| 638 |
-
|
| 639 |
-
---
|
| 640 |
-
|
| 641 |
-
## 2026-05-05 — Stage 2 weak-layer continuation run started
|
| 642 |
-
|
| 643 |
-
### Goal
|
| 644 |
-
|
| 645 |
-
Run the stage-2 weak-layer continuation experiment implemented on 2026-05-04.
|
| 646 |
-
|
| 647 |
-
The intended scientific question is:
|
| 648 |
-
|
| 649 |
-
> Can a short, low-learning-rate continuation on weak target layers improve low-performing layer-specific value fidelity while preserving the strong global JSON validity, key structure, and adversarial behavior from stage 1?
|
| 650 |
-
|
| 651 |
-
### Action
|
| 652 |
-
|
| 653 |
-
Started stage 2 with:
|
| 654 |
-
|
| 655 |
-
```bash
|
| 656 |
-
bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-20260501-083834
|
| 657 |
-
```
|
| 658 |
-
|
| 659 |
-
Generated run:
|
| 660 |
-
|
| 661 |
-
```text
|
| 662 |
-
runs/stage2-weak-20260505-080040
|
| 663 |
-
```
|
| 664 |
-
|
| 665 |
-
Source adapter:
|
| 666 |
-
|
| 667 |
-
```text
|
| 668 |
-
runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
|
| 669 |
-
```
|
| 670 |
-
|
| 671 |
-
### Stage-2 dataset composition
|
| 672 |
-
|
| 673 |
-
The weak-layer dataset builder produced:
|
| 674 |
-
|
| 675 |
-
```json
|
| 676 |
-
{
|
| 677 |
-
"rows_train_stage2": 13829,
|
| 678 |
-
"rows_validation": 1547,
|
| 679 |
-
"weak_rows_total_after_duplication": 10638,
|
| 680 |
-
"replay_rows": 3191,
|
| 681 |
-
"rare_min_per_layer": 1500,
|
| 682 |
-
"replay_ratio": 0.3
|
| 683 |
-
}
|
| 684 |
-
```
|
| 685 |
-
|
| 686 |
-
Layer counts before/after rare-layer duplication:
|
| 687 |
-
|
| 688 |
-
| Target layer | Before | After |
|
| 689 |
-
|---|---:|---:|
|
| 690 |
-
| `o1_nrm` | 2,672 | 2,672 |
|
| 691 |
-
| `a1_policy` | 3,466 | 3,466 |
|
| 692 |
-
| `tmf921_lifecycle_report` | 596 | 1,500 |
|
| 693 |
-
| `tmf921_lifecycle_monitor` | 726 | 1,500 |
|
| 694 |
-
| `tmf921_lifecycle_scale` | 576 | 1,500 |
|
| 695 |
-
|
| 696 |
-
Replay buffer size:
|
| 697 |
-
|
| 698 |
-
- replay rows from non-weak layers: **3,191**
|
| 699 |
-
- purpose: reduce catastrophic forgetting on strong layers such as `tmf921`, `camara`, `intent_3gpp`, `etsi_zsm`, and adversarial rejection.
|
| 700 |
-
|
| 701 |
-
Full target-layer composition in stage-2 train set:
|
| 702 |
-
|
| 703 |
-
| Target layer | Rows |
|
| 704 |
-
|---|---:|
|
| 705 |
-
| `a1_policy` | 3,466 |
|
| 706 |
-
| `o1_nrm` | 2,672 |
|
| 707 |
-
| `tmf921_lifecycle_monitor` | 1,500 |
|
| 708 |
-
| `tmf921_lifecycle_report` | 1,500 |
|
| 709 |
-
| `tmf921_lifecycle_scale` | 1,500 |
|
| 710 |
-
| `tmf921` replay | 902 |
|
| 711 |
-
| `intent_3gpp` replay | 630 |
|
| 712 |
-
| `camara` replay | 618 |
|
| 713 |
-
| `etsi_zsm` replay | 335 |
|
| 714 |
-
| adversarial replay and other lifecycle replay | remaining rows |
|
| 715 |
-
|
| 716 |
-
### Training configuration
|
| 717 |
-
|
| 718 |
-
Resolved stage-2 config:
|
| 719 |
-
|
| 720 |
-
```yaml
|
| 721 |
-
model_name_or_path: Qwen/Qwen3-8B
|
| 722 |
-
adapter_path: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
|
| 723 |
-
dataset_dir: runs/stage2-weak-20260505-080040/weak_layer_data
|
| 724 |
-
output_dir: runs/stage2-weak-20260505-080040/outputs/adapter
|
| 725 |
-
learning_rate: 5.0e-05
|
| 726 |
-
epochs: 1
|
| 727 |
-
per_device_train_batch_size: 1
|
| 728 |
-
gradient_accumulation_steps: 16
|
| 729 |
-
max_length: 2048
|
| 730 |
-
assistant_only_loss: true
|
| 731 |
-
bf16: true
|
| 732 |
-
gradient_checkpointing: true
|
| 733 |
-
optim: paged_adamw_32bit
|
| 734 |
-
```
|
| 735 |
-
|
| 736 |
-
### Evidence that adapter continuation was configured correctly
|
| 737 |
-
|
| 738 |
-
Server log confirmed:
|
| 739 |
-
|
| 740 |
-
```text
|
| 741 |
-
Base model: Qwen/Qwen3-8B
|
| 742 |
-
Adapter: runs/qwen3-8b-qlora-20260501-083834/outputs/adapter
|
| 743 |
-
trainable params: 174,587,904 || all params: 8,365,323,264 || trainable%: 2.0870
|
| 744 |
-
TunerModelStatus(... active_adapters=['default'], requires_grad={'default': True}, devices={'default': ['cuda']})
|
| 745 |
-
```
|
| 746 |
-
|
| 747 |
-
Interpretation:
|
| 748 |
-
|
| 749 |
-
- The existing adapter was loaded.
|
| 750 |
-
- Adapter weights are trainable.
|
| 751 |
-
- Training is on CUDA.
|
| 752 |
-
- The base model is not being full-finetuned; only LoRA adapter parameters are updated.
|
| 753 |
-
|
| 754 |
-
### Early training evidence
|
| 755 |
-
|
| 756 |
-
Stage-2 training began normally after tokenization:
|
| 757 |
-
|
| 758 |
-
```text
|
| 759 |
-
Tokenizing train dataset: 13,829 / 13,829
|
| 760 |
-
Tokenizing eval dataset: 1,547 / 1,547
|
| 761 |
-
```
|
| 762 |
-
|
| 763 |
-
Representative early logs:
|
| 764 |
-
|
| 765 |
-
```text
|
| 766 |
-
loss: 0.1313, grad_norm: 0.0199, lr: 5e-05, mean_token_accuracy: 0.9572, epoch: 0.0012
|
| 767 |
-
loss: 0.1686, grad_norm: 0.0317, lr: 5e-05, mean_token_accuracy: 0.9435, epoch: 0.0116
|
| 768 |
-
loss: 0.1541, grad_norm: 0.0166, lr: 5e-05, mean_token_accuracy: 0.9463, epoch: 0.1157
|
| 769 |
-
```
|
| 770 |
-
|
| 771 |
-
Validation during stage 2:
|
| 772 |
-
|
| 773 |
-
```text
|
| 774 |
-
eval_loss: 0.1581 at epoch 0.1157
|
| 775 |
-
eval_loss: 0.1582 at epoch 0.2314
|
| 776 |
-
eval_loss: 0.1584 at epoch 0.3471
|
| 777 |
-
eval_loss: 0.1585 at epoch 0.4628
|
| 778 |
-
```
|
| 779 |
-
|
| 780 |
-
At approximately 50% completion:
|
| 781 |
-
|
| 782 |
-
```text
|
| 783 |
-
epoch: 0.4975 / 1.0
|
| 784 |
-
loss: 0.1366-0.1428 range near midpoint
|
| 785 |
-
grad_norm: generally <0.14
|
| 786 |
-
mean_token_accuracy: about 0.95
|
| 787 |
-
```
|
| 788 |
-
|
| 789 |
-
### Interpretation
|
| 790 |
-
|
| 791 |
-
The stage-2 run is healthy:
|
| 792 |
-
|
| 793 |
-
- no CUDA OOM,
|
| 794 |
-
- no NaN/Inf,
|
| 795 |
-
- no gradient explosion,
|
| 796 |
-
- GPU is active,
|
| 797 |
-
- adapter continuation is correctly configured.
|
| 798 |
-
|
| 799 |
-
Validation loss is slightly worse than the stage-1 plateau (~0.153), but this is expected because stage 2 intentionally shifts the training distribution toward harder weak layers. The decisive evaluation is not broad validation loss alone; it is the post-stage2 OOD normalized weak-layer comparison.
|
| 800 |
-
|
| 801 |
-
### Decision / next step
|
| 802 |
-
|
| 803 |
-
Let stage 2 finish. After completion:
|
| 804 |
-
|
| 805 |
-
1. merge the stage-2 adapter,
|
| 806 |
-
2. run OOD evaluation,
|
| 807 |
-
3. run normalized evaluator,
|
| 808 |
-
4. compare against stage-1 baselines.
|
| 809 |
-
|
| 810 |
-
Commands planned after stage 2:
|
| 811 |
-
|
| 812 |
-
```bash
|
| 813 |
-
RUN_DIR="runs/stage2-weak-20260505-080040"
|
| 814 |
-
|
| 815 |
-
python scripts/merge_adapter.py \
|
| 816 |
-
--base_model Qwen/Qwen3-8B \
|
| 817 |
-
--adapter "$RUN_DIR/outputs/adapter" \
|
| 818 |
-
--output_dir "$RUN_DIR/outputs/merged"
|
| 819 |
-
|
| 820 |
-
EVAL_BATCH_SIZE=8 \
|
| 821 |
-
bash scripts/nohup_eval.sh "$RUN_DIR" "$RUN_DIR/outputs/merged"
|
| 822 |
-
|
| 823 |
-
python scripts/normalize_eval_metrics.py \
|
| 824 |
-
--eval_dir "$RUN_DIR/eval"
|
| 825 |
-
```
|
| 826 |
-
|
| 827 |
-
### Success criteria
|
| 828 |
-
|
| 829 |
-
Stage 2 is successful if:
|
| 830 |
-
|
| 831 |
-
1. weak-layer normalized field F1 improves:
|
| 832 |
-
- `o1_nrm` above stage-1 ~0.39-0.40,
|
| 833 |
-
- `a1_policy` above stage-1 ~0.67-0.68,
|
| 834 |
-
- `tmf921_lifecycle_report` above stage-1 ~0.15-0.18,
|
| 835 |
-
- `tmf921_lifecycle_monitor` above stage-1 ~0.39-0.52;
|
| 836 |
-
2. global normalized field F1 does not regress substantially:
|
| 837 |
-
- stage-1 ID: 0.7956,
|
| 838 |
-
- stage-1 template OOD: 0.7865,
|
| 839 |
-
- stage-1 use-case OOD: 0.7907,
|
| 840 |
-
- stage-1 sector OOD: 0.7697;
|
| 841 |
-
3. JSON parse remains near 100%;
|
| 842 |
-
4. adversarial normalized exact remains close to 0.9697.
|
| 843 |
-
|
| 844 |
-
### Failure modes to watch
|
| 845 |
-
|
| 846 |
-
- Global regression from weak-layer overfitting.
|
| 847 |
-
- Adversarial degradation from insufficient replay.
|
| 848 |
-
- O1 NRM still weak, suggesting need for layer-specific semantic evaluator or improved data generation rather than more SFT.
|
| 849 |
-
- Lifecycle report/monitor still weak, suggesting those outputs include measurement/simulation fields that may require tolerance-based scoring.
|
| 850 |
-
|
| 851 |
-
|
| 852 |
-
---
|
| 853 |
-
|
| 854 |
-
## 2026-05-05 — Stage 2 evaluation completed and decision made
|
| 855 |
-
|
| 856 |
-
### Goal
|
| 857 |
-
|
| 858 |
-
Determine whether the stage-2 weak-layer continuation improved the weak target layers enough to replace the stage-1 adapter as the main model.
|
| 859 |
-
|
| 860 |
-
### Action
|
| 861 |
-
|
| 862 |
-
After stage-2 training completed, the adapter was merged into the Qwen3-8B base model and evaluated on the same OOD protocol used for stage 1:
|
| 863 |
-
|
| 864 |
-
- `test_in_distribution`
|
| 865 |
-
- `test_template_ood`
|
| 866 |
-
- `test_use_case_ood`
|
| 867 |
-
- `test_sector_ood`
|
| 868 |
-
- `test_adversarial`
|
| 869 |
-
|
| 870 |
-
The normalized evaluator was then run on the generated predictions:
|
| 871 |
-
|
| 872 |
-
```bash
|
| 873 |
-
python scripts/normalize_eval_metrics.py \
|
| 874 |
-
--eval_dir runs/stage2-weak-20260505-080040/eval
|
| 875 |
-
```
|
| 876 |
-
|
| 877 |
-
### Evidence / result
|
| 878 |
-
|
| 879 |
-
Global normalized comparison, stage 1 -> stage 2:
|
| 880 |
-
|
| 881 |
-
| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|
| 882 |
-
|---|---:|---:|---:|---:|---:|---:|
|
| 883 |
-
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
|
| 884 |
-
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
|
| 885 |
-
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
|
| 886 |
-
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
|
| 887 |
-
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
|
| 888 |
-
|
| 889 |
-
JSON parse comparison:
|
| 890 |
-
|
| 891 |
-
| Split | Stage 1 parse | Stage 2 parse | Delta |
|
| 892 |
-
|---|---:|---:|---:|
|
| 893 |
-
| `test_in_distribution` | 1.0000 | 0.9993 | -0.0007 |
|
| 894 |
-
| `test_template_ood` | 1.0000 | 1.0000 | +0.0000 |
|
| 895 |
-
| `test_use_case_ood` | 0.9998 | 0.9995 | -0.0002 |
|
| 896 |
-
| `test_sector_ood` | 1.0000 | 1.0000 | +0.0000 |
|
| 897 |
-
| `test_adversarial` | 1.0000 | 0.9697 | -0.0303 |
|
| 898 |
-
|
| 899 |
-
Weak-layer normalized field F1 comparison, stage 1 -> stage 2:
|
| 900 |
-
|
| 901 |
-
| Split | Layer | Stage 1 | Stage 2 | Delta |
|
| 902 |
-
|---|---|---:|---:|---:|
|
| 903 |
-
| ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
|
| 904 |
-
| ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
|
| 905 |
-
| ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
|
| 906 |
-
| ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
|
| 907 |
-
| ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
|
| 908 |
-
| Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
|
| 909 |
-
| Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
|
| 910 |
-
| Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
|
| 911 |
-
| Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
|
| 912 |
-
| Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
|
| 913 |
-
| Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
|
| 914 |
-
| Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
|
| 915 |
-
| Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
|
| 916 |
-
| Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
|
| 917 |
-
| Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
|
| 918 |
-
| Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
|
| 919 |
-
| Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
|
| 920 |
-
| Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
|
| 921 |
-
| Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
|
| 922 |
-
|
| 923 |
-
### Interpretation
|
| 924 |
-
|
| 925 |
-
Stage 2 produced only marginal global changes and did not solve the main weak-layer problem.
|
| 926 |
-
|
| 927 |
-
Key observations:
|
| 928 |
-
|
| 929 |
-
1. Global normalized field F1 changed by less than 0.12 percentage points on all non-adversarial splits. This is effectively flat.
|
| 930 |
-
2. Normalized key F1 regressed slightly across all splits.
|
| 931 |
-
3. Adversarial performance regressed meaningfully:
|
| 932 |
-
- normalized field F1: **0.9697 -> 0.9596**
|
| 933 |
-
- normalized key F1: **1.0000 -> 0.9697**
|
| 934 |
-
- parse rate: **1.0000 -> 0.9697**
|
| 935 |
-
4. `o1_nrm` did not improve in any meaningful way. Changes are between about -0.004 and +0.003, which is noise-level.
|
| 936 |
-
5. `a1_policy` also did not improve meaningfully.
|
| 937 |
-
6. Lifecycle report/monitor/scale improved on some OOD splits, especially use-case and sector OOD, but not consistently enough to justify replacing the stage-1 model.
|
| 938 |
-
|
| 939 |
-
The experiment is scientifically useful because it shows that simply continuing LoRA training on weak-layer examples is insufficient for O1 NRM and A1 policy value fidelity. The likely limitation is not lack of exposure alone, but either:
|
| 940 |
-
|
| 941 |
-
- insufficient semantic supervision in the data,
|
| 942 |
-
- inadequacy of flat field-F1 for some low-level configs,
|
| 943 |
-
- need for layer-specific validators and value extractors,
|
| 944 |
-
- or the need for Gen4 canonical scenario generation with explicit per-layer rendering rules.
|
| 945 |
-
|
| 946 |
-
### Decision
|
| 947 |
-
|
| 948 |
-
Stage 2 should **not** replace the stage-1 model as the main model.
|
| 949 |
-
|
| 950 |
-
The stage-1 adapter remains the current primary model because it has:
|
| 951 |
-
|
| 952 |
-
- slightly better global normalized metrics,
|
| 953 |
-
- better adversarial robustness,
|
| 954 |
-
- no meaningful disadvantage on O1/A1 compared with stage 2.
|
| 955 |
-
|
| 956 |
-
Stage 2 is retained as a diagnostic experiment and may be useful only as evidence that weak-layer continuation alone is not sufficient.
|
| 957 |
-
|
| 958 |
-
### Next step
|
| 959 |
-
|
| 960 |
-
Do **not** run another blind weak-layer fine-tune yet. The next scientifically sound step is to improve evaluation/data for weak layers:
|
| 961 |
-
|
| 962 |
-
1. Build a layer-specific semantic evaluator for `o1_nrm` and `a1_policy` that extracts and scores telecom-relevant fields rather than flat JSON values.
|
| 963 |
-
2. Inspect O1 NRM predictions manually to identify whether failures are wrong values, wrong cell identities, wrong PRB ratios, wrong S-NSSAI encoding, or volatile fields still not normalized.
|
| 964 |
-
3. For Gen4, generate canonical scenario objects first, then render all target layers from the same canonical object with explicit validators.
|
| 965 |
-
4. Add row-level canonical labels for critical values so evaluation does not depend on brittle JSON flattening.
|
| 966 |
-
|
| 967 |
-
### Updated project status
|
| 968 |
-
|
| 969 |
-
Primary model: **stage 1 Qwen3-8B QLoRA adapter**
|
| 970 |
-
|
| 971 |
-
Stage 2 status: **diagnostic / not promoted**
|
| 972 |
-
|
| 973 |
-
Current best headline metrics remain the stage-1 normalized results:
|
| 974 |
-
|
| 975 |
-
| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|
| 976 |
-
|---|---:|---:|---:|
|
| 977 |
-
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 |
|
| 978 |
-
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 |
|
| 979 |
-
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 |
|
| 980 |
-
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 |
|
| 981 |
-
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|