Update paper tables with zero-shot baseline
Browse files- paper/tables.md +159 -0
paper/tables.md
CHANGED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Paper Tables
|
| 2 |
+
|
| 3 |
+
This file contains draft tables for the manuscript.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Table 1 β Research dataset splits
|
| 8 |
+
|
| 9 |
+
| Split | Rows | Purpose |
|
| 10 |
+
|---|---:|---|
|
| 11 |
+
| `train_base` | 26,357 | Unaugmented training after OOD holdouts |
|
| 12 |
+
| `train_sota` | 32,357 | Training split with lifecycle/adversarial upsampling and multi-turn wrappers |
|
| 13 |
+
| `validation` | 1,547 | Validation during training |
|
| 14 |
+
| `test_in_distribution` | 1,455 | In-distribution test |
|
| 15 |
+
| `test_template_ood` | 3,503 | Held-out prompt-template family |
|
| 16 |
+
| `test_use_case_ood` | 4,341 | Held-out use cases |
|
| 17 |
+
| `test_sector_ood` | 4,579 | Held-out sectors |
|
| 18 |
+
| `test_adversarial` | 33 | Held-out adversarial rejection examples |
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Table 2 β Qwen3 token-length audit
|
| 23 |
+
|
| 24 |
+
| Statistic | Tokens |
|
| 25 |
+
|---|---:|
|
| 26 |
+
| Mean | 754.1 |
|
| 27 |
+
| p50 | 705 |
|
| 28 |
+
| p95 | 1293 |
|
| 29 |
+
| p99 | 1300 |
|
| 30 |
+
| Max | 1316 |
|
| 31 |
+
| Fit under 2048 | 100% |
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Table 3 β Stage-1 training configuration
|
| 36 |
+
|
| 37 |
+
| Item | Value |
|
| 38 |
+
|---|---|
|
| 39 |
+
| Base model | `Qwen/Qwen3-8B` |
|
| 40 |
+
| Training method | QLoRA SFT |
|
| 41 |
+
| Quantization | 4-bit NF4 + double quantization |
|
| 42 |
+
| LoRA rank | 64 |
|
| 43 |
+
| LoRA alpha | 16 |
|
| 44 |
+
| LoRA dropout | 0.05 |
|
| 45 |
+
| Target modules | `all-linear` |
|
| 46 |
+
| Max length | 2048 |
|
| 47 |
+
| Loss | Assistant-only SFT loss |
|
| 48 |
+
| Learning rate | 2e-4 |
|
| 49 |
+
| Scheduler | constant |
|
| 50 |
+
| Optimizer | paged AdamW 32-bit |
|
| 51 |
+
| Gradient checkpointing | enabled |
|
| 52 |
+
| Hardware | RTX 6000 Ada 48/50GB |
|
| 53 |
+
| Train split | `train_sota` |
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Table 4 β Stage-1 raw metrics
|
| 58 |
+
|
| 59 |
+
| Split | JSON parse | Exact match | Field F1 | KPI presence |
|
| 60 |
+
|---|---:|---:|---:|---:|
|
| 61 |
+
| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
|
| 62 |
+
| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
|
| 63 |
+
| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
|
| 64 |
+
| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
|
| 65 |
+
| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Table 5 β Stage-1 normalized metrics
|
| 70 |
+
|
| 71 |
+
| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
|
| 72 |
+
|---|---:|---:|---:|---:|
|
| 73 |
+
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
|
| 74 |
+
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
|
| 75 |
+
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
|
| 76 |
+
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
|
| 77 |
+
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Table 6 β Stage-1 strong and weak target layers
|
| 82 |
+
|
| 83 |
+
| Target layer | Normalized field F1 range | Interpretation |
|
| 84 |
+
|---|---:|---|
|
| 85 |
+
| `tmf921` | 0.93β0.94 | Strong high-level intent object generation |
|
| 86 |
+
| `camara` | 0.81β0.87 | Strong after volatile-field normalization |
|
| 87 |
+
| `intent_3gpp` | 0.80β0.82 | Strong/moderate |
|
| 88 |
+
| `etsi_zsm` | 0.75β0.79 | Moderate/strong |
|
| 89 |
+
| `a1_policy` | 0.67β0.68 | Moderate, value fidelity remains limited |
|
| 90 |
+
| `o1_nrm` | 0.39β0.40 | Weak value fidelity despite correct structure |
|
| 91 |
+
| `tmf921_lifecycle_report` | 0.15β0.18 | Weak, likely measurement/simulation mismatch |
|
| 92 |
+
| `tmf921_lifecycle_monitor` | 0.39β0.52 | Weak/mixed |
|
| 93 |
+
|
| 94 |
+
---
|
| 95 |
+
|
| 96 |
+
## Table 7 β Stage 1 vs Stage 2 global comparison
|
| 97 |
+
|
| 98 |
+
| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|
| 99 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 100 |
+
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
|
| 101 |
+
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
|
| 102 |
+
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
|
| 103 |
+
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
|
| 104 |
+
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
|
| 105 |
+
|
| 106 |
+
Decision: Stage 2 is diagnostic and not promoted.
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## Table 8 β Stage 1 vs Stage 2 weak-layer comparison
|
| 111 |
+
|
| 112 |
+
| Split | Layer | Stage 1 | Stage 2 | Delta |
|
| 113 |
+
|---|---|---:|---:|---:|
|
| 114 |
+
| ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
|
| 115 |
+
| ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
|
| 116 |
+
| ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
|
| 117 |
+
| ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
|
| 118 |
+
| ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
|
| 119 |
+
| Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
|
| 120 |
+
| Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
|
| 121 |
+
| Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
|
| 122 |
+
| Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
|
| 123 |
+
| Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
|
| 124 |
+
| Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
|
| 125 |
+
| Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
|
| 126 |
+
| Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
|
| 127 |
+
| Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
|
| 128 |
+
| Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
|
| 129 |
+
| Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
|
| 130 |
+
| Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
|
| 131 |
+
| Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
|
| 132 |
+
| Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
## Table 9 β Zero-shot Qwen3-8B baseline vs fine-tuned QLoRA
|
| 137 |
+
|
| 138 |
+
Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned stage-1 results are full split metrics.
|
| 139 |
+
|
| 140 |
+
| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|
| 141 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 142 |
+
| ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
|
| 143 |
+
| Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
|
| 144 |
+
| Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
|
| 145 |
+
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
|
| 146 |
+
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## Table 10 β Limitations summary
|
| 151 |
+
|
| 152 |
+
| Limitation | Impact | Mitigation / future work |
|
| 153 |
+
|---|---|---|
|
| 154 |
+
| Synthetic data | May not reflect real operator language | Add expert/human-authored validation subset |
|
| 155 |
+
| No official standard validators | Cannot claim production compliance | Add TMF921/CAMARA/OpenAPI/YANG validators |
|
| 156 |
+
| O1 NRM weak value fidelity | Low-level RAN configuration unreliable | Add semantic evaluator and canonical labels |
|
| 157 |
+
| A1 policy moderate fidelity | Policy values may be wrong | Add policy-specific extractor/scorer |
|
| 158 |
+
| Lifecycle report/monitor weak | Measurement fields may be hard to reproduce | Use tolerance/semantic scoring |
|
| 159 |
+
| Exact match low | Raw exact match over-penalizes volatile fields | Report normalized metrics alongside raw |
|