PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
045a049
Β·
verified Β·
1 Parent(s): 0e636fc

Update paper tables with zero-shot baseline

Browse files
Files changed (1) hide show
  1. paper/tables.md +159 -0
paper/tables.md CHANGED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paper Tables
2
+
3
+ This file contains draft tables for the manuscript.
4
+
5
+ ---
6
+
7
+ ## Table 1 β€” Research dataset splits
8
+
9
+ | Split | Rows | Purpose |
10
+ |---|---:|---|
11
+ | `train_base` | 26,357 | Unaugmented training after OOD holdouts |
12
+ | `train_sota` | 32,357 | Training split with lifecycle/adversarial upsampling and multi-turn wrappers |
13
+ | `validation` | 1,547 | Validation during training |
14
+ | `test_in_distribution` | 1,455 | In-distribution test |
15
+ | `test_template_ood` | 3,503 | Held-out prompt-template family |
16
+ | `test_use_case_ood` | 4,341 | Held-out use cases |
17
+ | `test_sector_ood` | 4,579 | Held-out sectors |
18
+ | `test_adversarial` | 33 | Held-out adversarial rejection examples |
19
+
20
+ ---
21
+
22
+ ## Table 2 β€” Qwen3 token-length audit
23
+
24
+ | Statistic | Tokens |
25
+ |---|---:|
26
+ | Mean | 754.1 |
27
+ | p50 | 705 |
28
+ | p95 | 1293 |
29
+ | p99 | 1300 |
30
+ | Max | 1316 |
31
+ | Fit under 2048 | 100% |
32
+
33
+ ---
34
+
35
+ ## Table 3 β€” Stage-1 training configuration
36
+
37
+ | Item | Value |
38
+ |---|---|
39
+ | Base model | `Qwen/Qwen3-8B` |
40
+ | Training method | QLoRA SFT |
41
+ | Quantization | 4-bit NF4 + double quantization |
42
+ | LoRA rank | 64 |
43
+ | LoRA alpha | 16 |
44
+ | LoRA dropout | 0.05 |
45
+ | Target modules | `all-linear` |
46
+ | Max length | 2048 |
47
+ | Loss | Assistant-only SFT loss |
48
+ | Learning rate | 2e-4 |
49
+ | Scheduler | constant |
50
+ | Optimizer | paged AdamW 32-bit |
51
+ | Gradient checkpointing | enabled |
52
+ | Hardware | RTX 6000 Ada 48/50GB |
53
+ | Train split | `train_sota` |
54
+
55
+ ---
56
+
57
+ ## Table 4 β€” Stage-1 raw metrics
58
+
59
+ | Split | JSON parse | Exact match | Field F1 | KPI presence |
60
+ |---|---:|---:|---:|---:|
61
+ | `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
62
+ | `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
63
+ | `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
64
+ | `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
65
+ | `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |
66
+
67
+ ---
68
+
69
+ ## Table 5 β€” Stage-1 normalized metrics
70
+
71
+ | Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
72
+ |---|---:|---:|---:|---:|
73
+ | `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
74
+ | `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
75
+ | `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
76
+ | `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
77
+ | `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |
78
+
79
+ ---
80
+
81
+ ## Table 6 β€” Stage-1 strong and weak target layers
82
+
83
+ | Target layer | Normalized field F1 range | Interpretation |
84
+ |---|---:|---|
85
+ | `tmf921` | 0.93–0.94 | Strong high-level intent object generation |
86
+ | `camara` | 0.81–0.87 | Strong after volatile-field normalization |
87
+ | `intent_3gpp` | 0.80–0.82 | Strong/moderate |
88
+ | `etsi_zsm` | 0.75–0.79 | Moderate/strong |
89
+ | `a1_policy` | 0.67–0.68 | Moderate, value fidelity remains limited |
90
+ | `o1_nrm` | 0.39–0.40 | Weak value fidelity despite correct structure |
91
+ | `tmf921_lifecycle_report` | 0.15–0.18 | Weak, likely measurement/simulation mismatch |
92
+ | `tmf921_lifecycle_monitor` | 0.39–0.52 | Weak/mixed |
93
+
94
+ ---
95
+
96
+ ## Table 7 β€” Stage 1 vs Stage 2 global comparison
97
+
98
+ | Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
99
+ |---|---:|---:|---:|---:|---:|---:|
100
+ | `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
101
+ | `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
102
+ | `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
103
+ | `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
104
+ | `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |
105
+
106
+ Decision: Stage 2 is diagnostic and not promoted.
107
+
108
+ ---
109
+
110
+ ## Table 8 β€” Stage 1 vs Stage 2 weak-layer comparison
111
+
112
+ | Split | Layer | Stage 1 | Stage 2 | Delta |
113
+ |---|---|---:|---:|---:|
114
+ | ID | `o1_nrm` | 0.3927 | 0.3906 | -0.0021 |
115
+ | ID | `a1_policy` | 0.6837 | 0.6787 | -0.0050 |
116
+ | ID | `tmf921_lifecycle_report` | 0.1667 | 0.1889 | +0.0222 |
117
+ | ID | `tmf921_lifecycle_monitor` | 0.5172 | 0.4926 | -0.0246 |
118
+ | ID | `tmf921_lifecycle_scale` | 0.9345 | 0.9453 | +0.0108 |
119
+ | Template OOD | `o1_nrm` | 0.3976 | 0.3993 | +0.0017 |
120
+ | Template OOD | `a1_policy` | 0.6763 | 0.6758 | -0.0004 |
121
+ | Template OOD | `tmf921_lifecycle_report` | 0.1799 | 0.1905 | +0.0106 |
122
+ | Template OOD | `tmf921_lifecycle_scale` | 0.5363 | 0.5560 | +0.0197 |
123
+ | Use-case OOD | `o1_nrm` | 0.3936 | 0.3895 | -0.0042 |
124
+ | Use-case OOD | `a1_policy` | 0.6808 | 0.6786 | -0.0023 |
125
+ | Use-case OOD | `tmf921_lifecycle_report` | 0.1531 | 0.1981 | +0.0450 |
126
+ | Use-case OOD | `tmf921_lifecycle_monitor` | 0.3875 | 0.4187 | +0.0312 |
127
+ | Use-case OOD | `tmf921_lifecycle_scale` | 0.6993 | 0.7411 | +0.0418 |
128
+ | Sector OOD | `o1_nrm` | 0.3858 | 0.3888 | +0.0029 |
129
+ | Sector OOD | `a1_policy` | 0.6740 | 0.6763 | +0.0023 |
130
+ | Sector OOD | `tmf921_lifecycle_report` | 0.1763 | 0.1830 | +0.0067 |
131
+ | Sector OOD | `tmf921_lifecycle_monitor` | 0.4310 | 0.4696 | +0.0385 |
132
+ | Sector OOD | `tmf921_lifecycle_scale` | 0.7279 | 0.7437 | +0.0158 |
133
+
134
+ ---
135
+
136
+ ## Table 9 β€” Zero-shot Qwen3-8B baseline vs fine-tuned QLoRA
137
+
138
+ Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned stage-1 results are full split metrics.
139
+
140
+ | Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
141
+ |---|---:|---:|---:|---:|---:|---:|
142
+ | ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
143
+ | Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
144
+ | Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
145
+ | Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
146
+ | Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
147
+
148
+ ---
149
+
150
+ ## Table 10 β€” Limitations summary
151
+
152
+ | Limitation | Impact | Mitigation / future work |
153
+ |---|---|---|
154
+ | Synthetic data | May not reflect real operator language | Add expert/human-authored validation subset |
155
+ | No official standard validators | Cannot claim production compliance | Add TMF921/CAMARA/OpenAPI/YANG validators |
156
+ | O1 NRM weak value fidelity | Low-level RAN configuration unreliable | Add semantic evaluator and canonical labels |
157
+ | A1 policy moderate fidelity | Policy values may be wrong | Add policy-specific extractor/scorer |
158
+ | Lifecycle report/monitor weak | Measurement fields may be hard to reproduce | Use tolerance/semantic scoring |
159
+ | Exact match low | Raw exact match over-penalizes volatile fields | Report normalized metrics alongside raw |