PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
0e636fc
·
verified ·
1 Parent(s): d7752eb

Add zero-shot vs fine-tuned baseline summary

Browse files
results/baselines/zero_shot_vs_finetuned.md CHANGED
@@ -1,6 +1,6 @@
1
  # Zero-shot Qwen3-8B vs Fine-tuned Qwen3-8B QLoRA
2
 
3
- Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned results are full split metrics.
4
 
5
  | Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
6
  |---|---:|---:|---:|---:|---:|---:|
@@ -10,4 +10,10 @@ Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned results a
10
  | Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
11
  | Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
12
 
13
- Conclusion: domain QLoRA fine-tuning is essential for structured telecom intent-to-config generation.
 
 
 
 
 
 
 
1
  # Zero-shot Qwen3-8B vs Fine-tuned Qwen3-8B QLoRA
2
 
3
+ Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned stage-1 results are full split metrics.
4
 
5
  | Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
6
  |---|---:|---:|---:|---:|---:|---:|
 
10
  | Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
11
  | Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
12
 
13
+ ## Interpretation
14
+
15
+ Zero-shot Qwen3-8B mostly fails structured telecom intent-to-configuration generation. Domain QLoRA fine-tuning is essential: it raises JSON parse rate from roughly one-third to near 100%, normalized key F1 from about 0.02 to about 0.98, and normalized field F1 from near zero to about 0.77-0.80 across non-adversarial ID/OOD splits.
16
+
17
+ ## Caveat
18
+
19
+ The zero-shot baseline is sampled at 200 examples per split for compute efficiency. Fine-tuned metrics are reported on the full evaluation splits. If a strict apples-to-apples comparison is required, rerun the fine-tuned model on the same sampled subset.