Add zero-shot vs fine-tuned baseline summary
Browse files
results/baselines/zero_shot_vs_finetuned.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Zero-shot Qwen3-8B vs Fine-tuned Qwen3-8B QLoRA
|
| 2 |
|
| 3 |
-
Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned results are full split metrics.
|
| 4 |
|
| 5 |
| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|
| 6 |
|---|---:|---:|---:|---:|---:|---:|
|
|
@@ -10,4 +10,10 @@ Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned results a
|
|
| 10 |
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
|
| 11 |
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Zero-shot Qwen3-8B vs Fine-tuned Qwen3-8B QLoRA
|
| 2 |
|
| 3 |
+
Zero-shot baseline was evaluated on 200 examples per split. Fine-tuned stage-1 results are full split metrics.
|
| 4 |
|
| 5 |
| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|
| 6 |
|---|---:|---:|---:|---:|---:|---:|
|
|
|
|
| 10 |
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
|
| 11 |
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |
|
| 12 |
|
| 13 |
+
## Interpretation
|
| 14 |
+
|
| 15 |
+
Zero-shot Qwen3-8B mostly fails structured telecom intent-to-configuration generation. Domain QLoRA fine-tuning is essential: it raises JSON parse rate from roughly one-third to near 100%, normalized key F1 from about 0.02 to about 0.98, and normalized field F1 from near zero to about 0.77-0.80 across non-adversarial ID/OOD splits.
|
| 16 |
+
|
| 17 |
+
## Caveat
|
| 18 |
+
|
| 19 |
+
The zero-shot baseline is sampled at 200 examples per split for compute efficiency. Fine-tuned metrics are reported on the full evaluation splits. If a strict apples-to-apples comparison is required, rerun the fine-tuned model on the same sampled subset.
|