File size: 5,089 Bytes
54d41ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# v0.5 Chat-Tune Postmortem (2026-05-03)

The canonical v0.5 chat is `chat-v3 (MCQ-tuned)` at **36.9%** on CTIBench MCQ.
This document records the recovery attempts that followed and what they
actually changed about our understanding.

## Result table

| Run | Recipe | Steps | LR | Val | CTIBench MCQ |
|---|---|---|---|---|---|
| chat-v2 | Cybersec Q&A only, no MCQ | 1500 | 5e-5 | β€” | 19.0% |
| **chat-v3 (canonical)** | **Raw letter-only MCQ Γ— 5** | 1500 | 5e-5 | β€” | **36.9%** |
| chat-v4 (RAFT) | RAG-augmented chat-v3 mix | 1500 | 5e-5 | β€” | 25.0% |
| chat (v0.5 base re-tune) | chat-v3 recipe on v0.5 base | 1500 | 5e-5 | β€” | 32.5% |
| chat-long | chat-v3 mix, 4000 steps | 4000 | 5e-5 | β€” | 17.1% |
| chat-recovered | CoT MCQ Γ— 1 + small-talk Γ— 30 | 1500 | 3e-5 | 2.808 | 30.8% |
| chat-v4 (failed) | Hybrid + lr 2e-4 | 300 | 2e-4 | diverged | killed |
| **chat-v5 (this run)** | **Hybrid raw Γ— 5 + CoT Γ— 2 + small-talk Γ— 8** | 2000 | 5e-5 | 2.990 | **34.8%** |

## What we learned

### What chat-v3 actually does

The 36.9% canonical is a *pattern-match shortcut*, not reasoning. With raw
letter-only MCQ at Γ— 5 multiplier, the model learns "after the prompt ends in
'Answer:', emit a single letter consistent with the surface features of the
options." This is a known class of MCQ artifact (Answer Matching > MCQ,
arXiv 2507.02856) β€” sub-100M models can hit reasonable MCQ scores by
exploiting the choice distribution without understanding the question.

### Why CoT-MCQ alone made it worse

`chat-recovered` (30.8%) replaced the letter-only MCQ Γ— 5 with CoT MCQ Γ— 1.
The CoT records have the format `"B. <1-2 sentence justification>"` β€” Qwen-14B
generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini,
was that reasoning supervision should outperform pattern-match supervision
even at low multipliers.

It didn't β€” at 36M params, the model can't compress 1-2 sentences of cybersec
reasoning into useful weight updates, and it loses the letter-shortcut signal
in the process. Documented size effect: weaker students benefit from coarser
supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv
2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).

The 30 Γ— small-talk multiplier compounded the damage by pushing task-data
share below 5% of the SFT mix β€” well outside the SmolLM2 reference of
β‰₯ 20% task share.

### Why chat-v4 (lr 2e-4) diverged

Research said an undertrained backbone needs aggressive SFT lr to escape a
bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to
36M with mean-init new tokens, 2e-4 was still too hot β€” val climbed
monotonically across 3 evals (3.175 β†’ 3.285 β†’ 3.403) before we killed it at
step 300.

Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new
embedding rows. The safe range is closer to 5e-5.

### What chat-v5 got right (and didn't)

The hybrid recipe (raw Γ— 5 + CoT Γ— 2, small-talk Γ— 8, lr 5e-5, mean-init
embeddings) lifted the score from 30.8% β†’ 34.8% β€” a real **+4.0 point** gain
over the prior recovery attempt. But it still trails canonical by 2.1 points.

The hybrid was directionally right β€” keeping the letter-shortcut anchor
(raw Γ— 5) preserved the discriminative signal, while CoT Γ— 2 added some
reasoning supervision without over-rotating. Mean-init for new tokens kept
the residual stream stable.

What it didn't fix: the letter-shortcut at Γ— 5 is still doing most of the
work, and there's no mechanism in this recipe that actually transfers
*knowledge* into the model β€” only better calibration on top of the shortcut.
To beat 36.9% durably, the lever isn't another SFT recipe β€” it's either:

1. **Bigger model** (ghost-base ~350M) so reasoning supervision actually fits.
2. **Better pretrain coverage** of the CTIBench knowledge domain (more
   cyber threat intel, MITRE corpus depth) so the shortcut isn't the only
   path to a correct answer.
3. **Proper retrieval at inference** (RAG done right, not the chat-v4 RAFT
   attempt that conflated training-time and inference-time augmentation).

## Decision

- **Canonical stays:** `v0.5 chat-v3 (MCQ-tuned)` at 36.9% on the main HF repo.
- **Ship chat-v5 separately:** push to `Ghostgim/GhostLM-v0.5-experimental`
  with this postmortem in the model card. Honest framing: "improved CoT
  hybrid recipe, still 2.1pt below canonical, primarily of research interest."
- **No more chat-tune iterations on v0.5.** The 36.9% ceiling is a pretrain
  + capacity ceiling, not a recipe ceiling. Next swing should be ghost-base
  or a corpus-side fix, not another SFT permutation.

## Sources

- [Answer Matching Outperforms MCQ, arXiv 2507.02856](https://arxiv.org/abs/2507.02856)
- [Skip-Thinking, arXiv 2505.18642](https://arxiv.org/html/2505.18642v1)
- [Unveiling Key Factors for Distilling CoT, arXiv 2502.18001](https://arxiv.org/html/2502.18001v1)
- [SmolLM2, arXiv 2502.02737](https://arxiv.org/html/2502.02737v1)
- [How Abilities in LLMs are Affected by SFT Data Composition](https://openreview.net/forum?id=6M5G5hNiAU)