| # v0.5 Chat-Tune Postmortem (2026-05-03) |
|
|
| The canonical v0.5 chat is `chat-v3 (MCQ-tuned)` at **36.9%** on CTIBench MCQ. |
| This document records the recovery attempts that followed and what they |
| actually changed about our understanding. |
|
|
| ## Result table |
|
|
| | Run | Recipe | Steps | LR | Val | CTIBench MCQ | |
| |---|---|---|---|---|---| |
| | chat-v2 | Cybersec Q&A only, no MCQ | 1500 | 5e-5 | β | 19.0% | |
| | **chat-v3 (canonical)** | **Raw letter-only MCQ Γ 5** | 1500 | 5e-5 | β | **36.9%** | |
| | chat-v4 (RAFT) | RAG-augmented chat-v3 mix | 1500 | 5e-5 | β | 25.0% | |
| | chat (v0.5 base re-tune) | chat-v3 recipe on v0.5 base | 1500 | 5e-5 | β | 32.5% | |
| | chat-long | chat-v3 mix, 4000 steps | 4000 | 5e-5 | β | 17.1% | |
| | chat-recovered | CoT MCQ Γ 1 + small-talk Γ 30 | 1500 | 3e-5 | 2.808 | 30.8% | |
| | chat-v4 (failed) | Hybrid + lr 2e-4 | 300 | 2e-4 | diverged | killed | |
| | **chat-v5 (this run)** | **Hybrid raw Γ 5 + CoT Γ 2 + small-talk Γ 8** | 2000 | 5e-5 | 2.990 | **34.8%** | |
|
|
| ## What we learned |
|
|
| ### What chat-v3 actually does |
|
|
| The 36.9% canonical is a *pattern-match shortcut*, not reasoning. With raw |
| letter-only MCQ at Γ 5 multiplier, the model learns "after the prompt ends in |
| 'Answer:', emit a single letter consistent with the surface features of the |
| options." This is a known class of MCQ artifact (Answer Matching > MCQ, |
| arXiv 2507.02856) β sub-100M models can hit reasonable MCQ scores by |
| exploiting the choice distribution without understanding the question. |
|
|
| ### Why CoT-MCQ alone made it worse |
|
|
| `chat-recovered` (30.8%) replaced the letter-only MCQ Γ 5 with CoT MCQ Γ 1. |
| The CoT records have the format `"B. <1-2 sentence justification>"` β Qwen-14B |
| generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini, |
| was that reasoning supervision should outperform pattern-match supervision |
| even at low multipliers. |
|
|
| It didn't β at 36M params, the model can't compress 1-2 sentences of cybersec |
| reasoning into useful weight updates, and it loses the letter-shortcut signal |
| in the process. Documented size effect: weaker students benefit from coarser |
| supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv |
| 2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001). |
|
|
| The 30 Γ small-talk multiplier compounded the damage by pushing task-data |
| share below 5% of the SFT mix β well outside the SmolLM2 reference of |
| β₯ 20% task share. |
|
|
| ### Why chat-v4 (lr 2e-4) diverged |
|
|
| Research said an undertrained backbone needs aggressive SFT lr to escape a |
| bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to |
| 36M with mean-init new tokens, 2e-4 was still too hot β val climbed |
| monotonically across 3 evals (3.175 β 3.285 β 3.403) before we killed it at |
| step 300. |
|
|
| Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new |
| embedding rows. The safe range is closer to 5e-5. |
|
|
| ### What chat-v5 got right (and didn't) |
|
|
| The hybrid recipe (raw Γ 5 + CoT Γ 2, small-talk Γ 8, lr 5e-5, mean-init |
| embeddings) lifted the score from 30.8% β 34.8% β a real **+4.0 point** gain |
| over the prior recovery attempt. But it still trails canonical by 2.1 points. |
|
|
| The hybrid was directionally right β keeping the letter-shortcut anchor |
| (raw Γ 5) preserved the discriminative signal, while CoT Γ 2 added some |
| reasoning supervision without over-rotating. Mean-init for new tokens kept |
| the residual stream stable. |
|
|
| What it didn't fix: the letter-shortcut at Γ 5 is still doing most of the |
| work, and there's no mechanism in this recipe that actually transfers |
| *knowledge* into the model β only better calibration on top of the shortcut. |
| To beat 36.9% durably, the lever isn't another SFT recipe β it's either: |
|
|
| 1. **Bigger model** (ghost-base ~350M) so reasoning supervision actually fits. |
| 2. **Better pretrain coverage** of the CTIBench knowledge domain (more |
| cyber threat intel, MITRE corpus depth) so the shortcut isn't the only |
| path to a correct answer. |
| 3. **Proper retrieval at inference** (RAG done right, not the chat-v4 RAFT |
| attempt that conflated training-time and inference-time augmentation). |
|
|
| ## Decision |
|
|
| - **Canonical stays:** `v0.5 chat-v3 (MCQ-tuned)` at 36.9% on the main HF repo. |
| - **Ship chat-v5 separately:** push to `Ghostgim/GhostLM-v0.5-experimental` |
| with this postmortem in the model card. Honest framing: "improved CoT |
| hybrid recipe, still 2.1pt below canonical, primarily of research interest." |
| - **No more chat-tune iterations on v0.5.** The 36.9% ceiling is a pretrain |
| + capacity ceiling, not a recipe ceiling. Next swing should be ghost-base |
| or a corpus-side fix, not another SFT permutation. |
|
|
| ## Sources |
|
|
| - [Answer Matching Outperforms MCQ, arXiv 2507.02856](https://arxiv.org/abs/2507.02856) |
| - [Skip-Thinking, arXiv 2505.18642](https://arxiv.org/html/2505.18642v1) |
| - [Unveiling Key Factors for Distilling CoT, arXiv 2502.18001](https://arxiv.org/html/2502.18001v1) |
| - [SmolLM2, arXiv 2502.02737](https://arxiv.org/html/2502.02737v1) |
| - [How Abilities in LLMs are Affected by SFT Data Composition](https://openreview.net/forum?id=6M5G5hNiAU) |
|
|