Upload folder using huggingface_hub

54d41ec verified 19 days ago

5.09 kB

	# v0.5 Chat-Tune Postmortem (2026-05-03)

	The canonical v0.5 chat is `chat-v3 (MCQ-tuned)` at 36.9% on CTIBench MCQ.
	This document records the recovery attempts that followed and what they
	actually changed about our understanding.

	## Result table

	\| Run \| Recipe \| Steps \| LR \| Val \| CTIBench MCQ \|
	\|---\|---\|---\|---\|---\|---\|
	\| chat-v2 \| Cybersec Q&A only, no MCQ \| 1500 \| 5e-5 \| — \| 19.0% \|
	\| chat-v3 (canonical) \| Raw letter-only MCQ × 5 \| 1500 \| 5e-5 \| — \| 36.9% \|
	\| chat-v4 (RAFT) \| RAG-augmented chat-v3 mix \| 1500 \| 5e-5 \| — \| 25.0% \|
	\| chat (v0.5 base re-tune) \| chat-v3 recipe on v0.5 base \| 1500 \| 5e-5 \| — \| 32.5% \|
	\| chat-long \| chat-v3 mix, 4000 steps \| 4000 \| 5e-5 \| — \| 17.1% \|
	\| chat-recovered \| CoT MCQ × 1 + small-talk × 30 \| 1500 \| 3e-5 \| 2.808 \| 30.8% \|
	\| chat-v4 (failed) \| Hybrid + lr 2e-4 \| 300 \| 2e-4 \| diverged \| killed \|
	\| chat-v5 (this run) \| Hybrid raw × 5 + CoT × 2 + small-talk × 8 \| 2000 \| 5e-5 \| 2.990 \| 34.8% \|

	## What we learned

	### What chat-v3 actually does

	The 36.9% canonical is a pattern-match shortcut, not reasoning. With raw
	letter-only MCQ at × 5 multiplier, the model learns "after the prompt ends in
	'Answer:', emit a single letter consistent with the surface features of the
	options." This is a known class of MCQ artifact (Answer Matching > MCQ,
	arXiv 2507.02856) — sub-100M models can hit reasonable MCQ scores by
	exploiting the choice distribution without understanding the question.

	### Why CoT-MCQ alone made it worse

	`chat-recovered` (30.8%) replaced the letter-only MCQ × 5 with CoT MCQ × 1.
	The CoT records have the format `"B. <1-2 sentence justification>"` — Qwen-14B
	generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini,
	was that reasoning supervision should outperform pattern-match supervision
	even at low multipliers.

	It didn't — at 36M params, the model can't compress 1-2 sentences of cybersec
	reasoning into useful weight updates, and it loses the letter-shortcut signal
	in the process. Documented size effect: weaker students benefit from coarser
	supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv
	2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).

	The 30 × small-talk multiplier compounded the damage by pushing task-data
	share below 5% of the SFT mix — well outside the SmolLM2 reference of
	≥ 20% task share.

	### Why chat-v4 (lr 2e-4) diverged

	Research said an undertrained backbone needs aggressive SFT lr to escape a
	bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to
	36M with mean-init new tokens, 2e-4 was still too hot — val climbed
	monotonically across 3 evals (3.175 → 3.285 → 3.403) before we killed it at
	step 300.

	Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new
	embedding rows. The safe range is closer to 5e-5.

	### What chat-v5 got right (and didn't)

	The hybrid recipe (raw × 5 + CoT × 2, small-talk × 8, lr 5e-5, mean-init
	embeddings) lifted the score from 30.8% → 34.8% — a real +4.0 point gain
	over the prior recovery attempt. But it still trails canonical by 2.1 points.

	The hybrid was directionally right — keeping the letter-shortcut anchor
	(raw × 5) preserved the discriminative signal, while CoT × 2 added some
	reasoning supervision without over-rotating. Mean-init for new tokens kept
	the residual stream stable.

	What it didn't fix: the letter-shortcut at × 5 is still doing most of the
	work, and there's no mechanism in this recipe that actually transfers
	knowledge into the model — only better calibration on top of the shortcut.
	To beat 36.9% durably, the lever isn't another SFT recipe — it's either:

	1. Bigger model (ghost-base ~350M) so reasoning supervision actually fits.
	2. Better pretrain coverage of the CTIBench knowledge domain (more
	cyber threat intel, MITRE corpus depth) so the shortcut isn't the only
	path to a correct answer.
	3. Proper retrieval at inference (RAG done right, not the chat-v4 RAFT
	attempt that conflated training-time and inference-time augmentation).

	## Decision

	- Canonical stays: `v0.5 chat-v3 (MCQ-tuned)` at 36.9% on the main HF repo.
	- Ship chat-v5 separately: push to `Ghostgim/GhostLM-v0.5-experimental`
	with this postmortem in the model card. Honest framing: "improved CoT
	hybrid recipe, still 2.1pt below canonical, primarily of research interest."
	- No more chat-tune iterations on v0.5. The 36.9% ceiling is a pretrain
	+ capacity ceiling, not a recipe ceiling. Next swing should be ghost-base
	or a corpus-side fix, not another SFT permutation.

	## Sources

	- [Answer Matching Outperforms MCQ, arXiv 2507.02856](https://arxiv.org/abs/2507.02856)
	- [Skip-Thinking, arXiv 2505.18642](https://arxiv.org/html/2505.18642v1)
	- [Unveiling Key Factors for Distilling CoT, arXiv 2502.18001](https://arxiv.org/html/2502.18001v1)
	- [SmolLM2, arXiv 2502.02737](https://arxiv.org/html/2502.02737v1)
	- [How Abilities in LLMs are Affected by SFT Data Composition](https://openreview.net/forum?id=6M5G5hNiAU)