Ghostgim commited on
Commit
54d41ec
Β·
verified Β·
1 Parent(s): dc6f3be

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +99 -0
  2. config.json +33 -0
  3. pytorch_model.pt +3 -0
  4. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # v0.5 Chat-Tune Postmortem (2026-05-03)
2
+
3
+ The canonical v0.5 chat is `chat-v3 (MCQ-tuned)` at **36.9%** on CTIBench MCQ.
4
+ This document records the recovery attempts that followed and what they
5
+ actually changed about our understanding.
6
+
7
+ ## Result table
8
+
9
+ | Run | Recipe | Steps | LR | Val | CTIBench MCQ |
10
+ |---|---|---|---|---|---|
11
+ | chat-v2 | Cybersec Q&A only, no MCQ | 1500 | 5e-5 | β€” | 19.0% |
12
+ | **chat-v3 (canonical)** | **Raw letter-only MCQ Γ— 5** | 1500 | 5e-5 | β€” | **36.9%** |
13
+ | chat-v4 (RAFT) | RAG-augmented chat-v3 mix | 1500 | 5e-5 | β€” | 25.0% |
14
+ | chat (v0.5 base re-tune) | chat-v3 recipe on v0.5 base | 1500 | 5e-5 | β€” | 32.5% |
15
+ | chat-long | chat-v3 mix, 4000 steps | 4000 | 5e-5 | β€” | 17.1% |
16
+ | chat-recovered | CoT MCQ Γ— 1 + small-talk Γ— 30 | 1500 | 3e-5 | 2.808 | 30.8% |
17
+ | chat-v4 (failed) | Hybrid + lr 2e-4 | 300 | 2e-4 | diverged | killed |
18
+ | **chat-v5 (this run)** | **Hybrid raw Γ— 5 + CoT Γ— 2 + small-talk Γ— 8** | 2000 | 5e-5 | 2.990 | **34.8%** |
19
+
20
+ ## What we learned
21
+
22
+ ### What chat-v3 actually does
23
+
24
+ The 36.9% canonical is a *pattern-match shortcut*, not reasoning. With raw
25
+ letter-only MCQ at Γ— 5 multiplier, the model learns "after the prompt ends in
26
+ 'Answer:', emit a single letter consistent with the surface features of the
27
+ options." This is a known class of MCQ artifact (Answer Matching > MCQ,
28
+ arXiv 2507.02856) β€” sub-100M models can hit reasonable MCQ scores by
29
+ exploiting the choice distribution without understanding the question.
30
+
31
+ ### Why CoT-MCQ alone made it worse
32
+
33
+ `chat-recovered` (30.8%) replaced the letter-only MCQ Γ— 5 with CoT MCQ Γ— 1.
34
+ The CoT records have the format `"B. <1-2 sentence justification>"` β€” Qwen-14B
35
+ generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini,
36
+ was that reasoning supervision should outperform pattern-match supervision
37
+ even at low multipliers.
38
+
39
+ It didn't β€” at 36M params, the model can't compress 1-2 sentences of cybersec
40
+ reasoning into useful weight updates, and it loses the letter-shortcut signal
41
+ in the process. Documented size effect: weaker students benefit from coarser
42
+ supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv
43
+ 2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).
44
+
45
+ The 30 Γ— small-talk multiplier compounded the damage by pushing task-data
46
+ share below 5% of the SFT mix β€” well outside the SmolLM2 reference of
47
+ β‰₯ 20% task share.
48
+
49
+ ### Why chat-v4 (lr 2e-4) diverged
50
+
51
+ Research said an undertrained backbone needs aggressive SFT lr to escape a
52
+ bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to
53
+ 36M with mean-init new tokens, 2e-4 was still too hot β€” val climbed
54
+ monotonically across 3 evals (3.175 β†’ 3.285 β†’ 3.403) before we killed it at
55
+ step 300.
56
+
57
+ Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new
58
+ embedding rows. The safe range is closer to 5e-5.
59
+
60
+ ### What chat-v5 got right (and didn't)
61
+
62
+ The hybrid recipe (raw Γ— 5 + CoT Γ— 2, small-talk Γ— 8, lr 5e-5, mean-init
63
+ embeddings) lifted the score from 30.8% β†’ 34.8% β€” a real **+4.0 point** gain
64
+ over the prior recovery attempt. But it still trails canonical by 2.1 points.
65
+
66
+ The hybrid was directionally right β€” keeping the letter-shortcut anchor
67
+ (raw Γ— 5) preserved the discriminative signal, while CoT Γ— 2 added some
68
+ reasoning supervision without over-rotating. Mean-init for new tokens kept
69
+ the residual stream stable.
70
+
71
+ What it didn't fix: the letter-shortcut at Γ— 5 is still doing most of the
72
+ work, and there's no mechanism in this recipe that actually transfers
73
+ *knowledge* into the model β€” only better calibration on top of the shortcut.
74
+ To beat 36.9% durably, the lever isn't another SFT recipe β€” it's either:
75
+
76
+ 1. **Bigger model** (ghost-base ~350M) so reasoning supervision actually fits.
77
+ 2. **Better pretrain coverage** of the CTIBench knowledge domain (more
78
+ cyber threat intel, MITRE corpus depth) so the shortcut isn't the only
79
+ path to a correct answer.
80
+ 3. **Proper retrieval at inference** (RAG done right, not the chat-v4 RAFT
81
+ attempt that conflated training-time and inference-time augmentation).
82
+
83
+ ## Decision
84
+
85
+ - **Canonical stays:** `v0.5 chat-v3 (MCQ-tuned)` at 36.9% on the main HF repo.
86
+ - **Ship chat-v5 separately:** push to `Ghostgim/GhostLM-v0.5-experimental`
87
+ with this postmortem in the model card. Honest framing: "improved CoT
88
+ hybrid recipe, still 2.1pt below canonical, primarily of research interest."
89
+ - **No more chat-tune iterations on v0.5.** The 36.9% ceiling is a pretrain
90
+ + capacity ceiling, not a recipe ceiling. Next swing should be ghost-base
91
+ or a corpus-side fix, not another SFT permutation.
92
+
93
+ ## Sources
94
+
95
+ - [Answer Matching Outperforms MCQ, arXiv 2507.02856](https://arxiv.org/abs/2507.02856)
96
+ - [Skip-Thinking, arXiv 2505.18642](https://arxiv.org/html/2505.18642v1)
97
+ - [Unveiling Key Factors for Distilling CoT, arXiv 2502.18001](https://arxiv.org/html/2502.18001v1)
98
+ - [SmolLM2, arXiv 2502.02737](https://arxiv.org/html/2502.02737v1)
99
+ - [How Abilities in LLMs are Affected by SFT Data Composition](https://openreview.net/forum?id=6M5G5hNiAU)
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 32007,
3
+ "context_length": 512,
4
+ "d_model": 512,
5
+ "n_heads": 8,
6
+ "n_layers": 6,
7
+ "d_ff": 2048,
8
+ "dropout": 0.1,
9
+ "bias": true,
10
+ "use_rope": true,
11
+ "rope_base": 10000.0,
12
+ "use_swiglu": true,
13
+ "use_rmsnorm": true,
14
+ "use_flash_attention": false,
15
+ "batch_size": 4,
16
+ "learning_rate": 5e-05,
17
+ "weight_decay": 0.1,
18
+ "beta1": 0.9,
19
+ "beta2": 0.95,
20
+ "grad_clip": 1.0,
21
+ "grad_accum_steps": 8,
22
+ "warmup_steps": 100,
23
+ "max_steps": 2000,
24
+ "eval_interval": 100,
25
+ "save_interval": 500,
26
+ "data_dir": "data/processed",
27
+ "checkpoint_dir": "checkpoints/phase8_chat_v05_v5",
28
+ "log_dir": "logs/phase8_chat_v05_v5",
29
+ "device": "auto",
30
+ "dtype": "float32",
31
+ "seed": 42,
32
+ "use_wandb": false
33
+ }
pytorch_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0cf98025b870066c073cd1fe58efdd966e915495b7eadb21a181e8dd822b0771
3
+ size 428158001
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff