Upload folder using huggingface_hub
Browse files- README.md +99 -0
- config.json +33 -0
- pytorch_model.pt +3 -0
- tokenizer.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# v0.5 Chat-Tune Postmortem (2026-05-03)
|
| 2 |
+
|
| 3 |
+
The canonical v0.5 chat is `chat-v3 (MCQ-tuned)` at **36.9%** on CTIBench MCQ.
|
| 4 |
+
This document records the recovery attempts that followed and what they
|
| 5 |
+
actually changed about our understanding.
|
| 6 |
+
|
| 7 |
+
## Result table
|
| 8 |
+
|
| 9 |
+
| Run | Recipe | Steps | LR | Val | CTIBench MCQ |
|
| 10 |
+
|---|---|---|---|---|---|
|
| 11 |
+
| chat-v2 | Cybersec Q&A only, no MCQ | 1500 | 5e-5 | β | 19.0% |
|
| 12 |
+
| **chat-v3 (canonical)** | **Raw letter-only MCQ Γ 5** | 1500 | 5e-5 | β | **36.9%** |
|
| 13 |
+
| chat-v4 (RAFT) | RAG-augmented chat-v3 mix | 1500 | 5e-5 | β | 25.0% |
|
| 14 |
+
| chat (v0.5 base re-tune) | chat-v3 recipe on v0.5 base | 1500 | 5e-5 | β | 32.5% |
|
| 15 |
+
| chat-long | chat-v3 mix, 4000 steps | 4000 | 5e-5 | β | 17.1% |
|
| 16 |
+
| chat-recovered | CoT MCQ Γ 1 + small-talk Γ 30 | 1500 | 3e-5 | 2.808 | 30.8% |
|
| 17 |
+
| chat-v4 (failed) | Hybrid + lr 2e-4 | 300 | 2e-4 | diverged | killed |
|
| 18 |
+
| **chat-v5 (this run)** | **Hybrid raw Γ 5 + CoT Γ 2 + small-talk Γ 8** | 2000 | 5e-5 | 2.990 | **34.8%** |
|
| 19 |
+
|
| 20 |
+
## What we learned
|
| 21 |
+
|
| 22 |
+
### What chat-v3 actually does
|
| 23 |
+
|
| 24 |
+
The 36.9% canonical is a *pattern-match shortcut*, not reasoning. With raw
|
| 25 |
+
letter-only MCQ at Γ 5 multiplier, the model learns "after the prompt ends in
|
| 26 |
+
'Answer:', emit a single letter consistent with the surface features of the
|
| 27 |
+
options." This is a known class of MCQ artifact (Answer Matching > MCQ,
|
| 28 |
+
arXiv 2507.02856) β sub-100M models can hit reasonable MCQ scores by
|
| 29 |
+
exploiting the choice distribution without understanding the question.
|
| 30 |
+
|
| 31 |
+
### Why CoT-MCQ alone made it worse
|
| 32 |
+
|
| 33 |
+
`chat-recovered` (30.8%) replaced the letter-only MCQ Γ 5 with CoT MCQ Γ 1.
|
| 34 |
+
The CoT records have the format `"B. <1-2 sentence justification>"` β Qwen-14B
|
| 35 |
+
generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini,
|
| 36 |
+
was that reasoning supervision should outperform pattern-match supervision
|
| 37 |
+
even at low multipliers.
|
| 38 |
+
|
| 39 |
+
It didn't β at 36M params, the model can't compress 1-2 sentences of cybersec
|
| 40 |
+
reasoning into useful weight updates, and it loses the letter-shortcut signal
|
| 41 |
+
in the process. Documented size effect: weaker students benefit from coarser
|
| 42 |
+
supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv
|
| 43 |
+
2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).
|
| 44 |
+
|
| 45 |
+
The 30 Γ small-talk multiplier compounded the damage by pushing task-data
|
| 46 |
+
share below 5% of the SFT mix β well outside the SmolLM2 reference of
|
| 47 |
+
β₯ 20% task share.
|
| 48 |
+
|
| 49 |
+
### Why chat-v4 (lr 2e-4) diverged
|
| 50 |
+
|
| 51 |
+
Research said an undertrained backbone needs aggressive SFT lr to escape a
|
| 52 |
+
bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to
|
| 53 |
+
36M with mean-init new tokens, 2e-4 was still too hot β val climbed
|
| 54 |
+
monotonically across 3 evals (3.175 β 3.285 β 3.403) before we killed it at
|
| 55 |
+
step 300.
|
| 56 |
+
|
| 57 |
+
Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new
|
| 58 |
+
embedding rows. The safe range is closer to 5e-5.
|
| 59 |
+
|
| 60 |
+
### What chat-v5 got right (and didn't)
|
| 61 |
+
|
| 62 |
+
The hybrid recipe (raw Γ 5 + CoT Γ 2, small-talk Γ 8, lr 5e-5, mean-init
|
| 63 |
+
embeddings) lifted the score from 30.8% β 34.8% β a real **+4.0 point** gain
|
| 64 |
+
over the prior recovery attempt. But it still trails canonical by 2.1 points.
|
| 65 |
+
|
| 66 |
+
The hybrid was directionally right β keeping the letter-shortcut anchor
|
| 67 |
+
(raw Γ 5) preserved the discriminative signal, while CoT Γ 2 added some
|
| 68 |
+
reasoning supervision without over-rotating. Mean-init for new tokens kept
|
| 69 |
+
the residual stream stable.
|
| 70 |
+
|
| 71 |
+
What it didn't fix: the letter-shortcut at Γ 5 is still doing most of the
|
| 72 |
+
work, and there's no mechanism in this recipe that actually transfers
|
| 73 |
+
*knowledge* into the model β only better calibration on top of the shortcut.
|
| 74 |
+
To beat 36.9% durably, the lever isn't another SFT recipe β it's either:
|
| 75 |
+
|
| 76 |
+
1. **Bigger model** (ghost-base ~350M) so reasoning supervision actually fits.
|
| 77 |
+
2. **Better pretrain coverage** of the CTIBench knowledge domain (more
|
| 78 |
+
cyber threat intel, MITRE corpus depth) so the shortcut isn't the only
|
| 79 |
+
path to a correct answer.
|
| 80 |
+
3. **Proper retrieval at inference** (RAG done right, not the chat-v4 RAFT
|
| 81 |
+
attempt that conflated training-time and inference-time augmentation).
|
| 82 |
+
|
| 83 |
+
## Decision
|
| 84 |
+
|
| 85 |
+
- **Canonical stays:** `v0.5 chat-v3 (MCQ-tuned)` at 36.9% on the main HF repo.
|
| 86 |
+
- **Ship chat-v5 separately:** push to `Ghostgim/GhostLM-v0.5-experimental`
|
| 87 |
+
with this postmortem in the model card. Honest framing: "improved CoT
|
| 88 |
+
hybrid recipe, still 2.1pt below canonical, primarily of research interest."
|
| 89 |
+
- **No more chat-tune iterations on v0.5.** The 36.9% ceiling is a pretrain
|
| 90 |
+
+ capacity ceiling, not a recipe ceiling. Next swing should be ghost-base
|
| 91 |
+
or a corpus-side fix, not another SFT permutation.
|
| 92 |
+
|
| 93 |
+
## Sources
|
| 94 |
+
|
| 95 |
+
- [Answer Matching Outperforms MCQ, arXiv 2507.02856](https://arxiv.org/abs/2507.02856)
|
| 96 |
+
- [Skip-Thinking, arXiv 2505.18642](https://arxiv.org/html/2505.18642v1)
|
| 97 |
+
- [Unveiling Key Factors for Distilling CoT, arXiv 2502.18001](https://arxiv.org/html/2502.18001v1)
|
| 98 |
+
- [SmolLM2, arXiv 2502.02737](https://arxiv.org/html/2502.02737v1)
|
| 99 |
+
- [How Abilities in LLMs are Affected by SFT Data Composition](https://openreview.net/forum?id=6M5G5hNiAU)
|
config.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"vocab_size": 32007,
|
| 3 |
+
"context_length": 512,
|
| 4 |
+
"d_model": 512,
|
| 5 |
+
"n_heads": 8,
|
| 6 |
+
"n_layers": 6,
|
| 7 |
+
"d_ff": 2048,
|
| 8 |
+
"dropout": 0.1,
|
| 9 |
+
"bias": true,
|
| 10 |
+
"use_rope": true,
|
| 11 |
+
"rope_base": 10000.0,
|
| 12 |
+
"use_swiglu": true,
|
| 13 |
+
"use_rmsnorm": true,
|
| 14 |
+
"use_flash_attention": false,
|
| 15 |
+
"batch_size": 4,
|
| 16 |
+
"learning_rate": 5e-05,
|
| 17 |
+
"weight_decay": 0.1,
|
| 18 |
+
"beta1": 0.9,
|
| 19 |
+
"beta2": 0.95,
|
| 20 |
+
"grad_clip": 1.0,
|
| 21 |
+
"grad_accum_steps": 8,
|
| 22 |
+
"warmup_steps": 100,
|
| 23 |
+
"max_steps": 2000,
|
| 24 |
+
"eval_interval": 100,
|
| 25 |
+
"save_interval": 500,
|
| 26 |
+
"data_dir": "data/processed",
|
| 27 |
+
"checkpoint_dir": "checkpoints/phase8_chat_v05_v5",
|
| 28 |
+
"log_dir": "logs/phase8_chat_v05_v5",
|
| 29 |
+
"device": "auto",
|
| 30 |
+
"dtype": "float32",
|
| 31 |
+
"seed": 42,
|
| 32 |
+
"use_wandb": false
|
| 33 |
+
}
|
pytorch_model.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0cf98025b870066c073cd1fe58efdd966e915495b7eadb21a181e8dd822b0771
|
| 3 |
+
size 428158001
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|