Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +99 -0
config.json +33 -0
pytorch_model.pt +3 -0
tokenizer.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# v0.5 Chat-Tune Postmortem (2026-05-03)
+The canonical v0.5 chat is `chat-v3 (MCQ-tuned)` at **36.9%** on CTIBench MCQ.
+This document records the recovery attempts that followed and what they
+actually changed about our understanding.
+## Result table
+| Run | Recipe | Steps | LR | Val | CTIBench MCQ |
+|---|---|---|---|---|---|
+| chat-v2 | Cybersec Q&A only, no MCQ | 1500 | 5e-5 | — | 19.0% |
+| **chat-v3 (canonical)** | **Raw letter-only MCQ × 5** | 1500 | 5e-5 | — | **36.9%** |
+| chat-v4 (RAFT) | RAG-augmented chat-v3 mix | 1500 | 5e-5 | — | 25.0% |
+| chat (v0.5 base re-tune) | chat-v3 recipe on v0.5 base | 1500 | 5e-5 | — | 32.5% |
+| chat-long | chat-v3 mix, 4000 steps | 4000 | 5e-5 | — | 17.1% |
+| chat-recovered | CoT MCQ × 1 + small-talk × 30 | 1500 | 3e-5 | 2.808 | 30.8% |
+| chat-v4 (failed) | Hybrid + lr 2e-4 | 300 | 2e-4 | diverged | killed |
+| **chat-v5 (this run)** | **Hybrid raw × 5 + CoT × 2 + small-talk × 8** | 2000 | 5e-5 | 2.990 | **34.8%** |
+## What we learned
+### What chat-v3 actually does
+The 36.9% canonical is a *pattern-match shortcut*, not reasoning. With raw
+letter-only MCQ at × 5 multiplier, the model learns "after the prompt ends in
+'Answer:', emit a single letter consistent with the surface features of the
+options." This is a known class of MCQ artifact (Answer Matching > MCQ,
+arXiv 2507.02856) — sub-100M models can hit reasonable MCQ scores by
+exploiting the choice distribution without understanding the question.
+### Why CoT-MCQ alone made it worse
+`chat-recovered` (30.8%) replaced the letter-only MCQ × 5 with CoT MCQ × 1.
+The CoT records have the format `"B. <1-2 sentence justification>"` — Qwen-14B
+generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini,
+was that reasoning supervision should outperform pattern-match supervision
+even at low multipliers.
+It didn't — at 36M params, the model can't compress 1-2 sentences of cybersec
+reasoning into useful weight updates, and it loses the letter-shortcut signal
+in the process. Documented size effect: weaker students benefit from coarser
+supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv
+2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).
+The 30 × small-talk multiplier compounded the damage by pushing task-data
+share below 5% of the SFT mix — well outside the SmolLM2 reference of
+≥ 20% task share.
+### Why chat-v4 (lr 2e-4) diverged
+Research said an undertrained backbone needs aggressive SFT lr to escape a
+bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to
+36M with mean-init new tokens, 2e-4 was still too hot — val climbed
+monotonically across 3 evals (3.175 → 3.285 → 3.403) before we killed it at
+step 300.
+Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new
+embedding rows. The safe range is closer to 5e-5.
+### What chat-v5 got right (and didn't)
+The hybrid recipe (raw × 5 + CoT × 2, small-talk × 8, lr 5e-5, mean-init
+embeddings) lifted the score from 30.8% → 34.8% — a real **+4.0 point** gain
+over the prior recovery attempt. But it still trails canonical by 2.1 points.
+The hybrid was directionally right — keeping the letter-shortcut anchor
+(raw × 5) preserved the discriminative signal, while CoT × 2 added some
+reasoning supervision without over-rotating. Mean-init for new tokens kept
+the residual stream stable.
+What it didn't fix: the letter-shortcut at × 5 is still doing most of the
+work, and there's no mechanism in this recipe that actually transfers
+*knowledge* into the model — only better calibration on top of the shortcut.
+To beat 36.9% durably, the lever isn't another SFT recipe — it's either:
+1. **Bigger model** (ghost-base ~350M) so reasoning supervision actually fits.
+2. **Better pretrain coverage** of the CTIBench knowledge domain (more
+   cyber threat intel, MITRE corpus depth) so the shortcut isn't the only
+   path to a correct answer.
+3. **Proper retrieval at inference** (RAG done right, not the chat-v4 RAFT
+   attempt that conflated training-time and inference-time augmentation).
+## Decision
+- **Canonical stays:** `v0.5 chat-v3 (MCQ-tuned)` at 36.9% on the main HF repo.
+- **Ship chat-v5 separately:** push to `Ghostgim/GhostLM-v0.5-experimental`
+  with this postmortem in the model card. Honest framing: "improved CoT
+  hybrid recipe, still 2.1pt below canonical, primarily of research interest."
+- **No more chat-tune iterations on v0.5.** The 36.9% ceiling is a pretrain
+  + capacity ceiling, not a recipe ceiling. Next swing should be ghost-base
+  or a corpus-side fix, not another SFT permutation.
+## Sources
+- [Answer Matching Outperforms MCQ, arXiv 2507.02856](https://arxiv.org/abs/2507.02856)
+- [Skip-Thinking, arXiv 2505.18642](https://arxiv.org/html/2505.18642v1)
+- [Unveiling Key Factors for Distilling CoT, arXiv 2502.18001](https://arxiv.org/html/2502.18001v1)
+- [SmolLM2, arXiv 2502.02737](https://arxiv.org/html/2502.02737v1)
+- [How Abilities in LLMs are Affected by SFT Data Composition](https://openreview.net/forum?id=6M5G5hNiAU)

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "vocab_size": 32007,
+  "context_length": 512,
+  "d_model": 512,
+  "n_heads": 8,
+  "n_layers": 6,
+  "d_ff": 2048,
+  "dropout": 0.1,
+  "bias": true,
+  "use_rope": true,
+  "rope_base": 10000.0,
+  "use_swiglu": true,
+  "use_rmsnorm": true,
+  "use_flash_attention": false,
+  "batch_size": 4,
+  "learning_rate": 5e-05,
+  "weight_decay": 0.1,
+  "beta1": 0.9,
+  "beta2": 0.95,
+  "grad_clip": 1.0,
+  "grad_accum_steps": 8,
+  "warmup_steps": 100,
+  "max_steps": 2000,
+  "eval_interval": 100,
+  "save_interval": 500,
+  "data_dir": "data/processed",
+  "checkpoint_dir": "checkpoints/phase8_chat_v05_v5",
+  "log_dir": "logs/phase8_chat_v05_v5",
+  "device": "auto",
+  "dtype": "float32",
+  "seed": 42,
+  "use_wandb": false
+}

pytorch_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0cf98025b870066c073cd1fe58efdd966e915495b7eadb21a181e8dd822b0771
+size 428158001

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff