Update README to v0.9.5: nine differentiation bets, 1,505 templated records
Browse filesREADME sync for the v0.9.5 release (https://github.com/joemunene-by/GhostLM/releases/tag/v0.9.5). Replace the six-bet table with the nine-bet table; add bets 7 (code-for-security), 8 (binary/hex literacy), and 9 (provenance cite tags). Add the combined templated-synth corpus table (1,505 records, 99.4% acceptance). Update citation note to v0.9.5. v0.9 chat checkpoint itself is unchanged; bench numbers are intact.
README.md
CHANGED
|
@@ -76,28 +76,46 @@ parameters the model has the *register* of cybersec writing but not the
|
|
| 76 |
This repo holds the slim inference checkpoint
|
| 77 |
(`best_model.pt`, 324 MB, model + config only, optimizer state stripped).
|
| 78 |
|
| 79 |
-
## v0.9.
|
| 80 |
|
| 81 |
-
The
|
| 82 |
-
|
| 83 |
-
the
|
| 84 |
-
|
|
|
|
|
|
|
| 85 |
|
| 86 |
| Bet | Status | Result |
|
| 87 |
|---|---|---|
|
| 88 |
-
| 1. Tool-grounded SFT |
|
| 89 |
-
| 2. Daily LoRA over fresh threat-intel | scaffolded | scripts/daily_finetune.py, ~1-2 GPU hr/day |
|
| 90 |
-
| 3. Custom 32K BPE | **measured + settled** | +4.0% on cyber, -2.5% on general vs GPT-2 BPE; +25-35% projection falsified
|
| 91 |
-
| 4. Long context via RoPE NTK | scaffolded | scripts/extend_context_ntk.py, ~3-5 GPU hr |
|
| 92 |
-
| 5. MoE for ghost-1B+ | **smoke validated** | 100-step training PASS
|
| 93 |
-
| 6. Format-aware pretrain (STIX/YARA/Sigma/MISP) | **
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
The v0.9 chat checkpoint in this repo is unchanged; it's the
|
| 100 |
-
baseline against which
|
| 101 |
|
| 102 |
## Bench numbers
|
| 103 |
|
|
@@ -247,7 +265,7 @@ output rather than reliable answers.
|
|
| 247 |
author = {Munene, Joe},
|
| 248 |
year = {2026},
|
| 249 |
howpublished = {\url{https://github.com/joemunene-by/GhostLM}},
|
| 250 |
-
note = {v0.9.
|
| 251 |
}
|
| 252 |
```
|
| 253 |
|
|
|
|
| 76 |
This repo holds the slim inference checkpoint
|
| 77 |
(`best_model.pt`, 324 MB, model + config only, optimizer state stripped).
|
| 78 |
|
| 79 |
+
## v0.9.5 update (2026-05-08): nine differentiation bets, 1,505 templated SFT records ready
|
| 80 |
|
| 81 |
+
The strategic frame went from "six bets, three measured" (v0.9.4) to
|
| 82 |
+
"nine bets, all shipped, 1,505 deterministic SFT records ready for
|
| 83 |
+
the v1.0 GPU run." The new bets answer **"what would make GhostLM
|
| 84 |
+
exceptional, beyond what general-purpose small LMs offer?"**
|
| 85 |
+
|
| 86 |
+
Strategic frame: [`docs/differentiation.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/differentiation.md).
|
| 87 |
|
| 88 |
| Bet | Status | Result |
|
| 89 |
|---|---|---|
|
| 90 |
+
| 1. Tool-grounded SFT | **training data ready** | 424 templated traces, 98.6% acceptance under `trace_quality_ok`; ~10% "not found" injection trains lookup-failure acknowledgement. [tool_use_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/tool_use_synth.md) |
|
| 91 |
+
| 2. Daily LoRA over fresh threat-intel | scaffolded | `scripts/daily_finetune.py`, ~1-2 GPU hr/day |
|
| 92 |
+
| 3. Custom 32K BPE | **measured + settled** | +4.0% on cyber, -2.5% on general vs GPT-2 BPE; +25-35% projection falsified. [bpe_corpus_ablation.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/bpe_corpus_ablation.md) |
|
| 93 |
+
| 4. Long context via RoPE NTK | scaffolded | `scripts/extend_context_ntk.py`, ~3-5 GPU hr |
|
| 94 |
+
| 5. MoE for ghost-1B+ | **smoke validated** | 100-step training PASS. [moe_training_smoke.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/moe_training_smoke.md); presets `ghost-1b` (2.1B/1.2B-active) and `ghost-3b` (6.0B/3.3B-active) |
|
| 95 |
+
| 6. Format-aware pretrain (STIX/YARA/Sigma/MISP) | **measured baseline + training data ready** | v0.9 baseline locked at 0/32 = 0% [Wilson 95% CI 0.0-10.7]. 560 templated records ready. [format_baseline_v09.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/format_baseline_v09.md), [format_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/format_synth.md) |
|
| 96 |
+
| 7. Code-for-security | **NEW**, training data ready | 12-pattern bank covering OWASP-Top-10 CWE classes (Python/JS/C); 48 records, 100% pass. [code_security_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/code_security_synth.md) |
|
| 97 |
+
| 8. Binary / hex literacy | **NEW**, training data ready, **most novel bet** | 15-pattern bank: PE/ELF/Mach-O/ZIP/PDF/OLE2/PNG file magic, UPX/Themida packers, NOP sleds + x64 syscall, PE Optional Header Magic + Machine, x64 execve('/bin/sh') shellcode; 44 records, 100% pass. **No other small cybersec LM does this.** [binary_literacy_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/binary_literacy_synth.md) |
|
| 98 |
+
| 9. Provenance / cite tags | **NEW**, training data ready | 429 cite-augmented tool-use traces with `<\|cite\|>{source_type}:{id}#field<\|/cite\|>` inline in the answer; 99.8% acceptance under `trace_with_cites_quality_ok`. Stacks on bet 1 for ~853-record SFT corpus. [provenance_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/provenance_synth.md) |
|
| 99 |
+
|
| 100 |
+
### Combined templated-synth corpus
|
| 101 |
+
|
| 102 |
+
| Bet | Records | Acceptance |
|
| 103 |
+
|---|---:|---:|
|
| 104 |
+
| 1 (tool-use, plain) | 424 | 98.6% |
|
| 105 |
+
| 6 (STIX / YARA / Sigma / MISP) | 560 | 99.8% |
|
| 106 |
+
| 7 (code-for-security) | 48 | 100.0% |
|
| 107 |
+
| 8 (binary / hex literacy) | 44 | 100.0% |
|
| 108 |
+
| 9 (cite-augmented tool-use) | 429 | 99.8% |
|
| 109 |
+
| **TOTAL** | **1,505** | **99.4%** |
|
| 110 |
+
|
| 111 |
+
That's the deterministic floor. LLM-distilled records on top
|
| 112 |
+
(bet 1 production at ~$200, bet 6 production at ~$50-100 on
|
| 113 |
+
Anthropic) bring the realistic ghost-base SFT mix to ~10K records
|
| 114 |
+
for a few hundred dollars, with no GPU spend until the actual
|
| 115 |
+
pretrain run.
|
| 116 |
|
| 117 |
The v0.9 chat checkpoint in this repo is unchanged; it's the
|
| 118 |
+
baseline against which all bet measurements are made.
|
| 119 |
|
| 120 |
## Bench numbers
|
| 121 |
|
|
|
|
| 265 |
author = {Munene, Joe},
|
| 266 |
year = {2026},
|
| 267 |
howpublished = {\url{https://github.com/joemunene-by/GhostLM}},
|
| 268 |
+
note = {v0.9.5 release; 81M-parameter chat checkpoint plus nine differentiation bets, 1505 templated SFT records}
|
| 269 |
}
|
| 270 |
```
|
| 271 |
|