File size: 12,931 Bytes
2aba6a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df409e0
e235865
df409e0
 
 
 
 
 
e235865
 
 
df409e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e235865
 
df409e0
e235865
2aba6a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df409e0
2aba6a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e235865
 
 
 
 
 
 
 
2aba6a5
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
license: apache-2.0
language:
  - en
library_name: pytorch
pipeline_tag: text-generation
tags:
  - cybersecurity
  - language-model
  - from-scratch
  - small-model
  - cti
  - ctibench
  - chat
datasets:
  - trendmicro-ailab/Primus-Seed
  - trendmicro-ailab/Primus-FineWeb
  - AI4Sec/cti-bench
model-index:
  - name: GhostLM v0.9 chat (81M)
    results:
      - task:
          type: multiple-choice
          name: CTIBench MCQ (debiased text-scoring, n=2500)
        dataset:
          name: CTIBench MCQ
          type: AI4Sec/cti-bench
          split: test
        metrics:
          - name: 2-permutation per-perm avg
            type: accuracy
            value: 0.289
      - task:
          type: multiple-choice
          name: SecQA (n=210)
        dataset:
          name: SecQA
          type: zefang-liu/secqa
        metrics:
          - name: accuracy
            type: accuracy
            value: 0.393
      - task:
          type: multiple-choice
          name: GhostLM CTF MCQ (in-repo, n=30)
        dataset:
          name: GhostLM CTF eval
          type: in-repo
        metrics:
          - name: accuracy
            type: accuracy
            value: 0.592
      - task:
          type: text-generation
          name: GhostLM free-form fact recall (n=50)
        dataset:
          name: GhostLM fact-recall bench
          type: in-repo
        metrics:
          - name: substring-match accuracy
            type: accuracy
            value: 0.02
---

# GhostLM v0.9 chat (81M, from-scratch cybersecurity LM)

GhostLM is a multi-rung scale-ladder cybersecurity language model
trained entirely from scratch in PyTorch. **v0.9 chat is the bench
winner of the ghost-small (45-81M) line on every multiple-choice
benchmark we evaluated.** It is also where the line saturates: at 81M
parameters the model has the *register* of cybersec writing but not the
*facts* in any retrievable form. The next rung is
[ghost-base](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md)
(~360M, SmolLM2-360M shape), gated on rented GPU compute.

This repo holds the slim inference checkpoint
(`best_model.pt`, 324 MB, model + config only, optimizer state stripped).

## v0.9.5 update (2026-05-08): nine differentiation bets, 1,505 templated SFT records ready

The strategic frame went from "six bets, three measured" (v0.9.4) to
"nine bets, all shipped, 1,505 deterministic SFT records ready for
the v1.0 GPU run." The new bets answer **"what would make GhostLM
exceptional, beyond what general-purpose small LMs offer?"**

Strategic frame: [`docs/differentiation.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/differentiation.md).

| Bet | Status | Result |
|---|---|---|
| 1. Tool-grounded SFT | **training data ready** | 424 templated traces, 98.6% acceptance under `trace_quality_ok`; ~10% "not found" injection trains lookup-failure acknowledgement. [tool_use_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/tool_use_synth.md) |
| 2. Daily LoRA over fresh threat-intel | scaffolded | `scripts/daily_finetune.py`, ~1-2 GPU hr/day |
| 3. Custom 32K BPE | **measured + settled** | +4.0% on cyber, -2.5% on general vs GPT-2 BPE; +25-35% projection falsified. [bpe_corpus_ablation.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/bpe_corpus_ablation.md) |
| 4. Long context via RoPE NTK | scaffolded | `scripts/extend_context_ntk.py`, ~3-5 GPU hr |
| 5. MoE for ghost-1B+ | **smoke validated** | 100-step training PASS. [moe_training_smoke.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/moe_training_smoke.md); presets `ghost-1b` (2.1B/1.2B-active) and `ghost-3b` (6.0B/3.3B-active) |
| 6. Format-aware pretrain (STIX/YARA/Sigma/MISP) | **measured baseline + training data ready** | v0.9 baseline locked at 0/32 = 0% [Wilson 95% CI 0.0-10.7]. 560 templated records ready. [format_baseline_v09.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/format_baseline_v09.md), [format_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/format_synth.md) |
| 7. Code-for-security | **NEW**, training data ready | 12-pattern bank covering OWASP-Top-10 CWE classes (Python/JS/C); 48 records, 100% pass. [code_security_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/code_security_synth.md) |
| 8. Binary / hex literacy | **NEW**, training data ready, **most novel bet** | 15-pattern bank: PE/ELF/Mach-O/ZIP/PDF/OLE2/PNG file magic, UPX/Themida packers, NOP sleds + x64 syscall, PE Optional Header Magic + Machine, x64 execve('/bin/sh') shellcode; 44 records, 100% pass. **No other small cybersec LM does this.** [binary_literacy_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/binary_literacy_synth.md) |
| 9. Provenance / cite tags | **NEW**, training data ready | 429 cite-augmented tool-use traces with `<\|cite\|>{source_type}:{id}#field<\|/cite\|>` inline in the answer; 99.8% acceptance under `trace_with_cites_quality_ok`. Stacks on bet 1 for ~853-record SFT corpus. [provenance_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/provenance_synth.md) |

### Combined templated-synth corpus

| Bet | Records | Acceptance |
|---|---:|---:|
| 1 (tool-use, plain) | 424 | 98.6% |
| 6 (STIX / YARA / Sigma / MISP) | 560 | 99.8% |
| 7 (code-for-security) | 48 | 100.0% |
| 8 (binary / hex literacy) | 44 | 100.0% |
| 9 (cite-augmented tool-use) | 429 | 99.8% |
| **TOTAL** | **1,505** | **99.4%** |

That's the deterministic floor. LLM-distilled records on top
(bet 1 production at ~$200, bet 6 production at ~$50-100 on
Anthropic) bring the realistic ghost-base SFT mix to ~10K records
for a few hundred dollars, with no GPU spend until the actual
pretrain run.

The v0.9 chat checkpoint in this repo is unchanged; it's the
baseline against which all bet measurements are made.

## Bench numbers

All benches run with debiased multi-permutation text-scoring on
checkpointed CPU/GPU inference. Methodology in
[`docs/ctibench_bias_finding.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ctibench_bias_finding.md).

| Benchmark | Records | v0.4 chat-v3 | v0.7 chat | **v0.9 chat** | Random |
|---|---:|---:|---:|---:|---:|
| CTIBench MCQ (full split) | 2,500 | 27.6% | 27.2% | **28.9%** | 25.0% |
| In-repo CTF MCQ eval | 30 | 50.0% | 50.0% | **59.2%** | 25.0% |
| SecQA (external, n=210) | 210 | 35.0% | 37.6% | **39.3%** | 25.0% |
| Free-form fact recall | 50 | 0/50 | 1/50 | **1/50** | 0/50 |

v0.9 wins every multiple-choice benchmark by 0.7-9.2 pp. The MCQ
ranking holds across CTIBench, the in-repo CTF eval, and the
external SecQA bench.

**But free-form fact recall is at floor across the entire 81M ghost-small
rung.** A 50-question hand-written fact-recall set (CVE / CWE / MITRE /
OWASP / crypto / protocol / misc) graded by substring match scores
0-2% across every chat-tune in the line. The v0.9 model's one "hit"
("256" appearing in a SHA-256 question) is arguably spurious. **MCQ wins
measure register matching and topic distinctness, not factual recall.**

## Architecture

| Field | Value |
|---|---|
| Layers | 6 |
| d_model | 768 |
| Attention heads | 12 (head_dim 64) |
| FFN | SwiGLU, hidden = `int(d_ff × 2/3)` rounded to 64 = 2048 |
| Normalization | RMSNorm |
| Position | RoPE (base 10000) |
| Vocab | 50,264 (GPT-2 50K BPE + 7 special tokens) |
| Context | 512 train, 1024 inference |
| Total params | ~81M |

Same architecture as ghost-small-v0.7. The 273M-token v0.9 corpus is
what produces the bench delta over v0.7.

## Training data

Pretrain corpus: 273M tokens spanning

- **PRIMUS-Seed** (Trend Micro AI Lab, Apache 2.0): curated cybersec text
- **PRIMUS-FineWeb** (Trend Micro AI Lab, ODC-By): TinyBERT-filtered cybersec subset of CommonCrawl
- **NVD CVEs** (NIST, public domain): full v2 description text
- **MITRE ATT&CK + CWE + CAPEC** (MITRE, custom permissive): technique / weakness / pattern descriptions
- **OWASP** (Top 10, ASVS, Cheat Sheets, WSTG; CC-BY-SA): web-app security guidance
- **IETF RFCs** (BCP 78, public): security-relevant RFCs
- **CTFtime + Exploit-DB** (open): real CTF write-ups and exploit POCs
- **arXiv cs.CR**: full-text academic papers
- **fact-QA**: ~11K Q&A pairs distilled by Qwen-14B from the corpus

Per-source breakdown in
[`CORPUS.md`](https://github.com/joemunene-by/GhostLM/blob/main/CORPUS.md).

Chat-tuning: SFT on 1,802 MCQ + small-talk + identity examples using
the chat-v3 recipe. Three role tokens (`<|ghost_user|>`,
`<|ghost_assistant|>`, `<|ghost_end|>`) added to the tokenizer.

## Intended use

- Educational: a transparent, hand-written reference implementation of
  a from-scratch decoder-only cybersecurity LM, trained on a curated
  corpus, with all code on GitHub and all recipes documented.
- Research: a bench artifact for "what does an 81M from-scratch cyber
  LM actually score on CTIBench / SecQA?" The honest answer (28.9% /
  39.3%) is meaningful evidence about the parameter-count requirement
  for factual recall on cybersec MCQ.

## What this model is NOT for

- **Anything that depends on factual recall.** Free-form fact recall
  is at floor. CVE numbers, version chains, MITRE technique IDs,
  CVSS scores produced by this model are unreliable. Verify against
  authoritative sources.
- **General-purpose tasks.** Outside cybersecurity the model politely
  declines and returns to its domain. Do not expect it to summarize
  news, write code, or answer arbitrary questions.
- **Production cybersec workflows.** Not for incident response,
  threat hunting, or any decision that affects real systems.

## Loading

```python
import torch
from huggingface_hub import hf_hub_download
from ghostlm.config import GhostLMConfig
from ghostlm.model import GhostLM
from ghostlm.tokenizer import GhostTokenizer
from dataclasses import fields

# Pull weights
ckpt_path = hf_hub_download(
    repo_id="Ghostgim/GhostLM-v0.9-experimental",
    filename="best_model.pt",
)

# Load
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
saved = ckpt["config"]
config = GhostLMConfig(**{
    f.name: saved[f.name] for f in fields(GhostLMConfig) if f.name in saved
})
model = GhostLM(config)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = GhostTokenizer()  # GPT-2 BPE + 7 special tokens

# Multi-turn chat using the role tokens
turns = [{"role": "user", "content": "What is XSS?"}]
prompt_ids = tokenizer.format_chat_prompt(turns)
# ... see scripts/chat.py for full generation loop
```

The full code (architecture, tokenizer, generation, eval, training) is
in the [GhostLM GitHub repo](https://github.com/joemunene-by/GhostLM).

## Live demo

[`huggingface.co/spaces/Ghostgim/ghostlm`](https://huggingface.co/spaces/Ghostgim/ghostlm)

Pulls these weights via `hf_hub_download` on first launch. CPU inference
takes ~15-25 s per reply at the default 200-token cap. The demo is
intentionally honest about the fact-recall floor; expect register-shaped
output rather than reliable answers.

## Caveats

- **Hallucination is the norm**, not the exception. This is an 81M
  from-scratch model, not a fine-tuned 7B foundation model.
- **MCQ wins do not imply factual recall.** Test with the free-form
  fact-recall benchmark, not just CTIBench.
- **Pretrain corpus is sub-Chinchilla.** 273M tokens for 81M params is
  ~3× under Chinchilla-optimal; the chat tune partially compensates,
  but the model is undertrained relative to its capacity.

## Citation

```bibtex
@misc{munene2026ghostlm,
  title         = {GhostLM: a from-scratch cybersecurity language model on a transparent scale ladder},
  author        = {Munene, Joe},
  year          = {2026},
  howpublished  = {\url{https://github.com/joemunene-by/GhostLM}},
  note          = {v0.9.5 release; 81M-parameter chat checkpoint plus nine differentiation bets, 1505 templated SFT records}
}
```

## Roadmap

The next rung is **ghost-base (~360M, SmolLM2-360M shape)**, gated on
rented GPU compute. Acceptance gate:

- ≥40% on debiased CTIBench (full n=2500), OR
- ≥65% on the in-repo CTF MCQ eval, OR
- ≥30% on the 50-question free-form fact-recall set.

The fact-recall bar is the truth metric. Spec at
[`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md);
multi-year pathway through ghost-7B in
[`docs/hardware_pathway.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/hardware_pathway.md).

After ghost-base lands, the v0.9.4 differentiation bets compose on
top of it: tool-use SFT (bet 1) on the fresh ghost-base, format-aware
pretrain mix (bet 6) using the 560 templated records plus
LLM-distilled traces, RoPE NTK context extension to 16K (bet 4), and
eventually ghost-1B with native MoE from step 0 (bet 5). Sequencing
detail in
[`docs/differentiation.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/differentiation.md).

## License

Apache 2.0. Same license as the GhostLM source code.

Built by Joe Munene.