Ghostgim commited on
Commit
2aba6a5
·
verified ·
1 Parent(s): d15c7e7

docs: model card with bench numbers, intended use, caveats

Browse files

Adds proper README.md frontmatter so the model surfaces in HF model search
and tags, with the v0.9 chat bench numbers (CTIBench 28.9% / SecQA 39.3% /
in-repo CTF 59.2% / free-form fact recall 1/50) under the model-index schema.

Body documents architecture (6L x 768d x 12h, RoPE + SwiGLU + RMSNorm),
training corpus (273M tokens spanning PRIMUS / NVD / MITRE / CWE / OWASP /
RFCs / arXiv / fact-QA), intended use (educational + research), what the
model is NOT for (anything requiring factual recall), loading code, and
explicit caveats around the fact-recall floor.

Cross-links the GitHub source, the demo Space, the ghost_base_spec, and the
multi-year hardware pathway.

Files changed (1) hide show
  1. README.md +249 -0
README.md ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - cybersecurity
9
+ - language-model
10
+ - from-scratch
11
+ - small-model
12
+ - cti
13
+ - ctibench
14
+ - chat
15
+ datasets:
16
+ - trendmicro-ailab/Primus-Seed
17
+ - trendmicro-ailab/Primus-FineWeb
18
+ - AI4Sec/cti-bench
19
+ model-index:
20
+ - name: GhostLM v0.9 chat (81M)
21
+ results:
22
+ - task:
23
+ type: multiple-choice
24
+ name: CTIBench MCQ (debiased text-scoring, n=2500)
25
+ dataset:
26
+ name: CTIBench MCQ
27
+ type: AI4Sec/cti-bench
28
+ split: test
29
+ metrics:
30
+ - name: 2-permutation per-perm avg
31
+ type: accuracy
32
+ value: 0.289
33
+ - task:
34
+ type: multiple-choice
35
+ name: SecQA (n=210)
36
+ dataset:
37
+ name: SecQA
38
+ type: zefang-liu/secqa
39
+ metrics:
40
+ - name: accuracy
41
+ type: accuracy
42
+ value: 0.393
43
+ - task:
44
+ type: multiple-choice
45
+ name: GhostLM CTF MCQ (in-repo, n=30)
46
+ dataset:
47
+ name: GhostLM CTF eval
48
+ type: in-repo
49
+ metrics:
50
+ - name: accuracy
51
+ type: accuracy
52
+ value: 0.592
53
+ - task:
54
+ type: text-generation
55
+ name: GhostLM free-form fact recall (n=50)
56
+ dataset:
57
+ name: GhostLM fact-recall bench
58
+ type: in-repo
59
+ metrics:
60
+ - name: substring-match accuracy
61
+ type: accuracy
62
+ value: 0.02
63
+ ---
64
+
65
+ # GhostLM v0.9 chat (81M, from-scratch cybersecurity LM)
66
+
67
+ GhostLM is a multi-rung scale-ladder cybersecurity language model
68
+ trained entirely from scratch in PyTorch. **v0.9 chat is the bench
69
+ winner of the ghost-small (45-81M) line on every multiple-choice
70
+ benchmark we evaluated.** It is also where the line saturates: at 81M
71
+ parameters the model has the *register* of cybersec writing but not the
72
+ *facts* in any retrievable form. The next rung is
73
+ [ghost-base](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md)
74
+ (~360M, SmolLM2-360M shape), gated on rented GPU compute.
75
+
76
+ This repo holds the slim inference checkpoint
77
+ (`best_model.pt`, 324 MB, model + config only, optimizer state stripped).
78
+
79
+ ## Bench numbers
80
+
81
+ All benches run with debiased multi-permutation text-scoring on
82
+ checkpointed CPU/GPU inference. Methodology in
83
+ [`docs/ctibench_bias_finding.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ctibench_bias_finding.md).
84
+
85
+ | Benchmark | Records | v0.4 chat-v3 | v0.7 chat | **v0.9 chat** | Random |
86
+ |---|---:|---:|---:|---:|---:|
87
+ | CTIBench MCQ (full split) | 2,500 | 27.6% | 27.2% | **28.9%** | 25.0% |
88
+ | In-repo CTF MCQ eval | 30 | 50.0% | 50.0% | **59.2%** | 25.0% |
89
+ | SecQA (external, n=210) | 210 | 35.0% | 37.6% | **39.3%** | 25.0% |
90
+ | Free-form fact recall | 50 | 0/50 | 1/50 | **1/50** | 0/50 |
91
+
92
+ v0.9 wins every multiple-choice benchmark by 0.7-9.2 pp. The MCQ
93
+ ranking holds across CTIBench, the in-repo CTF eval, and the
94
+ external SecQA bench.
95
+
96
+ **But free-form fact recall is at floor across the entire 81M ghost-small
97
+ rung.** A 50-question hand-written fact-recall set (CVE / CWE / MITRE /
98
+ OWASP / crypto / protocol / misc) graded by substring match scores
99
+ 0-2% across every chat-tune in the line. The v0.9 model's one "hit"
100
+ ("256" appearing in a SHA-256 question) is arguably spurious. **MCQ wins
101
+ measure register matching and topic distinctness, not factual recall.**
102
+
103
+ ## Architecture
104
+
105
+ | Field | Value |
106
+ |---|---|
107
+ | Layers | 6 |
108
+ | d_model | 768 |
109
+ | Attention heads | 12 (head_dim 64) |
110
+ | FFN | SwiGLU, hidden = `int(d_ff × 2/3)` rounded to 64 = 2048 |
111
+ | Normalization | RMSNorm |
112
+ | Position | RoPE (base 10000) |
113
+ | Vocab | 50,264 (GPT-2 50K BPE + 7 special tokens) |
114
+ | Context | 512 train, 1024 inference |
115
+ | Total params | ~81M |
116
+
117
+ Same architecture as ghost-small-v0.7. The 273M-token v0.9 corpus is
118
+ what produces the bench delta over v0.7.
119
+
120
+ ## Training data
121
+
122
+ Pretrain corpus: 273M tokens spanning
123
+
124
+ - **PRIMUS-Seed** (Trend Micro AI Lab, Apache 2.0): curated cybersec text
125
+ - **PRIMUS-FineWeb** (Trend Micro AI Lab, ODC-By): TinyBERT-filtered cybersec subset of CommonCrawl
126
+ - **NVD CVEs** (NIST, public domain): full v2 description text
127
+ - **MITRE ATT&CK + CWE + CAPEC** (MITRE, custom permissive): technique / weakness / pattern descriptions
128
+ - **OWASP** (Top 10, ASVS, Cheat Sheets, WSTG; CC-BY-SA): web-app security guidance
129
+ - **IETF RFCs** (BCP 78, public): security-relevant RFCs
130
+ - **CTFtime + Exploit-DB** (open): real CTF write-ups and exploit POCs
131
+ - **arXiv cs.CR**: full-text academic papers
132
+ - **fact-QA**: ~11K Q&A pairs distilled by Qwen-14B from the corpus
133
+
134
+ Per-source breakdown in
135
+ [`CORPUS.md`](https://github.com/joemunene-by/GhostLM/blob/main/CORPUS.md).
136
+
137
+ Chat-tuning: SFT on 1,802 MCQ + small-talk + identity examples using
138
+ the chat-v3 recipe. Three role tokens (`<|ghost_user|>`,
139
+ `<|ghost_assistant|>`, `<|ghost_end|>`) added to the tokenizer.
140
+
141
+ ## Intended use
142
+
143
+ - Educational: a transparent, hand-written reference implementation of
144
+ a from-scratch decoder-only cybersecurity LM, trained on a curated
145
+ corpus, with all code on GitHub and all recipes documented.
146
+ - Research: a bench artifact for "what does an 81M from-scratch cyber
147
+ LM actually score on CTIBench / SecQA?" The honest answer (28.9% /
148
+ 39.3%) is meaningful evidence about the parameter-count requirement
149
+ for factual recall on cybersec MCQ.
150
+
151
+ ## What this model is NOT for
152
+
153
+ - **Anything that depends on factual recall.** Free-form fact recall
154
+ is at floor. CVE numbers, version chains, MITRE technique IDs,
155
+ CVSS scores produced by this model are unreliable. Verify against
156
+ authoritative sources.
157
+ - **General-purpose tasks.** Outside cybersecurity the model politely
158
+ declines and returns to its domain. Do not expect it to summarize
159
+ news, write code, or answer arbitrary questions.
160
+ - **Production cybersec workflows.** Not for incident response,
161
+ threat hunting, or any decision that affects real systems.
162
+
163
+ ## Loading
164
+
165
+ ```python
166
+ import torch
167
+ from huggingface_hub import hf_hub_download
168
+ from ghostlm.config import GhostLMConfig
169
+ from ghostlm.model import GhostLM
170
+ from ghostlm.tokenizer import GhostTokenizer
171
+ from dataclasses import fields
172
+
173
+ # Pull weights
174
+ ckpt_path = hf_hub_download(
175
+ repo_id="Ghostgim/GhostLM-v0.9-experimental",
176
+ filename="best_model.pt",
177
+ )
178
+
179
+ # Load
180
+ ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
181
+ saved = ckpt["config"]
182
+ config = GhostLMConfig(**{
183
+ f.name: saved[f.name] for f in fields(GhostLMConfig) if f.name in saved
184
+ })
185
+ model = GhostLM(config)
186
+ model.load_state_dict(ckpt["model_state_dict"])
187
+ model.eval()
188
+
189
+ tokenizer = GhostTokenizer() # GPT-2 BPE + 7 special tokens
190
+
191
+ # Multi-turn chat using the role tokens
192
+ turns = [{"role": "user", "content": "What is XSS?"}]
193
+ prompt_ids = tokenizer.format_chat_prompt(turns)
194
+ # ... see scripts/chat.py for full generation loop
195
+ ```
196
+
197
+ The full code (architecture, tokenizer, generation, eval, training) is
198
+ in the [GhostLM GitHub repo](https://github.com/joemunene-by/GhostLM).
199
+
200
+ ## Live demo
201
+
202
+ [`huggingface.co/spaces/Ghostgim/ghostlm`](https://huggingface.co/spaces/Ghostgim/ghostlm)
203
+
204
+ Pulls these weights via `hf_hub_download` on first launch. CPU inference
205
+ takes ~15-25 s per reply at the default 200-token cap. The demo is
206
+ intentionally honest about the fact-recall floor; expect register-shaped
207
+ output rather than reliable answers.
208
+
209
+ ## Caveats
210
+
211
+ - **Hallucination is the norm**, not the exception. This is an 81M
212
+ from-scratch model, not a fine-tuned 7B foundation model.
213
+ - **MCQ wins do not imply factual recall.** Test with the free-form
214
+ fact-recall benchmark, not just CTIBench.
215
+ - **Pretrain corpus is sub-Chinchilla.** 273M tokens for 81M params is
216
+ ~3× under Chinchilla-optimal; the chat tune partially compensates,
217
+ but the model is undertrained relative to its capacity.
218
+
219
+ ## Citation
220
+
221
+ ```bibtex
222
+ @misc{munene2026ghostlm,
223
+ title = {GhostLM: a from-scratch cybersecurity language model on a transparent scale ladder},
224
+ author = {Munene, Joe},
225
+ year = {2026},
226
+ howpublished = {\url{https://github.com/joemunene-by/GhostLM}},
227
+ note = {v0.9.2 release; 81M-parameter chat checkpoint}
228
+ }
229
+ ```
230
+
231
+ ## Roadmap
232
+
233
+ The next rung is **ghost-base (~360M, SmolLM2-360M shape)**, gated on
234
+ rented GPU compute. Acceptance gate:
235
+
236
+ - ≥40% on debiased CTIBench (full n=2500), OR
237
+ - ≥65% on the in-repo CTF MCQ eval, OR
238
+ - ≥30% on the 50-question free-form fact-recall set.
239
+
240
+ The fact-recall bar is the truth metric. Spec at
241
+ [`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md);
242
+ multi-year pathway through ghost-7B in
243
+ [`docs/hardware_pathway.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/hardware_pathway.md).
244
+
245
+ ## License
246
+
247
+ Apache 2.0. Same license as the GhostLM source code.
248
+
249
+ Built by Joe Munene.