docs: model card with bench numbers, intended use, caveats
Browse filesAdds proper README.md frontmatter so the model surfaces in HF model search
and tags, with the v0.9 chat bench numbers (CTIBench 28.9% / SecQA 39.3% /
in-repo CTF 59.2% / free-form fact recall 1/50) under the model-index schema.
Body documents architecture (6L x 768d x 12h, RoPE + SwiGLU + RMSNorm),
training corpus (273M tokens spanning PRIMUS / NVD / MITRE / CWE / OWASP /
RFCs / arXiv / fact-QA), intended use (educational + research), what the
model is NOT for (anything requiring factual recall), loading code, and
explicit caveats around the fact-recall floor.
Cross-links the GitHub source, the demo Space, the ghost_base_spec, and the
multi-year hardware pathway.
README.md
ADDED
|
@@ -0,0 +1,249 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- cybersecurity
|
| 9 |
+
- language-model
|
| 10 |
+
- from-scratch
|
| 11 |
+
- small-model
|
| 12 |
+
- cti
|
| 13 |
+
- ctibench
|
| 14 |
+
- chat
|
| 15 |
+
datasets:
|
| 16 |
+
- trendmicro-ailab/Primus-Seed
|
| 17 |
+
- trendmicro-ailab/Primus-FineWeb
|
| 18 |
+
- AI4Sec/cti-bench
|
| 19 |
+
model-index:
|
| 20 |
+
- name: GhostLM v0.9 chat (81M)
|
| 21 |
+
results:
|
| 22 |
+
- task:
|
| 23 |
+
type: multiple-choice
|
| 24 |
+
name: CTIBench MCQ (debiased text-scoring, n=2500)
|
| 25 |
+
dataset:
|
| 26 |
+
name: CTIBench MCQ
|
| 27 |
+
type: AI4Sec/cti-bench
|
| 28 |
+
split: test
|
| 29 |
+
metrics:
|
| 30 |
+
- name: 2-permutation per-perm avg
|
| 31 |
+
type: accuracy
|
| 32 |
+
value: 0.289
|
| 33 |
+
- task:
|
| 34 |
+
type: multiple-choice
|
| 35 |
+
name: SecQA (n=210)
|
| 36 |
+
dataset:
|
| 37 |
+
name: SecQA
|
| 38 |
+
type: zefang-liu/secqa
|
| 39 |
+
metrics:
|
| 40 |
+
- name: accuracy
|
| 41 |
+
type: accuracy
|
| 42 |
+
value: 0.393
|
| 43 |
+
- task:
|
| 44 |
+
type: multiple-choice
|
| 45 |
+
name: GhostLM CTF MCQ (in-repo, n=30)
|
| 46 |
+
dataset:
|
| 47 |
+
name: GhostLM CTF eval
|
| 48 |
+
type: in-repo
|
| 49 |
+
metrics:
|
| 50 |
+
- name: accuracy
|
| 51 |
+
type: accuracy
|
| 52 |
+
value: 0.592
|
| 53 |
+
- task:
|
| 54 |
+
type: text-generation
|
| 55 |
+
name: GhostLM free-form fact recall (n=50)
|
| 56 |
+
dataset:
|
| 57 |
+
name: GhostLM fact-recall bench
|
| 58 |
+
type: in-repo
|
| 59 |
+
metrics:
|
| 60 |
+
- name: substring-match accuracy
|
| 61 |
+
type: accuracy
|
| 62 |
+
value: 0.02
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
# GhostLM v0.9 chat (81M, from-scratch cybersecurity LM)
|
| 66 |
+
|
| 67 |
+
GhostLM is a multi-rung scale-ladder cybersecurity language model
|
| 68 |
+
trained entirely from scratch in PyTorch. **v0.9 chat is the bench
|
| 69 |
+
winner of the ghost-small (45-81M) line on every multiple-choice
|
| 70 |
+
benchmark we evaluated.** It is also where the line saturates: at 81M
|
| 71 |
+
parameters the model has the *register* of cybersec writing but not the
|
| 72 |
+
*facts* in any retrievable form. The next rung is
|
| 73 |
+
[ghost-base](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md)
|
| 74 |
+
(~360M, SmolLM2-360M shape), gated on rented GPU compute.
|
| 75 |
+
|
| 76 |
+
This repo holds the slim inference checkpoint
|
| 77 |
+
(`best_model.pt`, 324 MB, model + config only, optimizer state stripped).
|
| 78 |
+
|
| 79 |
+
## Bench numbers
|
| 80 |
+
|
| 81 |
+
All benches run with debiased multi-permutation text-scoring on
|
| 82 |
+
checkpointed CPU/GPU inference. Methodology in
|
| 83 |
+
[`docs/ctibench_bias_finding.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ctibench_bias_finding.md).
|
| 84 |
+
|
| 85 |
+
| Benchmark | Records | v0.4 chat-v3 | v0.7 chat | **v0.9 chat** | Random |
|
| 86 |
+
|---|---:|---:|---:|---:|---:|
|
| 87 |
+
| CTIBench MCQ (full split) | 2,500 | 27.6% | 27.2% | **28.9%** | 25.0% |
|
| 88 |
+
| In-repo CTF MCQ eval | 30 | 50.0% | 50.0% | **59.2%** | 25.0% |
|
| 89 |
+
| SecQA (external, n=210) | 210 | 35.0% | 37.6% | **39.3%** | 25.0% |
|
| 90 |
+
| Free-form fact recall | 50 | 0/50 | 1/50 | **1/50** | 0/50 |
|
| 91 |
+
|
| 92 |
+
v0.9 wins every multiple-choice benchmark by 0.7-9.2 pp. The MCQ
|
| 93 |
+
ranking holds across CTIBench, the in-repo CTF eval, and the
|
| 94 |
+
external SecQA bench.
|
| 95 |
+
|
| 96 |
+
**But free-form fact recall is at floor across the entire 81M ghost-small
|
| 97 |
+
rung.** A 50-question hand-written fact-recall set (CVE / CWE / MITRE /
|
| 98 |
+
OWASP / crypto / protocol / misc) graded by substring match scores
|
| 99 |
+
0-2% across every chat-tune in the line. The v0.9 model's one "hit"
|
| 100 |
+
("256" appearing in a SHA-256 question) is arguably spurious. **MCQ wins
|
| 101 |
+
measure register matching and topic distinctness, not factual recall.**
|
| 102 |
+
|
| 103 |
+
## Architecture
|
| 104 |
+
|
| 105 |
+
| Field | Value |
|
| 106 |
+
|---|---|
|
| 107 |
+
| Layers | 6 |
|
| 108 |
+
| d_model | 768 |
|
| 109 |
+
| Attention heads | 12 (head_dim 64) |
|
| 110 |
+
| FFN | SwiGLU, hidden = `int(d_ff × 2/3)` rounded to 64 = 2048 |
|
| 111 |
+
| Normalization | RMSNorm |
|
| 112 |
+
| Position | RoPE (base 10000) |
|
| 113 |
+
| Vocab | 50,264 (GPT-2 50K BPE + 7 special tokens) |
|
| 114 |
+
| Context | 512 train, 1024 inference |
|
| 115 |
+
| Total params | ~81M |
|
| 116 |
+
|
| 117 |
+
Same architecture as ghost-small-v0.7. The 273M-token v0.9 corpus is
|
| 118 |
+
what produces the bench delta over v0.7.
|
| 119 |
+
|
| 120 |
+
## Training data
|
| 121 |
+
|
| 122 |
+
Pretrain corpus: 273M tokens spanning
|
| 123 |
+
|
| 124 |
+
- **PRIMUS-Seed** (Trend Micro AI Lab, Apache 2.0): curated cybersec text
|
| 125 |
+
- **PRIMUS-FineWeb** (Trend Micro AI Lab, ODC-By): TinyBERT-filtered cybersec subset of CommonCrawl
|
| 126 |
+
- **NVD CVEs** (NIST, public domain): full v2 description text
|
| 127 |
+
- **MITRE ATT&CK + CWE + CAPEC** (MITRE, custom permissive): technique / weakness / pattern descriptions
|
| 128 |
+
- **OWASP** (Top 10, ASVS, Cheat Sheets, WSTG; CC-BY-SA): web-app security guidance
|
| 129 |
+
- **IETF RFCs** (BCP 78, public): security-relevant RFCs
|
| 130 |
+
- **CTFtime + Exploit-DB** (open): real CTF write-ups and exploit POCs
|
| 131 |
+
- **arXiv cs.CR**: full-text academic papers
|
| 132 |
+
- **fact-QA**: ~11K Q&A pairs distilled by Qwen-14B from the corpus
|
| 133 |
+
|
| 134 |
+
Per-source breakdown in
|
| 135 |
+
[`CORPUS.md`](https://github.com/joemunene-by/GhostLM/blob/main/CORPUS.md).
|
| 136 |
+
|
| 137 |
+
Chat-tuning: SFT on 1,802 MCQ + small-talk + identity examples using
|
| 138 |
+
the chat-v3 recipe. Three role tokens (`<|ghost_user|>`,
|
| 139 |
+
`<|ghost_assistant|>`, `<|ghost_end|>`) added to the tokenizer.
|
| 140 |
+
|
| 141 |
+
## Intended use
|
| 142 |
+
|
| 143 |
+
- Educational: a transparent, hand-written reference implementation of
|
| 144 |
+
a from-scratch decoder-only cybersecurity LM, trained on a curated
|
| 145 |
+
corpus, with all code on GitHub and all recipes documented.
|
| 146 |
+
- Research: a bench artifact for "what does an 81M from-scratch cyber
|
| 147 |
+
LM actually score on CTIBench / SecQA?" The honest answer (28.9% /
|
| 148 |
+
39.3%) is meaningful evidence about the parameter-count requirement
|
| 149 |
+
for factual recall on cybersec MCQ.
|
| 150 |
+
|
| 151 |
+
## What this model is NOT for
|
| 152 |
+
|
| 153 |
+
- **Anything that depends on factual recall.** Free-form fact recall
|
| 154 |
+
is at floor. CVE numbers, version chains, MITRE technique IDs,
|
| 155 |
+
CVSS scores produced by this model are unreliable. Verify against
|
| 156 |
+
authoritative sources.
|
| 157 |
+
- **General-purpose tasks.** Outside cybersecurity the model politely
|
| 158 |
+
declines and returns to its domain. Do not expect it to summarize
|
| 159 |
+
news, write code, or answer arbitrary questions.
|
| 160 |
+
- **Production cybersec workflows.** Not for incident response,
|
| 161 |
+
threat hunting, or any decision that affects real systems.
|
| 162 |
+
|
| 163 |
+
## Loading
|
| 164 |
+
|
| 165 |
+
```python
|
| 166 |
+
import torch
|
| 167 |
+
from huggingface_hub import hf_hub_download
|
| 168 |
+
from ghostlm.config import GhostLMConfig
|
| 169 |
+
from ghostlm.model import GhostLM
|
| 170 |
+
from ghostlm.tokenizer import GhostTokenizer
|
| 171 |
+
from dataclasses import fields
|
| 172 |
+
|
| 173 |
+
# Pull weights
|
| 174 |
+
ckpt_path = hf_hub_download(
|
| 175 |
+
repo_id="Ghostgim/GhostLM-v0.9-experimental",
|
| 176 |
+
filename="best_model.pt",
|
| 177 |
+
)
|
| 178 |
+
|
| 179 |
+
# Load
|
| 180 |
+
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
|
| 181 |
+
saved = ckpt["config"]
|
| 182 |
+
config = GhostLMConfig(**{
|
| 183 |
+
f.name: saved[f.name] for f in fields(GhostLMConfig) if f.name in saved
|
| 184 |
+
})
|
| 185 |
+
model = GhostLM(config)
|
| 186 |
+
model.load_state_dict(ckpt["model_state_dict"])
|
| 187 |
+
model.eval()
|
| 188 |
+
|
| 189 |
+
tokenizer = GhostTokenizer() # GPT-2 BPE + 7 special tokens
|
| 190 |
+
|
| 191 |
+
# Multi-turn chat using the role tokens
|
| 192 |
+
turns = [{"role": "user", "content": "What is XSS?"}]
|
| 193 |
+
prompt_ids = tokenizer.format_chat_prompt(turns)
|
| 194 |
+
# ... see scripts/chat.py for full generation loop
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
The full code (architecture, tokenizer, generation, eval, training) is
|
| 198 |
+
in the [GhostLM GitHub repo](https://github.com/joemunene-by/GhostLM).
|
| 199 |
+
|
| 200 |
+
## Live demo
|
| 201 |
+
|
| 202 |
+
[`huggingface.co/spaces/Ghostgim/ghostlm`](https://huggingface.co/spaces/Ghostgim/ghostlm)
|
| 203 |
+
|
| 204 |
+
Pulls these weights via `hf_hub_download` on first launch. CPU inference
|
| 205 |
+
takes ~15-25 s per reply at the default 200-token cap. The demo is
|
| 206 |
+
intentionally honest about the fact-recall floor; expect register-shaped
|
| 207 |
+
output rather than reliable answers.
|
| 208 |
+
|
| 209 |
+
## Caveats
|
| 210 |
+
|
| 211 |
+
- **Hallucination is the norm**, not the exception. This is an 81M
|
| 212 |
+
from-scratch model, not a fine-tuned 7B foundation model.
|
| 213 |
+
- **MCQ wins do not imply factual recall.** Test with the free-form
|
| 214 |
+
fact-recall benchmark, not just CTIBench.
|
| 215 |
+
- **Pretrain corpus is sub-Chinchilla.** 273M tokens for 81M params is
|
| 216 |
+
~3× under Chinchilla-optimal; the chat tune partially compensates,
|
| 217 |
+
but the model is undertrained relative to its capacity.
|
| 218 |
+
|
| 219 |
+
## Citation
|
| 220 |
+
|
| 221 |
+
```bibtex
|
| 222 |
+
@misc{munene2026ghostlm,
|
| 223 |
+
title = {GhostLM: a from-scratch cybersecurity language model on a transparent scale ladder},
|
| 224 |
+
author = {Munene, Joe},
|
| 225 |
+
year = {2026},
|
| 226 |
+
howpublished = {\url{https://github.com/joemunene-by/GhostLM}},
|
| 227 |
+
note = {v0.9.2 release; 81M-parameter chat checkpoint}
|
| 228 |
+
}
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
## Roadmap
|
| 232 |
+
|
| 233 |
+
The next rung is **ghost-base (~360M, SmolLM2-360M shape)**, gated on
|
| 234 |
+
rented GPU compute. Acceptance gate:
|
| 235 |
+
|
| 236 |
+
- ≥40% on debiased CTIBench (full n=2500), OR
|
| 237 |
+
- ≥65% on the in-repo CTF MCQ eval, OR
|
| 238 |
+
- ≥30% on the 50-question free-form fact-recall set.
|
| 239 |
+
|
| 240 |
+
The fact-recall bar is the truth metric. Spec at
|
| 241 |
+
[`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md);
|
| 242 |
+
multi-year pathway through ghost-7B in
|
| 243 |
+
[`docs/hardware_pathway.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/hardware_pathway.md).
|
| 244 |
+
|
| 245 |
+
## License
|
| 246 |
+
|
| 247 |
+
Apache 2.0. Same license as the GhostLM source code.
|
| 248 |
+
|
| 249 |
+
Built by Joe Munene.
|