Ghostgim commited on
Commit
9740f19
·
verified ·
1 Parent(s): 2ac3661

docs: HF-flavored model card for v0.5.0 chat-v3 (CTIBench 36.9%)

Browse files
Files changed (1) hide show
  1. README.md +205 -307
README.md CHANGED
@@ -1,351 +1,249 @@
1
  ---
2
- language:
3
- - en
4
  license: mit
5
- library_name: pytorch
 
6
  tags:
7
- - cybersecurity
8
- - transformer
9
- - language-model
10
- - decoder-only
11
- - from-scratch
12
- - cve
13
- - ctf
14
- - security
15
- datasets:
16
- - custom
17
  pipeline_tag: text-generation
18
  model-index:
19
- - name: ghost-tiny
20
- results: []
21
- - name: ghost-small
22
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ---
24
 
25
- # GhostLM — Cybersecurity Language Model
26
-
27
- ## Model Details
28
 
29
- | Field | Value |
30
- |---|---|
31
- | **Model Names** | `ghostlm/ghost-small` (~45M params, current canonical). `ghostlm/ghost-tiny` (14.7M, historical canonical and better PMI-suite scorer). Future: `ghost-base`, `ghost-1B` |
32
- | **Architecture** | Decoder-only transformer |
33
- | **Author** | [Joe Munene](https://github.com/joemunene-by) |
34
- | **License** | MIT |
35
- | **Language** | English |
36
- | **Framework** | PyTorch (built from scratch, no pretrained weights) |
37
- | **Version** | 0.4.0 (Phase 4 ghost-small — 30K steps on the 12.56M-token Phase 3.6 corpus, val_loss 2.3535, overall val PPL 11.12 — capacity-reallocation hypothesis confirmed) |
38
-
39
- ## Model Description
40
 
41
- GhostLM is a cybersecurity-focused decoder-only transformer language model built entirely from scratch in PyTorch. No pretrained weights, no wrappers — every component (attention, feed-forward, embeddings, training loop) is hand-implemented.
 
 
42
 
43
- The model is trained on CVE vulnerability descriptions from the National Vulnerability Database, CTF writeups, and security research papers. It is designed for cybersecurity reasoning tasks: CVE analysis, exploit explanation, penetration testing assistance, and security concept generation.
44
 
45
- ## Model Variants
 
 
 
 
 
 
 
 
 
 
46
 
47
- | Variant | Layers | d_model | Heads | d_ff | Context | Params | Status |
48
- |---|---|---|---|---|---|---|---|
49
- | `ghostlm/ghost-tiny` | 2 | 256 | 4 | 1024 | 1024 | 14.7M | Phase 3.5 (historical canonical). 30K steps on ~8.8M tokens, overall PPL 66, PMI suite 31.2% |
50
- | `ghostlm/ghost-small` | 6 | 512 | 8 | 2048 | 1024 | ~45M | **Phase 4 complete (current canonical). 30K steps on ~12.56M tokens, overall PPL 11.12 (−83%), val_loss 2.3535** |
51
- | `ghostlm/ghost-base` | 12 | 768 | 12 | 3072 | 1024 | ~350M | Planned (rented GPU) |
52
- | `ghostlm/ghost-1B` | 24 | 1024 | 16 | 4096 | 1024 | ~1B | Long-term goal |
53
 
54
- ghost-tiny is the iteration vehicle. The scale ladder above is the path to a genuinely useful from-scratch cyber LM. See [ROADMAP.md](ROADMAP.md) for phased milestones, compute requirements, and corpus targets.
 
 
 
 
55
 
56
  ## Architecture
57
 
58
- - **Type:** Decoder-only transformer with causal self-attention
59
- - **Normalization:** Pre-norm (LayerNorm before attention and FFN sub-layers)
60
- - **Positional encoding:** Learned positional embeddings
61
- - **Activation:** GELU
62
- - **Tokenizer:** GPT-2 BPE via tiktoken (50,257 base tokens + 4 special tokens = 50,261 total)
63
- - **Weight tying:** Output projection shares weights with token embedding
64
- - **Attention:** Multi-head causal self-attention with combined QKV projection
65
- - **Initialization:** Normal(0, 0.02) with scaled residual init (std=0.02/sqrt(2*n_layers)) for projection layers
66
-
67
- ## Training Data
68
-
69
- The released v0.3.5 checkpoint was trained on the rebalanced Phase 3.5 corpus. NVD's full 333,540-record pull is on disk, but its training contribution is capped at 6M tokens by content-hash subsample so the corpus isn't 90% CVE descriptions:
70
-
71
- | Source | Records (raw → trained) | Trained tokens | Share | Type |
72
- |---|---|---|---|---|
73
- | NVD CVE Database | 333,540 → 71,828 | ~5.74M | **65.3%** | Real, capped via `--max-cve-tokens 6000000` |
74
- | Synthetic CTF Writeups | 3,000 | ~1.51M | 17.2% | Synthetic, placeholder until real CTFtime grows |
75
- | arXiv cs.CR Abstracts | 2,000 | ~0.74M | 8.4% | Real |
76
- | CTFtime real writeups | 473 → 467 | ~0.47M | 5.3% | Real, inline-only, per-record attribution |
77
- | MITRE ATT&CK | 691 | ~0.26M | 2.9% | Real (Apache 2.0) |
78
- | CAPEC | 609 | ~0.07M | 0.9% | Real (Apache 2.0) |
79
- | **Total (post-dedup)** | **74,635** | **~8.79M** | | train: 70,965 / val: 3,670 |
80
-
81
- **Data splits:** deterministic by content hash — identical or near-duplicate texts always land in the same split. Train/val leakage check returns 0.
82
-
83
- **Token share comparison (what the model sees):**
84
-
85
- | Phase | NVD share | Top non-NVD source | Overall |
86
- |---|---|---|---|
87
- | v0.3.3 (Phase 3) | 87% | CTF synthetic 5% | NVD-dominated |
88
- | **v0.3.5 (Phase 3.5)** | **65.3%** | **synthetic 17.2%** | **balanced across 6 sources** |
89
-
90
- The rebalance is reproducible: `python3 scripts/rebuild_corpus.py --max-cve-tokens 6000000` always produces the same 71,828-record CVE prefix.
91
-
92
- **Topics covered:** vulnerability detection, adversarial ML, network intrusion, cryptographic protocols, fuzzing, side-channel attacks, ransomware detection, supply chain security, memory safety, WAF evasion, SQL injection, XSS, buffer overflow, privilege escalation, reverse engineering, binary exploitation, steganography, network forensics.
93
-
94
- For corpus expansion plans (CTFtime, security blogs, MITRE ATT&CK, tool docs) and licensing notes, see [CORPUS.md](CORPUS.md).
95
-
96
- ## Training Details
97
-
98
- | Parameter | Value |
99
  |---|---|
100
- | Optimizer | AdamW (beta1=0.9, beta2=0.95, weight_decay=0.1) |
101
- | Learning rate | 3e-4 (with cosine decay to 1e-5) |
102
- | Warmup steps | 2,000 |
103
- | Gradient clipping | 1.0 |
104
- | Gradient accumulation | 4 steps |
105
- | Batch size (Phase 3.5) | 2 (effective batch = 8 with grad_accum) |
106
- | Max steps (Phase 3.5) | 30,000 |
107
- | Dropout | 0.1 |
108
- | Mixed precision | AMP on CUDA, fp32 on CPU |
109
-
110
- **Weight decay separation:** No weight decay applied to biases, LayerNorm parameters, or embedding weights. Only linear layer weights receive weight decay.
111
-
112
- **Hardware (Phase 3.5):** Mac Mini M4 (CPU). ~3h13m wall-clock for 30K steps at ~2.4 it/s. Cross-machine workflow: Linux box for data prep, corpus curation, and SSH-driven Mac orchestration; Mac Mini M4 for the training loop. The previous Nemotron-on-Mac harness was replaced this phase by direct `ssh ghostlm-mac` from Linux — drops the email-relay friction and lets the dev box drive the workhorse cleanly.
113
-
114
- **Phase 1** was run on a ThinkPad Yoga 11e (Celeron N4100) and is preserved as `checkpoints/best_model_phase1.pt`. **Phase 2** is preserved as `checkpoints/best_model_phase2.pt` (val_loss 3.78 on the 2.66M-token corpus). **Phase 3 (v0.3.3)** is preserved as `checkpoints/phase3_refresh/best_model.pt` (val_loss 3.45 on the post-NVD-pull corpus, overall PPL 172).
 
115
 
116
  ## Evaluation
117
 
118
- The v0.3.5 model is evaluated on two complementary axes — domain-modeling quality (per-source perplexity) and downstream reasoning (PMI-corrected security task accuracy).
119
-
120
- ### Per-source perplexity on the validation split
121
-
122
- 100 records sampled per source (deterministic seed). Lower is better.
123
-
124
- | Source | v0.3.3 PPL | v0.3.5 PPL | Δ% | Reading |
125
- |---|---|---|---|---|
126
- | MITRE ATT&CK | 615.43 | 55.14 | **−91%** | Was OOD for v0.3.3; now in training |
127
- | CTFtime real writeups | 184.24 | 60.71 | **−67%** | Was OOD for v0.3.3; now in training |
128
- | CAPEC | 326.11 | 133.81 | **−59%** | Was OOD for v0.3.3; now in training |
129
- | Synthetic CTF | 67.57 | 28.48 | **−58%** | Same data both phases — capacity reallocation |
130
- | arXiv cs.CR | 671.09 | 354.95 | **−47%** | Same data both phases — capacity reallocation |
131
- | NVD CVE | 24.19 | 27.55 | +14% | The expected, modest cost |
132
- | **Overall** | **171.84** | **66.05** | **−62%** | |
133
-
134
- The rebalance shifted the model from "knows NVD register, treats everything else as generic English" to "models each domain in proportion to its training share." The 47–58% improvements on synthetic CTF and arXiv are particularly notable because **the training data for those sources didn't change** — the gain comes from parameter capacity that v0.3.3 was burning on memorizing duplicate CVE descriptions being redirected onto already-present sources.
135
-
136
- ### PMI-corrected security task accuracy
137
-
138
- Five classification tasks × 25 hand-crafted samples each (125 total). PMI scoring (commit `aee8008`) replaces the previous mode-collapsed length-normalized scoring that reported 4/30 = 13.3% on every phase under logp scoring. Per-task random baseline depends on the number of candidate labels.
139
-
140
- The eval was expanded from 30 125 samples in v0.3.6 the v0.3.5 model below was re-scored on the larger suite, so the numbers in this table are not directly comparable to the 30-sample numbers in older releases. The expanded suite is the new canonical measurement; future phases will be reported on it. The smaller suite is preserved at `logs/eval_security_phase3.5_pmi.json` for archaeology.
141
-
142
- | Task | Labels | Random | v0.3.5 (125-sample) | Most-common share |
143
- |---|---|---|---|---|
144
- | CVE Severity Classification | 4 | 25.0% | 8/25 (32.0%) | Critical 72% |
145
- | Vulnerability Type Detection | 10 | 10.0% | 8/25 (32.0%) | IDOR 44% |
146
- | Attack Technique Identification | 10 | 10.0% | 10/25 (40.0%) | LatMov 36% |
147
- | CTF Challenge Categorization | 5 | 20.0% | 10/25 (40.0%) | Forensics 64% |
148
- | MITRE ATT&CK Tactic Classification | 12 | 8.3% | 3/25 (12.0%) | LatMov 40% |
149
- | **Overall** | — | ~14.5% (avg) | **39/125 (31.2%)** | — |
150
-
151
- Reading the table:
152
-
153
- - **Vulnerability Type Detection (+22 pp), Attack Technique Identification (+30 pp), CTF Challenge Categorization (+20 pp)** are the three tasks where v0.3.5 is meaningfully above random. These map onto the corpora that grew during the Phase 3.5 rebalance (CWE-tagged CVE bodies, MITRE technique pages, CTFtime real writeups) and the eval picks up that the model has internalized those domains.
154
- - **CVE Severity Classification (+7 pp above random with 72% prediction collapse onto Critical).** The model has learned that NVD descriptions usually accompany severe CVEs and bets that way regardless of input. The previous 10-sample suite happened to over-weight Critical/High labels in a way that masked this; the 25-sample suite with balanced severity distribution exposes it. This is the canary metric for whether subsequent training rungs learn calibrated severity reasoning.
155
- - **MITRE ATT&CK Tactic Classification (+3.7 pp above random).** Tactic-level classification is the model's weakest task — distinguishing Persistence from Privilege Escalation from Defense Evasion is hard from a single description even for humans, and ghost-tiny at 14.7M params on 8.8M tokens has not built that abstraction. This is the metric to watch when ghost-small is trained: if scaling the model doesn't move tactic accuracy above ~25%, the architectural jump didn't produce reasoning gains.
156
-
157
- #### Cross-phase trajectory on the expanded suite
158
-
159
- Every preserved ghost-tiny checkpoint was re-scored on the new 125-sample suite so the trajectory is end-to-end comparable. Cells are `correct/total (accuracy) [most-common-share]`:
160
-
161
- | Task | Phase 1 (2K) | Phase 2 (v0.3.0) | Phase 3 (v0.3.3) | Phase 3.5 (v0.3.5) | Phase 3.6 (v0.3.7) | **Phase 4 (v0.4.0)** |
162
- |---|---|---|---|---|---|---|
163
- | CVE Severity Classification | 7/25 (28.0%) [100%] | 5/25 (20.0%) [96%] | 4/25 (16.0%) [48%] | **8/25 (32.0%) [72%]** | 4/25 (16.0%) [60%] | 6/25 (24.0%) [72%] |
164
- | Vulnerability Type Detection | 3/25 (12.0%) [48%] | 6/25 (24.0%) [76%] | 7/25 (28.0%) [48%] | 8/25 (32.0%) [44%] | 3/25 (12.0%) [96%] | **10/25 (40.0%) [44%]** |
165
- | Attack Technique Identification | 2/25 (8.0%) [24%] | 3/25 (12.0%) [88%] | 5/25 (20.0%) [72%] | **10/25 (40.0%) [36%]** | 4/25 (16.0%) [60%] | 4/25 (16.0%) [52%] |
166
- | CTF Challenge Categorization | 2/25 (8.0%) [84%] | 7/25 (28.0%) [76%] | 6/25 (24.0%) [88%] | **10/25 (40.0%) [64%]** | 5/25 (20.0%) [48%] | 7/25 (28.0%) [72%] |
167
- | MITRE ATT&CK Tactic Classification | 1/25 (4.0%) [72%] | 2/25 (8.0%) [76%] | 3/25 (12.0%) [64%] | 3/25 (12.0%) [40%] | 5/25 (20.0%) [76%] | 2/25 (8.0%) [44%] |
168
- | **Overall (PMI)** | **15/125 (12.0%)** | **23/125 (18.4%)** | **25/125 (20.0%)** | **39/125 (31.2%)** | **21/125 (16.8%)** | **29/125 (23.2%)** |
169
-
170
- Phase 4 ghost-small (v0.4.0) is the new canonical model for density / generation work but lands lower than Phase 3.5 on the PMI scoring above. The honest read requires the second column type — **logp scoring** — which the suite also supports via `--scoring logp`:
171
-
172
- | Phase | PMI | logp | Δ (PMI − logp) |
173
- |---|---:|---:|---:|
174
- | Phase 3.5 (ghost-tiny) | **31.2%** | 17.6% | +13.6 pp |
175
- | Phase 4 (ghost-small) | 23.2% | **19.2%** | +4.0 pp |
176
-
177
- Two things to note:
178
-
179
- 1. **PMI flatters Phase 3.5 by 13.6 pp.** PMI subtracts unconditional candidate log-prob to break ties — useful when the model is mode-collapsing because it normalizes for "this candidate is just inherently high-probability". A loose-distribution model with weakly differentiated logits gives PMI more separation to extract; a tight-distribution model gives less. Phase 3.5 (low capacity) gets the bigger PMI uplift; Phase 4 (higher capacity, sharper distribution) gets a smaller one.
180
- 2. **Logp — the more conservative scorer — picks Phase 4.** With logp scoring, Phase 4 narrowly beats Phase 3.5 (19.2% vs 17.6%) on this same 125-sample suite. The PMI vs logp gap diagnoses an eval-methodology limitation rather than a model regression.
181
-
182
- The cleanest model metric remains per-source val PPL (no scoring rule, just density), where Phase 4 dominates Phase 3.5 by 59–78% across every source. See README's "Per-source perplexity" section for the full table.
183
-
184
- The clean head-to-head between deliberate moves (PMI suite):
185
- - **Phase 2→3 (3× training volume, fixed corpus): +1.6 pp**
186
- - **Phase 3→3.5 (corpus rebalance, fixed model+steps): +11.2 pp**
187
- - **Phase 3.5→3.6 (corpus volume, fixed model+steps): −14.4 pp** (ghost-tiny capacity ceiling)
188
- - **Phase 3.6→4 (model capacity, fixed corpus+steps): +6.4 pp PMI / +1.6 pp logp / −75% per-source PPL** (capacity-reallocation hypothesis confirmed)
189
-
190
- Use `make eval-security-all-phases` to re-run end-to-end, or `make eval-compare-phases` to regenerate the PMI table from saved JSONs. Run with `--scoring logp` to reproduce the logp column.
191
-
192
- ### Cyber-text perplexity vs GPT-2 (fixed external test set, ten samples)
193
-
194
- The benchmark sample is held out from training and unchanged across phases — it's directly comparable.
195
-
196
- | Phase | Perplexity | vs prior |
197
- |---|---|---|
198
- | Phase 1 | 2,183.94 | — |
199
- | Phase 2 | 152.71 | −93% |
200
- | Phase 3 (v0.3.3) | 142.09 | −7% |
201
- | **Phase 3.5 (v0.3.5)** | **96.24** | **−32%** |
202
- | GPT-2 small (117M) | 26.76 | (frozen baseline) |
203
-
204
- ghost-tiny is 14.7M params vs GPT-2 small's 117M — so we're closing the cyber-text gap with ~8× less capacity. Still far behind GPT-2 in absolute terms, which is correct: a 14.7M-param ghost-tiny is a learning artifact, not a competitor. The trajectory is what matters.
205
-
206
- ### Note on val_loss
207
-
208
- Final v0.3.5 val_loss is 3.5518 vs v0.3.3's 3.4458. **Do not read this as v0.3.3 being a better model.** The val sets are different — v0.3.5's val covers six sources (NVD, arxiv, ctftime, mitre, capec, synthetic) while v0.3.3's was NVD-dominated. A more diverse val set is harder to predict per-token regardless of model quality. The per-source perplexity table above is the cleaner read.
209
-
210
- ## Intended Uses
211
-
212
- ### Primary use cases
213
- - CVE analysis and vulnerability explanation
214
- - CTF challenge reasoning and methodology
215
- - Penetration testing report generation
216
- - Security concept explanation and education
217
- - Cybersecurity text completion and generation
218
-
219
- ### Out-of-scope uses
220
- - **Production security decisions:** This is a small research model. Do not use it to make real security assessments.
221
- - **Malware creation:** The model should not be used to develop malicious software or exploits for unauthorized use.
222
- - **Attacking systems without authorization:** Any use for illegal cybersecurity activity is prohibited.
223
-
224
- ## Limitations
225
-
226
- - **Small model size:** At 14.7M parameters, ghost-tiny is two-to-three orders of magnitude below production LLMs. Output quality reflects this.
227
- - **Limited training data:** ~30M tokens is still small for language-model pre-training (Chinchilla-optimal for 14.7M params would be ~300M tokens; for ghost-1B, ~20B tokens). The corpus needs to grow another ~30× for the upper rungs of the scale ladder.
228
- - **Surface-level fluency, weak grounding:** the model has learned the CVE-database register and surface vocabulary of cyber writing — it produces structurally correct CVE descriptions and security-prose grammar — but will hallucinate version chains, mix product names, and bind topic only loosely. See [Sample Generations](#sample-generations) below.
229
- - **Hallucinated CVE-shaped output:** prompting with `CVE-YYYY-NNNNN is a vulnerability in...` will produce plausible-looking but entirely fabricated CVE descriptions, including invented version chains. **Do not use these as factual.**
230
- - **No instruction tuning:** ghost-tiny is a base language model. It generates text completions, not structured answers.
231
- - **Mode-collapse on severity classification:** the model predicts "Critical" on 72% of CVE Severity samples regardless of input. Above random (32% vs 25% baseline) but the prior is doing most of the work. Calibrated severity reasoning is not present at this scale.
232
- - **No tactic-level abstraction:** MITRE ATT&CK Tactic Classification accuracy is 12% (vs 8.3% random baseline) — the model can identify concrete techniques but not the higher-level adversary goal that groups them. Expected to be the slowest reasoning capability to emerge with scale.
233
- - **Web/binary CTF confusion:** CTF Categorization scores 100% on Forensics and Cryptography but 0% on Web Exploitation — the model conflates web-exploit writeups with adjacent categories. Targeted corpus expansion in v0.4.0 should focus there.
234
- - **Synthetic CTF share:** 17.2% of training tokens are synthetic CTF writeups. Replacement with real CTFtime / GitHub writeup corpus is the priority next-track work; the synthetic source will be dropped once real-writeup volume exceeds it.
235
- - **NVD corpus weight:** 65.3% of training tokens are CVE descriptions (down from 87% pre-rebalance). The model still tilts toward CVE-style prose on neutral prompts.
236
- - **English only.**
237
-
238
- ## Sample Generations
239
-
240
- Real outputs from `checkpoints/best_model.pt` (Phase 3, val_loss 3.45), at `temperature=0.8, top_k=50`:
241
-
242
- > **Prompt:** *A SQL injection attack works by*
243
- > ...injecting a crafted payload in the "password" parameter. NOTE: some of these details are obtained from third party information. EINXCMS 3.4.7.2.0 contains a cross-site scripting vulnerability that allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. This issue is fixed in 1.0.10. Use after free in Google Chrome before 17.0.66.75 allowed a remote...
244
-
245
- > **Prompt:** *CVE-2021-44228 is a vulnerability in*
246
- > ...the same product which allows local users to overwrite the filesystem via a large value (1) id parameter to the filename parameter to wp-admin/admin/mod_api/admin/admin/admin/login.php in QNAP Business Application Platform, iMC Net-based, iQAP_NAP-V, and MQ-IP-P, MQ-V-M DB2G, MQ-AN-W, MQ-2 A300-
247
-
248
- > **Prompt:** *To enumerate Active Directory users from a Linux box, an attacker can*
249
- > ...exploit an attacker to inject malicious scripts as the target system and execute arbitrary code in the context of the current process. The vulnerability is due to the application. An attacker could exploit this vulnerability by submitting a crafted link that executes in the application. A successful exploit could allow the attacker to obtain sensitive information on the user. The BIO plugin for WordPress is vulnerable to Cross-Site Request Forgery in all versions up to, and including, 2.2. This is due to missing nonce validation on...
250
-
251
- > **Prompt:** *Cross-site scripting (XSS) allows attackers to*
252
- > ...inject arbitrary web script or HTML via unspecified vectors. The davcodec_server function in libav.c in the Linux kernel before 4.4.0.5, 4.x before 4.6.1, and 4.x before 4.3.0.x before 4.3.3, does not properly restrict access to (1) Ntcpd) PTP-based LDAP (2) X-3-NAP00-00N-N
253
-
254
- **What this shows compared to Phase 2:** the *register* is correct now — phrases like "Cross-Site Request Forgery in all versions up to, and including, 2.2 — this is due to missing nonce validation," "use after free," "remote attacker," "submitting a crafted link," "in the context of the current process" are all real CVE-database language used in roughly the right context. Phase 2 produced fragments like "the login page is used to the login page's name of the login page does not properly sanitization" — the same model class can't produce that anymore. **The hallucinations are still rampant** (made-up products, scrambled version strings, mixed-up vendor names) — the model has the *form* of CVE descriptions but not the *facts*. This is the expected outcome of corpus expansion at fixed model size: better surface fluency, no new factual capability.
255
-
256
- ## Ethical Considerations
257
-
258
- GhostLM is trained on cybersecurity content that inherently includes offensive security knowledge — exploit techniques, attack methodologies, and vulnerability details. This is the same information freely available in CVE databases, security conferences, and published research.
259
-
260
- **Responsible use:**
261
- - This model is intended for defensive security, education, and research.
262
- - Users should follow responsible disclosure practices when working with vulnerability information.
263
- - The model's outputs should not be used to attack systems without explicit authorization.
264
- - Security professionals should apply the same ethical standards they would to any security tool.
265
-
266
- **Dual-use risk:** Like any cybersecurity knowledge base, the information the model generates could theoretically be misused. However, the model's small size and limited capabilities make it far less capable than freely available tools and resources already in the security community.
267
 
268
- ## How to Use
269
 
270
  ```python
271
- import torch
272
- from ghostlm import GhostLM, GhostLMConfig, GhostTokenizer
273
-
274
- # Load ghost-tiny
275
- config = GhostLMConfig.from_preset("ghost-tiny")
276
- model = GhostLM(config)
277
- tokenizer = GhostTokenizer()
278
-
279
- # Load trained weights (v0.3.3 — Phase 3 ghost-tiny refresh)
280
- checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
281
- model.load_state_dict(checkpoint["model_state_dict"])
282
- model.eval()
283
-
284
- # Generate
285
- prompt = "A SQL injection attack works by"
286
- ids = tokenizer.encode(prompt)
287
- input_tensor = torch.tensor(ids).unsqueeze(0)
288
- output = model.generate(input_tensor, max_new_tokens=100, temperature=0.8, top_k=50)
289
- print(tokenizer.decode(output[0].tolist()))
290
  ```
291
 
292
- ## Evaluation (Phase 3 — 30K Steps, post-NVD-pull corpus)
293
-
294
- ### Validation loss
295
 
296
- - **Final validation loss (step 30000):** **3.4458** (perplexity ≈ 31)
297
- - **Curve shape:** monotonic decrease over 60 eval points; no instability, still slightly descending at step 30K (diminishing returns rather than plateau).
298
- - Comparison: Phase 2 val_loss 3.7813 on the 2.66M-token corpus. Both runs use the deterministic-hash split, so the **0.34 nat drop is a real corpus-quality dividend at fixed model size**.
 
 
 
299
 
300
- ### Perplexity vs GPT-2 (cyber-text benchmark)
301
 
302
- Same hardcoded `BENCHMARK_TEXTS` set used for every prior phase (10 cyber-text samples, fair comparison):
 
 
 
303
 
304
- | Model | Perplexity (lower is better) |
305
- |---|---|
306
- | GPT-2 (124M baseline) | **26.76** |
307
- | **ghost-tiny — Phase 3 (released)** | **142.09** |
308
- | ghost-tiny — Phase 2 | 152.71 |
309
- | ghost-tiny — Phase 1 | 2,183.94 |
310
 
311
- Phase 3 is **7% better** than Phase 2 on this benchmark and **15.4× better** than Phase 1. Still 5.3× behind GPT-2, expected for a 14.7M-param model on ~30M tokens vs. a 124M-param model on ~40B tokens of WebText. The Phase 2→3 gain is modest because the 10-text benchmark contains generic security prose that already overlapped both corpora — most of the perplexity dividend was earned at Phase 2 (corpus quality + clean split), and the residual gain at Phase 3 is from the larger volume. Raw output: `logs/benchmark_phase3.json`.
 
 
 
 
 
 
 
312
 
313
- ### Security-domain task evaluation
314
 
315
- Re-run on the Phase 3 checkpoint via `scripts/eval_security.py` (3 tasks, 30 questions: CVE Severity Classification, Vulnerability Type Detection, Attack Technique Identification):
 
 
316
 
317
- | Phase | Score | Failure mode |
318
- |---|---|---|
319
- | Phase 1 | 4/30 (13.3%) | Mode-collapsed |
320
- | Phase 2 | 4/30 (13.3%) | Mode-collapsed: predicts "High" / "Cross-Site Scripting" / "Supply Chain Compromise" |
321
- | **Phase 3** | **4/30 (13.3%)** | Mode-collapsed: predicts "Medium-or-High" / "Cross-Site Scripting" / "DLL Search Order Hijacking" |
322
 
323
- Same numerical score as prior phases, **but with a different mode-collapse pattern** — the model has learned the *most frequent label per task* rather than the discriminative structure, and at Phase 3 the most-frequent attack technique label has shifted (from Supply Chain Compromise to DLL Search Order Hijacking) reflecting the corpus shift. CVE-severity picks up some genuine discrimination (gets 2 right by mixing in Mediums). **Random-guess baseline is ~33%** (4-way multiple choice), so 13.3% is below random — confirming the model is not yet doing real classification at this scale. Raw output: `logs/eval_security_phase3.json`.
324
 
325
- **What this means:** the corpus-expansion dividend is real on language modeling (val_loss + perplexity) but invisible on structured-task eval. Both numbers are baselines for the next scale rung — ghost-small at ~55M params is where structured-task eval should start to reward better corpus.
326
 
327
- ### Phase comparison plot
 
 
 
 
 
 
 
 
 
328
 
329
- `logs/phase_comparison.png` shows final val_loss, perplexity (vs GPT-2 baseline), and security-task accuracy across all three phases side by side. Generated by `scripts/plot_phase_comparison.py`.
 
 
 
330
 
331
- ### Training curve
332
 
333
- `logs/phase3_refresh/training_curve.png` shows the 30K-step Phase 3 curve. Phase 1 and Phase 2 logs were too sparse for real curves (3–5 endpoint datapoints); Phase 3 has 60 eval points, the first dense ghost-tiny training curve we've ever produced.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
334
 
335
  ## Citation
336
 
337
- ```bibtex
338
- @misc{ghostlm2026,
339
- author = {Joe Munene},
340
- title = {GhostLM: An Open-Source Cybersecurity-Focused Language Model},
341
- year = {2026},
342
- publisher = {GitHub},
343
- url = {https://github.com/joemunene-by/GhostLM}
344
  }
345
  ```
346
-
347
- ## Links
348
-
349
- - **GitHub:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
350
- - **Author:** [Joe Munene](https://github.com/joemunene-by)
351
- - **License:** [MIT](LICENSE)
 
1
  ---
 
 
2
  license: mit
3
+ language:
4
+ - en
5
  tags:
6
+ - cybersecurity
7
+ - security
8
+ - cti
9
+ - mitre-attack
10
+ - cve
11
+ - chat
12
+ - from-scratch
13
+ - small-lm
14
+ library_name: pytorch
 
15
  pipeline_tag: text-generation
16
  model-index:
17
+ - name: GhostLM ghost-small chat-v3
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Multiple-choice cyber-LLM benchmark
22
+ dataset:
23
+ type: AI4Sec/cti-bench
24
+ name: CTIBench MCQ
25
+ config: cti-mcq
26
+ split: test
27
+ metrics:
28
+ - type: accuracy
29
+ value: 0.369
30
+ name: accuracy (chat-v3)
31
+ - type: accuracy
32
+ value: 0.190
33
+ name: accuracy (chat-v2)
34
+ - type: accuracy
35
+ value: 0.178
36
+ name: accuracy (pretrain only, no chat)
37
  ---
38
 
39
+ # GhostLM
 
 
40
 
41
+ A small cybersecurity language model trained from scratch. Not a fine-tune
42
+ of an existing base — every parameter learned on a curated security
43
+ corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`.
 
 
 
 
 
 
 
 
44
 
45
+ - **Repository:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
46
+ - **Live demo:** [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm)
47
+ - **License:** MIT
48
 
49
+ ## What this model is
50
 
51
+ `ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained
52
+ from random initialization on **12.56M tokens of cybersecurity text** —
53
+ NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns,
54
+ Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus
55
+ a small synthetic CTF-writeup augmentation. After 30,000 steps of
56
+ pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a
57
+ mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal
58
+ pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common
59
+ acronyms). The chat tune uses three new role tokens
60
+ (`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) appended after
61
+ the base GPT-2 BPE vocabulary (50,261 → 50,264).
62
 
63
+ ## Why a 45M from-scratch model
 
 
 
 
 
64
 
65
+ A 45M model is too small to be a general-purpose assistant. The thesis is
66
+ specialization: a focused security corpus + targeted SFT can match or
67
+ beat much larger general models on narrow security tasks at a fraction
68
+ of the size, while running on a laptop CPU. CTIBench results below are
69
+ the test of that thesis.
70
 
71
  ## Architecture
72
 
73
+ | | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  |---|---|
75
+ | Type | Decoder-only Transformer (GPT-2 family) |
76
+ | Layers | 6 |
77
+ | Hidden dim (`d_model`) | 512 |
78
+ | Heads | 8 (head dim 64) |
79
+ | FFN dim (`d_ff`) | 2048, GELU |
80
+ | Norm | LayerNorm, pre-norm |
81
+ | Positional encoding | Learned absolute |
82
+ | Vocabulary | 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) |
83
+ | Context length | 1024 |
84
+ | Total params | ~45.2M |
85
+ | Tied input/output embeddings | yes |
86
+
87
+ The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the
88
+ codebase but disabled by default for backward compatibility with this
89
+ checkpoint. A `ghost-small-v0.5` preset flips them on for the next
90
+ pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens).
91
 
92
  ## Evaluation
93
 
94
+ Scored on **[CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ**
95
+ (2,500 multiple-choice cyber threat-intelligence questions, scored by the
96
+ log-probability of A/B/C/D as the next token after `Answer:`):
97
+
98
+ | Checkpoint | n | Accuracy | Notes |
99
+ |---|---:|---:|---|
100
+ | ghost-small Phase 4 (pretrain only) | 2500 | **17.8%** | below random — completion model, doesn't follow MCQ format |
101
+ | ghost-small chat-v2 (free-form chat tune) | 2500 | **19.0%** | identity + OOD refusal works, no MCQ-format signal |
102
+ | ghost-small chat-v2 + RAG (top-4) | 2500 | **19.0%** | retrieval is neutral without RAFT-style training |
103
+ | **ghost-small chat-v3 (MCQ-tuned)** | 2500 | **36.9%** | **canonical** 1.48× random |
104
+
105
+ `chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic,
106
+ acronym definition) to the chat training mix. The assistant turn is the
107
+ bare letter A/B/C/D, with a 30% subset followed by a one-line
108
+ justification. This teaches the model to output a single letter after
109
+ `Answer:` rather than continuing into prose — the dominant failure mode
110
+ of small models on MCQ format.
111
+
112
+ Honest comparisons: 36.9% is well above random (25%) but well below the
113
+ 85-95% that frontier models score on the same benchmark. The model was
114
+ trained on 12.56M tokens of pure cybersecurity text about 1.4% of the
115
+ Chinchilla-optimal data budget for 45M parameters. The next bench bump is
116
+ expected to come from corpus expansion (`v0.4.2`) and the v0.5
117
+ architecture upgrade.
118
+
119
+ ## Usage
120
+
121
+ ### Direct use (no HF transformers integration)
122
+
123
+ GhostLM has a custom architecture it does **not** use the
124
+ HuggingFace `transformers` library and is **not** auto-loadable via
125
+ `AutoModelForCausalLM`. You need the GhostLM repo itself.
126
+
127
+ ```bash
128
+ git clone https://github.com/joemunene-by/GhostLM
129
+ cd GhostLM
130
+ pip install -r requirements.txt
131
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
+ Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`:
134
 
135
  ```python
136
+ from huggingface_hub import hf_hub_download
137
+ from pathlib import Path
138
+
139
+ dest = Path("checkpoints/phase5_chat_v3")
140
+ dest.mkdir(parents=True, exist_ok=True)
141
+ hf_hub_download(
142
+ repo_id="Ghostgim/GhostLM",
143
+ filename="pytorch_model.pt",
144
+ local_dir=str(dest),
145
+ )
146
+ # Rename to match what the loader expects
147
+ (dest / "pytorch_model.pt").rename(dest / "best_model.pt")
 
 
 
 
 
 
 
148
  ```
149
 
150
+ ### Chat REPL
 
 
151
 
152
+ ```bash
153
+ PYTHONPATH=. python3 scripts/chat.py \
154
+ --checkpoint checkpoints/phase5_chat_v3/best_model.pt \
155
+ --temperature 0.7 --top-k 40 --top-p 0.95 \
156
+ --repetition-penalty 1.25
157
+ ```
158
 
159
+ Chat format uses three special tokens:
160
 
161
+ ```
162
+ <|ghost_user|>What is XSS?<|ghost_end|>
163
+ <|ghost_assistant|>Cross-Site Scripting is a vulnerability where ...<|ghost_end|>
164
+ ```
165
 
166
+ Use the helper to build an inference-ready prompt:
 
 
 
 
 
167
 
168
+ ```python
169
+ from ghostlm.tokenizer import GhostTokenizer
170
+ tok = GhostTokenizer()
171
+ prompt_ids = tok.format_chat_prompt([
172
+ {"role": "user", "content": "What is XSS?"},
173
+ ])
174
+ # prompt_ids ends in <|ghost_assistant|> ready for generation
175
+ ```
176
 
177
+ ### MCP server for Claude Code / Claude Desktop
178
 
179
+ GhostLM ships an MCP server that exposes three tools `ghostlm_query`,
180
+ `ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install
181
+ with:
182
 
183
+ ```bash
184
+ claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \
185
+ --checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt
186
+ ```
 
187
 
188
+ Requires Python 3.10 + `pip install mcp torch tiktoken`.
189
 
190
+ ## Training data
191
 
192
+ | Source | Records | License | Notes |
193
+ |---|---:|---|---|
194
+ | NVD CVE descriptions | 64,559 | Public domain (NIST) | sampled to ~3,500 chat Q&A + 1,000 MCQ pairs |
195
+ | Exploit-DB | 4,711 | GPLv2-compatible | sampled to 2,000 chat pairs |
196
+ | Synthetic CTF writeups | 2,847 | Custom — generated locally | turned into 2,847 chat pairs |
197
+ | arXiv cs.CR | 1,890 | arXiv non-exclusive (per paper) | abstracts only in v0.4 corpus |
198
+ | MITRE ATT&CK | 655 | Apache 2.0 | all 655 chat pairs + 655 MCQs |
199
+ | CAPEC | 563 | Public domain (CISA / MITRE) | all 563 chat pairs |
200
+ | CTFtime writeups | 451 | per-author (mostly CC) | all 451 chat pairs |
201
+ | Hand-written small_talk | 153 | MIT | greetings / identity / OOD refusals |
202
 
203
+ Total chat training set: ~17,000 records after 30× small_talk oversampling
204
+ and 2× MCQ oversampling. The `data/raw/` source files and the
205
+ `data/processed/train.jsonl` pretrain corpus are reproducible from the
206
+ collector scripts in the GitHub repo.
207
 
208
+ ## Limitations
209
 
210
+ - **No general world knowledge.** Outside cybersecurity the model is
211
+ wrong, repetitive, or both. It will refuse politely on most OOD
212
+ topics ("what's the weather", "tell me a joke") but accuracy on
213
+ general questions is essentially zero.
214
+ - **Specific facts unreliable.** Exact CVE numbers, CVSS scores, dates,
215
+ and technique IDs are memorized incompletely — the model often
216
+ confabulates plausible-looking but wrong specifics. Always verify
217
+ against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org),
218
+ or original vendor advisories.
219
+ - **Short coherence window.** 1024-token context, no RoPE — long
220
+ multi-turn conversations drift. The chat REPL trims old turns when
221
+ the running prompt overflows.
222
+ - **CTIBench 36.9% is well above random but well below larger models.**
223
+ This is expected at 45M parameters and 12.56M training tokens.
224
+ - **Repetition prone without penalty.** Use `--repetition-penalty 1.25`
225
+ in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops.
226
+
227
+ ## Intended use
228
+
229
+ - Hands-on learning: explore how a small specialized LM behaves on a
230
+ narrow domain.
231
+ - Local cybersecurity Q&A as a complement to a larger general model
232
+ via the MCP server.
233
+ - Research baseline for cyber-LLM evaluation work — a small, fully
234
+ reproducible from-scratch model with published benchmark numbers.
235
+
236
+ **Out of scope:** production security advice, vulnerability triage,
237
+ incident response. The model is a research artifact — never act on its
238
+ output without verifying against authoritative sources.
239
 
240
  ## Citation
241
 
242
+ ```
243
+ @misc{ghostlm-2026,
244
+ author = {Munene, Joe},
245
+ title = {GhostLM: A small cybersecurity language model trained from scratch},
246
+ year = {2026},
247
+ url = {https://github.com/joemunene-by/GhostLM},
 
248
  }
249
  ```