docs: HF-flavored model card for v0.5.0 chat-v3 (CTIBench 36.9%)
Browse files
README.md
CHANGED
|
@@ -1,351 +1,249 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
license: mit
|
| 5 |
-
|
|
|
|
| 6 |
tags:
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
- custom
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
model-index:
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
-
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
---
|
| 24 |
|
| 25 |
-
# GhostLM
|
| 26 |
-
|
| 27 |
-
## Model Details
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
| **Architecture** | Decoder-only transformer |
|
| 33 |
-
| **Author** | [Joe Munene](https://github.com/joemunene-by) |
|
| 34 |
-
| **License** | MIT |
|
| 35 |
-
| **Language** | English |
|
| 36 |
-
| **Framework** | PyTorch (built from scratch, no pretrained weights) |
|
| 37 |
-
| **Version** | 0.4.0 (Phase 4 ghost-small — 30K steps on the 12.56M-token Phase 3.6 corpus, val_loss 2.3535, overall val PPL 11.12 — capacity-reallocation hypothesis confirmed) |
|
| 38 |
-
|
| 39 |
-
## Model Description
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|---|---|---|---|---|---|---|---|
|
| 49 |
-
| `ghostlm/ghost-tiny` | 2 | 256 | 4 | 1024 | 1024 | 14.7M | Phase 3.5 (historical canonical). 30K steps on ~8.8M tokens, overall PPL 66, PMI suite 31.2% |
|
| 50 |
-
| `ghostlm/ghost-small` | 6 | 512 | 8 | 2048 | 1024 | ~45M | **Phase 4 complete (current canonical). 30K steps on ~12.56M tokens, overall PPL 11.12 (−83%), val_loss 2.3535** |
|
| 51 |
-
| `ghostlm/ghost-base` | 12 | 768 | 12 | 3072 | 1024 | ~350M | Planned (rented GPU) |
|
| 52 |
-
| `ghostlm/ghost-1B` | 24 | 1024 | 16 | 4096 | 1024 | ~1B | Long-term goal |
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## Architecture
|
| 57 |
|
| 58 |
-
|
| 59 |
-
- **Normalization:** Pre-norm (LayerNorm before attention and FFN sub-layers)
|
| 60 |
-
- **Positional encoding:** Learned positional embeddings
|
| 61 |
-
- **Activation:** GELU
|
| 62 |
-
- **Tokenizer:** GPT-2 BPE via tiktoken (50,257 base tokens + 4 special tokens = 50,261 total)
|
| 63 |
-
- **Weight tying:** Output projection shares weights with token embedding
|
| 64 |
-
- **Attention:** Multi-head causal self-attention with combined QKV projection
|
| 65 |
-
- **Initialization:** Normal(0, 0.02) with scaled residual init (std=0.02/sqrt(2*n_layers)) for projection layers
|
| 66 |
-
|
| 67 |
-
## Training Data
|
| 68 |
-
|
| 69 |
-
The released v0.3.5 checkpoint was trained on the rebalanced Phase 3.5 corpus. NVD's full 333,540-record pull is on disk, but its training contribution is capped at 6M tokens by content-hash subsample so the corpus isn't 90% CVE descriptions:
|
| 70 |
-
|
| 71 |
-
| Source | Records (raw → trained) | Trained tokens | Share | Type |
|
| 72 |
-
|---|---|---|---|---|
|
| 73 |
-
| NVD CVE Database | 333,540 → 71,828 | ~5.74M | **65.3%** | Real, capped via `--max-cve-tokens 6000000` |
|
| 74 |
-
| Synthetic CTF Writeups | 3,000 | ~1.51M | 17.2% | Synthetic, placeholder until real CTFtime grows |
|
| 75 |
-
| arXiv cs.CR Abstracts | 2,000 | ~0.74M | 8.4% | Real |
|
| 76 |
-
| CTFtime real writeups | 473 → 467 | ~0.47M | 5.3% | Real, inline-only, per-record attribution |
|
| 77 |
-
| MITRE ATT&CK | 691 | ~0.26M | 2.9% | Real (Apache 2.0) |
|
| 78 |
-
| CAPEC | 609 | ~0.07M | 0.9% | Real (Apache 2.0) |
|
| 79 |
-
| **Total (post-dedup)** | **74,635** | **~8.79M** | | train: 70,965 / val: 3,670 |
|
| 80 |
-
|
| 81 |
-
**Data splits:** deterministic by content hash — identical or near-duplicate texts always land in the same split. Train/val leakage check returns 0.
|
| 82 |
-
|
| 83 |
-
**Token share comparison (what the model sees):**
|
| 84 |
-
|
| 85 |
-
| Phase | NVD share | Top non-NVD source | Overall |
|
| 86 |
-
|---|---|---|---|
|
| 87 |
-
| v0.3.3 (Phase 3) | 87% | CTF synthetic 5% | NVD-dominated |
|
| 88 |
-
| **v0.3.5 (Phase 3.5)** | **65.3%** | **synthetic 17.2%** | **balanced across 6 sources** |
|
| 89 |
-
|
| 90 |
-
The rebalance is reproducible: `python3 scripts/rebuild_corpus.py --max-cve-tokens 6000000` always produces the same 71,828-record CVE prefix.
|
| 91 |
-
|
| 92 |
-
**Topics covered:** vulnerability detection, adversarial ML, network intrusion, cryptographic protocols, fuzzing, side-channel attacks, ransomware detection, supply chain security, memory safety, WAF evasion, SQL injection, XSS, buffer overflow, privilege escalation, reverse engineering, binary exploitation, steganography, network forensics.
|
| 93 |
-
|
| 94 |
-
For corpus expansion plans (CTFtime, security blogs, MITRE ATT&CK, tool docs) and licensing notes, see [CORPUS.md](CORPUS.md).
|
| 95 |
-
|
| 96 |
-
## Training Details
|
| 97 |
-
|
| 98 |
-
| Parameter | Value |
|
| 99 |
|---|---|
|
| 100 |
-
|
|
| 101 |
-
|
|
| 102 |
-
|
|
| 103 |
-
|
|
| 104 |
-
|
|
| 105 |
-
|
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
|
|
|
| 115 |
|
| 116 |
## Evaluation
|
| 117 |
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
|
| 125 |
-
|---|
|
| 126 |
-
|
|
| 127 |
-
|
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
#### Cross-phase trajectory on the expanded suite
|
| 158 |
-
|
| 159 |
-
Every preserved ghost-tiny checkpoint was re-scored on the new 125-sample suite so the trajectory is end-to-end comparable. Cells are `correct/total (accuracy) [most-common-share]`:
|
| 160 |
-
|
| 161 |
-
| Task | Phase 1 (2K) | Phase 2 (v0.3.0) | Phase 3 (v0.3.3) | Phase 3.5 (v0.3.5) | Phase 3.6 (v0.3.7) | **Phase 4 (v0.4.0)** |
|
| 162 |
-
|---|---|---|---|---|---|---|
|
| 163 |
-
| CVE Severity Classification | 7/25 (28.0%) [100%] | 5/25 (20.0%) [96%] | 4/25 (16.0%) [48%] | **8/25 (32.0%) [72%]** | 4/25 (16.0%) [60%] | 6/25 (24.0%) [72%] |
|
| 164 |
-
| Vulnerability Type Detection | 3/25 (12.0%) [48%] | 6/25 (24.0%) [76%] | 7/25 (28.0%) [48%] | 8/25 (32.0%) [44%] | 3/25 (12.0%) [96%] | **10/25 (40.0%) [44%]** |
|
| 165 |
-
| Attack Technique Identification | 2/25 (8.0%) [24%] | 3/25 (12.0%) [88%] | 5/25 (20.0%) [72%] | **10/25 (40.0%) [36%]** | 4/25 (16.0%) [60%] | 4/25 (16.0%) [52%] |
|
| 166 |
-
| CTF Challenge Categorization | 2/25 (8.0%) [84%] | 7/25 (28.0%) [76%] | 6/25 (24.0%) [88%] | **10/25 (40.0%) [64%]** | 5/25 (20.0%) [48%] | 7/25 (28.0%) [72%] |
|
| 167 |
-
| MITRE ATT&CK Tactic Classification | 1/25 (4.0%) [72%] | 2/25 (8.0%) [76%] | 3/25 (12.0%) [64%] | 3/25 (12.0%) [40%] | 5/25 (20.0%) [76%] | 2/25 (8.0%) [44%] |
|
| 168 |
-
| **Overall (PMI)** | **15/125 (12.0%)** | **23/125 (18.4%)** | **25/125 (20.0%)** | **39/125 (31.2%)** | **21/125 (16.8%)** | **29/125 (23.2%)** |
|
| 169 |
-
|
| 170 |
-
Phase 4 ghost-small (v0.4.0) is the new canonical model for density / generation work but lands lower than Phase 3.5 on the PMI scoring above. The honest read requires the second column type — **logp scoring** — which the suite also supports via `--scoring logp`:
|
| 171 |
-
|
| 172 |
-
| Phase | PMI | logp | Δ (PMI − logp) |
|
| 173 |
-
|---|---:|---:|---:|
|
| 174 |
-
| Phase 3.5 (ghost-tiny) | **31.2%** | 17.6% | +13.6 pp |
|
| 175 |
-
| Phase 4 (ghost-small) | 23.2% | **19.2%** | +4.0 pp |
|
| 176 |
-
|
| 177 |
-
Two things to note:
|
| 178 |
-
|
| 179 |
-
1. **PMI flatters Phase 3.5 by 13.6 pp.** PMI subtracts unconditional candidate log-prob to break ties — useful when the model is mode-collapsing because it normalizes for "this candidate is just inherently high-probability". A loose-distribution model with weakly differentiated logits gives PMI more separation to extract; a tight-distribution model gives less. Phase 3.5 (low capacity) gets the bigger PMI uplift; Phase 4 (higher capacity, sharper distribution) gets a smaller one.
|
| 180 |
-
2. **Logp — the more conservative scorer — picks Phase 4.** With logp scoring, Phase 4 narrowly beats Phase 3.5 (19.2% vs 17.6%) on this same 125-sample suite. The PMI vs logp gap diagnoses an eval-methodology limitation rather than a model regression.
|
| 181 |
-
|
| 182 |
-
The cleanest model metric remains per-source val PPL (no scoring rule, just density), where Phase 4 dominates Phase 3.5 by 59–78% across every source. See README's "Per-source perplexity" section for the full table.
|
| 183 |
-
|
| 184 |
-
The clean head-to-head between deliberate moves (PMI suite):
|
| 185 |
-
- **Phase 2→3 (3× training volume, fixed corpus): +1.6 pp**
|
| 186 |
-
- **Phase 3→3.5 (corpus rebalance, fixed model+steps): +11.2 pp**
|
| 187 |
-
- **Phase 3.5→3.6 (corpus volume, fixed model+steps): −14.4 pp** (ghost-tiny capacity ceiling)
|
| 188 |
-
- **Phase 3.6→4 (model capacity, fixed corpus+steps): +6.4 pp PMI / +1.6 pp logp / −75% per-source PPL** (capacity-reallocation hypothesis confirmed)
|
| 189 |
-
|
| 190 |
-
Use `make eval-security-all-phases` to re-run end-to-end, or `make eval-compare-phases` to regenerate the PMI table from saved JSONs. Run with `--scoring logp` to reproduce the logp column.
|
| 191 |
-
|
| 192 |
-
### Cyber-text perplexity vs GPT-2 (fixed external test set, ten samples)
|
| 193 |
-
|
| 194 |
-
The benchmark sample is held out from training and unchanged across phases — it's directly comparable.
|
| 195 |
-
|
| 196 |
-
| Phase | Perplexity | vs prior |
|
| 197 |
-
|---|---|---|
|
| 198 |
-
| Phase 1 | 2,183.94 | — |
|
| 199 |
-
| Phase 2 | 152.71 | −93% |
|
| 200 |
-
| Phase 3 (v0.3.3) | 142.09 | −7% |
|
| 201 |
-
| **Phase 3.5 (v0.3.5)** | **96.24** | **−32%** |
|
| 202 |
-
| GPT-2 small (117M) | 26.76 | (frozen baseline) |
|
| 203 |
-
|
| 204 |
-
ghost-tiny is 14.7M params vs GPT-2 small's 117M — so we're closing the cyber-text gap with ~8× less capacity. Still far behind GPT-2 in absolute terms, which is correct: a 14.7M-param ghost-tiny is a learning artifact, not a competitor. The trajectory is what matters.
|
| 205 |
-
|
| 206 |
-
### Note on val_loss
|
| 207 |
-
|
| 208 |
-
Final v0.3.5 val_loss is 3.5518 vs v0.3.3's 3.4458. **Do not read this as v0.3.3 being a better model.** The val sets are different — v0.3.5's val covers six sources (NVD, arxiv, ctftime, mitre, capec, synthetic) while v0.3.3's was NVD-dominated. A more diverse val set is harder to predict per-token regardless of model quality. The per-source perplexity table above is the cleaner read.
|
| 209 |
-
|
| 210 |
-
## Intended Uses
|
| 211 |
-
|
| 212 |
-
### Primary use cases
|
| 213 |
-
- CVE analysis and vulnerability explanation
|
| 214 |
-
- CTF challenge reasoning and methodology
|
| 215 |
-
- Penetration testing report generation
|
| 216 |
-
- Security concept explanation and education
|
| 217 |
-
- Cybersecurity text completion and generation
|
| 218 |
-
|
| 219 |
-
### Out-of-scope uses
|
| 220 |
-
- **Production security decisions:** This is a small research model. Do not use it to make real security assessments.
|
| 221 |
-
- **Malware creation:** The model should not be used to develop malicious software or exploits for unauthorized use.
|
| 222 |
-
- **Attacking systems without authorization:** Any use for illegal cybersecurity activity is prohibited.
|
| 223 |
-
|
| 224 |
-
## Limitations
|
| 225 |
-
|
| 226 |
-
- **Small model size:** At 14.7M parameters, ghost-tiny is two-to-three orders of magnitude below production LLMs. Output quality reflects this.
|
| 227 |
-
- **Limited training data:** ~30M tokens is still small for language-model pre-training (Chinchilla-optimal for 14.7M params would be ~300M tokens; for ghost-1B, ~20B tokens). The corpus needs to grow another ~30× for the upper rungs of the scale ladder.
|
| 228 |
-
- **Surface-level fluency, weak grounding:** the model has learned the CVE-database register and surface vocabulary of cyber writing — it produces structurally correct CVE descriptions and security-prose grammar — but will hallucinate version chains, mix product names, and bind topic only loosely. See [Sample Generations](#sample-generations) below.
|
| 229 |
-
- **Hallucinated CVE-shaped output:** prompting with `CVE-YYYY-NNNNN is a vulnerability in...` will produce plausible-looking but entirely fabricated CVE descriptions, including invented version chains. **Do not use these as factual.**
|
| 230 |
-
- **No instruction tuning:** ghost-tiny is a base language model. It generates text completions, not structured answers.
|
| 231 |
-
- **Mode-collapse on severity classification:** the model predicts "Critical" on 72% of CVE Severity samples regardless of input. Above random (32% vs 25% baseline) but the prior is doing most of the work. Calibrated severity reasoning is not present at this scale.
|
| 232 |
-
- **No tactic-level abstraction:** MITRE ATT&CK Tactic Classification accuracy is 12% (vs 8.3% random baseline) — the model can identify concrete techniques but not the higher-level adversary goal that groups them. Expected to be the slowest reasoning capability to emerge with scale.
|
| 233 |
-
- **Web/binary CTF confusion:** CTF Categorization scores 100% on Forensics and Cryptography but 0% on Web Exploitation — the model conflates web-exploit writeups with adjacent categories. Targeted corpus expansion in v0.4.0 should focus there.
|
| 234 |
-
- **Synthetic CTF share:** 17.2% of training tokens are synthetic CTF writeups. Replacement with real CTFtime / GitHub writeup corpus is the priority next-track work; the synthetic source will be dropped once real-writeup volume exceeds it.
|
| 235 |
-
- **NVD corpus weight:** 65.3% of training tokens are CVE descriptions (down from 87% pre-rebalance). The model still tilts toward CVE-style prose on neutral prompts.
|
| 236 |
-
- **English only.**
|
| 237 |
-
|
| 238 |
-
## Sample Generations
|
| 239 |
-
|
| 240 |
-
Real outputs from `checkpoints/best_model.pt` (Phase 3, val_loss 3.45), at `temperature=0.8, top_k=50`:
|
| 241 |
-
|
| 242 |
-
> **Prompt:** *A SQL injection attack works by*
|
| 243 |
-
> ...injecting a crafted payload in the "password" parameter. NOTE: some of these details are obtained from third party information. EINXCMS 3.4.7.2.0 contains a cross-site scripting vulnerability that allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. This issue is fixed in 1.0.10. Use after free in Google Chrome before 17.0.66.75 allowed a remote...
|
| 244 |
-
|
| 245 |
-
> **Prompt:** *CVE-2021-44228 is a vulnerability in*
|
| 246 |
-
> ...the same product which allows local users to overwrite the filesystem via a large value (1) id parameter to the filename parameter to wp-admin/admin/mod_api/admin/admin/admin/login.php in QNAP Business Application Platform, iMC Net-based, iQAP_NAP-V, and MQ-IP-P, MQ-V-M DB2G, MQ-AN-W, MQ-2 A300-
|
| 247 |
-
|
| 248 |
-
> **Prompt:** *To enumerate Active Directory users from a Linux box, an attacker can*
|
| 249 |
-
> ...exploit an attacker to inject malicious scripts as the target system and execute arbitrary code in the context of the current process. The vulnerability is due to the application. An attacker could exploit this vulnerability by submitting a crafted link that executes in the application. A successful exploit could allow the attacker to obtain sensitive information on the user. The BIO plugin for WordPress is vulnerable to Cross-Site Request Forgery in all versions up to, and including, 2.2. This is due to missing nonce validation on...
|
| 250 |
-
|
| 251 |
-
> **Prompt:** *Cross-site scripting (XSS) allows attackers to*
|
| 252 |
-
> ...inject arbitrary web script or HTML via unspecified vectors. The davcodec_server function in libav.c in the Linux kernel before 4.4.0.5, 4.x before 4.6.1, and 4.x before 4.3.0.x before 4.3.3, does not properly restrict access to (1) Ntcpd) PTP-based LDAP (2) X-3-NAP00-00N-N
|
| 253 |
-
|
| 254 |
-
**What this shows compared to Phase 2:** the *register* is correct now — phrases like "Cross-Site Request Forgery in all versions up to, and including, 2.2 — this is due to missing nonce validation," "use after free," "remote attacker," "submitting a crafted link," "in the context of the current process" are all real CVE-database language used in roughly the right context. Phase 2 produced fragments like "the login page is used to the login page's name of the login page does not properly sanitization" — the same model class can't produce that anymore. **The hallucinations are still rampant** (made-up products, scrambled version strings, mixed-up vendor names) — the model has the *form* of CVE descriptions but not the *facts*. This is the expected outcome of corpus expansion at fixed model size: better surface fluency, no new factual capability.
|
| 255 |
-
|
| 256 |
-
## Ethical Considerations
|
| 257 |
-
|
| 258 |
-
GhostLM is trained on cybersecurity content that inherently includes offensive security knowledge — exploit techniques, attack methodologies, and vulnerability details. This is the same information freely available in CVE databases, security conferences, and published research.
|
| 259 |
-
|
| 260 |
-
**Responsible use:**
|
| 261 |
-
- This model is intended for defensive security, education, and research.
|
| 262 |
-
- Users should follow responsible disclosure practices when working with vulnerability information.
|
| 263 |
-
- The model's outputs should not be used to attack systems without explicit authorization.
|
| 264 |
-
- Security professionals should apply the same ethical standards they would to any security tool.
|
| 265 |
-
|
| 266 |
-
**Dual-use risk:** Like any cybersecurity knowledge base, the information the model generates could theoretically be misused. However, the model's small size and limited capabilities make it far less capable than freely available tools and resources already in the security community.
|
| 267 |
|
| 268 |
-
|
| 269 |
|
| 270 |
```python
|
| 271 |
-
import
|
| 272 |
-
from
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
# Generate
|
| 285 |
-
prompt = "A SQL injection attack works by"
|
| 286 |
-
ids = tokenizer.encode(prompt)
|
| 287 |
-
input_tensor = torch.tensor(ids).unsqueeze(0)
|
| 288 |
-
output = model.generate(input_tensor, max_new_tokens=100, temperature=0.8, top_k=50)
|
| 289 |
-
print(tokenizer.decode(output[0].tolist()))
|
| 290 |
```
|
| 291 |
|
| 292 |
-
##
|
| 293 |
-
|
| 294 |
-
### Validation loss
|
| 295 |
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
-
|
|
|
|
|
|
|
|
|
|
| 299 |
|
| 300 |
-
|
| 301 |
|
| 302 |
-
|
|
|
|
|
|
|
|
|
|
| 303 |
|
| 304 |
-
|
| 305 |
-
|---|---|
|
| 306 |
-
| GPT-2 (124M baseline) | **26.76** |
|
| 307 |
-
| **ghost-tiny — Phase 3 (released)** | **142.09** |
|
| 308 |
-
| ghost-tiny — Phase 2 | 152.71 |
|
| 309 |
-
| ghost-tiny — Phase 1 | 2,183.94 |
|
| 310 |
|
| 311 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
|
| 313 |
-
###
|
| 314 |
|
| 315 |
-
|
|
|
|
|
|
|
| 316 |
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
| **Phase 3** | **4/30 (13.3%)** | Mode-collapsed: predicts "Medium-or-High" / "Cross-Site Scripting" / "DLL Search Order Hijacking" |
|
| 322 |
|
| 323 |
-
|
| 324 |
|
| 325 |
-
|
| 326 |
|
| 327 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
|
| 329 |
-
|
|
|
|
|
|
|
|
|
|
| 330 |
|
| 331 |
-
##
|
| 332 |
|
| 333 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
|
| 335 |
## Citation
|
| 336 |
|
| 337 |
-
```
|
| 338 |
-
@misc{
|
| 339 |
-
author = {
|
| 340 |
-
title
|
| 341 |
-
year
|
| 342 |
-
|
| 343 |
-
url = {https://github.com/joemunene-by/GhostLM}
|
| 344 |
}
|
| 345 |
```
|
| 346 |
-
|
| 347 |
-
## Links
|
| 348 |
-
|
| 349 |
-
- **GitHub:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
|
| 350 |
-
- **Author:** [Joe Munene](https://github.com/joemunene-by)
|
| 351 |
-
- **License:** [MIT](LICENSE)
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
tags:
|
| 6 |
+
- cybersecurity
|
| 7 |
+
- security
|
| 8 |
+
- cti
|
| 9 |
+
- mitre-attack
|
| 10 |
+
- cve
|
| 11 |
+
- chat
|
| 12 |
+
- from-scratch
|
| 13 |
+
- small-lm
|
| 14 |
+
library_name: pytorch
|
|
|
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
model-index:
|
| 17 |
+
- name: GhostLM ghost-small chat-v3
|
| 18 |
+
results:
|
| 19 |
+
- task:
|
| 20 |
+
type: text-classification
|
| 21 |
+
name: Multiple-choice cyber-LLM benchmark
|
| 22 |
+
dataset:
|
| 23 |
+
type: AI4Sec/cti-bench
|
| 24 |
+
name: CTIBench MCQ
|
| 25 |
+
config: cti-mcq
|
| 26 |
+
split: test
|
| 27 |
+
metrics:
|
| 28 |
+
- type: accuracy
|
| 29 |
+
value: 0.369
|
| 30 |
+
name: accuracy (chat-v3)
|
| 31 |
+
- type: accuracy
|
| 32 |
+
value: 0.190
|
| 33 |
+
name: accuracy (chat-v2)
|
| 34 |
+
- type: accuracy
|
| 35 |
+
value: 0.178
|
| 36 |
+
name: accuracy (pretrain only, no chat)
|
| 37 |
---
|
| 38 |
|
| 39 |
+
# GhostLM
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
A small cybersecurity language model trained from scratch. Not a fine-tune
|
| 42 |
+
of an existing base — every parameter learned on a curated security
|
| 43 |
+
corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
- **Repository:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
|
| 46 |
+
- **Live demo:** [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm)
|
| 47 |
+
- **License:** MIT
|
| 48 |
|
| 49 |
+
## What this model is
|
| 50 |
|
| 51 |
+
`ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained
|
| 52 |
+
from random initialization on **12.56M tokens of cybersecurity text** —
|
| 53 |
+
NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns,
|
| 54 |
+
Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus
|
| 55 |
+
a small synthetic CTF-writeup augmentation. After 30,000 steps of
|
| 56 |
+
pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a
|
| 57 |
+
mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal
|
| 58 |
+
pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common
|
| 59 |
+
acronyms). The chat tune uses three new role tokens
|
| 60 |
+
(`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) appended after
|
| 61 |
+
the base GPT-2 BPE vocabulary (50,261 → 50,264).
|
| 62 |
|
| 63 |
+
## Why a 45M from-scratch model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
A 45M model is too small to be a general-purpose assistant. The thesis is
|
| 66 |
+
specialization: a focused security corpus + targeted SFT can match or
|
| 67 |
+
beat much larger general models on narrow security tasks at a fraction
|
| 68 |
+
of the size, while running on a laptop CPU. CTIBench results below are
|
| 69 |
+
the test of that thesis.
|
| 70 |
|
| 71 |
## Architecture
|
| 72 |
|
| 73 |
+
| | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|---|---|
|
| 75 |
+
| Type | Decoder-only Transformer (GPT-2 family) |
|
| 76 |
+
| Layers | 6 |
|
| 77 |
+
| Hidden dim (`d_model`) | 512 |
|
| 78 |
+
| Heads | 8 (head dim 64) |
|
| 79 |
+
| FFN dim (`d_ff`) | 2048, GELU |
|
| 80 |
+
| Norm | LayerNorm, pre-norm |
|
| 81 |
+
| Positional encoding | Learned absolute |
|
| 82 |
+
| Vocabulary | 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) |
|
| 83 |
+
| Context length | 1024 |
|
| 84 |
+
| Total params | ~45.2M |
|
| 85 |
+
| Tied input/output embeddings | yes |
|
| 86 |
+
|
| 87 |
+
The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the
|
| 88 |
+
codebase but disabled by default for backward compatibility with this
|
| 89 |
+
checkpoint. A `ghost-small-v0.5` preset flips them on for the next
|
| 90 |
+
pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens).
|
| 91 |
|
| 92 |
## Evaluation
|
| 93 |
|
| 94 |
+
Scored on **[CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ**
|
| 95 |
+
(2,500 multiple-choice cyber threat-intelligence questions, scored by the
|
| 96 |
+
log-probability of A/B/C/D as the next token after `Answer:`):
|
| 97 |
+
|
| 98 |
+
| Checkpoint | n | Accuracy | Notes |
|
| 99 |
+
|---|---:|---:|---|
|
| 100 |
+
| ghost-small Phase 4 (pretrain only) | 2500 | **17.8%** | below random — completion model, doesn't follow MCQ format |
|
| 101 |
+
| ghost-small chat-v2 (free-form chat tune) | 2500 | **19.0%** | identity + OOD refusal works, no MCQ-format signal |
|
| 102 |
+
| ghost-small chat-v2 + RAG (top-4) | 2500 | **19.0%** | retrieval is neutral without RAFT-style training |
|
| 103 |
+
| **ghost-small chat-v3 (MCQ-tuned)** | 2500 | **36.9%** | **canonical** — 1.48× random |
|
| 104 |
+
|
| 105 |
+
`chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic,
|
| 106 |
+
acronym definition) to the chat training mix. The assistant turn is the
|
| 107 |
+
bare letter A/B/C/D, with a 30% subset followed by a one-line
|
| 108 |
+
justification. This teaches the model to output a single letter after
|
| 109 |
+
`Answer:` rather than continuing into prose — the dominant failure mode
|
| 110 |
+
of small models on MCQ format.
|
| 111 |
+
|
| 112 |
+
Honest comparisons: 36.9% is well above random (25%) but well below the
|
| 113 |
+
85-95% that frontier models score on the same benchmark. The model was
|
| 114 |
+
trained on 12.56M tokens of pure cybersecurity text — about 1.4% of the
|
| 115 |
+
Chinchilla-optimal data budget for 45M parameters. The next bench bump is
|
| 116 |
+
expected to come from corpus expansion (`v0.4.2`) and the v0.5
|
| 117 |
+
architecture upgrade.
|
| 118 |
+
|
| 119 |
+
## Usage
|
| 120 |
+
|
| 121 |
+
### Direct use (no HF transformers integration)
|
| 122 |
+
|
| 123 |
+
GhostLM has a custom architecture — it does **not** use the
|
| 124 |
+
HuggingFace `transformers` library and is **not** auto-loadable via
|
| 125 |
+
`AutoModelForCausalLM`. You need the GhostLM repo itself.
|
| 126 |
+
|
| 127 |
+
```bash
|
| 128 |
+
git clone https://github.com/joemunene-by/GhostLM
|
| 129 |
+
cd GhostLM
|
| 130 |
+
pip install -r requirements.txt
|
| 131 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
+
Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`:
|
| 134 |
|
| 135 |
```python
|
| 136 |
+
from huggingface_hub import hf_hub_download
|
| 137 |
+
from pathlib import Path
|
| 138 |
+
|
| 139 |
+
dest = Path("checkpoints/phase5_chat_v3")
|
| 140 |
+
dest.mkdir(parents=True, exist_ok=True)
|
| 141 |
+
hf_hub_download(
|
| 142 |
+
repo_id="Ghostgim/GhostLM",
|
| 143 |
+
filename="pytorch_model.pt",
|
| 144 |
+
local_dir=str(dest),
|
| 145 |
+
)
|
| 146 |
+
# Rename to match what the loader expects
|
| 147 |
+
(dest / "pytorch_model.pt").rename(dest / "best_model.pt")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
```
|
| 149 |
|
| 150 |
+
### Chat REPL
|
|
|
|
|
|
|
| 151 |
|
| 152 |
+
```bash
|
| 153 |
+
PYTHONPATH=. python3 scripts/chat.py \
|
| 154 |
+
--checkpoint checkpoints/phase5_chat_v3/best_model.pt \
|
| 155 |
+
--temperature 0.7 --top-k 40 --top-p 0.95 \
|
| 156 |
+
--repetition-penalty 1.25
|
| 157 |
+
```
|
| 158 |
|
| 159 |
+
Chat format uses three special tokens:
|
| 160 |
|
| 161 |
+
```
|
| 162 |
+
<|ghost_user|>What is XSS?<|ghost_end|>
|
| 163 |
+
<|ghost_assistant|>Cross-Site Scripting is a vulnerability where ...<|ghost_end|>
|
| 164 |
+
```
|
| 165 |
|
| 166 |
+
Use the helper to build an inference-ready prompt:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
+
```python
|
| 169 |
+
from ghostlm.tokenizer import GhostTokenizer
|
| 170 |
+
tok = GhostTokenizer()
|
| 171 |
+
prompt_ids = tok.format_chat_prompt([
|
| 172 |
+
{"role": "user", "content": "What is XSS?"},
|
| 173 |
+
])
|
| 174 |
+
# prompt_ids ends in <|ghost_assistant|> ready for generation
|
| 175 |
+
```
|
| 176 |
|
| 177 |
+
### MCP server for Claude Code / Claude Desktop
|
| 178 |
|
| 179 |
+
GhostLM ships an MCP server that exposes three tools — `ghostlm_query`,
|
| 180 |
+
`ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install
|
| 181 |
+
with:
|
| 182 |
|
| 183 |
+
```bash
|
| 184 |
+
claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \
|
| 185 |
+
--checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt
|
| 186 |
+
```
|
|
|
|
| 187 |
|
| 188 |
+
Requires Python ≥ 3.10 + `pip install mcp torch tiktoken`.
|
| 189 |
|
| 190 |
+
## Training data
|
| 191 |
|
| 192 |
+
| Source | Records | License | Notes |
|
| 193 |
+
|---|---:|---|---|
|
| 194 |
+
| NVD CVE descriptions | 64,559 | Public domain (NIST) | sampled to ~3,500 chat Q&A + 1,000 MCQ pairs |
|
| 195 |
+
| Exploit-DB | 4,711 | GPLv2-compatible | sampled to 2,000 chat pairs |
|
| 196 |
+
| Synthetic CTF writeups | 2,847 | Custom — generated locally | turned into 2,847 chat pairs |
|
| 197 |
+
| arXiv cs.CR | 1,890 | arXiv non-exclusive (per paper) | abstracts only in v0.4 corpus |
|
| 198 |
+
| MITRE ATT&CK | 655 | Apache 2.0 | all 655 chat pairs + 655 MCQs |
|
| 199 |
+
| CAPEC | 563 | Public domain (CISA / MITRE) | all 563 chat pairs |
|
| 200 |
+
| CTFtime writeups | 451 | per-author (mostly CC) | all 451 chat pairs |
|
| 201 |
+
| Hand-written small_talk | 153 | MIT | greetings / identity / OOD refusals |
|
| 202 |
|
| 203 |
+
Total chat training set: ~17,000 records after 30× small_talk oversampling
|
| 204 |
+
and 2× MCQ oversampling. The `data/raw/` source files and the
|
| 205 |
+
`data/processed/train.jsonl` pretrain corpus are reproducible from the
|
| 206 |
+
collector scripts in the GitHub repo.
|
| 207 |
|
| 208 |
+
## Limitations
|
| 209 |
|
| 210 |
+
- **No general world knowledge.** Outside cybersecurity the model is
|
| 211 |
+
wrong, repetitive, or both. It will refuse politely on most OOD
|
| 212 |
+
topics ("what's the weather", "tell me a joke") but accuracy on
|
| 213 |
+
general questions is essentially zero.
|
| 214 |
+
- **Specific facts unreliable.** Exact CVE numbers, CVSS scores, dates,
|
| 215 |
+
and technique IDs are memorized incompletely — the model often
|
| 216 |
+
confabulates plausible-looking but wrong specifics. Always verify
|
| 217 |
+
against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org),
|
| 218 |
+
or original vendor advisories.
|
| 219 |
+
- **Short coherence window.** 1024-token context, no RoPE — long
|
| 220 |
+
multi-turn conversations drift. The chat REPL trims old turns when
|
| 221 |
+
the running prompt overflows.
|
| 222 |
+
- **CTIBench 36.9% is well above random but well below larger models.**
|
| 223 |
+
This is expected at 45M parameters and 12.56M training tokens.
|
| 224 |
+
- **Repetition prone without penalty.** Use `--repetition-penalty 1.25`
|
| 225 |
+
in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops.
|
| 226 |
+
|
| 227 |
+
## Intended use
|
| 228 |
+
|
| 229 |
+
- Hands-on learning: explore how a small specialized LM behaves on a
|
| 230 |
+
narrow domain.
|
| 231 |
+
- Local cybersecurity Q&A as a complement to a larger general model
|
| 232 |
+
via the MCP server.
|
| 233 |
+
- Research baseline for cyber-LLM evaluation work — a small, fully
|
| 234 |
+
reproducible from-scratch model with published benchmark numbers.
|
| 235 |
+
|
| 236 |
+
**Out of scope:** production security advice, vulnerability triage,
|
| 237 |
+
incident response. The model is a research artifact — never act on its
|
| 238 |
+
output without verifying against authoritative sources.
|
| 239 |
|
| 240 |
## Citation
|
| 241 |
|
| 242 |
+
```
|
| 243 |
+
@misc{ghostlm-2026,
|
| 244 |
+
author = {Munene, Joe},
|
| 245 |
+
title = {GhostLM: A small cybersecurity language model trained from scratch},
|
| 246 |
+
year = {2026},
|
| 247 |
+
url = {https://github.com/joemunene-by/GhostLM},
|
|
|
|
| 248 |
}
|
| 249 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|