| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - cybersecurity |
| - language-model |
| - from-scratch |
| - small-model |
| - cti |
| - ctibench |
| - chat |
| datasets: |
| - trendmicro-ailab/Primus-Seed |
| - trendmicro-ailab/Primus-FineWeb |
| - AI4Sec/cti-bench |
| model-index: |
| - name: GhostLM v0.9 chat (81M) |
| results: |
| - task: |
| type: multiple-choice |
| name: CTIBench MCQ (debiased text-scoring, n=2500) |
| dataset: |
| name: CTIBench MCQ |
| type: AI4Sec/cti-bench |
| split: test |
| metrics: |
| - name: 2-permutation per-perm avg |
| type: accuracy |
| value: 0.289 |
| - task: |
| type: multiple-choice |
| name: SecQA (n=210) |
| dataset: |
| name: SecQA |
| type: zefang-liu/secqa |
| metrics: |
| - name: accuracy |
| type: accuracy |
| value: 0.393 |
| - task: |
| type: multiple-choice |
| name: GhostLM CTF MCQ (in-repo, n=30) |
| dataset: |
| name: GhostLM CTF eval |
| type: in-repo |
| metrics: |
| - name: accuracy |
| type: accuracy |
| value: 0.592 |
| - task: |
| type: text-generation |
| name: GhostLM free-form fact recall (n=50) |
| dataset: |
| name: GhostLM fact-recall bench |
| type: in-repo |
| metrics: |
| - name: substring-match accuracy |
| type: accuracy |
| value: 0.02 |
| --- |
| |
| # GhostLM v0.9 chat (81M, from-scratch cybersecurity LM) |
|
|
| GhostLM is a multi-rung scale-ladder cybersecurity language model |
| trained entirely from scratch in PyTorch. **v0.9 chat is the bench |
| winner of the ghost-small (45-81M) line on every multiple-choice |
| benchmark we evaluated.** It is also where the line saturates: at 81M |
| parameters the model has the *register* of cybersec writing but not the |
| *facts* in any retrievable form. The next rung is |
| [ghost-base](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md) |
| (~360M, SmolLM2-360M shape), gated on rented GPU compute. |
|
|
| This repo holds the slim inference checkpoint |
| (`best_model.pt`, 324 MB, model + config only, optimizer state stripped). |
|
|
| ## v0.9.5 update (2026-05-08): nine differentiation bets, 1,505 templated SFT records ready |
|
|
| The strategic frame went from "six bets, three measured" (v0.9.4) to |
| "nine bets, all shipped, 1,505 deterministic SFT records ready for |
| the v1.0 GPU run." The new bets answer **"what would make GhostLM |
| exceptional, beyond what general-purpose small LMs offer?"** |
|
|
| Strategic frame: [`docs/differentiation.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/differentiation.md). |
|
|
| | Bet | Status | Result | |
| |---|---|---| |
| | 1. Tool-grounded SFT | **training data ready** | 424 templated traces, 98.6% acceptance under `trace_quality_ok`; ~10% "not found" injection trains lookup-failure acknowledgement. [tool_use_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/tool_use_synth.md) | |
| | 2. Daily LoRA over fresh threat-intel | scaffolded | `scripts/daily_finetune.py`, ~1-2 GPU hr/day | |
| | 3. Custom 32K BPE | **measured + settled** | +4.0% on cyber, -2.5% on general vs GPT-2 BPE; +25-35% projection falsified. [bpe_corpus_ablation.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/bpe_corpus_ablation.md) | |
| | 4. Long context via RoPE NTK | scaffolded | `scripts/extend_context_ntk.py`, ~3-5 GPU hr | |
| | 5. MoE for ghost-1B+ | **smoke validated** | 100-step training PASS. [moe_training_smoke.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/moe_training_smoke.md); presets `ghost-1b` (2.1B/1.2B-active) and `ghost-3b` (6.0B/3.3B-active) | |
| | 6. Format-aware pretrain (STIX/YARA/Sigma/MISP) | **measured baseline + training data ready** | v0.9 baseline locked at 0/32 = 0% [Wilson 95% CI 0.0-10.7]. 560 templated records ready. [format_baseline_v09.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/format_baseline_v09.md), [format_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/format_synth.md) | |
| | 7. Code-for-security | **NEW**, training data ready | 12-pattern bank covering OWASP-Top-10 CWE classes (Python/JS/C); 48 records, 100% pass. [code_security_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/code_security_synth.md) | |
| | 8. Binary / hex literacy | **NEW**, training data ready, **most novel bet** | 15-pattern bank: PE/ELF/Mach-O/ZIP/PDF/OLE2/PNG file magic, UPX/Themida packers, NOP sleds + x64 syscall, PE Optional Header Magic + Machine, x64 execve('/bin/sh') shellcode; 44 records, 100% pass. **No other small cybersec LM does this.** [binary_literacy_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/binary_literacy_synth.md) | |
| | 9. Provenance / cite tags | **NEW**, training data ready | 429 cite-augmented tool-use traces with `<\|cite\|>{source_type}:{id}#field<\|/cite\|>` inline in the answer; 99.8% acceptance under `trace_with_cites_quality_ok`. Stacks on bet 1 for ~853-record SFT corpus. [provenance_synth.md](https://github.com/joemunene-by/GhostLM/blob/main/docs/provenance_synth.md) | |
|
|
| ### Combined templated-synth corpus |
|
|
| | Bet | Records | Acceptance | |
| |---|---:|---:| |
| | 1 (tool-use, plain) | 424 | 98.6% | |
| | 6 (STIX / YARA / Sigma / MISP) | 560 | 99.8% | |
| | 7 (code-for-security) | 48 | 100.0% | |
| | 8 (binary / hex literacy) | 44 | 100.0% | |
| | 9 (cite-augmented tool-use) | 429 | 99.8% | |
| | **TOTAL** | **1,505** | **99.4%** | |
|
|
| That's the deterministic floor. LLM-distilled records on top |
| (bet 1 production at ~$200, bet 6 production at ~$50-100 on |
| Anthropic) bring the realistic ghost-base SFT mix to ~10K records |
| for a few hundred dollars, with no GPU spend until the actual |
| pretrain run. |
|
|
| The v0.9 chat checkpoint in this repo is unchanged; it's the |
| baseline against which all bet measurements are made. |
|
|
| ## Bench numbers |
|
|
| All benches run with debiased multi-permutation text-scoring on |
| checkpointed CPU/GPU inference. Methodology in |
| [`docs/ctibench_bias_finding.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ctibench_bias_finding.md). |
|
|
| | Benchmark | Records | v0.4 chat-v3 | v0.7 chat | **v0.9 chat** | Random | |
| |---|---:|---:|---:|---:|---:| |
| | CTIBench MCQ (full split) | 2,500 | 27.6% | 27.2% | **28.9%** | 25.0% | |
| | In-repo CTF MCQ eval | 30 | 50.0% | 50.0% | **59.2%** | 25.0% | |
| | SecQA (external, n=210) | 210 | 35.0% | 37.6% | **39.3%** | 25.0% | |
| | Free-form fact recall | 50 | 0/50 | 1/50 | **1/50** | 0/50 | |
|
|
| v0.9 wins every multiple-choice benchmark by 0.7-9.2 pp. The MCQ |
| ranking holds across CTIBench, the in-repo CTF eval, and the |
| external SecQA bench. |
|
|
| **But free-form fact recall is at floor across the entire 81M ghost-small |
| rung.** A 50-question hand-written fact-recall set (CVE / CWE / MITRE / |
| OWASP / crypto / protocol / misc) graded by substring match scores |
| 0-2% across every chat-tune in the line. The v0.9 model's one "hit" |
| ("256" appearing in a SHA-256 question) is arguably spurious. **MCQ wins |
| measure register matching and topic distinctness, not factual recall.** |
|
|
| ## Architecture |
|
|
| | Field | Value | |
| |---|---| |
| | Layers | 6 | |
| | d_model | 768 | |
| | Attention heads | 12 (head_dim 64) | |
| | FFN | SwiGLU, hidden = `int(d_ff × 2/3)` rounded to 64 = 2048 | |
| | Normalization | RMSNorm | |
| | Position | RoPE (base 10000) | |
| | Vocab | 50,264 (GPT-2 50K BPE + 7 special tokens) | |
| | Context | 512 train, 1024 inference | |
| | Total params | ~81M | |
|
|
| Same architecture as ghost-small-v0.7. The 273M-token v0.9 corpus is |
| what produces the bench delta over v0.7. |
|
|
| ## Training data |
|
|
| Pretrain corpus: 273M tokens spanning |
|
|
| - **PRIMUS-Seed** (Trend Micro AI Lab, Apache 2.0): curated cybersec text |
| - **PRIMUS-FineWeb** (Trend Micro AI Lab, ODC-By): TinyBERT-filtered cybersec subset of CommonCrawl |
| - **NVD CVEs** (NIST, public domain): full v2 description text |
| - **MITRE ATT&CK + CWE + CAPEC** (MITRE, custom permissive): technique / weakness / pattern descriptions |
| - **OWASP** (Top 10, ASVS, Cheat Sheets, WSTG; CC-BY-SA): web-app security guidance |
| - **IETF RFCs** (BCP 78, public): security-relevant RFCs |
| - **CTFtime + Exploit-DB** (open): real CTF write-ups and exploit POCs |
| - **arXiv cs.CR**: full-text academic papers |
| - **fact-QA**: ~11K Q&A pairs distilled by Qwen-14B from the corpus |
|
|
| Per-source breakdown in |
| [`CORPUS.md`](https://github.com/joemunene-by/GhostLM/blob/main/CORPUS.md). |
|
|
| Chat-tuning: SFT on 1,802 MCQ + small-talk + identity examples using |
| the chat-v3 recipe. Three role tokens (`<|ghost_user|>`, |
| `<|ghost_assistant|>`, `<|ghost_end|>`) added to the tokenizer. |
|
|
| ## Intended use |
|
|
| - Educational: a transparent, hand-written reference implementation of |
| a from-scratch decoder-only cybersecurity LM, trained on a curated |
| corpus, with all code on GitHub and all recipes documented. |
| - Research: a bench artifact for "what does an 81M from-scratch cyber |
| LM actually score on CTIBench / SecQA?" The honest answer (28.9% / |
| 39.3%) is meaningful evidence about the parameter-count requirement |
| for factual recall on cybersec MCQ. |
|
|
| ## What this model is NOT for |
|
|
| - **Anything that depends on factual recall.** Free-form fact recall |
| is at floor. CVE numbers, version chains, MITRE technique IDs, |
| CVSS scores produced by this model are unreliable. Verify against |
| authoritative sources. |
| - **General-purpose tasks.** Outside cybersecurity the model politely |
| declines and returns to its domain. Do not expect it to summarize |
| news, write code, or answer arbitrary questions. |
| - **Production cybersec workflows.** Not for incident response, |
| threat hunting, or any decision that affects real systems. |
|
|
| ## Loading |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from ghostlm.config import GhostLMConfig |
| from ghostlm.model import GhostLM |
| from ghostlm.tokenizer import GhostTokenizer |
| from dataclasses import fields |
| |
| # Pull weights |
| ckpt_path = hf_hub_download( |
| repo_id="Ghostgim/GhostLM-v0.9-experimental", |
| filename="best_model.pt", |
| ) |
| |
| # Load |
| ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False) |
| saved = ckpt["config"] |
| config = GhostLMConfig(**{ |
| f.name: saved[f.name] for f in fields(GhostLMConfig) if f.name in saved |
| }) |
| model = GhostLM(config) |
| model.load_state_dict(ckpt["model_state_dict"]) |
| model.eval() |
| |
| tokenizer = GhostTokenizer() # GPT-2 BPE + 7 special tokens |
| |
| # Multi-turn chat using the role tokens |
| turns = [{"role": "user", "content": "What is XSS?"}] |
| prompt_ids = tokenizer.format_chat_prompt(turns) |
| # ... see scripts/chat.py for full generation loop |
| ``` |
|
|
| The full code (architecture, tokenizer, generation, eval, training) is |
| in the [GhostLM GitHub repo](https://github.com/joemunene-by/GhostLM). |
|
|
| ## Live demo |
|
|
| [`huggingface.co/spaces/Ghostgim/ghostlm`](https://huggingface.co/spaces/Ghostgim/ghostlm) |
|
|
| Pulls these weights via `hf_hub_download` on first launch. CPU inference |
| takes ~15-25 s per reply at the default 200-token cap. The demo is |
| intentionally honest about the fact-recall floor; expect register-shaped |
| output rather than reliable answers. |
|
|
| ## Caveats |
|
|
| - **Hallucination is the norm**, not the exception. This is an 81M |
| from-scratch model, not a fine-tuned 7B foundation model. |
| - **MCQ wins do not imply factual recall.** Test with the free-form |
| fact-recall benchmark, not just CTIBench. |
| - **Pretrain corpus is sub-Chinchilla.** 273M tokens for 81M params is |
| ~3× under Chinchilla-optimal; the chat tune partially compensates, |
| but the model is undertrained relative to its capacity. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{munene2026ghostlm, |
| title = {GhostLM: a from-scratch cybersecurity language model on a transparent scale ladder}, |
| author = {Munene, Joe}, |
| year = {2026}, |
| howpublished = {\url{https://github.com/joemunene-by/GhostLM}}, |
| note = {v0.9.5 release; 81M-parameter chat checkpoint plus nine differentiation bets, 1505 templated SFT records} |
| } |
| ``` |
|
|
| ## Roadmap |
|
|
| The next rung is **ghost-base (~360M, SmolLM2-360M shape)**, gated on |
| rented GPU compute. Acceptance gate: |
|
|
| - ≥40% on debiased CTIBench (full n=2500), OR |
| - ≥65% on the in-repo CTF MCQ eval, OR |
| - ≥30% on the 50-question free-form fact-recall set. |
|
|
| The fact-recall bar is the truth metric. Spec at |
| [`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md); |
| multi-year pathway through ghost-7B in |
| [`docs/hardware_pathway.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/hardware_pathway.md). |
|
|
| After ghost-base lands, the v0.9.4 differentiation bets compose on |
| top of it: tool-use SFT (bet 1) on the fresh ghost-base, format-aware |
| pretrain mix (bet 6) using the 560 templated records plus |
| LLM-distilled traces, RoPE NTK context extension to 16K (bet 4), and |
| eventually ghost-1B with native MoE from step 0 (bet 5). Sequencing |
| detail in |
| [`docs/differentiation.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/differentiation.md). |
|
|
| ## License |
|
|
| Apache 2.0. Same license as the GhostLM source code. |
|
|
| Built by Joe Munene. |
|
|