---
title: GhostLM
emoji: 🔐
colorFrom: purple
colorTo: gray
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: From-scratch 81M cybersecurity LM, v0.9 chat demo
---

# GhostLM Chat (v0.9)

Interactive Gradio chat for **GhostLM v0.9 chat**, an 81M-parameter
cybersecurity language model trained from scratch in PyTorch.

The Space ships a multi-turn chat interface backed by the v0.9 chat
weights. Generation uses the model's three role tokens
(`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) and stops the
moment the assistant's `<|ghost_end|>` is sampled.

## Bench numbers (v0.9 chat)

The v0.9 chat checkpoint is the bench winner of the ghost-small line on
every multiple-choice benchmark we ran:

| Benchmark | n | Score |
|---|---:|---:|
| [CTIBench MCQ](https://huggingface.co/datasets/AI4Sec/cti-bench), 2-permutation debiased | 2,500 | **28.9%** |
| in-repo CTF MCQ eval | 30 | **59.2%** |
| SecQA (external) | 210 | **39.3%** |
| free-form fact recall, hand-written | 50 | 1/50 (at floor) |

Free-form fact recall is at floor across the entire 81M ghost-small
rung by design. At this parameter count the model has the *register*
of cybersec writing but not the *facts* in any retrievable form. The
next rung (ghost-base ~360M, SmolLM2-360M shape) is gated on rented
GPU compute. Spec: [`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md).

## Architecture

6 layers, d_model 768, 12 heads, with RoPE + SwiGLU + RMSNorm. Pretrain
corpus: 273M tokens spanning PRIMUS-Seed, PRIMUS-FineWeb, NVD CVEs,
MITRE ATT&CK, CWE, CAPEC, OWASP, IETF RFCs, Exploit-DB, CTFtime, arXiv
cs.CR, plus a fact-dense Q&A set. Chat-tuned with the chat-v3 SFT recipe.

## Where the weights live

The 324 MB slim weights are stored in the Models repo
[`Ghostgim/GhostLM-v0.9-experimental`](https://huggingface.co/Ghostgim/GhostLM-v0.9-experimental).
The Space's `app.py` calls `huggingface_hub.hf_hub_download` on first
launch and caches them locally. This keeps the Space comfortably under
HF's 1 GB free-tier LFS cap; the source code stays small and the
weights are versioned separately.

## Source

GitHub: [`joemunene-by/GhostLM`](https://github.com/joemunene-by/GhostLM)

Run locally:

```bash
git clone https://github.com/joemunene-by/GhostLM
cd GhostLM
pip install -r demo/requirements.txt
PYTHONPATH=. python3 demo/app.py
```

The model is small enough to run on a laptop CPU; expect ~10-25 s per
chat reply at the default 200-token cap.

## Caveats

- **Hallucinates facts.** CVE IDs, CVSS scores, technique IDs, version
  ranges are all unreliable. Outputs are register-shaped fiction, not
  reference material. Verify against authoritative sources.
- **No general-knowledge tuning.** Outside cybersecurity the model
  politely declines and returns to its domain. Don't expect it to
  summarize a news article or write Python.
- **The MCQ wins do not mean factual recall.** The 28.9% on debiased
  CTIBench measures the register-matching component of the
  benchmark; the free-form fact recall floor (1/50) is the truth metric.

## License

Apache 2.0. Built by Joe Munene.