File size: 3,160 Bytes
4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 4fcda01 5126953 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | ---
title: GhostLM
emoji: 🔐
colorFrom: purple
colorTo: gray
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: From-scratch 81M cybersecurity LM, v0.9 chat demo
---
# GhostLM Chat (v0.9)
Interactive Gradio chat for **GhostLM v0.9 chat**, an 81M-parameter
cybersecurity language model trained from scratch in PyTorch.
The Space ships a multi-turn chat interface backed by the v0.9 chat
weights. Generation uses the model's three role tokens
(`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) and stops the
moment the assistant's `<|ghost_end|>` is sampled.
## Bench numbers (v0.9 chat)
The v0.9 chat checkpoint is the bench winner of the ghost-small line on
every multiple-choice benchmark we ran:
| Benchmark | n | Score |
|---|---:|---:|
| [CTIBench MCQ](https://huggingface.co/datasets/AI4Sec/cti-bench), 2-permutation debiased | 2,500 | **28.9%** |
| in-repo CTF MCQ eval | 30 | **59.2%** |
| SecQA (external) | 210 | **39.3%** |
| free-form fact recall, hand-written | 50 | 1/50 (at floor) |
Free-form fact recall is at floor across the entire 81M ghost-small
rung by design. At this parameter count the model has the *register*
of cybersec writing but not the *facts* in any retrievable form. The
next rung (ghost-base ~360M, SmolLM2-360M shape) is gated on rented
GPU compute. Spec: [`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md).
## Architecture
6 layers, d_model 768, 12 heads, with RoPE + SwiGLU + RMSNorm. Pretrain
corpus: 273M tokens spanning PRIMUS-Seed, PRIMUS-FineWeb, NVD CVEs,
MITRE ATT&CK, CWE, CAPEC, OWASP, IETF RFCs, Exploit-DB, CTFtime, arXiv
cs.CR, plus a fact-dense Q&A set. Chat-tuned with the chat-v3 SFT recipe.
## Where the weights live
The 324 MB slim weights are stored in the Models repo
[`Ghostgim/GhostLM-v0.9-experimental`](https://huggingface.co/Ghostgim/GhostLM-v0.9-experimental).
The Space's `app.py` calls `huggingface_hub.hf_hub_download` on first
launch and caches them locally. This keeps the Space comfortably under
HF's 1 GB free-tier LFS cap; the source code stays small and the
weights are versioned separately.
## Source
GitHub: [`joemunene-by/GhostLM`](https://github.com/joemunene-by/GhostLM)
Run locally:
```bash
git clone https://github.com/joemunene-by/GhostLM
cd GhostLM
pip install -r demo/requirements.txt
PYTHONPATH=. python3 demo/app.py
```
The model is small enough to run on a laptop CPU; expect ~10-25 s per
chat reply at the default 200-token cap.
## Caveats
- **Hallucinates facts.** CVE IDs, CVSS scores, technique IDs, version
ranges are all unreliable. Outputs are register-shaped fiction, not
reference material. Verify against authoritative sources.
- **No general-knowledge tuning.** Outside cybersecurity the model
politely declines and returns to its domain. Don't expect it to
summarize a news article or write Python.
- **The MCQ wins do not mean factual recall.** The 28.9% on debiased
CTIBench measures the register-matching component of the
benchmark; the free-form fact recall floor (1/50) is the truth metric.
## License
Apache 2.0. Built by Joe Munene.
|