--- title: GhostLM emoji: 🔐 colorFrom: purple colorTo: gray sdk: gradio app_file: app.py pinned: false license: apache-2.0 short_description: From-scratch 81M cybersecurity LM, v0.9 chat demo --- # GhostLM Chat (v0.9) Interactive Gradio chat for **GhostLM v0.9 chat**, an 81M-parameter cybersecurity language model trained from scratch in PyTorch. The Space ships a multi-turn chat interface backed by the v0.9 chat weights. Generation uses the model's three role tokens (`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) and stops the moment the assistant's `<|ghost_end|>` is sampled. ## Bench numbers (v0.9 chat) The v0.9 chat checkpoint is the bench winner of the ghost-small line on every multiple-choice benchmark we ran: | Benchmark | n | Score | |---|---:|---:| | [CTIBench MCQ](https://huggingface.co/datasets/AI4Sec/cti-bench), 2-permutation debiased | 2,500 | **28.9%** | | in-repo CTF MCQ eval | 30 | **59.2%** | | SecQA (external) | 210 | **39.3%** | | free-form fact recall, hand-written | 50 | 1/50 (at floor) | Free-form fact recall is at floor across the entire 81M ghost-small rung by design. At this parameter count the model has the *register* of cybersec writing but not the *facts* in any retrievable form. The next rung (ghost-base ~360M, SmolLM2-360M shape) is gated on rented GPU compute. Spec: [`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md). ## Architecture 6 layers, d_model 768, 12 heads, with RoPE + SwiGLU + RMSNorm. Pretrain corpus: 273M tokens spanning PRIMUS-Seed, PRIMUS-FineWeb, NVD CVEs, MITRE ATT&CK, CWE, CAPEC, OWASP, IETF RFCs, Exploit-DB, CTFtime, arXiv cs.CR, plus a fact-dense Q&A set. Chat-tuned with the chat-v3 SFT recipe. ## Where the weights live The 324 MB slim weights are stored in the Models repo [`Ghostgim/GhostLM-v0.9-experimental`](https://huggingface.co/Ghostgim/GhostLM-v0.9-experimental). The Space's `app.py` calls `huggingface_hub.hf_hub_download` on first launch and caches them locally. This keeps the Space comfortably under HF's 1 GB free-tier LFS cap; the source code stays small and the weights are versioned separately. ## Source GitHub: [`joemunene-by/GhostLM`](https://github.com/joemunene-by/GhostLM) Run locally: ```bash git clone https://github.com/joemunene-by/GhostLM cd GhostLM pip install -r demo/requirements.txt PYTHONPATH=. python3 demo/app.py ``` The model is small enough to run on a laptop CPU; expect ~10-25 s per chat reply at the default 200-token cap. ## Caveats - **Hallucinates facts.** CVE IDs, CVSS scores, technique IDs, version ranges are all unreliable. Outputs are register-shaped fiction, not reference material. Verify against authoritative sources. - **No general-knowledge tuning.** Outside cybersecurity the model politely declines and returns to its domain. Don't expect it to summarize a news article or write Python. - **The MCQ wins do not mean factual recall.** The 28.9% on debiased CTIBench measures the register-matching component of the benchmark; the free-form fact recall floor (1/50) is the truth metric. ## License Apache 2.0. Built by Joe Munene.