docs: model card with bench numbers, intended use, caveats

Adds proper README.md frontmatter so the model surfaces in HF model search
and tags, with the v0.9 chat bench numbers (CTIBench 28.9% / SecQA 39.3% /
in-repo CTF 59.2% / free-form fact recall 1/50) under the model-index schema.

Body documents architecture (6L x 768d x 12h, RoPE + SwiGLU + RMSNorm),
training corpus (273M tokens spanning PRIMUS / NVD / MITRE / CWE / OWASP /
RFCs / arXiv / fact-QA), intended use (educational + research), what the
model is NOT for (anything requiring factual recall), loading code, and
explicit caveats around the fact-recall floor.

Cross-links the GitHub source, the demo Space, the ghost_base_spec, and the
multi-year hardware pathway.

Files changed (1) hide show

README.md +249 -0

README.md ADDED Viewed

	@@ -0,0 +1,249 @@

+---
+license: apache-2.0
+language:
+  - en
+library_name: pytorch
+pipeline_tag: text-generation
+tags:
+  - cybersecurity
+  - language-model
+  - from-scratch
+  - small-model
+  - cti
+  - ctibench
+  - chat
+datasets:
+  - trendmicro-ailab/Primus-Seed
+  - trendmicro-ailab/Primus-FineWeb
+  - AI4Sec/cti-bench
+model-index:
+  - name: GhostLM v0.9 chat (81M)
+    results:
+      - task:
+          type: multiple-choice
+          name: CTIBench MCQ (debiased text-scoring, n=2500)
+        dataset:
+          name: CTIBench MCQ
+          type: AI4Sec/cti-bench
+          split: test
+        metrics:
+          - name: 2-permutation per-perm avg
+            type: accuracy
+            value: 0.289
+      - task:
+          type: multiple-choice
+          name: SecQA (n=210)
+        dataset:
+          name: SecQA
+          type: zefang-liu/secqa
+        metrics:
+          - name: accuracy
+            type: accuracy
+            value: 0.393
+      - task:
+          type: multiple-choice
+          name: GhostLM CTF MCQ (in-repo, n=30)
+        dataset:
+          name: GhostLM CTF eval
+          type: in-repo
+        metrics:
+          - name: accuracy
+            type: accuracy
+            value: 0.592
+      - task:
+          type: text-generation
+          name: GhostLM free-form fact recall (n=50)
+        dataset:
+          name: GhostLM fact-recall bench
+          type: in-repo
+        metrics:
+          - name: substring-match accuracy
+            type: accuracy
+            value: 0.02
+---
+# GhostLM v0.9 chat (81M, from-scratch cybersecurity LM)
+GhostLM is a multi-rung scale-ladder cybersecurity language model
+trained entirely from scratch in PyTorch. **v0.9 chat is the bench
+winner of the ghost-small (45-81M) line on every multiple-choice
+benchmark we evaluated.** It is also where the line saturates: at 81M
+parameters the model has the *register* of cybersec writing but not the
+*facts* in any retrievable form. The next rung is
+[ghost-base](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md)
+(~360M, SmolLM2-360M shape), gated on rented GPU compute.
+This repo holds the slim inference checkpoint
+(`best_model.pt`, 324 MB, model + config only, optimizer state stripped).
+## Bench numbers
+All benches run with debiased multi-permutation text-scoring on
+checkpointed CPU/GPU inference. Methodology in
+[`docs/ctibench_bias_finding.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ctibench_bias_finding.md).
+| Benchmark | Records | v0.4 chat-v3 | v0.7 chat | **v0.9 chat** | Random |
+|---|---:|---:|---:|---:|---:|
+| CTIBench MCQ (full split) | 2,500 | 27.6% | 27.2% | **28.9%** | 25.0% |
+| In-repo CTF MCQ eval | 30 | 50.0% | 50.0% | **59.2%** | 25.0% |
+| SecQA (external, n=210) | 210 | 35.0% | 37.6% | **39.3%** | 25.0% |
+| Free-form fact recall | 50 | 0/50 | 1/50 | **1/50** | 0/50 |
+v0.9 wins every multiple-choice benchmark by 0.7-9.2 pp. The MCQ
+ranking holds across CTIBench, the in-repo CTF eval, and the
+external SecQA bench.
+**But free-form fact recall is at floor across the entire 81M ghost-small
+rung.** A 50-question hand-written fact-recall set (CVE / CWE / MITRE /
+OWASP / crypto / protocol / misc) graded by substring match scores
+0-2% across every chat-tune in the line. The v0.9 model's one "hit"
+("256" appearing in a SHA-256 question) is arguably spurious. **MCQ wins
+measure register matching and topic distinctness, not factual recall.**
+## Architecture
+| Field | Value |
+|---|---|
+| Layers | 6 |
+| d_model | 768 |
+| Attention heads | 12 (head_dim 64) |
+| FFN | SwiGLU, hidden = `int(d_ff × 2/3)` rounded to 64 = 2048 |
+| Normalization | RMSNorm |
+| Position | RoPE (base 10000) |
+| Vocab | 50,264 (GPT-2 50K BPE + 7 special tokens) |
+| Context | 512 train, 1024 inference |
+| Total params | ~81M |
+Same architecture as ghost-small-v0.7. The 273M-token v0.9 corpus is
+what produces the bench delta over v0.7.
+## Training data
+Pretrain corpus: 273M tokens spanning
+- **PRIMUS-Seed** (Trend Micro AI Lab, Apache 2.0): curated cybersec text
+- **PRIMUS-FineWeb** (Trend Micro AI Lab, ODC-By): TinyBERT-filtered cybersec subset of CommonCrawl
+- **NVD CVEs** (NIST, public domain): full v2 description text
+- **MITRE ATT&CK + CWE + CAPEC** (MITRE, custom permissive): technique / weakness / pattern descriptions
+- **OWASP** (Top 10, ASVS, Cheat Sheets, WSTG; CC-BY-SA): web-app security guidance
+- **IETF RFCs** (BCP 78, public): security-relevant RFCs
+- **CTFtime + Exploit-DB** (open): real CTF write-ups and exploit POCs
+- **arXiv cs.CR**: full-text academic papers
+- **fact-QA**: ~11K Q&A pairs distilled by Qwen-14B from the corpus
+Per-source breakdown in
+[`CORPUS.md`](https://github.com/joemunene-by/GhostLM/blob/main/CORPUS.md).
+Chat-tuning: SFT on 1,802 MCQ + small-talk + identity examples using
+the chat-v3 recipe. Three role tokens (`<|ghost_user|>`,
+`<|ghost_assistant|>`, `<|ghost_end|>`) added to the tokenizer.
+## Intended use
+- Educational: a transparent, hand-written reference implementation of
+  a from-scratch decoder-only cybersecurity LM, trained on a curated
+  corpus, with all code on GitHub and all recipes documented.
+- Research: a bench artifact for "what does an 81M from-scratch cyber
+  LM actually score on CTIBench / SecQA?" The honest answer (28.9% /
+  39.3%) is meaningful evidence about the parameter-count requirement
+  for factual recall on cybersec MCQ.
+## What this model is NOT for
+- **Anything that depends on factual recall.** Free-form fact recall
+  is at floor. CVE numbers, version chains, MITRE technique IDs,
+  CVSS scores produced by this model are unreliable. Verify against
+  authoritative sources.
+- **General-purpose tasks.** Outside cybersecurity the model politely
+  declines and returns to its domain. Do not expect it to summarize
+  news, write code, or answer arbitrary questions.
+- **Production cybersec workflows.** Not for incident response,
+  threat hunting, or any decision that affects real systems.
+## Loading
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from ghostlm.config import GhostLMConfig
+from ghostlm.model import GhostLM
+from ghostlm.tokenizer import GhostTokenizer
+from dataclasses import fields
+# Pull weights
+ckpt_path = hf_hub_download(
+    repo_id="Ghostgim/GhostLM-v0.9-experimental",
+    filename="best_model.pt",
+)
+# Load
+ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+saved = ckpt["config"]
+config = GhostLMConfig(**{
+    f.name: saved[f.name] for f in fields(GhostLMConfig) if f.name in saved
+})
+model = GhostLM(config)
+model.load_state_dict(ckpt["model_state_dict"])
+model.eval()
+tokenizer = GhostTokenizer()  # GPT-2 BPE + 7 special tokens
+# Multi-turn chat using the role tokens
+turns = [{"role": "user", "content": "What is XSS?"}]
+prompt_ids = tokenizer.format_chat_prompt(turns)
+# ... see scripts/chat.py for full generation loop
+```
+The full code (architecture, tokenizer, generation, eval, training) is
+in the [GhostLM GitHub repo](https://github.com/joemunene-by/GhostLM).
+## Live demo
+[`huggingface.co/spaces/Ghostgim/ghostlm`](https://huggingface.co/spaces/Ghostgim/ghostlm)
+Pulls these weights via `hf_hub_download` on first launch. CPU inference
+takes ~15-25 s per reply at the default 200-token cap. The demo is
+intentionally honest about the fact-recall floor; expect register-shaped
+output rather than reliable answers.
+## Caveats
+- **Hallucination is the norm**, not the exception. This is an 81M
+  from-scratch model, not a fine-tuned 7B foundation model.
+- **MCQ wins do not imply factual recall.** Test with the free-form
+  fact-recall benchmark, not just CTIBench.
+- **Pretrain corpus is sub-Chinchilla.** 273M tokens for 81M params is
+  ~3× under Chinchilla-optimal; the chat tune partially compensates,
+  but the model is undertrained relative to its capacity.
+## Citation
+```bibtex
+@misc{munene2026ghostlm,
+  title         = {GhostLM: a from-scratch cybersecurity language model on a transparent scale ladder},
+  author        = {Munene, Joe},
+  year          = {2026},
+  howpublished  = {\url{https://github.com/joemunene-by/GhostLM}},
+  note          = {v0.9.2 release; 81M-parameter chat checkpoint}
+}
+```
+## Roadmap
+The next rung is **ghost-base (~360M, SmolLM2-360M shape)**, gated on
+rented GPU compute. Acceptance gate:
+- ≥40% on debiased CTIBench (full n=2500), OR
+- ≥65% on the in-repo CTF MCQ eval, OR
+- ≥30% on the 50-question free-form fact-recall set.
+The fact-recall bar is the truth metric. Spec at
+[`docs/ghost_base_spec.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/ghost_base_spec.md);
+multi-year pathway through ghost-7B in
+[`docs/hardware_pathway.md`](https://github.com/joemunene-by/GhostLM/blob/main/docs/hardware_pathway.md).
+## License
+Apache 2.0. Same license as the GhostLM source code.
+Built by Joe Munene.