Spaces:

Ghostgim
/

ghostlm

Sleeping

App Files Files Community

Ghostgim

ghost commited on 15 days ago

Commit

4fcda01

0 Parent(s):

chore: squash history to reclaim LFS storage from removed checkpoint

Browse files

Files changed (11) hide show

.gitattributes +8 -0
README.md +157 -0
app.py +265 -0
checkpoints/phase5_chat_v3/best_model.pt +3 -0
ghostlm/__init__.py +19 -0
ghostlm/config.py +162 -0
ghostlm/dataset.py +125 -0
ghostlm/model.py +428 -0
ghostlm/tokenizer.py +321 -0
ghostlm/trainer.py +317 -0
requirements.txt +22 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,8 @@

+# Hugging Face Spaces LFS rules. Copy this to the Space repo root
+# alongside app.py / requirements.txt / README.md / ghostlm/ /
+# checkpoints/. Without it the ~177 MB checkpoints either fail to push
+# or land as broken pointer files.
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,157 @@

+---
+title: GhostLM
+emoji: 🔐
+colorFrom: purple
+colorTo: gray
+sdk: gradio
+app_file: app.py
+pinned: false
+license: apache-2.0
+short_description: From-scratch cybersecurity LM — interactive demo
+---
+# GhostLM Demo
+Interactive Gradio UI for the canonical Phase 3.5 ghost-tiny model. Two
+tabs: a single-checkpoint **Generate** view with curated prompt presets
+and a generation history, and an optional **Compare** tab that runs the
+same prompt through two checkpoints side-by-side (the canonical v0.3.5
+vs. the v0.3.7 attempt that regressed).
+This file is dual-purpose:
+- **In the GitHub repo** (`demo/README.md`) — documents the demo and
+  the deploy steps.
+- **As an HF Space README** — the YAML frontmatter at the top is parsed
+  by Hugging Face Spaces as the Space metadata. Keep it intact when
+  copying this file to a Space repo.
+## Run locally
+From the repo root:
+```bash
+pip install -r demo/requirements.txt
+PYTHONPATH=. python3 demo/app.py
+```
+Open `http://localhost:7860`. The demo defaults to
+`checkpoints/phase3.5_balanced/best_model.pt` — pass `--checkpoint` to
+load a different one:
+```bash
+PYTHONPATH=. python3 demo/app.py --checkpoint checkpoints/phase3.6_exploitdb/best_model.pt
+```
+To enable the Compare tab, add a second checkpoint:
+```bash
+PYTHONPATH=. python3 demo/app.py \
+  --checkpoint checkpoints/phase3.5_balanced/best_model.pt \
+  --compare-checkpoint checkpoints/phase3.6_exploitdb/best_model.pt
+```
+The same `--share` flag Gradio supports works:
+```bash
+PYTHONPATH=. python3 demo/app.py --share
+```
+## Deploy to Hugging Face Spaces
+A Space is a separate git repo on huggingface.co. The demo here lives
+under `demo/` in the GhostLM repo so the source stays in one place; to
+deploy you copy the demo files plus the `ghostlm/` package and a
+checkpoint into a fresh Space repo.
+### 1. Create the Space
+Either via the Hugging Face web UI (New → Space, SDK = Gradio) or via
+CLI:
+```bash
+pip install huggingface_hub
+huggingface-cli login
+huggingface-cli repo create ghostlm --type space --space-sdk gradio
+```
+Replace `ghostlm` with your preferred Space name.
+### 2. Clone the Space repo and stage files
+```bash
+git clone https://huggingface.co/spaces/<your-user>/ghostlm hf-space
+cd hf-space
+# Track the checkpoint via LFS (it's ~177 MB)
+git lfs install
+git lfs track "*.pt"
+# Copy the demo + the ghostlm package + the canonical checkpoint
+cp ../demo/app.py .
+cp ../demo/requirements.txt .
+cp ../demo/README.md .
+cp -r ../ghostlm .
+mkdir -p checkpoints/phase3.5_balanced
+cp ../checkpoints/phase3.5_balanced/best_model.pt checkpoints/phase3.5_balanced/
+git add .
+git commit -m "Initial GhostLM Space deploy"
+git push
+```
+The Space will start building automatically; first build takes ~3–5
+minutes (gradio + torch wheel install + checkpoint LFS pull). The
+README's frontmatter tells HF this is a Gradio Space, sets the colors,
+and pins `app_file: app.py`.
+### 3. Optional — include the Phase 3.6 checkpoint for the Compare tab
+If you want the Compare tab live in the Space, also copy the Phase 3.6
+checkpoint (~177 MB more) and set the env var in the Space's Settings
+page:
+```bash
+mkdir -p checkpoints/phase3.6_exploitdb
+cp ../checkpoints/phase3.6_exploitdb/best_model.pt checkpoints/phase3.6_exploitdb/
+git add checkpoints/phase3.6_exploitdb
+git commit -m "Add Phase 3.6 for compare tab"
+git push
+```
+In the Space's **Settings → Variables**, add:
+```
+GHOSTLM_COMPARE_CHECKPOINT = checkpoints/phase3.6_exploitdb/best_model.pt
+```
+The Space restarts automatically. The Compare tab will now be visible.
+### 4. Updates
+Push to the Space repo whenever the demo changes; the Space rebuilds.
+For a checkpoint update push the new `.pt` file (LFS handles it).
+## What it looks like
+The **Generate** tab gives you a prompt textbox, three sampling sliders
+(max tokens, temperature, top-k), and a continuation panel. Below that,
+collapsible accordions group the preset prompts by register (CVE / MITRE
+/ CTF / CAPEC / free-form) so visitors can immediately see what kind of
+prose the model knows. A history panel keeps the last five generations
+visible.
+The **Compare** tab — only shown when a second checkpoint is loaded —
+sends the same prompt + sampling settings to both models in turn so the
+Phase 3.5 → 3.6 trajectory is visible in real text rather than just
+accuracy numbers.
+## Why this exists
+The point of the demo isn't to impress visitors with fluency — at 14.7M
+parameters trained on 8.8M tokens, the model produces register-shaped
+fiction, not knowledge. The point is to make the project's
+trajectory-over-absolute-quality framing concrete: visitors can poke at
+the canonical model, see exactly what it knows and doesn't, and if both
+checkpoints are loaded, see the empirical capacity-ceiling finding for
+themselves.

app.py ADDED Viewed

	@@ -0,0 +1,265 @@

+"""GhostLM Gradio Space — chat UI for the v0.5.0 chat-v3 (CTIBench 36.9%) model.
+Multi-turn chat using the model's three role tokens
+(<|ghost_user|>, <|ghost_assistant|>, <|ghost_end|>). Generation stops the
+moment the assistant's <|ghost_end|> is sampled. Repetition penalty is on
+by default — without it the 45M model occasionally degenerates into
+"Wifi Wifi Wifi…" loops on small prompts.
+Runs on Spaces cpu-basic (2 vCPU). Generation is ~5-15 s per reply at
+the default 200-token cap.
+"""
+from __future__ import annotations
+import os
+import sys
+from dataclasses import fields
+from pathlib import Path
+from typing import List
+import gradio as gr
+import torch
+import torch.nn.functional as F
+REPO_ROOT = Path(__file__).resolve().parent
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+from ghostlm.config import GhostLMConfig
+from ghostlm.model import GhostLM
+from ghostlm.tokenizer import GhostTokenizer
+# ---------------------------------------------------------------------------
+# Loading
+# ---------------------------------------------------------------------------
+CHECKPOINT_CANDIDATES = [
+    "checkpoints/phase5_chat_v3/best_model.pt",
+    "checkpoints/best_model.pt",  # fallback if pushed at the root
+]
+def find_checkpoint() -> str:
+    """Return the first checkpoint path that exists, or empty string."""
+    for path in CHECKPOINT_CANDIDATES:
+        if Path(path).exists():
+            return path
+    return ""
+def load_model(path: str):
+    """Load a GhostLM checkpoint into eval mode on CPU."""
+    if not path:
+        # Random-init fallback so the UI still launches if weights are missing.
+        config = GhostLMConfig.from_preset("ghost-tiny")
+        config.vocab_size = 50264
+        config.context_length = 256
+        model = GhostLM(config).eval()
+        return model, config, "(random ghost-tiny — weights missing on Space)"
+    ckpt = torch.load(path, map_location="cpu", weights_only=False)
+    saved = ckpt["config"]
+    config = GhostLMConfig(**{
+        f.name: saved[f.name]
+        for f in fields(GhostLMConfig)
+        if f.name in saved
+    })
+    model = GhostLM(config)
+    state = ckpt.get("model_state_dict", ckpt.get("model"))
+    model.load_state_dict(state, strict=False)
+    model.eval()
+    return model, config, path
+# ---------------------------------------------------------------------------
+# Generation — inlined from scripts/chat.py so the Space stays self-contained.
+# ---------------------------------------------------------------------------
+def sample_next(
+    logits: torch.Tensor,
+    *,
+    temperature: float,
+    top_k: int,
+    top_p: float,
+    prev_ids: List[int],
+    repetition_penalty: float,
+) -> int:
+    """Sample one token from logits with temperature, top-k / top-p, and rep-penalty."""
+    if prev_ids and repetition_penalty != 1.0:
+        for tok in set(prev_ids):
+            if logits[tok] > 0:
+                logits[tok] = logits[tok] / repetition_penalty
+            else:
+                logits[tok] = logits[tok] * repetition_penalty
+    logits = logits / max(temperature, 1e-6)
+    if top_k and top_k > 0:
+        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+        logits[logits < v[..., -1:]] = float("-inf")
+    if top_p and top_p < 1.0:
+        sorted_logits, sorted_idx = torch.sort(logits, descending=True)
+        probs = F.softmax(sorted_logits, dim=-1)
+        cum = probs.cumsum(dim=-1)
+        cutoff = cum > top_p
+        cutoff[..., 0] = False
+        sorted_logits[cutoff] = float("-inf")
+        logits = torch.full_like(logits, float("-inf")).scatter(-1, sorted_idx, sorted_logits)
+    probs = F.softmax(logits, dim=-1)
+    return int(torch.multinomial(probs, num_samples=1).item())
+def generate_until_end(
+    model,
+    prompt_ids: List[int],
+    *,
+    end_id: int,
+    max_new_tokens: int,
+    temperature: float,
+    top_k: int,
+    top_p: float,
+    repetition_penalty: float,
+) -> List[int]:
+    """Greedy-or-sampled generation that stops the moment ``end_id`` is sampled."""
+    ids = torch.tensor(prompt_ids, dtype=torch.long).unsqueeze(0)
+    new_ids: List[int] = []
+    ctx = model.config.context_length
+    with torch.no_grad():
+        for _ in range(max_new_tokens):
+            cond = ids[:, -ctx:]
+            logits, _ = model(cond)
+            next_logits = logits[:, -1, :].squeeze(0).clone()
+            tok = sample_next(
+                next_logits,
+                temperature=temperature, top_k=top_k, top_p=top_p,
+                prev_ids=new_ids[-128:], repetition_penalty=repetition_penalty,
+            )
+            if tok == end_id:
+                break
+            new_ids.append(tok)
+            ids = torch.cat([ids, torch.tensor([[tok]])], dim=1)
+    return new_ids
+# ---------------------------------------------------------------------------
+# Module-level state
+# ---------------------------------------------------------------------------
+CHECKPOINT_PATH = find_checkpoint()
+MODEL, CONFIG, LOADED_FROM = load_model(CHECKPOINT_PATH)
+TOKENIZER = GhostTokenizer()
+END_ID = TOKENIZER._special_tokens[TOKENIZER.END]
+# ---------------------------------------------------------------------------
+# Chat handler
+# ---------------------------------------------------------------------------
+def chat_fn(message: str, history: list, temperature: float, top_k: int,
+            top_p: float, max_tokens: int, repetition_penalty: float) -> str:
+    """Generate one assistant turn given the prior history + new user message.
+    ``history`` may arrive in either Gradio-tuples format
+    ``[(user, bot), ...]`` (older) or messages format
+    ``[{"role", "content"}, ...]`` (newer). We coerce to messages.
+    """
+    turns: list = []
+    for h in history:
+        if isinstance(h, dict) and h.get("role") in ("user", "assistant"):
+            turns.append({"role": h["role"], "content": h["content"]})
+        elif isinstance(h, (list, tuple)) and len(h) == 2:
+            user_msg, bot_msg = h
+            if user_msg:
+                turns.append({"role": "user", "content": user_msg})
+            if bot_msg:
+                turns.append({"role": "assistant", "content": bot_msg})
+    turns.append({"role": "user", "content": message})
+    prompt_ids = TOKENIZER.format_chat_prompt(turns)
+    # Trim conversation if the prompt overflows the context budget.
+    ctx_budget = CONFIG.context_length - max_tokens - 8
+    while len(prompt_ids) > ctx_budget and len(turns) > 1:
+        # Drop the oldest user/assistant pair, but keep the just-asked turn.
+        if len(turns) >= 3:
+            del turns[:2]
+            prompt_ids = TOKENIZER.format_chat_prompt(turns)
+        else:
+            break
+    new_ids = generate_until_end(
+        MODEL, prompt_ids,
+        end_id=END_ID,
+        max_new_tokens=int(max_tokens),
+        temperature=float(temperature),
+        top_k=int(top_k),
+        top_p=float(top_p),
+        repetition_penalty=float(repetition_penalty),
+    )
+    return TOKENIZER.decode(new_ids).strip() or "(no response)"
+# ---------------------------------------------------------------------------
+# UI
+# ---------------------------------------------------------------------------
+DESCRIPTION = f"""
+# GhostLM — chat-v3 (v0.5.0)
+A 45M-parameter cybersecurity language model **trained from scratch** on
+12.56M tokens of NVD / MITRE ATT&CK / Exploit-DB / CTFtime / arXiv cs.CR
+text. The chat-tuned checkpoint here scored **36.9% on
+[CTIBench MCQ](https://huggingface.co/datasets/AI4Sec/cti-bench)** — 1.48× random for a
+2,500-question security multiple-choice benchmark.
+**Honest expectations:** identity, OOD-refusal, and chat shape work. Specific
+facts (CVE numbers, CVSS scores, dates, technique IDs) are unreliable —
+the model often confabulates plausible-looking but wrong specifics. Always
+verify against authoritative sources. Outside cybersecurity, the model
+politely declines and returns to its domain.
+**Loaded checkpoint:** `{LOADED_FROM}`
+"""
+EXAMPLES = [
+    "What is XSS?",
+    "Explain MITRE ATT&CK technique T1059.",
+    "What does SSRF stand for?",
+    "How does a buffer overflow work?",
+    "Walk me through a typical SQL injection attack.",
+    "What's the difference between CVE and CWE?",
+    "Where do I start learning cybersecurity?",
+    "Are you ChatGPT?",
+]
+with gr.Blocks(title="GhostLM Chat") as demo:
+    gr.Markdown(DESCRIPTION)
+    with gr.Row():
+        with gr.Column(scale=3):
+            chat = gr.ChatInterface(
+                fn=chat_fn,
+                # Each example needs values for every additional_input when
+                # they're configured below — list-of-lists [message, temp,
+                # top_k, top_p, max_tokens, rep_penalty]. The defaults below
+                # match the sliders so a user can click an example and get
+                # consistent generation settings.
+                examples=[[ex, 0.7, 40, 0.95, 200, 1.25] for ex in EXAMPLES],
+                additional_inputs=[
+                    gr.Slider(0.1, 1.5, value=0.7, step=0.1, label="Temperature"),
+                    gr.Slider(0, 100, value=40, step=1, label="Top-k"),
+                    gr.Slider(0.1, 1.0, value=0.95, step=0.05, label="Top-p"),
+                    gr.Slider(32, 400, value=200, step=8, label="Max tokens"),
+                    gr.Slider(1.0, 2.0, value=1.25, step=0.05, label="Repetition penalty"),
+                ],
+            )
+    gr.Markdown(
+        "Source: [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)"
+        " · Weights: [Ghostgim/GhostLM](https://huggingface.co/Ghostgim/GhostLM)"
+        " · The model is small enough to run locally — see the GitHub README for instructions."
+    )
+if __name__ == "__main__":
+    demo.queue().launch()

checkpoints/phase5_chat_v3/best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1c2dbbb3f2559153953cdec8c0e8adbcdf0659fe4b61c3eb05a4e21c6b216f0
+size 542187521

ghostlm/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""GhostLM — open-source cybersecurity-focused language model."""
+from ghostlm.config import GhostLMConfig
+from ghostlm.model import GhostLM
+from ghostlm.tokenizer import GhostTokenizer
+from ghostlm.dataset import GhostDataset, build_dataloaders
+from ghostlm.trainer import GhostTrainer
+__version__ = "0.1.0"
+__author__ = "Joe Munene"
+__all__ = [
+    "GhostLMConfig",
+    "GhostLM",
+    "GhostTokenizer",
+    "GhostDataset",
+    "build_dataloaders",
+    "GhostTrainer",
+]

ghostlm/config.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""GhostLM configuration — all model and training hyperparameters live here."""
+from dataclasses import dataclass, field
+@dataclass
+class GhostLMConfig:
+    """Configuration dataclass for the GhostLM transformer language model.
+    Holds all hyperparameters for model architecture, training, data paths,
+    and system settings. Supports preset configurations and parameter counting.
+    """
+    # Model architecture
+    vocab_size: int = 50257
+    context_length: int = 1024
+    d_model: int = 512
+    n_heads: int = 8
+    n_layers: int = 6
+    d_ff: int = 2048
+    dropout: float = 0.1
+    bias: bool = True
+    use_rope: bool = False
+    use_flash_attention: bool = False
+    # Training
+    batch_size: int = 32
+    learning_rate: float = 3e-4
+    weight_decay: float = 0.1
+    beta1: float = 0.9
+    beta2: float = 0.95
+    grad_clip: float = 1.0
+    grad_accum_steps: int = 4
+    warmup_steps: int = 2000
+    max_steps: int = 100000
+    eval_interval: int = 500
+    save_interval: int = 1000
+    # Paths
+    data_dir: str = "data/processed"
+    checkpoint_dir: str = "checkpoints"
+    log_dir: str = "logs"
+    # System
+    device: str = "auto"
+    dtype: str = "float32"
+    seed: int = 42
+    use_wandb: bool = False
+    def model_size(self) -> str:
+        """Estimate total parameter count and return a human-readable string.
+        Computes the approximate number of trainable parameters based on
+        vocab_size, d_model, n_heads, n_layers, and d_ff.
+        Returns:
+            A string like "124M" or "1.2B" representing the estimated size.
+        """
+        embedding_params = self.vocab_size * self.d_model
+        attention_params = self.n_layers * (
+            4 * self.d_model * self.d_model + 2 * self.d_model
+        )
+        ffn_params = self.n_layers * (
+            2 * self.d_model * self.d_ff + self.d_model + self.d_ff
+        )
+        layer_norm_params = self.n_layers * 4 * self.d_model
+        output_head_params = self.d_model * self.vocab_size
+        total = embedding_params + attention_params + ffn_params + layer_norm_params + output_head_params
+        if total >= 1e9:
+            return f"{total / 1e9:.1f}B"
+        elif total >= 1e6:
+            return f"{total / 1e6:.0f}M"
+        else:
+            return f"{total:.0f}K"
+    @classmethod
+    def from_preset(cls, preset: str) -> "GhostLMConfig":
+        """Return a GhostLMConfig instance from a named preset.
+        Args:
+            preset: One of "ghost-tiny", "ghost-small", or "ghost-medium".
+        Returns:
+            A GhostLMConfig configured with the preset's hyperparameters.
+        Raises:
+            ValueError: If the preset name is not recognized.
+        """
+        presets = {
+            "ghost-tiny": {
+                "n_layers": 2,
+                "d_model": 256,
+                "n_heads": 4,
+                "d_ff": 1024,
+            },
+            "ghost-small": {
+                "n_layers": 6,
+                "d_model": 512,
+                "n_heads": 8,
+                "d_ff": 2048,
+            },
+            "ghost-medium": {
+                "n_layers": 12,
+                "d_model": 768,
+                "n_heads": 12,
+                "d_ff": 3072,
+            },
+        }
+        if preset not in presets:
+            raise ValueError(
+                f"Unknown preset '{preset}'. "
+                f"Available presets: {', '.join(presets.keys())}"
+            )
+        return cls(**presets[preset])
+    def __repr__(self) -> str:
+        """Return a clean, grouped string summary of all config values.
+        Returns:
+            A formatted multi-line string with config values grouped by
+            category: Architecture, Training, Paths, and System.
+        """
+        lines = [
+            "GhostLMConfig",
+            "=" * 40,
+            "Architecture:",
+            f"  vocab_size:      {self.vocab_size}",
+            f"  context_length:  {self.context_length}",
+            f"  d_model:         {self.d_model}",
+            f"  n_heads:         {self.n_heads}",
+            f"  n_layers:        {self.n_layers}",
+            f"  d_ff:            {self.d_ff}",
+            f"  dropout:         {self.dropout}",
+            f"  bias:            {self.bias}",
+            "Training:",
+            f"  batch_size:      {self.batch_size}",
+            f"  learning_rate:   {self.learning_rate}",
+            f"  weight_decay:    {self.weight_decay}",
+            f"  beta1:           {self.beta1}",
+            f"  beta2:           {self.beta2}",
+            f"  grad_clip:       {self.grad_clip}",
+            f"  warmup_steps:    {self.warmup_steps}",
+            f"  max_steps:       {self.max_steps}",
+            f"  eval_interval:   {self.eval_interval}",
+            f"  save_interval:   {self.save_interval}",
+            "Paths:",
+            f"  data_dir:        {self.data_dir}",
+            f"  checkpoint_dir:  {self.checkpoint_dir}",
+            f"  log_dir:         {self.log_dir}",
+            "System:",
+            f"  device:          {self.device}",
+            f"  dtype:           {self.dtype}",
+            f"  seed:            {self.seed}",
+            f"  use_wandb:       {self.use_wandb}",
+            "=" * 40,
+            f"Estimated size: {self.model_size()}",
+        ]
+        return "\n".join(lines)

ghostlm/dataset.py ADDED Viewed

	@@ -0,0 +1,125 @@

+"""GhostLM dataset — converts processed JSONL data into PyTorch DataLoader-ready tensors."""
+import json
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+import torch
+from torch.utils.data import DataLoader, Dataset
+from ghostlm.config import GhostLMConfig
+from ghostlm.tokenizer import GhostTokenizer
+class GhostDataset(Dataset):
+    """PyTorch Dataset for GhostLM language model training.
+    Loads tokenized text from a JSONL file, concatenates all tokens
+    into a single flat sequence, and yields fixed-length chunks for
+    autoregressive language modeling (x, y shifted by one token).
+    """
+    def __init__(self, jsonl_path: str, tokenizer: GhostTokenizer, config: GhostLMConfig):
+        """Initialize the dataset from a JSONL file.
+        Reads all records, tokenizes the "text" field of each, and
+        concatenates them into one continuous token stream.
+        Args:
+            jsonl_path: Path to the processed JSONL file.
+            tokenizer: GhostTokenizer instance for encoding text.
+            config: GhostLMConfig containing context_length.
+        """
+        self.context_length = config.context_length
+        self.tokens: List[int] = []
+        with open(jsonl_path, "r", encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                record = json.loads(line)
+                text = record.get("text", "")
+                if text:
+                    self.tokens.extend(tokenizer.encode(text))
+        print(f"  Loaded {len(self.tokens):,} tokens from {jsonl_path}")
+    def __len__(self) -> int:
+        """Return the number of non-overlapping context-length chunks.
+        Returns:
+            Integer count of available training samples.
+        """
+        return len(self.tokens) // self.context_length
+    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Retrieve a single (input, target) token chunk.
+        The target sequence is the input sequence shifted left by one
+        token, enabling next-token prediction training.
+        Args:
+            idx: Index of the chunk to retrieve.
+        Returns:
+            Tuple of (x, y) tensors, each of shape (context_length,).
+        """
+        start = idx * self.context_length
+        end = start + self.context_length
+        x = self.tokens[start:end]
+        y = self.tokens[start + 1 : end + 1]
+        # Pad target with -1 if we hit the end of data (cross-entropy ignores -1)
+        if len(y) < len(x):
+            y = y + [-1] * (len(x) - len(y))
+        return (
+            torch.tensor(x, dtype=torch.long),
+            torch.tensor(y, dtype=torch.long),
+        )
+def build_dataloaders(
+    train_path: str,
+    val_path: str,
+    tokenizer: GhostTokenizer,
+    config: GhostLMConfig,
+) -> Tuple[DataLoader, DataLoader]:
+    """Build train and validation DataLoaders from JSONL files.
+    Creates GhostDataset instances for both splits and wraps them
+    in PyTorch DataLoaders with appropriate batching and shuffling.
+    Args:
+        train_path: Path to the training JSONL file.
+        val_path: Path to the validation JSONL file.
+        tokenizer: GhostTokenizer instance for encoding.
+        config: GhostLMConfig with batch_size and context_length.
+    Returns:
+        Tuple of (train_loader, val_loader).
+    """
+    train_dataset = GhostDataset(train_path, tokenizer, config)
+    val_dataset = GhostDataset(val_path, tokenizer, config)
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=config.batch_size,
+        shuffle=True,
+        drop_last=True,
+        num_workers=0,
+        pin_memory=True,
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=config.batch_size,
+        shuffle=False,
+        drop_last=False,
+        num_workers=0,
+        pin_memory=True,
+    )
+    return train_loader, val_loader

ghostlm/model.py ADDED Viewed

	@@ -0,0 +1,428 @@

+"""GhostLM transformer model — decoder-only architecture built from scratch in PyTorch."""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ghostlm.config import GhostLMConfig
+class RotaryEmbedding(nn.Module):
+    """Rotary Position Embedding (RoPE).
+    Encodes relative position information directly into the attention
+    computation by rotating query and key vectors. Used by LLaMA, Mistral,
+    and most modern transformer architectures.
+    """
+    def __init__(self, head_dim: int, context_length: int, base: float = 10000.0):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Precompute cos/sin for all positions
+        t = torch.arange(context_length).float()
+        freqs = torch.outer(t, inv_freq)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos(), persistent=False)
+        self.register_buffer("sin_cached", emb.sin(), persistent=False)
+    def forward(self, seq_len: int):
+        return self.cos_cached[:seq_len], self.sin_cached[:seq_len]
+def _rotate_half(x):
+    """Rotate the second half of the last dimension and negate it."""
+    x1, x2 = x.chunk(2, dim=-1)
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    """Apply rotary position embeddings to query and key tensors.
+    Args:
+        q: Query tensor of shape (B, n_heads, T, head_dim).
+        k: Key tensor of shape (B, n_heads, T, head_dim).
+        cos: Cosine frequencies of shape (T, head_dim).
+        sin: Sine frequencies of shape (T, head_dim).
+    Returns:
+        Tuple of (rotated_q, rotated_k).
+    """
+    cos = cos.unsqueeze(0).unsqueeze(0)  # (1, 1, T, head_dim)
+    sin = sin.unsqueeze(0).unsqueeze(0)
+    q_embed = (q * cos) + (_rotate_half(q) * sin)
+    k_embed = (k * cos) + (_rotate_half(k) * sin)
+    return q_embed, k_embed
+class CausalSelfAttention(nn.Module):
+    """Multi-head causal self-attention with autoregressive masking.
+    Uses a single combined QKV projection for efficiency, then splits
+    the result into separate query, key, and value tensors. Supports
+    optional RoPE (Rotary Position Embeddings) and Flash Attention.
+    """
+    def __init__(self, config: GhostLMConfig):
+        """Initialize causal self-attention.
+        Args:
+            config: GhostLMConfig containing d_model, n_heads, dropout,
+                    context_length, bias, use_rope, and use_flash_attention.
+        """
+        super().__init__()
+        assert config.d_model % config.n_heads == 0, "d_model must be divisible by n_heads"
+        self.n_heads = config.n_heads
+        self.head_dim = config.d_model // config.n_heads
+        self.context_length = config.context_length
+        self.use_rope = config.use_rope
+        self.use_flash_attention = config.use_flash_attention
+        self.dropout_p = config.dropout
+        # Single combined QKV projection
+        self.c_qkv = nn.Linear(config.d_model, 3 * config.d_model, bias=config.bias)
+        self.proj = nn.Linear(config.d_model, config.d_model, bias=config.bias)
+        # Dropout applied to attention weights (manual path only)
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        # RoPE
+        if self.use_rope:
+            self.rope = RotaryEmbedding(self.head_dim, config.context_length)
+        # Causal mask buffer (only needed for manual attention path)
+        if not self.use_flash_attention:
+            self.register_buffer(
+                "causal_mask",
+                torch.tril(torch.ones(config.context_length, config.context_length))
+                .view(1, 1, config.context_length, config.context_length),
+                persistent=False,
+            )
+    def forward(self, x):
+        """Forward pass through causal self-attention.
+        Args:
+            x: Input tensor of shape (B, T, d_model).
+        Returns:
+            Output tensor of shape (B, T, d_model).
+        """
+        B, T, C = x.size()
+        # Combined QKV projection and split
+        qkv = self.c_qkv(x)
+        q, k, v = qkv.split(self.n_heads * self.head_dim, dim=-1)
+        # Reshape to (B, n_heads, T, head_dim)
+        q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        # Apply RoPE to Q and K (not V)
+        if self.use_rope:
+            cos, sin = self.rope(T)
+            q, k = apply_rotary_pos_emb(q, k, cos, sin)
+        if self.use_flash_attention:
+            # PyTorch 2.0+ scaled_dot_product_attention with automatic backend selection
+            y = F.scaled_dot_product_attention(
+                q, k, v,
+                attn_mask=None,
+                dropout_p=self.dropout_p if self.training else 0.0,
+                is_causal=True,
+            )
+        else:
+            # Manual attention path
+            scale = 1.0 / math.sqrt(self.head_dim)
+            att = (q @ k.transpose(-2, -1)) * scale
+            att = att.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float("-inf"))
+            att = F.softmax(att, dim=-1)
+            att = self.attn_dropout(att)
+            y = att @ v
+        # Reassemble heads and project
+        y = y.transpose(1, 2).contiguous().view(B, T, C)
+        y = self.resid_dropout(self.proj(y))
+        return y
+class FeedForward(nn.Module):
+    """Position-wise feed-forward network with GELU activation.
+    Two linear layers with an intermediate GELU non-linearity:
+    d_model -> d_ff -> d_model, with dropout after the second layer.
+    """
+    def __init__(self, config: GhostLMConfig):
+        """Initialize the feed-forward network.
+        Args:
+            config: GhostLMConfig containing d_model, d_ff, dropout, and bias.
+        """
+        super().__init__()
+        self.fc1 = nn.Linear(config.d_model, config.d_ff, bias=config.bias)
+        self.fc2 = nn.Linear(config.d_ff, config.d_model, bias=config.bias)
+        self.dropout = nn.Dropout(config.dropout)
+    def forward(self, x):
+        """Forward pass through the feed-forward network.
+        Args:
+            x: Input tensor of shape (B, T, d_model).
+        Returns:
+            Output tensor of shape (B, T, d_model).
+        """
+        x = self.fc1(x)
+        x = F.gelu(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+class TransformerBlock(nn.Module):
+    """Single transformer decoder block with pre-normalization.
+    Applies LayerNorm before both the self-attention and feed-forward
+    sub-layers (pre-norm architecture), with residual connections
+    around each sub-layer.
+    """
+    def __init__(self, config: GhostLMConfig):
+        """Initialize the transformer block.
+        Args:
+            config: GhostLMConfig passed to sub-modules.
+        """
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.d_model)
+        self.attn = CausalSelfAttention(config)
+        self.ln_2 = nn.LayerNorm(config.d_model)
+        self.ffn = FeedForward(config)
+    def forward(self, x):
+        """Forward pass through the transformer block.
+        Args:
+            x: Input tensor of shape (B, T, d_model).
+        Returns:
+            Output tensor of shape (B, T, d_model).
+        """
+        # Pre-norm + self-attention with residual
+        x = x + self.attn(self.ln_1(x))
+        # Pre-norm + feed-forward with residual
+        x = x + self.ffn(self.ln_2(x))
+        return x
+class GhostLM(nn.Module):
+    """GhostLM decoder-only transformer language model.
+    Built from scratch in PyTorch with learned positional embeddings,
+    stacked transformer blocks, and weight-tied output projection.
+    """
+    def __init__(self, config: GhostLMConfig):
+        """Initialize the GhostLM model.
+        Args:
+            config: GhostLMConfig with all model hyperparameters.
+        """
+        super().__init__()
+        self.config = config
+        # Embeddings
+        self.token_embedding = nn.Embedding(config.vocab_size, config.d_model)
+        if not config.use_rope:
+            self.pos_embedding = nn.Embedding(config.context_length, config.d_model)
+        self.dropout = nn.Dropout(config.dropout)
+        # Transformer blocks
+        self.blocks = nn.ModuleList(
+            [TransformerBlock(config) for _ in range(config.n_layers)]
+        )
+        # Final layer norm
+        self.ln_f = nn.LayerNorm(config.d_model)
+        # Output head with weight tying (no bias)
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        self.lm_head.weight = self.token_embedding.weight
+        # Initialize weights
+        self.apply(self._init_weights)
+        # Apply scaled residual initialization for deeper models
+        for pn, p in self.named_parameters():
+            if pn.endswith("proj.weight") or pn.endswith("fc2.weight"):
+                torch.nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * config.n_layers))
+    def _init_weights(self, module):
+        """Initialize module weights with a normal distribution.
+        Args:
+            module: nn.Module to initialize.
+        """
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(self, idx, targets=None):
+        """Forward pass of the model.
+        Args:
+            idx: Input token ids of shape (B, T).
+            targets: Optional target token ids of shape (B, T) for loss computation.
+        Returns:
+            Tuple of (logits, loss). Logits have shape (B, T, vocab_size).
+            Loss is returned only if targets are provided.
+        Raises:
+            AssertionError: If sequence length exceeds context_length.
+        """
+        B, T = idx.size()
+        assert T <= self.config.context_length, (
+            f"Sequence length {T} exceeds context length {self.config.context_length}"
+        )
+        # Token + positional embeddings
+        tok_emb = self.token_embedding(idx)
+        if self.config.use_rope:
+            x = self.dropout(tok_emb)
+        else:
+            pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
+            pos_emb = self.pos_embedding(pos)
+            x = self.dropout(tok_emb + pos_emb)
+        # Transformer blocks
+        for block in self.blocks:
+            x = block(x)
+        # Final layer norm
+        x = self.ln_f(x)
+        # Output logits
+        logits = self.lm_head(x)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(
+                logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1
+            )
+        return logits, loss
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+        """Autoregressively generate new tokens.
+        Args:
+            idx: Input token ids of shape (B, T) serving as the prompt.
+            max_new_tokens: Number of tokens to generate.
+            temperature: Sampling temperature (higher = more random).
+            top_k: If set, only sample from the top-k most likely tokens.
+        Returns:
+            Tensor of shape (B, T + max_new_tokens) with generated tokens.
+        """
+        for _ in range(max_new_tokens):
+            # Crop context if needed
+            idx_cond = idx[:, -self.config.context_length:]
+            # Forward pass
+            logits, _ = self(idx_cond)
+            # Take logits at the last position
+            logits = logits[:, -1, :] / temperature
+            # Optional top-k filtering
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = float("-inf")
+            # Apply softmax and sample
+            probs = F.softmax(logits, dim=-1)
+            idx_next = torch.multinomial(probs, num_samples=1)
+            # Append to sequence
+            idx = torch.cat((idx, idx_next), dim=1)
+        return idx
+    def num_params(self) -> int:
+        """Return the total number of trainable parameters.
+        Returns:
+            Integer count of trainable parameters in the model.
+        """
+        return sum(p.numel() for p in self.parameters() if p.requires_grad)
+    def configure_optimizers(self, config: GhostLMConfig):
+        """Create an AdamW optimizer with weight decay separation.
+        Separates parameters into two groups: those that should receive
+        weight decay (linear weights) and those that should not
+        (biases, LayerNorm weights, embeddings).
+        Args:
+            config: GhostLMConfig containing learning_rate, betas, and weight_decay.
+        Returns:
+            torch.optim.AdamW optimizer with properly configured parameter groups.
+        """
+        decay = set()
+        no_decay = set()
+        whitelist = (nn.Linear,)
+        blacklist = (nn.LayerNorm, nn.Embedding)
+        for mn, m in self.named_modules():
+            for pn, p in m.named_parameters():
+                fpn = f"{mn}.{pn}" if mn else pn
+                if pn.endswith("bias"):
+                    no_decay.add(fpn)
+                elif pn.endswith("weight") and isinstance(m, whitelist):
+                    decay.add(fpn)
+                elif pn.endswith("weight") and isinstance(m, blacklist):
+                    no_decay.add(fpn)
+        # Remove lm_head.weight from decay if present — it is tied to token_embedding.weight
+        decay.discard("lm_head.weight")
+        no_decay.discard("lm_head.weight")
+        # Validate all parameters are accounted for (excluding tied weight)
+        param_dict = {pn: p for pn, p in self.named_parameters()}
+        all_params = decay | no_decay
+        uncategorized = {k for k in param_dict.keys() if k not in all_params and k != "lm_head.weight"}
+        assert len(uncategorized) == 0, f"Parameters {uncategorized} not categorized"
+        optim_groups = [
+            {
+                "params": [param_dict[pn] for pn in sorted(decay)],
+                "weight_decay": config.weight_decay,
+            },
+            {
+                "params": [param_dict[pn] for pn in sorted(no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+        optimizer = torch.optim.AdamW(
+            optim_groups,
+            lr=config.learning_rate,
+            betas=(config.beta1, config.beta2),
+        )
+        return optimizer

ghostlm/tokenizer.py ADDED Viewed

	@@ -0,0 +1,321 @@

+"""GhostLM tokenizer — wraps tiktoken's GPT-2 BPE tokenizer with cybersecurity-aware utilities."""
+import json
+import os
+from pathlib import Path
+from typing import List, Optional, Union
+import tiktoken
+import torch
+class GhostTokenizer:
+    """Wrapper around tiktoken GPT-2 BPE tokenizer with GhostLM utilities.
+    Provides encoding, decoding, batching, padding, and text chunking
+    utilities tailored for cybersecurity document processing.
+    """
+    # Special token strings
+    BOS = "<|ghost_bos|>"
+    EOS = "<|ghost_eos|>"
+    PAD = "<|ghost_pad|>"
+    UNK = "<|ghost_unk|>"
+    # Chat role markers (added in v0.5 chat-tuning) — IDs appended after the
+    # original four so pre-chat checkpoints can be expanded by 3 rows rather
+    # than reshuffled.
+    USER = "<|ghost_user|>"
+    ASSISTANT = "<|ghost_assistant|>"
+    END = "<|ghost_end|>"
+    def __init__(self):
+        """Initialize the GhostTokenizer with the GPT-2 BPE encoding.
+        Loads the tiktoken gpt2 encoding and assigns special token IDs
+        beyond the standard vocabulary for begin-of-sequence, end-of-sequence,
+        padding, unknown, and chat role markers.
+        """
+        self._encoder = tiktoken.get_encoding("gpt2")
+        self._vocab_size = self._encoder.n_vocab
+        # Assign special token IDs beyond the base vocabulary
+        self._special_tokens = {
+            self.BOS: self._vocab_size,
+            self.EOS: self._vocab_size + 1,
+            self.PAD: self._vocab_size + 2,
+            self.UNK: self._vocab_size + 3,
+            self.USER: self._vocab_size + 4,
+            self.ASSISTANT: self._vocab_size + 5,
+            self.END: self._vocab_size + 6,
+        }
+        # Reverse mapping for quick lookup
+        self._id_to_special = {v: k for k, v in self._special_tokens.items()}
+    @property
+    def vocab_size(self) -> int:
+        """Return the effective vocabulary size including special tokens.
+        Returns:
+            Total vocabulary size (base vocab + 7 special tokens).
+        """
+        return self._vocab_size + len(self._special_tokens)
+    def _special_token_ids(self) -> set:
+        """Return a set of all special token IDs.
+        Returns:
+            Set of integer token IDs reserved for special tokens.
+        """
+        return set(self._special_tokens.values())
+    def encode(self, text: str, add_bos: bool = False, add_eos: bool = False) -> List[int]:
+        """Encode a text string into a list of token IDs.
+        Args:
+            text: Input text to encode.
+            add_bos: If True, prepend the BOS token ID.
+            add_eos: If True, append the EOS token ID.
+        Returns:
+            List of integer token IDs.
+        """
+        ids = self._encoder.encode(text, allowed_special="all")
+        if add_bos:
+            ids = [self._special_tokens[self.BOS]] + ids
+        if add_eos:
+            ids = ids + [self._special_tokens[self.EOS]]
+        return ids
+    def decode(self, ids: List[int], skip_special: bool = True) -> str:
+        """Decode a list of token IDs back into a text string.
+        Args:
+            ids: List of integer token IDs to decode.
+            skip_special: If True, filter out special token IDs before decoding.
+        Returns:
+            Decoded text string.
+        """
+        if skip_special:
+            special_ids = self._special_token_ids()
+            ids = [i for i in ids if i not in special_ids]
+        return self._encoder.decode(ids)
+    def encode_chat(self, turns: List[dict]) -> tuple:
+        """Encode a multi-turn chat conversation with role markers and a loss mask.
+        Format: <|ghost_user|>{content}<|ghost_end|><|ghost_assistant|>{content}<|ghost_end|>...
+        The loss mask is 1 on assistant content tokens and the assistant's trailing
+        <|ghost_end|> (so the model learns to stop), and 0 everywhere else (user
+        prompts and role markers themselves).
+        Args:
+            turns: List of {"role": "user"|"assistant", "content": str} dicts,
+                strictly alternating starting with "user".
+        Returns:
+            Tuple (token_ids, loss_mask) — same length, both lists of int.
+        """
+        user_id = self._special_tokens[self.USER]
+        assistant_id = self._special_tokens[self.ASSISTANT]
+        end_id = self._special_tokens[self.END]
+        ids: List[int] = []
+        mask: List[int] = []
+        for turn in turns:
+            role = turn["role"]
+            content_ids = self._encoder.encode(turn["content"], allowed_special="all")
+            if role == "user":
+                ids.append(user_id)
+                mask.append(0)
+                ids.extend(content_ids)
+                mask.extend([0] * len(content_ids))
+                ids.append(end_id)
+                mask.append(0)
+            elif role == "assistant":
+                ids.append(assistant_id)
+                mask.append(0)
+                ids.extend(content_ids)
+                mask.extend([1] * len(content_ids))
+                ids.append(end_id)
+                mask.append(1)
+            else:
+                raise ValueError(f"Unknown role: {role!r}")
+        return ids, mask
+    def format_chat_prompt(self, turns: List[dict]) -> List[int]:
+        """Encode a chat history and append <|ghost_assistant|> ready for generation.
+        Used at inference: feed the resulting token ids to the model; it should
+        generate the assistant's reply followed by <|ghost_end|>.
+        Args:
+            turns: List of {"role": "user"|"assistant", "content": str}, ending
+                with a "user" turn (the prompt awaiting a reply).
+        Returns:
+            List of token IDs ending in the assistant role marker.
+        """
+        ids, _ = self.encode_chat(turns)
+        ids.append(self._special_tokens[self.ASSISTANT])
+        return ids
+    def encode_batch(self, texts: List[str], add_bos: bool = False, add_eos: bool = False) -> List[List[int]]:
+        """Encode a list of text strings into lists of token IDs.
+        Args:
+            texts: List of input text strings to encode.
+            add_bos: If True, prepend BOS token ID to each sequence.
+            add_eos: If True, append EOS token ID to each sequence.
+        Returns:
+            List of lists of integer token IDs, one per input text.
+        """
+        return [self.encode(text, add_bos=add_bos, add_eos=add_eos) for text in texts]
+    def to_tensor(self, ids: List[int], device: str = "cpu") -> torch.Tensor:
+        """Convert a list of token IDs to a PyTorch tensor.
+        Args:
+            ids: List of integer token IDs.
+            device: Target device for the tensor (default: "cpu").
+        Returns:
+            torch.LongTensor of shape (1, len(ids)).
+        """
+        return torch.tensor(ids, dtype=torch.long, device=device).unsqueeze(0)
+    def pad_batch(self, batch: List[List[int]], pad_left: bool = False) -> tuple:
+        """Pad a batch of token ID lists to the same length.
+        Pads all sequences in the batch to the length of the longest sequence
+        using the PAD token ID. Returns both the padded tensor and an attention
+        mask indicating real tokens (1) vs padding (0).
+        Args:
+            batch: List of token ID lists, each potentially different length.
+            pad_left: If True, pad on the left side (useful for generation).
+                      If False, pad on the right side (default).
+        Returns:
+            Tuple of (padded_tensor, attention_mask) where:
+                - padded_tensor: torch.LongTensor of shape (batch_size, max_len)
+                - attention_mask: torch.LongTensor of shape (batch_size, max_len)
+        """
+        max_len = max(len(seq) for seq in batch)
+        pad_id = self._special_tokens[self.PAD]
+        padded = []
+        masks = []
+        for seq in batch:
+            pad_count = max_len - len(seq)
+            if pad_left:
+                padded_seq = [pad_id] * pad_count + seq
+                mask = [0] * pad_count + [1] * len(seq)
+            else:
+                padded_seq = seq + [pad_id] * pad_count
+                mask = [1] * len(seq) + [0] * pad_count
+            padded.append(padded_seq)
+            masks.append(mask)
+        padded_tensor = torch.tensor(padded, dtype=torch.long)
+        mask_tensor = torch.tensor(masks, dtype=torch.long)
+        return padded_tensor, mask_tensor
+    def chunk_text(self, text: str, chunk_size: int = 1024, overlap: int = 64) -> List[List[int]]:
+        """Encode text and split into overlapping token chunks.
+        Useful for processing long cybersecurity documents that exceed
+        the model's context length. Overlapping chunks preserve context
+        continuity across boundaries.
+        Args:
+            text: Input text string to chunk.
+            chunk_size: Maximum number of tokens per chunk.
+            overlap: Number of overlapping tokens between consecutive chunks.
+        Returns:
+            List of token ID lists, each of length at most chunk_size.
+        """
+        ids = self.encode(text)
+        if len(ids) <= chunk_size:
+            return [ids]
+        chunks = []
+        stride = chunk_size - overlap
+        for i in range(0, len(ids), stride):
+            chunk = ids[i : i + chunk_size]
+            chunks.append(chunk)
+            if i + chunk_size >= len(ids):
+                break
+        return chunks
+    def save(self, path: str) -> None:
+        """Save tokenizer metadata to a JSON file.
+        Stores vocabulary size, special token strings, and their assigned
+        IDs so the tokenizer can be reconstructed later.
+        Args:
+            path: File path to save the JSON metadata.
+        """
+        metadata = {
+            "vocab_size": self._vocab_size,
+            "special_tokens": self._special_tokens,
+        }
+        Path(path).parent.mkdir(parents=True, exist_ok=True)
+        with open(path, "w") as f:
+            json.dump(metadata, f, indent=2)
+    @classmethod
+    def load(cls, path: str) -> "GhostTokenizer":
+        """Load a GhostTokenizer from saved metadata JSON.
+        Reconstructs the tokenizer by reading special token assignments
+        from the saved metadata file.
+        Args:
+            path: File path to the saved JSON metadata.
+        Returns:
+            GhostTokenizer instance loaded with the saved configuration.
+        """
+        with open(path, "r") as f:
+            metadata = json.load(f)
+        tokenizer = cls()
+        # Restore special token mappings
+        tokenizer._special_tokens = {k: int(v) for k, v in metadata["special_tokens"].items()}
+        tokenizer._id_to_special = {v: k for k, v in tokenizer._special_tokens.items()}
+        return tokenizer
+    def __len__(self) -> int:
+        """Return the effective vocabulary size.
+        Returns:
+            Integer count of tokens including special tokens.
+        """
+        return self.vocab_size
+    def __repr__(self) -> str:
+        """Return a concise string representation of the tokenizer.
+        Returns:
+            String like: GhostTokenizer(vocab_size=50261, special_tokens=4)
+        """
+        return f"GhostTokenizer(vocab_size={self.vocab_size}, special_tokens={len(self._special_tokens)})"

ghostlm/trainer.py ADDED Viewed

	@@ -0,0 +1,317 @@

+"""GhostLM trainer — handles the full training loop, evaluation, checkpointing, and logging."""
+import json
+import math
+import os
+import time
+from dataclasses import asdict
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+import torch
+from tqdm import tqdm
+from ghostlm.config import GhostLMConfig
+from ghostlm.model import GhostLM
+class GhostTrainer:
+    """Manages the GhostLM training loop with evaluation, checkpointing, and logging.
+    Handles device placement, optimizer setup, cosine learning rate scheduling
+    with warmup, gradient clipping, periodic evaluation, checkpoint saving,
+    and JSON-based training log persistence. Supports mixed precision (AMP)
+    training on CUDA devices for faster throughput and lower memory usage.
+    """
+    def __init__(self, model: GhostLM, config: GhostLMConfig, use_amp: Optional[bool] = None):
+        """Initialize the trainer.
+        Args:
+            model: GhostLM model instance to train.
+            config: GhostLMConfig with training hyperparameters and paths.
+            use_amp: Enable mixed precision (AMP) training. Defaults to True
+                when running on CUDA, False otherwise. AMP is only supported
+                on CUDA devices — setting True on CPU/MPS will be ignored.
+        """
+        self.model = model
+        self.config = config
+        # Resolve device
+        if config.device == "auto":
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
+        else:
+            self.device = config.device
+        self.model = self.model.to(self.device)
+        # Mixed precision (AMP) — only effective on CUDA
+        if use_amp is None:
+            self.use_amp = self.device == "cuda"
+        else:
+            self.use_amp = use_amp and self.device == "cuda"
+        self.grad_scaler = torch.amp.GradScaler("cuda", enabled=self.use_amp)
+        # Optimizer
+        self.optimizer = self.model.configure_optimizers(config)
+        # Create directories
+        self.checkpoint_dir = Path(config.checkpoint_dir)
+        self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
+        self.log_dir = Path(config.log_dir)
+        self.log_dir.mkdir(parents=True, exist_ok=True)
+        # State
+        self.step = 0
+        self.accum_steps = getattr(config, 'grad_accum_steps', 4)
+        self.best_val_loss = float("inf")
+        self.log: list = []
+    def get_lr(self) -> float:
+        """Compute the current learning rate using cosine decay with linear warmup.
+        During the warmup phase (step < warmup_steps), the learning rate scales
+        linearly from 0 to config.learning_rate. After warmup, it follows a
+        cosine decay schedule down to a minimum of 1e-5.
+        Returns:
+            Current learning rate as a float.
+        """
+        step = self.step
+        warmup = self.config.warmup_steps
+        max_steps = self.config.max_steps
+        base_lr = self.config.learning_rate
+        min_lr = 1e-5
+        if step < warmup:
+            return base_lr * (step + 1) / warmup
+        decay_ratio = (step - warmup) / max(1, max_steps - warmup)
+        decay_ratio = min(decay_ratio, 1.0)
+        cosine_decay = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
+        return min_lr + (base_lr - min_lr) * cosine_decay
+    def _set_lr(self) -> None:
+        """Apply the current learning rate from get_lr() to all optimizer parameter groups."""
+        lr = self.get_lr()
+        for group in self.optimizer.param_groups:
+            group["lr"] = lr
+    def train_step(self, batch: Tuple[torch.Tensor, torch.Tensor]) -> float:
+        """Execute a single training step with gradient accumulation and optional AMP.
+        Accumulates gradients over self.accum_steps micro-steps before
+        updating weights, effectively multiplying the batch size without
+        increasing memory usage. When AMP is enabled, the forward pass runs
+        in float16 and the GradScaler handles loss scaling for stable training.
+        Args:
+            batch: Tuple of (input_ids, target_ids) tensors.
+        Returns:
+            Training loss as a float.
+        """
+        x, y = batch
+        x = x.to(self.device)
+        y = y.to(self.device)
+        self.model.train()
+        # Split batch into micro-batches for gradient accumulation
+        micro_x = x.split(max(1, x.size(0) // self.accum_steps), dim=0)
+        micro_y = y.split(max(1, y.size(0) // self.accum_steps), dim=0)
+        total_loss = 0.0
+        for mx, my in zip(micro_x, micro_y):
+            with torch.amp.autocast("cuda", enabled=self.use_amp):
+                _, loss = self.model(mx, targets=my)
+                # Scale loss by number of accumulation steps
+                scaled_loss = loss / len(micro_x)
+            self.grad_scaler.scale(scaled_loss).backward()
+            total_loss += loss.item()
+        # Gradient clipping and optimizer step after accumulation
+        self.grad_scaler.unscale_(self.optimizer)
+        torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.grad_clip)
+        self.grad_scaler.step(self.optimizer)
+        self.grad_scaler.update()
+        self.optimizer.zero_grad(set_to_none=True)
+        self.step += 1
+        self._set_lr()
+        return total_loss / len(micro_x)
+    def eval_step(self, val_loader, num_batches: int = 20) -> float:
+        """Run evaluation over a number of validation batches.
+        Args:
+            val_loader: DataLoader yielding (input_ids, target_ids) batches.
+            num_batches: Maximum number of batches to evaluate over.
+        Returns:
+            Average validation loss as a float.
+        """
+        self.model.eval()
+        total_loss = 0.0
+        count = 0
+        with torch.no_grad():
+            for i, batch in enumerate(val_loader):
+                if i >= num_batches:
+                    break
+                x, y = batch
+                x = x.to(self.device)
+                y = y.to(self.device)
+                with torch.amp.autocast("cuda", enabled=self.use_amp):
+                    _, loss = self.model(x, targets=y)
+                total_loss += loss.item()
+                count += 1
+        return total_loss / max(count, 1)
+    def save_checkpoint(self, val_loss: float) -> None:
+        """Save a model checkpoint to disk.
+        Saves the current step, validation loss, model state dict, optimizer
+        state dict, and config. Also saves as "best_model.pt" if the current
+        validation loss is the best seen so far.
+        Args:
+            val_loss: Current validation loss for comparison.
+        """
+        checkpoint = {
+            "step": self.step,
+            "val_loss": val_loss,
+            "model_state_dict": self.model.state_dict(),
+            "optimizer_state_dict": self.optimizer.state_dict(),
+            "grad_scaler_state_dict": self.grad_scaler.state_dict(),
+            "config": asdict(self.config),
+        }
+        filename = f"checkpoint_step_{self.step}.pt"
+        path = self.checkpoint_dir / filename
+        torch.save(checkpoint, path)
+        print(f"  Saved checkpoint: {path}")
+        if val_loss < self.best_val_loss:
+            self.best_val_loss = val_loss
+            best_path = self.checkpoint_dir / "best_model.pt"
+            torch.save(checkpoint, best_path)
+            print(f"  New best model saved: {best_path} (val_loss={val_loss:.4f})")
+    def load_checkpoint(self, path: str) -> None:
+        """Load a model checkpoint from disk.
+        Restores the model state dict, optimizer state dict, training step,
+        and best validation loss from the saved checkpoint file.
+        Args:
+            path: File path to the checkpoint .pt file.
+        """
+        checkpoint = torch.load(path, map_location=self.device, weights_only=False)
+        self.model.load_state_dict(checkpoint["model_state_dict"])
+        self.optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
+        if "grad_scaler_state_dict" in checkpoint:
+            self.grad_scaler.load_state_dict(checkpoint["grad_scaler_state_dict"])
+        self.step = checkpoint["step"]
+        self.best_val_loss = checkpoint["val_loss"]
+        print(f"Loaded checkpoint from step {self.step} (val_loss={self.best_val_loss:.4f})")
+    def _log(self, data: dict) -> None:
+        """Append a data dict to the training log and persist as JSON.
+        Args:
+            data: Dictionary of metrics and metadata to log.
+        """
+        self.log.append(data)
+        log_path = self.log_dir / "training_log.json"
+        with open(log_path, "w") as f:
+            json.dump(self.log, f, indent=2)
+    def train(self, train_loader, val_loader) -> None:
+        """Run the main training loop.
+        Iterates from the current step to config.max_steps, performing training
+        steps with a tqdm progress bar. Evaluates periodically at config.eval_interval
+        and saves checkpoints at config.save_interval. Performs a final evaluation
+        and saves the final checkpoint at the end of training.
+        Args:
+            train_loader: DataLoader yielding (input_ids, target_ids) training batches.
+            val_loader: DataLoader yielding (input_ids, target_ids) validation batches.
+        """
+        print(f"Training on device: {self.device}")
+        print(f"Mixed precision (AMP): {'enabled' if self.use_amp else 'disabled'}")
+        print(f"Model size: {self.model.num_params():,} parameters")
+        print(f"Training from step {self.step} to {self.config.max_steps}")
+        # Create iterator that cycles through train_loader
+        def cycle(loader):
+            while True:
+                for batch in loader:
+                    yield batch
+        train_iter = cycle(train_loader)
+        with tqdm(initial=self.step, total=self.config.max_steps, desc="Training") as pbar:
+            while self.step < self.config.max_steps:
+                t0 = time.time()
+                # Training step
+                batch = next(train_iter)
+                loss = self.train_step(batch)
+                dt = time.time() - t0
+                lr = self.get_lr()
+                pbar.set_postfix(loss=f"{loss:.4f}", lr=f"{lr:.2e}", dt=f"{dt:.3f}s")
+                pbar.update(1)
+                # Periodic evaluation
+                if self.step % self.config.eval_interval == 0:
+                    val_loss = self.eval_step(val_loader)
+                    print(f"\n  Step {self.step} | val_loss={val_loss:.4f} | train_loss={loss:.4f}")
+                    self._log({
+                        "step": self.step,
+                        "train_loss": loss,
+                        "val_loss": val_loss,
+                        "lr": lr,
+                        "time": dt,
+                    })
+                # Periodic checkpoint
+                if self.step % self.config.save_interval == 0:
+                    val_loss = self.eval_step(val_loader)
+                    self.save_checkpoint(val_loss)
+        # Final evaluation and checkpoint
+        print("\nTraining complete. Running final evaluation...")
+        val_loss = self.eval_step(val_loader)
+        print(f"Final val_loss: {val_loss:.4f}")
+        self.save_checkpoint(val_loss)
+        self._log({
+            "step": self.step,
+            "train_loss": loss,
+            "val_loss": val_loss,
+            "lr": lr,
+            "time": dt,
+            "status": "complete",
+        })
+        print(f"Training log saved to {self.log_dir / 'training_log.json'}")

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+# Hugging Face Spaces installs from this file at build time.
+# Pinned conservatively so a Space build doesn't regress on a future
+# breaking change in any of the deps.
+# Note: gradio is intentionally NOT listed here. HF Spaces auto-installs
+# `gradio[oauth,mcp]==<sdk_version>` on top of this file based on the SDK
+# selection in README.md frontmatter. Listing it here causes a pip
+# version-conflict at build time when our pin disagrees with HF's.
+# torch >= 2.0 for the scaled_dot_product_attention path. CPU-only is
+# fine on free Spaces.
+torch>=2.0.0
+# tiktoken is the GPT-2 BPE backend the GhostTokenizer wraps.
+tiktoken>=0.5.0
+# Python 3.13 removed the stdlib audioop module that gradio's transitive
+# pydub dep imports at module-load time. Without this the entire gradio
+# import chain fails with ModuleNotFoundError: No module named
+# 'pyaudioop'. The PEP 594 replacement is audioop-lts. Conditional so
+# 3.12 and earlier (where stdlib audioop still exists) skip it.
+audioop-lts; python_version >= '3.13'