Homepage / I Captured The Ghosts In The Machine (And Named It Prism).html
CompactAI's picture
Upload 107 files
259696a verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>I Captured The Ghosts In The Machine (And Named It Prism) | TinyMemoryLM</title>
<link rel="stylesheet" href="bluesheet.css">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Geist:wght@400;500;600;700&family=Geist+Mono&display=swap" rel="stylesheet">
<style>
:root {
--blue-900: #000000;
--blue-800: #0a0a0a;
--blue-700: #111111;
--blue-600: #1a1a1a;
--blue-500: #333333;
--blue-400: #555555;
--blue-300: #777777;
--blue-200: #888888;
--blue-100: #aaaaaa;
--white: #ffffff;
--white-soft: #f5f5f5;
--white-muted: #e0e0e0;
--grid-line: rgba(255, 255, 255, 0.03);
--grid-line-major: rgba(255, 255, 255, 0.06);
--accent: #ededed;
--accent-muted: #888888;
--font-sans: 'Geist', -apple-system, BlinkMacSystemFont, sans-serif;
--font-mono: 'Geist Mono', 'SF Mono', 'Fira Code', monospace;
--container-max: 1100px;
}
* { box-sizing: border-box; margin: 0; padding: 0; }
html { font-size: 16px; scroll-behavior: smooth; }
body { font-family: var(--font-sans); background: var(--blue-900); color: var(--white-muted); line-height: 1.7; -webkit-font-smoothing: antialiased; }
a { color: var(--white); text-decoration: none; transition: color 0.15s ease; }
a:hover { color: var(--accent); }
.container { max-width: var(--container-max); margin: 0 auto; padding: 0 24px; }
nav { position: fixed; top: 0; left: 0; right: 0; z-index: 100; background: rgba(0, 0, 0, 0.85); backdrop-filter: blur(12px); border-bottom: 1px solid var(--blue-600); padding: 16px 0; }
nav .container { display: flex; justify-content: space-between; align-items: center; }
.nav-brand { font-size: 18px; font-weight: 600; color: var(--white); display: flex; align-items: center; gap: 8px; }
.nav-brand span { color: var(--accent); }
.nav-links { display: flex; gap: 32px; }
.nav-links a { font-size: 14px; font-weight: 500; color: var(--blue-200); }
.nav-links a:hover { color: var(--white); }
.post { padding: 140px 0 80px; }
.post-back { display: inline-block; color: var(--blue-200); font-size: 14px; margin-bottom: 32px; }
.post-back:hover { color: var(--accent); }
.post-back::before { content: '← '; }
.post-meta { display: flex; gap: 12px; margin-bottom: 20px; }
.post-date { font-size: 13px; color: var(--blue-200); font-family: var(--font-mono); }
.post-tag { font-size: 11px; font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; color: var(--white); background: rgba(255, 255, 255, 0.08); padding: 4px 10px; border-radius: 4px; }
.post h1 { font-size: 36px; font-weight: 700; color: var(--white); margin-bottom: 32px; line-height: 1.2; letter-spacing: -0.02em; }
.post-body p { font-size: 17px; line-height: 1.8; margin-bottom: 24px; color: var(--blue-200); }
.post-body p:first-of-type { font-size: 20px; color: var(--white-muted); }
.post-body h2 { font-size: 24px; font-weight: 600; color: var(--white); margin: 48px 0 20px; }
.post-body blockquote { border-left: 3px solid var(--accent); padding: 20px 24px; margin: 32px 0; background: var(--blue-800); border-radius: 0 8px 8px 0; }
.post-body blockquote p { font-size: 16px; font-style: italic; color: var(--blue-200); margin: 0; }
.post-body hr { border: none; height: 1px; background: var(--blue-600); margin: 48px 0; }
.code-block { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; margin: 24px 0; font-family: var(--font-mono); font-size: 12px; overflow-x: auto; white-space: pre-wrap; word-wrap: break-word; }
.code-block .comment { color: var(--blue-200); font-style: italic; display: block; margin-top: 4px; }
.stats-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin: 24px 0; }
.stat-card { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; text-align: center; }
.stat-card .number { font-size: 28px; font-weight: 700; color: var(--accent); font-family: var(--font-mono); }
.stat-card .label { font-size: 13px; color: var(--blue-200); margin-top: 8px; }
.dataset-table { width: 100%; border-collapse: collapse; margin: 24px 0; font-family: var(--font-mono); font-size: 13px; }
.dataset-table th { text-align: left; padding: 12px; border-bottom: 2px solid var(--accent); color: var(--white); font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em; }
.dataset-table td { padding: 12px; border-bottom: 1px solid var(--blue-600); color: var(--blue-200); }
.dataset-table tr:hover { background: var(--blue-800); }
.dataset-table .highlight { color: var(--accent); font-weight: 600; }
.post-footer { margin-top: 48px; padding-top: 32px; border-top: 1px solid var(--blue-600); }
.post-footer p { font-size: 14px; color: var(--blue-200); font-style: italic; margin: 0; }
footer { padding: 40px 0; background: var(--blue-800); border-top: 1px solid var(--blue-600); text-align: center; }
footer p { color: var(--blue-200); font-size: 14px; margin-bottom: 8px; }
footer a { color: var(--blue-200); }
footer a:hover { color: var(--accent); }
@media (max-width: 768px) { .post h1 { font-size: 28px; } .nav-links { display: none; } .stats-grid { grid-template-columns: 1fr; } }
</style>
</head>
<body>
<svg class="scribbles" viewBox="0 0 1440 900" preserveAspectRatio="xMidYMid slice">
<path d="M100,50 Q150,30 200,60 T300,40 T400,70" fill="none" stroke="white" stroke-width="1"/>
<path d="M800,200 Q850,180 900,210 T1000,190 T1100,220" fill="none" stroke="white" stroke-width="0.8"/>
<path d="M200,700 Q250,680 300,710 T400,690 T500,720" fill="none" stroke="white" stroke-width="0.6"/>
<path d="M1200,400 Q1250,380 1300,410 T1400,390" fill="none" stroke="white" stroke-width="0.7"/>
<path d="M50,400 Q100,380 150,420 T250,400" fill="none" stroke="white" stroke-width="0.5"/>
<circle cx="350" cy="150" r="30" fill="none" stroke="white" stroke-width="0.6"/>
<circle cx="1100" cy="600" r="25" fill="none" stroke="white" stroke-width="0.5"/>
<path d="M600,100 L620,80 L640,100 L660,80" fill="none" stroke="white" stroke-width="0.7"/>
<path d="M1300,750 Q1320,730 1340,760 T1380,740" fill="none" stroke="white" stroke-width="0.5"/>
<path d="M100,800 Q120,780 140,810 T180,790 T220,820" fill="none" stroke="white" stroke-width="0.6"/>
<path d="M700,500 Q720,480 740,510 T780,490 T820,520" fill="none" stroke="white" stroke-width="0.4"/>
<path d="M400,300 C420,280 440,320 460,300 C480,280 500,320 520,300" fill="none" stroke="white" stroke-width="0.5"/>
<path d="M900,700 C920,680 940,720 960,700 C980,680 1000,720 1020,700" fill="none" stroke="white" stroke-width="0.6"/>
<path d="M150,250 Q170,230 190,260 Q210,240 230,270" fill="none" stroke="white" stroke-width="0.4"/>
<path d="M1050,100 Q1070,80 1090,110 Q1110,90 1130,120" fill="none" stroke="white" stroke-width="0.5"/>
<path d="M500,850 C520,830 540,860 560,840 C580,820 600,860 620,840" fill="none" stroke="white" stroke-width="0.4"/>
<path d="M1350,50 Q1370,30 1390,60 T1430,40" fill="none" stroke="white" stroke-width="0.5"/>
<path d="M30,600 Q50,580 70,610 T110,590" fill="none" stroke="white" stroke-width="0.4"/>
</svg>
<nav>
<div class="container">
<a href="index.html" class="nav-brand"><span>/</span>TinyMemoryLM</a>
<div class="nav-links">
<a href="index.html">Home</a>
<a href="blog.html">Blog</a>
<a href="status.html">Status</a>
</div>
</div>
</nav>
<main>
<article class="post">
<div class="container">
<a href="blog.html" class="post-back">Back to Blog</a>
<header>
<div class="post-meta">
<span class="post-date">2026-03-30</span>
<span class="post-tag">Datasets</span>
</div>
<h1>I Captured The Ghosts In The Machine (And Named It Prism)</h1>
</header>
<div class="post-body">
<p>Most distillation datasets are flat. They show you what the AI said. They do not show you what the AI thought about saying. They show you the destination. They hide the journey. I decided to capture the journey. I decided to name it Prism.</p>
<p>CompactAI-Prism is now a thing. Actually, two things. Maybe a third thing soon. I have been busy while Haiku-2 outputs pipe characters at me.</p>
<blockquote>
<p>A prism takes a single beam of light and refracts it to reveal the full spectrum. This dataset takes a single AI response and reveals the spectrum of probability that existed at every step.</p>
</blockquote>
<h2>The Datasets</h2>
<p>I have two datasets out now. A third is in the planning phase. My hard drive hates me. My GPU loves me. I am somewhere in the middle questioning my life choices.</p>
<table class="dataset-table">
<thead>
<tr>
<th>Dataset</th>
<th>Size</th>
<th>Teacher Model</th>
<th>Top-K</th>
</tr>
</thead>
<tbody>
<tr>
<td class="highlight">cAI-Prism-K50</td>
<td>247 MB</td>
<td>Qwen3.5-0.8B</td>
<td>50</td>
</tr>
<tr>
<td class="highlight">cAI-Prism-B.5-K50</td>
<td>2.2 GB</td>
<td>Qwen3.5-2B</td>
<td>50</td>
</tr>
</tbody>
</table>
<p>The third one exists only in my dreams and a partially written Python script. It will be large. It will be expensive. It will be worth it. Probably.</p>
<h2>What Does B Stand For</h2>
<p>Good question. I needed something for the 2B model variant. K50 means Top-50 tokens. That is clear. B needed meaning. I considered options:</p>
<div class="code-block">
<span class="comment"># Naming brainstorm session</span>
Base? Too generic.
Balanced? Too corporate.
Big? Too honest.
Brave? Too dramatic.
B.5? Just right.
</div>
<p>B.5 it is. It sounds technical. It sounds intentional. It sounds like I planned this instead of naming things at 3 AM while waiting for training to finish. Which is exactly what happened.</p>
<p>The B.5 designation means it sits between the base K50 and whatever comes next. Maybe B.6. Maybe C. Maybe I run out of letters and start using Greek. Omega Prism has a nice ring to it.</p>
<h2>The Concept</h2>
<p>Normal distillation gives you the answer. Prism gives you the answer plus the top 50 alternatives for every single token. You see the chosen path. You also see the 50 paths not taken. You see the logprobs. You see the uncertainty.</p>
<div class="stats-grid">
<div class="stat-card">
<div class="number">50</div>
<div class="label">Top K Alternatives</div>
</div>
<div class="stat-card">
<div class="number">50x</div>
<div class="label">More Data Per Prompt</div>
</div>
</div>
<p>The math is simple. Each token in the response equals x. The number of questions equals y. The total data points equal x times y times 50. This is basically 50 times more training data per prompt compared to standard datasets.</p>
<h2>Why Qwen 3.5</h2>
<p>Not 2.5. Not 3.0. Specifically 3.5. The 0.8B and 2B variants from the Qwen3.5 family. They are small enough to run locally. They are smart enough to teach my models. They are open enough to share logits.</p>
<p>I could have used larger models. I could have used closed APIs. I did not. My wallet said no. My principles said no. My hard drive also said no but I ignored that one.</p>
<h2>The Data Structure</h2>
<p>It is JSONL. It is standard. It is easy to parse. It is also massive. Here is what a single entry looks like. Notice the token_logprobs array. Notice the top_k list inside each token. That is the gold.</p>
<div class="code-block">
<span class="comment"># Example CompactAI-Prism Entry</span>
{
"messages": [
{"role": "user", "content": "About how many words does the average person speak?"},
{"role": "assistant", "content": "Research into speech habits suggests..."}
],
"response_tokens": 223,
"token_logprobs": [
{
"position": 0,
"generated_token_id": 26539,
"generated_token": "Research",
"logprob": -2.984375,
"top_k": [
{"token_id": 1206, "token": "To", "logprob": -1.359375},
{"token_id": 760, "token": "The", "logprob": -1.984375},
{"token_id": 27775, "token": "Based", "logprob": -1.984375},
... (50 total alternatives)
]
}
]
}
</div>
<h2>The Full Vocab Dream</h2>
<p>The third dataset plans to capture everything. Not just 50 tokens. Not just 100 tokens. The full vocabulary. Qwen3.5 has 248,320 tokens. I want to capture all of them for every generation step.</p>
<p>The third dataset would be 248,320 times more dense than any other dataset that has 799 examples per token. This level of density maps the entire semantic landscape. Standard datasets show the path taken. Full vocab shows the terrain.</p>
<div class="code-block">
<span class="comment"># Future dataset specs (hopefully)</span>
Top-K: 248,320 (Full Vocab)
Density: 248,320x more than datasets with 799 examples per token
Status: Planning / Praying
</div>
<p>Imagine knowing every road not taken. Most models learn the highway. This dataset teaches the backroads. It shows the model that "Why?" lives in a different neighborhood than "Hey! whats up?". The probability distance tells a story. Low probability means semantic distance. High probability means similarity.</p>
<p>The student model learns the shape of language itself. It understands that certain responses belong together. It understands tone. It understands context. It sees the probability mass surrounding each decision. A model trained on this knows why "Research" fits better than "To" in a formal context. It sees the weight of every possibility.</p>
<p>Will it work? I hope so. Can my hard drive handle it? Probably not. Will I try anyway? Absolutely. A single response could become gigabytes. A dataset could become terabytes. I am aware of the scale. I am proceeding with caution. And hope. Mostly hope.</p>
<h2>Why This Matters</h2>
<p>Training on just the final tokens teaches mimicry. Training on the probability distribution teaches reasoning. The student model learns why "Paris" was chosen over "London". It learns why "Research" was chosen over "To". It learns the shape of the decision.</p>
<p>This is how you make small models smart. You do not just show them the answer. You show them the thinking. You show them the doubt. You show them the confidence. Prism does all of this.</p>
<h2>What Comes Next</h2>
<p>Both datasets are MIT licensed. Both are on Hugging Face. Both have 50 top tokens. Both are free. Use them. Fork them. Train your tiny models on them. Make something better than my tiny models.</p>
<p>Haiku-2 will train on Prism. Sonnet-2 will train on Prism. Opus will train on Prism. Maybe they will learn to speak. Maybe they will learn to think. Maybe they will still output pipe characters. We will find out together.</p>
<h2>Final Thoughts</h2>
<p>Prism exists now. Two versions. A third dreaming of full vocab. 50 times denser than standard datasets. Longer training times. Smarter models. Full hard drives. Empty wallet. Full heart.</p>
<p>This is something real. This is something useful. This helps the little guys train better models without spending millions on API calls. That is the goal. That is the dream. That is Prism.</p>
<hr>
</div>
<footer class="post-footer">
<p>Current status: Prism-K50 and Prism-B.5-K50 released. Full vocab dataset planned. Hard drive full. GPU happy. Haiku-2 still training. Will release when it stops outputting pipe characters.</p>
</footer>
</div>
</article>
</main>
<footer>
<div class="container">
<p>Built with curiosity over compute</p>
<p>TinyMemoryLM by AILAY | 2026</p>
</div>
</footer>
</body>
</html>