Spaces:

CompactAI-O
/

Homepage

Running

App Files Files Community

Homepage / I Made A Dataset So Dense It Broke My Hard Drive.html

CompactAI

Upload 107 files

259696a verified 4 days ago

raw

history blame contribute delete

16.5 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>I Made A Dataset So Dense It Broke My Hard Drive \| TinyMemoryLM</title>
	<link rel="stylesheet" href="bluesheet.css">
	<link rel="preconnect" href="https://fonts.googleapis.com">
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
	<link href="https://fonts.googleapis.com/css2?family=Geist:wght@400;500;600;700&family=Geist+Mono&display=swap" rel="stylesheet">
	<style>

	:root {
	--blue-900: #000000;
	--blue-800: #0a0a0a;
	--blue-700: #111111;
	--blue-600: #1a1a1a;
	--blue-500: #333333;
	--blue-400: #555555;
	--blue-300: #777777;
	--blue-200: #888888;
	--blue-100: #aaaaaa;
	--white: #ffffff;
	--white-soft: #f5f5f5;
	--white-muted: #e0e0e0;
	--grid-line: rgba(255, 255, 255, 0.03);
	--grid-line-major: rgba(255, 255, 255, 0.06);
	--accent: #ededed;
	--accent-muted: #888888;
	--font-sans: 'Geist', -apple-system, BlinkMacSystemFont, sans-serif;
	--font-mono: 'Geist Mono', 'SF Mono', 'Fira Code', monospace;
	--container-max: 1100px;
	}
	* { box-sizing: border-box; margin: 0; padding: 0; }
	html { font-size: 16px; scroll-behavior: smooth; }
	body { font-family: var(--font-sans); background: var(--blue-900); color: var(--white-muted); line-height: 1.7; -webkit-font-smoothing: antialiased; }
	a { color: var(--white); text-decoration: none; transition: color 0.15s ease; }
	a:hover { color: var(--accent); }
	.container { max-width: var(--container-max); margin: 0 auto; padding: 0 24px; }
	nav { position: fixed; top: 0; left: 0; right: 0; z-index: 100; background: rgba(0, 0, 0, 0.85); backdrop-filter: blur(12px); border-bottom: 1px solid var(--blue-600); padding: 16px 0; }
	nav .container { display: flex; justify-content: space-between; align-items: center; }
	.nav-brand { font-size: 18px; font-weight: 600; color: var(--white); display: flex; align-items: center; gap: 8px; }
	.nav-brand span { color: var(--accent); }
	.nav-links { display: flex; gap: 32px; }
	.nav-links a { font-size: 14px; font-weight: 500; color: var(--blue-200); }
	.nav-links a:hover { color: var(--white); }
	.post { padding: 140px 0 80px; }
	.post-back { display: inline-block; color: var(--blue-200); font-size: 14px; margin-bottom: 32px; }
	.post-back:hover { color: var(--accent); }
	.post-back::before { content: '← '; }
	.post-meta { display: flex; gap: 12px; margin-bottom: 20px; }
	.post-date { font-size: 13px; color: var(--blue-200); font-family: var(--font-mono); }
	.post-tag { font-size: 11px; font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; color: var(--white); background: rgba(255, 255, 255, 0.08); padding: 4px 10px; border-radius: 4px; }
	.post h1 { font-size: 36px; font-weight: 700; color: var(--white); margin-bottom: 32px; line-height: 1.2; letter-spacing: -0.02em; }
	.post-body p { font-size: 17px; line-height: 1.8; margin-bottom: 24px; color: var(--blue-200); }
	.post-body p:first-of-type { font-size: 20px; color: var(--white-muted); }
	.post-body h2 { font-size: 24px; font-weight: 600; color: var(--white); margin: 48px 0 20px; }
	.post-body blockquote { border-left: 3px solid var(--accent); padding: 20px 24px; margin: 32px 0; background: var(--blue-800); border-radius: 0 8px 8px 0; }
	.post-body blockquote p { font-size: 16px; font-style: italic; color: var(--blue-200); margin: 0; }
	.post-body hr { border: none; height: 1px; background: var(--blue-600); margin: 48px 0; }
	.code-block { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; margin: 24px 0; font-family: var(--font-mono); font-size: 12px; overflow-x: auto; white-space: pre-wrap; word-wrap: break-word; }
	.code-block .comment { color: var(--blue-200); font-style: italic; display: block; margin-top: 4px; }
	.stats-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin: 24px 0; }
	.stat-card { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; text-align: center; }
	.stat-card .number { font-size: 28px; font-weight: 700; color: var(--accent); font-family: var(--font-mono); }
	.stat-card .label { font-size: 13px; color: var(--blue-200); margin-top: 8px; }
	.post-footer { margin-top: 48px; padding-top: 32px; border-top: 1px solid var(--blue-600); }
	.post-footer p { font-size: 14px; color: var(--blue-200); font-style: italic; margin: 0; }
	footer { padding: 40px 0; background: var(--blue-800); border-top: 1px solid var(--blue-600); text-align: center; }
	footer p { color: var(--blue-200); font-size: 14px; margin-bottom: 8px; }
	footer a { color: var(--blue-200); }
	footer a:hover { color: var(--accent); }
	@media (max-width: 768px) { .post h1 { font-size: 28px; } .nav-links { display: none; } .stats-grid { grid-template-columns: 1fr; } }

	</style>

	</head>
	<body>
	<svg class="scribbles" viewBox="0 0 1440 900" preserveAspectRatio="xMidYMid slice">
	<path d="M100,50 Q150,30 200,60 T300,40 T400,70" fill="none" stroke="white" stroke-width="1"/>
	<path d="M800,200 Q850,180 900,210 T1000,190 T1100,220" fill="none" stroke="white" stroke-width="0.8"/>
	<path d="M200,700 Q250,680 300,710 T400,690 T500,720" fill="none" stroke="white" stroke-width="0.6"/>
	<path d="M1200,400 Q1250,380 1300,410 T1400,390" fill="none" stroke="white" stroke-width="0.7"/>
	<path d="M50,400 Q100,380 150,420 T250,400" fill="none" stroke="white" stroke-width="0.5"/>
	<circle cx="350" cy="150" r="30" fill="none" stroke="white" stroke-width="0.6"/>
	<circle cx="1100" cy="600" r="25" fill="none" stroke="white" stroke-width="0.5"/>
	<path d="M600,100 L620,80 L640,100 L660,80" fill="none" stroke="white" stroke-width="0.7"/>
	<path d="M1300,750 Q1320,730 1340,760 T1380,740" fill="none" stroke="white" stroke-width="0.5"/>
	<path d="M100,800 Q120,780 140,810 T180,790 T220,820" fill="none" stroke="white" stroke-width="0.6"/>
	<path d="M700,500 Q720,480 740,510 T780,490 T820,520" fill="none" stroke="white" stroke-width="0.4"/>
	<path d="M400,300 C420,280 440,320 460,300 C480,280 500,320 520,300" fill="none" stroke="white" stroke-width="0.5"/>
	<path d="M900,700 C920,680 940,720 960,700 C980,680 1000,720 1020,700" fill="none" stroke="white" stroke-width="0.6"/>
	<path d="M150,250 Q170,230 190,260 Q210,240 230,270" fill="none" stroke="white" stroke-width="0.4"/>
	<path d="M1050,100 Q1070,80 1090,110 Q1110,90 1130,120" fill="none" stroke="white" stroke-width="0.5"/>
	<path d="M500,850 C520,830 540,860 560,840 C580,820 600,860 620,840" fill="none" stroke="white" stroke-width="0.4"/>
	<path d="M1350,50 Q1370,30 1390,60 T1430,40" fill="none" stroke="white" stroke-width="0.5"/>
	<path d="M30,600 Q50,580 70,610 T110,590" fill="none" stroke="white" stroke-width="0.4"/>
	</svg>

	<nav>
	<div class="container">
	<a href="index.html" class="nav-brand"><span>/</span>TinyMemoryLM</a>
	<div class="nav-links">
	<a href="index.html">Home</a>
	<a href="blog.html">Blog</a>
	<a href="status.html">Status</a>
	</div>
	</div>
	</nav>
	<main>
	<article class="post">
	<div class="container">
	<a href="blog.html" class="post-back">Back to Blog</a>
	<header>
	<div class="post-meta">
	<span class="post-date">2026-03-31</span>
	<span class="post-tag">Datasets</span>
	</div>
	<h1>I Made A Dataset So Dense It Broke My Hard Drive</h1>
	</header>
	<div class="post-body">
	<p>I have a new dataset. It is called Dense-PRISM. It lives on Hugging Face. It is 164 GB. My hard drive cried when I uploaded it. My internet provider sent me a concerned email. I am proud.</p>
	<blockquote>
	<p>Density is not about size. Density is about information per byte. Dense-PRISM has so much information per byte that bytes are now asking for raises.</p>
	</blockquote>
	<h2>The Numbers</h2>
	<p>Let us talk about scale. Because numbers are fun and also terrifying.</p>
	<div class="stats-grid">
	<div class="stat-card">
	<div class="number">164 GB</div>
	<div class="label">File Size</div>
	</div>
	<div class="stat-card">
	<div class="number">4096</div>
	<div class="label">Top-K Per Token</div>
	</div>
	<div class="stat-card">
	<div class="number">799</div>
	<div class="label">Prompts</div>
	</div>
	<div class="stat-card">
	<div class="number">∞</div>
	<div class="label">Regrets</div>
	</div>
	</div>
	<p>The math works like this. Four thousand ninety-six top tokens logged for every single generated token. Seven hundred ninety-nine prompts. Average response length times four thousand ninety-six times seven hundred ninety-nine equals total training signals.</p>
	<div class="code-block">
	<span class="comment"># Dense-PRISM by the numbers</span>
	4096 (top_k) * ~200 (avg tokens) * 799 (prompts) = ~654 million data points
	</div>
	<p>Six hundred fifty-four million training signals. From seven hundred ninety-nine prompts. That is the power of density. That is the curse of density. My hard drive understands the curse personally.</p>
	<h2>What Is In The File</h2>
	<p>Each entry contains the standard conversation format. User asks. Assistant answers. Then comes the gold. For every token in the response, you get the top 4096 alternatives with their log probabilities.</p>
	<div class="code-block">
	<span class="comment"># Example Dense-PRISM Entry (abbreviated)</span>
	{
	"messages": [
	{"role": "user", "content": "Explain quantum entanglement simply"},
	{"role": "assistant", "content": "Quantum entanglement is a phenomenon..."}
	],
	"response_tokens": 187,
	"token_logprobs": [
	{
	"position": 0,
	"generated_token": "Quantum",
	"logprob": -3.12,
	"top_k": [
	{"token": "The", "logprob": -1.2},
	{"token": "In", "logprob": -1.8},
	{"token": "Quantum", "logprob": -3.12},
	... (4093 more alternatives)
	]
	}
	]
	}
	</div>
	<p>That ellipsis represents four thousand ninety-three more tokens. Multiply that by every token in every response. You get Dense-PRISM. You get a file that makes file explorers hesitate.</p>
	<h2>Why 4096</h2>
	<p>Why not 50? Why not 100? Why not a reasonable number that does not break storage systems? Because 4096 is a power of two. Because it feels technical. Because I wanted to see what would happen.</p>
	<p>Also, 4096 tokens covers a meaningful slice of the vocabulary. It shows the model not just the top choices but the entire neighborhood of possibilities. It teaches semantic distance through probability gradients.</p>
	<blockquote>
	<p>A model trained on Dense-PRISM knows that "Why?" and "Hey! whats up?" live in different probability neighborhoods. It learns tone through math. It learns style through statistics.</p>
	</blockquote>
	<h2>The Free Part</h2>
	<p>Yes, it is free. MIT license. Download it. Fork it. Train your tiny models on it. Make something smarter than my tiny models. Please. My GPU needs the competition.</p>
	<p>I could have put it behind a paywall. I could have made it exclusive. I did not. Open source is the point. Sharing is the point. Watching other people build cool things with my weird datasets is the point.</p>
	<h2>Storage Considerations</h2>
	<p>164 GB is large. It will take time to download. It will take space to store. It will take patience to parse. This is the cost of density.</p>
	<div class="code-block">
	<span class="comment"># Tips for working with Dense-PRISM</span>
	1. Use streaming loaders, do not load entire file into memory
	2. Filter by prompt type before training to reduce scope
	3. Consider sampling top_k if full density is not needed
	4. Have a large hard drive. Seriously.
	</div>
	<p>I learned these tips the hard way. My RAM cried. My swap file screamed. My patience evaporated. You do not need to repeat my mistakes. Learn from my pain.</p>
	<h2>What This Teaches</h2>
	<p>Standard distillation teaches what to say. Dense-PRISM teaches how to choose what to say. The student model sees the probability landscape. It understands why certain tokens fit certain contexts. It learns the shape of appropriate responses.</p>
	<p>A model trained on this knows that formal questions deserve formal answers. It knows that casual greetings invite casual responses. It knows the distance between tones. It learns through exposure to the full spectrum of possibility.</p>
	<h2>The Math Again</h2>
	<p>Let us return to the formula because it is beautiful in a terrifying way.</p>
	<div class="code-block">
	<span class="comment"># Total training signals</span>
	4096 * avg_tokens * 799 = total_signals

	# Example calculation:
	4096 * 200 * 799 = 654,540,800 signals

	# That is six hundred fifty-four million
	# training signals from seven hundred ninety-nine prompts
	# This is why my hard drive filed a complaint
	</div>
	<p>Each prompt becomes a universe of possibilities. Each token becomes a lesson in probability. Each logprob becomes a teacher. This is distillation at maximum density.</p>
	<h2>Who Should Use This</h2>
	<p>People training small models. People who want their models to understand nuance. People who have large hard drives. People who enjoy watching progress bars move very slowly.</p>
	<p>If you are training a model under 1B parameters, Dense-PRISM can teach it to speak with more intention. If you are training a model under 100M parameters, it can teach it to choose words with more care. If you are training a model under 10M parameters, it might teach it to form coherent sentences. Progress is relative.</p>
	<h2>Final Thoughts</h2>
	<p>Dense-PRISM exists. It is 164 GB. It has 4096 top tokens per generated token. It has 799 prompts. It is free. It is dense. It is available now on Hugging Face.</p>
	<p>Download it if you dare. Train on it if you can. Make something better than my confused tiny models. That is the goal. That is the dream. That is Dense-PRISM.</p>
	<hr>
	</div>
	<footer class="post-footer">
	<p>Current status: Dense-PRISM uploaded. Hard drive recovering. Internet bill arriving. Haiku-2 still training. Will release when it stops outputting pipe characters.</p>
	</footer>
	</div>
	</article>
	</main>
	<footer>
	<div class="container">
	<p>Built with curiosity over compute</p>
	<p>TinyMemoryLM by AILAY \| 2026</p>
	</div>
	</footer>
	</body>
	</html>