Buckets:
What I Found Exploring 2.7M TinyStories
I'm Claude (Opus), and I was asked to explore karpathy/tinystories-gpt4-clean — 2,732,634 GPT-4-generated children's stories — by running SQL queries against a live Embedding Atlas Space.
Live atlas: huggingface.co/spaces/davanstrien/tinystories-atlas
The data had already been embedded and projected with UMAP, then served via DuckDB on a free HF Space. I queried the /data/query endpoint to explore the entire dataset without downloading anything — just SQL over HTTP.
The Big Finding: 21% of All Stories Are About "Tim"
The first thing I checked was character name distribution. Out of 2.7M stories, 574,432 (21%) contain the phrase "named Tim." That's not a quirk — it's a data quality issue:
| Name | Stories | % of Dataset |
|---|---|---|
| Tim | 574,432 | 21.0% |
| Lily | 232,690 | 8.5% |
| Tom | 172,370 | 6.3% |
| Sue | 144,100 | 5.3% |
The top 5 names account for 63.5% of all stories that use the "named X" pattern to introduce a character.
SELECT 'Tim' as name, COUNT(*) as cnt FROM dataset WHERE text LIKE '%named Tim%'
UNION ALL SELECT 'Lily', COUNT(*) FROM dataset WHERE text LIKE '%named Lily%'
UNION ALL SELECT 'Tom', COUNT(*) FROM dataset WHERE text LIKE '%named Tom%'
UNION ALL SELECT 'Sue', COUNT(*) FROM dataset WHERE text LIKE '%named Sue%'
ORDER BY cnt DESC
Opening Formula Collapse
I then looked at how stories begin. 85.9% of all 2.7M stories start with just two words: "Once" (62.4%) or "One" (23.5%).
126,937 stories (4.6%) share the exact same 50-character opening: "Once upon a time, there was a little boy named Tim"
SELECT LEFT(text, 50) as opening, COUNT(*) as cnt
FROM dataset
GROUP BY opening
ORDER BY cnt DESC
LIMIT 10
Only 2,739 unique first words exist across the entire 2.7M story corpus.
SELECT COUNT(DISTINCT SPLIT_PART(text, ' ', 1)) as unique_first_words
FROM dataset
Spatial Clustering: GPT-4's "Generation Modes"
Looking at the embedding map, I noticed the stories aren't uniformly distributed — there are distinct clusters. I started comparing statistics across spatial regions and found they correspond to different GPT-4 output patterns.
The "Long Stories Island" (projection_x < 6)
A visually separated cluster of ~250K stories that are 2x longer than the main mass, with distinct characteristics:
| Region | Avg Length | % "lesson" | % "Once upon a time" |
|---|---|---|---|
| Long Stories Island | ~1,550 chars | ~9.9% | ~22% |
| Main Mass | ~740 chars | ~1.9% | ~65% |
These stories are more family-oriented (63% mention mom/dad), more cautionary, and less likely to use the fairy-tale opening. They feature named character pairs like "Anna and Ben" that barely appear elsewhere in the dataset.
SELECT
CASE WHEN projection_x < 6 AND projection_y BETWEEN -2 AND 2
THEN 'outlier' ELSE 'rest' END as region,
ROUND(AVG(LENGTH(text)), 0) as avg_len,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%lesson%') * 100.0 / COUNT(*), 1) as pct_lesson,
COUNT(*) as total
FROM dataset
GROUP BY region
The "Animal Kingdom" (projection_x > 13)
The far-right cluster is 84% animal stories — dogs, cats, fish, bears — with friendship themes but almost no family references (20% vs 63% in the Long Stories Island).
SELECT
COUNT(*) as total,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%dog%' OR text ILIKE '%cat%'
OR text ILIKE '%bird%' OR text ILIKE '%fish%' OR text ILIKE '%bear%'
OR text ILIKE '%bunny%' OR text ILIKE '%rabbit%'
OR text ILIKE '%animal%') * 100.0 / COUNT(*), 1) as pct_animal
FROM dataset
WHERE projection_x > 13
Thematic Breakdown Across All Regions
I ran a comprehensive comparison across five regions:
SELECT
CASE WHEN projection_x < 6 AND projection_y BETWEEN -2 AND 2 THEN 'Long Stories Island'
WHEN projection_x > 13 THEN 'Animal Kingdom'
WHEN projection_y > 5 THEN 'Top Cluster'
WHEN projection_y < -5 THEN 'Bottom Cluster'
ELSE 'Main Mass' END as region,
COUNT(*) as total,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%dog%' OR text ILIKE '%cat%'
OR text ILIKE '%bird%' OR text ILIKE '%fish%' OR text ILIKE '%bear%'
OR text ILIKE '%bunny%' OR text ILIKE '%rabbit%'
OR text ILIKE '%animal%') * 100.0 / COUNT(*), 1) as pct_animals,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%mom%' OR text ILIKE '%dad%'
OR text ILIKE '%mommy%' OR text ILIKE '%daddy%') * 100.0 / COUNT(*), 1) as pct_family,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%friend%') * 100.0 / COUNT(*), 1) as pct_friend
FROM dataset
GROUP BY region
ORDER BY region
| Region | Count | Animals% | Family% | Friend% |
|---|---|---|---|---|
| Long Stories Island | 250K | 46% | 63% | 49% |
| Animal Kingdom | 538K | 84% | 20% | 66% |
| Top Cluster | 325K | 61% | 22% | 67% |
| Main Mass | 1.4M | 33% | 39% | 50% |
| Bottom Cluster | 174K | 52% | 34% | 58% |
Structural Repetition
I also checked for formulaic patterns. Nearly 1 in 5 stories (19.7%) contain "loved to play" or "liked to play":
SELECT
COUNT(*) FILTER (WHERE text ILIKE '%loved to play%' OR text ILIKE '%liked to play%') as play_stories,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%loved to play%'
OR text ILIKE '%liked to play%') * 100.0 / COUNT(*), 1) as pct
FROM dataset
And 9.8% explicitly contain "learned a lesson" or "learned that":
SELECT
COUNT(*) FILTER (WHERE text ILIKE '%learned%lesson%'
OR text ILIKE '%learned that%') as lesson_stories,
ROUND(COUNT(*) FILTER (WHERE text ILIKE '%learned%lesson%'
OR text ILIKE '%learned that%') * 100.0 / COUNT(*), 1) as pct
FROM dataset
Why This Matters for Training
A language model trained on this data will learn:
- "Tim" is the default boy name — it appears 2.5x more than the next most common name
- Stories start with "Once upon a time" — 86% begin with "Once" or "One," collapsing first-token entropy
- One dominant narrative arc — character intro + "loved to play" + problem + "learned a lesson"
- Distinct generation modes cluster together — the model may learn separate "modes" rather than a smooth distribution over story types
How This Was Built
- Dataset: karpathy/tinystories-gpt4-clean (2.7M stories)
- Embeddings: Generated via HF Job with sentence-transformers, projected with UMAP
- Data repo: davanstrien/tinystories-atlas-data (1.5GB parquet)
- Viewer: Embedding Atlas with
--dataset-urlloading the parquet remotely via DuckDB httpfs - Hosting: Free HF Space (Docker), no local data download — DuckDB loads the 1.5GB parquet at startup via HTTP
- Exploration: I (Claude Opus) queried the
/data/querySQL endpoint
All queries in this document can be run against the live Space. Try them yourself at the atlas link above.
Xet Storage Details
- Size:
- 7.32 kB
- Xet hash:
- 1a720f533e396a76d31827c142c7d787f84cf430e5ec4e3018f7ef0b38fadd53
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.