davanstrien/gists / tinystories-atlas-exploration.md
|
download
raw
7.32 kB

What I Found Exploring 2.7M TinyStories

I'm Claude (Opus), and I was asked to explore karpathy/tinystories-gpt4-clean — 2,732,634 GPT-4-generated children's stories — by running SQL queries against a live Embedding Atlas Space.

Live atlas: huggingface.co/spaces/davanstrien/tinystories-atlas

The data had already been embedded and projected with UMAP, then served via DuckDB on a free HF Space. I queried the /data/query endpoint to explore the entire dataset without downloading anything — just SQL over HTTP.


The Big Finding: 21% of All Stories Are About "Tim"

The first thing I checked was character name distribution. Out of 2.7M stories, 574,432 (21%) contain the phrase "named Tim." That's not a quirk — it's a data quality issue:

Name Stories % of Dataset
Tim 574,432 21.0%
Lily 232,690 8.5%
Tom 172,370 6.3%
Sue 144,100 5.3%

The top 5 names account for 63.5% of all stories that use the "named X" pattern to introduce a character.

SELECT 'Tim' as name, COUNT(*) as cnt FROM dataset WHERE text LIKE '%named Tim%'
UNION ALL SELECT 'Lily', COUNT(*) FROM dataset WHERE text LIKE '%named Lily%'
UNION ALL SELECT 'Tom', COUNT(*) FROM dataset WHERE text LIKE '%named Tom%'
UNION ALL SELECT 'Sue', COUNT(*) FROM dataset WHERE text LIKE '%named Sue%'
ORDER BY cnt DESC

Opening Formula Collapse

I then looked at how stories begin. 85.9% of all 2.7M stories start with just two words: "Once" (62.4%) or "One" (23.5%).

126,937 stories (4.6%) share the exact same 50-character opening: "Once upon a time, there was a little boy named Tim"

SELECT LEFT(text, 50) as opening, COUNT(*) as cnt
FROM dataset
GROUP BY opening
ORDER BY cnt DESC
LIMIT 10

Only 2,739 unique first words exist across the entire 2.7M story corpus.

SELECT COUNT(DISTINCT SPLIT_PART(text, ' ', 1)) as unique_first_words
FROM dataset

Spatial Clustering: GPT-4's "Generation Modes"

Looking at the embedding map, I noticed the stories aren't uniformly distributed — there are distinct clusters. I started comparing statistics across spatial regions and found they correspond to different GPT-4 output patterns.

The "Long Stories Island" (projection_x < 6)

A visually separated cluster of ~250K stories that are 2x longer than the main mass, with distinct characteristics:

Region Avg Length % "lesson" % "Once upon a time"
Long Stories Island ~1,550 chars ~9.9% ~22%
Main Mass ~740 chars ~1.9% ~65%

These stories are more family-oriented (63% mention mom/dad), more cautionary, and less likely to use the fairy-tale opening. They feature named character pairs like "Anna and Ben" that barely appear elsewhere in the dataset.

SELECT
  CASE WHEN projection_x < 6 AND projection_y BETWEEN -2 AND 2
       THEN 'outlier' ELSE 'rest' END as region,
  ROUND(AVG(LENGTH(text)), 0) as avg_len,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%lesson%') * 100.0 / COUNT(*), 1) as pct_lesson,
  COUNT(*) as total
FROM dataset
GROUP BY region

The "Animal Kingdom" (projection_x > 13)

The far-right cluster is 84% animal stories — dogs, cats, fish, bears — with friendship themes but almost no family references (20% vs 63% in the Long Stories Island).

SELECT
  COUNT(*) as total,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%dog%' OR text ILIKE '%cat%'
    OR text ILIKE '%bird%' OR text ILIKE '%fish%' OR text ILIKE '%bear%'
    OR text ILIKE '%bunny%' OR text ILIKE '%rabbit%'
    OR text ILIKE '%animal%') * 100.0 / COUNT(*), 1) as pct_animal
FROM dataset
WHERE projection_x > 13

Thematic Breakdown Across All Regions

I ran a comprehensive comparison across five regions:

SELECT
  CASE WHEN projection_x < 6 AND projection_y BETWEEN -2 AND 2 THEN 'Long Stories Island'
       WHEN projection_x > 13 THEN 'Animal Kingdom'
       WHEN projection_y > 5 THEN 'Top Cluster'
       WHEN projection_y < -5 THEN 'Bottom Cluster'
       ELSE 'Main Mass' END as region,
  COUNT(*) as total,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%dog%' OR text ILIKE '%cat%'
    OR text ILIKE '%bird%' OR text ILIKE '%fish%' OR text ILIKE '%bear%'
    OR text ILIKE '%bunny%' OR text ILIKE '%rabbit%'
    OR text ILIKE '%animal%') * 100.0 / COUNT(*), 1) as pct_animals,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%mom%' OR text ILIKE '%dad%'
    OR text ILIKE '%mommy%' OR text ILIKE '%daddy%') * 100.0 / COUNT(*), 1) as pct_family,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%friend%') * 100.0 / COUNT(*), 1) as pct_friend
FROM dataset
GROUP BY region
ORDER BY region
Region Count Animals% Family% Friend%
Long Stories Island 250K 46% 63% 49%
Animal Kingdom 538K 84% 20% 66%
Top Cluster 325K 61% 22% 67%
Main Mass 1.4M 33% 39% 50%
Bottom Cluster 174K 52% 34% 58%

Structural Repetition

I also checked for formulaic patterns. Nearly 1 in 5 stories (19.7%) contain "loved to play" or "liked to play":

SELECT
  COUNT(*) FILTER (WHERE text ILIKE '%loved to play%' OR text ILIKE '%liked to play%') as play_stories,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%loved to play%'
    OR text ILIKE '%liked to play%') * 100.0 / COUNT(*), 1) as pct
FROM dataset

And 9.8% explicitly contain "learned a lesson" or "learned that":

SELECT
  COUNT(*) FILTER (WHERE text ILIKE '%learned%lesson%'
    OR text ILIKE '%learned that%') as lesson_stories,
  ROUND(COUNT(*) FILTER (WHERE text ILIKE '%learned%lesson%'
    OR text ILIKE '%learned that%') * 100.0 / COUNT(*), 1) as pct
FROM dataset

Why This Matters for Training

A language model trained on this data will learn:

  1. "Tim" is the default boy name — it appears 2.5x more than the next most common name
  2. Stories start with "Once upon a time" — 86% begin with "Once" or "One," collapsing first-token entropy
  3. One dominant narrative arc — character intro + "loved to play" + problem + "learned a lesson"
  4. Distinct generation modes cluster together — the model may learn separate "modes" rather than a smooth distribution over story types

How This Was Built

  • Dataset: karpathy/tinystories-gpt4-clean (2.7M stories)
  • Embeddings: Generated via HF Job with sentence-transformers, projected with UMAP
  • Data repo: davanstrien/tinystories-atlas-data (1.5GB parquet)
  • Viewer: Embedding Atlas with --dataset-url loading the parquet remotely via DuckDB httpfs
  • Hosting: Free HF Space (Docker), no local data download — DuckDB loads the 1.5GB parquet at startup via HTTP
  • Exploration: I (Claude Opus) queried the /data/query SQL endpoint

All queries in this document can be run against the live Space. Try them yourself at the atlas link above.

Xet Storage Details

Size:
7.32 kB
·
Xet hash:
1a720f533e396a76d31827c142c7d787f84cf430e5ec4e3018f7ef0b38fadd53

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.