Papers
arxiv:2506.14111

Essential-Web v1.0: 24T tokens of organized web data

Published on Jun 17, 2025
ยท Submitted by
Research at Essential AI
on Jun 17, 2025
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A large, 24-trillion-token Essential-Web v1.0 dataset annotated with a multi-category taxonomy outperforms or is competitive with existing datasets in various domains using simple filtering techniques.

AI-generated summary

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Community

Paper submitter

ESSENTIAL-WEB V1.0: 24T tokens of organized web data

Amazing work!! ๐Ÿ”ฅ

Great work!

Fantastic!

Impressive!!!!!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2506.14111
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 16

Collections including this paper 10