qrater-web-small-v1.0

A fast, lightweight binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Distilled from qrater-web-base-v1.0 (0.6B) using temperature-scaled KL-divergence distillation into a 210M encoder model. Runs at 34 docs/s on a single GPU with only 0.5 GB memory.

Model	Params	Base	Speed (HF)	GPU Mem	Val Acc	Gold Acc
qrater-web-large-v1.0	4B	Qwen3-Embedding-4B	~9 docs/s	~8 GB	92.1%	89.0%
qrater-web-base-v1.0	0.6B	Qwen3-Embedding-0.6B	~16 docs/s	~2 GB	92.4%	89.0%
qrater-web-small-v1.0	210M	EuroBERT-210m	~34 docs/s	~0.5 GB	90.6%	86.0%

Speed measured on a single A100-80GB, HuggingFace inference, max 4096 tokens.

What it does

Given a web page (as markdown or plain text), the model predicts:

clean (label 1) — substantive, readable content suitable for AI consumption
dirty (label 0) — noise, boilerplate, broken formatting, thin content

Usage

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-small-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]

Training

Teacher model: qrater-web-base-v1.0 (Qwen3-Embedding-0.6B, distilled from the 4B teacher)
Student base: EuroBERT/EuroBERT-210m
Distillation method: KL-divergence loss on teacher soft probabilities combined with hard-label cross-entropy
- Temperature: 1.0
- Alpha (soft label weight): 0.5
- Loss = 0.5 * KL(student, teacher) + 0.5 * CrossEntropy(student, hard_labels)
Training data: 10,000 labeled web pages
- 4,128 samples from live web search results, labeled by Claude
- 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
- Target distribution: ~30% clean / ~70% dirty
Hyperparameters: 3 epochs, lr=5e-5, effective batch size 128, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
Hardware: 4x A100-80GB with gradient checkpointing

Why EuroBERT?

We evaluated three encoder architectures for the small model tier:

Base Model	Params	Val Accuracy	Val F1
EuroBERT-210m	210M	90.6%	0.843
EmbeddingGemma-300m	300M	90.8%	0.849
ModernBERT-large	395M	81.0%	0.668

EuroBERT-210m matches EmbeddingGemma at nearly half the parameters, with 8K context (vs 2K), and an Apache 2.0 license (vs Gemma license).

Label definition

A page is clean if:

It contains substantive, original content (articles, tutorials, documentation, research papers)
The main content is intact and readable after markdown conversion
Minimal boilerplate relative to content

A page is dirty if:

Dominated by navigation, ads, cookie notices, or login walls
Thin or auto-generated content with little substance
Broken formatting or encoding issues that make content unusable
Primarily lists of links, product listings, or search result pages

Evaluation

Validation set (1,000 held-out samples, same distribution as training):

Accuracy: 90.6%
F1 (clean class): 0.843

Gold standard (100 human-labeled samples):

Accuracy: 86.0%
F1 (clean class): 0.741

Throughput

Metric	This model	0.6B (base)	4B (large)
HuggingFace (single doc, 1 GPU)	34.0 docs/s	16.0 docs/s	8.7 docs/s
Peak GPU memory	0.5 GB	2.1 GB	~8 GB
Avg latency	29.4 ms/doc	62.6 ms/doc	~115 ms/doc

Limitations

English-only — trained exclusively on English web content
Max input: 4,096 tokens — longer pages are truncated (the base model supports 8K but training used 4K)
Requires trust_remote_code=True — EuroBERT uses custom modeling code
Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
Binary classification — does not grade quality on a spectrum

Citation

@misc{qrater2026,
  title={qrater-web-small-v1.0: Distilled Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-small-v1.0}
}

License

Apache 2.0

Downloads last month: 12

Safetensors

Model size

0.2B params

Tensor type

BF16

Model tree for chonkie-ai/qrater-web-small-v1.0

Base model

EuroBERT/EuroBERT-210m

Finetuned

(68)

this model

Collection including chonkie-ai/qrater-web-small-v1.0

🤵🏻 Qrater

Collection

The connoisseur disapproves! • 3 items • Updated 14 days ago • 1

Evaluation results

Accuracy
self-reported

0.906
F1
self-reported

0.843