qrater-web-small-v1.0
A fast, lightweight binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).
Distilled from qrater-web-base-v1.0 (0.6B) using temperature-scaled KL-divergence distillation into a 210M encoder model. Runs at 34 docs/s on a single GPU with only 0.5 GB memory.
| Model | Params | Base | Speed (HF) | GPU Mem | Val Acc | Gold Acc |
|---|---|---|---|---|---|---|
| qrater-web-large-v1.0 | 4B | Qwen3-Embedding-4B | ~9 docs/s | ~8 GB | 92.1% | 89.0% |
| qrater-web-base-v1.0 | 0.6B | Qwen3-Embedding-0.6B | ~16 docs/s | ~2 GB | 92.4% | 89.0% |
| qrater-web-small-v1.0 | 210M | EuroBERT-210m | ~34 docs/s | ~0.5 GB | 90.6% | 86.0% |
Speed measured on a single A100-80GB, HuggingFace inference, max 4096 tokens.
What it does
Given a web page (as markdown or plain text), the model predicts:
- clean (label 1) β substantive, readable content suitable for AI consumption
- dirty (label 0) β noise, boilerplate, broken formatting, thin content
Usage
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="chonkie-ai/qrater-web-small-v1.0",
torch_dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]
Training
- Teacher model: qrater-web-base-v1.0 (Qwen3-Embedding-0.6B, distilled from the 4B teacher)
- Student base: EuroBERT/EuroBERT-210m
- Distillation method: KL-divergence loss on teacher soft probabilities combined with hard-label cross-entropy
- Temperature: 1.0
- Alpha (soft label weight): 0.5
- Loss = 0.5 * KL(student, teacher) + 0.5 * CrossEntropy(student, hard_labels)
- Training data: 10,000 labeled web pages
- 4,128 samples from live web search results, labeled by Claude
- 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
- Target distribution: ~30% clean / ~70% dirty
- Hyperparameters: 3 epochs, lr=5e-5, effective batch size 128, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
- Hardware: 4x A100-80GB with gradient checkpointing
Why EuroBERT?
We evaluated three encoder architectures for the small model tier:
| Base Model | Params | Val Accuracy | Val F1 |
|---|---|---|---|
| EuroBERT-210m | 210M | 90.6% | 0.843 |
| EmbeddingGemma-300m | 300M | 90.8% | 0.849 |
| ModernBERT-large | 395M | 81.0% | 0.668 |
EuroBERT-210m matches EmbeddingGemma at nearly half the parameters, with 8K context (vs 2K), and an Apache 2.0 license (vs Gemma license).
Label definition
A page is clean if:
- It contains substantive, original content (articles, tutorials, documentation, research papers)
- The main content is intact and readable after markdown conversion
- Minimal boilerplate relative to content
A page is dirty if:
- Dominated by navigation, ads, cookie notices, or login walls
- Thin or auto-generated content with little substance
- Broken formatting or encoding issues that make content unusable
- Primarily lists of links, product listings, or search result pages
Evaluation
Validation set (1,000 held-out samples, same distribution as training):
- Accuracy: 90.6%
- F1 (clean class): 0.843
Gold standard (100 human-labeled samples):
- Accuracy: 86.0%
- F1 (clean class): 0.741
Throughput
| Metric | This model | 0.6B (base) | 4B (large) |
|---|---|---|---|
| HuggingFace (single doc, 1 GPU) | 34.0 docs/s | 16.0 docs/s | 8.7 docs/s |
| Peak GPU memory | 0.5 GB | 2.1 GB | ~8 GB |
| Avg latency | 29.4 ms/doc | 62.6 ms/doc | ~115 ms/doc |
Limitations
- English-only β trained exclusively on English web content
- Max input: 4,096 tokens β longer pages are truncated (the base model supports 8K but training used 4K)
- Requires
trust_remote_code=Trueβ EuroBERT uses custom modeling code - Optimized for informational content β may be less calibrated on creative writing, social media, or e-commerce pages
- Binary classification β does not grade quality on a spectrum
Citation
@misc{qrater2026,
title={qrater-web-small-v1.0: Distilled Web Content Quality Classifier},
author={Bhavnick Minhas},
year={2026},
url={https://huggingface.co/chonkie-ai/qrater-web-small-v1.0}
}
License
Apache 2.0
- Downloads last month
- 12
Model tree for chonkie-ai/qrater-web-small-v1.0
Base model
EuroBERT/EuroBERT-210mCollection including chonkie-ai/qrater-web-small-v1.0
Evaluation results
- Accuracyself-reported0.906
- F1self-reported0.843