qrater-web-small-v1.0

A fast, lightweight binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Distilled from qrater-web-base-v1.0 (0.6B) using temperature-scaled KL-divergence distillation into a 210M encoder model. Runs at 34 docs/s on a single GPU with only 0.5 GB memory.

Model Params Base Speed (HF) GPU Mem Val Acc Gold Acc
qrater-web-large-v1.0 4B Qwen3-Embedding-4B ~9 docs/s ~8 GB 92.1% 89.0%
qrater-web-base-v1.0 0.6B Qwen3-Embedding-0.6B ~16 docs/s ~2 GB 92.4% 89.0%
qrater-web-small-v1.0 210M EuroBERT-210m ~34 docs/s ~0.5 GB 90.6% 86.0%

Speed measured on a single A100-80GB, HuggingFace inference, max 4096 tokens.

What it does

Given a web page (as markdown or plain text), the model predicts:

  • clean (label 1) β€” substantive, readable content suitable for AI consumption
  • dirty (label 0) β€” noise, boilerplate, broken formatting, thin content

Usage

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-small-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]

Training

  • Teacher model: qrater-web-base-v1.0 (Qwen3-Embedding-0.6B, distilled from the 4B teacher)
  • Student base: EuroBERT/EuroBERT-210m
  • Distillation method: KL-divergence loss on teacher soft probabilities combined with hard-label cross-entropy
    • Temperature: 1.0
    • Alpha (soft label weight): 0.5
    • Loss = 0.5 * KL(student, teacher) + 0.5 * CrossEntropy(student, hard_labels)
  • Training data: 10,000 labeled web pages
    • 4,128 samples from live web search results, labeled by Claude
    • 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
    • Target distribution: ~30% clean / ~70% dirty
  • Hyperparameters: 3 epochs, lr=5e-5, effective batch size 128, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
  • Hardware: 4x A100-80GB with gradient checkpointing

Why EuroBERT?

We evaluated three encoder architectures for the small model tier:

Base Model Params Val Accuracy Val F1
EuroBERT-210m 210M 90.6% 0.843
EmbeddingGemma-300m 300M 90.8% 0.849
ModernBERT-large 395M 81.0% 0.668

EuroBERT-210m matches EmbeddingGemma at nearly half the parameters, with 8K context (vs 2K), and an Apache 2.0 license (vs Gemma license).

Label definition

A page is clean if:

  • It contains substantive, original content (articles, tutorials, documentation, research papers)
  • The main content is intact and readable after markdown conversion
  • Minimal boilerplate relative to content

A page is dirty if:

  • Dominated by navigation, ads, cookie notices, or login walls
  • Thin or auto-generated content with little substance
  • Broken formatting or encoding issues that make content unusable
  • Primarily lists of links, product listings, or search result pages

Evaluation

Validation set (1,000 held-out samples, same distribution as training):

  • Accuracy: 90.6%
  • F1 (clean class): 0.843

Gold standard (100 human-labeled samples):

  • Accuracy: 86.0%
  • F1 (clean class): 0.741

Throughput

Metric This model 0.6B (base) 4B (large)
HuggingFace (single doc, 1 GPU) 34.0 docs/s 16.0 docs/s 8.7 docs/s
Peak GPU memory 0.5 GB 2.1 GB ~8 GB
Avg latency 29.4 ms/doc 62.6 ms/doc ~115 ms/doc

Limitations

  • English-only β€” trained exclusively on English web content
  • Max input: 4,096 tokens β€” longer pages are truncated (the base model supports 8K but training used 4K)
  • Requires trust_remote_code=True β€” EuroBERT uses custom modeling code
  • Optimized for informational content β€” may be less calibrated on creative writing, social media, or e-commerce pages
  • Binary classification β€” does not grade quality on a spectrum

Citation

@misc{qrater2026,
  title={qrater-web-small-v1.0: Distilled Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-small-v1.0}
}

License

Apache 2.0

Downloads last month
12
Safetensors
Model size
0.2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for chonkie-ai/qrater-web-small-v1.0

Finetuned
(68)
this model

Collection including chonkie-ai/qrater-web-small-v1.0

Evaluation results