qrater-web-base-v1.0
A fast, lightweight binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).
Distilled from qrater-web-large-v1.0 (4B) using temperature-scaled KL-divergence, retaining near-identical accuracy at 6x the throughput and 4x less memory.
| Model | Params | Base | Speed (vLLM) | Speed (HF) | GPU Mem | Val Acc | Val F1 |
|---|---|---|---|---|---|---|---|
| qrater-web-large-v1.0 | 4B | Qwen3-Embedding-4B | ~15 docs/s | ~9 docs/s | ~8 GB | 92.1% | 0.867 |
| qrater-web-base-v1.0 | 0.6B | Qwen3-Embedding-0.6B | ~90 docs/s | ~16 docs/s | ~2 GB | 92.4% | 0.873 |
| qrater-web-small-v1.0 | 210M | EuroBERT-210m | — | ~34 docs/s | ~0.5 GB | 90.6% | 0.843 |
Speed measured on a single A100-80GB, max 4096 tokens.
What it does
Given a web page (as markdown or plain text), the model predicts:
- clean (label 1) — substantive, readable content suitable for AI consumption
- dirty (label 0) — noise, boilerplate, broken formatting, thin content
Usage
Transformers
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="chonkie-ai/qrater-web-base-v1.0",
torch_dtype="bfloat16",
device_map="auto",
)
result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]
vLLM (recommended for throughput)
from vllm import LLM
model = LLM(
"chonkie-ai/qrater-web-base-v1.0",
dtype="bfloat16",
max_model_len=4096,
)
outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs # [prob_dirty, prob_clean]
Training
- Teacher model: qrater-web-large-v1.0 (Qwen3-Embedding-4B, fine-tuned)
- Student base: Qwen/Qwen3-Embedding-0.6B
- Distillation method: KL-divergence loss on teacher soft probabilities combined with hard-label cross-entropy
- Temperature: 1.0
- Alpha (soft label weight): 0.5
- Loss = 0.5 * KL(student, teacher) + 0.5 * CrossEntropy(student, hard_labels)
- Training data: 10,000 labeled web pages
- 4,128 samples from live web search results, labeled by Claude
- 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
- Target distribution: ~30% clean / ~70% dirty
- Hyperparameters: 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
- Hardware: 4x A100-80GB with gradient checkpointing
Hyperparameter sweep
The best configuration was selected from a 9-config sweep over learning rate, temperature, and alpha:
| Config | Val Accuracy | Val F1 |
|---|---|---|
| lr=1e-4, T=2.0, α=0.5 | 88.6% | 0.810 |
| lr=5e-5, T=2.0, α=0.5 | 90.3% | 0.840 |
| lr=2e-5, T=2.0, α=0.5 | 78.9% | 0.613 |
| lr=1e-5, T=2.0, α=0.5 | 59.1% | 0.383 |
| lr=5e-5, T=1.0, α=0.5 | 90.2% | 0.838 |
| lr=5e-5, T=4.0, α=0.5 | 89.7% | 0.828 |
| lr=5e-5, T=2.0, α=0.3 | 90.6% | 0.843 |
| lr=5e-5, T=2.0, α=0.7 | 87.9% | 0.795 |
| lr=5e-5, T=2.0, α=1.0 | 84.7% | 0.738 |
The final model was trained with lr=5e-5, T=1.0, α=0.5 for 3 full epochs, achieving 92.4% accuracy and 0.873 F1.
Label definition
A page is clean if:
- It contains substantive, original content (articles, tutorials, documentation, research papers)
- The main content is intact and readable after markdown conversion
- Minimal boilerplate relative to content
A page is dirty if:
- Dominated by navigation, ads, cookie notices, or login walls
- Thin or auto-generated content with little substance
- Broken formatting or encoding issues that make content unusable
- Primarily lists of links, product listings, or search result pages
Evaluation
Validation set (1,000 held-out samples, same distribution as training):
- Accuracy: 92.4%
- F1 (clean class): 0.873
Gold standard (100 human-labeled samples):
- Accuracy: 89.0%
- F1 (clean class): 0.807
- Matches the 4B teacher's gold accuracy (89.0%)
Live web search results (99 pages across 10 diverse queries):
- 34.3% classified clean — well-aligned with teacher (30.3%) and Claude baseline (~40%)
Throughput comparison
| Engine | 0.6B (this model) | 4B (teacher) | Speedup |
|---|---|---|---|
| HuggingFace (single doc, 1 GPU) | 16.0 docs/s | 8.7 docs/s | 1.8x |
| vLLM classify (batched, 1 GPU) | ~90 docs/s | ~15 docs/s | ~6x |
| Peak GPU memory | 2.1 GB | ~8 GB | 3.8x less |
Limitations
- English-only — trained exclusively on English web content
- Max input: 4,096 tokens — longer pages are truncated (the base model supports 32K but training used 4K)
- Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
- Binary classification — does not grade quality on a spectrum
Citation
@misc{qrater2026,
title={qrater-web-base-v1.0: Distilled Web Content Quality Classifier},
author={Bhavnick Minhas},
year={2026},
url={https://huggingface.co/chonkie-ai/qrater-web-base-v1.0}
}
License
Apache 2.0
- Downloads last month
- 11
Model tree for chonkie-ai/qrater-web-base-v1.0
Collection including chonkie-ai/qrater-web-base-v1.0
Evaluation results
- Accuracyself-reported0.924
- F1self-reported0.873