qrater-web-base-v1.0

A fast, lightweight binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Distilled from qrater-web-large-v1.0 (4B) using temperature-scaled KL-divergence, retaining near-identical accuracy at 6x the throughput and 4x less memory.

Model	Params	Base	Speed (vLLM)	Speed (HF)	GPU Mem	Val Acc	Val F1
qrater-web-large-v1.0	4B	Qwen3-Embedding-4B	~15 docs/s	~9 docs/s	~8 GB	92.1%	0.867
qrater-web-base-v1.0	0.6B	Qwen3-Embedding-0.6B	~90 docs/s	~16 docs/s	~2 GB	92.4%	0.873
qrater-web-small-v1.0	210M	EuroBERT-210m	—	~34 docs/s	~0.5 GB	90.6%	0.843

Speed measured on a single A100-80GB, max 4096 tokens.

What it does

Given a web page (as markdown or plain text), the model predicts:

clean (label 1) — substantive, readable content suitable for AI consumption
dirty (label 0) — noise, boilerplate, broken formatting, thin content

Usage

Transformers

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-base-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]

vLLM (recommended for throughput)

from vllm import LLM

model = LLM(
    "chonkie-ai/qrater-web-base-v1.0",
    dtype="bfloat16",
    max_model_len=4096,
)

outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs  # [prob_dirty, prob_clean]

Training

Teacher model: qrater-web-large-v1.0 (Qwen3-Embedding-4B, fine-tuned)
Student base: Qwen/Qwen3-Embedding-0.6B
Distillation method: KL-divergence loss on teacher soft probabilities combined with hard-label cross-entropy
- Temperature: 1.0
- Alpha (soft label weight): 0.5
- Loss = 0.5 * KL(student, teacher) + 0.5 * CrossEntropy(student, hard_labels)
Training data: 10,000 labeled web pages
- 4,128 samples from live web search results, labeled by Claude
- 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
- Target distribution: ~30% clean / ~70% dirty
Hyperparameters: 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
Hardware: 4x A100-80GB with gradient checkpointing

Hyperparameter sweep

The best configuration was selected from a 9-config sweep over learning rate, temperature, and alpha:

Config	Val Accuracy	Val F1
lr=1e-4, T=2.0, α=0.5	88.6%	0.810
lr=5e-5, T=2.0, α=0.5	90.3%	0.840
lr=2e-5, T=2.0, α=0.5	78.9%	0.613
lr=1e-5, T=2.0, α=0.5	59.1%	0.383
lr=5e-5, T=1.0, α=0.5	90.2%	0.838
lr=5e-5, T=4.0, α=0.5	89.7%	0.828
lr=5e-5, T=2.0, α=0.3	90.6%	0.843
lr=5e-5, T=2.0, α=0.7	87.9%	0.795
lr=5e-5, T=2.0, α=1.0	84.7%	0.738

The final model was trained with lr=5e-5, T=1.0, α=0.5 for 3 full epochs, achieving 92.4% accuracy and 0.873 F1.

Label definition

A page is clean if:

It contains substantive, original content (articles, tutorials, documentation, research papers)
The main content is intact and readable after markdown conversion
Minimal boilerplate relative to content

A page is dirty if:

Dominated by navigation, ads, cookie notices, or login walls
Thin or auto-generated content with little substance
Broken formatting or encoding issues that make content unusable
Primarily lists of links, product listings, or search result pages

Evaluation

Validation set (1,000 held-out samples, same distribution as training):

Accuracy: 92.4%
F1 (clean class): 0.873

Gold standard (100 human-labeled samples):

Accuracy: 89.0%
F1 (clean class): 0.807
Matches the 4B teacher's gold accuracy (89.0%)

Live web search results (99 pages across 10 diverse queries):

34.3% classified clean — well-aligned with teacher (30.3%) and Claude baseline (~40%)

Throughput comparison

Engine	0.6B (this model)	4B (teacher)	Speedup
HuggingFace (single doc, 1 GPU)	16.0 docs/s	8.7 docs/s	1.8x
vLLM classify (batched, 1 GPU)	~90 docs/s	~15 docs/s	~6x
Peak GPU memory	2.1 GB	~8 GB	3.8x less

Limitations

English-only — trained exclusively on English web content
Max input: 4,096 tokens — longer pages are truncated (the base model supports 32K but training used 4K)
Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
Binary classification — does not grade quality on a spectrum

Citation

@misc{qrater2026,
  title={qrater-web-base-v1.0: Distilled Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-base-v1.0}
}