qrater-web-base-v1.0

A fast, lightweight binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Distilled from qrater-web-large-v1.0 (4B) using temperature-scaled KL-divergence, retaining near-identical accuracy at 6x the throughput and 4x less memory.

Model Params Base Speed (vLLM) Speed (HF) GPU Mem Val Acc Val F1
qrater-web-large-v1.0 4B Qwen3-Embedding-4B ~15 docs/s ~9 docs/s ~8 GB 92.1% 0.867
qrater-web-base-v1.0 0.6B Qwen3-Embedding-0.6B ~90 docs/s ~16 docs/s ~2 GB 92.4% 0.873
qrater-web-small-v1.0 210M EuroBERT-210m ~34 docs/s ~0.5 GB 90.6% 0.843

Speed measured on a single A100-80GB, max 4096 tokens.

What it does

Given a web page (as markdown or plain text), the model predicts:

  • clean (label 1) — substantive, readable content suitable for AI consumption
  • dirty (label 0) — noise, boilerplate, broken formatting, thin content

Usage

Transformers

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-base-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]

vLLM (recommended for throughput)

from vllm import LLM

model = LLM(
    "chonkie-ai/qrater-web-base-v1.0",
    dtype="bfloat16",
    max_model_len=4096,
)

outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs  # [prob_dirty, prob_clean]

Training

  • Teacher model: qrater-web-large-v1.0 (Qwen3-Embedding-4B, fine-tuned)
  • Student base: Qwen/Qwen3-Embedding-0.6B
  • Distillation method: KL-divergence loss on teacher soft probabilities combined with hard-label cross-entropy
    • Temperature: 1.0
    • Alpha (soft label weight): 0.5
    • Loss = 0.5 * KL(student, teacher) + 0.5 * CrossEntropy(student, hard_labels)
  • Training data: 10,000 labeled web pages
    • 4,128 samples from live web search results, labeled by Claude
    • 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
    • Target distribution: ~30% clean / ~70% dirty
  • Hyperparameters: 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
  • Hardware: 4x A100-80GB with gradient checkpointing

Hyperparameter sweep

The best configuration was selected from a 9-config sweep over learning rate, temperature, and alpha:

Config Val Accuracy Val F1
lr=1e-4, T=2.0, α=0.5 88.6% 0.810
lr=5e-5, T=2.0, α=0.5 90.3% 0.840
lr=2e-5, T=2.0, α=0.5 78.9% 0.613
lr=1e-5, T=2.0, α=0.5 59.1% 0.383
lr=5e-5, T=1.0, α=0.5 90.2% 0.838
lr=5e-5, T=4.0, α=0.5 89.7% 0.828
lr=5e-5, T=2.0, α=0.3 90.6% 0.843
lr=5e-5, T=2.0, α=0.7 87.9% 0.795
lr=5e-5, T=2.0, α=1.0 84.7% 0.738

The final model was trained with lr=5e-5, T=1.0, α=0.5 for 3 full epochs, achieving 92.4% accuracy and 0.873 F1.

Label definition

A page is clean if:

  • It contains substantive, original content (articles, tutorials, documentation, research papers)
  • The main content is intact and readable after markdown conversion
  • Minimal boilerplate relative to content

A page is dirty if:

  • Dominated by navigation, ads, cookie notices, or login walls
  • Thin or auto-generated content with little substance
  • Broken formatting or encoding issues that make content unusable
  • Primarily lists of links, product listings, or search result pages

Evaluation

Validation set (1,000 held-out samples, same distribution as training):

  • Accuracy: 92.4%
  • F1 (clean class): 0.873

Gold standard (100 human-labeled samples):

  • Accuracy: 89.0%
  • F1 (clean class): 0.807
  • Matches the 4B teacher's gold accuracy (89.0%)

Live web search results (99 pages across 10 diverse queries):

  • 34.3% classified clean — well-aligned with teacher (30.3%) and Claude baseline (~40%)

Throughput comparison

Engine 0.6B (this model) 4B (teacher) Speedup
HuggingFace (single doc, 1 GPU) 16.0 docs/s 8.7 docs/s 1.8x
vLLM classify (batched, 1 GPU) ~90 docs/s ~15 docs/s ~6x
Peak GPU memory 2.1 GB ~8 GB 3.8x less

Limitations

  • English-only — trained exclusively on English web content
  • Max input: 4,096 tokens — longer pages are truncated (the base model supports 32K but training used 4K)
  • Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
  • Binary classification — does not grade quality on a spectrum

Citation

@misc{qrater2026,
  title={qrater-web-base-v1.0: Distilled Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-base-v1.0}
}

License

Apache 2.0

Downloads last month
11
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chonkie-ai/qrater-web-base-v1.0

Finetuned
(168)
this model

Collection including chonkie-ai/qrater-web-base-v1.0

Evaluation results