ModernBERT IAB News Classifier

A ModernBERT-base model fine-tuned to classify English news articles into 35 IAB Content Taxonomy 3.1 Tier 1 categories.

Supports top-k classification — returns multiple categories with confidence scores, useful for articles that span multiple topics.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "mdonigian/modernbert-iab-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "The Federal Reserve raised interest rates by 25 basis points on Wednesday, citing persistent inflation concerns despite recent banking sector turmoil."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)[0]
top5_probs, top5_ids = torch.topk(probs, k=5)

for prob, idx in zip(top5_probs, top5_ids):
    label = model.config.id2label[idx.item()]
    print(f"  {label:40s} {prob:.4f}")

Output:

  Business and Finance                     0.8234
  Personal Finance                         0.0891
  Politics                                 0.0412
  Law                                      0.0198
  Education                                0.0067

Batch Inference

articles = ["article text 1...", "article text 2...", ...]

inputs = tokenizer(articles, return_tensors="pt", truncation=True, max_length=1024, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)

for i, article_probs in enumerate(probs):
    top3_probs, top3_ids = torch.topk(article_probs, k=3)
    categories = [(model.config.id2label[idx.item()], prob.item()) for prob, idx in zip(top3_probs, top3_ids)]
    print(f"Article {i}: {categories}")

Performance

Metric	Score
Top-1 Accuracy	75.5%
Top-2 Accuracy	88.2%
Top-3 Accuracy	92.9%
Top-5 Accuracy	95.8%
Top-10 Accuracy	98.6%
Macro F1	0.71
Weighted F1	0.75

Confidence Distribution

Confidence Threshold	% of Predictions
>= 0.9	32.7%
>= 0.8	48.6%
>= 0.7	59.6%
>= 0.5	78.6%

Mean confidence on correct predictions: 0.79 | Mean confidence on wrong predictions: 0.52

Most Confused Pairs

True Label	Predicted As	Count
Law	Crime	26
Pop Culture	Entertainment	23
Crime	Law	22
Shopping	Style & Fashion	21
Entertainment	Pop Culture	21
Medical Health	Healthy Living	17
Disasters	Science	16
Politics	Law	15
Home & Garden	Shopping	15
Business and Finance	Personal Finance	15

Category	Category	Category
Attractions	Automotive	Books and Literature
Business and Finance	Careers	Communication
Crime	Disasters	Education
Entertainment	Events	Family and Relationships
Fine Art	Food & Drink	Healthy Living
Hobbies & Interests	Holidays	Home & Garden
Law	Medical Health	Personal Celebrations & Life Events
Personal Finance	Pets	Politics
Pop Culture	Real Estate	Religion & Spirituality
Science	Shopping	Sports
Style & Fashion	Technology & Computing	Travel
Video Gaming	War and Conflicts

Training Details

Base model: answerdotai/ModernBERT-base (149M parameters)
Training data: 106K news articles labeled by GPT-5-nano via OpenAI Batch API, downsampled to ~47K with a 2,000 cap per category
Sources: Common Crawl CC-NEWS archives + Spider.cloud targeted crawls across 70+ news domains
Max sequence length: 1,024 tokens
Training: 5 epochs, cosine LR schedule, bf16, Flash Attention 2, AdamW (lr=3e-5)
Hardware: Single NVIDIA RTX 4090 (24 GB)

Recommended Usage

Single-label classification: Use argmax on the logits
Multi-label / topic tagging: Take all categories above a confidence threshold (e.g. 0.15)
Confidence filtering: Discard predictions below 0.5 confidence for higher precision

Limitations

Trained on English-language news articles only — performance on non-news text or other languages will be lower
Labels were generated by GPT-5-nano, not human-annotated, so some label noise exists
Categories with fewer training examples (Holidays, Communication, Careers) may have lower accuracy
Very long articles are truncated to 1,024 tokens — classification is based on the beginning of the article

Citation

@misc{modernbert,
    title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
    author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
    year={2024},
    eprint={2412.13663},
    archivePrefix={arXiv},
}

IAB Content Taxonomy: github.com/InteractiveAdvertisingBureau/Taxonomies

Downloads last month: 25

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for mdonigian/modernbert-iab-classifier

Base model

answerdotai/ModernBERT-base

Finetuned

(1177)

this model

Paper for mdonigian/modernbert-iab-classifier

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 163