ModernBERT IAB News Classifier

A ModernBERT-base model fine-tuned to classify English news articles into 35 IAB Content Taxonomy 3.1 Tier 1 categories.

Supports top-k classification — returns multiple categories with confidence scores, useful for articles that span multiple topics.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "mdonigian/modernbert-iab-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "The Federal Reserve raised interest rates by 25 basis points on Wednesday, citing persistent inflation concerns despite recent banking sector turmoil."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)[0]
top5_probs, top5_ids = torch.topk(probs, k=5)

for prob, idx in zip(top5_probs, top5_ids):
    label = model.config.id2label[idx.item()]
    print(f"  {label:40s} {prob:.4f}")

Output:

  Business and Finance                     0.8234
  Personal Finance                         0.0891
  Politics                                 0.0412
  Law                                      0.0198
  Education                                0.0067

Batch Inference

articles = ["article text 1...", "article text 2...", ...]

inputs = tokenizer(articles, return_tensors="pt", truncation=True, max_length=1024, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)

for i, article_probs in enumerate(probs):
    top3_probs, top3_ids = torch.topk(article_probs, k=3)
    categories = [(model.config.id2label[idx.item()], prob.item()) for prob, idx in zip(top3_probs, top3_ids)]
    print(f"Article {i}: {categories}")

Performance

Metric Score
Top-1 Accuracy 75.5%
Top-2 Accuracy 88.2%
Top-3 Accuracy 92.9%
Top-5 Accuracy 95.8%
Top-10 Accuracy 98.6%
Macro F1 0.71
Weighted F1 0.75

Confidence Distribution

Confidence Threshold % of Predictions
>= 0.9 32.7%
>= 0.8 48.6%
>= 0.7 59.6%
>= 0.5 78.6%

Mean confidence on correct predictions: 0.79 | Mean confidence on wrong predictions: 0.52

Most Confused Pairs

True Label Predicted As Count
Law Crime 26
Pop Culture Entertainment 23
Crime Law 22
Shopping Style & Fashion 21
Entertainment Pop Culture 21
Medical Health Healthy Living 17
Disasters Science 16
Politics Law 15
Home & Garden Shopping 15
Business and Finance Personal Finance 15

Categories

The model classifies into the 35 IAB Content Taxonomy 3.1 Tier 1 categories:

Category Category Category
Attractions Automotive Books and Literature
Business and Finance Careers Communication
Crime Disasters Education
Entertainment Events Family and Relationships
Fine Art Food & Drink Healthy Living
Hobbies & Interests Holidays Home & Garden
Law Medical Health Personal Celebrations & Life Events
Personal Finance Pets Politics
Pop Culture Real Estate Religion & Spirituality
Science Shopping Sports
Style & Fashion Technology & Computing Travel
Video Gaming War and Conflicts

Training Details

  • Base model: answerdotai/ModernBERT-base (149M parameters)
  • Training data: 106K news articles labeled by GPT-5-nano via OpenAI Batch API, downsampled to ~47K with a 2,000 cap per category
  • Sources: Common Crawl CC-NEWS archives + Spider.cloud targeted crawls across 70+ news domains
  • Max sequence length: 1,024 tokens
  • Training: 5 epochs, cosine LR schedule, bf16, Flash Attention 2, AdamW (lr=3e-5)
  • Hardware: Single NVIDIA RTX 4090 (24 GB)

Recommended Usage

  • Single-label classification: Use argmax on the logits
  • Multi-label / topic tagging: Take all categories above a confidence threshold (e.g. 0.15)
  • Confidence filtering: Discard predictions below 0.5 confidence for higher precision

Limitations

  • Trained on English-language news articles only — performance on non-news text or other languages will be lower
  • Labels were generated by GPT-5-nano, not human-annotated, so some label noise exists
  • Categories with fewer training examples (Holidays, Communication, Careers) may have lower accuracy
  • Very long articles are truncated to 1,024 tokens — classification is based on the beginning of the article

Citation

@misc{modernbert,
    title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
    author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
    year={2024},
    eprint={2412.13663},
    archivePrefix={arXiv},
}

IAB Content Taxonomy: github.com/InteractiveAdvertisingBureau/Taxonomies

Downloads last month
25
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdonigian/modernbert-iab-classifier

Finetuned
(1177)
this model

Paper for mdonigian/modernbert-iab-classifier