Newswire Classifier (AP, UPI, NEA) - BERT Transformers
π Overview
This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services:
- AP (Associated Press)
- UPI (United Press International)
- NEA (Newspaper Enterprise Association)
The models are designed for historical news classification from public-domain newswire articles.
π§ Model Architecture
- Base Model:
bert-base-uncased - Task: Binary classification (
1if from the specific newswire,0otherwise) - Optimizer: AdamW
- Loss Function: Binary Cross-Entropy with Logits
- Batch Size: 16
- Epochs: 4
- Learning Rate: 2e-5
- Device: TPU (v2-8) in Google Colab
π Training Data
- Articles: 4000 per training round (1000 from target newswire, 3000 from other sources)
- Features Used: First 100 characters of an article.
- Labeling:
1for articles from the target newswire,0for all others.
π Model Performance
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| AP | 0.9925 | 0.9926 | 0.9925 | 0.9925 |
| UPI | 0.9999 | 0.9999 | 0.9999 | 0.9999 |
| NEA | 0.9875 | 0.9880 | 0.9875 | 0.9876 |
π οΈ Usage
Installation
pip install transformers torch
Example Inference (AP Classifier)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("mike-mcrae/newswire_classifier/AP")
tokenizer = AutoTokenizer.from_pretrained("mike-mcrae/newswire_classifier/AP")
text = "(AP) President speaks at conference..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print("AP Article" if prediction == 1 else "Not AP Article")
βοΈ Recommended Usage Notes
- The models were trained on a combination of the first 100 characters of headline + author + the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy.
π Licensing & Data Source
- Training Data: Newspaper articles from public-domain sources.
- License: Public domain (for data) and MIT License (for model and code).
π¬ Citation
If you use these models, please cite:
@misc{newswire_classifier,
author = {McRae, Michael},
title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/username/newswire_classifier}
}
Model tree for mikemcrae25/newswire_classifiers
Base model
google-bert/bert-base-uncased