Article Topic Service SciBERT
SciBERT text classifier for scientific article topic prediction from article title and abstract.
Labels
- Artificial Intelligence
- Natural Language Processing
- Computer Vision
- Machine Learning
- Computer Science Theory and Algorithms
- Mathematics
- Statistics
- Electrical Engineering
- Astrophysics
- Condensed Matter Physics
- Quantum Physics
- Quantitative Biology
Dataset
Balanced 12-class subset built from librarian-bots/arxiv-metadata-snapshot.
- Train: 30,000 examples
- Validation: 3,600 examples
- Test: 3,600 examples
Metrics
- Validation accuracy: 0.8350
- Validation macro F1: 0.8351
- Test accuracy: 0.8356
- Test macro F1: 0.8351
- Title-only test accuracy: 0.7522
- Title-only test macro F1: 0.7495
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_id = "Ian-Khalzov/article-topic-service-scibert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "Title: Large language models for scientific document classification\n\nAbstract: We study..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.inference_mode():
probs = torch.softmax(model(**inputs).logits[0], dim=-1)
predicted_label = model.config.id2label[int(probs.argmax())]
print(predicted_label)
Notes
The current baseline is strongest on physics-heavy classes and weakest on the broad Machine Learning category, where topical overlap with AI, NLP, CV, and Statistics remains high.
- Downloads last month
- 35