Article Topic Service SciBERT

SciBERT text classifier for scientific article topic prediction from article title and abstract.

Labels

  • Artificial Intelligence
  • Natural Language Processing
  • Computer Vision
  • Machine Learning
  • Computer Science Theory and Algorithms
  • Mathematics
  • Statistics
  • Electrical Engineering
  • Astrophysics
  • Condensed Matter Physics
  • Quantum Physics
  • Quantitative Biology

Dataset

Balanced 12-class subset built from librarian-bots/arxiv-metadata-snapshot.

  • Train: 30,000 examples
  • Validation: 3,600 examples
  • Test: 3,600 examples

Metrics

  • Validation accuracy: 0.8350
  • Validation macro F1: 0.8351
  • Test accuracy: 0.8356
  • Test macro F1: 0.8351
  • Title-only test accuracy: 0.7522
  • Title-only test macro F1: 0.7495

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "Ian-Khalzov/article-topic-service-scibert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Title: Large language models for scientific document classification\n\nAbstract: We study..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits[0], dim=-1)

predicted_label = model.config.id2label[int(probs.argmax())]
print(predicted_label)

Notes

The current baseline is strongest on physics-heavy classes and weakest on the broad Machine Learning category, where topical overlap with AI, NLP, CV, and Statistics remains high.

Downloads last month
35
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Ian-Khalzov/article-topic-service-scibert