Ian-Khalzov's picture
Upload final model artifacts
328c7b9 verified
metadata
language:
  - en
library_name: transformers
license: mit
pipeline_tag: text-classification
tags:
  - arxiv
  - scientific-text-classification
  - scibert
  - streamlit-demo
datasets:
  - librarian-bots/arxiv-metadata-snapshot
metrics:
  - accuracy
  - f1

Article Topic Service SciBERT

SciBERT text classifier for scientific article topic prediction from article title and abstract.

Labels

  • Artificial Intelligence
  • Natural Language Processing
  • Computer Vision
  • Machine Learning
  • Computer Science Theory and Algorithms
  • Mathematics
  • Statistics
  • Electrical Engineering
  • Astrophysics
  • Condensed Matter Physics
  • Quantum Physics
  • Quantitative Biology

Dataset

Balanced 12-class subset built from librarian-bots/arxiv-metadata-snapshot.

  • Train: 30,000 examples
  • Validation: 3,600 examples
  • Test: 3,600 examples

Metrics

  • Validation accuracy: 0.8350
  • Validation macro F1: 0.8351
  • Test accuracy: 0.8356
  • Test macro F1: 0.8351
  • Title-only test accuracy: 0.7522
  • Title-only test macro F1: 0.7495

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "Ian-Khalzov/article-topic-service-scibert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Title: Large language models for scientific document classification\n\nAbstract: We study..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits[0], dim=-1)

predicted_label = model.config.id2label[int(probs.argmax())]
print(predicted_label)

Notes

The current baseline is strongest on physics-heavy classes and weakest on the broad Machine Learning category, where topical overlap with AI, NLP, CV, and Statistics remains high.