Upload final model artifacts

328c7b9 verified 11 days ago

1.86 kB

language:
  - en
library_name: transformers
license: mit
pipeline_tag: text-classification
tags:
  - arxiv
  - scientific-text-classification
  - scibert
  - streamlit-demo
datasets:
  - librarian-bots/arxiv-metadata-snapshot
metrics:
  - accuracy
  - f1

Article Topic Service SciBERT

SciBERT text classifier for scientific article topic prediction from article title and abstract.

Labels

Artificial Intelligence
Natural Language Processing
Computer Vision
Machine Learning
Computer Science Theory and Algorithms
Mathematics
Statistics
Electrical Engineering
Astrophysics
Condensed Matter Physics
Quantum Physics
Quantitative Biology

Dataset

Balanced 12-class subset built from librarian-bots/arxiv-metadata-snapshot.

Train: 30,000 examples
Validation: 3,600 examples
Test: 3,600 examples

Metrics

Validation accuracy: 0.8350
Validation macro F1: 0.8351
Test accuracy: 0.8356
Test macro F1: 0.8351
Title-only test accuracy: 0.7522
Title-only test macro F1: 0.7495

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "Ian-Khalzov/article-topic-service-scibert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Title: Large language models for scientific document classification\n\nAbstract: We study..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits[0], dim=-1)

predicted_label = model.config.id2label[int(probs.argmax())]
print(predicted_label)

Notes

The current baseline is strongest on physics-heavy classes and weakest on the broad Machine Learning category, where topical overlap with AI, NLP, CV, and Statistics remains high.