---
language:
- en
library_name: transformers
license: mit
pipeline_tag: text-classification
tags:
- arxiv
- scientific-text-classification
- scibert
- streamlit-demo
datasets:
- librarian-bots/arxiv-metadata-snapshot
metrics:
- accuracy
- f1
---

# Article Topic Service SciBERT

SciBERT text classifier for scientific article topic prediction from article title and abstract.

## Labels

- Artificial Intelligence
- Natural Language Processing
- Computer Vision
- Machine Learning
- Computer Science Theory and Algorithms
- Mathematics
- Statistics
- Electrical Engineering
- Astrophysics
- Condensed Matter Physics
- Quantum Physics
- Quantitative Biology

## Dataset

Balanced 12-class subset built from `librarian-bots/arxiv-metadata-snapshot`.

- Train: 30,000 examples
- Validation: 3,600 examples
- Test: 3,600 examples

## Metrics

- Validation accuracy: 0.8350
- Validation macro F1: 0.8351
- Test accuracy: 0.8356
- Test macro F1: 0.8351
- Title-only test accuracy: 0.7522
- Title-only test macro F1: 0.7495

## Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "Ian-Khalzov/article-topic-service-scibert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Title: Large language models for scientific document classification\n\nAbstract: We study..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits[0], dim=-1)

predicted_label = model.config.id2label[int(probs.argmax())]
print(predicted_label)
```

## Notes

The current baseline is strongest on physics-heavy classes and weakest on the broad `Machine Learning` category, where topical overlap with AI, NLP, CV, and Statistics remains high.