--- language: - en library_name: transformers license: mit pipeline_tag: text-classification tags: - arxiv - scientific-text-classification - scibert - streamlit-demo datasets: - librarian-bots/arxiv-metadata-snapshot metrics: - accuracy - f1 --- # Article Topic Service SciBERT SciBERT text classifier for scientific article topic prediction from article title and abstract. ## Labels - Artificial Intelligence - Natural Language Processing - Computer Vision - Machine Learning - Computer Science Theory and Algorithms - Mathematics - Statistics - Electrical Engineering - Astrophysics - Condensed Matter Physics - Quantum Physics - Quantitative Biology ## Dataset Balanced 12-class subset built from `librarian-bots/arxiv-metadata-snapshot`. - Train: 30,000 examples - Validation: 3,600 examples - Test: 3,600 examples ## Metrics - Validation accuracy: 0.8350 - Validation macro F1: 0.8351 - Test accuracy: 0.8356 - Test macro F1: 0.8351 - Title-only test accuracy: 0.7522 - Title-only test macro F1: 0.7495 ## Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch model_id = "Ian-Khalzov/article-topic-service-scibert" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) text = "Title: Large language models for scientific document classification\n\nAbstract: We study..." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.inference_mode(): probs = torch.softmax(model(**inputs).logits[0], dim=-1) predicted_label = model.config.id2label[int(probs.argmax())] print(predicted_label) ``` ## Notes The current baseline is strongest on physics-heavy classes and weakest on the broad `Machine Learning` category, where topical overlap with AI, NLP, CV, and Statistics remains high.