lostelf
/

section-classifier-imrad

Text Classification

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

lostelf commited on 19 days ago

Commit

ee753dd

·

verified ·

1 Parent(s): 9cd0272

Update README.md

Files changed (1) hide show

README.md +32 -5

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ should probably proofread and complete it, then remove this comment. -->
 # section-classifier-imrad
-This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.6404
 - Accuracy: 0.7714
@@ -31,17 +31,44 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters

 # section-classifier-imrad
+This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on [saier/unarXive_imrad_clf](https://huggingface.co/datasets/saier/unarXive_imrad_clf) dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.6404
 - Accuracy: 0.7714
 ## Model description
+This model classifies scientific paper sections into IMRaD categories (Introduction, Methods, Results, and Discussion). It's a fine-tuned version of DistilBERT trained on the unarXive dataset with weighted cross-entropy loss to handle class imbalance.
 ## Intended uses & limitations
+Intended use: Automatically categorizing sections in academic papers, particularly arXiv submissions.
+Limitations: Trained exclusively on arXiv papers; may not generalize well to non-academic text or from other domains. Requires text segments of reasonable length (up to 512 tokens).
 ## Training and evaluation data
+Trained on saier/unarXive_imrad_clf, a dataset of labeled paper sections from arXiv. The model uses weighted class balancing to account for label distribution imbalance across the five IMRaD categories.
+## How to use
+```
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_id = "your-username/section-classifier-imrad"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+texts = [
+    "In this paper, we propose a new method for retrieval.",
+    "We evaluate on three benchmarks and report state-of-the-art results."
+]
+inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs).logits
+pred_ids = torch.argmax(logits, dim=-1).tolist()
+id2label = model.config.id2label
+for t, i in zip(texts, pred_ids):
+    print(id2label[i], ":", t)
+```
 ### Training hyperparameters