lostelf commited on
Commit
ee753dd
·
verified ·
1 Parent(s): 9cd0272

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -5
README.md CHANGED
@@ -21,7 +21,7 @@ should probably proofread and complete it, then remove this comment. -->
21
 
22
  # section-classifier-imrad
23
 
24
- This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
25
  It achieves the following results on the evaluation set:
26
  - Loss: 0.6404
27
  - Accuracy: 0.7714
@@ -31,17 +31,44 @@ It achieves the following results on the evaluation set:
31
 
32
  ## Model description
33
 
34
- More information needed
35
 
36
  ## Intended uses & limitations
37
 
38
- More information needed
 
 
39
 
40
  ## Training and evaluation data
41
 
42
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- ## Training procedure
 
 
45
 
46
  ### Training hyperparameters
47
 
 
21
 
22
  # section-classifier-imrad
23
 
24
+ This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on [saier/unarXive_imrad_clf](https://huggingface.co/datasets/saier/unarXive_imrad_clf) dataset.
25
  It achieves the following results on the evaluation set:
26
  - Loss: 0.6404
27
  - Accuracy: 0.7714
 
31
 
32
  ## Model description
33
 
34
+ This model classifies scientific paper sections into IMRaD categories (Introduction, Methods, Results, and Discussion). It's a fine-tuned version of DistilBERT trained on the unarXive dataset with weighted cross-entropy loss to handle class imbalance.
35
 
36
  ## Intended uses & limitations
37
 
38
+ Intended use: Automatically categorizing sections in academic papers, particularly arXiv submissions.
39
+
40
+ Limitations: Trained exclusively on arXiv papers; may not generalize well to non-academic text or from other domains. Requires text segments of reasonable length (up to 512 tokens).
41
 
42
  ## Training and evaluation data
43
 
44
+ Trained on saier/unarXive_imrad_clf, a dataset of labeled paper sections from arXiv. The model uses weighted class balancing to account for label distribution imbalance across the five IMRaD categories.
45
+
46
+ ## How to use
47
+
48
+ ```
49
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
50
+ import torch
51
+
52
+ model_id = "your-username/section-classifier-imrad"
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
56
+
57
+ texts = [
58
+ "In this paper, we propose a new method for retrieval.",
59
+ "We evaluate on three benchmarks and report state-of-the-art results."
60
+ ]
61
+
62
+ inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
63
+ with torch.no_grad():
64
+ logits = model(**inputs).logits
65
+
66
+ pred_ids = torch.argmax(logits, dim=-1).tolist()
67
+ id2label = model.config.id2label
68
 
69
+ for t, i in zip(texts, pred_ids):
70
+ print(id2label[i], ":", t)
71
+ ```
72
 
73
  ### Training hyperparameters
74