This is a text classification model, fully fine-tuned from a ` allenai/scibert_scivocab_uncased`. It re-uses the main BERT model and fits an ordinal regression head on the `[CLS]` token. The model is fine-tuned on the certainty labels collected in [Wurl et al (2024): _Understanding Fine-Grained Distortions in Reports for Scientific Finding_](https://aclanthology.org/2024.findings-acl.369/). The authors originally collect certainty annotations from humans using a 4-point Likert Scale ranging from (1) Uncertain to (4) Certain. Because the resulting datasets suffer from severe class imbalance, we merge the classes (1) Uncertain and (2) Somewhat Uncertain. ### Dataset Statistics There are 1330 examples in the training set and 334 in the test set. Each example is a sentence long. Examples are filtered from the [copenlu/spiced](https://huggingface.co/datasets/copenlu/spiced) dataset to exhibit final score greater or equal than 4. The original base rates are as follows: | Class | Base Rate in Training set | Base Rate in Test set | | ----- | ------------------------- | --------------------- | | 0 - Uncertain | 5.5970 | 7.1856 | | 1 - Somewhat Uncertain | 15.2985 | 17.6647 | | 2 - Somewhat Certain | 32.3881 | 33.2335 | | 3 - Certain | 46.7164 | 41.9162 | After combining classes 0 and 1, we obtain the base rates below. Note that this mimicks the procedure adopted in the original paper. | Class | Base Rate in Training set | Base Rate in Test set | | ----- | ------------------------- | --------------------- | | 0 - Uncertain | 20.8955 | 24.8503 | | 1 - Somewhat Certain | 32.3881 | 33.2335 | | 2 - Certain | 46.7164 | 41.9162 | ### Hyperparameter Optimization The published model represents one of the 29 models different configurations. The selected model maximizes Quadratic Weighted Kappa (implemented using [cohen_kappa with quadratic weights](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)), which is better adapted to ordinal problems, such as ordinal scales. Under this metric, a random model would score 0. We adopt this metric as opposed to accuracy or macro F1 to address class imbalances. Here is the classification report and test set metrics: ``` 17:44:36 INFO test loss=0.9565 acc=0.578 QWK=0.5004 17:44:36 INFO precision recall f1-score support 0 0.58 0.51 0.54 83 1 0.47 0.46 0.46 111 2 0.65 0.71 0.68 140 accuracy 0.58 334 macro avg 0.57 0.56 0.56 334 weighted avg 0.57 0.58 0.57 334 ``` We conduct a hyperparameter sweep of the following hyperp - Freeze / Unfreeze - LR: 1e-6 through 1e-3 - Batch Size: 16, 32 - Hidden Size Dimensions: 256, 128 - Warmup Ratio: 0.05, 0.1, 0.2, 0.3 - Epochs 30 (with patience) ## Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True) ```