This is a text classification model, fully fine-tuned from a ` allenai/scibert_scivocab_uncased`. It re-uses the main BERT model and fits an ordinal regression head on the `[CLS]` token. The model is fine-tuned on the certainty labels collected in [Wurl et al (2024): _Understanding Fine-Grained Distortions in Reports for Scientific Finding_](https://aclanthology.org/2024.findings-acl.369/). The authors originally collect certainty annotations from humans using a 4-point Likert Scale ranging from (1) Uncertain to (4) Certain. Because the resulting datasets suffer from severe class imbalance, we merge the classes (1) Uncertain and (2) Somewhat Uncertain. 


### Dataset Statistics

There are 1330 examples in the training set and 334 in the test set. 
Each example is a sentence long.
Examples are filtered from the [copenlu/spiced](https://huggingface.co/datasets/copenlu/spiced) dataset to exhibit final score greater or equal than 4.

The original base rates are as follows:

| Class | Base Rate in Training set | Base Rate in Test set |
| ----- | ------------------------- | --------------------- |
| 0 - Uncertain | 5.5970 | 7.1856 | 
| 1 - Somewhat Uncertain | 15.2985 |  17.6647 |
| 2 - Somewhat Certain  | 32.3881 | 33.2335 | 
| 3 - Certain | 46.7164 | 41.9162 |

After combining classes 0 and 1, we obtain the base rates below. Note that this mimicks the procedure adopted in the original paper.

| Class | Base Rate in Training set | Base Rate in Test set |
| ----- | ------------------------- | --------------------- |
| 0 - Uncertain | 20.8955 | 24.8503 | 
| 1 - Somewhat Certain  | 32.3881 | 33.2335 | 
| 2 - Certain | 46.7164 | 41.9162 |


### Hyperparameter Optimization

The published model represents one of the 29 models different configurations. The selected model maximizes Quadratic Weighted Kappa (implemented using [cohen_kappa with quadratic weights](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)), which is better adapted to ordinal problems, such as ordinal scales. Under this metric, a random model would score 0. We adopt this metric as opposed to accuracy or macro F1 to address class imbalances. 

Here is the classification report and test set metrics:

```
17:44:36  INFO        test  loss=0.9565  acc=0.578  QWK=0.5004
17:44:36  INFO      
              precision    recall  f1-score   support

           0       0.58      0.51      0.54        83
           1       0.47      0.46      0.46       111
           2       0.65      0.71      0.68       140

    accuracy                           0.58       334
   macro avg       0.57      0.56      0.56       334
weighted avg       0.57      0.58      0.57       334
```


We conduct a hyperparameter sweep of the following hyperp

- Freeze / Unfreeze 
- LR: 1e-6 through 1e-3
- Batch Size: 16, 32
- Hidden Size Dimensions: 256, 128
- Warmup Ratio: 0.05, 0.1, 0.2, 0.3
- Epochs 30 (with patience)


## Usage 
 
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True)
```