Model Card for roberta-nli-classifier

This is a binary Natural Language Inference (NLI) classifier that determines whether a hypothesis is entailed by a given premise. It is a Category A (Transformer-based) implementation that fine-tunes roberta-large with a multi-pooling MLP head, achieving 92.43% dev accuracy and 92.42% macro F1 — a statistically significant improvement of +9.93 percentage points over the given BERT baseline (one-sample t-test: t = 28.58, p < 0.0001).

Model Details

Model Description

This model builds on the roberta-large transformer encoder (Liu et al., 2019). A premise-hypothesis pair is tokenised jointly using the RoBERTa tokeniser and fed through the full 24-layer encoder. Rather than relying on the CLS token alone, the model aggregates the final hidden states using three parallel pooling strategies:

CLS token — the [CLS] representation conventionally used for sequence classification.
Mean pooling — the attention-masked mean of all non-padding token hidden states, capturing the average contextual signal across the sequence.
Max pooling — the element-wise maximum over all non-padding token positions, preserving the strongest activated feature per dimension.

The three vectors (each of size H = 1024) are concatenated into a single H×3 = 3072-dimensional representation and passed through a 3-layer MLP classifier:

Linear(3072 → 1024) → GELU → Dropout(0.1)
Linear(1024 → 512)  → GELU → Dropout(0.1)
Linear(512  → 2)

The final logits are passed to a standard cross-entropy loss during training. The best checkpoint was selected based on peak dev accuracy across 5 epochs, rather than the final epoch, to prevent overfitting.

Developed by: 11309715, 11303252 and 11382801 (Group 8)
Language(s): English
Model type: Supervised fine-tuning
Model architecture: Transformer encoder (RoBERTa-large) with CLS+Mean+Max pooling and 3-layer MLP classifier head
Finetuned from model: roberta-large
Total parameters: 359,032,322
Model size: ~1.42 GB

Model Resources

Repository: https://huggingface.co/hk414xin/roberta-nli-classifier/
Foundational Architecture: RoBERTa Pre-trained Encoder (Liu et al., 2019)
- Link: https://arxiv.org/abs/1907.11692

Training Details

Training Data

24,432 premise-hypothesis pairs for training and 6,736 pairs for development, sourced from the COMP34812 NLI track dataset. Labels are near-balanced: 12,648 positive (entailment) and 11,784 negative (non-entailment) in training, and 3,478 positive and 3,258 negative in development. No class weighting or oversampling was applied given this near-balance.

Each example was tokenised as a joint sequence [CLS] premise [SEP] hypothesis [SEP], padded or truncated to a maximum of 256 subword tokens.

Training Procedure

Training proceeded in two stages: a hyperparameter grid search, followed by a full 5-epoch training run using the best identified configuration.

Stage 1 — Hyperparameter Grid Search

A 16-configuration grid search was conducted, evaluating each configuration for 2 warm-up epochs on the full training set and recording dev accuracy. The search space crossed learning rate ∈ {1e-5, 2e-5}, batch size ∈ {16, 32}, dropout ∈ {0.1, 0.2}, and max sequence length ∈ {128, 256}, with warmup ratio and weight decay fixed at 0.1. The best configuration by 2-epoch dev accuracy was lr=2e-5, batch=32, dropout=0.1, max_len=256, achieving 0.9256.

Stage 2 — Full Training Run

The best configuration was trained for 5 full epochs with seed 42. The checkpoint at epoch 4 (highest dev accuracy) was saved as the final model.

Training Hyperparameters

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
num_epochs: 5
max_seq_length: 256
dropout: 0.1
warmup_ratio: 0.1
weight_decay: 0.1
optimizer: AdamW with three parameter groups — encoder parameters with decay (lr=2e-5, wd=0.1), encoder bias/LayerNorm parameters without decay (lr=2e-5, wd=0.0), and classifier head at 5× encoder lr (lr=1e-4, wd=0.1)
lr_scheduler: Linear warmup + linear decay (get_linear_schedule_with_warmup, 10% warmup steps)
fp16: True (torch.amp)
gradient_clipping: 1.0

Speeds, Sizes, Times

GPU used: NVIDIA GH200 120GB (102 GB VRAM)
Duration per training epoch: ~67–68 seconds
Total training time (5 epochs): ~5.7 minutes
Total parameters: 359,032,322
Model size: ~1.42 GB (roberta-large base weights) + custom MLP head

Evaluation

Testing Data

The full development set of 6,736 premise-hypothesis pairs provided as part of the COMP34812 NLI track, used as a held-out evaluation set after the best checkpoint (epoch 4) was selected.

Metrics

Accuracy
Macro F1-score
Per-class Precision, Recall, F1-score (via sklearn.metrics.classification_report)
Statistical significance (one-sample t-test and two-sample t-test via scipy.stats)

Fine Tuning Results

The best checkpoint (saved at epoch 4) achieved a dev accuracy of 92.43% and macro F1 of 92.42%.

Training Trajectory:

Epoch	Train Loss	Train Acc	Dev Loss	Dev Acc	Notes
1	0.4146	79.21%	0.2429	91.29%	Best saved
2	0.1961	93.12%	0.2810	90.65%	—
3	0.1117	97.02%	0.3548	91.97%	Best saved
4	0.0624	98.61%	0.4174	92.43%	Best checkpoint
5	0.0305	99.34%	0.4744	92.19%	Not saved (overfit)

By epoch 5, training loss continued to drop while dev loss rose from 0.4174 to 0.4744 and dev accuracy declined — a clear sign of overfitting.

Per-Class Classification Report (Epoch 4 checkpoint):

Class	Precision	Recall	F1-score	Support
0 — Non-entailment	0.9249	0.9180	0.9214	3,258
1 — Entailment	0.9238	0.9301	0.9269	3,478
Macro avg	0.9243	0.9241	0.9242	6,736
Weighted avg	0.9243	0.9243	0.9243	6,736

The model performs symmetrically across both classes, with slightly higher recall on entailment (93.01%) and slightly higher precision on non-entailment (92.49%).

Multi-Run Reliability & Statistical Significance

To ensure results are robust against random seed variation and to confirm that gains are not due to "lucky" initialisations, the model was trained 5 times using independently sampled random seeds.

Baseline Comparison

Performance was benchmarked against the official COMP34812 NLI track BERT baseline, evaluated on the consistent 6,736-pair development set. While BERT provides a strong transformer-based baseline, it remains significantly below the performance of the proposed RoBERTa-based approach.

Metric	SVM	LSTM	BERT
Accuracy	0.5863	0.6606	0.8202
Macro Precision	0.5855	0.6603	0.8204
Macro Recall	0.5848	0.6603	0.8196
Macro F1	0.5846	0.6603	0.8198
Weighted Macro Precision	0.5857	0.6606	0.8203
Weighted Macro Recall	0.5863	0.6606	0.8202
Weighted Macro F1	0.5854	0.6606	0.8201
Matthews Corr. Coef.	0.1703	0.3206	0.6400

Experimental Results & Variance

The table below summarises the reliability of the fine-tuned model across multiple runs. The low standard deviation (0.70%) indicates reasonable training stability. Note that Run 5 converged more slowly (seed=67935 produced a difficult initialisation), which pulled the mean down slightly relative to the single-seed best result.

Run	Seed	Best Dev Acc
1	33513	92.47%
2	65070	92.21%
3	83360	92.13%
4	20017	92.38%
5	67935	90.59%
Mean ± Std	—	92.00% ± 0.70%

Note on Run 5: This run used seed 67935, which led to a slower warm-up phase. By epoch 5 it had recovered to 90.59%, but never reached the peak achieved by the other runs. This is reflected in the wider standard deviation compared to the single fixed-seed result.

Statistical Hypothesis Testing

A one-sample, one-sided t-test was conducted to evaluate the significance of the improvement over the strongest baseline (BERT Accuracy = 0.8202).

Null Hypothesis (H₀): μ_acc ≤ 0.8202
Alternative Hypothesis (H₁): μ_acc > 0.8202

The test yielded a t-statistic of 28.58 (p = 0.000004) at a significance level of α = 0.05. The mean multi-run accuracy of 92.00% represents an improvement of +9.93 percentage points over the BERT baseline. We reject the null hypothesis, confirming with high confidence that the performance gains are statistically significant and not merely due to a favourable random seed.

K-Fold Cross-Validation & Consistency Check

To confirm performance is not an artefact of the particular train/dev split, 5-fold stratified cross-validation was run on the full training set (19,545–19,546 train / 4,886–4,887 val per fold):

Fold	Val Acc	Val Macro F1
1	92.39%	92.38%
2	92.84%	92.83%
3	91.98%	91.96%
4	92.75%	92.75%
5	92.75%	92.75%
Mean ± Std	92.54% ± 0.32%	92.53% ± 0.33%

A two-sample t-test comparing the 5 multi-run dev accuracies against the 5 k-fold val accuracies found no significant difference: t = −1.5289, p = 0.1648 (α = 0.05). The non-significant result confirms that performance is consistent across evaluation strategies. The reported accuracy is a reliable estimate of true generalisation ability, not an optimistic artefact of a favourable train/dev partition.

Technical Specifications

Hardware

GPU: NVIDIA GH200 120GB
VRAM: 102 GB
Mixed precision (fp16) training enabled via torch.amp

Software

PyTorch (with torch.amp for mixed-precision training)
Transformers (HuggingFace) — AutoModel, AutoTokenizer, get_linear_schedule_with_warmup
scikit-learn — accuracy_score, f1_score, classification_report, StratifiedKFold
SciPy — scipy.stats.ttest_1samp, scipy.stats.ttest_ind
pandas, numpy

Bias, Risks, and Limitations

Any premise-hypothesis pair tokenised to more than 256 subword tokens will be truncated, which may cause information loss for longer inputs.
The model was trained and evaluated on a single academic NLI dataset (COMP34812). Performance on out-of-distribution or domain-shifted text pairs (e.g., legal, medical, conversational) may be lower.
The model predicts binary entailment only (entailment vs. non-entailment) and is not designed for three-way NLI (entailment / neutral / contradiction).
Alternative pooling strategies (gated pooling, attention pooling) combined with LLRD, label smoothing, and freeze-then-unfreeze scheduling were explored but did not yield meaningful improvements over the CLS+mean+max baseline (see Additional Information).
As with all NLI datasets, annotation artefacts (e.g., lexical overlap, negation patterns) may have been learned by the model rather than genuine semantic reasoning.
The multi-run standard deviation (0.70%) is higher than the k-fold standard deviation (0.32%), partly driven by Run 5's slow initialisation. Users should be aware that occasional seeds may produce a weaker result.

Additional Information

The initial training run used a fixed seed (42) and max_len=128. To determine whether a longer context window or different regularisation would help, a systematic 16-configuration grid search was conducted crossing learning rate ∈ {1e-5, 2e-5}, batch size ∈ {16, 32}, dropout ∈ {0.1, 0.2}, and max sequence length ∈ {128, 256}. The winning configuration — lr=2e-5, batch=32, dropout=0.1, max_len=256 — outperformed its max_len=128 equivalent (0.9256 vs 0.9219 at 2 epochs), confirming that the longer context window captures nuanced wording in longer premise-hypothesis pairs that would otherwise be truncated.
In parallel, two alternative pooling architectures were implemented and evaluated against the CLS+mean+max baseline:
- Gated Pooling replaced fixed pooling with a learned sigmoid gate: gate_scores = sigmoid(Linear(H→1)(seq)), output = (gate_scores * seq).sum(1) / gate_scores.sum(1). This allows the model to softly weight each token's contribution to the final representation.
- Attention Pooling used a trainable query vector to compute a softmax distribution over token hidden states: scores = Linear(H→1)(tanh(Linear(H→H)(seq))), output = (softmax(scores) * seq).sum(1).
Both variants were further combined with Layer-wise Learning Rate Decay (LLRD), label smoothing, and a freeze-then-unfreeze training schedule. Despite these additions, neither architecture produced meaningful gains over the simpler CLS+mean+max baseline. Dev accuracy remained largely unchanged across all configurations, suggesting the baseline multi-pooling strategy already captures sufficient contextual information for binary NLI on this dataset, and that the added complexity introduced variance without benefit.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for hk414xin/roberta-nli-classifier

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Paper • 1907.11692 • Published Jul 26, 2019 • 10