Model Card for roberta-nli-classifier
This is a binary Natural Language Inference (NLI) classifier that determines whether a hypothesis is entailed by a given premise. It is a Category A (Transformer-based) implementation that fine-tunes roberta-large with a multi-pooling MLP head, achieving 92.43% dev accuracy and 92.42% macro F1 β a statistically significant improvement of +9.93 percentage points over the given BERT baseline (one-sample t-test: t = 28.58, p < 0.0001).
Model Details
Model Description
This model builds on the roberta-large transformer encoder (Liu et al., 2019). A premise-hypothesis pair is tokenised jointly using the RoBERTa tokeniser and fed through the full 24-layer encoder. Rather than relying on the CLS token alone, the model aggregates the final hidden states using three parallel pooling strategies:
- CLS token β the
[CLS]representation conventionally used for sequence classification. - Mean pooling β the attention-masked mean of all non-padding token hidden states, capturing the average contextual signal across the sequence.
- Max pooling β the element-wise maximum over all non-padding token positions, preserving the strongest activated feature per dimension.
The three vectors (each of size H = 1024) are concatenated into a single HΓ3 = 3072-dimensional representation and passed through a 3-layer MLP classifier:
Linear(3072 β 1024) β GELU β Dropout(0.1)
Linear(1024 β 512) β GELU β Dropout(0.1)
Linear(512 β 2)
The final logits are passed to a standard cross-entropy loss during training. The best checkpoint was selected based on peak dev accuracy across 5 epochs, rather than the final epoch, to prevent overfitting.
- Developed by: 11309715, 11303252 and 11382801 (Group 8)
- Language(s): English
- Model type: Supervised fine-tuning
- Model architecture: Transformer encoder (RoBERTa-large) with CLS+Mean+Max pooling and 3-layer MLP classifier head
- Finetuned from model:
roberta-large - Total parameters: 359,032,322
- Model size: ~1.42 GB
Model Resources
Repository: https://huggingface.co/hk414xin/roberta-nli-classifier/
Foundational Architecture: RoBERTa Pre-trained Encoder (Liu et al., 2019)
Training Details
Training Data
24,432 premise-hypothesis pairs for training and 6,736 pairs for development, sourced from the COMP34812 NLI track dataset. Labels are near-balanced: 12,648 positive (entailment) and 11,784 negative (non-entailment) in training, and 3,478 positive and 3,258 negative in development. No class weighting or oversampling was applied given this near-balance.
Each example was tokenised as a joint sequence [CLS] premise [SEP] hypothesis [SEP], padded or truncated to a maximum of 256 subword tokens.
Training Procedure
Training proceeded in two stages: a hyperparameter grid search, followed by a full 5-epoch training run using the best identified configuration.
Stage 1 β Hyperparameter Grid Search
A 16-configuration grid search was conducted, evaluating each configuration for 2 warm-up epochs on the full training set and recording dev accuracy. The search space crossed learning rate β {1e-5, 2e-5}, batch size β {16, 32}, dropout β {0.1, 0.2}, and max sequence length β {128, 256}, with warmup ratio and weight decay fixed at 0.1. The best configuration by 2-epoch dev accuracy was lr=2e-5, batch=32, dropout=0.1, max_len=256, achieving 0.9256.
Stage 2 β Full Training Run
The best configuration was trained for 5 full epochs with seed 42. The checkpoint at epoch 4 (highest dev accuracy) was saved as the final model.
Training Hyperparameters
learning_rate: 2e-05train_batch_size: 32eval_batch_size: 64seed: 42num_epochs: 5max_seq_length: 256dropout: 0.1warmup_ratio: 0.1weight_decay: 0.1optimizer: AdamW with three parameter groups β encoder parameters with decay (lr=2e-5, wd=0.1), encoder bias/LayerNorm parameters without decay (lr=2e-5, wd=0.0), and classifier head at 5Γ encoder lr (lr=1e-4, wd=0.1)lr_scheduler: Linear warmup + linear decay (get_linear_schedule_with_warmup, 10% warmup steps)fp16: True (torch.amp)gradient_clipping: 1.0
Speeds, Sizes, Times
- GPU used: NVIDIA GH200 120GB (102 GB VRAM)
- Duration per training epoch: ~67β68 seconds
- Total training time (5 epochs): ~5.7 minutes
- Total parameters: 359,032,322
- Model size: ~1.42 GB (roberta-large base weights) + custom MLP head
Evaluation
Testing Data
The full development set of 6,736 premise-hypothesis pairs provided as part of the COMP34812 NLI track, used as a held-out evaluation set after the best checkpoint (epoch 4) was selected.
Metrics
- Accuracy
- Macro F1-score
- Per-class Precision, Recall, F1-score (via
sklearn.metrics.classification_report) - Statistical significance (one-sample t-test and two-sample t-test via
scipy.stats)
Fine Tuning Results
The best checkpoint (saved at epoch 4) achieved a dev accuracy of 92.43% and macro F1 of 92.42%.
Training Trajectory:
| Epoch | Train Loss | Train Acc | Dev Loss | Dev Acc | Notes |
|---|---|---|---|---|---|
| 1 | 0.4146 | 79.21% | 0.2429 | 91.29% | Best saved |
| 2 | 0.1961 | 93.12% | 0.2810 | 90.65% | β |
| 3 | 0.1117 | 97.02% | 0.3548 | 91.97% | Best saved |
| 4 | 0.0624 | 98.61% | 0.4174 | 92.43% | Best checkpoint |
| 5 | 0.0305 | 99.34% | 0.4744 | 92.19% | Not saved (overfit) |
By epoch 5, training loss continued to drop while dev loss rose from 0.4174 to 0.4744 and dev accuracy declined β a clear sign of overfitting.
Per-Class Classification Report (Epoch 4 checkpoint):
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| 0 β Non-entailment | 0.9249 | 0.9180 | 0.9214 | 3,258 |
| 1 β Entailment | 0.9238 | 0.9301 | 0.9269 | 3,478 |
| Macro avg | 0.9243 | 0.9241 | 0.9242 | 6,736 |
| Weighted avg | 0.9243 | 0.9243 | 0.9243 | 6,736 |
The model performs symmetrically across both classes, with slightly higher recall on entailment (93.01%) and slightly higher precision on non-entailment (92.49%).
Multi-Run Reliability & Statistical Significance
To ensure results are robust against random seed variation and to confirm that gains are not due to "lucky" initialisations, the model was trained 5 times using independently sampled random seeds.
Baseline Comparison
Performance was benchmarked against the official COMP34812 NLI track BERT baseline, evaluated on the consistent 6,736-pair development set. While BERT provides a strong transformer-based baseline, it remains significantly below the performance of the proposed RoBERTa-based approach.
| Metric | SVM | LSTM | BERT |
|---|---|---|---|
| Accuracy | 0.5863 | 0.6606 | 0.8202 |
| Macro Precision | 0.5855 | 0.6603 | 0.8204 |
| Macro Recall | 0.5848 | 0.6603 | 0.8196 |
| Macro F1 | 0.5846 | 0.6603 | 0.8198 |
| Weighted Macro Precision | 0.5857 | 0.6606 | 0.8203 |
| Weighted Macro Recall | 0.5863 | 0.6606 | 0.8202 |
| Weighted Macro F1 | 0.5854 | 0.6606 | 0.8201 |
| Matthews Corr. Coef. | 0.1703 | 0.3206 | 0.6400 |
Experimental Results & Variance
The table below summarises the reliability of the fine-tuned model across multiple runs. The low standard deviation (0.70%) indicates reasonable training stability. Note that Run 5 converged more slowly (seed=67935 produced a difficult initialisation), which pulled the mean down slightly relative to the single-seed best result.
| Run | Seed | Best Dev Acc |
|---|---|---|
| 1 | 33513 | 92.47% |
| 2 | 65070 | 92.21% |
| 3 | 83360 | 92.13% |
| 4 | 20017 | 92.38% |
| 5 | 67935 | 90.59% |
| Mean Β± Std | β | 92.00% Β± 0.70% |
Note on Run 5: This run used seed 67935, which led to a slower warm-up phase. By epoch 5 it had recovered to 90.59%, but never reached the peak achieved by the other runs. This is reflected in the wider standard deviation compared to the single fixed-seed result.
Statistical Hypothesis Testing
A one-sample, one-sided t-test was conducted to evaluate the significance of the improvement over the strongest baseline (BERT Accuracy = 0.8202).
- Null Hypothesis (Hβ): ΞΌ_acc β€ 0.8202
- Alternative Hypothesis (Hβ): ΞΌ_acc > 0.8202
The test yielded a t-statistic of 28.58 (p = 0.000004) at a significance level of Ξ± = 0.05. The mean multi-run accuracy of 92.00% represents an improvement of +9.93 percentage points over the BERT baseline. We reject the null hypothesis, confirming with high confidence that the performance gains are statistically significant and not merely due to a favourable random seed.
K-Fold Cross-Validation & Consistency Check
To confirm performance is not an artefact of the particular train/dev split, 5-fold stratified cross-validation was run on the full training set (19,545β19,546 train / 4,886β4,887 val per fold):
| Fold | Val Acc | Val Macro F1 |
|---|---|---|
| 1 | 92.39% | 92.38% |
| 2 | 92.84% | 92.83% |
| 3 | 91.98% | 91.96% |
| 4 | 92.75% | 92.75% |
| 5 | 92.75% | 92.75% |
| Mean Β± Std | 92.54% Β± 0.32% | 92.53% Β± 0.33% |
A two-sample t-test comparing the 5 multi-run dev accuracies against the 5 k-fold val accuracies found no significant difference: t = β1.5289, p = 0.1648 (Ξ± = 0.05). The non-significant result confirms that performance is consistent across evaluation strategies. The reported accuracy is a reliable estimate of true generalisation ability, not an optimistic artefact of a favourable train/dev partition.
Technical Specifications
Hardware
- GPU: NVIDIA GH200 120GB
- VRAM: 102 GB
- Mixed precision (fp16) training enabled via
torch.amp
Software
- PyTorch (with
torch.ampfor mixed-precision training) - Transformers (HuggingFace) β
AutoModel,AutoTokenizer,get_linear_schedule_with_warmup - scikit-learn β
accuracy_score,f1_score,classification_report,StratifiedKFold - SciPy β
scipy.stats.ttest_1samp,scipy.stats.ttest_ind - pandas, numpy
Bias, Risks, and Limitations
- Any premise-hypothesis pair tokenised to more than 256 subword tokens will be truncated, which may cause information loss for longer inputs.
- The model was trained and evaluated on a single academic NLI dataset (COMP34812). Performance on out-of-distribution or domain-shifted text pairs (e.g., legal, medical, conversational) may be lower.
- The model predicts binary entailment only (entailment vs. non-entailment) and is not designed for three-way NLI (entailment / neutral / contradiction).
- Alternative pooling strategies (gated pooling, attention pooling) combined with LLRD, label smoothing, and freeze-then-unfreeze scheduling were explored but did not yield meaningful improvements over the CLS+mean+max baseline (see Additional Information).
- As with all NLI datasets, annotation artefacts (e.g., lexical overlap, negation patterns) may have been learned by the model rather than genuine semantic reasoning.
- The multi-run standard deviation (0.70%) is higher than the k-fold standard deviation (0.32%), partly driven by Run 5's slow initialisation. Users should be aware that occasional seeds may produce a weaker result.
Additional Information
The initial training run used a fixed seed (42) and
max_len=128. To determine whether a longer context window or different regularisation would help, a systematic 16-configuration grid search was conducted crossing learning rate β {1e-5, 2e-5}, batch size β {16, 32}, dropout β {0.1, 0.2}, and max sequence length β {128, 256}. The winning configuration βlr=2e-5, batch=32, dropout=0.1, max_len=256β outperformed itsmax_len=128equivalent (0.9256 vs 0.9219 at 2 epochs), confirming that the longer context window captures nuanced wording in longer premise-hypothesis pairs that would otherwise be truncated.In parallel, two alternative pooling architectures were implemented and evaluated against the CLS+mean+max baseline:
Gated Pooling replaced fixed pooling with a learned sigmoid gate:
gate_scores = sigmoid(Linear(Hβ1)(seq)),output = (gate_scores * seq).sum(1) / gate_scores.sum(1). This allows the model to softly weight each token's contribution to the final representation.Attention Pooling used a trainable query vector to compute a softmax distribution over token hidden states:
scores = Linear(Hβ1)(tanh(Linear(HβH)(seq))),output = (softmax(scores) * seq).sum(1).
Both variants were further combined with Layer-wise Learning Rate Decay (LLRD), label smoothing, and a freeze-then-unfreeze training schedule. Despite these additions, neither architecture produced meaningful gains over the simpler CLS+mean+max baseline. Dev accuracy remained largely unchanged across all configurations, suggesting the baseline multi-pooling strategy already captures sufficient contextual information for binary NLI on this dataset, and that the added complexity introduced variance without benefit.