Model Card for roberta-nli-classifier

This is a binary Natural Language Inference (NLI) classifier that determines whether a hypothesis is entailed by a given premise. It is a Category A (Transformer-based) implementation that fine-tunes roberta-large with a multi-pooling MLP head, achieving 92.43% dev accuracy and 92.42% macro F1 β€” a statistically significant improvement of +9.93 percentage points over the given BERT baseline (one-sample t-test: t = 28.58, p < 0.0001).

Model Details

Model Description

This model builds on the roberta-large transformer encoder (Liu et al., 2019). A premise-hypothesis pair is tokenised jointly using the RoBERTa tokeniser and fed through the full 24-layer encoder. Rather than relying on the CLS token alone, the model aggregates the final hidden states using three parallel pooling strategies:

  1. CLS token β€” the [CLS] representation conventionally used for sequence classification.
  2. Mean pooling β€” the attention-masked mean of all non-padding token hidden states, capturing the average contextual signal across the sequence.
  3. Max pooling β€” the element-wise maximum over all non-padding token positions, preserving the strongest activated feature per dimension.

The three vectors (each of size H = 1024) are concatenated into a single HΓ—3 = 3072-dimensional representation and passed through a 3-layer MLP classifier:

Linear(3072 β†’ 1024) β†’ GELU β†’ Dropout(0.1)
Linear(1024 β†’ 512)  β†’ GELU β†’ Dropout(0.1)
Linear(512  β†’ 2)

The final logits are passed to a standard cross-entropy loss during training. The best checkpoint was selected based on peak dev accuracy across 5 epochs, rather than the final epoch, to prevent overfitting.

  • Developed by: 11309715, 11303252 and 11382801 (Group 8)
  • Language(s): English
  • Model type: Supervised fine-tuning
  • Model architecture: Transformer encoder (RoBERTa-large) with CLS+Mean+Max pooling and 3-layer MLP classifier head
  • Finetuned from model: roberta-large
  • Total parameters: 359,032,322
  • Model size: ~1.42 GB

Model Resources

Training Details

Training Data

24,432 premise-hypothesis pairs for training and 6,736 pairs for development, sourced from the COMP34812 NLI track dataset. Labels are near-balanced: 12,648 positive (entailment) and 11,784 negative (non-entailment) in training, and 3,478 positive and 3,258 negative in development. No class weighting or oversampling was applied given this near-balance.

Each example was tokenised as a joint sequence [CLS] premise [SEP] hypothesis [SEP], padded or truncated to a maximum of 256 subword tokens.

Training Procedure

Training proceeded in two stages: a hyperparameter grid search, followed by a full 5-epoch training run using the best identified configuration.

Stage 1 β€” Hyperparameter Grid Search

A 16-configuration grid search was conducted, evaluating each configuration for 2 warm-up epochs on the full training set and recording dev accuracy. The search space crossed learning rate ∈ {1e-5, 2e-5}, batch size ∈ {16, 32}, dropout ∈ {0.1, 0.2}, and max sequence length ∈ {128, 256}, with warmup ratio and weight decay fixed at 0.1. The best configuration by 2-epoch dev accuracy was lr=2e-5, batch=32, dropout=0.1, max_len=256, achieving 0.9256.

Stage 2 β€” Full Training Run

The best configuration was trained for 5 full epochs with seed 42. The checkpoint at epoch 4 (highest dev accuracy) was saved as the final model.

Training Hyperparameters

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 64
  • seed: 42
  • num_epochs: 5
  • max_seq_length: 256
  • dropout: 0.1
  • warmup_ratio: 0.1
  • weight_decay: 0.1
  • optimizer: AdamW with three parameter groups β€” encoder parameters with decay (lr=2e-5, wd=0.1), encoder bias/LayerNorm parameters without decay (lr=2e-5, wd=0.0), and classifier head at 5Γ— encoder lr (lr=1e-4, wd=0.1)
  • lr_scheduler: Linear warmup + linear decay (get_linear_schedule_with_warmup, 10% warmup steps)
  • fp16: True (torch.amp)
  • gradient_clipping: 1.0

Speeds, Sizes, Times

  • GPU used: NVIDIA GH200 120GB (102 GB VRAM)
  • Duration per training epoch: ~67–68 seconds
  • Total training time (5 epochs): ~5.7 minutes
  • Total parameters: 359,032,322
  • Model size: ~1.42 GB (roberta-large base weights) + custom MLP head

Evaluation

Testing Data

The full development set of 6,736 premise-hypothesis pairs provided as part of the COMP34812 NLI track, used as a held-out evaluation set after the best checkpoint (epoch 4) was selected.

Metrics

  • Accuracy
  • Macro F1-score
  • Per-class Precision, Recall, F1-score (via sklearn.metrics.classification_report)
  • Statistical significance (one-sample t-test and two-sample t-test via scipy.stats)

Fine Tuning Results

The best checkpoint (saved at epoch 4) achieved a dev accuracy of 92.43% and macro F1 of 92.42%.

Training Trajectory:

Epoch Train Loss Train Acc Dev Loss Dev Acc Notes
1 0.4146 79.21% 0.2429 91.29% Best saved
2 0.1961 93.12% 0.2810 90.65% β€”
3 0.1117 97.02% 0.3548 91.97% Best saved
4 0.0624 98.61% 0.4174 92.43% Best checkpoint
5 0.0305 99.34% 0.4744 92.19% Not saved (overfit)

By epoch 5, training loss continued to drop while dev loss rose from 0.4174 to 0.4744 and dev accuracy declined β€” a clear sign of overfitting.

Per-Class Classification Report (Epoch 4 checkpoint):

Class Precision Recall F1-score Support
0 β€” Non-entailment 0.9249 0.9180 0.9214 3,258
1 β€” Entailment 0.9238 0.9301 0.9269 3,478
Macro avg 0.9243 0.9241 0.9242 6,736
Weighted avg 0.9243 0.9243 0.9243 6,736

The model performs symmetrically across both classes, with slightly higher recall on entailment (93.01%) and slightly higher precision on non-entailment (92.49%).

Multi-Run Reliability & Statistical Significance

To ensure results are robust against random seed variation and to confirm that gains are not due to "lucky" initialisations, the model was trained 5 times using independently sampled random seeds.

Baseline Comparison

Performance was benchmarked against the official COMP34812 NLI track BERT baseline, evaluated on the consistent 6,736-pair development set. While BERT provides a strong transformer-based baseline, it remains significantly below the performance of the proposed RoBERTa-based approach.

Metric SVM LSTM BERT
Accuracy 0.5863 0.6606 0.8202
Macro Precision 0.5855 0.6603 0.8204
Macro Recall 0.5848 0.6603 0.8196
Macro F1 0.5846 0.6603 0.8198
Weighted Macro Precision 0.5857 0.6606 0.8203
Weighted Macro Recall 0.5863 0.6606 0.8202
Weighted Macro F1 0.5854 0.6606 0.8201
Matthews Corr. Coef. 0.1703 0.3206 0.6400

Experimental Results & Variance

The table below summarises the reliability of the fine-tuned model across multiple runs. The low standard deviation (0.70%) indicates reasonable training stability. Note that Run 5 converged more slowly (seed=67935 produced a difficult initialisation), which pulled the mean down slightly relative to the single-seed best result.

Run Seed Best Dev Acc
1 33513 92.47%
2 65070 92.21%
3 83360 92.13%
4 20017 92.38%
5 67935 90.59%
Mean Β± Std β€” 92.00% Β± 0.70%

Note on Run 5: This run used seed 67935, which led to a slower warm-up phase. By epoch 5 it had recovered to 90.59%, but never reached the peak achieved by the other runs. This is reflected in the wider standard deviation compared to the single fixed-seed result.

Statistical Hypothesis Testing

A one-sample, one-sided t-test was conducted to evaluate the significance of the improvement over the strongest baseline (BERT Accuracy = 0.8202).

  • Null Hypothesis (Hβ‚€): ΞΌ_acc ≀ 0.8202
  • Alternative Hypothesis (H₁): ΞΌ_acc > 0.8202

The test yielded a t-statistic of 28.58 (p = 0.000004) at a significance level of Ξ± = 0.05. The mean multi-run accuracy of 92.00% represents an improvement of +9.93 percentage points over the BERT baseline. We reject the null hypothesis, confirming with high confidence that the performance gains are statistically significant and not merely due to a favourable random seed.

K-Fold Cross-Validation & Consistency Check

To confirm performance is not an artefact of the particular train/dev split, 5-fold stratified cross-validation was run on the full training set (19,545–19,546 train / 4,886–4,887 val per fold):

Fold Val Acc Val Macro F1
1 92.39% 92.38%
2 92.84% 92.83%
3 91.98% 91.96%
4 92.75% 92.75%
5 92.75% 92.75%
Mean Β± Std 92.54% Β± 0.32% 92.53% Β± 0.33%

A two-sample t-test comparing the 5 multi-run dev accuracies against the 5 k-fold val accuracies found no significant difference: t = βˆ’1.5289, p = 0.1648 (Ξ± = 0.05). The non-significant result confirms that performance is consistent across evaluation strategies. The reported accuracy is a reliable estimate of true generalisation ability, not an optimistic artefact of a favourable train/dev partition.

Technical Specifications

Hardware

  • GPU: NVIDIA GH200 120GB
  • VRAM: 102 GB
  • Mixed precision (fp16) training enabled via torch.amp

Software

  • PyTorch (with torch.amp for mixed-precision training)
  • Transformers (HuggingFace) β€” AutoModel, AutoTokenizer, get_linear_schedule_with_warmup
  • scikit-learn β€” accuracy_score, f1_score, classification_report, StratifiedKFold
  • SciPy β€” scipy.stats.ttest_1samp, scipy.stats.ttest_ind
  • pandas, numpy

Bias, Risks, and Limitations

  • Any premise-hypothesis pair tokenised to more than 256 subword tokens will be truncated, which may cause information loss for longer inputs.
  • The model was trained and evaluated on a single academic NLI dataset (COMP34812). Performance on out-of-distribution or domain-shifted text pairs (e.g., legal, medical, conversational) may be lower.
  • The model predicts binary entailment only (entailment vs. non-entailment) and is not designed for three-way NLI (entailment / neutral / contradiction).
  • Alternative pooling strategies (gated pooling, attention pooling) combined with LLRD, label smoothing, and freeze-then-unfreeze scheduling were explored but did not yield meaningful improvements over the CLS+mean+max baseline (see Additional Information).
  • As with all NLI datasets, annotation artefacts (e.g., lexical overlap, negation patterns) may have been learned by the model rather than genuine semantic reasoning.
  • The multi-run standard deviation (0.70%) is higher than the k-fold standard deviation (0.32%), partly driven by Run 5's slow initialisation. Users should be aware that occasional seeds may produce a weaker result.

Additional Information

  • The initial training run used a fixed seed (42) and max_len=128. To determine whether a longer context window or different regularisation would help, a systematic 16-configuration grid search was conducted crossing learning rate ∈ {1e-5, 2e-5}, batch size ∈ {16, 32}, dropout ∈ {0.1, 0.2}, and max sequence length ∈ {128, 256}. The winning configuration β€” lr=2e-5, batch=32, dropout=0.1, max_len=256 β€” outperformed its max_len=128 equivalent (0.9256 vs 0.9219 at 2 epochs), confirming that the longer context window captures nuanced wording in longer premise-hypothesis pairs that would otherwise be truncated.

  • In parallel, two alternative pooling architectures were implemented and evaluated against the CLS+mean+max baseline:

    • Gated Pooling replaced fixed pooling with a learned sigmoid gate: gate_scores = sigmoid(Linear(Hβ†’1)(seq)), output = (gate_scores * seq).sum(1) / gate_scores.sum(1). This allows the model to softly weight each token's contribution to the final representation.

    • Attention Pooling used a trainable query vector to compute a softmax distribution over token hidden states: scores = Linear(Hβ†’1)(tanh(Linear(Hβ†’H)(seq))), output = (softmax(scores) * seq).sum(1).

    Both variants were further combined with Layer-wise Learning Rate Decay (LLRD), label smoothing, and a freeze-then-unfreeze training schedule. Despite these additions, neither architecture produced meaningful gains over the simpler CLS+mean+max baseline. Dev accuracy remained largely unchanged across all configurations, suggesting the baseline multi-pooling strategy already captures sufficient contextual information for binary NLI on this dataset, and that the added complexity introduced variance without benefit.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for hk414xin/roberta-nli-classifier