Model Card for esim-cnn-bilstm-ensemble

This is a Natural Language Inference (NLI) classifier that determines the logical relationship between a given premise and hypothesis. It is a Category B (Non-Transformer) implementation that augments a standard CNN-BiLSTM baseline with an Enhanced Sequential Inference Model (ESIM) Cross-Attention matrix, stabilized via a 5-model majority-vote ensemble.

Model Details

Model Description

This model evaluates text pairs without relying on modern global self-attention mechanisms. Instead, the encoder utilizes parallel processing: a CNN extracts global n-gram features, while a BiLSTM extracts sequential features. To overcome the information bottleneck of standard recurrent networks, the BiLSTM hidden states are passed through an ESIM local inference matrix, which mathematically calculates the dot-product similarity between every word in the premise and the hypothesis prior to pooling.

The pooled vectors from both the CNN and ESIM paths are concatenated and passed through a 3-layer MLP classifier with GELU activations. To minimize statistical variance caused by random weight initialization, the final predictions are generated via a majority-vote ensemble of 5 identical architectures trained on different random seeds.

Developed by: 11303252, 11309715 and 11382801
Language(s): English
Model type: Supervised Classification (Majority-Vote Ensemble)
Model architecture: CNN (Global Features) + BiLSTM (Sequential Features) + ESIM Cross-Attention Matrix
Finetuned from model: N/A (Trained from scratch using static GloVe 300d embeddings)

Model Resources

Repository: https://huggingface.co/wnnafisah/esim-cnn-bilstm-ensemble

Foundational Architecture Literature:

ESIM & BiLSTM Framework: Enhanced LSTM for Natural Language Inference (Chen et al., 2017).
CNN Feature Extraction: Convolutional Neural Networks for Sentence Classification (Kim, 2014).
Word Embeddings: GloVe: Global Vectors for Word Representation (Pennington et al., 2014).

Training Details

Training Data

24,432 premise-hypothesis pairs for training and 6,736 pairs for development, sourced from the COMP34812 NLI track. Vocabulary Size: 23,053 tokens, with 93.40% covered by GloVe vectors.

Training Procedure

During baseline development, a grid search evaluated hidden_dim ∈ [128, 256], dropout ∈ [0.2, 0.3, 0.5], and learning_rate ∈ [1e-3, 5e-4]. The optimal sequential configuration was identified as 128 hidden dimensions and 0.3 dropout.

Upon upgrading to the ESIM architecture, a secondary 3D Grid Search was performed to handle increased parameterization. The final architecture utilizes a 5-model ensemble strategy. Each constituent model was trained on 100% of the training dataset using a distinct random seed (42-46).

Training Hyperparameters

learning_rate: 1e-03 (ReduceLROnPlateau, factor=0.5, patience=1)
train_batch_size: 32
seed: {42, 43, 44, 45, 46}
num_epochs: 10
dropout: 0.5
hidden_dim: 256
optimizer: AdamW (weight_decay: 1e-04)

Evaluation

Results (Official nlu-score Scorer)

The following metrics were generated using the nlu_scorer_env on the 6,736-pair development set.

1. Random-Seed ESIM Ensemble (Primary Submission)

Metric	Value
Accuracy	0.7096
Macro F1	0.7086
Macro Precision	0.7098
Macro Recall	0.7084
Matthews CorrCoef	0.4182

2. K-Fold Ensemble (Experimental Study)

Metric	Value
Accuracy	0.6980
Macro F1	0.6977
Matthews CorrCoef	0.3954

Training Trajectory (Model 1 - Seed 42): The best state dict for each constituent model was captured at peak validation F1 (e.g., Epoch 6 for Seed 42).

Epoch	LR	Train Loss	Train Acc	Val Loss	Val Acc	Val F1
1	1e-3	0.6996	0.5954	0.6119	0.6626	0.6522
2	1e-3	0.6331	0.6419	0.5996	0.6730	0.6721
3	1e-3	0.6554	0.6569	0.5916	0.6689	0.6554
4	5e-4	0.6099	0.6732	0.6233	0.6730	0.6719
5	5e-4	0.5937	0.6949	0.5790	0.6881	0.6871
6	5e-4	0.5764	0.7093	0.5785	0.6934	0.6921

Additional Information

ESIM Architecture Grid Search & Regularization Strategy: The 3D Grid Search revealed an optimal configuration of 256 hidden dimensions and 0.5 dropout (Peak Validation F1: 0.6930). This mathematically supports the architectural design: expanding the hidden_dim to 256 provided the necessary capacity to process the complex ESIM semantic alignments, while concurrently requiring aggressive 50% dropout to prevent sequence memorization.

Ensembling Experiments & Statistical Significance: Three Paired Samples t-tests were conducted on the development set:

Random-Seed Ensemble vs. Official Baseline: Demonstrated a significant architectural improvement over the unit LSTM baseline (70.96% vs 66.06%; t = 8.0035, p < 0.05).
K-Fold Ensemble vs. Official Baseline: Also showed a significant improvement (t = 6.1321, p < 0.05).
Ablation Study (Random-Seed vs. K-Fold): Confirmed the Random-Seed approach was statistically superior to the K-Fold approach (t = 3.2637, p < 0.05). This indicates that for this specific non-Transformer architecture, maximizing raw data and vocabulary exposure (starved by 20% in K-Folding) was more beneficial than the data diversity gained through cross-validation slicing.

Downloads last month: -; Downloads are not tracked for this model. How to track