Model Card for esim-cnn-bilstm-ensemble
This is a Natural Language Inference (NLI) classifier that determines the logical relationship between a given premise and hypothesis. It is a Category B (Non-Transformer) implementation that augments a standard CNN-BiLSTM baseline with an Enhanced Sequential Inference Model (ESIM) Cross-Attention matrix, stabilized via a 5-model majority-vote ensemble.
Model Details
Model Description
This model evaluates text pairs without relying on modern global self-attention mechanisms. Instead, the encoder utilizes parallel processing: a CNN extracts global n-gram features, while a BiLSTM extracts sequential features. To overcome the information bottleneck of standard recurrent networks, the BiLSTM hidden states are passed through an ESIM local inference matrix, which mathematically calculates the dot-product similarity between every word in the premise and the hypothesis prior to pooling.
The pooled vectors from both the CNN and ESIM paths are concatenated and passed through a 3-layer MLP classifier with GELU activations. To minimize statistical variance caused by random weight initialization, the final predictions are generated via a majority-vote ensemble of 5 identical architectures trained on different random seeds.
- Developed by: 11303252, 11309715 and 11382801
- Language(s): English
- Model type: Supervised Classification (Majority-Vote Ensemble)
- Model architecture: CNN (Global Features) + BiLSTM (Sequential Features) + ESIM Cross-Attention Matrix
- Finetuned from model: N/A (Trained from scratch using static GloVe 300d embeddings)
Model Resources
Foundational Architecture Literature:
- ESIM & BiLSTM Framework: Enhanced LSTM for Natural Language Inference (Chen et al., 2017).
- CNN Feature Extraction: Convolutional Neural Networks for Sentence Classification (Kim, 2014).
- Word Embeddings: GloVe: Global Vectors for Word Representation (Pennington et al., 2014).
Training Details
Training Data
24,432 premise-hypothesis pairs for training and 6,736 pairs for development, sourced from the COMP34812 NLI track. Vocabulary Size: 23,053 tokens, with 93.40% covered by GloVe vectors.
Training Procedure
During baseline development, a grid search evaluated hidden_dim โ [128, 256], dropout โ [0.2, 0.3, 0.5], and learning_rate โ [1e-3, 5e-4]. The optimal sequential configuration was identified as 128 hidden dimensions and 0.3 dropout.
Upon upgrading to the ESIM architecture, a secondary 3D Grid Search was performed to handle increased parameterization. The final architecture utilizes a 5-model ensemble strategy. Each constituent model was trained on 100% of the training dataset using a distinct random seed (42-46).
Training Hyperparameters
- learning_rate: 1e-03 (ReduceLROnPlateau, factor=0.5, patience=1)
- train_batch_size: 32
- seed: {42, 43, 44, 45, 46}
- num_epochs: 10
- dropout: 0.5
- hidden_dim: 256
- optimizer: AdamW (weight_decay: 1e-04)
Evaluation
Results (Official nlu-score Scorer)
The following metrics were generated using the nlu_scorer_env on the 6,736-pair development set.
1. Random-Seed ESIM Ensemble (Primary Submission)
| Metric | Value |
|---|---|
| Accuracy | 0.7096 |
| Macro F1 | 0.7086 |
| Macro Precision | 0.7098 |
| Macro Recall | 0.7084 |
| Matthews CorrCoef | 0.4182 |
2. K-Fold Ensemble (Experimental Study)
| Metric | Value |
|---|---|
| Accuracy | 0.6980 |
| Macro F1 | 0.6977 |
| Matthews CorrCoef | 0.3954 |
Training Trajectory (Model 1 - Seed 42): The best state dict for each constituent model was captured at peak validation F1 (e.g., Epoch 6 for Seed 42).
| Epoch | LR | Train Loss | Train Acc | Val Loss | Val Acc | Val F1 |
|---|---|---|---|---|---|---|
| 1 | 1e-3 | 0.6996 | 0.5954 | 0.6119 | 0.6626 | 0.6522 |
| 2 | 1e-3 | 0.6331 | 0.6419 | 0.5996 | 0.6730 | 0.6721 |
| 3 | 1e-3 | 0.6554 | 0.6569 | 0.5916 | 0.6689 | 0.6554 |
| 4 | 5e-4 | 0.6099 | 0.6732 | 0.6233 | 0.6730 | 0.6719 |
| 5 | 5e-4 | 0.5937 | 0.6949 | 0.5790 | 0.6881 | 0.6871 |
| 6 | 5e-4 | 0.5764 | 0.7093 | 0.5785 | 0.6934 | 0.6921 |
Additional Information
ESIM Architecture Grid Search & Regularization Strategy:
The 3D Grid Search revealed an optimal configuration of 256 hidden dimensions and 0.5 dropout (Peak Validation F1: 0.6930). This mathematically supports the architectural design: expanding the hidden_dim to 256 provided the necessary capacity to process the complex ESIM semantic alignments, while concurrently requiring aggressive 50% dropout to prevent sequence memorization.
Ensembling Experiments & Statistical Significance: Three Paired Samples t-tests were conducted on the development set:
- Random-Seed Ensemble vs. Official Baseline: Demonstrated a significant architectural improvement over the unit LSTM baseline (70.96% vs 66.06%; t = 8.0035, p < 0.05).
- K-Fold Ensemble vs. Official Baseline: Also showed a significant improvement (t = 6.1321, p < 0.05).
- Ablation Study (Random-Seed vs. K-Fold): Confirmed the Random-Seed approach was statistically superior to the K-Fold approach (t = 3.2637, p < 0.05). This indicates that for this specific non-Transformer architecture, maximizing raw data and vocabulary exposure (starved by 20% in K-Folding) was more beneficial than the data diversity gained through cross-validation slicing.