Model Card for esim-cnn-bilstm-ensemble

This is a Natural Language Inference (NLI) classifier that determines the logical relationship between a given premise and hypothesis. It is a Category B (Non-Transformer) implementation that augments a standard CNN-BiLSTM baseline with an Enhanced Sequential Inference Model (ESIM) Cross-Attention matrix, stabilized via a 5-model majority-vote ensemble.

Model Details

Model Description

This model evaluates text pairs without relying on modern global self-attention mechanisms. Instead, the encoder utilizes parallel processing: a CNN extracts global n-gram features, while a BiLSTM extracts sequential features. To overcome the information bottleneck of standard recurrent networks, the BiLSTM hidden states are passed through an ESIM local inference matrix, which mathematically calculates the dot-product similarity between every word in the premise and the hypothesis prior to pooling.

The pooled vectors from both the CNN and ESIM paths are concatenated and passed through a 3-layer MLP classifier with GELU activations. To minimize statistical variance caused by random weight initialization, the final predictions are generated via a majority-vote ensemble of 5 identical architectures trained on different random seeds.

  • Developed by: 11303252, 11309715 and 11382801
  • Language(s): English
  • Model type: Supervised Classification (Majority-Vote Ensemble)
  • Model architecture: CNN (Global Features) + BiLSTM (Sequential Features) + ESIM Cross-Attention Matrix
  • Finetuned from model: N/A (Trained from scratch using static GloVe 300d embeddings)

Model Resources

Foundational Architecture Literature:

  • ESIM & BiLSTM Framework: Enhanced LSTM for Natural Language Inference (Chen et al., 2017).
  • CNN Feature Extraction: Convolutional Neural Networks for Sentence Classification (Kim, 2014).
  • Word Embeddings: GloVe: Global Vectors for Word Representation (Pennington et al., 2014).

Training Details

Training Data

24,432 premise-hypothesis pairs for training and 6,736 pairs for development, sourced from the COMP34812 NLI track. Vocabulary Size: 23,053 tokens, with 93.40% covered by GloVe vectors.

Training Procedure

During baseline development, a grid search evaluated hidden_dim โˆˆ [128, 256], dropout โˆˆ [0.2, 0.3, 0.5], and learning_rate โˆˆ [1e-3, 5e-4]. The optimal sequential configuration was identified as 128 hidden dimensions and 0.3 dropout.

Upon upgrading to the ESIM architecture, a secondary 3D Grid Search was performed to handle increased parameterization. The final architecture utilizes a 5-model ensemble strategy. Each constituent model was trained on 100% of the training dataset using a distinct random seed (42-46).

Training Hyperparameters

  • learning_rate: 1e-03 (ReduceLROnPlateau, factor=0.5, patience=1)
  • train_batch_size: 32
  • seed: {42, 43, 44, 45, 46}
  • num_epochs: 10
  • dropout: 0.5
  • hidden_dim: 256
  • optimizer: AdamW (weight_decay: 1e-04)

Evaluation

Results (Official nlu-score Scorer)

The following metrics were generated using the nlu_scorer_env on the 6,736-pair development set.

1. Random-Seed ESIM Ensemble (Primary Submission)

Metric Value
Accuracy 0.7096
Macro F1 0.7086
Macro Precision 0.7098
Macro Recall 0.7084
Matthews CorrCoef 0.4182

2. K-Fold Ensemble (Experimental Study)

Metric Value
Accuracy 0.6980
Macro F1 0.6977
Matthews CorrCoef 0.3954

Training Trajectory (Model 1 - Seed 42): The best state dict for each constituent model was captured at peak validation F1 (e.g., Epoch 6 for Seed 42).

Epoch LR Train Loss Train Acc Val Loss Val Acc Val F1
1 1e-3 0.6996 0.5954 0.6119 0.6626 0.6522
2 1e-3 0.6331 0.6419 0.5996 0.6730 0.6721
3 1e-3 0.6554 0.6569 0.5916 0.6689 0.6554
4 5e-4 0.6099 0.6732 0.6233 0.6730 0.6719
5 5e-4 0.5937 0.6949 0.5790 0.6881 0.6871
6 5e-4 0.5764 0.7093 0.5785 0.6934 0.6921

Additional Information

ESIM Architecture Grid Search & Regularization Strategy: The 3D Grid Search revealed an optimal configuration of 256 hidden dimensions and 0.5 dropout (Peak Validation F1: 0.6930). This mathematically supports the architectural design: expanding the hidden_dim to 256 provided the necessary capacity to process the complex ESIM semantic alignments, while concurrently requiring aggressive 50% dropout to prevent sequence memorization.

Ensembling Experiments & Statistical Significance: Three Paired Samples t-tests were conducted on the development set:

  1. Random-Seed Ensemble vs. Official Baseline: Demonstrated a significant architectural improvement over the unit LSTM baseline (70.96% vs 66.06%; t = 8.0035, p < 0.05).
  2. K-Fold Ensemble vs. Official Baseline: Also showed a significant improvement (t = 6.1321, p < 0.05).
  3. Ablation Study (Random-Seed vs. K-Fold): Confirmed the Random-Seed approach was statistically superior to the K-Fold approach (t = 3.2637, p < 0.05). This indicates that for this specific non-Transformer architecture, maximizing raw data and vocabulary exposure (starved by 20% in K-Folding) was more beneficial than the data diversity gained through cross-validation slicing.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support