An analytical overview of machine and deep learning models for Network Intrusion Detection Systems, evaluated across two benchmark datasets under a unified experimental framework.
Two widely-used intrusion detection datasets selected to represent both classical and modern network traffic environments.
One of the earliest and most widely-used IDS benchmarks. Contains diverse attack types across 41 features, enabling systematic ML evaluation despite its age.
A modern dataset with realistic traffic patterns and diverse attack categories. Reflects contemporary network environments with significant class imbalance.
Comprehensive metrics including balanced accuracy, F1-macro, G-Mean, precision, and recall evaluated across all models on both datasets — with and without SMOTE oversampling.
| Model | Type | Accuracy | Balanced Acc. | Precision | Recall | F1-Macro | G-Mean |
|---|---|---|---|---|---|---|---|
| Random Forest | ML | 0.9757 | 0.5657 | 0.6197 | 0.5657 | 0.5810 | 0.4822 |
| XGBoost | ML | 0.9738 | 0.4247 | 0.6435 | 0.4247 | 0.4510 | 0.0180 |
| Decision Tree | ML | 0.9753 | 0.4703 | 0.5760 | 0.4703 | 0.4850 | 0.0238 |
| LSTM-CNN | DL | 0.9729 | 0.3823 | 0.4673 | 0.3823 | 0.3950 | 0.0001 |
| ANN | DL | 0.9589 | 0.3744 | 0.3568 | 0.3744 | 0.2990 | ≈0 |
| LSTM | DL | 0.9673 | 0.3287 | 0.3546 | 0.3287 | 0.3250 | ≈0 |
| CNN | DL | 0.9524 | 0.2781 | 0.3386 | 0.2781 | 0.2730 | ≈0 |
| Model | Accuracy | Balanced Acc. | Precision | Recall | F1-Macro | G-Mean |
|---|---|---|---|---|---|---|
| LSTM-CNN SMOTE | 0.0112 | 0.1241 | 0.0918 | 0.1241 | 0.0077 | 0.0017 |
| Random Forest SMOTE | 0.9194 | 0.1001 | 0.1919 | 0.1001 | 0.0961 | ≈0 |
| ANN SMOTE | 0.9194 | 0.1000 | 0.0919 | 0.1000 | 0.0958 | ≈0 |
| CNN SMOTE | 0.9194 | 0.1000 | 0.0919 | 0.1000 | 0.0958 | ≈0 |
| XGBoost SMOTE | 0.1104 | 0.1108 | 0.1044 | 0.1108 | 0.0248 | ≈0 |
| Decision Tree SMOTE | 0.8633 | 0.0960 | 0.0930 | 0.0960 | 0.0937 | ≈0 |
| LSTM SMOTE | 0.0352 | 0.0683 | 0.0814 | 0.0683 | 0.0097 | ≈0 |
Applying SMOTE to the UNSW-NB15 dataset caused substantial deterioration across all models. Balanced accuracy, F1-macro, and G-Mean dropped significantly compared to raw data settings, indicating that sample-level oversampling is ineffective for high-dimensional complex datasets. Model robustness and algorithm-level imbalance handling are more critical.
| Model | Type | Accuracy | Balanced Acc. | Precision | Recall | F1-Macro | G-Mean |
|---|---|---|---|---|---|---|---|
| Random Forest | ML | 0.9999 | 0.9940 | 0.9969 | 0.9940 | 0.9955 | 0.9939 |
| XGBoost | ML | 0.9997 | 0.9909 | 0.9941 | 0.9909 | 0.9925 | 0.9908 |
| LSTM-CNN | DL | 0.9992 | 0.9848 | 0.9687 | 0.9848 | 0.9758 | 0.9846 |
| Decision Tree | ML | 0.9988 | 0.9614 | 0.9659 | 0.9614 | 0.9622 | 0.9586 |
| LSTM | DL | 0.9985 | 0.9673 | 0.9401 | 0.9673 | 0.9517 | 0.9663 |
| ANN | DL | 0.9987 | 0.9427 | 0.9622 | 0.9427 | 0.9519 | 0.9362 |
| CNN | DL | 0.9899 | 0.6256 | 0.7080 | 0.6256 | 0.6446 | 0.0051 |
| Model | Accuracy | Balanced Acc. | Precision | Recall | F1-Macro | G-Mean |
|---|---|---|---|---|---|---|
| Random Forest SMOTE | 0.9999 | 0.9922 | 0.9978 | 0.9922 | 0.9949 | 0.9920 |
| XGBoost SMOTE | 0.9997 | 0.9971 | 0.9864 | 0.9971 | 0.9916 | 0.9971 |
| ANN SMOTE | 0.9926 | 0.9953 | 0.7993 | 0.9954 | 0.8655 | 0.9953 |
| LSTM-CNN SMOTE | 0.9992 | 0.9907 | 0.9589 | 0.9908 | 0.9737 | 0.9906 |
| Decision Tree SMOTE | 0.9970 | 0.9931 | 0.9252 | 0.9931 | 0.9386 | 0.9930 |
| LSTM SMOTE | 0.9981 | 0.9980 | 0.9111 | 0.9912 | 0.9431 | 0.9910 |
| CNN SMOTE | 0.9494 | 0.9721 | 0.5231 | 0.9721 | 0.6199 | 0.9717 |
SMOTE improved minority class recall for deep learning models notably (ANN: 0.94 → 0.99, CNN: 0.63 → 0.97). However, precision dropped, indicating more false positives. Tree-based models (RF, XGBoost) were largely unaffected — they inherently handle class imbalance well and do not benefit significantly from data-level oversampling.
Inference time and memory footprint are critical factors for real-world NIDS deployment. Deep learning architectures show significantly higher latency and resource consumption.
| Model | Type | UNSW Infer. Time (s) | UNSW Memory (MB) | KDD Infer. Time (s) | KDD Memory (MB) |
|---|---|---|---|---|---|
| Decision Tree | ML | 0.025 | 1,579 | 0.026 | 4,959 |
| XGBoost | ML | 0.333 | 1,571 | 0.226 | 4,959 |
| Random Forest | ML | 0.571 | 22,424 | 0.373 | 4,989 |
| ANN | DL | 1.622 | 5,267 | 1.410 | 4,503 |
| CNN | DL | 1.649 | 4,375 | 1.314 | 5,096 |
| LSTM | DL | 1.902 | 4,710 | 1.475 | 5,104 |
| LSTM-CNN | DL | 3.721 | 4,720 | 2.995 | 5,112 |
Six critical insights derived from unified experimental evaluation across models and datasets.
Random Forest and XGBoost achieved near-perfect F1-macro (>0.99) on KDD Cup 99. Ensemble methods consistently outperformed all other approaches on structured, tabular datasets.
Multiple DL models reported >96% accuracy on UNSW-NB15, yet their balanced accuracy and G-Mean revealed near-zero minority class detection, exposing the danger of relying on a single metric.
CNN-LSTM hybrid achieved strong KDD performance (F1: 0.976) but requires 3.7s inference and >5GB memory. The computational cost limits real-time applicability significantly.
Applying SMOTE to UNSW-NB15 degraded all metrics across all models substantially. Simple sample-level balancing cannot compensate for feature complexity and distribution shifts.
ANN balanced accuracy improved from 0.943 to 0.995, and CNN from 0.626 to 0.972 after SMOTE on KDD Cup 99. However, this came at the cost of precision degradation.
With ~0.025s inference time and the smallest memory footprint, Decision Trees are optimal for latency-sensitive deployments where sub-second detection is a hard requirement.
Practical decision framework for selecting the right model and strategy based on deployment requirements and dataset characteristics.
Decision Trees achieve inference times as low as 0.02s for real-time monitoring, while hybrid CNN-LSTM models exceed 3s latency with >5GB memory — suitable only for offline analysis. RF and XGBoost provide the optimal balance of strong detection performance with manageable inference costs, making them the most practical choice for production NIDS environments.
Future research should explore transformer-based architectures for NIDS — leveraging self-attention for parallel processing and better capture of global traffic patterns, potentially overcoming the computational bottlenecks observed in LSTM and CNN-LSTM models.