From Data to Defense — NIDS Survey Results

// Section 01

Benchmark Datasets

Two widely-used intrusion detection datasets selected to represent both classical and modern network traffic environments.

Dataset 01

KDD Cup 99

One of the earliest and most widely-used IDS benchmarks. Contains diverse attack types across 41 features, enabling systematic ML evaluation despite its age.

Total Records~4,000,000

Features41

Attack Types10 (after filtering)

Largest Classsmurf · 2,807,886

Normal Traffic972,781

Dataset 02

UNSW-NB15

A modern dataset with realistic traffic patterns and diverse attack categories. Reflects contemporary network environments with significant class imbalance.

Total Records~2,367,624

Attack Categories9

Benign Traffic2,237,731

Rarest ClassWorms · 158

Imbalance RatioHigh

// Section 02

Model Performance Results

Comprehensive metrics including balanced accuracy, F1-macro, G-Mean, precision, and recall evaluated across all models on both datasets — with and without SMOTE oversampling.

// Balanced Accuracy — UNSW-NB15 (Raw)

Model	Type	Accuracy	Balanced Acc.	Precision	Recall	F1-Macro	G-Mean
Random Forest	ML	0.9757	0.5657	0.6197	0.5657	0.5810	0.4822
XGBoost	ML	0.9738	0.4247	0.6435	0.4247	0.4510	0.0180
Decision Tree	ML	0.9753	0.4703	0.5760	0.4703	0.4850	0.0238
LSTM-CNN	DL	0.9729	0.3823	0.4673	0.3823	0.3950	0.0001
ANN	DL	0.9589	0.3744	0.3568	0.3744	0.2990	≈0
LSTM	DL	0.9673	0.3287	0.3546	0.3287	0.3250	≈0
CNN	DL	0.9524	0.2781	0.3386	0.2781	0.2730	≈0

// Balanced Accuracy — UNSW-NB15 (After SMOTE) — Degradation observed

Model	Accuracy	Balanced Acc.	Precision	Recall	F1-Macro	G-Mean
LSTM-CNN SMOTE	0.0112	0.1241	0.0918	0.1241	0.0077	0.0017
Random Forest SMOTE	0.9194	0.1001	0.1919	0.1001	0.0961	≈0
ANN SMOTE	0.9194	0.1000	0.0919	0.1000	0.0958	≈0
CNN SMOTE	0.9194	0.1000	0.0919	0.1000	0.0958	≈0
XGBoost SMOTE	0.1104	0.1108	0.1044	0.1108	0.0248	≈0
Decision Tree SMOTE	0.8633	0.0960	0.0930	0.0960	0.0937	≈0
LSTM SMOTE	0.0352	0.0683	0.0814	0.0683	0.0097	≈0

// ⚠ SMOTE degraded ALL metrics on UNSW-NB15 — Not recommended for high-dimensional imbalanced data

Applying SMOTE to the UNSW-NB15 dataset caused substantial deterioration across all models. Balanced accuracy, F1-macro, and G-Mean dropped significantly compared to raw data settings, indicating that sample-level oversampling is ineffective for high-dimensional complex datasets. Model robustness and algorithm-level imbalance handling are more critical.

// Balanced Accuracy — KDD Cup 99 (Raw)

Model	Type	Accuracy	Balanced Acc.	Precision	Recall	F1-Macro	G-Mean
Random Forest	ML	0.9999	0.9940	0.9969	0.9940	0.9955	0.9939
XGBoost	ML	0.9997	0.9909	0.9941	0.9909	0.9925	0.9908
LSTM-CNN	DL	0.9992	0.9848	0.9687	0.9848	0.9758	0.9846
Decision Tree	ML	0.9988	0.9614	0.9659	0.9614	0.9622	0.9586
LSTM	DL	0.9985	0.9673	0.9401	0.9673	0.9517	0.9663
ANN	DL	0.9987	0.9427	0.9622	0.9427	0.9519	0.9362
CNN	DL	0.9899	0.6256	0.7080	0.6256	0.6446	0.0051

// Balanced Accuracy — KDD Cup 99 (After SMOTE) — Selective improvements

Model	Accuracy	Balanced Acc.	Precision	Recall	F1-Macro	G-Mean
Random Forest SMOTE	0.9999	0.9922	0.9978	0.9922	0.9949	0.9920
XGBoost SMOTE	0.9997	0.9971	0.9864	0.9971	0.9916	0.9971
ANN SMOTE	0.9926	0.9953	0.7993	0.9954	0.8655	0.9953
LSTM-CNN SMOTE	0.9992	0.9907	0.9589	0.9908	0.9737	0.9906
Decision Tree SMOTE	0.9970	0.9931	0.9252	0.9931	0.9386	0.9930
LSTM SMOTE	0.9981	0.9980	0.9111	0.9912	0.9431	0.9910
CNN SMOTE	0.9494	0.9721	0.5231	0.9721	0.6199	0.9717

// ✓ SMOTE on KDD Cup 99 — Improvements for DL models, marginal effect on tree-based

SMOTE improved minority class recall for deep learning models notably (ANN: 0.94 → 0.99, CNN: 0.63 → 0.97). However, precision dropped, indicating more false positives. Tree-based models (RF, XGBoost) were largely unaffected — they inherently handle class imbalance well and do not benefit significantly from data-level oversampling.

// Section 03

Computational Cost Analysis

Inference time and memory footprint are critical factors for real-world NIDS deployment. Deep learning architectures show significantly higher latency and resource consumption.

// Inference Time (seconds) — UNSW-NB15

// Inference Time (seconds) — KDD Cup 99

// Memory Usage (MB) — UNSW-NB15 · Note: RF 22,424 MB

// Memory Usage (MB) — KDD Cup 99

Model	Type	UNSW Infer. Time (s)	UNSW Memory (MB)	KDD Infer. Time (s)	KDD Memory (MB)
Decision Tree	ML	0.025	1,579	0.026	4,959
XGBoost	ML	0.333	1,571	0.226	4,959
Random Forest	ML	0.571	22,424	0.373	4,989
ANN	DL	1.622	5,267	1.410	4,503
CNN	DL	1.649	4,375	1.314	5,096
LSTM	DL	1.902	4,710	1.475	5,104
LSTM-CNN	DL	3.721	4,720	2.995	5,112

// Section 04

Key Findings

Six critical insights derived from unified experimental evaluation across models and datasets.

🌲

Tree Models Dominate on Structured Data

Random Forest and XGBoost achieved near-perfect F1-macro (>0.99) on KDD Cup 99. Ensemble methods consistently outperformed all other approaches on structured, tabular datasets.

⚖️

Accuracy is a Misleading Metric

Multiple DL models reported >96% accuracy on UNSW-NB15, yet their balanced accuracy and G-Mean revealed near-zero minority class detection, exposing the danger of relying on a single metric.

🧠

Deep Learning: High Cost, Mixed Returns

CNN-LSTM hybrid achieved strong KDD performance (F1: 0.976) but requires 3.7s inference and >5GB memory. The computational cost limits real-time applicability significantly.

🚫

SMOTE Harmful on High-Dimensional Data

Applying SMOTE to UNSW-NB15 degraded all metrics across all models substantially. Simple sample-level balancing cannot compensate for feature complexity and distribution shifts.

✅

SMOTE Helps DL on KDD Cup 99

ANN balanced accuracy improved from 0.943 to 0.995, and CNN from 0.626 to 0.972 after SMOTE on KDD Cup 99. However, this came at the cost of precision degradation.

⚡

Decision Tree Best for Real-Time

With ~0.025s inference time and the smallest memory footprint, Decision Trees are optimal for latency-sensitive deployments where sub-second detection is a hard requirement.

// Section 05

Dataset–Model Selection Guidelines

Practical decision framework for selecting the right model and strategy based on deployment requirements and dataset characteristics.

Deployment Scenario

Recommended Model

Dataset

Key Evidence

Structured, low-complexity traffic

RF / XGBoost

KDD Cup 99

F1-macro >0.99; near-perfect balanced accuracy

High-dimensional, imbalanced traffic

UNSW-NB15

Best balanced accuracy (0.57) and G-Mean (0.48) among all tested models

Real-time / latency-sensitive deployment

Both

Fastest inference (~0.02s); lowest memory footprint across all datasets

Sequential / temporal attack patterns

LSTM / CNN-LSTM

KDD Cup 99

Strong recall on ordered flow attacks; temporal dependency modeling

Imbalanced data + deep learning

SMOTE + ANN/LSTM

KDD Cup 99

Balanced accuracy improved from 0.94 to 0.99; recall gains outweigh precision drop

High-dimensional data + imbalance

RF (no SMOTE)

UNSW-NB15

SMOTE degraded all metrics on high-dimensional data; avoid sample-level balancing

// Conclusion & Future Directions

Ensemble Methods Offer the Best Production Trade-off

Decision Trees achieve inference times as low as 0.02s for real-time monitoring, while hybrid CNN-LSTM models exceed 3s latency with >5GB memory — suitable only for offline analysis. RF and XGBoost provide the optimal balance of strong detection performance with manageable inference costs, making them the most practical choice for production NIDS environments.

Future research should explore transformer-based architectures for NIDS — leveraging self-attention for parallel processing and better capture of global traffic patterns, potentially overcoming the computational bottlenecks observed in LSTM and CNN-LSTM models.

From Datato Defense

Benchmark Datasets

KDD Cup 99

UNSW-NB15

Model Performance Results

Computational Cost Analysis

Key Findings

Tree Models Dominate on Structured Data

Accuracy is a Misleading Metric

Deep Learning: High Cost, Mixed Returns

SMOTE Harmful on High-Dimensional Data

SMOTE Helps DL on KDD Cup 99

Decision Tree Best for Real-Time

Dataset–Model Selection Guidelines

Ensemble Methods Offer the Best Production Trade-off

From Data
to Defense