Tabular Classification
Scikit-learn
anomaly-detection
intrusion-detection
vehicle-security
automotive
CAN-bus
cybersecurity
xgboost
Instructions to use anddali/vehicle-ids-anomaly-detector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use anddali/vehicle-ids-anomaly-detector with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("anddali/vehicle-ids-anomaly-detector", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - anomaly-detection | |
| - intrusion-detection | |
| - vehicle-security | |
| - automotive | |
| - CAN-bus | |
| - cybersecurity | |
| - sklearn | |
| - xgboost | |
| - tabular-classification | |
| license: apache-2.0 | |
| library_name: sklearn | |
| metrics: | |
| - accuracy | |
| - f1 | |
| - precision | |
| - recall | |
| # Vehicle Intrusion Detection System (IDS) — Anomaly Detector | |
| **Multi-Tiered Hybrid IDS for detecting hacking attempts in vehicle CAN bus telecom data.** | |
| Based on the [MTH-IDS architecture](https://arxiv.org/abs/2105.13289) (340 citations, 99.99% accuracy). | |
| ## Architecture | |
| ### Tier 1: Signature-Based IDS (Multi-Class Classification) | |
| - **Stacking Ensemble**: XGBoost + RandomForest + ExtraTrees + DecisionTree → Logistic Regression meta-learner | |
| - Detects 4 known attack types: **DoS, Fuzzy, RPM Spoofing, Gear Spoofing** | |
| ### Tier 2: Anomaly-Based IDS (Zero-Day Detection) | |
| - **Isolation Forest** trained on normal traffic only | |
| - Detects unknown/novel attacks not seen during training | |
| ### Combined Detection | |
| - If Tier 1 classifies as Normal but Tier 2 flags as anomaly → **UNKNOWN_ATTACK** alert | |
| - Catches zero-day attacks that evade signature-based detection | |
| ## Performance | |
| | Metric | Tier 1 (Multi-Class) | Tier 2 (Anomaly) | | |
| |--------|---------------------|-------------------| | |
| | Accuracy | 0.9586 | 0.6103 | | |
| | F1 (weighted) | 0.9584 | 0.7035 | | |
| | Precision | 0.9597 | 0.9243 | | |
| | Recall | 0.9586 | 0.5678 | | |
| ### Base Learner Validation Accuracies | |
| | Model | Accuracy | | |
| |-------|----------| | |
| | Decision Tree | 0.9664 | | |
| | Random Forest | 0.9690 | | |
| | Extra Trees | 0.9690 | | |
| | XGBoost | 0.9689 | | |
| ## Attack Types Detected | |
| | Attack | Description | Detection | | |
| |--------|-------------|-----------| | |
| | **DoS** | Flood CAN bus with dominant ID (0x0000) every 0.3ms | Signature (Tier 1) | | |
| | **Fuzzy** | Random CAN ID and data injection every 0.5ms | Signature (Tier 1) | | |
| | **RPM Spoofing** | Inject fake RPM gauge values every 1ms | Signature (Tier 1) | | |
| | **Gear Spoofing** | Inject fake drive gear values every 1ms | Signature (Tier 1) | | |
| | **Unknown/Zero-Day** | Any novel attack pattern | Anomaly (Tier 2) | | |
| ## Usage | |
| ```python | |
| import pickle | |
| import pandas as pd | |
| from inference import load_model, preprocess, predict | |
| # Load model | |
| model = load_model('vehicle_ids_model.pkl') | |
| # Load CAN bus data (CSV format: timestamp, can_id, dlc, d0-d7, flag) | |
| df = pd.read_csv('can_traffic.csv') | |
| # Preprocess and predict | |
| X = preprocess(df, model) | |
| results = predict(X, model) | |
| # Results contain: attack_type, anomaly_score, is_anomaly, alert | |
| # alert values: NORMAL, KNOWN_ATTACK, UNKNOWN_ATTACK | |
| print(results['alert'].value_counts()) | |
| ``` | |
| ## Input Format | |
| CAN bus message CSV with columns: | |
| - `timestamp`: Recording time (seconds) | |
| - `can_id`: CAN identifier in HEX (e.g., "043F") | |
| - `dlc`: Data Length Code (0-8) | |
| - `d0`-`d7`: Data bytes in HEX (e.g., "FF") | |
| - `flag`: R (normal) or T (injected/attack) | |
| ## Feature Engineering | |
| 10 features extracted from raw CAN messages: | |
| - CAN ID (decimal), DLC | |
| - 8 data bytes (decimal) | |
| - Statistical features: mean, std, min, max, range, sum of data bytes | |
| - Temporal: inter-arrival time (IAT) | |
| - Frequency: CAN ID frequency in traffic | |
| - Information-theoretic: data byte entropy | |
| ## Training Details | |
| - Dataset: ~1,077,264 CAN bus messages (HCRL Car-Hacking format) | |
| - Train/Val/Test split: 70/15/15 | |
| - SMOTE for class imbalance handling | |
| - Feature selection via Mutual Information | |
| - Z-score normalization | |
| ## References | |
| 1. Li et al., "MTH-IDS: A Multi-Tiered Hybrid Intrusion Detection System for IoV", IEEE IoT Journal, 2021. [arXiv:2105.13289](https://arxiv.org/abs/2105.13289) | |
| 2. Seo et al., "GIDS: GAN based IDS for In-Vehicle Network", PST 2018. [arXiv:1907.07377](https://arxiv.org/abs/1907.07377) | |
| 3. Song et al., "In-vehicle network intrusion detection using deep CNN", Vehicular Communications, 2020. | |