Initial release: attack_lifecycle_phase 5-class baseline + 11-oracle-path leakage diagnostic
e2c4702 verified | license: cc-by-nc-4.0 | |
| library_name: pytorch | |
| tags: | |
| - cybersecurity | |
| - siem | |
| - security-logs | |
| - mitre-attack | |
| - apt | |
| - tabular-classification | |
| - synthetic-data | |
| - xgboost | |
| - baseline | |
| - leakage-diagnostic | |
| pipeline_tag: tabular-classification | |
| base_model: [] | |
| datasets: | |
| - xpertsystems/cyb010-sample | |
| metrics: | |
| - accuracy | |
| - f1 | |
| - roc_auc | |
| model-index: | |
| - name: cyb010-baseline-classifier | |
| results: | |
| - task: | |
| type: tabular-classification | |
| name: 5-class attack lifecycle phase classification | |
| dataset: | |
| type: xpertsystems/cyb010-sample | |
| name: CYB010 Synthetic Security Event Log Dataset (Sample) | |
| metrics: | |
| - type: roc_auc | |
| value: 0.9904 | |
| name: Test macro ROC-AUC OvR (XGBoost, seed 42) | |
| - type: accuracy | |
| value: 0.9493 | |
| name: Test accuracy (XGBoost, seed 42) | |
| - type: f1 | |
| value: 0.7781 | |
| name: Test macro-F1 (XGBoost, seed 42) | |
| - type: accuracy | |
| value: 0.936 | |
| name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) | |
| - type: roc_auc | |
| value: 0.988 | |
| name: Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds) | |
| # CYB010 Baseline Classifier | |
| **Attack lifecycle phase classifier (5-class) trained on the CYB010 | |
| synthetic security event log sample. Predicts which of 5 attack phases | |
| (`benign_background` / `initial_access` / `lateral_movement` / | |
| `persistence_establishment` / `exfiltration_or_impact`) a security | |
| event belongs to, from per-event features. ALSO ships a comprehensive | |
| `leakage_diagnostic.json` documenting 11 oracle paths discovered | |
| across the dataset's targets and 2 README-suggested targets that are | |
| unlearnable on the sample after honest leak removal.** | |
| > **Read this first.** This repo ships two related artifacts: | |
| > (1) a working baseline classifier for `attack_lifecycle_phase` (the | |
| > dataset's headline target), and (2) `leakage_diagnostic.json` | |
| > documenting 11 separate oracle paths plus 2 unlearnable targets. | |
| > Both files matter; the diagnostic is required reading for anyone | |
| > evaluating CYB010 for SIEM ML work. | |
| ## Model overview | |
| | Property | Value | | |
| |---|---| | |
| | Primary task | 5-class `attack_lifecycle_phase` classification | | |
| | Secondary artifact | `leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets | | |
| | Training data | `xpertsystems/cyb010-sample` (21,896 events / 500 incidents) | | |
| | Models | XGBoost + PyTorch MLP | | |
| | Input features | 87 (after one-hot encoding) | | |
| | Split | **Group-aware** (GroupShuffleSplit on `incident_id`) | | |
| | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | | |
| | License | CC-BY-NC-4.0 (matches dataset) | | |
| | Status | Reference baseline + comprehensive leakage diagnostic | | |
| ## Why this task — and what was dropped | |
| The CYB010 README's central concept is the "5-phase attack lifecycle | |
| state machine", and `attack_lifecycle_phase` is the data's headline | |
| target. We piloted six candidate targets and found: | |
| - **`attack_lifecycle_phase` 5-class**: strongest honest result. | |
| Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes | |
| represented, per-class F1 range 0.48–1.00. | |
| - **`threat_actor_profile` 5-class**: works at acc 0.84 but per-class | |
| F1 reveals it's almost entirely driven by `benign_user` separation | |
| (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class | |
| malicious-only formulation is below majority (acc 0.55 vs 0.61). | |
| - **`label_true_positive` binary on alerts**: documented as a secondary | |
| finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after | |
| dropping all of them. | |
| - **`mitre_tactic` 14-class**: hits acc 0.90 but macro-F1 0.37 - | |
| imbalance gaming (benign class dominates at 57%). | |
| - **`event_class` 12-class**: unlearnable (acc 0.35 vs majority 0.42). | |
| ### Six oracle columns dropped from the phase task | |
| CYB010 encodes the benign vs malicious distinction explicitly in | |
| multiple columns. Each is a perfect or near-perfect oracle for the | |
| `benign_background` phase: | |
| | Column | Oracle relationship | | |
| |---|---| | |
| | `mitre_tactic` | `=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect) | | |
| | `mitre_technique_id` | Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic) | | |
| | `label_malicious` | `==False` ↔ `benign_background` (perfect) | | |
| | `threat_actor_id` | `=="NONE"` ↔ `benign_background` (perfect) | | |
| | `threat_actor_profile` | `=="benign_user"` ↔ `benign_background` (perfect) | | |
| | `event_type` | Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`) | | |
| With these six columns present, a plain XGBoost trivially separates | |
| benign vs malicious. The published baseline trains with all six | |
| excluded. | |
| Two model artifacts are published. They are designed to be used | |
| together: | |
| - `model_xgb.json` — gradient-boosted trees (slightly higher F1) | |
| - `model_mlp.safetensors` — PyTorch MLP | |
| ## Quick start | |
| ```bash | |
| pip install xgboost torch safetensors pandas huggingface_hub | |
| ``` | |
| ```python | |
| from huggingface_hub import hf_hub_download, snapshot_download | |
| import json, numpy as np, torch, xgboost as xgb | |
| from safetensors.torch import load_file | |
| REPO = "xpertsystems/cyb010-baseline-classifier" | |
| paths = {n: hf_hub_download(REPO, n) for n in [ | |
| "model_xgb.json", "model_mlp.safetensors", | |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", | |
| ]} | |
| import sys, os | |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) | |
| from feature_engineering import ( | |
| transform_single, load_meta, build_host_lookup, INT_TO_LABEL, | |
| ) | |
| meta = load_meta(paths["feature_meta.json"]) | |
| # Host features are joined from host_inventory.csv at inference time | |
| ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset") | |
| host_lookup = build_host_lookup(f"{ds}/host_inventory.csv") | |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) | |
| # Predict (see inference_example.ipynb for the full pattern) | |
| # Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious, | |
| # threat_actor_id, threat_actor_profile, or event_type - those were the | |
| # oracle columns. | |
| X = transform_single(my_event, meta, host_lookup=host_lookup) | |
| proba = xgb_model.predict_proba(X)[0] | |
| print(INT_TO_LABEL[int(np.argmax(proba))]) | |
| ``` | |
| See [`inference_example.ipynb`](./inference_example.ipynb) for the full | |
| copy-paste demo. | |
| ## Training data | |
| Trained on the public sample of CYB010, 21,896 per-event records: | |
| | Phase | Events | Class share | | |
| |---|---:|---:| | |
| | `benign_background` | 12,448 | 56.9% | | |
| | `exfiltration_or_impact` | 6,205 | 28.3% | | |
| | `initial_access` | 1,674 | 7.6% | | |
| | `lateral_movement` | 968 | 4.4% | | |
| | `persistence_establishment` | 601 | 2.7% | | |
| ### Group-aware split by incident_id | |
| 500 incidents × ~44 events each. Events from the same incident share | |
| host, threat actor, and phase trajectory — so train/test contamination | |
| is a real risk with random splitting. The baseline uses | |
| **GroupShuffleSplit** on `incident_id` (nested 70/15/15): | |
| | Fold | Events | Incidents | | |
| |---|---:|---:| | |
| | Train | 14,697 | ~350 | | |
| | Validation | 3,473 | ~75 | | |
| | Test | 3,726 | ~75 | | |
| All 10 multi-seed evaluations yielded all 5 classes in the test fold. | |
| Class imbalance is addressed with `class_weight='balanced'` (XGBoost | |
| `sample_weight`) and weighted cross-entropy (MLP). | |
| ## Feature pipeline | |
| The bundled `feature_engineering.py` is the canonical recipe. 87 | |
| features survive after encoding, drawn from: | |
| - **Per-event numeric** (5): `source_port`, `dest_port`, | |
| `cvss_score_analogue`, `label_log_tampered`, `label_false_positive` | |
| - **Per-event categorical** (3, one-hot): `event_class` (12 values), | |
| `log_source_type` (8 values), `severity_level` (5 values) | |
| - **Host features** (joined from `host_inventory.csv`): 3 numeric + | |
| 7 categorical (os_type, host_role, network_segment, defender_posture, | |
| criticality_rating, cloud_provider, siem_platform) | |
| - **Engineered** (9): `hour_of_day`, `is_off_hours`, `is_weekend`, | |
| `log_cvss`, `is_high_cvss`, `is_well_known_port`, `is_dynamic_port`, | |
| `is_outbound_web`, `risk_composite` | |
| ### Partial-oracle features kept as legitimate observables | |
| `event_class` (max purity 0.87, mean 0.72 across phases) is the | |
| strongest non-oracle feature. C2 beacon traffic (`event_class = | |
| network_flow`) is 65% exfiltration phase but also 29% benign and 6% | |
| other phases — real overlap, not deterministic encoding. Kept. | |
| `severity_level` and `cvss_score_analogue` correlate strongly with | |
| phase (high-severity events skew toward exfil and initial_access) but | |
| with substantial overlap. Kept. | |
| `label_log_tampered` is a real observable — APTs tamper more than | |
| script_kiddies — but is not phase-deterministic. Kept. | |
| ## Evaluation | |
| ### Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents) | |
| **XGBoost** (the published `model_xgb.json` artifact) | |
| | Metric | Value | | |
| |---|---:| | |
| | Macro ROC-AUC (OvR) | **0.9904** | | |
| | Accuracy | **0.9493** | | |
| | Macro-F1 | 0.7781 | | |
| | Weighted-F1 | 0.9478 | | |
| **MLP** (the published `model_mlp.safetensors` artifact) | |
| | Metric | Value | | |
| |---|---:| | |
| | Macro ROC-AUC (OvR) | **0.9861** | | |
| | Accuracy | **0.9412** | | |
| | Macro-F1 | 0.7534 | | |
| | Weighted-F1 | 0.9396 | | |
| XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941, | |
| macro-F1 0.778 vs 0.753). The gap is consistent across seeds. | |
| ### Multi-seed robustness (XGBoost, 10 seeds) | |
| | Metric | Mean | Std | Min | Max | | |
| |---|---:|---:|---:|---:| | |
| | Accuracy | 0.936 | 0.007 | 0.923 | 0.949 | | |
| | Macro-F1 | 0.759 | 0.015 | 0.741 | 0.781 | | |
| | Macro ROC-AUC OvR | 0.988 | 0.001 | 0.986 | 0.990 | | |
| **Tightest ROC-AUC std in the catalog** (0.001). All 10 seeds yielded | |
| all 5 classes in the test fold. Full per-seed results in | |
| [`multi_seed_results.json`](./multi_seed_results.json). | |
| ### Per-class F1 (seed 42) | |
| | Phase | Class share | XGBoost F1 | MLP F1 | | |
| |---|---:|---:|---:| | |
| | `benign_background` | 56.9% | **0.998** | 0.994 | | |
| | `exfiltration_or_impact` | 28.3% | **0.987** | 0.981 | | |
| | `initial_access` | 7.6% | 0.720 | 0.651 | | |
| | `persistence_establishment` | 2.7% | 0.703 | 0.690 | | |
| | `lateral_movement` | 4.4% | **0.483** | 0.451 | | |
| The two largest classes (`benign_background` and `exfiltration_or_impact`) | |
| are nearly perfectly separable — `benign_background` because the | |
| non-oracle features (severity, CVSS, log_source) still cleanly separate | |
| non-malicious traffic, and `exfiltration_or_impact` because it's | |
| dominated by network_flow events (C2 beacons). The three middle | |
| classes overlap substantially in feature space; `lateral_movement` is | |
| the hardest (F1 0.48) because lateral movement events look similar to | |
| initial_access events at the per-event level. A sequence model that | |
| considers event ordering within an incident would likely do better | |
| than the per-event baseline. | |
| ### Ablation: which feature groups matter | |
| | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | Δ macro-F1 | | |
| |---|---:|---:|---:|---:|---:| | |
| | Full feature set (published) | 0.9493 | 0.7781 | 0.9904 | — | — | | |
| | No `event_class` | 0.9206 | 0.5969 | 0.9723 | **−0.0287** | **−0.181** | | |
| | No CVSS features | 0.9383 | 0.7475 | 0.9812 | −0.0110 | −0.031 | | |
| | No `log_source_type` | 0.9469 | 0.7655 | 0.9902 | −0.0024 | −0.013 | | |
| | No engineered features | 0.9471 | 0.7655 | 0.9903 | −0.0022 | −0.013 | | |
| | No ports | 0.9463 | 0.7621 | 0.9903 | −0.0030 | −0.016 | | |
| | No `severity_level` | 0.9479 | 0.7688 | 0.9902 | −0.0014 | −0.009 | | |
| | No tamper flags | 0.9469 | 0.7657 | 0.9905 | −0.0024 | −0.012 | | |
| | No timing | 0.9501 | 0.7730 | 0.9907 | +0.0008 | −0.005 | | |
| | No host features | 0.9522 | 0.7828 | 0.9917 | +0.0029 | +0.005 | | |
| Three findings: | |
| 1. **`event_class` is the dominant signal** (drops 18pp macro-F1 when | |
| removed). Phase prediction without it loses most discrimination | |
| between the middle classes. | |
| 2. **CVSS features are second-strongest** (drops 3pp F1). Captures | |
| severity information that complements event_class. | |
| 3. **Host features and timing add modest noise.** The model performs | |
| marginally *better* without host features (+0.3pp accuracy), and | |
| timing features contribute essentially nothing. Kept in the | |
| pipeline as documented baseline reference. | |
| ### Architecture | |
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes), | |
| `hist` tree method, class-balanced sample weights, early stopping on | |
| validation mlogloss. | |
| **MLP:** `87 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d` | |
| → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, | |
| early stopping on validation macro-F1. | |
| Training hyperparameters are held internally by XpertSystems. | |
| ## Limitations | |
| **This is a baseline reference, not a production phase classifier.** | |
| 1. **The leakage diagnostic is required reading.** Six oracle columns | |
| for the phase task and seven for the alert TP task are documented | |
| in `leakage_diagnostic.json`. If you use CYB010 sample data for | |
| your own training, you MUST drop these or your model will learn | |
| the oracles instead of the task. | |
| 2. **`lateral_movement` F1 0.48 is the weakest class.** The 968-event | |
| sample with substantial overlap to `initial_access` makes this | |
| class hard. A sequence model that considers event ordering within | |
| incidents would likely do better than per-event classification. | |
| 3. **`threat_actor_profile` 4-class (malicious-only) is unlearnable | |
| on this sample** (acc 0.55 vs majority 0.61). The 5-class | |
| formulation with benign included works only because benign_user | |
| separation is structurally trivial. | |
| 4. **`event_class` 12-class is unlearnable on this sample** (acc 0.35 | |
| vs majority 0.42). event_class is a structural property of the | |
| event itself, not something to predict from other features. | |
| 5. **Synthetic-vs-real transfer.** The dataset is synthetic, calibrated | |
| to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE | |
| ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise | |
| characteristics — and in particular, the explicit `mitre_tactic == | |
| "benign"` marker and `threat_actor_id == "NONE"` benign sentinel | |
| would not be present in real data. Real telemetry has implicit | |
| benign-vs-malicious distinctions that emerge from event content. | |
| Do not assume metrics transfer end-to-end. | |
| 6. **21,896 events / 500 incidents is a modest training set.** The | |
| 3,726-event / ~75-incident test fold yields stable multi-seed | |
| metrics (std 0.007 on accuracy) but per-class confidence intervals | |
| widen for the smallest classes (lateral_movement, persistence). | |
| ## Notes on dataset schema | |
| The CYB010 sample dataset README describes some fields differently | |
| from the actual schema. The model was trained on the actual schema; | |
| this note helps buyers reconcile what they read with what they receive. | |
| | What the README says | What the data actually contains | | |
| |---|---| | |
| | `security_events` has 16 columns | Data has **23 columns** | | |
| | Field renames | `timestamp_utc` → `timestamp`, `user` → `user_id`, `log_format` → `log_source_type` | | |
| | README missing from `security_events` | `event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented | | |
| | README claims `command_line` / `process_name` / `is_off_hours` columns | Not present in `security_events` (off-hours derived from timestamp in pipeline) | | |
| | `alert_records` has 9 columns | Data has **21 columns** | | |
| | Field renames | `alert_severity` → `severity_level`, `detection_rule` → `alert_rule_name` | | |
| | README's `triage_outcome` (categorical) | Replaced by `label_true_positive` / `label_false_positive` (mirror booleans) | | |
| | README's `ioc_matched` | Not present in `alert_records` | | |
| | README missing from `alert_records` | `correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented | | |
| | `incident_summary` has 8 columns | Data has **24 columns** | | |
| | `host_inventory` has 6 columns | Data has **15 columns** | | |
| | `threat_actor_profile` has 4 values | Data has **5 values** (adds `benign_user` at 57% of events) | | |
| | `attack_lifecycle_phase` 5-phase malicious lifecycle | Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included | | |
| | README says MITRE ATT&CK v14 with 50 techniques | Data has 54 unique technique IDs across 14 tactics + benign | | |
| None of these affects model correctness — the feature pipeline uses | |
| the actual column names. If you build your own pipeline against the | |
| dataset, use the actual columns. | |
| ## Intended use | |
| - **Evaluating fit** of the CYB010 dataset for your SIEM ML research | |
| - **Baseline reference** for new model architectures on the | |
| attack-phase classification task | |
| - **Reference example of structural-leakage diagnostics** for | |
| synthetic SIEM datasets — the methodology is reusable | |
| - **Feature engineering reference** for per-event SIEM telemetry | |
| ## Out-of-scope use | |
| - Production SIEM phase detection on real telemetry | |
| - Threat actor attribution (4-class malicious-only is unlearnable | |
| on the sample) | |
| - Event-class prediction (this is a structural property, not a | |
| learnable target) | |
| - Any operational decision affecting actual security operations | |
| without further validation on your own data | |
| ## Reproducibility | |
| Outputs above were produced with `seed = 42` (published artifact), | |
| nested `GroupShuffleSplit` on `incident_id` (70/15/15), on the published | |
| sample (`xpertsystems/cyb010-sample`, version 1.0.0, generated | |
| 2026-05-16). The feature pipeline in `feature_engineering.py` is | |
| deterministic and the trained weights in this repo correspond exactly | |
| to the metrics above. | |
| Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) | |
| in `multi_seed_results.json` confirm robust performance across splits | |
| (std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std | |
| in the XpertSystems catalog). | |
| The training script itself is private to XpertSystems. | |
| ## Files in this repo | |
| | File | Purpose | | |
| |---|---| | |
| | `model_xgb.json` | XGBoost weights (seed 42) | | |
| | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | | |
| | `feature_engineering.py` | Feature pipeline | | |
| | `feature_meta.json` | Feature column order + categorical levels | | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | | |
| | `ablation_results.json` | Per-feature-group ablation | | |
| | `multi_seed_results.json` | XGBoost metrics across 10 seeds | | |
| | **`leakage_diagnostic.json`** | **11-oracle-path audit + 2 unlearnable targets** | | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | | |
| | `README.md` | This file | | |
| ## Contact and full product | |
| The full **CYB010** dataset contains **~550,000 rows** across four files, | |
| with calibrated benchmark validation against 6 metrics drawn from | |
| authoritative SOC operations and threat intelligence sources (SANS SOC | |
| Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA | |
| Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security). | |
| The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across | |
| Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials | |
| & Energy. | |
| - 📧 **pradeep@xpertsystems.ai** | |
| - 🌐 **https://xpertsystems.ai** | |
| - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample | |
| - 🤖 Companion models: | |
| - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) | |
| - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) | |
| - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase) | |
| - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase) | |
| - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution) | |
| - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic) | |
| - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type) | |
| - https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic) | |
| - https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic) | |
| ## Citation | |
| ```bibtex | |
| @misc{xpertsystems_cyb010_baseline_2026, | |
| title = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic}, | |
| author = {XpertSystems.ai}, | |
| year = {2026}, | |
| url = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier}, | |
| note = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample} | |
| } | |
| ``` | |