--- license: cc-by-nc-4.0 library_name: pytorch tags: - cybersecurity - siem - security-logs - mitre-attack - apt - tabular-classification - synthetic-data - xgboost - baseline - leakage-diagnostic pipeline_tag: tabular-classification base_model: [] datasets: - xpertsystems/cyb010-sample metrics: - accuracy - f1 - roc_auc model-index: - name: cyb010-baseline-classifier results: - task: type: tabular-classification name: 5-class attack lifecycle phase classification dataset: type: xpertsystems/cyb010-sample name: CYB010 Synthetic Security Event Log Dataset (Sample) metrics: - type: roc_auc value: 0.9904 name: Test macro ROC-AUC OvR (XGBoost, seed 42) - type: accuracy value: 0.9493 name: Test accuracy (XGBoost, seed 42) - type: f1 value: 0.7781 name: Test macro-F1 (XGBoost, seed 42) - type: accuracy value: 0.936 name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) - type: roc_auc value: 0.988 name: Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds) --- # CYB010 Baseline Classifier **Attack lifecycle phase classifier (5-class) trained on the CYB010 synthetic security event log sample. Predicts which of 5 attack phases (`benign_background` / `initial_access` / `lateral_movement` / `persistence_establishment` / `exfiltration_or_impact`) a security event belongs to, from per-event features. ALSO ships a comprehensive `leakage_diagnostic.json` documenting 11 oracle paths discovered across the dataset's targets and 2 README-suggested targets that are unlearnable on the sample after honest leak removal.** > **Read this first.** This repo ships two related artifacts: > (1) a working baseline classifier for `attack_lifecycle_phase` (the > dataset's headline target), and (2) `leakage_diagnostic.json` > documenting 11 separate oracle paths plus 2 unlearnable targets. > Both files matter; the diagnostic is required reading for anyone > evaluating CYB010 for SIEM ML work. ## Model overview | Property | Value | |---|---| | Primary task | 5-class `attack_lifecycle_phase` classification | | Secondary artifact | `leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets | | Training data | `xpertsystems/cyb010-sample` (21,896 events / 500 incidents) | | Models | XGBoost + PyTorch MLP | | Input features | 87 (after one-hot encoding) | | Split | **Group-aware** (GroupShuffleSplit on `incident_id`) | | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | | License | CC-BY-NC-4.0 (matches dataset) | | Status | Reference baseline + comprehensive leakage diagnostic | ## Why this task — and what was dropped The CYB010 README's central concept is the "5-phase attack lifecycle state machine", and `attack_lifecycle_phase` is the data's headline target. We piloted six candidate targets and found: - **`attack_lifecycle_phase` 5-class**: strongest honest result. Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes represented, per-class F1 range 0.48–1.00. - **`threat_actor_profile` 5-class**: works at acc 0.84 but per-class F1 reveals it's almost entirely driven by `benign_user` separation (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class malicious-only formulation is below majority (acc 0.55 vs 0.61). - **`label_true_positive` binary on alerts**: documented as a secondary finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after dropping all of them. - **`mitre_tactic` 14-class**: hits acc 0.90 but macro-F1 0.37 - imbalance gaming (benign class dominates at 57%). - **`event_class` 12-class**: unlearnable (acc 0.35 vs majority 0.42). ### Six oracle columns dropped from the phase task CYB010 encodes the benign vs malicious distinction explicitly in multiple columns. Each is a perfect or near-perfect oracle for the `benign_background` phase: | Column | Oracle relationship | |---|---| | `mitre_tactic` | `=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect) | | `mitre_technique_id` | Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic) | | `label_malicious` | `==False` ↔ `benign_background` (perfect) | | `threat_actor_id` | `=="NONE"` ↔ `benign_background` (perfect) | | `threat_actor_profile` | `=="benign_user"` ↔ `benign_background` (perfect) | | `event_type` | Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`) | With these six columns present, a plain XGBoost trivially separates benign vs malicious. The published baseline trains with all six excluded. Two model artifacts are published. They are designed to be used together: - `model_xgb.json` — gradient-boosted trees (slightly higher F1) - `model_mlp.safetensors` — PyTorch MLP ## Quick start ```bash pip install xgboost torch safetensors pandas huggingface_hub ``` ```python from huggingface_hub import hf_hub_download, snapshot_download import json, numpy as np, torch, xgboost as xgb from safetensors.torch import load_file REPO = "xpertsystems/cyb010-baseline-classifier" paths = {n: hf_hub_download(REPO, n) for n in [ "model_xgb.json", "model_mlp.safetensors", "feature_engineering.py", "feature_meta.json", "feature_scaler.json", ]} import sys, os sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) from feature_engineering import ( transform_single, load_meta, build_host_lookup, INT_TO_LABEL, ) meta = load_meta(paths["feature_meta.json"]) # Host features are joined from host_inventory.csv at inference time ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset") host_lookup = build_host_lookup(f"{ds}/host_inventory.csv") xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) # Predict (see inference_example.ipynb for the full pattern) # Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious, # threat_actor_id, threat_actor_profile, or event_type - those were the # oracle columns. X = transform_single(my_event, meta, host_lookup=host_lookup) proba = xgb_model.predict_proba(X)[0] print(INT_TO_LABEL[int(np.argmax(proba))]) ``` See [`inference_example.ipynb`](./inference_example.ipynb) for the full copy-paste demo. ## Training data Trained on the public sample of CYB010, 21,896 per-event records: | Phase | Events | Class share | |---|---:|---:| | `benign_background` | 12,448 | 56.9% | | `exfiltration_or_impact` | 6,205 | 28.3% | | `initial_access` | 1,674 | 7.6% | | `lateral_movement` | 968 | 4.4% | | `persistence_establishment` | 601 | 2.7% | ### Group-aware split by incident_id 500 incidents × ~44 events each. Events from the same incident share host, threat actor, and phase trajectory — so train/test contamination is a real risk with random splitting. The baseline uses **GroupShuffleSplit** on `incident_id` (nested 70/15/15): | Fold | Events | Incidents | |---|---:|---:| | Train | 14,697 | ~350 | | Validation | 3,473 | ~75 | | Test | 3,726 | ~75 | All 10 multi-seed evaluations yielded all 5 classes in the test fold. Class imbalance is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and weighted cross-entropy (MLP). ## Feature pipeline The bundled `feature_engineering.py` is the canonical recipe. 87 features survive after encoding, drawn from: - **Per-event numeric** (5): `source_port`, `dest_port`, `cvss_score_analogue`, `label_log_tampered`, `label_false_positive` - **Per-event categorical** (3, one-hot): `event_class` (12 values), `log_source_type` (8 values), `severity_level` (5 values) - **Host features** (joined from `host_inventory.csv`): 3 numeric + 7 categorical (os_type, host_role, network_segment, defender_posture, criticality_rating, cloud_provider, siem_platform) - **Engineered** (9): `hour_of_day`, `is_off_hours`, `is_weekend`, `log_cvss`, `is_high_cvss`, `is_well_known_port`, `is_dynamic_port`, `is_outbound_web`, `risk_composite` ### Partial-oracle features kept as legitimate observables `event_class` (max purity 0.87, mean 0.72 across phases) is the strongest non-oracle feature. C2 beacon traffic (`event_class = network_flow`) is 65% exfiltration phase but also 29% benign and 6% other phases — real overlap, not deterministic encoding. Kept. `severity_level` and `cvss_score_analogue` correlate strongly with phase (high-severity events skew toward exfil and initial_access) but with substantial overlap. Kept. `label_log_tampered` is a real observable — APTs tamper more than script_kiddies — but is not phase-deterministic. Kept. ## Evaluation ### Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents) **XGBoost** (the published `model_xgb.json` artifact) | Metric | Value | |---|---:| | Macro ROC-AUC (OvR) | **0.9904** | | Accuracy | **0.9493** | | Macro-F1 | 0.7781 | | Weighted-F1 | 0.9478 | **MLP** (the published `model_mlp.safetensors` artifact) | Metric | Value | |---|---:| | Macro ROC-AUC (OvR) | **0.9861** | | Accuracy | **0.9412** | | Macro-F1 | 0.7534 | | Weighted-F1 | 0.9396 | XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941, macro-F1 0.778 vs 0.753). The gap is consistent across seeds. ### Multi-seed robustness (XGBoost, 10 seeds) | Metric | Mean | Std | Min | Max | |---|---:|---:|---:|---:| | Accuracy | 0.936 | 0.007 | 0.923 | 0.949 | | Macro-F1 | 0.759 | 0.015 | 0.741 | 0.781 | | Macro ROC-AUC OvR | 0.988 | 0.001 | 0.986 | 0.990 | **Tightest ROC-AUC std in the catalog** (0.001). All 10 seeds yielded all 5 classes in the test fold. Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). ### Per-class F1 (seed 42) | Phase | Class share | XGBoost F1 | MLP F1 | |---|---:|---:|---:| | `benign_background` | 56.9% | **0.998** | 0.994 | | `exfiltration_or_impact` | 28.3% | **0.987** | 0.981 | | `initial_access` | 7.6% | 0.720 | 0.651 | | `persistence_establishment` | 2.7% | 0.703 | 0.690 | | `lateral_movement` | 4.4% | **0.483** | 0.451 | The two largest classes (`benign_background` and `exfiltration_or_impact`) are nearly perfectly separable — `benign_background` because the non-oracle features (severity, CVSS, log_source) still cleanly separate non-malicious traffic, and `exfiltration_or_impact` because it's dominated by network_flow events (C2 beacons). The three middle classes overlap substantially in feature space; `lateral_movement` is the hardest (F1 0.48) because lateral movement events look similar to initial_access events at the per-event level. A sequence model that considers event ordering within an incident would likely do better than the per-event baseline. ### Ablation: which feature groups matter | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | Δ macro-F1 | |---|---:|---:|---:|---:|---:| | Full feature set (published) | 0.9493 | 0.7781 | 0.9904 | — | — | | No `event_class` | 0.9206 | 0.5969 | 0.9723 | **−0.0287** | **−0.181** | | No CVSS features | 0.9383 | 0.7475 | 0.9812 | −0.0110 | −0.031 | | No `log_source_type` | 0.9469 | 0.7655 | 0.9902 | −0.0024 | −0.013 | | No engineered features | 0.9471 | 0.7655 | 0.9903 | −0.0022 | −0.013 | | No ports | 0.9463 | 0.7621 | 0.9903 | −0.0030 | −0.016 | | No `severity_level` | 0.9479 | 0.7688 | 0.9902 | −0.0014 | −0.009 | | No tamper flags | 0.9469 | 0.7657 | 0.9905 | −0.0024 | −0.012 | | No timing | 0.9501 | 0.7730 | 0.9907 | +0.0008 | −0.005 | | No host features | 0.9522 | 0.7828 | 0.9917 | +0.0029 | +0.005 | Three findings: 1. **`event_class` is the dominant signal** (drops 18pp macro-F1 when removed). Phase prediction without it loses most discrimination between the middle classes. 2. **CVSS features are second-strongest** (drops 3pp F1). Captures severity information that complements event_class. 3. **Host features and timing add modest noise.** The model performs marginally *better* without host features (+0.3pp accuracy), and timing features contribute essentially nothing. Kept in the pipeline as documented baseline reference. ### Architecture **XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes), `hist` tree method, class-balanced sample weights, early stopping on validation mlogloss. **MLP:** `87 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d` → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1. Training hyperparameters are held internally by XpertSystems. ## Limitations **This is a baseline reference, not a production phase classifier.** 1. **The leakage diagnostic is required reading.** Six oracle columns for the phase task and seven for the alert TP task are documented in `leakage_diagnostic.json`. If you use CYB010 sample data for your own training, you MUST drop these or your model will learn the oracles instead of the task. 2. **`lateral_movement` F1 0.48 is the weakest class.** The 968-event sample with substantial overlap to `initial_access` makes this class hard. A sequence model that considers event ordering within incidents would likely do better than per-event classification. 3. **`threat_actor_profile` 4-class (malicious-only) is unlearnable on this sample** (acc 0.55 vs majority 0.61). The 5-class formulation with benign included works only because benign_user separation is structurally trivial. 4. **`event_class` 12-class is unlearnable on this sample** (acc 0.35 vs majority 0.42). event_class is a structural property of the event itself, not something to predict from other features. 5. **Synthetic-vs-real transfer.** The dataset is synthetic, calibrated to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise characteristics — and in particular, the explicit `mitre_tactic == "benign"` marker and `threat_actor_id == "NONE"` benign sentinel would not be present in real data. Real telemetry has implicit benign-vs-malicious distinctions that emerge from event content. Do not assume metrics transfer end-to-end. 6. **21,896 events / 500 incidents is a modest training set.** The 3,726-event / ~75-incident test fold yields stable multi-seed metrics (std 0.007 on accuracy) but per-class confidence intervals widen for the smallest classes (lateral_movement, persistence). ## Notes on dataset schema The CYB010 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive. | What the README says | What the data actually contains | |---|---| | `security_events` has 16 columns | Data has **23 columns** | | Field renames | `timestamp_utc` → `timestamp`, `user` → `user_id`, `log_format` → `log_source_type` | | README missing from `security_events` | `event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented | | README claims `command_line` / `process_name` / `is_off_hours` columns | Not present in `security_events` (off-hours derived from timestamp in pipeline) | | `alert_records` has 9 columns | Data has **21 columns** | | Field renames | `alert_severity` → `severity_level`, `detection_rule` → `alert_rule_name` | | README's `triage_outcome` (categorical) | Replaced by `label_true_positive` / `label_false_positive` (mirror booleans) | | README's `ioc_matched` | Not present in `alert_records` | | README missing from `alert_records` | `correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented | | `incident_summary` has 8 columns | Data has **24 columns** | | `host_inventory` has 6 columns | Data has **15 columns** | | `threat_actor_profile` has 4 values | Data has **5 values** (adds `benign_user` at 57% of events) | | `attack_lifecycle_phase` 5-phase malicious lifecycle | Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included | | README says MITRE ATT&CK v14 with 50 techniques | Data has 54 unique technique IDs across 14 tactics + benign | None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns. ## Intended use - **Evaluating fit** of the CYB010 dataset for your SIEM ML research - **Baseline reference** for new model architectures on the attack-phase classification task - **Reference example of structural-leakage diagnostics** for synthetic SIEM datasets — the methodology is reusable - **Feature engineering reference** for per-event SIEM telemetry ## Out-of-scope use - Production SIEM phase detection on real telemetry - Threat actor attribution (4-class malicious-only is unlearnable on the sample) - Event-class prediction (this is a structural property, not a learnable target) - Any operational decision affecting actual security operations without further validation on your own data ## Reproducibility Outputs above were produced with `seed = 42` (published artifact), nested `GroupShuffleSplit` on `incident_id` (70/15/15), on the published sample (`xpertsystems/cyb010-sample`, version 1.0.0, generated 2026-05-16). The feature pipeline in `feature_engineering.py` is deterministic and the trained weights in this repo correspond exactly to the metrics above. Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in `multi_seed_results.json` confirm robust performance across splits (std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std in the XpertSystems catalog). The training script itself is private to XpertSystems. ## Files in this repo | File | Purpose | |---|---| | `model_xgb.json` | XGBoost weights (seed 42) | | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | | `feature_engineering.py` | Feature pipeline | | `feature_meta.json` | Feature column order + categorical levels | | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | | `validation_results.json` | Per-class metrics, confusion matrix, architecture | | `ablation_results.json` | Per-feature-group ablation | | `multi_seed_results.json` | XGBoost metrics across 10 seeds | | **`leakage_diagnostic.json`** | **11-oracle-path audit + 2 unlearnable targets** | | `inference_example.ipynb` | End-to-end inference demo notebook | | `README.md` | This file | ## Contact and full product The full **CYB010** dataset contains **~550,000 rows** across four files, with calibrated benchmark validation against 6 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy. - 📧 **pradeep@xpertsystems.ai** - 🌐 **https://xpertsystems.ai** - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample - 🤖 Companion models: - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase) - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase) - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution) - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic) - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type) - https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic) - https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic) ## Citation ```bibtex @misc{xpertsystems_cyb010_baseline_2026, title = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic}, author = {XpertSystems.ai}, year = {2026}, url = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier}, note = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample} } ```