| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - cybersecurity |
| - mitre-attack |
| - kill-chain |
| - apt |
| - tabular-classification |
| - synthetic-data |
| - xgboost |
| - baseline |
| pipeline_tag: tabular-classification |
| base_model: [] |
| datasets: |
| - xpertsystems/cyb002-sample |
| metrics: |
| - accuracy |
| - f1 |
| - roc_auc |
| model-index: |
| - name: cyb002-baseline-classifier |
| results: |
| - task: |
| type: tabular-classification |
| name: 10-class MITRE ATT&CK kill-chain phase classification |
| dataset: |
| type: xpertsystems/cyb002-sample |
| name: CYB002 Synthetic Cyber Attack Dataset (Sample) |
| metrics: |
| - type: roc_auc |
| value: 0.8599 |
| name: Test macro ROC-AUC OvR (XGBoost) |
| - type: f1 |
| value: 0.4255 |
| name: Test macro-F1 (XGBoost) |
| - type: accuracy |
| value: 0.4683 |
| name: Test accuracy (XGBoost) |
| - type: roc_auc |
| value: 0.8496 |
| name: Test macro ROC-AUC OvR (MLP) |
| - type: f1 |
| value: 0.3911 |
| name: Test macro-F1 (MLP) |
| - type: accuracy |
| value: 0.4449 |
| name: Test accuracy (MLP) |
| --- |
| |
| # CYB002 Baseline Classifier |
|
|
| **MITRE ATT&CK kill-chain phase classifier trained on the CYB002 |
| synthetic cyber attack sample. Predicts which of 10 kill-chain phases |
| an attack event belongs to, from observable event + segment features.** |
|
|
| > **Baseline reference, not for production use.** This model demonstrates |
| > that the [CYB002 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb002-sample) |
| > is learnable end-to-end and gives prospective buyers a working starting |
| > point. It is not a production threat detector or SOC tool. See |
| > [Limitations](#limitations). |
|
|
| ## Model overview |
|
|
| | Property | Value | |
| |---|---| |
| | Task | 10-class kill-chain phase classification | |
| | Training data | `xpertsystems/cyb002-sample` (4,353 attack events across 100 campaigns) | |
| | Models | XGBoost + PyTorch MLP | |
| | Input features | 90 (after one-hot encoding) | |
| | Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) | |
| | License | CC-BY-NC-4.0 (matches dataset) | |
| | Status | Reference baseline | |
| |
| Two model artifacts are published. They are designed to be used together β disagreement is a useful triage signal: |
| |
| - `model_xgb.json` β gradient-boosted trees, primary recommendation |
| - `model_mlp.safetensors` β PyTorch MLP in SafeTensors format |
| |
| ## Quick start |
| |
| ```bash |
| pip install xgboost torch safetensors pandas huggingface_hub |
| ``` |
| |
| ```python |
| from huggingface_hub import hf_hub_download |
| import json, numpy as np, torch, xgboost as xgb |
| from safetensors.torch import load_file |
| |
| REPO = "xpertsystems/cyb002-baseline-classifier" |
| |
| paths = {n: hf_hub_download(REPO, n) for n in [ |
| "model_xgb.json", "model_mlp.safetensors", |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", |
| ]} |
| |
| import sys, os |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) |
| from feature_engineering import ( |
| transform_single, load_meta, INT_TO_LABEL, build_segment_lookup |
| ) |
| |
| meta = load_meta(paths["feature_meta.json"]) |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) |
| |
| # Build the segment-aggregate lookup from the dataset's topology CSV |
| seg_lookup = build_segment_lookup("path/to/network_topology.csv") |
| |
| # Predict (see inference_example.ipynb for the full pattern) |
| seg_agg = seg_lookup.get(my_event["target_segment_id"], {}) |
| X = transform_single(my_event, meta, segment_aggregates=seg_agg) |
| proba = xgb_model.predict_proba(X)[0] |
| print(INT_TO_LABEL[int(np.argmax(proba))]) |
| ``` |
| |
| See [`inference_example.ipynb`](./inference_example.ipynb) for an |
| end-to-end copy-paste demo including segment-aggregate setup and |
| batch prediction. |
| |
| ## Training data |
| |
| Trained on the public sample of CYB002, 4,353 attack events from 100 |
| distinct campaigns: |
| |
| | Phase | Train (n=2,822) | Test (n=726) | Test share | |
| |---|---:|---:|---:| |
| | `dwell_idle` | 581 | 141 | 19.4% | |
| | `reconnaissance` | 411 | 112 | 15.4% | |
| | `initial_access` | 358 | 106 | 14.6% | |
| | `execution` | 324 | 74 | 10.2% | |
| | `persistence` | 287 | 79 | 10.9% | |
| | `privilege_escalation` | 249 | 68 | 9.4% | |
| | `lateral_movement` | 201 | 54 | 7.4% | |
| | `collection` | 162 | 40 | 5.5% | |
| | `exfiltration` | 113 | 31 | 4.3% | |
| | `impact` | 105 | 21 | 2.9% | |
| |
| ### Group-aware split |
| |
| A single campaign generates ~40 highly-correlated events. Random row-level |
| splitting would put events from the same campaign in both train and test, |
| inflating metrics in a way that does not generalize to new campaigns. |
| |
| This release uses **GroupShuffleSplit by `campaign_id`**: |
|
|
| | Fold | Campaigns | Events | |
| |---|---:|---:| |
| | Train | 69 | 2,822 | |
| | Validation | 16 | 805 | |
| | Test | 15 | 726 | |
|
|
| All test campaigns are completely unseen during training. Class imbalance |
| is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and |
| weighted cross-entropy (MLP). |
|
|
| ## Feature pipeline |
|
|
| The bundled `feature_engineering.py` is the canonical feature recipe. |
|
|
| **Three columns are deliberately excluded** because they leak the target: |
|
|
| - `technique_id` β 62 of 63 ATT&CK techniques map 1:1 to a single phase. |
| Including it gives perfect-looking metrics that mean nothing. |
| - `technique_name` β 1:1 alias of `technique_id` (63 unique values each). |
| - `tactic_category` β direct alias of `kill_chain_phase`. |
|
|
| **90 features survive after encoding**, drawn from: |
|
|
| - **Event-level numeric** (10): `timestep`, `dest_port`, `bytes_transferred`, `connection_duration_s`, `auth_failure_count`, `process_injection_flag`, `lateral_hop_count`, `c2_beacon_interval_s`, `edr_blocked_flag`, `siem_rule_triggered` |
| - **Event-level categorical** (7, one-hot encoded): `target_asset_type`, `source_ip_class`, `protocol`, `attacker_capability_tier`, `defender_maturity_level`, `alert_severity`, `detection_outcome` |
| - **Segment-level topology aggregates** (13): mean `patch_lag_days`, mean `exposure_score`, max `vulnerability_count`, fraction with EDR/SIEM/NDR/MFA coverage, mean MTTD / MTTR baselines, plus segment_type and defender_maturity_level (segment-constant) |
| - **Engineered** (6): `byte_volume_log`, `has_c2_beacon`, `is_brute_forcing`, `attacker_defender_advantage`, `is_high_volume`, `is_privileged_port` |
| |
| None of the engineered features is derived from phase or technique β |
| that would re-introduce the leakage we just excluded. |
| |
| ### Note on detection-outcome features |
| |
| `detection_outcome`, `alert_severity`, `edr_blocked_flag`, and |
| `siem_rule_triggered` are post-hoc observables from the SOC's perspective. |
| They are kept as features for the realistic use case where a SOC analyst |
| has just seen an action and its initial detection signal and is reasoning |
| about which phase the campaign is in. Buyers who want a strictly |
| pre-detection model can drop these four columns and retrain β the ablation |
| results below show this **does not hurt accuracy** (the model doesn't |
| lean on them for phase prediction). |
| |
| ## Evaluation |
| |
| ### Test-set metrics (n = 726 events from 15 disjoint campaigns) |
| |
| **XGBoost** |
| |
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.8599** | |
| | Accuracy | 0.4683 | |
| | Macro-F1 | 0.4255 | |
| | Weighted-F1 | 0.4604 | |
| |
| **MLP** |
| |
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.8496** | |
| | Accuracy | 0.4449 | |
| | Macro-F1 | 0.3911 | |
| | Weighted-F1 | 0.4350 | |
| |
| ### Headline interpretation |
| |
| Accuracy of 47% looks low at first glance, but the right comparison is: |
| |
| | Baseline | Accuracy | Macro-F1 | |
| |---|---:|---:| |
| | Random uniform guess (1/10 classes) | 0.10 | ~0.10 | |
| | Always predict majority (`dwell_idle`) | 0.19 | n/a | |
| | **XGBoost (this model)** | **0.47** | **0.43** | |
|
|
| The macro ROC-AUC of **0.86** tells the cleaner story: the model |
| distinguishes the 10 phases meaningfully well even though the |
| argmax-prediction sometimes lands on an adjacent phase. |
|
|
| ### Per-class F1 β where the signal is and isn't |
|
|
| | Phase | XGBoost F1 | MLP F1 | Note | |
| |---|---:|---:|---| |
| | `reconnaissance` | **0.753** | 0.725 | Strong: early timestep, distinct protocols/targets | |
| | `lateral_movement` | **0.742** | 0.783 | Strong: lateral-hop count, post-privesc pattern | |
| | `initial_access` | **0.647** | 0.648 | Strong: perimeter targets, specific protocols | |
| | `privilege_escalation` | 0.500 | 0.488 | Moderate | |
| | `execution` | 0.441 | 0.510 | Moderate | |
| | `persistence` | 0.413 | 0.301 | Moderate, easily confused with execution | |
| | `exfiltration` | 0.273 | 0.119 | Weak: late-phase, similar to collection/impact | |
| | `impact` | 0.226 | 0.132 | Weak: late-phase clustering | |
| | `collection` | 0.220 | 0.191 | Weak: late-phase clustering | |
| | `dwell_idle` | 0.040 | 0.013 | Very weak: no-op steps lack distinguishing features | |
|
|
| The model has solid signal on **early and mid-campaign phases** and |
| genuinely struggles to disambiguate **late-stage objective-completion |
| phases** (collection / exfiltration / impact), which arrive close in |
| time and look similar at the event level. This is an honest limitation |
| of flat-tabular classification β sequence models would help here. |
|
|
| ### Ablation: which feature groups matter |
|
|
| | Configuration | Accuracy | Macro-F1 | Ξ accuracy vs full | |
| |---|---:|---:|---:| |
| | Full feature set (published) | 0.4683 | 0.4255 | β | |
| | No `timestep` | 0.3264 | 0.3102 | **β0.1419** | |
| | No topology aggregates | 0.4601 | 0.4093 | β0.0083 | |
| | No engineered features | 0.4642 | 0.4240 | β0.0041 | |
| | No detection-signal features | 0.4725 | 0.4284 | **+0.0041** | |
|
|
| Two clear findings: |
|
|
| 1. **`timestep` is by far the most important feature** (drops 14 pp when |
| removed). The honest reading: kill chains progress in time, and where |
| you are in the campaign timeline carries most of the phase signal. |
| 2. **Detection-signal features (`detection_outcome`, `alert_severity`, |
| `edr_blocked_flag`, `siem_rule_triggered`) do not help phase prediction.** |
| Removing them actually improves the score marginally. A buyer who wants |
| a pre-detection model can drop these four columns with no loss. |
|
|
| Topology and engineered features each contribute roughly 1 pp. |
|
|
| ### Architecture |
|
|
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes), |
| `hist` tree method, class-balanced sample weights, early stopping on |
| validation mlogloss. |
|
|
| **MLP:** `90 β 128 β 64 β 10`, each hidden layer followed by `BatchNorm1d` |
| β `ReLU` β `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, |
| early stopping on validation macro-F1. |
|
|
| Training hyperparameters (learning rate, batch size, n_estimators, |
| early-stopping patience, weight decay, class-weighting strategy) are |
| held internally by XpertSystems and are not part of this release. |
| |
| ## Limitations |
| |
| **This is a baseline reference, not a production threat detection system.** |
| |
| 1. **Late-phase confusion.** Per-class F1 for `collection`, `exfiltration`, |
| and `impact` is 0.22β0.27. These phases arrive near campaign-end with |
| similar feature signatures, and a flat-tabular event-level model can't |
| easily disambiguate them. Sequence models (LSTM / transformer over the |
| per-campaign event sequence) would substantially improve this. |
| |
| 2. **`dwell_idle` is essentially unlearnable in this framing.** The |
| class-balanced weights amplify rare classes; `dwell_idle` is common |
| but featureless ("no action this timestep"), so the model trades |
| `dwell_idle` recall for late-phase recall. F1 = 0.04. A real SOC |
| pipeline would handle idle steps with a separate gating rule, not a |
| classifier head. |
|
|
| 3. **Sample-size constraints.** 100 campaigns / 4,353 events with a |
| group-aware split leaves 69 training campaigns. The full 380k-event |
| CYB002 product supports much more reliable per-class estimation, |
| especially on the rare late-phase classes. |
|
|
| 4. **Synthetic-vs-real transfer.** The dataset is synthetic and |
| calibrated to threat-intelligence benchmark targets (Mandiant |
| M-Trends, IBM CODB, Verizon DBIR, MITRE ATT&CK Evaluations). Real |
| attack telemetry has different noise characteristics, adversary |
| adaptation, and gaps in coverage. Do not assume metrics transfer. |
|
|
| 5. **Adversarial robustness not evaluated.** The dataset is not |
| adversarially generated; the model has not been red-teamed. |
|
|
| 6. **MLP brittleness on OOD inputs.** With ~2.8k training events, the |
| MLP can produce confidently-wrong predictions on hand-crafted |
| records far from the training manifold. XGBoost is more robust. |
| Use both; treat disagreement as a signal for human review. |
|
|
| ## Notes on dataset schema |
|
|
| The CYB002 sample dataset README describes some fields differently from |
| the actual schema. The model was trained on the actual schema; this note |
| is to help buyers reconcile what they read with what they receive. |
|
|
| | What the README says | What the data actually contains | |
| |---|---| |
| | "9 ATT&CK phases" | 10 phases including `dwell_idle` (idle/no-op steps) | |
| | 4 attacker tiers: `opportunistic`, `organized_crime`, `apt`, `nation_state` | 4 tiers: `opportunistic`, `script_kiddie`, `apt`, `nation_state` | |
| | 5 defender maturity levels: CMMI names (`ad_hoc`, `defined`, `managed`, `quantitatively_managed`, `optimizing`) | 5 levels: `minimal`, `baseline`, `managed`, `advanced`, `zero_trust` | |
| | Field name `phase` | Actual column: `kill_chain_phase` | |
| | Field name `tactic` | Actual column: `tactic_category` | |
| | Field name `segment_id` | Actual column: `target_segment_id` | |
| | Field name `attacker_tier` | Actual column: `attacker_capability_tier` | |
| | Field name `defender_maturity` | Actual column: `defender_maturity_level` | |
| | Field name `detected`, `blocked`, `stealth_score` | Actual: `detection_outcome`, `edr_blocked_flag`, `siem_rule_triggered`; no `stealth_score` on events | |
|
|
| None of this affects model correctness β `feature_engineering.py` uses the |
| actual column names. If you build your own pipeline against the dataset, |
| use the actual columns, not the README descriptions. |
|
|
| ## Intended use |
|
|
| - **Evaluating fit** of the CYB002 dataset for your ATT&CK / kill-chain |
| research |
| - **Baseline reference** for new model architectures (especially sequence |
| models, which should beat this baseline on the late-phase classes) |
| - **Teaching and demo** for tabular classification on attack-event data |
| - **Feature engineering reference** for MITRE ATT&CK-aligned datasets |
|
|
| ## Out-of-scope use |
|
|
| - Production threat detection on real network telemetry |
| - SOC alert triage on real systems |
| - Forensic attribution of real attacks |
| - Adversarial-evasion evaluation (dataset not adversarially generated) |
| - Any safety-critical or operational security decision |
|
|
| ## Reproducibility |
|
|
| Outputs above were produced with `seed = 42`, group-aware nested |
| `GroupShuffleSplit` (70/15/15 by campaign_id), on the published sample |
| (`xpertsystems/cyb002-sample`, version 1.0.0, generated 2026-05-16). |
| The feature pipeline in `feature_engineering.py` is deterministic and |
| the trained weights in this repo correspond exactly to the metrics above. |
|
|
| The training script itself is private to XpertSystems. The published |
| artifacts contain the feature pipeline, model weights, scaler, metadata, |
| and validation results β sufficient to reproduce inference but not |
| training. |
|
|
| ## Files in this repo |
|
|
| | File | Purpose | |
| |---|---| |
| | `model_xgb.json` | XGBoost weights | |
| | `model_mlp.safetensors` | PyTorch MLP weights | |
| | `feature_engineering.py` | Feature pipeline (load β aggregate topology β engineer β encode) | |
| | `feature_meta.json` | Feature column order + categorical levels | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | |
| | `ablation_results.json` | Per-feature-group ablation (timestep, topology, engineered, detection-signals) | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | |
| | `README.md` | This file | |
|
|
| ## Contact and full product |
|
|
| The full **CYB002** dataset contains ~454,000 rows across four files, |
| with calibrated benchmark validation against 12 metrics drawn from |
| authoritative threat intelligence sources (Mandiant, IBM, Verizon, |
| CrowdStrike, MITRE, SANS, ENISA). The full XpertSystems.ai synthetic data |
| catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & |
| Risk, Oil & Gas, and Materials & Energy. |
|
|
| - π§ **pradeep@xpertsystems.ai** |
| - π **https://xpertsystems.ai** |
| - π Dataset: https://huggingface.co/datasets/xpertsystems/cyb002-sample |
| - π€ Companion model (network traffic): https://huggingface.co/xpertsystems/cyb001-baseline-classifier |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{xpertsystems_cyb002_baseline_2026, |
| title = {CYB002 Baseline Classifier: XGBoost and MLP for MITRE ATT&CK Kill-Chain Phase Classification}, |
| author = {XpertSystems.ai}, |
| year = {2026}, |
| url = {https://huggingface.co/xpertsystems/cyb002-baseline-classifier}, |
| note = {Baseline reference model trained on xpertsystems/cyb002-sample} |
| } |
| ``` |
|
|