File size: 16,867 Bytes
146a3a4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 | ---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- cybersecurity
- mitre-attack
- kill-chain
- apt
- tabular-classification
- synthetic-data
- xgboost
- baseline
pipeline_tag: tabular-classification
base_model: []
datasets:
- xpertsystems/cyb002-sample
metrics:
- accuracy
- f1
- roc_auc
model-index:
- name: cyb002-baseline-classifier
results:
- task:
type: tabular-classification
name: 10-class MITRE ATT&CK kill-chain phase classification
dataset:
type: xpertsystems/cyb002-sample
name: CYB002 Synthetic Cyber Attack Dataset (Sample)
metrics:
- type: roc_auc
value: 0.8599
name: Test macro ROC-AUC OvR (XGBoost)
- type: f1
value: 0.4255
name: Test macro-F1 (XGBoost)
- type: accuracy
value: 0.4683
name: Test accuracy (XGBoost)
- type: roc_auc
value: 0.8496
name: Test macro ROC-AUC OvR (MLP)
- type: f1
value: 0.3911
name: Test macro-F1 (MLP)
- type: accuracy
value: 0.4449
name: Test accuracy (MLP)
---
# CYB002 Baseline Classifier
**MITRE ATT&CK kill-chain phase classifier trained on the CYB002
synthetic cyber attack sample. Predicts which of 10 kill-chain phases
an attack event belongs to, from observable event + segment features.**
> **Baseline reference, not for production use.** This model demonstrates
> that the [CYB002 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb002-sample)
> is learnable end-to-end and gives prospective buyers a working starting
> point. It is not a production threat detector or SOC tool. See
> [Limitations](#limitations).
## Model overview
| Property | Value |
|---|---|
| Task | 10-class kill-chain phase classification |
| Training data | `xpertsystems/cyb002-sample` (4,353 attack events across 100 campaigns) |
| Models | XGBoost + PyTorch MLP |
| Input features | 90 (after one-hot encoding) |
| Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline |
Two model artifacts are published. They are designed to be used together β disagreement is a useful triage signal:
- `model_xgb.json` β gradient-boosted trees, primary recommendation
- `model_mlp.safetensors` β PyTorch MLP in SafeTensors format
## Quick start
```bash
pip install xgboost torch safetensors pandas huggingface_hub
```
```python
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file
REPO = "xpertsystems/cyb002-baseline-classifier"
paths = {n: hf_hub_download(REPO, n) for n in [
"model_xgb.json", "model_mlp.safetensors",
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}
import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
transform_single, load_meta, INT_TO_LABEL, build_segment_lookup
)
meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
# Build the segment-aggregate lookup from the dataset's topology CSV
seg_lookup = build_segment_lookup("path/to/network_topology.csv")
# Predict (see inference_example.ipynb for the full pattern)
seg_agg = seg_lookup.get(my_event["target_segment_id"], {})
X = transform_single(my_event, meta, segment_aggregates=seg_agg)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```
See [`inference_example.ipynb`](./inference_example.ipynb) for an
end-to-end copy-paste demo including segment-aggregate setup and
batch prediction.
## Training data
Trained on the public sample of CYB002, 4,353 attack events from 100
distinct campaigns:
| Phase | Train (n=2,822) | Test (n=726) | Test share |
|---|---:|---:|---:|
| `dwell_idle` | 581 | 141 | 19.4% |
| `reconnaissance` | 411 | 112 | 15.4% |
| `initial_access` | 358 | 106 | 14.6% |
| `execution` | 324 | 74 | 10.2% |
| `persistence` | 287 | 79 | 10.9% |
| `privilege_escalation` | 249 | 68 | 9.4% |
| `lateral_movement` | 201 | 54 | 7.4% |
| `collection` | 162 | 40 | 5.5% |
| `exfiltration` | 113 | 31 | 4.3% |
| `impact` | 105 | 21 | 2.9% |
### Group-aware split
A single campaign generates ~40 highly-correlated events. Random row-level
splitting would put events from the same campaign in both train and test,
inflating metrics in a way that does not generalize to new campaigns.
This release uses **GroupShuffleSplit by `campaign_id`**:
| Fold | Campaigns | Events |
|---|---:|---:|
| Train | 69 | 2,822 |
| Validation | 16 | 805 |
| Test | 15 | 726 |
All test campaigns are completely unseen during training. Class imbalance
is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
weighted cross-entropy (MLP).
## Feature pipeline
The bundled `feature_engineering.py` is the canonical feature recipe.
**Three columns are deliberately excluded** because they leak the target:
- `technique_id` β 62 of 63 ATT&CK techniques map 1:1 to a single phase.
Including it gives perfect-looking metrics that mean nothing.
- `technique_name` β 1:1 alias of `technique_id` (63 unique values each).
- `tactic_category` β direct alias of `kill_chain_phase`.
**90 features survive after encoding**, drawn from:
- **Event-level numeric** (10): `timestep`, `dest_port`, `bytes_transferred`, `connection_duration_s`, `auth_failure_count`, `process_injection_flag`, `lateral_hop_count`, `c2_beacon_interval_s`, `edr_blocked_flag`, `siem_rule_triggered`
- **Event-level categorical** (7, one-hot encoded): `target_asset_type`, `source_ip_class`, `protocol`, `attacker_capability_tier`, `defender_maturity_level`, `alert_severity`, `detection_outcome`
- **Segment-level topology aggregates** (13): mean `patch_lag_days`, mean `exposure_score`, max `vulnerability_count`, fraction with EDR/SIEM/NDR/MFA coverage, mean MTTD / MTTR baselines, plus segment_type and defender_maturity_level (segment-constant)
- **Engineered** (6): `byte_volume_log`, `has_c2_beacon`, `is_brute_forcing`, `attacker_defender_advantage`, `is_high_volume`, `is_privileged_port`
None of the engineered features is derived from phase or technique β
that would re-introduce the leakage we just excluded.
### Note on detection-outcome features
`detection_outcome`, `alert_severity`, `edr_blocked_flag`, and
`siem_rule_triggered` are post-hoc observables from the SOC's perspective.
They are kept as features for the realistic use case where a SOC analyst
has just seen an action and its initial detection signal and is reasoning
about which phase the campaign is in. Buyers who want a strictly
pre-detection model can drop these four columns and retrain β the ablation
results below show this **does not hurt accuracy** (the model doesn't
lean on them for phase prediction).
## Evaluation
### Test-set metrics (n = 726 events from 15 disjoint campaigns)
**XGBoost**
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.8599** |
| Accuracy | 0.4683 |
| Macro-F1 | 0.4255 |
| Weighted-F1 | 0.4604 |
**MLP**
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.8496** |
| Accuracy | 0.4449 |
| Macro-F1 | 0.3911 |
| Weighted-F1 | 0.4350 |
### Headline interpretation
Accuracy of 47% looks low at first glance, but the right comparison is:
| Baseline | Accuracy | Macro-F1 |
|---|---:|---:|
| Random uniform guess (1/10 classes) | 0.10 | ~0.10 |
| Always predict majority (`dwell_idle`) | 0.19 | n/a |
| **XGBoost (this model)** | **0.47** | **0.43** |
The macro ROC-AUC of **0.86** tells the cleaner story: the model
distinguishes the 10 phases meaningfully well even though the
argmax-prediction sometimes lands on an adjacent phase.
### Per-class F1 β where the signal is and isn't
| Phase | XGBoost F1 | MLP F1 | Note |
|---|---:|---:|---|
| `reconnaissance` | **0.753** | 0.725 | Strong: early timestep, distinct protocols/targets |
| `lateral_movement` | **0.742** | 0.783 | Strong: lateral-hop count, post-privesc pattern |
| `initial_access` | **0.647** | 0.648 | Strong: perimeter targets, specific protocols |
| `privilege_escalation` | 0.500 | 0.488 | Moderate |
| `execution` | 0.441 | 0.510 | Moderate |
| `persistence` | 0.413 | 0.301 | Moderate, easily confused with execution |
| `exfiltration` | 0.273 | 0.119 | Weak: late-phase, similar to collection/impact |
| `impact` | 0.226 | 0.132 | Weak: late-phase clustering |
| `collection` | 0.220 | 0.191 | Weak: late-phase clustering |
| `dwell_idle` | 0.040 | 0.013 | Very weak: no-op steps lack distinguishing features |
The model has solid signal on **early and mid-campaign phases** and
genuinely struggles to disambiguate **late-stage objective-completion
phases** (collection / exfiltration / impact), which arrive close in
time and look similar at the event level. This is an honest limitation
of flat-tabular classification β sequence models would help here.
### Ablation: which feature groups matter
| Configuration | Accuracy | Macro-F1 | Ξ accuracy vs full |
|---|---:|---:|---:|
| Full feature set (published) | 0.4683 | 0.4255 | β |
| No `timestep` | 0.3264 | 0.3102 | **β0.1419** |
| No topology aggregates | 0.4601 | 0.4093 | β0.0083 |
| No engineered features | 0.4642 | 0.4240 | β0.0041 |
| No detection-signal features | 0.4725 | 0.4284 | **+0.0041** |
Two clear findings:
1. **`timestep` is by far the most important feature** (drops 14 pp when
removed). The honest reading: kill chains progress in time, and where
you are in the campaign timeline carries most of the phase signal.
2. **Detection-signal features (`detection_outcome`, `alert_severity`,
`edr_blocked_flag`, `siem_rule_triggered`) do not help phase prediction.**
Removing them actually improves the score marginally. A buyer who wants
a pre-detection model can drop these four columns with no loss.
Topology and engineered features each contribute roughly 1 pp.
### Architecture
**XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.
**MLP:** `90 β 128 β 64 β 10`, each hidden layer followed by `BatchNorm1d`
β `ReLU` β `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.
Training hyperparameters (learning rate, batch size, n_estimators,
early-stopping patience, weight decay, class-weighting strategy) are
held internally by XpertSystems and are not part of this release.
## Limitations
**This is a baseline reference, not a production threat detection system.**
1. **Late-phase confusion.** Per-class F1 for `collection`, `exfiltration`,
and `impact` is 0.22β0.27. These phases arrive near campaign-end with
similar feature signatures, and a flat-tabular event-level model can't
easily disambiguate them. Sequence models (LSTM / transformer over the
per-campaign event sequence) would substantially improve this.
2. **`dwell_idle` is essentially unlearnable in this framing.** The
class-balanced weights amplify rare classes; `dwell_idle` is common
but featureless ("no action this timestep"), so the model trades
`dwell_idle` recall for late-phase recall. F1 = 0.04. A real SOC
pipeline would handle idle steps with a separate gating rule, not a
classifier head.
3. **Sample-size constraints.** 100 campaigns / 4,353 events with a
group-aware split leaves 69 training campaigns. The full 380k-event
CYB002 product supports much more reliable per-class estimation,
especially on the rare late-phase classes.
4. **Synthetic-vs-real transfer.** The dataset is synthetic and
calibrated to threat-intelligence benchmark targets (Mandiant
M-Trends, IBM CODB, Verizon DBIR, MITRE ATT&CK Evaluations). Real
attack telemetry has different noise characteristics, adversary
adaptation, and gaps in coverage. Do not assume metrics transfer.
5. **Adversarial robustness not evaluated.** The dataset is not
adversarially generated; the model has not been red-teamed.
6. **MLP brittleness on OOD inputs.** With ~2.8k training events, the
MLP can produce confidently-wrong predictions on hand-crafted
records far from the training manifold. XGBoost is more robust.
Use both; treat disagreement as a signal for human review.
## Notes on dataset schema
The CYB002 sample dataset README describes some fields differently from
the actual schema. The model was trained on the actual schema; this note
is to help buyers reconcile what they read with what they receive.
| What the README says | What the data actually contains |
|---|---|
| "9 ATT&CK phases" | 10 phases including `dwell_idle` (idle/no-op steps) |
| 4 attacker tiers: `opportunistic`, `organized_crime`, `apt`, `nation_state` | 4 tiers: `opportunistic`, `script_kiddie`, `apt`, `nation_state` |
| 5 defender maturity levels: CMMI names (`ad_hoc`, `defined`, `managed`, `quantitatively_managed`, `optimizing`) | 5 levels: `minimal`, `baseline`, `managed`, `advanced`, `zero_trust` |
| Field name `phase` | Actual column: `kill_chain_phase` |
| Field name `tactic` | Actual column: `tactic_category` |
| Field name `segment_id` | Actual column: `target_segment_id` |
| Field name `attacker_tier` | Actual column: `attacker_capability_tier` |
| Field name `defender_maturity` | Actual column: `defender_maturity_level` |
| Field name `detected`, `blocked`, `stealth_score` | Actual: `detection_outcome`, `edr_blocked_flag`, `siem_rule_triggered`; no `stealth_score` on events |
None of this affects model correctness β `feature_engineering.py` uses the
actual column names. If you build your own pipeline against the dataset,
use the actual columns, not the README descriptions.
## Intended use
- **Evaluating fit** of the CYB002 dataset for your ATT&CK / kill-chain
research
- **Baseline reference** for new model architectures (especially sequence
models, which should beat this baseline on the late-phase classes)
- **Teaching and demo** for tabular classification on attack-event data
- **Feature engineering reference** for MITRE ATT&CK-aligned datasets
## Out-of-scope use
- Production threat detection on real network telemetry
- SOC alert triage on real systems
- Forensic attribution of real attacks
- Adversarial-evasion evaluation (dataset not adversarially generated)
- Any safety-critical or operational security decision
## Reproducibility
Outputs above were produced with `seed = 42`, group-aware nested
`GroupShuffleSplit` (70/15/15 by campaign_id), on the published sample
(`xpertsystems/cyb002-sample`, version 1.0.0, generated 2026-05-16).
The feature pipeline in `feature_engineering.py` is deterministic and
the trained weights in this repo correspond exactly to the metrics above.
The training script itself is private to XpertSystems. The published
artifacts contain the feature pipeline, model weights, scaler, metadata,
and validation results β sufficient to reproduce inference but not
training.
## Files in this repo
| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights |
| `model_mlp.safetensors` | PyTorch MLP weights |
| `feature_engineering.py` | Feature pipeline (load β aggregate topology β engineer β encode) |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation (timestep, topology, engineered, detection-signals) |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |
## Contact and full product
The full **CYB002** dataset contains ~454,000 rows across four files,
with calibrated benchmark validation against 12 metrics drawn from
authoritative threat intelligence sources (Mandiant, IBM, Verizon,
CrowdStrike, MITRE, SANS, ENISA). The full XpertSystems.ai synthetic data
catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance &
Risk, Oil & Gas, and Materials & Energy.
- π§ **pradeep@xpertsystems.ai**
- π **https://xpertsystems.ai**
- π Dataset: https://huggingface.co/datasets/xpertsystems/cyb002-sample
- π€ Companion model (network traffic): https://huggingface.co/xpertsystems/cyb001-baseline-classifier
## Citation
```bibtex
@misc{xpertsystems_cyb002_baseline_2026,
title = {CYB002 Baseline Classifier: XGBoost and MLP for MITRE ATT&CK Kill-Chain Phase Classification},
author = {XpertSystems.ai},
year = {2026},
url = {https://huggingface.co/xpertsystems/cyb002-baseline-classifier},
note = {Baseline reference model trained on xpertsystems/cyb002-sample}
}
```
|