pradeep-xpert's picture
Initial release: attack_lifecycle_phase 5-class baseline + 11-oracle-path leakage diagnostic
e2c4702 verified
---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- cybersecurity
- siem
- security-logs
- mitre-attack
- apt
- tabular-classification
- synthetic-data
- xgboost
- baseline
- leakage-diagnostic
pipeline_tag: tabular-classification
base_model: []
datasets:
- xpertsystems/cyb010-sample
metrics:
- accuracy
- f1
- roc_auc
model-index:
- name: cyb010-baseline-classifier
results:
- task:
type: tabular-classification
name: 5-class attack lifecycle phase classification
dataset:
type: xpertsystems/cyb010-sample
name: CYB010 Synthetic Security Event Log Dataset (Sample)
metrics:
- type: roc_auc
value: 0.9904
name: Test macro ROC-AUC OvR (XGBoost, seed 42)
- type: accuracy
value: 0.9493
name: Test accuracy (XGBoost, seed 42)
- type: f1
value: 0.7781
name: Test macro-F1 (XGBoost, seed 42)
- type: accuracy
value: 0.936
name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
- type: roc_auc
value: 0.988
name: Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds)
---
# CYB010 Baseline Classifier
**Attack lifecycle phase classifier (5-class) trained on the CYB010
synthetic security event log sample. Predicts which of 5 attack phases
(`benign_background` / `initial_access` / `lateral_movement` /
`persistence_establishment` / `exfiltration_or_impact`) a security
event belongs to, from per-event features. ALSO ships a comprehensive
`leakage_diagnostic.json` documenting 11 oracle paths discovered
across the dataset's targets and 2 README-suggested targets that are
unlearnable on the sample after honest leak removal.**
> **Read this first.** This repo ships two related artifacts:
> (1) a working baseline classifier for `attack_lifecycle_phase` (the
> dataset's headline target), and (2) `leakage_diagnostic.json`
> documenting 11 separate oracle paths plus 2 unlearnable targets.
> Both files matter; the diagnostic is required reading for anyone
> evaluating CYB010 for SIEM ML work.
## Model overview
| Property | Value |
|---|---|
| Primary task | 5-class `attack_lifecycle_phase` classification |
| Secondary artifact | `leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets |
| Training data | `xpertsystems/cyb010-sample` (21,896 events / 500 incidents) |
| Models | XGBoost + PyTorch MLP |
| Input features | 87 (after one-hot encoding) |
| Split | **Group-aware** (GroupShuffleSplit on `incident_id`) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline + comprehensive leakage diagnostic |
## Why this task — and what was dropped
The CYB010 README's central concept is the "5-phase attack lifecycle
state machine", and `attack_lifecycle_phase` is the data's headline
target. We piloted six candidate targets and found:
- **`attack_lifecycle_phase` 5-class**: strongest honest result.
Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes
represented, per-class F1 range 0.48–1.00.
- **`threat_actor_profile` 5-class**: works at acc 0.84 but per-class
F1 reveals it's almost entirely driven by `benign_user` separation
(F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class
malicious-only formulation is below majority (acc 0.55 vs 0.61).
- **`label_true_positive` binary on alerts**: documented as a secondary
finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after
dropping all of them.
- **`mitre_tactic` 14-class**: hits acc 0.90 but macro-F1 0.37 -
imbalance gaming (benign class dominates at 57%).
- **`event_class` 12-class**: unlearnable (acc 0.35 vs majority 0.42).
### Six oracle columns dropped from the phase task
CYB010 encodes the benign vs malicious distinction explicitly in
multiple columns. Each is a perfect or near-perfect oracle for the
`benign_background` phase:
| Column | Oracle relationship |
|---|---|
| `mitre_tactic` | `=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect) |
| `mitre_technique_id` | Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic) |
| `label_malicious` | `==False` ↔ `benign_background` (perfect) |
| `threat_actor_id` | `=="NONE"` ↔ `benign_background` (perfect) |
| `threat_actor_profile` | `=="benign_user"` ↔ `benign_background` (perfect) |
| `event_type` | Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`) |
With these six columns present, a plain XGBoost trivially separates
benign vs malicious. The published baseline trains with all six
excluded.
Two model artifacts are published. They are designed to be used
together:
- `model_xgb.json` — gradient-boosted trees (slightly higher F1)
- `model_mlp.safetensors` — PyTorch MLP
## Quick start
```bash
pip install xgboost torch safetensors pandas huggingface_hub
```
```python
from huggingface_hub import hf_hub_download, snapshot_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file
REPO = "xpertsystems/cyb010-baseline-classifier"
paths = {n: hf_hub_download(REPO, n) for n in [
"model_xgb.json", "model_mlp.safetensors",
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}
import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
)
meta = load_meta(paths["feature_meta.json"])
# Host features are joined from host_inventory.csv at inference time
ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
# threat_actor_id, threat_actor_profile, or event_type - those were the
# oracle columns.
X = transform_single(my_event, meta, host_lookup=host_lookup)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```
See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.
## Training data
Trained on the public sample of CYB010, 21,896 per-event records:
| Phase | Events | Class share |
|---|---:|---:|
| `benign_background` | 12,448 | 56.9% |
| `exfiltration_or_impact` | 6,205 | 28.3% |
| `initial_access` | 1,674 | 7.6% |
| `lateral_movement` | 968 | 4.4% |
| `persistence_establishment` | 601 | 2.7% |
### Group-aware split by incident_id
500 incidents × ~44 events each. Events from the same incident share
host, threat actor, and phase trajectory — so train/test contamination
is a real risk with random splitting. The baseline uses
**GroupShuffleSplit** on `incident_id` (nested 70/15/15):
| Fold | Events | Incidents |
|---|---:|---:|
| Train | 14,697 | ~350 |
| Validation | 3,473 | ~75 |
| Test | 3,726 | ~75 |
All 10 multi-seed evaluations yielded all 5 classes in the test fold.
Class imbalance is addressed with `class_weight='balanced'` (XGBoost
`sample_weight`) and weighted cross-entropy (MLP).
## Feature pipeline
The bundled `feature_engineering.py` is the canonical recipe. 87
features survive after encoding, drawn from:
- **Per-event numeric** (5): `source_port`, `dest_port`,
`cvss_score_analogue`, `label_log_tampered`, `label_false_positive`
- **Per-event categorical** (3, one-hot): `event_class` (12 values),
`log_source_type` (8 values), `severity_level` (5 values)
- **Host features** (joined from `host_inventory.csv`): 3 numeric +
7 categorical (os_type, host_role, network_segment, defender_posture,
criticality_rating, cloud_provider, siem_platform)
- **Engineered** (9): `hour_of_day`, `is_off_hours`, `is_weekend`,
`log_cvss`, `is_high_cvss`, `is_well_known_port`, `is_dynamic_port`,
`is_outbound_web`, `risk_composite`
### Partial-oracle features kept as legitimate observables
`event_class` (max purity 0.87, mean 0.72 across phases) is the
strongest non-oracle feature. C2 beacon traffic (`event_class =
network_flow`) is 65% exfiltration phase but also 29% benign and 6%
other phases — real overlap, not deterministic encoding. Kept.
`severity_level` and `cvss_score_analogue` correlate strongly with
phase (high-severity events skew toward exfil and initial_access) but
with substantial overlap. Kept.
`label_log_tampered` is a real observable — APTs tamper more than
script_kiddies — but is not phase-deterministic. Kept.
## Evaluation
### Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)
**XGBoost** (the published `model_xgb.json` artifact)
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9904** |
| Accuracy | **0.9493** |
| Macro-F1 | 0.7781 |
| Weighted-F1 | 0.9478 |
**MLP** (the published `model_mlp.safetensors` artifact)
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9861** |
| Accuracy | **0.9412** |
| Macro-F1 | 0.7534 |
| Weighted-F1 | 0.9396 |
XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941,
macro-F1 0.778 vs 0.753). The gap is consistent across seeds.
### Multi-seed robustness (XGBoost, 10 seeds)
| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.936 | 0.007 | 0.923 | 0.949 |
| Macro-F1 | 0.759 | 0.015 | 0.741 | 0.781 |
| Macro ROC-AUC OvR | 0.988 | 0.001 | 0.986 | 0.990 |
**Tightest ROC-AUC std in the catalog** (0.001). All 10 seeds yielded
all 5 classes in the test fold. Full per-seed results in
[`multi_seed_results.json`](./multi_seed_results.json).
### Per-class F1 (seed 42)
| Phase | Class share | XGBoost F1 | MLP F1 |
|---|---:|---:|---:|
| `benign_background` | 56.9% | **0.998** | 0.994 |
| `exfiltration_or_impact` | 28.3% | **0.987** | 0.981 |
| `initial_access` | 7.6% | 0.720 | 0.651 |
| `persistence_establishment` | 2.7% | 0.703 | 0.690 |
| `lateral_movement` | 4.4% | **0.483** | 0.451 |
The two largest classes (`benign_background` and `exfiltration_or_impact`)
are nearly perfectly separable — `benign_background` because the
non-oracle features (severity, CVSS, log_source) still cleanly separate
non-malicious traffic, and `exfiltration_or_impact` because it's
dominated by network_flow events (C2 beacons). The three middle
classes overlap substantially in feature space; `lateral_movement` is
the hardest (F1 0.48) because lateral movement events look similar to
initial_access events at the per-event level. A sequence model that
considers event ordering within an incident would likely do better
than the per-event baseline.
### Ablation: which feature groups matter
| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | Δ macro-F1 |
|---|---:|---:|---:|---:|---:|
| Full feature set (published) | 0.9493 | 0.7781 | 0.9904 | — | — |
| No `event_class` | 0.9206 | 0.5969 | 0.9723 | **−0.0287** | **−0.181** |
| No CVSS features | 0.9383 | 0.7475 | 0.9812 | −0.0110 | −0.031 |
| No `log_source_type` | 0.9469 | 0.7655 | 0.9902 | −0.0024 | −0.013 |
| No engineered features | 0.9471 | 0.7655 | 0.9903 | −0.0022 | −0.013 |
| No ports | 0.9463 | 0.7621 | 0.9903 | −0.0030 | −0.016 |
| No `severity_level` | 0.9479 | 0.7688 | 0.9902 | −0.0014 | −0.009 |
| No tamper flags | 0.9469 | 0.7657 | 0.9905 | −0.0024 | −0.012 |
| No timing | 0.9501 | 0.7730 | 0.9907 | +0.0008 | −0.005 |
| No host features | 0.9522 | 0.7828 | 0.9917 | +0.0029 | +0.005 |
Three findings:
1. **`event_class` is the dominant signal** (drops 18pp macro-F1 when
removed). Phase prediction without it loses most discrimination
between the middle classes.
2. **CVSS features are second-strongest** (drops 3pp F1). Captures
severity information that complements event_class.
3. **Host features and timing add modest noise.** The model performs
marginally *better* without host features (+0.3pp accuracy), and
timing features contribute essentially nothing. Kept in the
pipeline as documented baseline reference.
### Architecture
**XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.
**MLP:** `87 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
`ReLU``Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.
Training hyperparameters are held internally by XpertSystems.
## Limitations
**This is a baseline reference, not a production phase classifier.**
1. **The leakage diagnostic is required reading.** Six oracle columns
for the phase task and seven for the alert TP task are documented
in `leakage_diagnostic.json`. If you use CYB010 sample data for
your own training, you MUST drop these or your model will learn
the oracles instead of the task.
2. **`lateral_movement` F1 0.48 is the weakest class.** The 968-event
sample with substantial overlap to `initial_access` makes this
class hard. A sequence model that considers event ordering within
incidents would likely do better than per-event classification.
3. **`threat_actor_profile` 4-class (malicious-only) is unlearnable
on this sample** (acc 0.55 vs majority 0.61). The 5-class
formulation with benign included works only because benign_user
separation is structurally trivial.
4. **`event_class` 12-class is unlearnable on this sample** (acc 0.35
vs majority 0.42). event_class is a structural property of the
event itself, not something to predict from other features.
5. **Synthetic-vs-real transfer.** The dataset is synthetic, calibrated
to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE
ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise
characteristics — and in particular, the explicit `mitre_tactic ==
"benign"` marker and `threat_actor_id == "NONE"` benign sentinel
would not be present in real data. Real telemetry has implicit
benign-vs-malicious distinctions that emerge from event content.
Do not assume metrics transfer end-to-end.
6. **21,896 events / 500 incidents is a modest training set.** The
3,726-event / ~75-incident test fold yields stable multi-seed
metrics (std 0.007 on accuracy) but per-class confidence intervals
widen for the smallest classes (lateral_movement, persistence).
## Notes on dataset schema
The CYB010 sample dataset README describes some fields differently
from the actual schema. The model was trained on the actual schema;
this note helps buyers reconcile what they read with what they receive.
| What the README says | What the data actually contains |
|---|---|
| `security_events` has 16 columns | Data has **23 columns** |
| Field renames | `timestamp_utc``timestamp`, `user``user_id`, `log_format``log_source_type` |
| README missing from `security_events` | `event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented |
| README claims `command_line` / `process_name` / `is_off_hours` columns | Not present in `security_events` (off-hours derived from timestamp in pipeline) |
| `alert_records` has 9 columns | Data has **21 columns** |
| Field renames | `alert_severity``severity_level`, `detection_rule``alert_rule_name` |
| README's `triage_outcome` (categorical) | Replaced by `label_true_positive` / `label_false_positive` (mirror booleans) |
| README's `ioc_matched` | Not present in `alert_records` |
| README missing from `alert_records` | `correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented |
| `incident_summary` has 8 columns | Data has **24 columns** |
| `host_inventory` has 6 columns | Data has **15 columns** |
| `threat_actor_profile` has 4 values | Data has **5 values** (adds `benign_user` at 57% of events) |
| `attack_lifecycle_phase` 5-phase malicious lifecycle | Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included |
| README says MITRE ATT&CK v14 with 50 techniques | Data has 54 unique technique IDs across 14 tactics + benign |
None of these affects model correctness — the feature pipeline uses
the actual column names. If you build your own pipeline against the
dataset, use the actual columns.
## Intended use
- **Evaluating fit** of the CYB010 dataset for your SIEM ML research
- **Baseline reference** for new model architectures on the
attack-phase classification task
- **Reference example of structural-leakage diagnostics** for
synthetic SIEM datasets — the methodology is reusable
- **Feature engineering reference** for per-event SIEM telemetry
## Out-of-scope use
- Production SIEM phase detection on real telemetry
- Threat actor attribution (4-class malicious-only is unlearnable
on the sample)
- Event-class prediction (this is a structural property, not a
learnable target)
- Any operational decision affecting actual security operations
without further validation on your own data
## Reproducibility
Outputs above were produced with `seed = 42` (published artifact),
nested `GroupShuffleSplit` on `incident_id` (70/15/15), on the published
sample (`xpertsystems/cyb010-sample`, version 1.0.0, generated
2026-05-16). The feature pipeline in `feature_engineering.py` is
deterministic and the trained weights in this repo correspond exactly
to the metrics above.
Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
in `multi_seed_results.json` confirm robust performance across splits
(std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std
in the XpertSystems catalog).
The training script itself is private to XpertSystems.
## Files in this repo
| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds |
| **`leakage_diagnostic.json`** | **11-oracle-path audit + 2 unlearnable targets** |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |
## Contact and full product
The full **CYB010** dataset contains **~550,000 rows** across four files,
with calibrated benchmark validation against 6 metrics drawn from
authoritative SOC operations and threat intelligence sources (SANS SOC
Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA
Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).
The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
& Energy.
- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample
- 🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
- https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic)
## Citation
```bibtex
@misc{xpertsystems_cyb010_baseline_2026,
title = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
author = {XpertSystems.ai},
year = {2026},
url = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
note = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
}
```