Initial release: attack_lifecycle_phase 5-class baseline + 11-oracle-path leakage diagnostic

e2c4702 verified 1 day ago

20.7 kB

	---
	license: cc-by-nc-4.0
	library_name: pytorch
	tags:
	- cybersecurity
	- siem
	- security-logs
	- mitre-attack
	- apt
	- tabular-classification
	- synthetic-data
	- xgboost
	- baseline
	- leakage-diagnostic
	pipeline_tag: tabular-classification
	base_model: []
	datasets:
	- xpertsystems/cyb010-sample
	metrics:
	- accuracy
	- f1
	- roc_auc
	model-index:
	- name: cyb010-baseline-classifier
	results:
	- task:
	type: tabular-classification
	name: 5-class attack lifecycle phase classification
	dataset:
	type: xpertsystems/cyb010-sample
	name: CYB010 Synthetic Security Event Log Dataset (Sample)
	metrics:
	- type: roc_auc
	value: 0.9904
	name: Test macro ROC-AUC OvR (XGBoost, seed 42)
	- type: accuracy
	value: 0.9493
	name: Test accuracy (XGBoost, seed 42)
	- type: f1
	value: 0.7781
	name: Test macro-F1 (XGBoost, seed 42)
	- type: accuracy
	value: 0.936
	name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.988
	name: Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds)
	---

	# CYB010 Baseline Classifier

	**Attack lifecycle phase classifier (5-class) trained on the CYB010
	synthetic security event log sample. Predicts which of 5 attack phases
	(`benign_background` / `initial_access` / `lateral_movement` /
	`persistence_establishment` / `exfiltration_or_impact`) a security
	event belongs to, from per-event features. ALSO ships a comprehensive
	`leakage_diagnostic.json` documenting 11 oracle paths discovered
	across the dataset's targets and 2 README-suggested targets that are
	unlearnable on the sample after honest leak removal.**

	> Read this first. This repo ships two related artifacts:
	> (1) a working baseline classifier for `attack_lifecycle_phase` (the
	> dataset's headline target), and (2) `leakage_diagnostic.json`
	> documenting 11 separate oracle paths plus 2 unlearnable targets.
	> Both files matter; the diagnostic is required reading for anyone
	> evaluating CYB010 for SIEM ML work.

	## Model overview

	\| Property \| Value \|
	\|---\|---\|
	\| Primary task \| 5-class `attack_lifecycle_phase` classification \|
	\| Secondary artifact \| `leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets \|
	\| Training data \| `xpertsystems/cyb010-sample` (21,896 events / 500 incidents) \|
	\| Models \| XGBoost + PyTorch MLP \|
	\| Input features \| 87 (after one-hot encoding) \|
	\| Split \| Group-aware (GroupShuffleSplit on `incident_id`) \|
	\| Validation \| Single seed (artifact) + multi-seed aggregate across 10 seeds \|
	\| License \| CC-BY-NC-4.0 (matches dataset) \|
	\| Status \| Reference baseline + comprehensive leakage diagnostic \|

	## Why this task — and what was dropped

	The CYB010 README's central concept is the "5-phase attack lifecycle
	state machine", and `attack_lifecycle_phase` is the data's headline
	target. We piloted six candidate targets and found:

	- `attack_lifecycle_phase` 5-class: strongest honest result.
	Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes
	represented, per-class F1 range 0.48–1.00.

	- `threat_actor_profile` 5-class: works at acc 0.84 but per-class
	F1 reveals it's almost entirely driven by `benign_user` separation
	(F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class
	malicious-only formulation is below majority (acc 0.55 vs 0.61).

	- `label_true_positive` binary on alerts: documented as a secondary
	finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after
	dropping all of them.

	- `mitre_tactic` 14-class: hits acc 0.90 but macro-F1 0.37 -
	imbalance gaming (benign class dominates at 57%).

	- `event_class` 12-class: unlearnable (acc 0.35 vs majority 0.42).

	### Six oracle columns dropped from the phase task

	CYB010 encodes the benign vs malicious distinction explicitly in
	multiple columns. Each is a perfect or near-perfect oracle for the
	`benign_background` phase:

	\| Column \| Oracle relationship \|
	\|---\|---\|
	\| `mitre_tactic` \| `=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect) \|
	\| `mitre_technique_id` \| Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic) \|
	\| `label_malicious` \| `==False` ↔ `benign_background` (perfect) \|
	\| `threat_actor_id` \| `=="NONE"` ↔ `benign_background` (perfect) \|
	\| `threat_actor_profile` \| `=="benign_user"` ↔ `benign_background` (perfect) \|
	\| `event_type` \| Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`) \|

	With these six columns present, a plain XGBoost trivially separates
	benign vs malicious. The published baseline trains with all six
	excluded.

	Two model artifacts are published. They are designed to be used
	together:

	- `model_xgb.json` — gradient-boosted trees (slightly higher F1)
	- `model_mlp.safetensors` — PyTorch MLP

	## Quick start

	```bash
	pip install xgboost torch safetensors pandas huggingface_hub
	```

	```python
	from huggingface_hub import hf_hub_download, snapshot_download
	import json, numpy as np, torch, xgboost as xgb
	from safetensors.torch import load_file

	REPO = "xpertsystems/cyb010-baseline-classifier"

	paths = {n: hf_hub_download(REPO, n) for n in [
	"model_xgb.json", "model_mlp.safetensors",
	"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
	]}

	import sys, os
	sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
	from feature_engineering import (
	transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
	)

	meta = load_meta(paths["feature_meta.json"])

	# Host features are joined from host_inventory.csv at inference time
	ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
	host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")

	xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

	# Predict (see inference_example.ipynb for the full pattern)
	# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
	# threat_actor_id, threat_actor_profile, or event_type - those were the
	# oracle columns.
	X = transform_single(my_event, meta, host_lookup=host_lookup)
	proba = xgb_model.predict_proba(X)[0]
	print(INT_TO_LABEL[int(np.argmax(proba))])
	```

	See [`inference_example.ipynb`](./inference_example.ipynb) for the full
	copy-paste demo.

	## Training data

	Trained on the public sample of CYB010, 21,896 per-event records:

	\| Phase \| Events \| Class share \|
	\|---\|---:\|---:\|
	\| `benign_background` \| 12,448 \| 56.9% \|
	\| `exfiltration_or_impact` \| 6,205 \| 28.3% \|
	\| `initial_access` \| 1,674 \| 7.6% \|
	\| `lateral_movement` \| 968 \| 4.4% \|
	\| `persistence_establishment` \| 601 \| 2.7% \|

	### Group-aware split by incident_id

	500 incidents × ~44 events each. Events from the same incident share
	host, threat actor, and phase trajectory — so train/test contamination
	is a real risk with random splitting. The baseline uses
	GroupShuffleSplit on `incident_id` (nested 70/15/15):

	\| Fold \| Events \| Incidents \|
	\|---\|---:\|---:\|
	\| Train \| 14,697 \| ~350 \|
	\| Validation \| 3,473 \| ~75 \|
	\| Test \| 3,726 \| ~75 \|

	All 10 multi-seed evaluations yielded all 5 classes in the test fold.
	Class imbalance is addressed with `class_weight='balanced'` (XGBoost
	`sample_weight`) and weighted cross-entropy (MLP).

	## Feature pipeline

	The bundled `feature_engineering.py` is the canonical recipe. 87
	features survive after encoding, drawn from:

	- Per-event numeric (5): `source_port`, `dest_port`,
	`cvss_score_analogue`, `label_log_tampered`, `label_false_positive`
	- Per-event categorical (3, one-hot): `event_class` (12 values),
	`log_source_type` (8 values), `severity_level` (5 values)
	- Host features (joined from `host_inventory.csv`): 3 numeric +
	7 categorical (os_type, host_role, network_segment, defender_posture,
	criticality_rating, cloud_provider, siem_platform)
	- Engineered (9): `hour_of_day`, `is_off_hours`, `is_weekend`,
	`log_cvss`, `is_high_cvss`, `is_well_known_port`, `is_dynamic_port`,
	`is_outbound_web`, `risk_composite`

	### Partial-oracle features kept as legitimate observables

	`event_class` (max purity 0.87, mean 0.72 across phases) is the
	strongest non-oracle feature. C2 beacon traffic (`event_class =
	network_flow`) is 65% exfiltration phase but also 29% benign and 6%
	other phases — real overlap, not deterministic encoding. Kept.

	`severity_level` and `cvss_score_analogue` correlate strongly with
	phase (high-severity events skew toward exfil and initial_access) but
	with substantial overlap. Kept.

	`label_log_tampered` is a real observable — APTs tamper more than
	script_kiddies — but is not phase-deterministic. Kept.

	## Evaluation

	### Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)

	XGBoost (the published `model_xgb.json` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.9904 \|
	\| Accuracy \| 0.9493 \|
	\| Macro-F1 \| 0.7781 \|
	\| Weighted-F1 \| 0.9478 \|

	MLP (the published `model_mlp.safetensors` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.9861 \|
	\| Accuracy \| 0.9412 \|
	\| Macro-F1 \| 0.7534 \|
	\| Weighted-F1 \| 0.9396 \|

	XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941,
	macro-F1 0.778 vs 0.753). The gap is consistent across seeds.

	### Multi-seed robustness (XGBoost, 10 seeds)

	\| Metric \| Mean \| Std \| Min \| Max \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Accuracy \| 0.936 \| 0.007 \| 0.923 \| 0.949 \|
	\| Macro-F1 \| 0.759 \| 0.015 \| 0.741 \| 0.781 \|
	\| Macro ROC-AUC OvR \| 0.988 \| 0.001 \| 0.986 \| 0.990 \|

	Tightest ROC-AUC std in the catalog (0.001). All 10 seeds yielded
	all 5 classes in the test fold. Full per-seed results in
	[`multi_seed_results.json`](./multi_seed_results.json).

	### Per-class F1 (seed 42)

	\| Phase \| Class share \| XGBoost F1 \| MLP F1 \|
	\|---\|---:\|---:\|---:\|
	\| `benign_background` \| 56.9% \| 0.998 \| 0.994 \|
	\| `exfiltration_or_impact` \| 28.3% \| 0.987 \| 0.981 \|
	\| `initial_access` \| 7.6% \| 0.720 \| 0.651 \|
	\| `persistence_establishment` \| 2.7% \| 0.703 \| 0.690 \|
	\| `lateral_movement` \| 4.4% \| 0.483 \| 0.451 \|

	The two largest classes (`benign_background` and `exfiltration_or_impact`)
	are nearly perfectly separable — `benign_background` because the
	non-oracle features (severity, CVSS, log_source) still cleanly separate
	non-malicious traffic, and `exfiltration_or_impact` because it's
	dominated by network_flow events (C2 beacons). The three middle
	classes overlap substantially in feature space; `lateral_movement` is
	the hardest (F1 0.48) because lateral movement events look similar to
	initial_access events at the per-event level. A sequence model that
	considers event ordering within an incident would likely do better
	than the per-event baseline.

	### Ablation: which feature groups matter

	\| Configuration \| Accuracy \| Macro-F1 \| ROC-AUC \| Δ accuracy \| Δ macro-F1 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| Full feature set (published) \| 0.9493 \| 0.7781 \| 0.9904 \| — \| — \|
	\| No `event_class` \| 0.9206 \| 0.5969 \| 0.9723 \| −0.0287 \| −0.181 \|
	\| No CVSS features \| 0.9383 \| 0.7475 \| 0.9812 \| −0.0110 \| −0.031 \|
	\| No `log_source_type` \| 0.9469 \| 0.7655 \| 0.9902 \| −0.0024 \| −0.013 \|
	\| No engineered features \| 0.9471 \| 0.7655 \| 0.9903 \| −0.0022 \| −0.013 \|
	\| No ports \| 0.9463 \| 0.7621 \| 0.9903 \| −0.0030 \| −0.016 \|
	\| No `severity_level` \| 0.9479 \| 0.7688 \| 0.9902 \| −0.0014 \| −0.009 \|
	\| No tamper flags \| 0.9469 \| 0.7657 \| 0.9905 \| −0.0024 \| −0.012 \|
	\| No timing \| 0.9501 \| 0.7730 \| 0.9907 \| +0.0008 \| −0.005 \|
	\| No host features \| 0.9522 \| 0.7828 \| 0.9917 \| +0.0029 \| +0.005 \|

	Three findings:

	1. `event_class` is the dominant signal (drops 18pp macro-F1 when
	removed). Phase prediction without it loses most discrimination
	between the middle classes.
	2. CVSS features are second-strongest (drops 3pp F1). Captures
	severity information that complements event_class.
	3. Host features and timing add modest noise. The model performs
	marginally better without host features (+0.3pp accuracy), and
	timing features contribute essentially nothing. Kept in the
	pipeline as documented baseline reference.

	### Architecture

	XGBoost: multi-class gradient boosting (`multi:softprob`, 5 classes),
	`hist` tree method, class-balanced sample weights, early stopping on
	validation mlogloss.

	MLP: `87 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
	→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
	early stopping on validation macro-F1.

	Training hyperparameters are held internally by XpertSystems.

	## Limitations

	This is a baseline reference, not a production phase classifier.

	1. The leakage diagnostic is required reading. Six oracle columns
	for the phase task and seven for the alert TP task are documented
	in `leakage_diagnostic.json`. If you use CYB010 sample data for
	your own training, you MUST drop these or your model will learn
	the oracles instead of the task.

	2. `lateral_movement` F1 0.48 is the weakest class. The 968-event
	sample with substantial overlap to `initial_access` makes this
	class hard. A sequence model that considers event ordering within
	incidents would likely do better than per-event classification.

	3. **`threat_actor_profile` 4-class (malicious-only) is unlearnable
	on this sample** (acc 0.55 vs majority 0.61). The 5-class
	formulation with benign included works only because benign_user
	separation is structurally trivial.

	4. `event_class` 12-class is unlearnable on this sample (acc 0.35
	vs majority 0.42). event_class is a structural property of the
	event itself, not something to predict from other features.

	5. Synthetic-vs-real transfer. The dataset is synthetic, calibrated
	to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE
	ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise
	characteristics — and in particular, the explicit `mitre_tactic ==
	"benign"` marker and `threat_actor_id == "NONE"` benign sentinel
	would not be present in real data. Real telemetry has implicit
	benign-vs-malicious distinctions that emerge from event content.
	Do not assume metrics transfer end-to-end.

	6. 21,896 events / 500 incidents is a modest training set. The
	3,726-event / ~75-incident test fold yields stable multi-seed
	metrics (std 0.007 on accuracy) but per-class confidence intervals
	widen for the smallest classes (lateral_movement, persistence).

	## Notes on dataset schema

	The CYB010 sample dataset README describes some fields differently
	from the actual schema. The model was trained on the actual schema;
	this note helps buyers reconcile what they read with what they receive.

	\| What the README says \| What the data actually contains \|
	\|---\|---\|
	\| `security_events` has 16 columns \| Data has 23 columns \|
	\| Field renames \| `timestamp_utc` → `timestamp`, `user` → `user_id`, `log_format` → `log_source_type` \|
	\| README missing from `security_events` \| `event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented \|
	\| README claims `command_line` / `process_name` / `is_off_hours` columns \| Not present in `security_events` (off-hours derived from timestamp in pipeline) \|
	\| `alert_records` has 9 columns \| Data has 21 columns \|
	\| Field renames \| `alert_severity` → `severity_level`, `detection_rule` → `alert_rule_name` \|
	\| README's `triage_outcome` (categorical) \| Replaced by `label_true_positive` / `label_false_positive` (mirror booleans) \|
	\| README's `ioc_matched` \| Not present in `alert_records` \|
	\| README missing from `alert_records` \| `correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented \|
	\| `incident_summary` has 8 columns \| Data has 24 columns \|
	\| `host_inventory` has 6 columns \| Data has 15 columns \|
	\| `threat_actor_profile` has 4 values \| Data has 5 values (adds `benign_user` at 57% of events) \|
	\| `attack_lifecycle_phase` 5-phase malicious lifecycle \| Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included \|
	\| README says MITRE ATT&CK v14 with 50 techniques \| Data has 54 unique technique IDs across 14 tactics + benign \|

	None of these affects model correctness — the feature pipeline uses
	the actual column names. If you build your own pipeline against the
	dataset, use the actual columns.

	## Intended use

	- Evaluating fit of the CYB010 dataset for your SIEM ML research
	- Baseline reference for new model architectures on the
	attack-phase classification task
	- Reference example of structural-leakage diagnostics for
	synthetic SIEM datasets — the methodology is reusable
	- Feature engineering reference for per-event SIEM telemetry

	## Out-of-scope use

	- Production SIEM phase detection on real telemetry
	- Threat actor attribution (4-class malicious-only is unlearnable
	on the sample)
	- Event-class prediction (this is a structural property, not a
	learnable target)
	- Any operational decision affecting actual security operations
	without further validation on your own data

	## Reproducibility

	Outputs above were produced with `seed = 42` (published artifact),
	nested `GroupShuffleSplit` on `incident_id` (70/15/15), on the published
	sample (`xpertsystems/cyb010-sample`, version 1.0.0, generated
	2026-05-16). The feature pipeline in `feature_engineering.py` is
	deterministic and the trained weights in this repo correspond exactly
	to the metrics above.

	Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
	in `multi_seed_results.json` confirm robust performance across splits
	(std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std
	in the XpertSystems catalog).

	The training script itself is private to XpertSystems.

	## Files in this repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `model_xgb.json` \| XGBoost weights (seed 42) \|
	\| `model_mlp.safetensors` \| PyTorch MLP weights (seed 42) \|
	\| `feature_engineering.py` \| Feature pipeline \|
	\| `feature_meta.json` \| Feature column order + categorical levels \|
	\| `feature_scaler.json` \| MLP input mean/std (XGBoost ignores) \|
	\| `validation_results.json` \| Per-class metrics, confusion matrix, architecture \|
	\| `ablation_results.json` \| Per-feature-group ablation \|
	\| `multi_seed_results.json` \| XGBoost metrics across 10 seeds \|
	\| `leakage_diagnostic.json` \| 11-oracle-path audit + 2 unlearnable targets \|
	\| `inference_example.ipynb` \| End-to-end inference demo notebook \|
	\| `README.md` \| This file \|

	## Contact and full product

	The full CYB010 dataset contains ~550,000 rows across four files,
	with calibrated benchmark validation against 6 metrics drawn from
	authoritative SOC operations and threat intelligence sources (SANS SOC
	Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA
	Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).

	The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
	Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
	& Energy.

	- 📧 pradeep@xpertsystems.ai
	- 🌐 https://xpertsystems.ai
	- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample
	- 🤖 Companion models:
	- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
	- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
	- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
	- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
	- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
	- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
	- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
	- https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic)
	- https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic)

	## Citation

	```bibtex
	@misc{xpertsystems_cyb010_baseline_2026,
	title = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
	author = {XpertSystems.ai},
	year = {2026},
	url = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
	note = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
	}
	```