CYB010 Baseline Classifier

Attack lifecycle phase classifier (5-class) trained on the CYB010 synthetic security event log sample. Predicts which of 5 attack phases (benign_background / initial_access / lateral_movement / persistence_establishment / exfiltration_or_impact) a security event belongs to, from per-event features. ALSO ships a comprehensive leakage_diagnostic.json documenting 11 oracle paths discovered across the dataset's targets and 2 README-suggested targets that are unlearnable on the sample after honest leak removal.

Read this first. This repo ships two related artifacts: (1) a working baseline classifier for attack_lifecycle_phase (the dataset's headline target), and (2) leakage_diagnostic.json documenting 11 separate oracle paths plus 2 unlearnable targets. Both files matter; the diagnostic is required reading for anyone evaluating CYB010 for SIEM ML work.

Model overview

Property	Value
Primary task	5-class `attack_lifecycle_phase` classification
Secondary artifact	`leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets
Training data	`xpertsystems/cyb010-sample` (21,896 events / 500 incidents)
Models	XGBoost + PyTorch MLP
Input features	87 (after one-hot encoding)
Split	Group-aware (GroupShuffleSplit on `incident_id`)
Validation	Single seed (artifact) + multi-seed aggregate across 10 seeds
License	CC-BY-NC-4.0 (matches dataset)
Status	Reference baseline + comprehensive leakage diagnostic

Why this task — and what was dropped

The CYB010 README's central concept is the "5-phase attack lifecycle state machine", and attack_lifecycle_phase is the data's headline target. We piloted six candidate targets and found:

attack_lifecycle_phase 5-class: strongest honest result. Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes represented, per-class F1 range 0.48–1.00.
threat_actor_profile 5-class: works at acc 0.84 but per-class F1 reveals it's almost entirely driven by benign_user separation (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class malicious-only formulation is below majority (acc 0.55 vs 0.61).
label_true_positive binary on alerts: documented as a secondary finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after dropping all of them.
mitre_tactic 14-class: hits acc 0.90 but macro-F1 0.37 - imbalance gaming (benign class dominates at 57%).
event_class 12-class: unlearnable (acc 0.35 vs majority 0.42).

Six oracle columns dropped from the phase task

CYB010 encodes the benign vs malicious distinction explicitly in multiple columns. Each is a perfect or near-perfect oracle for the benign_background phase:

Column	Oracle relationship
`mitre_tactic`	`=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect)
`mitre_technique_id`	Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic)
`label_malicious`	`==False` ↔ `benign_background` (perfect)
`threat_actor_id`	`=="NONE"` ↔ `benign_background` (perfect)
`threat_actor_profile`	`=="benign_user"` ↔ `benign_background` (perfect)
`event_type`	Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`)

With these six columns present, a plain XGBoost trivially separates benign vs malicious. The published baseline trains with all six excluded.

Two model artifacts are published. They are designed to be used together:

model_xgb.json — gradient-boosted trees (slightly higher F1)
model_mlp.safetensors — PyTorch MLP

Quick start

pip install xgboost torch safetensors pandas huggingface_hub

from huggingface_hub import hf_hub_download, snapshot_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb010-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
    transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
)

meta = load_meta(paths["feature_meta.json"])

# Host features are joined from host_inventory.csv at inference time
ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")

xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
# threat_actor_id, threat_actor_profile, or event_type - those were the
# oracle columns.
X = transform_single(my_event, meta, host_lookup=host_lookup)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB010, 21,896 per-event records:

Phase	Events	Class share
`benign_background`	12,448	56.9%
`exfiltration_or_impact`	6,205	28.3%
`initial_access`	1,674	7.6%
`lateral_movement`	968	4.4%
`persistence_establishment`	601	2.7%

Group-aware split by incident_id

500 incidents × ~44 events each. Events from the same incident share host, threat actor, and phase trajectory — so train/test contamination is a real risk with random splitting. The baseline uses GroupShuffleSplit on incident_id (nested 70/15/15):

Fold	Events	Incidents
Train	14,697	~350
Validation	3,473	~75
Test	3,726	~75

All 10 multi-seed evaluations yielded all 5 classes in the test fold. Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical recipe. 87 features survive after encoding, drawn from:

Per-event numeric (5): source_port, dest_port, cvss_score_analogue, label_log_tampered, label_false_positive
Per-event categorical (3, one-hot): event_class (12 values), log_source_type (8 values), severity_level (5 values)
Host features (joined from host_inventory.csv): 3 numeric + 7 categorical (os_type, host_role, network_segment, defender_posture, criticality_rating, cloud_provider, siem_platform)
Engineered (9): hour_of_day, is_off_hours, is_weekend, log_cvss, is_high_cvss, is_well_known_port, is_dynamic_port, is_outbound_web, risk_composite

Partial-oracle features kept as legitimate observables

event_class (max purity 0.87, mean 0.72 across phases) is the strongest non-oracle feature. C2 beacon traffic (event_class = network_flow) is 65% exfiltration phase but also 29% benign and 6% other phases — real overlap, not deterministic encoding. Kept.

severity_level and cvss_score_analogue correlate strongly with phase (high-severity events skew toward exfil and initial_access) but with substantial overlap. Kept.

label_log_tampered is a real observable — APTs tamper more than script_kiddies — but is not phase-deterministic. Kept.

Evaluation

Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)

XGBoost (the published model_xgb.json artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.9904
Accuracy	0.9493
Macro-F1	0.7781
Weighted-F1	0.9478

MLP (the published model_mlp.safetensors artifact)

Metric	Value
Macro ROC-AUC (OvR)	0.9861
Accuracy	0.9412
Macro-F1	0.7534
Weighted-F1	0.9396

XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941, macro-F1 0.778 vs 0.753). The gap is consistent across seeds.

Multi-seed robustness (XGBoost, 10 seeds)

Metric	Mean	Std	Min	Max
Accuracy	0.936	0.007	0.923	0.949
Macro-F1	0.759	0.015	0.741	0.781
Macro ROC-AUC OvR	0.988	0.001	0.986	0.990

Tightest ROC-AUC std in the catalog (0.001). All 10 seeds yielded all 5 classes in the test fold. Full per-seed results in multi_seed_results.json.

Per-class F1 (seed 42)

Phase	Class share	XGBoost F1	MLP F1
`benign_background`	56.9%	0.998	0.994
`exfiltration_or_impact`	28.3%	0.987	0.981
`initial_access`	7.6%	0.720	0.651
`persistence_establishment`	2.7%	0.703	0.690
`lateral_movement`	4.4%	0.483	0.451

The two largest classes (benign_background and exfiltration_or_impact) are nearly perfectly separable — benign_background because the non-oracle features (severity, CVSS, log_source) still cleanly separate non-malicious traffic, and exfiltration_or_impact because it's dominated by network_flow events (C2 beacons). The three middle classes overlap substantially in feature space; lateral_movement is the hardest (F1 0.48) because lateral movement events look similar to initial_access events at the per-event level. A sequence model that considers event ordering within an incident would likely do better than the per-event baseline.

Ablation: which feature groups matter

Configuration	Accuracy	Macro-F1	ROC-AUC	Δ accuracy	Δ macro-F1
Full feature set (published)	0.9493	0.7781	0.9904	—	—
No `event_class`	0.9206	0.5969	0.9723	−0.0287	−0.181
No CVSS features	0.9383	0.7475	0.9812	−0.0110	−0.031
No `log_source_type`	0.9469	0.7655	0.9902	−0.0024	−0.013
No engineered features	0.9471	0.7655	0.9903	−0.0022	−0.013
No ports	0.9463	0.7621	0.9903	−0.0030	−0.016
No `severity_level`	0.9479	0.7688	0.9902	−0.0014	−0.009
No tamper flags	0.9469	0.7657	0.9905	−0.0024	−0.012
No timing	0.9501	0.7730	0.9907	+0.0008	−0.005
No host features	0.9522	0.7828	0.9917	+0.0029	+0.005

Three findings:

event_class is the dominant signal (drops 18pp macro-F1 when removed). Phase prediction without it loses most discrimination between the middle classes.
CVSS features are second-strongest (drops 3pp F1). Captures severity information that complements event_class.
Host features and timing add modest noise. The model performs marginally better without host features (+0.3pp accuracy), and timing features contribute essentially nothing. Kept in the pipeline as documented baseline reference.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 5 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 87 → 128 → 64 → 5, each hidden layer followed by BatchNorm1d → ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

Limitations

This is a baseline reference, not a production phase classifier.

The leakage diagnostic is required reading. Six oracle columns for the phase task and seven for the alert TP task are documented in leakage_diagnostic.json. If you use CYB010 sample data for your own training, you MUST drop these or your model will learn the oracles instead of the task.
lateral_movement F1 0.48 is the weakest class. The 968-event sample with substantial overlap to initial_access makes this class hard. A sequence model that considers event ordering within incidents would likely do better than per-event classification.
threat_actor_profile 4-class (malicious-only) is unlearnable on this sample (acc 0.55 vs majority 0.61). The 5-class formulation with benign included works only because benign_user separation is structurally trivial.
event_class 12-class is unlearnable on this sample (acc 0.35 vs majority 0.42). event_class is a structural property of the event itself, not something to predict from other features.
Synthetic-vs-real transfer. The dataset is synthetic, calibrated to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise characteristics — and in particular, the explicit mitre_tactic == "benign" marker and threat_actor_id == "NONE" benign sentinel would not be present in real data. Real telemetry has implicit benign-vs-malicious distinctions that emerge from event content. Do not assume metrics transfer end-to-end.
21,896 events / 500 incidents is a modest training set. The 3,726-event / ~75-incident test fold yields stable multi-seed metrics (std 0.007 on accuracy) but per-class confidence intervals widen for the smallest classes (lateral_movement, persistence).

Notes on dataset schema

The CYB010 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says	What the data actually contains
`security_events` has 16 columns	Data has 23 columns
Field renames	`timestamp_utc` → `timestamp`, `user` → `user_id`, `log_format` → `log_source_type`
README missing from `security_events`	`event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented
README claims `command_line` / `process_name` / `is_off_hours` columns	Not present in `security_events` (off-hours derived from timestamp in pipeline)
`alert_records` has 9 columns	Data has 21 columns
Field renames	`alert_severity` → `severity_level`, `detection_rule` → `alert_rule_name`
README's `triage_outcome` (categorical)	Replaced by `label_true_positive` / `label_false_positive` (mirror booleans)
README's `ioc_matched`	Not present in `alert_records`
README missing from `alert_records`	`correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented
`incident_summary` has 8 columns	Data has 24 columns
`host_inventory` has 6 columns	Data has 15 columns
`threat_actor_profile` has 4 values	Data has 5 values (adds `benign_user` at 57% of events)
`attack_lifecycle_phase` 5-phase malicious lifecycle	Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included
README says MITRE ATT&CK v14 with 50 techniques	Data has 54 unique technique IDs across 14 tactics + benign

None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.

Intended use

Evaluating fit of the CYB010 dataset for your SIEM ML research
Baseline reference for new model architectures on the attack-phase classification task
Reference example of structural-leakage diagnostics for synthetic SIEM datasets — the methodology is reusable
Feature engineering reference for per-event SIEM telemetry

Out-of-scope use

Production SIEM phase detection on real telemetry
Threat actor attribution (4-class malicious-only is unlearnable on the sample)
Event-class prediction (this is a structural property, not a learnable target)
Any operational decision affecting actual security operations without further validation on your own data

Reproducibility

Outputs above were produced with seed = 42 (published artifact), nested GroupShuffleSplit on incident_id (70/15/15), on the published sample (xpertsystems/cyb010-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits (std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std in the XpertSystems catalog).

The training script itself is private to XpertSystems.

Files in this repo

File	Purpose
`model_xgb.json`	XGBoost weights (seed 42)
`model_mlp.safetensors`	PyTorch MLP weights (seed 42)
`feature_engineering.py`	Feature pipeline
`feature_meta.json`	Feature column order + categorical levels
`feature_scaler.json`	MLP input mean/std (XGBoost ignores)
`validation_results.json`	Per-class metrics, confusion matrix, architecture
`ablation_results.json`	Per-feature-group ablation
`multi_seed_results.json`	XGBoost metrics across 10 seeds
`leakage_diagnostic.json`	11-oracle-path audit + 2 unlearnable targets
`inference_example.ipynb`	End-to-end inference demo notebook
`README.md`	This file

Contact and full product

The full CYB010 dataset contains ~550,000 rows across four files, with calibrated benchmark validation against 6 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).

The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

📧 pradeep@xpertsystems.ai
🌐 https://xpertsystems.ai
🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample
🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
- https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic)

Citation

@misc{xpertsystems_cyb010_baseline_2026,
  title  = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
  note   = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train xpertsystems/cyb010-baseline-classifier

Evaluation results

Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)
self-reported

0.990
Test accuracy (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)
self-reported

0.949
Test macro-F1 (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)
self-reported

0.778
Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) on CYB010 Synthetic Security Event Log Dataset (Sample)
self-reported

0.936
Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds) on CYB010 Synthetic Security Event Log Dataset (Sample)
self-reported

0.988