CYB010 Baseline Classifier
Attack lifecycle phase classifier (5-class) trained on the CYB010
synthetic security event log sample. Predicts which of 5 attack phases
(benign_background / initial_access / lateral_movement /
persistence_establishment / exfiltration_or_impact) a security
event belongs to, from per-event features. ALSO ships a comprehensive
leakage_diagnostic.json documenting 11 oracle paths discovered
across the dataset's targets and 2 README-suggested targets that are
unlearnable on the sample after honest leak removal.
Read this first. This repo ships two related artifacts: (1) a working baseline classifier for
attack_lifecycle_phase(the dataset's headline target), and (2)leakage_diagnostic.jsondocumenting 11 separate oracle paths plus 2 unlearnable targets. Both files matter; the diagnostic is required reading for anyone evaluating CYB010 for SIEM ML work.
Model overview
| Property | Value |
|---|---|
| Primary task | 5-class attack_lifecycle_phase classification |
| Secondary artifact | leakage_diagnostic.json — 11 oracle paths + 2 unlearnable targets |
| Training data | xpertsystems/cyb010-sample (21,896 events / 500 incidents) |
| Models | XGBoost + PyTorch MLP |
| Input features | 87 (after one-hot encoding) |
| Split | Group-aware (GroupShuffleSplit on incident_id) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline + comprehensive leakage diagnostic |
Why this task — and what was dropped
The CYB010 README's central concept is the "5-phase attack lifecycle
state machine", and attack_lifecycle_phase is the data's headline
target. We piloted six candidate targets and found:
attack_lifecycle_phase5-class: strongest honest result. Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes represented, per-class F1 range 0.48–1.00.threat_actor_profile5-class: works at acc 0.84 but per-class F1 reveals it's almost entirely driven bybenign_userseparation (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class malicious-only formulation is below majority (acc 0.55 vs 0.61).label_true_positivebinary on alerts: documented as a secondary finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after dropping all of them.mitre_tactic14-class: hits acc 0.90 but macro-F1 0.37 - imbalance gaming (benign class dominates at 57%).event_class12-class: unlearnable (acc 0.35 vs majority 0.42).
Six oracle columns dropped from the phase task
CYB010 encodes the benign vs malicious distinction explicitly in
multiple columns. Each is a perfect or near-perfect oracle for the
benign_background phase:
| Column | Oracle relationship |
|---|---|
mitre_tactic |
=="benign" ↔ benign_background phase (12,448/12,448, perfect) |
mitre_technique_id |
Perfect ATT&CK-by-design oracle for mitre_tactic (54/54 techniques → single tactic) |
label_malicious |
==False ↔ benign_background (perfect) |
threat_actor_id |
=="NONE" ↔ benign_background (perfect) |
threat_actor_profile |
=="benign_user" ↔ benign_background (perfect) |
event_type |
Many values phase-specific (c2_beacon_outbound → 100% exfiltration_or_impact) |
With these six columns present, a plain XGBoost trivially separates benign vs malicious. The published baseline trains with all six excluded.
Two model artifacts are published. They are designed to be used together:
model_xgb.json— gradient-boosted trees (slightly higher F1)model_mlp.safetensors— PyTorch MLP
Quick start
pip install xgboost torch safetensors pandas huggingface_hub
from huggingface_hub import hf_hub_download, snapshot_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file
REPO = "xpertsystems/cyb010-baseline-classifier"
paths = {n: hf_hub_download(REPO, n) for n in [
"model_xgb.json", "model_mlp.safetensors",
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}
import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
)
meta = load_meta(paths["feature_meta.json"])
# Host features are joined from host_inventory.csv at inference time
ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
# threat_actor_id, threat_actor_profile, or event_type - those were the
# oracle columns.
X = transform_single(my_event, meta, host_lookup=host_lookup)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
See inference_example.ipynb for the full
copy-paste demo.
Training data
Trained on the public sample of CYB010, 21,896 per-event records:
| Phase | Events | Class share |
|---|---|---|
benign_background |
12,448 | 56.9% |
exfiltration_or_impact |
6,205 | 28.3% |
initial_access |
1,674 | 7.6% |
lateral_movement |
968 | 4.4% |
persistence_establishment |
601 | 2.7% |
Group-aware split by incident_id
500 incidents × ~44 events each. Events from the same incident share
host, threat actor, and phase trajectory — so train/test contamination
is a real risk with random splitting. The baseline uses
GroupShuffleSplit on incident_id (nested 70/15/15):
| Fold | Events | Incidents |
|---|---|---|
| Train | 14,697 | ~350 |
| Validation | 3,473 | ~75 |
| Test | 3,726 | ~75 |
All 10 multi-seed evaluations yielded all 5 classes in the test fold.
Class imbalance is addressed with class_weight='balanced' (XGBoost
sample_weight) and weighted cross-entropy (MLP).
Feature pipeline
The bundled feature_engineering.py is the canonical recipe. 87
features survive after encoding, drawn from:
- Per-event numeric (5):
source_port,dest_port,cvss_score_analogue,label_log_tampered,label_false_positive - Per-event categorical (3, one-hot):
event_class(12 values),log_source_type(8 values),severity_level(5 values) - Host features (joined from
host_inventory.csv): 3 numeric + 7 categorical (os_type, host_role, network_segment, defender_posture, criticality_rating, cloud_provider, siem_platform) - Engineered (9):
hour_of_day,is_off_hours,is_weekend,log_cvss,is_high_cvss,is_well_known_port,is_dynamic_port,is_outbound_web,risk_composite
Partial-oracle features kept as legitimate observables
event_class (max purity 0.87, mean 0.72 across phases) is the
strongest non-oracle feature. C2 beacon traffic (event_class = network_flow) is 65% exfiltration phase but also 29% benign and 6%
other phases — real overlap, not deterministic encoding. Kept.
severity_level and cvss_score_analogue correlate strongly with
phase (high-severity events skew toward exfil and initial_access) but
with substantial overlap. Kept.
label_log_tampered is a real observable — APTs tamper more than
script_kiddies — but is not phase-deterministic. Kept.
Evaluation
Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)
XGBoost (the published model_xgb.json artifact)
| Metric | Value |
|---|---|
| Macro ROC-AUC (OvR) | 0.9904 |
| Accuracy | 0.9493 |
| Macro-F1 | 0.7781 |
| Weighted-F1 | 0.9478 |
MLP (the published model_mlp.safetensors artifact)
| Metric | Value |
|---|---|
| Macro ROC-AUC (OvR) | 0.9861 |
| Accuracy | 0.9412 |
| Macro-F1 | 0.7534 |
| Weighted-F1 | 0.9396 |
XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941, macro-F1 0.778 vs 0.753). The gap is consistent across seeds.
Multi-seed robustness (XGBoost, 10 seeds)
| Metric | Mean | Std | Min | Max |
|---|---|---|---|---|
| Accuracy | 0.936 | 0.007 | 0.923 | 0.949 |
| Macro-F1 | 0.759 | 0.015 | 0.741 | 0.781 |
| Macro ROC-AUC OvR | 0.988 | 0.001 | 0.986 | 0.990 |
Tightest ROC-AUC std in the catalog (0.001). All 10 seeds yielded
all 5 classes in the test fold. Full per-seed results in
multi_seed_results.json.
Per-class F1 (seed 42)
| Phase | Class share | XGBoost F1 | MLP F1 |
|---|---|---|---|
benign_background |
56.9% | 0.998 | 0.994 |
exfiltration_or_impact |
28.3% | 0.987 | 0.981 |
initial_access |
7.6% | 0.720 | 0.651 |
persistence_establishment |
2.7% | 0.703 | 0.690 |
lateral_movement |
4.4% | 0.483 | 0.451 |
The two largest classes (benign_background and exfiltration_or_impact)
are nearly perfectly separable — benign_background because the
non-oracle features (severity, CVSS, log_source) still cleanly separate
non-malicious traffic, and exfiltration_or_impact because it's
dominated by network_flow events (C2 beacons). The three middle
classes overlap substantially in feature space; lateral_movement is
the hardest (F1 0.48) because lateral movement events look similar to
initial_access events at the per-event level. A sequence model that
considers event ordering within an incident would likely do better
than the per-event baseline.
Ablation: which feature groups matter
| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | Δ macro-F1 |
|---|---|---|---|---|---|
| Full feature set (published) | 0.9493 | 0.7781 | 0.9904 | — | — |
No event_class |
0.9206 | 0.5969 | 0.9723 | −0.0287 | −0.181 |
| No CVSS features | 0.9383 | 0.7475 | 0.9812 | −0.0110 | −0.031 |
No log_source_type |
0.9469 | 0.7655 | 0.9902 | −0.0024 | −0.013 |
| No engineered features | 0.9471 | 0.7655 | 0.9903 | −0.0022 | −0.013 |
| No ports | 0.9463 | 0.7621 | 0.9903 | −0.0030 | −0.016 |
No severity_level |
0.9479 | 0.7688 | 0.9902 | −0.0014 | −0.009 |
| No tamper flags | 0.9469 | 0.7657 | 0.9905 | −0.0024 | −0.012 |
| No timing | 0.9501 | 0.7730 | 0.9907 | +0.0008 | −0.005 |
| No host features | 0.9522 | 0.7828 | 0.9917 | +0.0029 | +0.005 |
Three findings:
event_classis the dominant signal (drops 18pp macro-F1 when removed). Phase prediction without it loses most discrimination between the middle classes.- CVSS features are second-strongest (drops 3pp F1). Captures severity information that complements event_class.
- Host features and timing add modest noise. The model performs marginally better without host features (+0.3pp accuracy), and timing features contribute essentially nothing. Kept in the pipeline as documented baseline reference.
Architecture
XGBoost: multi-class gradient boosting (multi:softprob, 5 classes),
hist tree method, class-balanced sample weights, early stopping on
validation mlogloss.
MLP: 87 → 128 → 64 → 5, each hidden layer followed by BatchNorm1d
→ ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.
Training hyperparameters are held internally by XpertSystems.
Limitations
This is a baseline reference, not a production phase classifier.
The leakage diagnostic is required reading. Six oracle columns for the phase task and seven for the alert TP task are documented in
leakage_diagnostic.json. If you use CYB010 sample data for your own training, you MUST drop these or your model will learn the oracles instead of the task.lateral_movementF1 0.48 is the weakest class. The 968-event sample with substantial overlap toinitial_accessmakes this class hard. A sequence model that considers event ordering within incidents would likely do better than per-event classification.threat_actor_profile4-class (malicious-only) is unlearnable on this sample (acc 0.55 vs majority 0.61). The 5-class formulation with benign included works only because benign_user separation is structurally trivial.event_class12-class is unlearnable on this sample (acc 0.35 vs majority 0.42). event_class is a structural property of the event itself, not something to predict from other features.Synthetic-vs-real transfer. The dataset is synthetic, calibrated to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise characteristics — and in particular, the explicit
mitre_tactic == "benign"marker andthreat_actor_id == "NONE"benign sentinel would not be present in real data. Real telemetry has implicit benign-vs-malicious distinctions that emerge from event content. Do not assume metrics transfer end-to-end.21,896 events / 500 incidents is a modest training set. The 3,726-event / ~75-incident test fold yields stable multi-seed metrics (std 0.007 on accuracy) but per-class confidence intervals widen for the smallest classes (lateral_movement, persistence).
Notes on dataset schema
The CYB010 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.
| What the README says | What the data actually contains |
|---|---|
security_events has 16 columns |
Data has 23 columns |
| Field renames | timestamp_utc → timestamp, user → user_id, log_format → log_source_type |
README missing from security_events |
event_class, severity_level, label_malicious, label_log_tampered, threat_actor_id, cvss_score_analogue are in data but not documented |
README claims command_line / process_name / is_off_hours columns |
Not present in security_events (off-hours derived from timestamp in pipeline) |
alert_records has 9 columns |
Data has 21 columns |
| Field renames | alert_severity → severity_level, detection_rule → alert_rule_name |
README's triage_outcome (categorical) |
Replaced by label_true_positive / label_false_positive (mirror booleans) |
README's ioc_matched |
Not present in alert_records |
README missing from alert_records |
correlated_chain_length, time_to_detect_seconds, suppression_reason, analyst_triage_priority are in data but not documented |
incident_summary has 8 columns |
Data has 24 columns |
host_inventory has 6 columns |
Data has 15 columns |
threat_actor_profile has 4 values |
Data has 5 values (adds benign_user at 57% of events) |
attack_lifecycle_phase 5-phase malicious lifecycle |
Data adds benign_background as a phase value (57% of events) — so the lifecycle is 5-class with benign included |
| README says MITRE ATT&CK v14 with 50 techniques | Data has 54 unique technique IDs across 14 tactics + benign |
None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.
Intended use
- Evaluating fit of the CYB010 dataset for your SIEM ML research
- Baseline reference for new model architectures on the attack-phase classification task
- Reference example of structural-leakage diagnostics for synthetic SIEM datasets — the methodology is reusable
- Feature engineering reference for per-event SIEM telemetry
Out-of-scope use
- Production SIEM phase detection on real telemetry
- Threat actor attribution (4-class malicious-only is unlearnable on the sample)
- Event-class prediction (this is a structural property, not a learnable target)
- Any operational decision affecting actual security operations without further validation on your own data
Reproducibility
Outputs above were produced with seed = 42 (published artifact),
nested GroupShuffleSplit on incident_id (70/15/15), on the published
sample (xpertsystems/cyb010-sample, version 1.0.0, generated
2026-05-16). The feature pipeline in feature_engineering.py is
deterministic and the trained weights in this repo correspond exactly
to the metrics above.
Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
in multi_seed_results.json confirm robust performance across splits
(std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std
in the XpertSystems catalog).
The training script itself is private to XpertSystems.
Files in this repo
| File | Purpose |
|---|---|
model_xgb.json |
XGBoost weights (seed 42) |
model_mlp.safetensors |
PyTorch MLP weights (seed 42) |
feature_engineering.py |
Feature pipeline |
feature_meta.json |
Feature column order + categorical levels |
feature_scaler.json |
MLP input mean/std (XGBoost ignores) |
validation_results.json |
Per-class metrics, confusion matrix, architecture |
ablation_results.json |
Per-feature-group ablation |
multi_seed_results.json |
XGBoost metrics across 10 seeds |
leakage_diagnostic.json |
11-oracle-path audit + 2 unlearnable targets |
inference_example.ipynb |
End-to-end inference demo notebook |
README.md |
This file |
Contact and full product
The full CYB010 dataset contains ~550,000 rows across four files, with calibrated benchmark validation against 6 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).
The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.
- 📧 pradeep@xpertsystems.ai
- 🌐 https://xpertsystems.ai
- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample
- 🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
- https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic)
- https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic)
Citation
@misc{xpertsystems_cyb010_baseline_2026,
title = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
author = {XpertSystems.ai},
year = {2026},
url = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
note = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
}
Dataset used to train xpertsystems/cyb010-baseline-classifier
Evaluation results
- Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)self-reported0.990
- Test accuracy (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)self-reported0.949
- Test macro-F1 (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)self-reported0.778
- Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) on CYB010 Synthetic Security Event Log Dataset (Sample)self-reported0.936
- Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds) on CYB010 Synthetic Security Event Log Dataset (Sample)self-reported0.988