CYB010 Baseline Classifier

Attack lifecycle phase classifier (5-class) trained on the CYB010 synthetic security event log sample. Predicts which of 5 attack phases (benign_background / initial_access / lateral_movement / persistence_establishment / exfiltration_or_impact) a security event belongs to, from per-event features. ALSO ships a comprehensive leakage_diagnostic.json documenting 11 oracle paths discovered across the dataset's targets and 2 README-suggested targets that are unlearnable on the sample after honest leak removal.

Read this first. This repo ships two related artifacts: (1) a working baseline classifier for attack_lifecycle_phase (the dataset's headline target), and (2) leakage_diagnostic.json documenting 11 separate oracle paths plus 2 unlearnable targets. Both files matter; the diagnostic is required reading for anyone evaluating CYB010 for SIEM ML work.

Model overview

Property Value
Primary task 5-class attack_lifecycle_phase classification
Secondary artifact leakage_diagnostic.json — 11 oracle paths + 2 unlearnable targets
Training data xpertsystems/cyb010-sample (21,896 events / 500 incidents)
Models XGBoost + PyTorch MLP
Input features 87 (after one-hot encoding)
Split Group-aware (GroupShuffleSplit on incident_id)
Validation Single seed (artifact) + multi-seed aggregate across 10 seeds
License CC-BY-NC-4.0 (matches dataset)
Status Reference baseline + comprehensive leakage diagnostic

Why this task — and what was dropped

The CYB010 README's central concept is the "5-phase attack lifecycle state machine", and attack_lifecycle_phase is the data's headline target. We piloted six candidate targets and found:

  • attack_lifecycle_phase 5-class: strongest honest result. Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes represented, per-class F1 range 0.48–1.00.

  • threat_actor_profile 5-class: works at acc 0.84 but per-class F1 reveals it's almost entirely driven by benign_user separation (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class malicious-only formulation is below majority (acc 0.55 vs 0.61).

  • label_true_positive binary on alerts: documented as a secondary finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after dropping all of them.

  • mitre_tactic 14-class: hits acc 0.90 but macro-F1 0.37 - imbalance gaming (benign class dominates at 57%).

  • event_class 12-class: unlearnable (acc 0.35 vs majority 0.42).

Six oracle columns dropped from the phase task

CYB010 encodes the benign vs malicious distinction explicitly in multiple columns. Each is a perfect or near-perfect oracle for the benign_background phase:

Column Oracle relationship
mitre_tactic =="benign"benign_background phase (12,448/12,448, perfect)
mitre_technique_id Perfect ATT&CK-by-design oracle for mitre_tactic (54/54 techniques → single tactic)
label_malicious ==Falsebenign_background (perfect)
threat_actor_id =="NONE"benign_background (perfect)
threat_actor_profile =="benign_user"benign_background (perfect)
event_type Many values phase-specific (c2_beacon_outbound → 100% exfiltration_or_impact)

With these six columns present, a plain XGBoost trivially separates benign vs malicious. The published baseline trains with all six excluded.

Two model artifacts are published. They are designed to be used together:

  • model_xgb.json — gradient-boosted trees (slightly higher F1)
  • model_mlp.safetensors — PyTorch MLP

Quick start

pip install xgboost torch safetensors pandas huggingface_hub
from huggingface_hub import hf_hub_download, snapshot_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb010-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
    transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
)

meta = load_meta(paths["feature_meta.json"])

# Host features are joined from host_inventory.csv at inference time
ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")

xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
# threat_actor_id, threat_actor_profile, or event_type - those were the
# oracle columns.
X = transform_single(my_event, meta, host_lookup=host_lookup)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for the full copy-paste demo.

Training data

Trained on the public sample of CYB010, 21,896 per-event records:

Phase Events Class share
benign_background 12,448 56.9%
exfiltration_or_impact 6,205 28.3%
initial_access 1,674 7.6%
lateral_movement 968 4.4%
persistence_establishment 601 2.7%

Group-aware split by incident_id

500 incidents × ~44 events each. Events from the same incident share host, threat actor, and phase trajectory — so train/test contamination is a real risk with random splitting. The baseline uses GroupShuffleSplit on incident_id (nested 70/15/15):

Fold Events Incidents
Train 14,697 ~350
Validation 3,473 ~75
Test 3,726 ~75

All 10 multi-seed evaluations yielded all 5 classes in the test fold. Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical recipe. 87 features survive after encoding, drawn from:

  • Per-event numeric (5): source_port, dest_port, cvss_score_analogue, label_log_tampered, label_false_positive
  • Per-event categorical (3, one-hot): event_class (12 values), log_source_type (8 values), severity_level (5 values)
  • Host features (joined from host_inventory.csv): 3 numeric + 7 categorical (os_type, host_role, network_segment, defender_posture, criticality_rating, cloud_provider, siem_platform)
  • Engineered (9): hour_of_day, is_off_hours, is_weekend, log_cvss, is_high_cvss, is_well_known_port, is_dynamic_port, is_outbound_web, risk_composite

Partial-oracle features kept as legitimate observables

event_class (max purity 0.87, mean 0.72 across phases) is the strongest non-oracle feature. C2 beacon traffic (event_class = network_flow) is 65% exfiltration phase but also 29% benign and 6% other phases — real overlap, not deterministic encoding. Kept.

severity_level and cvss_score_analogue correlate strongly with phase (high-severity events skew toward exfil and initial_access) but with substantial overlap. Kept.

label_log_tampered is a real observable — APTs tamper more than script_kiddies — but is not phase-deterministic. Kept.

Evaluation

Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)

XGBoost (the published model_xgb.json artifact)

Metric Value
Macro ROC-AUC (OvR) 0.9904
Accuracy 0.9493
Macro-F1 0.7781
Weighted-F1 0.9478

MLP (the published model_mlp.safetensors artifact)

Metric Value
Macro ROC-AUC (OvR) 0.9861
Accuracy 0.9412
Macro-F1 0.7534
Weighted-F1 0.9396

XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941, macro-F1 0.778 vs 0.753). The gap is consistent across seeds.

Multi-seed robustness (XGBoost, 10 seeds)

Metric Mean Std Min Max
Accuracy 0.936 0.007 0.923 0.949
Macro-F1 0.759 0.015 0.741 0.781
Macro ROC-AUC OvR 0.988 0.001 0.986 0.990

Tightest ROC-AUC std in the catalog (0.001). All 10 seeds yielded all 5 classes in the test fold. Full per-seed results in multi_seed_results.json.

Per-class F1 (seed 42)

Phase Class share XGBoost F1 MLP F1
benign_background 56.9% 0.998 0.994
exfiltration_or_impact 28.3% 0.987 0.981
initial_access 7.6% 0.720 0.651
persistence_establishment 2.7% 0.703 0.690
lateral_movement 4.4% 0.483 0.451

The two largest classes (benign_background and exfiltration_or_impact) are nearly perfectly separable — benign_background because the non-oracle features (severity, CVSS, log_source) still cleanly separate non-malicious traffic, and exfiltration_or_impact because it's dominated by network_flow events (C2 beacons). The three middle classes overlap substantially in feature space; lateral_movement is the hardest (F1 0.48) because lateral movement events look similar to initial_access events at the per-event level. A sequence model that considers event ordering within an incident would likely do better than the per-event baseline.

Ablation: which feature groups matter

Configuration Accuracy Macro-F1 ROC-AUC Δ accuracy Δ macro-F1
Full feature set (published) 0.9493 0.7781 0.9904
No event_class 0.9206 0.5969 0.9723 −0.0287 −0.181
No CVSS features 0.9383 0.7475 0.9812 −0.0110 −0.031
No log_source_type 0.9469 0.7655 0.9902 −0.0024 −0.013
No engineered features 0.9471 0.7655 0.9903 −0.0022 −0.013
No ports 0.9463 0.7621 0.9903 −0.0030 −0.016
No severity_level 0.9479 0.7688 0.9902 −0.0014 −0.009
No tamper flags 0.9469 0.7657 0.9905 −0.0024 −0.012
No timing 0.9501 0.7730 0.9907 +0.0008 −0.005
No host features 0.9522 0.7828 0.9917 +0.0029 +0.005

Three findings:

  1. event_class is the dominant signal (drops 18pp macro-F1 when removed). Phase prediction without it loses most discrimination between the middle classes.
  2. CVSS features are second-strongest (drops 3pp F1). Captures severity information that complements event_class.
  3. Host features and timing add modest noise. The model performs marginally better without host features (+0.3pp accuracy), and timing features contribute essentially nothing. Kept in the pipeline as documented baseline reference.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 5 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 87 → 128 → 64 → 5, each hidden layer followed by BatchNorm1dReLUDropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

Limitations

This is a baseline reference, not a production phase classifier.

  1. The leakage diagnostic is required reading. Six oracle columns for the phase task and seven for the alert TP task are documented in leakage_diagnostic.json. If you use CYB010 sample data for your own training, you MUST drop these or your model will learn the oracles instead of the task.

  2. lateral_movement F1 0.48 is the weakest class. The 968-event sample with substantial overlap to initial_access makes this class hard. A sequence model that considers event ordering within incidents would likely do better than per-event classification.

  3. threat_actor_profile 4-class (malicious-only) is unlearnable on this sample (acc 0.55 vs majority 0.61). The 5-class formulation with benign included works only because benign_user separation is structurally trivial.

  4. event_class 12-class is unlearnable on this sample (acc 0.35 vs majority 0.42). event_class is a structural property of the event itself, not something to predict from other features.

  5. Synthetic-vs-real transfer. The dataset is synthetic, calibrated to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise characteristics — and in particular, the explicit mitre_tactic == "benign" marker and threat_actor_id == "NONE" benign sentinel would not be present in real data. Real telemetry has implicit benign-vs-malicious distinctions that emerge from event content. Do not assume metrics transfer end-to-end.

  6. 21,896 events / 500 incidents is a modest training set. The 3,726-event / ~75-incident test fold yields stable multi-seed metrics (std 0.007 on accuracy) but per-class confidence intervals widen for the smallest classes (lateral_movement, persistence).

Notes on dataset schema

The CYB010 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive.

What the README says What the data actually contains
security_events has 16 columns Data has 23 columns
Field renames timestamp_utctimestamp, useruser_id, log_formatlog_source_type
README missing from security_events event_class, severity_level, label_malicious, label_log_tampered, threat_actor_id, cvss_score_analogue are in data but not documented
README claims command_line / process_name / is_off_hours columns Not present in security_events (off-hours derived from timestamp in pipeline)
alert_records has 9 columns Data has 21 columns
Field renames alert_severityseverity_level, detection_rulealert_rule_name
README's triage_outcome (categorical) Replaced by label_true_positive / label_false_positive (mirror booleans)
README's ioc_matched Not present in alert_records
README missing from alert_records correlated_chain_length, time_to_detect_seconds, suppression_reason, analyst_triage_priority are in data but not documented
incident_summary has 8 columns Data has 24 columns
host_inventory has 6 columns Data has 15 columns
threat_actor_profile has 4 values Data has 5 values (adds benign_user at 57% of events)
attack_lifecycle_phase 5-phase malicious lifecycle Data adds benign_background as a phase value (57% of events) — so the lifecycle is 5-class with benign included
README says MITRE ATT&CK v14 with 50 techniques Data has 54 unique technique IDs across 14 tactics + benign

None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns.

Intended use

  • Evaluating fit of the CYB010 dataset for your SIEM ML research
  • Baseline reference for new model architectures on the attack-phase classification task
  • Reference example of structural-leakage diagnostics for synthetic SIEM datasets — the methodology is reusable
  • Feature engineering reference for per-event SIEM telemetry

Out-of-scope use

  • Production SIEM phase detection on real telemetry
  • Threat actor attribution (4-class malicious-only is unlearnable on the sample)
  • Event-class prediction (this is a structural property, not a learnable target)
  • Any operational decision affecting actual security operations without further validation on your own data

Reproducibility

Outputs above were produced with seed = 42 (published artifact), nested GroupShuffleSplit on incident_id (70/15/15), on the published sample (xpertsystems/cyb010-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in multi_seed_results.json confirm robust performance across splits (std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std in the XpertSystems catalog).

The training script itself is private to XpertSystems.

Files in this repo

File Purpose
model_xgb.json XGBoost weights (seed 42)
model_mlp.safetensors PyTorch MLP weights (seed 42)
feature_engineering.py Feature pipeline
feature_meta.json Feature column order + categorical levels
feature_scaler.json MLP input mean/std (XGBoost ignores)
validation_results.json Per-class metrics, confusion matrix, architecture
ablation_results.json Per-feature-group ablation
multi_seed_results.json XGBoost metrics across 10 seeds
leakage_diagnostic.json 11-oracle-path audit + 2 unlearnable targets
inference_example.ipynb End-to-end inference demo notebook
README.md This file

Contact and full product

The full CYB010 dataset contains ~550,000 rows across four files, with calibrated benchmark validation against 6 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).

The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

Citation

@misc{xpertsystems_cyb010_baseline_2026,
  title  = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
  note   = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train xpertsystems/cyb010-baseline-classifier

Evaluation results

  • Test macro ROC-AUC OvR (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)
    self-reported
    0.990
  • Test accuracy (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)
    self-reported
    0.949
  • Test macro-F1 (XGBoost, seed 42) on CYB010 Synthetic Security Event Log Dataset (Sample)
    self-reported
    0.778
  • Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) on CYB010 Synthetic Security Event Log Dataset (Sample)
    self-reported
    0.936
  • Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds) on CYB010 Synthetic Security Event Log Dataset (Sample)
    self-reported
    0.988