File size: 20,741 Bytes

e2c4702

---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - cybersecurity
  - siem
  - security-logs
  - mitre-attack
  - apt
  - tabular-classification
  - synthetic-data
  - xgboost
  - baseline
  - leakage-diagnostic
pipeline_tag: tabular-classification
base_model: []
datasets:
  - xpertsystems/cyb010-sample
metrics:
  - accuracy
  - f1
  - roc_auc
model-index:
  - name: cyb010-baseline-classifier
    results:
      - task:
          type: tabular-classification
          name: 5-class attack lifecycle phase classification
        dataset:
          type: xpertsystems/cyb010-sample
          name: CYB010 Synthetic Security Event Log Dataset (Sample)
        metrics:
          - type: roc_auc
            value: 0.9904
            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
          - type: accuracy
            value: 0.9493
            name: Test accuracy (XGBoost, seed 42)
          - type: f1
            value: 0.7781
            name: Test macro-F1 (XGBoost, seed 42)
          - type: accuracy
            value: 0.936
            name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.988
            name: Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds)
---

# CYB010 Baseline Classifier

**Attack lifecycle phase classifier (5-class) trained on the CYB010
synthetic security event log sample. Predicts which of 5 attack phases
(`benign_background` / `initial_access` / `lateral_movement` /
`persistence_establishment` / `exfiltration_or_impact`) a security
event belongs to, from per-event features. ALSO ships a comprehensive
`leakage_diagnostic.json` documenting 11 oracle paths discovered
across the dataset's targets and 2 README-suggested targets that are
unlearnable on the sample after honest leak removal.**

> **Read this first.** This repo ships two related artifacts:
> (1) a working baseline classifier for `attack_lifecycle_phase` (the
> dataset's headline target), and (2) `leakage_diagnostic.json`
> documenting 11 separate oracle paths plus 2 unlearnable targets.
> Both files matter; the diagnostic is required reading for anyone
> evaluating CYB010 for SIEM ML work.

## Model overview

| Property | Value |
|---|---|
| Primary task | 5-class `attack_lifecycle_phase` classification |
| Secondary artifact | `leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets |
| Training data | `xpertsystems/cyb010-sample` (21,896 events / 500 incidents) |
| Models | XGBoost + PyTorch MLP |
| Input features | 87 (after one-hot encoding) |
| Split | **Group-aware** (GroupShuffleSplit on `incident_id`) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline + comprehensive leakage diagnostic |

## Why this task — and what was dropped

The CYB010 README's central concept is the "5-phase attack lifecycle
state machine", and `attack_lifecycle_phase` is the data's headline
target. We piloted six candidate targets and found:

- **`attack_lifecycle_phase` 5-class**: strongest honest result.
  Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes
  represented, per-class F1 range 0.48–1.00.

- **`threat_actor_profile` 5-class**: works at acc 0.84 but per-class
  F1 reveals it's almost entirely driven by `benign_user` separation
  (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class
  malicious-only formulation is below majority (acc 0.55 vs 0.61).

- **`label_true_positive` binary on alerts**: documented as a secondary
  finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after
  dropping all of them.

- **`mitre_tactic` 14-class**: hits acc 0.90 but macro-F1 0.37 -
  imbalance gaming (benign class dominates at 57%).

- **`event_class` 12-class**: unlearnable (acc 0.35 vs majority 0.42).

### Six oracle columns dropped from the phase task

CYB010 encodes the benign vs malicious distinction explicitly in
multiple columns. Each is a perfect or near-perfect oracle for the
`benign_background` phase:

| Column | Oracle relationship |
|---|---|
| `mitre_tactic` | `=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect) |
| `mitre_technique_id` | Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic) |
| `label_malicious` | `==False` ↔ `benign_background` (perfect) |
| `threat_actor_id` | `=="NONE"` ↔ `benign_background` (perfect) |
| `threat_actor_profile` | `=="benign_user"` ↔ `benign_background` (perfect) |
| `event_type` | Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`) |

With these six columns present, a plain XGBoost trivially separates
benign vs malicious. The published baseline trains with all six
excluded.

Two model artifacts are published. They are designed to be used
together:

- `model_xgb.json` — gradient-boosted trees (slightly higher F1)
- `model_mlp.safetensors` — PyTorch MLP

## Quick start

```bash
pip install xgboost torch safetensors pandas huggingface_hub
```

```python
from huggingface_hub import hf_hub_download, snapshot_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb010-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
    transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
)

meta = load_meta(paths["feature_meta.json"])

# Host features are joined from host_inventory.csv at inference time
ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")

xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
# threat_actor_id, threat_actor_profile, or event_type - those were the
# oracle columns.
X = transform_single(my_event, meta, host_lookup=host_lookup)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```

See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.

## Training data

Trained on the public sample of CYB010, 21,896 per-event records:

| Phase | Events | Class share |
|---|---:|---:|
| `benign_background` | 12,448 | 56.9% |
| `exfiltration_or_impact` | 6,205 | 28.3% |
| `initial_access` | 1,674 | 7.6% |
| `lateral_movement` | 968 | 4.4% |
| `persistence_establishment` | 601 | 2.7% |

### Group-aware split by incident_id

500 incidents × ~44 events each. Events from the same incident share
host, threat actor, and phase trajectory — so train/test contamination
is a real risk with random splitting. The baseline uses
**GroupShuffleSplit** on `incident_id` (nested 70/15/15):

| Fold | Events | Incidents |
|---|---:|---:|
| Train | 14,697 | ~350 |
| Validation | 3,473 | ~75 |
| Test | 3,726 | ~75 |

All 10 multi-seed evaluations yielded all 5 classes in the test fold.
Class imbalance is addressed with `class_weight='balanced'` (XGBoost
`sample_weight`) and weighted cross-entropy (MLP).

## Feature pipeline

The bundled `feature_engineering.py` is the canonical recipe. 87
features survive after encoding, drawn from:

- **Per-event numeric** (5): `source_port`, `dest_port`,
  `cvss_score_analogue`, `label_log_tampered`, `label_false_positive`
- **Per-event categorical** (3, one-hot): `event_class` (12 values),
  `log_source_type` (8 values), `severity_level` (5 values)
- **Host features** (joined from `host_inventory.csv`): 3 numeric +
  7 categorical (os_type, host_role, network_segment, defender_posture,
  criticality_rating, cloud_provider, siem_platform)
- **Engineered** (9): `hour_of_day`, `is_off_hours`, `is_weekend`,
  `log_cvss`, `is_high_cvss`, `is_well_known_port`, `is_dynamic_port`,
  `is_outbound_web`, `risk_composite`

### Partial-oracle features kept as legitimate observables

`event_class` (max purity 0.87, mean 0.72 across phases) is the
strongest non-oracle feature. C2 beacon traffic (`event_class =
network_flow`) is 65% exfiltration phase but also 29% benign and 6%
other phases — real overlap, not deterministic encoding. Kept.

`severity_level` and `cvss_score_analogue` correlate strongly with
phase (high-severity events skew toward exfil and initial_access) but
with substantial overlap. Kept.

`label_log_tampered` is a real observable — APTs tamper more than
script_kiddies — but is not phase-deterministic. Kept.

## Evaluation

### Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)

**XGBoost** (the published `model_xgb.json` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9904** |
| Accuracy | **0.9493** |
| Macro-F1 | 0.7781 |
| Weighted-F1 | 0.9478 |

**MLP** (the published `model_mlp.safetensors` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9861** |
| Accuracy | **0.9412** |
| Macro-F1 | 0.7534 |
| Weighted-F1 | 0.9396 |

XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941,
macro-F1 0.778 vs 0.753). The gap is consistent across seeds.

### Multi-seed robustness (XGBoost, 10 seeds)

| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.936 | 0.007 | 0.923 | 0.949 |
| Macro-F1 | 0.759 | 0.015 | 0.741 | 0.781 |
| Macro ROC-AUC OvR | 0.988 | 0.001 | 0.986 | 0.990 |

**Tightest ROC-AUC std in the catalog** (0.001). All 10 seeds yielded
all 5 classes in the test fold. Full per-seed results in
[`multi_seed_results.json`](./multi_seed_results.json).

### Per-class F1 (seed 42)

| Phase | Class share | XGBoost F1 | MLP F1 |
|---|---:|---:|---:|
| `benign_background` | 56.9% | **0.998** | 0.994 |
| `exfiltration_or_impact` | 28.3% | **0.987** | 0.981 |
| `initial_access` | 7.6% | 0.720 | 0.651 |
| `persistence_establishment` | 2.7% | 0.703 | 0.690 |
| `lateral_movement` | 4.4% | **0.483** | 0.451 |

The two largest classes (`benign_background` and `exfiltration_or_impact`)
are nearly perfectly separable — `benign_background` because the
non-oracle features (severity, CVSS, log_source) still cleanly separate
non-malicious traffic, and `exfiltration_or_impact` because it's
dominated by network_flow events (C2 beacons). The three middle
classes overlap substantially in feature space; `lateral_movement` is
the hardest (F1 0.48) because lateral movement events look similar to
initial_access events at the per-event level. A sequence model that
considers event ordering within an incident would likely do better
than the per-event baseline.

### Ablation: which feature groups matter

| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | Δ macro-F1 |
|---|---:|---:|---:|---:|---:|
| Full feature set (published) | 0.9493 | 0.7781 | 0.9904 | — | — |
| No `event_class` | 0.9206 | 0.5969 | 0.9723 | **−0.0287** | **−0.181** |
| No CVSS features | 0.9383 | 0.7475 | 0.9812 | −0.0110 | −0.031 |
| No `log_source_type` | 0.9469 | 0.7655 | 0.9902 | −0.0024 | −0.013 |
| No engineered features | 0.9471 | 0.7655 | 0.9903 | −0.0022 | −0.013 |
| No ports | 0.9463 | 0.7621 | 0.9903 | −0.0030 | −0.016 |
| No `severity_level` | 0.9479 | 0.7688 | 0.9902 | −0.0014 | −0.009 |
| No tamper flags | 0.9469 | 0.7657 | 0.9905 | −0.0024 | −0.012 |
| No timing | 0.9501 | 0.7730 | 0.9907 | +0.0008 | −0.005 |
| No host features | 0.9522 | 0.7828 | 0.9917 | +0.0029 | +0.005 |

Three findings:

1. **`event_class` is the dominant signal** (drops 18pp macro-F1 when
   removed). Phase prediction without it loses most discrimination
   between the middle classes.
2. **CVSS features are second-strongest** (drops 3pp F1). Captures
   severity information that complements event_class.
3. **Host features and timing add modest noise.** The model performs
   marginally *better* without host features (+0.3pp accuracy), and
   timing features contribute essentially nothing. Kept in the
   pipeline as documented baseline reference.

### Architecture

**XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.

**MLP:** `87 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

## Limitations

**This is a baseline reference, not a production phase classifier.**

1. **The leakage diagnostic is required reading.** Six oracle columns
   for the phase task and seven for the alert TP task are documented
   in `leakage_diagnostic.json`. If you use CYB010 sample data for
   your own training, you MUST drop these or your model will learn
   the oracles instead of the task.

2. **`lateral_movement` F1 0.48 is the weakest class.** The 968-event
   sample with substantial overlap to `initial_access` makes this
   class hard. A sequence model that considers event ordering within
   incidents would likely do better than per-event classification.

3. **`threat_actor_profile` 4-class (malicious-only) is unlearnable
   on this sample** (acc 0.55 vs majority 0.61). The 5-class
   formulation with benign included works only because benign_user
   separation is structurally trivial.

4. **`event_class` 12-class is unlearnable on this sample** (acc 0.35
   vs majority 0.42). event_class is a structural property of the
   event itself, not something to predict from other features.

5. **Synthetic-vs-real transfer.** The dataset is synthetic, calibrated
   to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE
   ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise
   characteristics — and in particular, the explicit `mitre_tactic ==
   "benign"` marker and `threat_actor_id == "NONE"` benign sentinel
   would not be present in real data. Real telemetry has implicit
   benign-vs-malicious distinctions that emerge from event content.
   Do not assume metrics transfer end-to-end.

6. **21,896 events / 500 incidents is a modest training set.** The
   3,726-event / ~75-incident test fold yields stable multi-seed
   metrics (std 0.007 on accuracy) but per-class confidence intervals
   widen for the smallest classes (lateral_movement, persistence).

## Notes on dataset schema

The CYB010 sample dataset README describes some fields differently
from the actual schema. The model was trained on the actual schema;
this note helps buyers reconcile what they read with what they receive.

| What the README says | What the data actually contains |
|---|---|
| `security_events` has 16 columns | Data has **23 columns** |
| Field renames | `timestamp_utc` → `timestamp`, `user` → `user_id`, `log_format` → `log_source_type` |
| README missing from `security_events` | `event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented |
| README claims `command_line` / `process_name` / `is_off_hours` columns | Not present in `security_events` (off-hours derived from timestamp in pipeline) |
| `alert_records` has 9 columns | Data has **21 columns** |
| Field renames | `alert_severity` → `severity_level`, `detection_rule` → `alert_rule_name` |
| README's `triage_outcome` (categorical) | Replaced by `label_true_positive` / `label_false_positive` (mirror booleans) |
| README's `ioc_matched` | Not present in `alert_records` |
| README missing from `alert_records` | `correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented |
| `incident_summary` has 8 columns | Data has **24 columns** |
| `host_inventory` has 6 columns | Data has **15 columns** |
| `threat_actor_profile` has 4 values | Data has **5 values** (adds `benign_user` at 57% of events) |
| `attack_lifecycle_phase` 5-phase malicious lifecycle | Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included |
| README says MITRE ATT&CK v14 with 50 techniques | Data has 54 unique technique IDs across 14 tactics + benign |

None of these affects model correctness — the feature pipeline uses
the actual column names. If you build your own pipeline against the
dataset, use the actual columns.

## Intended use

- **Evaluating fit** of the CYB010 dataset for your SIEM ML research
- **Baseline reference** for new model architectures on the
  attack-phase classification task
- **Reference example of structural-leakage diagnostics** for
  synthetic SIEM datasets — the methodology is reusable
- **Feature engineering reference** for per-event SIEM telemetry

## Out-of-scope use

- Production SIEM phase detection on real telemetry
- Threat actor attribution (4-class malicious-only is unlearnable
  on the sample)
- Event-class prediction (this is a structural property, not a
  learnable target)
- Any operational decision affecting actual security operations
  without further validation on your own data

## Reproducibility

Outputs above were produced with `seed = 42` (published artifact),
nested `GroupShuffleSplit` on `incident_id` (70/15/15), on the published
sample (`xpertsystems/cyb010-sample`, version 1.0.0, generated
2026-05-16). The feature pipeline in `feature_engineering.py` is
deterministic and the trained weights in this repo correspond exactly
to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
in `multi_seed_results.json` confirm robust performance across splits
(std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std
in the XpertSystems catalog).

The training script itself is private to XpertSystems.

## Files in this repo

| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds |
| **`leakage_diagnostic.json`** | **11-oracle-path audit + 2 unlearnable targets** |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |

## Contact and full product

The full **CYB010** dataset contains **~550,000 rows** across four files,
with calibrated benchmark validation against 6 metrics drawn from
authoritative SOC operations and threat intelligence sources (SANS SOC
Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA
Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).

The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
& Energy.

- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample
- 🤖 Companion models:
  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
  - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
  - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
  - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
  - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
  - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
  - https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic)
  - https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic)

## Citation

```bibtex
@misc{xpertsystems_cyb010_baseline_2026,
  title  = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
  note   = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
}
```