File size: 21,670 Bytes

001717c

---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - cybersecurity
  - soc-operations
  - alert-triage
  - mitre-attack
  - soar
  - siem
  - tabular-classification
  - synthetic-data
  - xgboost
  - baseline
  - leakage-diagnostic
pipeline_tag: tabular-classification
base_model: []
datasets:
  - xpertsystems/cyb008-sample
metrics:
  - accuracy
  - f1
  - roc_auc
model-index:
  - name: cyb008-baseline-classifier
    results:
      - task:
          type: tabular-classification
          name: 5-class SOC alert triage outcome classification
        dataset:
          type: xpertsystems/cyb008-sample
          name: CYB008 Synthetic SOC Alert Dataset (Sample)
        metrics:
          - type: roc_auc
            value: 0.9522
            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
          - type: accuracy
            value: 0.7659
            name: Test accuracy (XGBoost, seed 42)
          - type: f1
            value: 0.7430
            name: Test macro-F1 (XGBoost, seed 42)
          - type: accuracy
            value: 0.777
            name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.955
            name: Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.9552
            name: Test macro ROC-AUC OvR (MLP, seed 42)
          - type: accuracy
            value: 0.7674
            name: Test accuracy (MLP, seed 42)
          - type: f1
            value: 0.7510
            name: Test macro-F1 (MLP, seed 42)
---

# CYB008 Baseline Classifier

**SOC alert triage classifier trained on the CYB008 synthetic SOC alert
sample. Predicts which of 5 triage outcome classes
(`auto_resolved_soar` / `duplicate_merged` / `false_positive_closed` /
`true_positive_remediated` / `true_positive_escalated`) an alert
will reach, from per-alert features. ALSO ships a leakage diagnostic
for the three structural-oracle columns dropped from the feature
pipeline.**

> **Read this first.** This repo ships two related artifacts:
> (1) a working baseline classifier for `resolution_outcome` (the
> primary product), and (2) a `leakage_diagnostic.json` file
> documenting (a) the three structural oracle columns that were
> dropped from the feature set, and (b) the separate finding that the
> README's first suggested use case — MITRE ATT&CK tactic
> classification — is **not learnable** on this sample. Both files
> matter; the diagnostic is required reading for anyone evaluating
> CYB008 for a triage product.

## Model overview

| Property | Value |
|---|---|
| Primary task | 5-class `resolution_outcome` classification (SOC alert triage) |
| Secondary artifact | `leakage_diagnostic.json` — structural oracle + unlearnable-target audit |
| Training data | `xpertsystems/cyb008-sample` (9,200 alerts) |
| Models | XGBoost + PyTorch MLP |
| Input features | 53 (after one-hot encoding) |
| Split | **Stratified random** (no natural group key in this dataset — see rationale below) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline + leakage diagnostic |

## Why this task — and what was dropped

The CYB008 README lists **alert triage (TP vs FP prediction)** as its
first suggested use case and **MITRE ATT&CK tactic classification** as
its second. We piloted both on the sample dataset:

- **Triage outcome:** works honestly. After dropping 3 structural
  oracle columns, the model achieves **acc 0.777 ± 0.007, ROC-AUC
  0.955 ± 0.003** on 5-class classification. This is the primary
  baseline.

- **MITRE tactic classification:** **does NOT work on this sample.**
  Without `mitre_technique_id` (which is a perfect ATT&CK-by-design
  oracle), the per-tactic feature distributions are nearly identical
  (raw_score 0.37–0.39 across all 12 tactics, similar for enriched
  score and fatigue). A trained XGBoost achieves accuracy 0.08,
  below the majority baseline of 0.14. The README's stated use case
  cannot be honestly demonstrated on the sample. See
  [`leakage_diagnostic.json`](./leakage_diagnostic.json) for the full
  finding and our recommendation to the dataset author.

### The three structural oracle columns (dropped)

CYB008 has three columns that structurally encode the
`resolution_outcome` label:

| Column | Oracle relationship |
|---|---|
| `alert_lifecycle_phase` | 3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged) |
| `automation_resolved` | Exact 1:1 with `auto_resolved_soar` outcome |
| `escalation_flag` | 1319 escalation flags = 1319 `true_positive_escalated` outcomes (near-1:1) |

With all three present, plain XGBoost achieves **100% test accuracy
across all seeds** — mechanical, not learned. With all three dropped,
accuracy is **0.79 with ROC-AUC 0.96**: real learning on a
non-trivial 5-class task. The published baseline trains with these
three columns excluded.

Two model artifacts are published. They are designed to be used
together — disagreement is a useful triage signal:

- `model_xgb.json` — gradient-boosted trees
- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format

On CYB008 the MLP slightly outperforms XGBoost on the test fold
(0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only
the second SKU in the XpertSystems baseline catalog where this
happens (after CYB007).

## Quick start

```bash
pip install xgboost torch safetensors pandas huggingface_hub
```

```python
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb008-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
# Note: do NOT include alert_lifecycle_phase, automation_resolved, or
# escalation_flag in your record - those were the oracle columns.
X = transform_single(my_alert_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```

See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.

## Training data

Trained on the public sample of CYB008, 9,200 per-alert records:

| Outcome | Alerts | Class share |
|---|---:|---:|
| `false_positive_closed` | 2,996 | 32.6% |
| `auto_resolved_soar` | 2,642 | 28.7% |
| `true_positive_remediated` | 1,848 | 20.1% |
| `true_positive_escalated` | 1,319 | 14.3% |
| `duplicate_merged` | 395 | 4.3% |

### Stratified split (no natural group key)

CYB008 does not have a natural row-level group key for group-aware
splitting:
- 25 analysts — group-aware split would yield only ~4 test analysts
- 5 SOCs — would yield 1 test SOC
- 589 incidents — only 9% of alerts have a non-null `incident_id`

Alerts are essentially independent given features, so we use
**StratifiedShuffleSplit** (nested 70/15/15), the same approach as
CYB001 for network flow classification:

| Fold | Alerts |
|---|---:|
| Train | 6,440 |
| Validation | 1,380 |
| Test | 1,380 |

Class imbalance is addressed with `class_weight='balanced'` (XGBoost
`sample_weight`) and weighted cross-entropy (MLP).

## Feature pipeline

The bundled `feature_engineering.py` is the canonical feature recipe.
53 features survive after encoding, drawn from:

- **Per-alert numeric** (9): `raw_score`, `enriched_score`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `soar_playbook_triggered`, `sla_breached_flag`, `mttd_minutes`, `mttr_minutes`, `fatigue_score_at_alert`
- **Per-alert categorical** (5, one-hot): `alert_severity` (7 values), `alert_source` (8 values), `mitre_tactic` (12 values), `analyst_tier` (3 values), `siem_platform` (8 values)
- **Engineered** (6): `enrichment_lift`, `log_mttr`, `log_mttd`, `queue_pressure`, `enrichment_per_minute`, `is_high_confidence`

### Excluded columns

**Oracle columns** (dropped to allow honest evaluation):

| Column | Why excluded |
|---|---|
| `alert_lifecycle_phase` | 3 of 4 values are deterministic outcome oracles |
| `automation_resolved` | 1:1 with `auto_resolved_soar` outcome |
| `escalation_flag` | Near-1:1 with `true_positive_escalated` outcome |

**High-cardinality columns** (dropped for tractability):

| Column | Why excluded |
|---|---|
| `mitre_technique_id` | 36 unique values; perfect oracle for `mitre_tactic` but unrelated to this target |
| `detection_rule_id` | 656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic) |

### Partial-oracle features (kept as legitimate observables)

`soar_playbook_triggered` is a *necessary but not sufficient* condition
for `auto_resolved_soar` — when 0, the alert is never auto-resolved;
when 1, the outcome is auto-resolved 68% of the time but can also be
TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is
a legitimate observable that downstream operators would already have
on hand at decision time. KEPT in the pipeline.

## Evaluation

### Test-set metrics, seed 42 (n = 1,380 alerts)

**XGBoost** (the published `model_xgb.json` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9522** |
| Accuracy | **0.7659** |
| Macro-F1 | 0.7430 |
| Weighted-F1 | 0.7672 |

**MLP** (the published `model_mlp.safetensors` artifact) — **slightly outperforms XGBoost**

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9552** |
| Accuracy | **0.7674** |
| Macro-F1 | 0.7510 |
| Weighted-F1 | 0.7691 |

With 6,440 training rows and 53 features, the MLP has enough data to
compete favorably with boosted trees. Both models are published.

### Multi-seed robustness (XGBoost, 10 seeds)

Very stable performance — std 0.007 on accuracy is among the tightest
in the XpertSystems catalog:

| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.777 | 0.007 | 0.766 | 0.792 |
| Macro-F1 | 0.765 | 0.011 | 0.743 | 0.783 |
| Macro ROC-AUC OvR | 0.955 | 0.003 | 0.950 | 0.960 |

Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
All 10 seeds yielded all 5 classes in the test fold (stratified split
guarantees this).

### Per-class F1 (seed 42)

| Outcome | Class share | XGBoost F1 | MLP F1 |
|---|---:|---:|---:|
| `false_positive_closed` | 32.6% | **0.904** | 0.910 |
| `duplicate_merged` | 4.3% | 0.794 | 0.825 |
| `auto_resolved_soar` | 28.7% | 0.757 | 0.751 |
| `true_positive_remediated` | 20.1% | 0.701 | 0.698 |
| `true_positive_escalated` | 14.3% | 0.559 | 0.571 |

The model performs best on `false_positive_closed` (clearest behavioural
profile — low scores, fast resolution by L1 analysts) and
`duplicate_merged` (smallest class but distinctive — duplicate-suppressed
severity is a strong tell). The hardest discrimination is between
`true_positive_remediated` and `true_positive_escalated` — both are
genuine threats, differing primarily by whether the alert was closed
by the original analyst or passed to a higher tier. In production this
matters less because both are TP outcomes; binary TP-vs-FP recall is
much higher.

### Ablation: which feature groups matter

| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
|---|---:|---:|---:|---:|
| Full feature set (published) | 0.7659 | 0.7430 | 0.9522 | — |
| No alert severity | 0.5138 | 0.3933 | 0.7304 | **−0.2522** |
| No `soar_playbook_triggered` | 0.6188 | 0.5773 | 0.8369 | **−0.1471** |
| No analyst tier | 0.7717 | 0.7471 | 0.9524 | +0.0058 |
| No siem platform | 0.7681 | 0.7474 | 0.9522 | +0.0022 |
| No alert source | 0.7638 | 0.7406 | 0.9511 | −0.0022 |
| No engineered features | 0.7681 | 0.7480 | 0.9533 | +0.0022 |
| No mitre_tactic | 0.7812 | 0.7656 | 0.9530 | +0.0152 |
| No timing features | 0.7775 | 0.7572 | 0.9547 | +0.0116 |
| No score features | 0.7710 | 0.7569 | 0.9541 | +0.0051 |

Four findings:

1. **Alert severity carries the dominant signal** (drops 25 pp
   accuracy, 22 pp ROC-AUC). This is intuitive: severity directly
   drives triage priority, which drives outcome. `false_positive`
   severity → `false_positive_closed`; `duplicate_suppressed` severity
   → `duplicate_merged`.
2. **`soar_playbook_triggered` is the second-strongest signal**
   (drops 15 pp accuracy). It's a partial oracle for the
   `auto_resolved_soar` outcome class.
3. **MITRE tactic and analyst tier contribute essentially nothing.**
   The model performs marginally *better* without them — they add
   noise that the trees over-fit on the training set.
4. **Engineered features and timing features are near-flat.** The
   trees recover composites from raw inputs. Kept in the pipeline as
   a documented baseline reference.

### Architecture

**XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.

**MLP:** `53 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.

Training hyperparameters are held internally by XpertSystems.

## Limitations

**This is a baseline reference, not a production SOC triage system.**

1. **MITRE tactic classification is unlearnable on this sample.** The
   README lists it as a suggested use case but the per-tactic feature
   distributions are too similar (raw_score 0.37–0.39 across all 12
   tactics). See [`leakage_diagnostic.json`](./leakage_diagnostic.json)
   for the full audit. Real SOC data has stronger per-tactic feature
   signatures.

2. **TP-remediated vs TP-escalated is the hardest discrimination.**
   F1 0.56 on TP-escalated is the weakest per-class result. Both are
   genuine threats; the difference is workflow rather than threat
   nature. For most operational uses (TP-vs-FP recall, SLA-breach
   reduction), this confusion does not matter.

3. **MLP modestly outperforms XGBoost.** Both are shipped; we
   recommend running both and treating disagreement as a triage
   triage signal. The boost is modest enough that for production
   deployment, the choice between them is essentially an engineering
   preference.

4. **Synthetic-vs-real transfer.** The dataset is synthetic and
   calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM
   Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR,
   Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of
   Security, Verizon DBIR). Real SOC telemetry has different noise
   characteristics and the structural-oracle pattern documented
   above (alert_lifecycle_phase deterministically encoding outcome)
   would not be present in real data — real lifecycle phases
   transition stochastically. Do not assume metrics transfer
   end-to-end.

5. **9,200 alerts is a modest training set.** The 1,380-alert test
   fold yields stable multi-seed metrics (std 0.007), but full
   confidence intervals for downstream production decisions should
   come from the full ~280k-alert product.

## Notes on dataset schema

The CYB008 sample dataset README describes some fields differently
from the actual schema. The model was trained on the actual schema;
this note helps buyers reconcile what they read with what they receive.

| What the README says | What the data actually contains |
|---|---|
| `incident_summary` has 8 columns | Data has **23 columns** including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc. |
| `alert_severity` has 6 values (info / low / medium / high / critical / false_positive) | **7 values**: adds `duplicate_suppressed`. All values are suffixed (`high_severity`, `low_severity`, `critical_confirmed`, `informational`). |
| `analyst_tier` has 4 values (tier_1 / tier_2 / tier_3 / manager) | 3 values on alerts (`L1_junior`, `L2_senior`, `L3_threat_hunter`); 4 on `soc_topology` (adds `L4_incident_commander`). |
| 14 MITRE ATT&CK tactics | 12 tactics in the data (no `reconnaissance` or `resource_development` from PRE-ATT&CK). |
| Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel | Field is `alert_source` (not `detection_source`); 8 values: `edr_behavioural_engine`, `nids_signature`, `ueba_user_anomaly`, `cspm_cloud_rule`, `siem_correlation_rule`, `threat_intel_ioc_match`, `honeypot_trigger`, `itdr_identity_anomaly`. |
| `triage_score` / `enrichment_score` columns | Actual names: `raw_score` / `enriched_score`. |
| `alert_timestamp` (ISO string) | Actual: `alert_timestamp_min` (integer minutes from epoch). |
| `kill_chain_stage`, `storm_event_flag` columns on alerts | Not present in the data. |
| Field rename: `detection_source` ↔ data `alert_source` | Same fact noted twice |
| `resolution_outcome` values (true_positive / false_positive / duplicate / suppressed) | Actual 5 values: `auto_resolved_soar`, `duplicate_merged`, `false_positive_closed`, `true_positive_escalated`, `true_positive_remediated`. |
| Extra columns in data not in README | `shift_id`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `fatigue_score_at_alert`, `siem_platform`, `soar_playbook_id`, `detection_rule_id`, `alert_lifecycle_phase` |

None of these affects model correctness — the feature pipeline uses
the actual column names. If you build your own pipeline against the
dataset, use the actual columns.

## Intended use

- **Evaluating fit** of the CYB008 dataset for your SOC-triage research
- **Baseline reference** for new model architectures
- **Reference example of structural-leakage diagnostics** in
  synthetic SOC datasets — the diagnostic methodology is reusable
- **Feature engineering reference** for per-alert SOC telemetry

## Out-of-scope use

- Production SOC triage decisions on real telemetry
- MITRE ATT&CK tactic prediction (this baseline establishes that
  task is unlearnable on the sample)
- SLA-breach prediction (also tested as unlearnable on the sample —
  acc 0.68 vs majority 0.82)
- Any operational decision affecting actual security operations
  without further validation on your own data

## Reproducibility

Outputs above were produced with `seed = 42` (published artifact),
nested `StratifiedShuffleSplit` (70/15/15), on the published sample
(`xpertsystems/cyb008-sample`, version 1.0.0, generated 2026-05-16).
The feature pipeline in `feature_engineering.py` is deterministic and
the trained weights in this repo correspond exactly to the metrics
above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
in `multi_seed_results.json` confirm robust performance across splits.

The training script itself is private to XpertSystems.

## Files in this repo

| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds |
| `leakage_diagnostic.json` | **Structural-oracle audit + unlearnable-target finding** |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |

## Contact and full product

The full **CYB008** dataset contains ~335,000 rows across four files,
with calibrated benchmark validation against 12 metrics drawn from
authoritative SOC operations and threat intelligence sources (SANS
SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester
Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk
State of Security, Verizon DBIR). The full XpertSystems.ai synthetic
data catalogue spans 41 SKUs across Cybersecurity, Healthcare,
Insurance & Risk, Oil & Gas, and Materials & Energy.

- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb008-sample
- 🤖 Companion models:
  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
  - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
  - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
  - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
  - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
  - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)

## Citation

```bibtex
@misc{xpertsystems_cyb008_baseline_2026,
  title  = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb008-sample}
}
```