--- license: cc-by-nc-4.0 library_name: pytorch tags: - cybersecurity - soc-operations - alert-triage - mitre-attack - soar - siem - tabular-classification - synthetic-data - xgboost - baseline - leakage-diagnostic pipeline_tag: tabular-classification base_model: [] datasets: - xpertsystems/cyb008-sample metrics: - accuracy - f1 - roc_auc model-index: - name: cyb008-baseline-classifier results: - task: type: tabular-classification name: 5-class SOC alert triage outcome classification dataset: type: xpertsystems/cyb008-sample name: CYB008 Synthetic SOC Alert Dataset (Sample) metrics: - type: roc_auc value: 0.9522 name: Test macro ROC-AUC OvR (XGBoost, seed 42) - type: accuracy value: 0.7659 name: Test accuracy (XGBoost, seed 42) - type: f1 value: 0.7430 name: Test macro-F1 (XGBoost, seed 42) - type: accuracy value: 0.777 name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds) - type: roc_auc value: 0.955 name: Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds) - type: roc_auc value: 0.9552 name: Test macro ROC-AUC OvR (MLP, seed 42) - type: accuracy value: 0.7674 name: Test accuracy (MLP, seed 42) - type: f1 value: 0.7510 name: Test macro-F1 (MLP, seed 42) --- # CYB008 Baseline Classifier **SOC alert triage classifier trained on the CYB008 synthetic SOC alert sample. Predicts which of 5 triage outcome classes (`auto_resolved_soar` / `duplicate_merged` / `false_positive_closed` / `true_positive_remediated` / `true_positive_escalated`) an alert will reach, from per-alert features. ALSO ships a leakage diagnostic for the three structural-oracle columns dropped from the feature pipeline.** > **Read this first.** This repo ships two related artifacts: > (1) a working baseline classifier for `resolution_outcome` (the > primary product), and (2) a `leakage_diagnostic.json` file > documenting (a) the three structural oracle columns that were > dropped from the feature set, and (b) the separate finding that the > README's first suggested use case — MITRE ATT&CK tactic > classification — is **not learnable** on this sample. Both files > matter; the diagnostic is required reading for anyone evaluating > CYB008 for a triage product. ## Model overview | Property | Value | |---|---| | Primary task | 5-class `resolution_outcome` classification (SOC alert triage) | | Secondary artifact | `leakage_diagnostic.json` — structural oracle + unlearnable-target audit | | Training data | `xpertsystems/cyb008-sample` (9,200 alerts) | | Models | XGBoost + PyTorch MLP | | Input features | 53 (after one-hot encoding) | | Split | **Stratified random** (no natural group key in this dataset — see rationale below) | | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | | License | CC-BY-NC-4.0 (matches dataset) | | Status | Reference baseline + leakage diagnostic | ## Why this task — and what was dropped The CYB008 README lists **alert triage (TP vs FP prediction)** as its first suggested use case and **MITRE ATT&CK tactic classification** as its second. We piloted both on the sample dataset: - **Triage outcome:** works honestly. After dropping 3 structural oracle columns, the model achieves **acc 0.777 ± 0.007, ROC-AUC 0.955 ± 0.003** on 5-class classification. This is the primary baseline. - **MITRE tactic classification:** **does NOT work on this sample.** Without `mitre_technique_id` (which is a perfect ATT&CK-by-design oracle), the per-tactic feature distributions are nearly identical (raw_score 0.37–0.39 across all 12 tactics, similar for enriched score and fatigue). A trained XGBoost achieves accuracy 0.08, below the majority baseline of 0.14. The README's stated use case cannot be honestly demonstrated on the sample. See [`leakage_diagnostic.json`](./leakage_diagnostic.json) for the full finding and our recommendation to the dataset author. ### The three structural oracle columns (dropped) CYB008 has three columns that structurally encode the `resolution_outcome` label: | Column | Oracle relationship | |---|---| | `alert_lifecycle_phase` | 3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged) | | `automation_resolved` | Exact 1:1 with `auto_resolved_soar` outcome | | `escalation_flag` | 1319 escalation flags = 1319 `true_positive_escalated` outcomes (near-1:1) | With all three present, plain XGBoost achieves **100% test accuracy across all seeds** — mechanical, not learned. With all three dropped, accuracy is **0.79 with ROC-AUC 0.96**: real learning on a non-trivial 5-class task. The published baseline trains with these three columns excluded. Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal: - `model_xgb.json` — gradient-boosted trees - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format On CYB008 the MLP slightly outperforms XGBoost on the test fold (0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only the second SKU in the XpertSystems baseline catalog where this happens (after CYB007). ## Quick start ```bash pip install xgboost torch safetensors pandas huggingface_hub ``` ```python from huggingface_hub import hf_hub_download import json, numpy as np, torch, xgboost as xgb from safetensors.torch import load_file REPO = "xpertsystems/cyb008-baseline-classifier" paths = {n: hf_hub_download(REPO, n) for n in [ "model_xgb.json", "model_mlp.safetensors", "feature_engineering.py", "feature_meta.json", "feature_scaler.json", ]} import sys, os sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) from feature_engineering import transform_single, load_meta, INT_TO_LABEL meta = load_meta(paths["feature_meta.json"]) xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) # Predict (see inference_example.ipynb for the full pattern) # Note: do NOT include alert_lifecycle_phase, automation_resolved, or # escalation_flag in your record - those were the oracle columns. X = transform_single(my_alert_record, meta) proba = xgb_model.predict_proba(X)[0] print(INT_TO_LABEL[int(np.argmax(proba))]) ``` See [`inference_example.ipynb`](./inference_example.ipynb) for the full copy-paste demo. ## Training data Trained on the public sample of CYB008, 9,200 per-alert records: | Outcome | Alerts | Class share | |---|---:|---:| | `false_positive_closed` | 2,996 | 32.6% | | `auto_resolved_soar` | 2,642 | 28.7% | | `true_positive_remediated` | 1,848 | 20.1% | | `true_positive_escalated` | 1,319 | 14.3% | | `duplicate_merged` | 395 | 4.3% | ### Stratified split (no natural group key) CYB008 does not have a natural row-level group key for group-aware splitting: - 25 analysts — group-aware split would yield only ~4 test analysts - 5 SOCs — would yield 1 test SOC - 589 incidents — only 9% of alerts have a non-null `incident_id` Alerts are essentially independent given features, so we use **StratifiedShuffleSplit** (nested 70/15/15), the same approach as CYB001 for network flow classification: | Fold | Alerts | |---|---:| | Train | 6,440 | | Validation | 1,380 | | Test | 1,380 | Class imbalance is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and weighted cross-entropy (MLP). ## Feature pipeline The bundled `feature_engineering.py` is the canonical feature recipe. 53 features survive after encoding, drawn from: - **Per-alert numeric** (9): `raw_score`, `enriched_score`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `soar_playbook_triggered`, `sla_breached_flag`, `mttd_minutes`, `mttr_minutes`, `fatigue_score_at_alert` - **Per-alert categorical** (5, one-hot): `alert_severity` (7 values), `alert_source` (8 values), `mitre_tactic` (12 values), `analyst_tier` (3 values), `siem_platform` (8 values) - **Engineered** (6): `enrichment_lift`, `log_mttr`, `log_mttd`, `queue_pressure`, `enrichment_per_minute`, `is_high_confidence` ### Excluded columns **Oracle columns** (dropped to allow honest evaluation): | Column | Why excluded | |---|---| | `alert_lifecycle_phase` | 3 of 4 values are deterministic outcome oracles | | `automation_resolved` | 1:1 with `auto_resolved_soar` outcome | | `escalation_flag` | Near-1:1 with `true_positive_escalated` outcome | **High-cardinality columns** (dropped for tractability): | Column | Why excluded | |---|---| | `mitre_technique_id` | 36 unique values; perfect oracle for `mitre_tactic` but unrelated to this target | | `detection_rule_id` | 656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic) | ### Partial-oracle features (kept as legitimate observables) `soar_playbook_triggered` is a *necessary but not sufficient* condition for `auto_resolved_soar` — when 0, the alert is never auto-resolved; when 1, the outcome is auto-resolved 68% of the time but can also be TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is a legitimate observable that downstream operators would already have on hand at decision time. KEPT in the pipeline. ## Evaluation ### Test-set metrics, seed 42 (n = 1,380 alerts) **XGBoost** (the published `model_xgb.json` artifact) | Metric | Value | |---|---:| | Macro ROC-AUC (OvR) | **0.9522** | | Accuracy | **0.7659** | | Macro-F1 | 0.7430 | | Weighted-F1 | 0.7672 | **MLP** (the published `model_mlp.safetensors` artifact) — **slightly outperforms XGBoost** | Metric | Value | |---|---:| | Macro ROC-AUC (OvR) | **0.9552** | | Accuracy | **0.7674** | | Macro-F1 | 0.7510 | | Weighted-F1 | 0.7691 | With 6,440 training rows and 53 features, the MLP has enough data to compete favorably with boosted trees. Both models are published. ### Multi-seed robustness (XGBoost, 10 seeds) Very stable performance — std 0.007 on accuracy is among the tightest in the XpertSystems catalog: | Metric | Mean | Std | Min | Max | |---|---:|---:|---:|---:| | Accuracy | 0.777 | 0.007 | 0.766 | 0.792 | | Macro-F1 | 0.765 | 0.011 | 0.743 | 0.783 | | Macro ROC-AUC OvR | 0.955 | 0.003 | 0.950 | 0.960 | Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). All 10 seeds yielded all 5 classes in the test fold (stratified split guarantees this). ### Per-class F1 (seed 42) | Outcome | Class share | XGBoost F1 | MLP F1 | |---|---:|---:|---:| | `false_positive_closed` | 32.6% | **0.904** | 0.910 | | `duplicate_merged` | 4.3% | 0.794 | 0.825 | | `auto_resolved_soar` | 28.7% | 0.757 | 0.751 | | `true_positive_remediated` | 20.1% | 0.701 | 0.698 | | `true_positive_escalated` | 14.3% | 0.559 | 0.571 | The model performs best on `false_positive_closed` (clearest behavioural profile — low scores, fast resolution by L1 analysts) and `duplicate_merged` (smallest class but distinctive — duplicate-suppressed severity is a strong tell). The hardest discrimination is between `true_positive_remediated` and `true_positive_escalated` — both are genuine threats, differing primarily by whether the alert was closed by the original analyst or passed to a higher tier. In production this matters less because both are TP outcomes; binary TP-vs-FP recall is much higher. ### Ablation: which feature groups matter | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | |---|---:|---:|---:|---:| | Full feature set (published) | 0.7659 | 0.7430 | 0.9522 | — | | No alert severity | 0.5138 | 0.3933 | 0.7304 | **−0.2522** | | No `soar_playbook_triggered` | 0.6188 | 0.5773 | 0.8369 | **−0.1471** | | No analyst tier | 0.7717 | 0.7471 | 0.9524 | +0.0058 | | No siem platform | 0.7681 | 0.7474 | 0.9522 | +0.0022 | | No alert source | 0.7638 | 0.7406 | 0.9511 | −0.0022 | | No engineered features | 0.7681 | 0.7480 | 0.9533 | +0.0022 | | No mitre_tactic | 0.7812 | 0.7656 | 0.9530 | +0.0152 | | No timing features | 0.7775 | 0.7572 | 0.9547 | +0.0116 | | No score features | 0.7710 | 0.7569 | 0.9541 | +0.0051 | Four findings: 1. **Alert severity carries the dominant signal** (drops 25 pp accuracy, 22 pp ROC-AUC). This is intuitive: severity directly drives triage priority, which drives outcome. `false_positive` severity → `false_positive_closed`; `duplicate_suppressed` severity → `duplicate_merged`. 2. **`soar_playbook_triggered` is the second-strongest signal** (drops 15 pp accuracy). It's a partial oracle for the `auto_resolved_soar` outcome class. 3. **MITRE tactic and analyst tier contribute essentially nothing.** The model performs marginally *better* without them — they add noise that the trees over-fit on the training set. 4. **Engineered features and timing features are near-flat.** The trees recover composites from raw inputs. Kept in the pipeline as a documented baseline reference. ### Architecture **XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes), `hist` tree method, class-balanced sample weights, early stopping on validation mlogloss. **MLP:** `53 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d` → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1. Training hyperparameters are held internally by XpertSystems. ## Limitations **This is a baseline reference, not a production SOC triage system.** 1. **MITRE tactic classification is unlearnable on this sample.** The README lists it as a suggested use case but the per-tactic feature distributions are too similar (raw_score 0.37–0.39 across all 12 tactics). See [`leakage_diagnostic.json`](./leakage_diagnostic.json) for the full audit. Real SOC data has stronger per-tactic feature signatures. 2. **TP-remediated vs TP-escalated is the hardest discrimination.** F1 0.56 on TP-escalated is the weakest per-class result. Both are genuine threats; the difference is workflow rather than threat nature. For most operational uses (TP-vs-FP recall, SLA-breach reduction), this confusion does not matter. 3. **MLP modestly outperforms XGBoost.** Both are shipped; we recommend running both and treating disagreement as a triage triage signal. The boost is modest enough that for production deployment, the choice between them is essentially an engineering preference. 4. **Synthetic-vs-real transfer.** The dataset is synthetic and calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). Real SOC telemetry has different noise characteristics and the structural-oracle pattern documented above (alert_lifecycle_phase deterministically encoding outcome) would not be present in real data — real lifecycle phases transition stochastically. Do not assume metrics transfer end-to-end. 5. **9,200 alerts is a modest training set.** The 1,380-alert test fold yields stable multi-seed metrics (std 0.007), but full confidence intervals for downstream production decisions should come from the full ~280k-alert product. ## Notes on dataset schema The CYB008 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive. | What the README says | What the data actually contains | |---|---| | `incident_summary` has 8 columns | Data has **23 columns** including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc. | | `alert_severity` has 6 values (info / low / medium / high / critical / false_positive) | **7 values**: adds `duplicate_suppressed`. All values are suffixed (`high_severity`, `low_severity`, `critical_confirmed`, `informational`). | | `analyst_tier` has 4 values (tier_1 / tier_2 / tier_3 / manager) | 3 values on alerts (`L1_junior`, `L2_senior`, `L3_threat_hunter`); 4 on `soc_topology` (adds `L4_incident_commander`). | | 14 MITRE ATT&CK tactics | 12 tactics in the data (no `reconnaissance` or `resource_development` from PRE-ATT&CK). | | Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel | Field is `alert_source` (not `detection_source`); 8 values: `edr_behavioural_engine`, `nids_signature`, `ueba_user_anomaly`, `cspm_cloud_rule`, `siem_correlation_rule`, `threat_intel_ioc_match`, `honeypot_trigger`, `itdr_identity_anomaly`. | | `triage_score` / `enrichment_score` columns | Actual names: `raw_score` / `enriched_score`. | | `alert_timestamp` (ISO string) | Actual: `alert_timestamp_min` (integer minutes from epoch). | | `kill_chain_stage`, `storm_event_flag` columns on alerts | Not present in the data. | | Field rename: `detection_source` ↔ data `alert_source` | Same fact noted twice | | `resolution_outcome` values (true_positive / false_positive / duplicate / suppressed) | Actual 5 values: `auto_resolved_soar`, `duplicate_merged`, `false_positive_closed`, `true_positive_escalated`, `true_positive_remediated`. | | Extra columns in data not in README | `shift_id`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `fatigue_score_at_alert`, `siem_platform`, `soar_playbook_id`, `detection_rule_id`, `alert_lifecycle_phase` | None of these affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns. ## Intended use - **Evaluating fit** of the CYB008 dataset for your SOC-triage research - **Baseline reference** for new model architectures - **Reference example of structural-leakage diagnostics** in synthetic SOC datasets — the diagnostic methodology is reusable - **Feature engineering reference** for per-alert SOC telemetry ## Out-of-scope use - Production SOC triage decisions on real telemetry - MITRE ATT&CK tactic prediction (this baseline establishes that task is unlearnable on the sample) - SLA-breach prediction (also tested as unlearnable on the sample — acc 0.68 vs majority 0.82) - Any operational decision affecting actual security operations without further validation on your own data ## Reproducibility Outputs above were produced with `seed = 42` (published artifact), nested `StratifiedShuffleSplit` (70/15/15), on the published sample (`xpertsystems/cyb008-sample`, version 1.0.0, generated 2026-05-16). The feature pipeline in `feature_engineering.py` is deterministic and the trained weights in this repo correspond exactly to the metrics above. Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in `multi_seed_results.json` confirm robust performance across splits. The training script itself is private to XpertSystems. ## Files in this repo | File | Purpose | |---|---| | `model_xgb.json` | XGBoost weights (seed 42) | | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | | `feature_engineering.py` | Feature pipeline | | `feature_meta.json` | Feature column order + categorical levels | | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | | `validation_results.json` | Per-class metrics, confusion matrix, architecture | | `ablation_results.json` | Per-feature-group ablation | | `multi_seed_results.json` | XGBoost metrics across 10 seeds | | `leakage_diagnostic.json` | **Structural-oracle audit + unlearnable-target finding** | | `inference_example.ipynb` | End-to-end inference demo notebook | | `README.md` | This file | ## Contact and full product The full **CYB008** dataset contains ~335,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative SOC operations and threat intelligence sources (SANS SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of Security, Verizon DBIR). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy. - 📧 **pradeep@xpertsystems.ai** - 🌐 **https://xpertsystems.ai** - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb008-sample - 🤖 Companion models: - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase) - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase) - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution) - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic) - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type) ## Citation ```bibtex @misc{xpertsystems_cyb008_baseline_2026, title = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic}, author = {XpertSystems.ai}, year = {2026}, url = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier}, note = {Baseline reference model trained on xpertsystems/cyb008-sample} } ```