pradeep-xpert

Initial release: XGBoost + MLP for SOC alert triage outcome classification, with structural-leakage and unlearnable-target diagnostic

001717c verified 2 days ago

preview code

raw

history blame contribute delete

21.7 kB

	---
	license: cc-by-nc-4.0
	library_name: pytorch
	tags:
	- cybersecurity
	- soc-operations
	- alert-triage
	- mitre-attack
	- soar
	- siem
	- tabular-classification
	- synthetic-data
	- xgboost
	- baseline
	- leakage-diagnostic
	pipeline_tag: tabular-classification
	base_model: []
	datasets:
	- xpertsystems/cyb008-sample
	metrics:
	- accuracy
	- f1
	- roc_auc
	model-index:
	- name: cyb008-baseline-classifier
	results:
	- task:
	type: tabular-classification
	name: 5-class SOC alert triage outcome classification
	dataset:
	type: xpertsystems/cyb008-sample
	name: CYB008 Synthetic SOC Alert Dataset (Sample)
	metrics:
	- type: roc_auc
	value: 0.9522
	name: Test macro ROC-AUC OvR (XGBoost, seed 42)
	- type: accuracy
	value: 0.7659
	name: Test accuracy (XGBoost, seed 42)
	- type: f1
	value: 0.7430
	name: Test macro-F1 (XGBoost, seed 42)
	- type: accuracy
	value: 0.777
	name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.955
	name: Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.9552
	name: Test macro ROC-AUC OvR (MLP, seed 42)
	- type: accuracy
	value: 0.7674
	name: Test accuracy (MLP, seed 42)
	- type: f1
	value: 0.7510
	name: Test macro-F1 (MLP, seed 42)
	---

	# CYB008 Baseline Classifier

	**SOC alert triage classifier trained on the CYB008 synthetic SOC alert
	sample. Predicts which of 5 triage outcome classes
	(`auto_resolved_soar` / `duplicate_merged` / `false_positive_closed` /
	`true_positive_remediated` / `true_positive_escalated`) an alert
	will reach, from per-alert features. ALSO ships a leakage diagnostic
	for the three structural-oracle columns dropped from the feature
	pipeline.**

	> Read this first. This repo ships two related artifacts:
	> (1) a working baseline classifier for `resolution_outcome` (the
	> primary product), and (2) a `leakage_diagnostic.json` file
	> documenting (a) the three structural oracle columns that were
	> dropped from the feature set, and (b) the separate finding that the
	> README's first suggested use case — MITRE ATT&CK tactic
	> classification — is not learnable on this sample. Both files
	> matter; the diagnostic is required reading for anyone evaluating
	> CYB008 for a triage product.

	## Model overview

	\| Property \| Value \|
	\|---\|---\|
	\| Primary task \| 5-class `resolution_outcome` classification (SOC alert triage) \|
	\| Secondary artifact \| `leakage_diagnostic.json` — structural oracle + unlearnable-target audit \|
	\| Training data \| `xpertsystems/cyb008-sample` (9,200 alerts) \|
	\| Models \| XGBoost + PyTorch MLP \|
	\| Input features \| 53 (after one-hot encoding) \|
	\| Split \| Stratified random (no natural group key in this dataset — see rationale below) \|
	\| Validation \| Single seed (artifact) + multi-seed aggregate across 10 seeds \|
	\| License \| CC-BY-NC-4.0 (matches dataset) \|
	\| Status \| Reference baseline + leakage diagnostic \|

	## Why this task — and what was dropped

	The CYB008 README lists alert triage (TP vs FP prediction) as its
	first suggested use case and MITRE ATT&CK tactic classification as
	its second. We piloted both on the sample dataset:

	- Triage outcome: works honestly. After dropping 3 structural
	oracle columns, the model achieves **acc 0.777 ± 0.007, ROC-AUC
	0.955 ± 0.003** on 5-class classification. This is the primary
	baseline.

	- MITRE tactic classification: does NOT work on this sample.
	Without `mitre_technique_id` (which is a perfect ATT&CK-by-design
	oracle), the per-tactic feature distributions are nearly identical
	(raw_score 0.37–0.39 across all 12 tactics, similar for enriched
	score and fatigue). A trained XGBoost achieves accuracy 0.08,
	below the majority baseline of 0.14. The README's stated use case
	cannot be honestly demonstrated on the sample. See
	[`leakage_diagnostic.json`](./leakage_diagnostic.json) for the full
	finding and our recommendation to the dataset author.

	### The three structural oracle columns (dropped)

	CYB008 has three columns that structurally encode the
	`resolution_outcome` label:

	\| Column \| Oracle relationship \|
	\|---\|---\|
	\| `alert_lifecycle_phase` \| 3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged) \|
	\| `automation_resolved` \| Exact 1:1 with `auto_resolved_soar` outcome \|
	\| `escalation_flag` \| 1319 escalation flags = 1319 `true_positive_escalated` outcomes (near-1:1) \|

	With all three present, plain XGBoost achieves **100% test accuracy
	across all seeds** — mechanical, not learned. With all three dropped,
	accuracy is 0.79 with ROC-AUC 0.96: real learning on a
	non-trivial 5-class task. The published baseline trains with these
	three columns excluded.

	Two model artifacts are published. They are designed to be used
	together — disagreement is a useful triage signal:

	- `model_xgb.json` — gradient-boosted trees
	- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format

	On CYB008 the MLP slightly outperforms XGBoost on the test fold
	(0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only
	the second SKU in the XpertSystems baseline catalog where this
	happens (after CYB007).

	## Quick start

	```bash
	pip install xgboost torch safetensors pandas huggingface_hub
	```

	```python
	from huggingface_hub import hf_hub_download
	import json, numpy as np, torch, xgboost as xgb
	from safetensors.torch import load_file

	REPO = "xpertsystems/cyb008-baseline-classifier"

	paths = {n: hf_hub_download(REPO, n) for n in [
	"model_xgb.json", "model_mlp.safetensors",
	"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
	]}

	import sys, os
	sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
	from feature_engineering import transform_single, load_meta, INT_TO_LABEL

	meta = load_meta(paths["feature_meta.json"])
	xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

	# Predict (see inference_example.ipynb for the full pattern)
	# Note: do NOT include alert_lifecycle_phase, automation_resolved, or
	# escalation_flag in your record - those were the oracle columns.
	X = transform_single(my_alert_record, meta)
	proba = xgb_model.predict_proba(X)[0]
	print(INT_TO_LABEL[int(np.argmax(proba))])
	```

	See [`inference_example.ipynb`](./inference_example.ipynb) for the full
	copy-paste demo.

	## Training data

	Trained on the public sample of CYB008, 9,200 per-alert records:

	\| Outcome \| Alerts \| Class share \|
	\|---\|---:\|---:\|
	\| `false_positive_closed` \| 2,996 \| 32.6% \|
	\| `auto_resolved_soar` \| 2,642 \| 28.7% \|
	\| `true_positive_remediated` \| 1,848 \| 20.1% \|
	\| `true_positive_escalated` \| 1,319 \| 14.3% \|
	\| `duplicate_merged` \| 395 \| 4.3% \|

	### Stratified split (no natural group key)

	CYB008 does not have a natural row-level group key for group-aware
	splitting:
	- 25 analysts — group-aware split would yield only ~4 test analysts
	- 5 SOCs — would yield 1 test SOC
	- 589 incidents — only 9% of alerts have a non-null `incident_id`

	Alerts are essentially independent given features, so we use
	StratifiedShuffleSplit (nested 70/15/15), the same approach as
	CYB001 for network flow classification:

	\| Fold \| Alerts \|
	\|---\|---:\|
	\| Train \| 6,440 \|
	\| Validation \| 1,380 \|
	\| Test \| 1,380 \|

	Class imbalance is addressed with `class_weight='balanced'` (XGBoost
	`sample_weight`) and weighted cross-entropy (MLP).

	## Feature pipeline

	The bundled `feature_engineering.py` is the canonical feature recipe.
	53 features survive after encoding, drawn from:

	- Per-alert numeric (9): `raw_score`, `enriched_score`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `soar_playbook_triggered`, `sla_breached_flag`, `mttd_minutes`, `mttr_minutes`, `fatigue_score_at_alert`
	- Per-alert categorical (5, one-hot): `alert_severity` (7 values), `alert_source` (8 values), `mitre_tactic` (12 values), `analyst_tier` (3 values), `siem_platform` (8 values)
	- Engineered (6): `enrichment_lift`, `log_mttr`, `log_mttd`, `queue_pressure`, `enrichment_per_minute`, `is_high_confidence`

	### Excluded columns

	Oracle columns (dropped to allow honest evaluation):

	\| Column \| Why excluded \|
	\|---\|---\|
	\| `alert_lifecycle_phase` \| 3 of 4 values are deterministic outcome oracles \|
	\| `automation_resolved` \| 1:1 with `auto_resolved_soar` outcome \|
	\| `escalation_flag` \| Near-1:1 with `true_positive_escalated` outcome \|

	High-cardinality columns (dropped for tractability):

	\| Column \| Why excluded \|
	\|---\|---\|
	\| `mitre_technique_id` \| 36 unique values; perfect oracle for `mitre_tactic` but unrelated to this target \|
	\| `detection_rule_id` \| 656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic) \|

	### Partial-oracle features (kept as legitimate observables)

	`soar_playbook_triggered` is a necessary but not sufficient condition
	for `auto_resolved_soar` — when 0, the alert is never auto-resolved;
	when 1, the outcome is auto-resolved 68% of the time but can also be
	TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is
	a legitimate observable that downstream operators would already have
	on hand at decision time. KEPT in the pipeline.

	## Evaluation

	### Test-set metrics, seed 42 (n = 1,380 alerts)

	XGBoost (the published `model_xgb.json` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.9522 \|
	\| Accuracy \| 0.7659 \|
	\| Macro-F1 \| 0.7430 \|
	\| Weighted-F1 \| 0.7672 \|

	MLP (the published `model_mlp.safetensors` artifact) — slightly outperforms XGBoost

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.9552 \|
	\| Accuracy \| 0.7674 \|
	\| Macro-F1 \| 0.7510 \|
	\| Weighted-F1 \| 0.7691 \|

	With 6,440 training rows and 53 features, the MLP has enough data to
	compete favorably with boosted trees. Both models are published.

	### Multi-seed robustness (XGBoost, 10 seeds)

	Very stable performance — std 0.007 on accuracy is among the tightest
	in the XpertSystems catalog:

	\| Metric \| Mean \| Std \| Min \| Max \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Accuracy \| 0.777 \| 0.007 \| 0.766 \| 0.792 \|
	\| Macro-F1 \| 0.765 \| 0.011 \| 0.743 \| 0.783 \|
	\| Macro ROC-AUC OvR \| 0.955 \| 0.003 \| 0.950 \| 0.960 \|

	Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
	All 10 seeds yielded all 5 classes in the test fold (stratified split
	guarantees this).

	### Per-class F1 (seed 42)

	\| Outcome \| Class share \| XGBoost F1 \| MLP F1 \|
	\|---\|---:\|---:\|---:\|
	\| `false_positive_closed` \| 32.6% \| 0.904 \| 0.910 \|
	\| `duplicate_merged` \| 4.3% \| 0.794 \| 0.825 \|
	\| `auto_resolved_soar` \| 28.7% \| 0.757 \| 0.751 \|
	\| `true_positive_remediated` \| 20.1% \| 0.701 \| 0.698 \|
	\| `true_positive_escalated` \| 14.3% \| 0.559 \| 0.571 \|

	The model performs best on `false_positive_closed` (clearest behavioural
	profile — low scores, fast resolution by L1 analysts) and
	`duplicate_merged` (smallest class but distinctive — duplicate-suppressed
	severity is a strong tell). The hardest discrimination is between
	`true_positive_remediated` and `true_positive_escalated` — both are
	genuine threats, differing primarily by whether the alert was closed
	by the original analyst or passed to a higher tier. In production this
	matters less because both are TP outcomes; binary TP-vs-FP recall is
	much higher.

	### Ablation: which feature groups matter

	\| Configuration \| Accuracy \| Macro-F1 \| ROC-AUC \| Δ accuracy \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Full feature set (published) \| 0.7659 \| 0.7430 \| 0.9522 \| — \|
	\| No alert severity \| 0.5138 \| 0.3933 \| 0.7304 \| −0.2522 \|
	\| No `soar_playbook_triggered` \| 0.6188 \| 0.5773 \| 0.8369 \| −0.1471 \|
	\| No analyst tier \| 0.7717 \| 0.7471 \| 0.9524 \| +0.0058 \|
	\| No siem platform \| 0.7681 \| 0.7474 \| 0.9522 \| +0.0022 \|
	\| No alert source \| 0.7638 \| 0.7406 \| 0.9511 \| −0.0022 \|
	\| No engineered features \| 0.7681 \| 0.7480 \| 0.9533 \| +0.0022 \|
	\| No mitre_tactic \| 0.7812 \| 0.7656 \| 0.9530 \| +0.0152 \|
	\| No timing features \| 0.7775 \| 0.7572 \| 0.9547 \| +0.0116 \|
	\| No score features \| 0.7710 \| 0.7569 \| 0.9541 \| +0.0051 \|

	Four findings:

	1. Alert severity carries the dominant signal (drops 25 pp
	accuracy, 22 pp ROC-AUC). This is intuitive: severity directly
	drives triage priority, which drives outcome. `false_positive`
	severity → `false_positive_closed`; `duplicate_suppressed` severity
	→ `duplicate_merged`.
	2. `soar_playbook_triggered` is the second-strongest signal
	(drops 15 pp accuracy). It's a partial oracle for the
	`auto_resolved_soar` outcome class.
	3. MITRE tactic and analyst tier contribute essentially nothing.
	The model performs marginally better without them — they add
	noise that the trees over-fit on the training set.
	4. Engineered features and timing features are near-flat. The
	trees recover composites from raw inputs. Kept in the pipeline as
	a documented baseline reference.

	### Architecture

	XGBoost: multi-class gradient boosting (`multi:softprob`, 5 classes),
	`hist` tree method, class-balanced sample weights, early stopping on
	validation mlogloss.

	MLP: `53 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
	→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
	early stopping on validation macro-F1.

	Training hyperparameters are held internally by XpertSystems.

	## Limitations

	This is a baseline reference, not a production SOC triage system.

	1. MITRE tactic classification is unlearnable on this sample. The
	README lists it as a suggested use case but the per-tactic feature
	distributions are too similar (raw_score 0.37–0.39 across all 12
	tactics). See [`leakage_diagnostic.json`](./leakage_diagnostic.json)
	for the full audit. Real SOC data has stronger per-tactic feature
	signatures.

	2. TP-remediated vs TP-escalated is the hardest discrimination.
	F1 0.56 on TP-escalated is the weakest per-class result. Both are
	genuine threats; the difference is workflow rather than threat
	nature. For most operational uses (TP-vs-FP recall, SLA-breach
	reduction), this confusion does not matter.

	3. MLP modestly outperforms XGBoost. Both are shipped; we
	recommend running both and treating disagreement as a triage
	triage signal. The boost is modest enough that for production
	deployment, the choice between them is essentially an engineering
	preference.

	4. Synthetic-vs-real transfer. The dataset is synthetic and
	calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM
	Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR,
	Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of
	Security, Verizon DBIR). Real SOC telemetry has different noise
	characteristics and the structural-oracle pattern documented
	above (alert_lifecycle_phase deterministically encoding outcome)
	would not be present in real data — real lifecycle phases
	transition stochastically. Do not assume metrics transfer
	end-to-end.

	5. 9,200 alerts is a modest training set. The 1,380-alert test
	fold yields stable multi-seed metrics (std 0.007), but full
	confidence intervals for downstream production decisions should
	come from the full ~280k-alert product.

	## Notes on dataset schema

	The CYB008 sample dataset README describes some fields differently
	from the actual schema. The model was trained on the actual schema;
	this note helps buyers reconcile what they read with what they receive.

	\| What the README says \| What the data actually contains \|
	\|---\|---\|
	\| `incident_summary` has 8 columns \| Data has 23 columns including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc. \|
	\| `alert_severity` has 6 values (info / low / medium / high / critical / false_positive) \| 7 values: adds `duplicate_suppressed`. All values are suffixed (`high_severity`, `low_severity`, `critical_confirmed`, `informational`). \|
	\| `analyst_tier` has 4 values (tier_1 / tier_2 / tier_3 / manager) \| 3 values on alerts (`L1_junior`, `L2_senior`, `L3_threat_hunter`); 4 on `soc_topology` (adds `L4_incident_commander`). \|
	\| 14 MITRE ATT&CK tactics \| 12 tactics in the data (no `reconnaissance` or `resource_development` from PRE-ATT&CK). \|
	\| Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel \| Field is `alert_source` (not `detection_source`); 8 values: `edr_behavioural_engine`, `nids_signature`, `ueba_user_anomaly`, `cspm_cloud_rule`, `siem_correlation_rule`, `threat_intel_ioc_match`, `honeypot_trigger`, `itdr_identity_anomaly`. \|
	\| `triage_score` / `enrichment_score` columns \| Actual names: `raw_score` / `enriched_score`. \|
	\| `alert_timestamp` (ISO string) \| Actual: `alert_timestamp_min` (integer minutes from epoch). \|
	\| `kill_chain_stage`, `storm_event_flag` columns on alerts \| Not present in the data. \|
	\| Field rename: `detection_source` ↔ data `alert_source` \| Same fact noted twice \|
	\| `resolution_outcome` values (true_positive / false_positive / duplicate / suppressed) \| Actual 5 values: `auto_resolved_soar`, `duplicate_merged`, `false_positive_closed`, `true_positive_escalated`, `true_positive_remediated`. \|
	\| Extra columns in data not in README \| `shift_id`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `fatigue_score_at_alert`, `siem_platform`, `soar_playbook_id`, `detection_rule_id`, `alert_lifecycle_phase` \|

	None of these affects model correctness — the feature pipeline uses
	the actual column names. If you build your own pipeline against the
	dataset, use the actual columns.

	## Intended use

	- Evaluating fit of the CYB008 dataset for your SOC-triage research
	- Baseline reference for new model architectures
	- Reference example of structural-leakage diagnostics in
	synthetic SOC datasets — the diagnostic methodology is reusable
	- Feature engineering reference for per-alert SOC telemetry

	## Out-of-scope use

	- Production SOC triage decisions on real telemetry
	- MITRE ATT&CK tactic prediction (this baseline establishes that
	task is unlearnable on the sample)
	- SLA-breach prediction (also tested as unlearnable on the sample —
	acc 0.68 vs majority 0.82)
	- Any operational decision affecting actual security operations
	without further validation on your own data

	## Reproducibility

	Outputs above were produced with `seed = 42` (published artifact),
	nested `StratifiedShuffleSplit` (70/15/15), on the published sample
	(`xpertsystems/cyb008-sample`, version 1.0.0, generated 2026-05-16).
	The feature pipeline in `feature_engineering.py` is deterministic and
	the trained weights in this repo correspond exactly to the metrics
	above.

	Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
	in `multi_seed_results.json` confirm robust performance across splits.

	The training script itself is private to XpertSystems.

	## Files in this repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `model_xgb.json` \| XGBoost weights (seed 42) \|
	\| `model_mlp.safetensors` \| PyTorch MLP weights (seed 42) \|
	\| `feature_engineering.py` \| Feature pipeline \|
	\| `feature_meta.json` \| Feature column order + categorical levels \|
	\| `feature_scaler.json` \| MLP input mean/std (XGBoost ignores) \|
	\| `validation_results.json` \| Per-class metrics, confusion matrix, architecture \|
	\| `ablation_results.json` \| Per-feature-group ablation \|
	\| `multi_seed_results.json` \| XGBoost metrics across 10 seeds \|
	\| `leakage_diagnostic.json` \| Structural-oracle audit + unlearnable-target finding \|
	\| `inference_example.ipynb` \| End-to-end inference demo notebook \|
	\| `README.md` \| This file \|

	## Contact and full product

	The full CYB008 dataset contains ~335,000 rows across four files,
	with calibrated benchmark validation against 12 metrics drawn from
	authoritative SOC operations and threat intelligence sources (SANS
	SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester
	Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk
	State of Security, Verizon DBIR). The full XpertSystems.ai synthetic
	data catalogue spans 41 SKUs across Cybersecurity, Healthcare,
	Insurance & Risk, Oil & Gas, and Materials & Energy.

	- 📧 pradeep@xpertsystems.ai
	- 🌐 https://xpertsystems.ai
	- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb008-sample
	- 🤖 Companion models:
	- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
	- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
	- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
	- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
	- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
	- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
	- https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)

	## Citation

	```bibtex
	@misc{xpertsystems_cyb008_baseline_2026,
	title = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic},
	author = {XpertSystems.ai},
	year = {2026},
	url = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier},
	note = {Baseline reference model trained on xpertsystems/cyb008-sample}
	}
	```