pradeep-xpert commited on
Commit
001717c
·
verified ·
1 Parent(s): 8f2baa5

Initial release: XGBoost + MLP for SOC alert triage outcome classification, with structural-leakage and unlearnable-target diagnostic

Browse files
README.md ADDED
@@ -0,0 +1,495 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: pytorch
4
+ tags:
5
+ - cybersecurity
6
+ - soc-operations
7
+ - alert-triage
8
+ - mitre-attack
9
+ - soar
10
+ - siem
11
+ - tabular-classification
12
+ - synthetic-data
13
+ - xgboost
14
+ - baseline
15
+ - leakage-diagnostic
16
+ pipeline_tag: tabular-classification
17
+ base_model: []
18
+ datasets:
19
+ - xpertsystems/cyb008-sample
20
+ metrics:
21
+ - accuracy
22
+ - f1
23
+ - roc_auc
24
+ model-index:
25
+ - name: cyb008-baseline-classifier
26
+ results:
27
+ - task:
28
+ type: tabular-classification
29
+ name: 5-class SOC alert triage outcome classification
30
+ dataset:
31
+ type: xpertsystems/cyb008-sample
32
+ name: CYB008 Synthetic SOC Alert Dataset (Sample)
33
+ metrics:
34
+ - type: roc_auc
35
+ value: 0.9522
36
+ name: Test macro ROC-AUC OvR (XGBoost, seed 42)
37
+ - type: accuracy
38
+ value: 0.7659
39
+ name: Test accuracy (XGBoost, seed 42)
40
+ - type: f1
41
+ value: 0.7430
42
+ name: Test macro-F1 (XGBoost, seed 42)
43
+ - type: accuracy
44
+ value: 0.777
45
+ name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
46
+ - type: roc_auc
47
+ value: 0.955
48
+ name: Multi-seed ROC-AUC mean ± 0.003 (XGBoost, 10 seeds)
49
+ - type: roc_auc
50
+ value: 0.9552
51
+ name: Test macro ROC-AUC OvR (MLP, seed 42)
52
+ - type: accuracy
53
+ value: 0.7674
54
+ name: Test accuracy (MLP, seed 42)
55
+ - type: f1
56
+ value: 0.7510
57
+ name: Test macro-F1 (MLP, seed 42)
58
+ ---
59
+
60
+ # CYB008 Baseline Classifier
61
+
62
+ **SOC alert triage classifier trained on the CYB008 synthetic SOC alert
63
+ sample. Predicts which of 5 triage outcome classes
64
+ (`auto_resolved_soar` / `duplicate_merged` / `false_positive_closed` /
65
+ `true_positive_remediated` / `true_positive_escalated`) an alert
66
+ will reach, from per-alert features. ALSO ships a leakage diagnostic
67
+ for the three structural-oracle columns dropped from the feature
68
+ pipeline.**
69
+
70
+ > **Read this first.** This repo ships two related artifacts:
71
+ > (1) a working baseline classifier for `resolution_outcome` (the
72
+ > primary product), and (2) a `leakage_diagnostic.json` file
73
+ > documenting (a) the three structural oracle columns that were
74
+ > dropped from the feature set, and (b) the separate finding that the
75
+ > README's first suggested use case — MITRE ATT&CK tactic
76
+ > classification — is **not learnable** on this sample. Both files
77
+ > matter; the diagnostic is required reading for anyone evaluating
78
+ > CYB008 for a triage product.
79
+
80
+ ## Model overview
81
+
82
+ | Property | Value |
83
+ |---|---|
84
+ | Primary task | 5-class `resolution_outcome` classification (SOC alert triage) |
85
+ | Secondary artifact | `leakage_diagnostic.json` — structural oracle + unlearnable-target audit |
86
+ | Training data | `xpertsystems/cyb008-sample` (9,200 alerts) |
87
+ | Models | XGBoost + PyTorch MLP |
88
+ | Input features | 53 (after one-hot encoding) |
89
+ | Split | **Stratified random** (no natural group key in this dataset — see rationale below) |
90
+ | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
91
+ | License | CC-BY-NC-4.0 (matches dataset) |
92
+ | Status | Reference baseline + leakage diagnostic |
93
+
94
+ ## Why this task — and what was dropped
95
+
96
+ The CYB008 README lists **alert triage (TP vs FP prediction)** as its
97
+ first suggested use case and **MITRE ATT&CK tactic classification** as
98
+ its second. We piloted both on the sample dataset:
99
+
100
+ - **Triage outcome:** works honestly. After dropping 3 structural
101
+ oracle columns, the model achieves **acc 0.777 ± 0.007, ROC-AUC
102
+ 0.955 ± 0.003** on 5-class classification. This is the primary
103
+ baseline.
104
+
105
+ - **MITRE tactic classification:** **does NOT work on this sample.**
106
+ Without `mitre_technique_id` (which is a perfect ATT&CK-by-design
107
+ oracle), the per-tactic feature distributions are nearly identical
108
+ (raw_score 0.37–0.39 across all 12 tactics, similar for enriched
109
+ score and fatigue). A trained XGBoost achieves accuracy 0.08,
110
+ below the majority baseline of 0.14. The README's stated use case
111
+ cannot be honestly demonstrated on the sample. See
112
+ [`leakage_diagnostic.json`](./leakage_diagnostic.json) for the full
113
+ finding and our recommendation to the dataset author.
114
+
115
+ ### The three structural oracle columns (dropped)
116
+
117
+ CYB008 has three columns that structurally encode the
118
+ `resolution_outcome` label:
119
+
120
+ | Column | Oracle relationship |
121
+ |---|---|
122
+ | `alert_lifecycle_phase` | 3 of 4 values deterministically map to specific outcomes (auto_closed → auto_resolved_soar; escalated → true_positive_escalated; suppressed_duplicate → duplicate_merged) |
123
+ | `automation_resolved` | Exact 1:1 with `auto_resolved_soar` outcome |
124
+ | `escalation_flag` | 1319 escalation flags = 1319 `true_positive_escalated` outcomes (near-1:1) |
125
+
126
+ With all three present, plain XGBoost achieves **100% test accuracy
127
+ across all seeds** — mechanical, not learned. With all three dropped,
128
+ accuracy is **0.79 with ROC-AUC 0.96**: real learning on a
129
+ non-trivial 5-class task. The published baseline trains with these
130
+ three columns excluded.
131
+
132
+ Two model artifacts are published. They are designed to be used
133
+ together — disagreement is a useful triage signal:
134
+
135
+ - `model_xgb.json` — gradient-boosted trees
136
+ - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
137
+
138
+ On CYB008 the MLP slightly outperforms XGBoost on the test fold
139
+ (0.767 vs 0.766 accuracy, 0.955 vs 0.952 ROC-AUC at seed 42) — only
140
+ the second SKU in the XpertSystems baseline catalog where this
141
+ happens (after CYB007).
142
+
143
+ ## Quick start
144
+
145
+ ```bash
146
+ pip install xgboost torch safetensors pandas huggingface_hub
147
+ ```
148
+
149
+ ```python
150
+ from huggingface_hub import hf_hub_download
151
+ import json, numpy as np, torch, xgboost as xgb
152
+ from safetensors.torch import load_file
153
+
154
+ REPO = "xpertsystems/cyb008-baseline-classifier"
155
+
156
+ paths = {n: hf_hub_download(REPO, n) for n in [
157
+ "model_xgb.json", "model_mlp.safetensors",
158
+ "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
159
+ ]}
160
+
161
+ import sys, os
162
+ sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
163
+ from feature_engineering import transform_single, load_meta, INT_TO_LABEL
164
+
165
+ meta = load_meta(paths["feature_meta.json"])
166
+ xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
167
+
168
+ # Predict (see inference_example.ipynb for the full pattern)
169
+ # Note: do NOT include alert_lifecycle_phase, automation_resolved, or
170
+ # escalation_flag in your record - those were the oracle columns.
171
+ X = transform_single(my_alert_record, meta)
172
+ proba = xgb_model.predict_proba(X)[0]
173
+ print(INT_TO_LABEL[int(np.argmax(proba))])
174
+ ```
175
+
176
+ See [`inference_example.ipynb`](./inference_example.ipynb) for the full
177
+ copy-paste demo.
178
+
179
+ ## Training data
180
+
181
+ Trained on the public sample of CYB008, 9,200 per-alert records:
182
+
183
+ | Outcome | Alerts | Class share |
184
+ |---|---:|---:|
185
+ | `false_positive_closed` | 2,996 | 32.6% |
186
+ | `auto_resolved_soar` | 2,642 | 28.7% |
187
+ | `true_positive_remediated` | 1,848 | 20.1% |
188
+ | `true_positive_escalated` | 1,319 | 14.3% |
189
+ | `duplicate_merged` | 395 | 4.3% |
190
+
191
+ ### Stratified split (no natural group key)
192
+
193
+ CYB008 does not have a natural row-level group key for group-aware
194
+ splitting:
195
+ - 25 analysts — group-aware split would yield only ~4 test analysts
196
+ - 5 SOCs — would yield 1 test SOC
197
+ - 589 incidents — only 9% of alerts have a non-null `incident_id`
198
+
199
+ Alerts are essentially independent given features, so we use
200
+ **StratifiedShuffleSplit** (nested 70/15/15), the same approach as
201
+ CYB001 for network flow classification:
202
+
203
+ | Fold | Alerts |
204
+ |---|---:|
205
+ | Train | 6,440 |
206
+ | Validation | 1,380 |
207
+ | Test | 1,380 |
208
+
209
+ Class imbalance is addressed with `class_weight='balanced'` (XGBoost
210
+ `sample_weight`) and weighted cross-entropy (MLP).
211
+
212
+ ## Feature pipeline
213
+
214
+ The bundled `feature_engineering.py` is the canonical feature recipe.
215
+ 53 features survive after encoding, drawn from:
216
+
217
+ - **Per-alert numeric** (9): `raw_score`, `enriched_score`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `soar_playbook_triggered`, `sla_breached_flag`, `mttd_minutes`, `mttr_minutes`, `fatigue_score_at_alert`
218
+ - **Per-alert categorical** (5, one-hot): `alert_severity` (7 values), `alert_source` (8 values), `mitre_tactic` (12 values), `analyst_tier` (3 values), `siem_platform` (8 values)
219
+ - **Engineered** (6): `enrichment_lift`, `log_mttr`, `log_mttd`, `queue_pressure`, `enrichment_per_minute`, `is_high_confidence`
220
+
221
+ ### Excluded columns
222
+
223
+ **Oracle columns** (dropped to allow honest evaluation):
224
+
225
+ | Column | Why excluded |
226
+ |---|---|
227
+ | `alert_lifecycle_phase` | 3 of 4 values are deterministic outcome oracles |
228
+ | `automation_resolved` | 1:1 with `auto_resolved_soar` outcome |
229
+ | `escalation_flag` | Near-1:1 with `true_positive_escalated` outcome |
230
+
231
+ **High-cardinality columns** (dropped for tractability):
232
+
233
+ | Column | Why excluded |
234
+ |---|---|
235
+ | `mitre_technique_id` | 36 unique values; perfect oracle for `mitre_tactic` but unrelated to this target |
236
+ | `detection_rule_id` | 656 unique values; one-hot explosion with no real per-tactic affinity (only 5% of rules map to a single tactic) |
237
+
238
+ ### Partial-oracle features (kept as legitimate observables)
239
+
240
+ `soar_playbook_triggered` is a *necessary but not sufficient* condition
241
+ for `auto_resolved_soar` — when 0, the alert is never auto-resolved;
242
+ when 1, the outcome is auto-resolved 68% of the time but can also be
243
+ TP-remediated, TP-escalated, FP-closed, or duplicate-merged. This is
244
+ a legitimate observable that downstream operators would already have
245
+ on hand at decision time. KEPT in the pipeline.
246
+
247
+ ## Evaluation
248
+
249
+ ### Test-set metrics, seed 42 (n = 1,380 alerts)
250
+
251
+ **XGBoost** (the published `model_xgb.json` artifact)
252
+
253
+ | Metric | Value |
254
+ |---|---:|
255
+ | Macro ROC-AUC (OvR) | **0.9522** |
256
+ | Accuracy | **0.7659** |
257
+ | Macro-F1 | 0.7430 |
258
+ | Weighted-F1 | 0.7672 |
259
+
260
+ **MLP** (the published `model_mlp.safetensors` artifact) — **slightly outperforms XGBoost**
261
+
262
+ | Metric | Value |
263
+ |---|---:|
264
+ | Macro ROC-AUC (OvR) | **0.9552** |
265
+ | Accuracy | **0.7674** |
266
+ | Macro-F1 | 0.7510 |
267
+ | Weighted-F1 | 0.7691 |
268
+
269
+ With 6,440 training rows and 53 features, the MLP has enough data to
270
+ compete favorably with boosted trees. Both models are published.
271
+
272
+ ### Multi-seed robustness (XGBoost, 10 seeds)
273
+
274
+ Very stable performance — std 0.007 on accuracy is among the tightest
275
+ in the XpertSystems catalog:
276
+
277
+ | Metric | Mean | Std | Min | Max |
278
+ |---|---:|---:|---:|---:|
279
+ | Accuracy | 0.777 | 0.007 | 0.766 | 0.792 |
280
+ | Macro-F1 | 0.765 | 0.011 | 0.743 | 0.783 |
281
+ | Macro ROC-AUC OvR | 0.955 | 0.003 | 0.950 | 0.960 |
282
+
283
+ Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
284
+ All 10 seeds yielded all 5 classes in the test fold (stratified split
285
+ guarantees this).
286
+
287
+ ### Per-class F1 (seed 42)
288
+
289
+ | Outcome | Class share | XGBoost F1 | MLP F1 |
290
+ |---|---:|---:|---:|
291
+ | `false_positive_closed` | 32.6% | **0.904** | 0.910 |
292
+ | `duplicate_merged` | 4.3% | 0.794 | 0.825 |
293
+ | `auto_resolved_soar` | 28.7% | 0.757 | 0.751 |
294
+ | `true_positive_remediated` | 20.1% | 0.701 | 0.698 |
295
+ | `true_positive_escalated` | 14.3% | 0.559 | 0.571 |
296
+
297
+ The model performs best on `false_positive_closed` (clearest behavioural
298
+ profile — low scores, fast resolution by L1 analysts) and
299
+ `duplicate_merged` (smallest class but distinctive — duplicate-suppressed
300
+ severity is a strong tell). The hardest discrimination is between
301
+ `true_positive_remediated` and `true_positive_escalated` — both are
302
+ genuine threats, differing primarily by whether the alert was closed
303
+ by the original analyst or passed to a higher tier. In production this
304
+ matters less because both are TP outcomes; binary TP-vs-FP recall is
305
+ much higher.
306
+
307
+ ### Ablation: which feature groups matter
308
+
309
+ | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
310
+ |---|---:|---:|---:|---:|
311
+ | Full feature set (published) | 0.7659 | 0.7430 | 0.9522 | — |
312
+ | No alert severity | 0.5138 | 0.3933 | 0.7304 | **−0.2522** |
313
+ | No `soar_playbook_triggered` | 0.6188 | 0.5773 | 0.8369 | **−0.1471** |
314
+ | No analyst tier | 0.7717 | 0.7471 | 0.9524 | +0.0058 |
315
+ | No siem platform | 0.7681 | 0.7474 | 0.9522 | +0.0022 |
316
+ | No alert source | 0.7638 | 0.7406 | 0.9511 | −0.0022 |
317
+ | No engineered features | 0.7681 | 0.7480 | 0.9533 | +0.0022 |
318
+ | No mitre_tactic | 0.7812 | 0.7656 | 0.9530 | +0.0152 |
319
+ | No timing features | 0.7775 | 0.7572 | 0.9547 | +0.0116 |
320
+ | No score features | 0.7710 | 0.7569 | 0.9541 | +0.0051 |
321
+
322
+ Four findings:
323
+
324
+ 1. **Alert severity carries the dominant signal** (drops 25 pp
325
+ accuracy, 22 pp ROC-AUC). This is intuitive: severity directly
326
+ drives triage priority, which drives outcome. `false_positive`
327
+ severity → `false_positive_closed`; `duplicate_suppressed` severity
328
+ → `duplicate_merged`.
329
+ 2. **`soar_playbook_triggered` is the second-strongest signal**
330
+ (drops 15 pp accuracy). It's a partial oracle for the
331
+ `auto_resolved_soar` outcome class.
332
+ 3. **MITRE tactic and analyst tier contribute essentially nothing.**
333
+ The model performs marginally *better* without them — they add
334
+ noise that the trees over-fit on the training set.
335
+ 4. **Engineered features and timing features are near-flat.** The
336
+ trees recover composites from raw inputs. Kept in the pipeline as
337
+ a documented baseline reference.
338
+
339
+ ### Architecture
340
+
341
+ **XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes),
342
+ `hist` tree method, class-balanced sample weights, early stopping on
343
+ validation mlogloss.
344
+
345
+ **MLP:** `53 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
346
+ → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
347
+ early stopping on validation macro-F1.
348
+
349
+ Training hyperparameters are held internally by XpertSystems.
350
+
351
+ ## Limitations
352
+
353
+ **This is a baseline reference, not a production SOC triage system.**
354
+
355
+ 1. **MITRE tactic classification is unlearnable on this sample.** The
356
+ README lists it as a suggested use case but the per-tactic feature
357
+ distributions are too similar (raw_score 0.37–0.39 across all 12
358
+ tactics). See [`leakage_diagnostic.json`](./leakage_diagnostic.json)
359
+ for the full audit. Real SOC data has stronger per-tactic feature
360
+ signatures.
361
+
362
+ 2. **TP-remediated vs TP-escalated is the hardest discrimination.**
363
+ F1 0.56 on TP-escalated is the weakest per-class result. Both are
364
+ genuine threats; the difference is workflow rather than threat
365
+ nature. For most operational uses (TP-vs-FP recall, SLA-breach
366
+ reduction), this confusion does not matter.
367
+
368
+ 3. **MLP modestly outperforms XGBoost.** Both are shipped; we
369
+ recommend running both and treating disagreement as a triage
370
+ triage signal. The boost is modest enough that for production
371
+ deployment, the choice between them is essentially an engineering
372
+ preference.
373
+
374
+ 4. **Synthetic-vs-real transfer.** The dataset is synthetic and
375
+ calibrated to 12 SOC-operations benchmarks (SANS SOC Survey, IBM
376
+ Cost of Data Breach, Mandiant M-Trends, Forrester Wave SOAR,
377
+ Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk State of
378
+ Security, Verizon DBIR). Real SOC telemetry has different noise
379
+ characteristics and the structural-oracle pattern documented
380
+ above (alert_lifecycle_phase deterministically encoding outcome)
381
+ would not be present in real data — real lifecycle phases
382
+ transition stochastically. Do not assume metrics transfer
383
+ end-to-end.
384
+
385
+ 5. **9,200 alerts is a modest training set.** The 1,380-alert test
386
+ fold yields stable multi-seed metrics (std 0.007), but full
387
+ confidence intervals for downstream production decisions should
388
+ come from the full ~280k-alert product.
389
+
390
+ ## Notes on dataset schema
391
+
392
+ The CYB008 sample dataset README describes some fields differently
393
+ from the actual schema. The model was trained on the actual schema;
394
+ this note helps buyers reconcile what they read with what they receive.
395
+
396
+ | What the README says | What the data actually contains |
397
+ |---|---|
398
+ | `incident_summary` has 8 columns | Data has **23 columns** including incident_type, kill_chain_stages_observed, false_positive_rate, soar_actions_taken, etc. |
399
+ | `alert_severity` has 6 values (info / low / medium / high / critical / false_positive) | **7 values**: adds `duplicate_suppressed`. All values are suffixed (`high_severity`, `low_severity`, `critical_confirmed`, `informational`). |
400
+ | `analyst_tier` has 4 values (tier_1 / tier_2 / tier_3 / manager) | 3 values on alerts (`L1_junior`, `L2_senior`, `L3_threat_hunter`); 4 on `soc_topology` (adds `L4_incident_commander`). |
401
+ | 14 MITRE ATT&CK tactics | 12 tactics in the data (no `reconnaissance` or `resource_development` from PRE-ATT&CK). |
402
+ | Detection source mix: edr, siem, ndr, ids, ueba, casb, deception, threat intel | Field is `alert_source` (not `detection_source`); 8 values: `edr_behavioural_engine`, `nids_signature`, `ueba_user_anomaly`, `cspm_cloud_rule`, `siem_correlation_rule`, `threat_intel_ioc_match`, `honeypot_trigger`, `itdr_identity_anomaly`. |
403
+ | `triage_score` / `enrichment_score` columns | Actual names: `raw_score` / `enriched_score`. |
404
+ | `alert_timestamp` (ISO string) | Actual: `alert_timestamp_min` (integer minutes from epoch). |
405
+ | `kill_chain_stage`, `storm_event_flag` columns on alerts | Not present in the data. |
406
+ | Field rename: `detection_source` ↔ data `alert_source` | Same fact noted twice |
407
+ | `resolution_outcome` values (true_positive / false_positive / duplicate / suppressed) | Actual 5 values: `auto_resolved_soar`, `duplicate_merged`, `false_positive_closed`, `true_positive_escalated`, `true_positive_remediated`. |
408
+ | Extra columns in data not in README | `shift_id`, `time_in_phase_minutes`, `queue_depth_at_ingestion`, `fatigue_score_at_alert`, `siem_platform`, `soar_playbook_id`, `detection_rule_id`, `alert_lifecycle_phase` |
409
+
410
+ None of these affects model correctness — the feature pipeline uses
411
+ the actual column names. If you build your own pipeline against the
412
+ dataset, use the actual columns.
413
+
414
+ ## Intended use
415
+
416
+ - **Evaluating fit** of the CYB008 dataset for your SOC-triage research
417
+ - **Baseline reference** for new model architectures
418
+ - **Reference example of structural-leakage diagnostics** in
419
+ synthetic SOC datasets — the diagnostic methodology is reusable
420
+ - **Feature engineering reference** for per-alert SOC telemetry
421
+
422
+ ## Out-of-scope use
423
+
424
+ - Production SOC triage decisions on real telemetry
425
+ - MITRE ATT&CK tactic prediction (this baseline establishes that
426
+ task is unlearnable on the sample)
427
+ - SLA-breach prediction (also tested as unlearnable on the sample —
428
+ acc 0.68 vs majority 0.82)
429
+ - Any operational decision affecting actual security operations
430
+ without further validation on your own data
431
+
432
+ ## Reproducibility
433
+
434
+ Outputs above were produced with `seed = 42` (published artifact),
435
+ nested `StratifiedShuffleSplit` (70/15/15), on the published sample
436
+ (`xpertsystems/cyb008-sample`, version 1.0.0, generated 2026-05-16).
437
+ The feature pipeline in `feature_engineering.py` is deterministic and
438
+ the trained weights in this repo correspond exactly to the metrics
439
+ above.
440
+
441
+ Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
442
+ in `multi_seed_results.json` confirm robust performance across splits.
443
+
444
+ The training script itself is private to XpertSystems.
445
+
446
+ ## Files in this repo
447
+
448
+ | File | Purpose |
449
+ |---|---|
450
+ | `model_xgb.json` | XGBoost weights (seed 42) |
451
+ | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
452
+ | `feature_engineering.py` | Feature pipeline |
453
+ | `feature_meta.json` | Feature column order + categorical levels |
454
+ | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
455
+ | `validation_results.json` | Per-class metrics, confusion matrix, architecture |
456
+ | `ablation_results.json` | Per-feature-group ablation |
457
+ | `multi_seed_results.json` | XGBoost metrics across 10 seeds |
458
+ | `leakage_diagnostic.json` | **Structural-oracle audit + unlearnable-target finding** |
459
+ | `inference_example.ipynb` | End-to-end inference demo notebook |
460
+ | `README.md` | This file |
461
+
462
+ ## Contact and full product
463
+
464
+ The full **CYB008** dataset contains ~335,000 rows across four files,
465
+ with calibrated benchmark validation against 12 metrics drawn from
466
+ authoritative SOC operations and threat intelligence sources (SANS
467
+ SOC Survey, IBM Cost of Data Breach, Mandiant M-Trends, Forrester
468
+ Wave SOAR, Gartner SIEM Magic Quadrant, SOC.OS, CrowdStrike, Splunk
469
+ State of Security, Verizon DBIR). The full XpertSystems.ai synthetic
470
+ data catalogue spans 41 SKUs across Cybersecurity, Healthcare,
471
+ Insurance & Risk, Oil & Gas, and Materials & Energy.
472
+
473
+ - 📧 **pradeep@xpertsystems.ai**
474
+ - 🌐 **https://xpertsystems.ai**
475
+ - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb008-sample
476
+ - 🤖 Companion models:
477
+ - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
478
+ - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
479
+ - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
480
+ - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
481
+ - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
482
+ - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
483
+ - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
484
+
485
+ ## Citation
486
+
487
+ ```bibtex
488
+ @misc{xpertsystems_cyb008_baseline_2026,
489
+ title = {CYB008 Baseline Classifier: XGBoost and MLP for SOC Alert Triage Outcome Classification, with Structural-Leakage and Unlearnable-Target Diagnostic},
490
+ author = {XpertSystems.ai},
491
+ year = {2026},
492
+ url = {https://huggingface.co/xpertsystems/cyb008-baseline-classifier},
493
+ note = {Baseline reference model trained on xpertsystems/cyb008-sample}
494
+ }
495
+ ```
ablation_results.json ADDED
@@ -0,0 +1,659 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same stratified split, with one feature group dropped at a time.",
3
+ "full_model_metrics": {
4
+ "model": "xgboost",
5
+ "accuracy": 0.7659420289855072,
6
+ "macro_f1": 0.7429876131468711,
7
+ "weighted_f1": 0.7669168766123218,
8
+ "per_class_f1": {
9
+ "auto_resolved_soar": 0.7572383073496659,
10
+ "duplicate_merged": 0.7936507936507936,
11
+ "false_positive_closed": 0.9038461538461539,
12
+ "true_positive_remediated": 0.7012987012987013,
13
+ "true_positive_escalated": 0.5589041095890411
14
+ },
15
+ "confusion_matrix": {
16
+ "labels": [
17
+ "auto_resolved_soar",
18
+ "duplicate_merged",
19
+ "false_positive_closed",
20
+ "true_positive_remediated",
21
+ "true_positive_escalated"
22
+ ],
23
+ "matrix": [
24
+ [
25
+ 340,
26
+ 17,
27
+ 6,
28
+ 16,
29
+ 17
30
+ ],
31
+ [
32
+ 9,
33
+ 50,
34
+ 0,
35
+ 0,
36
+ 0
37
+ ],
38
+ [
39
+ 74,
40
+ 0,
41
+ 376,
42
+ 0,
43
+ 0
44
+ ],
45
+ [
46
+ 40,
47
+ 0,
48
+ 0,
49
+ 189,
50
+ 48
51
+ ],
52
+ [
53
+ 39,
54
+ 0,
55
+ 0,
56
+ 57,
57
+ 102
58
+ ]
59
+ ]
60
+ },
61
+ "macro_roc_auc_ovr": 0.9522005654044479
62
+ },
63
+ "ablations": {
64
+ "no_severity": {
65
+ "n_features": 46,
66
+ "dropped_count": 7,
67
+ "metrics": {
68
+ "model": "xgboost_no_severity",
69
+ "accuracy": 0.513768115942029,
70
+ "macro_f1": 0.39328452803110936,
71
+ "weighted_f1": 0.48887003837655496,
72
+ "per_class_f1": {
73
+ "auto_resolved_soar": 0.8058455114822547,
74
+ "duplicate_merged": 0.0,
75
+ "false_positive_closed": 0.4,
76
+ "true_positive_remediated": 0.3155893536121673,
77
+ "true_positive_escalated": 0.4449877750611247
78
+ },
79
+ "confusion_matrix": {
80
+ "labels": [
81
+ "auto_resolved_soar",
82
+ "duplicate_merged",
83
+ "false_positive_closed",
84
+ "true_positive_remediated",
85
+ "true_positive_escalated"
86
+ ],
87
+ "matrix": [
88
+ [
89
+ 386,
90
+ 1,
91
+ 1,
92
+ 3,
93
+ 5
94
+ ],
95
+ [
96
+ 15,
97
+ 0,
98
+ 17,
99
+ 21,
100
+ 6
101
+ ],
102
+ [
103
+ 75,
104
+ 30,
105
+ 149,
106
+ 122,
107
+ 74
108
+ ],
109
+ [
110
+ 42,
111
+ 26,
112
+ 91,
113
+ 83,
114
+ 35
115
+ ],
116
+ [
117
+ 44,
118
+ 6,
119
+ 37,
120
+ 20,
121
+ 91
122
+ ]
123
+ ]
124
+ },
125
+ "macro_roc_auc_ovr": 0.7303857388456401
126
+ },
127
+ "delta_accuracy": 0.25217391304347825,
128
+ "delta_macro_f1": 0.34970308511576176
129
+ },
130
+ "no_alert_source": {
131
+ "n_features": 45,
132
+ "dropped_count": 8,
133
+ "metrics": {
134
+ "model": "xgboost_no_alert_source",
135
+ "accuracy": 0.763768115942029,
136
+ "macro_f1": 0.7406277489805807,
137
+ "weighted_f1": 0.764708131838635,
138
+ "per_class_f1": {
139
+ "auto_resolved_soar": 0.755011135857461,
140
+ "duplicate_merged": 0.784,
141
+ "false_positive_closed": 0.8984468339307049,
142
+ "true_positive_remediated": 0.6981132075471698,
143
+ "true_positive_escalated": 0.5675675675675675
144
+ },
145
+ "confusion_matrix": {
146
+ "labels": [
147
+ "auto_resolved_soar",
148
+ "duplicate_merged",
149
+ "false_positive_closed",
150
+ "true_positive_remediated",
151
+ "true_positive_escalated"
152
+ ],
153
+ "matrix": [
154
+ [
155
+ 339,
156
+ 17,
157
+ 11,
158
+ 13,
159
+ 16
160
+ ],
161
+ [
162
+ 10,
163
+ 49,
164
+ 0,
165
+ 0,
166
+ 0
167
+ ],
168
+ [
169
+ 74,
170
+ 0,
171
+ 376,
172
+ 0,
173
+ 0
174
+ ],
175
+ [
176
+ 41,
177
+ 0,
178
+ 0,
179
+ 185,
180
+ 51
181
+ ],
182
+ [
183
+ 38,
184
+ 0,
185
+ 0,
186
+ 55,
187
+ 105
188
+ ]
189
+ ]
190
+ },
191
+ "macro_roc_auc_ovr": 0.9511218248098263
192
+ },
193
+ "delta_accuracy": 0.0021739130434782483,
194
+ "delta_macro_f1": 0.002359864166290415
195
+ },
196
+ "no_tactic": {
197
+ "n_features": 41,
198
+ "dropped_count": 12,
199
+ "metrics": {
200
+ "model": "xgboost_no_tactic",
201
+ "accuracy": 0.7811594202898551,
202
+ "macro_f1": 0.7655889644647986,
203
+ "weighted_f1": 0.7823552630061641,
204
+ "per_class_f1": {
205
+ "auto_resolved_soar": 0.7717750826901875,
206
+ "duplicate_merged": 0.8346456692913385,
207
+ "false_positive_closed": 0.908433734939759,
208
+ "true_positive_remediated": 0.7088122605363985,
209
+ "true_positive_escalated": 0.6042780748663101
210
+ },
211
+ "confusion_matrix": {
212
+ "labels": [
213
+ "auto_resolved_soar",
214
+ "duplicate_merged",
215
+ "false_positive_closed",
216
+ "true_positive_remediated",
217
+ "true_positive_escalated"
218
+ ],
219
+ "matrix": [
220
+ [
221
+ 350,
222
+ 15,
223
+ 3,
224
+ 15,
225
+ 13
226
+ ],
227
+ [
228
+ 6,
229
+ 53,
230
+ 0,
231
+ 0,
232
+ 0
233
+ ],
234
+ [
235
+ 73,
236
+ 0,
237
+ 377,
238
+ 0,
239
+ 0
240
+ ],
241
+ [
242
+ 42,
243
+ 0,
244
+ 0,
245
+ 185,
246
+ 50
247
+ ],
248
+ [
249
+ 40,
250
+ 0,
251
+ 0,
252
+ 45,
253
+ 113
254
+ ]
255
+ ]
256
+ },
257
+ "macro_roc_auc_ovr": 0.9529923809161402
258
+ },
259
+ "delta_accuracy": -0.01521739130434785,
260
+ "delta_macro_f1": -0.02260135131792751
261
+ },
262
+ "no_siem": {
263
+ "n_features": 45,
264
+ "dropped_count": 8,
265
+ "metrics": {
266
+ "model": "xgboost_no_siem",
267
+ "accuracy": 0.7681159420289855,
268
+ "macro_f1": 0.747392848800313,
269
+ "weighted_f1": 0.7695871955675133,
270
+ "per_class_f1": {
271
+ "auto_resolved_soar": 0.7577777777777778,
272
+ "duplicate_merged": 0.8,
273
+ "false_positive_closed": 0.9025270758122743,
274
+ "true_positive_remediated": 0.706766917293233,
275
+ "true_positive_escalated": 0.5698924731182796
276
+ },
277
+ "confusion_matrix": {
278
+ "labels": [
279
+ "auto_resolved_soar",
280
+ "duplicate_merged",
281
+ "false_positive_closed",
282
+ "true_positive_remediated",
283
+ "true_positive_escalated"
284
+ ],
285
+ "matrix": [
286
+ [
287
+ 341,
288
+ 16,
289
+ 6,
290
+ 15,
291
+ 18
292
+ ],
293
+ [
294
+ 9,
295
+ 50,
296
+ 0,
297
+ 0,
298
+ 0
299
+ ],
300
+ [
301
+ 75,
302
+ 0,
303
+ 375,
304
+ 0,
305
+ 0
306
+ ],
307
+ [
308
+ 39,
309
+ 0,
310
+ 0,
311
+ 188,
312
+ 50
313
+ ],
314
+ [
315
+ 40,
316
+ 0,
317
+ 0,
318
+ 52,
319
+ 106
320
+ ]
321
+ ]
322
+ },
323
+ "macro_roc_auc_ovr": 0.9521514669693077
324
+ },
325
+ "delta_accuracy": -0.0021739130434782483,
326
+ "delta_macro_f1": -0.0044052356534418635
327
+ },
328
+ "no_analyst_tier": {
329
+ "n_features": 50,
330
+ "dropped_count": 3,
331
+ "metrics": {
332
+ "model": "xgboost_no_analyst_tier",
333
+ "accuracy": 0.7717391304347826,
334
+ "macro_f1": 0.7470947169858246,
335
+ "weighted_f1": 0.7727339237745289,
336
+ "per_class_f1": {
337
+ "auto_resolved_soar": 0.768893756845564,
338
+ "duplicate_merged": 0.784,
339
+ "false_positive_closed": 0.9071170084439083,
340
+ "true_positive_remediated": 0.6948176583493282,
341
+ "true_positive_escalated": 0.5806451612903226
342
+ },
343
+ "confusion_matrix": {
344
+ "labels": [
345
+ "auto_resolved_soar",
346
+ "duplicate_merged",
347
+ "false_positive_closed",
348
+ "true_positive_remediated",
349
+ "true_positive_escalated"
350
+ ],
351
+ "matrix": [
352
+ [
353
+ 351,
354
+ 17,
355
+ 3,
356
+ 14,
357
+ 11
358
+ ],
359
+ [
360
+ 10,
361
+ 49,
362
+ 0,
363
+ 0,
364
+ 0
365
+ ],
366
+ [
367
+ 74,
368
+ 0,
369
+ 376,
370
+ 0,
371
+ 0
372
+ ],
373
+ [
374
+ 41,
375
+ 0,
376
+ 0,
377
+ 181,
378
+ 55
379
+ ],
380
+ [
381
+ 41,
382
+ 0,
383
+ 0,
384
+ 49,
385
+ 108
386
+ ]
387
+ ]
388
+ },
389
+ "macro_roc_auc_ovr": 0.9524262361561989
390
+ },
391
+ "delta_accuracy": -0.005797101449275366,
392
+ "delta_macro_f1": -0.004107103838953519
393
+ },
394
+ "no_timing": {
395
+ "n_features": 48,
396
+ "dropped_count": 5,
397
+ "metrics": {
398
+ "model": "xgboost_no_timing",
399
+ "accuracy": 0.777536231884058,
400
+ "macro_f1": 0.7572452763946715,
401
+ "weighted_f1": 0.7795520836463574,
402
+ "per_class_f1": {
403
+ "auto_resolved_soar": 0.7676991150442478,
404
+ "duplicate_merged": 0.8031496062992126,
405
+ "false_positive_closed": 0.9071170084439083,
406
+ "true_positive_remediated": 0.723404255319149,
407
+ "true_positive_escalated": 0.5848563968668408
408
+ },
409
+ "confusion_matrix": {
410
+ "labels": [
411
+ "auto_resolved_soar",
412
+ "duplicate_merged",
413
+ "false_positive_closed",
414
+ "true_positive_remediated",
415
+ "true_positive_escalated"
416
+ ],
417
+ "matrix": [
418
+ [
419
+ 347,
420
+ 17,
421
+ 3,
422
+ 9,
423
+ 20
424
+ ],
425
+ [
426
+ 8,
427
+ 51,
428
+ 0,
429
+ 0,
430
+ 0
431
+ ],
432
+ [
433
+ 74,
434
+ 0,
435
+ 376,
436
+ 0,
437
+ 0
438
+ ],
439
+ [
440
+ 37,
441
+ 0,
442
+ 0,
443
+ 187,
444
+ 53
445
+ ],
446
+ [
447
+ 42,
448
+ 0,
449
+ 0,
450
+ 44,
451
+ 112
452
+ ]
453
+ ]
454
+ },
455
+ "macro_roc_auc_ovr": 0.9546713378957848
456
+ },
457
+ "delta_accuracy": -0.011594202898550732,
458
+ "delta_macro_f1": -0.014257663247800423
459
+ },
460
+ "no_scores": {
461
+ "n_features": 48,
462
+ "dropped_count": 5,
463
+ "metrics": {
464
+ "model": "xgboost_no_scores",
465
+ "accuracy": 0.7710144927536232,
466
+ "macro_f1": 0.7569411600325896,
467
+ "weighted_f1": 0.7729871790343515,
468
+ "per_class_f1": {
469
+ "auto_resolved_soar": 0.7531285551763367,
470
+ "duplicate_merged": 0.8253968253968254,
471
+ "false_positive_closed": 0.9019138755980861,
472
+ "true_positive_remediated": 0.7047970479704797,
473
+ "true_positive_escalated": 0.5994694960212201
474
+ },
475
+ "confusion_matrix": {
476
+ "labels": [
477
+ "auto_resolved_soar",
478
+ "duplicate_merged",
479
+ "false_positive_closed",
480
+ "true_positive_remediated",
481
+ "true_positive_escalated"
482
+ ],
483
+ "matrix": [
484
+ [
485
+ 331,
486
+ 15,
487
+ 9,
488
+ 23,
489
+ 18
490
+ ],
491
+ [
492
+ 7,
493
+ 52,
494
+ 0,
495
+ 0,
496
+ 0
497
+ ],
498
+ [
499
+ 73,
500
+ 0,
501
+ 377,
502
+ 0,
503
+ 0
504
+ ],
505
+ [
506
+ 38,
507
+ 0,
508
+ 0,
509
+ 191,
510
+ 48
511
+ ],
512
+ [
513
+ 34,
514
+ 0,
515
+ 0,
516
+ 51,
517
+ 113
518
+ ]
519
+ ]
520
+ },
521
+ "macro_roc_auc_ovr": 0.9541430544791097
522
+ },
523
+ "delta_accuracy": -0.005072463768115987,
524
+ "delta_macro_f1": -0.013953546885718482
525
+ },
526
+ "no_soar": {
527
+ "n_features": 52,
528
+ "dropped_count": 1,
529
+ "metrics": {
530
+ "model": "xgboost_no_soar",
531
+ "accuracy": 0.618840579710145,
532
+ "macro_f1": 0.5773360587813117,
533
+ "weighted_f1": 0.5258347983183296,
534
+ "per_class_f1": {
535
+ "auto_resolved_soar": 0.028846153846153848,
536
+ "duplicate_merged": 0.8194444444444444,
537
+ "false_positive_closed": 0.8424068767908309,
538
+ "true_positive_remediated": 0.6328358208955224,
539
+ "true_positive_escalated": 0.5631469979296067
540
+ },
541
+ "confusion_matrix": {
542
+ "labels": [
543
+ "auto_resolved_soar",
544
+ "duplicate_merged",
545
+ "false_positive_closed",
546
+ "true_positive_remediated",
547
+ "true_positive_escalated"
548
+ ],
549
+ "matrix": [
550
+ [
551
+ 6,
552
+ 26,
553
+ 156,
554
+ 122,
555
+ 86
556
+ ],
557
+ [
558
+ 0,
559
+ 59,
560
+ 0,
561
+ 0,
562
+ 0
563
+ ],
564
+ [
565
+ 9,
566
+ 0,
567
+ 441,
568
+ 0,
569
+ 0
570
+ ],
571
+ [
572
+ 2,
573
+ 0,
574
+ 0,
575
+ 212,
576
+ 63
577
+ ],
578
+ [
579
+ 3,
580
+ 0,
581
+ 0,
582
+ 59,
583
+ 136
584
+ ]
585
+ ]
586
+ },
587
+ "macro_roc_auc_ovr": 0.8369099942380366
588
+ },
589
+ "delta_accuracy": 0.14710144927536228,
590
+ "delta_macro_f1": 0.16565155436555945
591
+ },
592
+ "no_engineered": {
593
+ "n_features": 47,
594
+ "dropped_count": 6,
595
+ "metrics": {
596
+ "model": "xgboost_no_engineered",
597
+ "accuracy": 0.7681159420289855,
598
+ "macro_f1": 0.7479996795268518,
599
+ "weighted_f1": 0.7700206321761683,
600
+ "per_class_f1": {
601
+ "auto_resolved_soar": 0.7542087542087542,
602
+ "duplicate_merged": 0.796875,
603
+ "false_positive_closed": 0.9027611044417767,
604
+ "true_positive_remediated": 0.7094339622641509,
605
+ "true_positive_escalated": 0.5767195767195767
606
+ },
607
+ "confusion_matrix": {
608
+ "labels": [
609
+ "auto_resolved_soar",
610
+ "duplicate_merged",
611
+ "false_positive_closed",
612
+ "true_positive_remediated",
613
+ "true_positive_escalated"
614
+ ],
615
+ "matrix": [
616
+ [
617
+ 336,
618
+ 18,
619
+ 7,
620
+ 13,
621
+ 22
622
+ ],
623
+ [
624
+ 8,
625
+ 51,
626
+ 0,
627
+ 0,
628
+ 0
629
+ ],
630
+ [
631
+ 74,
632
+ 0,
633
+ 376,
634
+ 0,
635
+ 0
636
+ ],
637
+ [
638
+ 40,
639
+ 0,
640
+ 0,
641
+ 188,
642
+ 49
643
+ ],
644
+ [
645
+ 37,
646
+ 0,
647
+ 0,
648
+ 52,
649
+ 109
650
+ ]
651
+ ]
652
+ },
653
+ "macro_roc_auc_ovr": 0.9533153727185603
654
+ },
655
+ "delta_accuracy": -0.0021739130434782483,
656
+ "delta_macro_f1": -0.00501206637998064
657
+ }
658
+ }
659
+ }
feature_engineering.py ADDED
@@ -0,0 +1,338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ feature_engineering.py
3
+ ======================
4
+
5
+ Feature pipeline for the CYB008 baseline classifier.
6
+
7
+ Predicts `resolution_outcome` (5-class triage outcome) from per-alert
8
+ features on the CYB008 sample dataset.
9
+
10
+ CSV inputs:
11
+ soc_alerts.csv (primary, one row per alert, 9,200 alerts)
12
+ soc_topology.csv (per-analyst registry; reserved for future
13
+ work - 25 analysts is too small to be a
14
+ useful join target beyond the analyst_tier
15
+ column already on soc_alerts)
16
+ incident_summary.csv (per-incident aggregates; reserved - only
17
+ 9% of alerts link to an incident)
18
+ alert_events.csv (discrete alert event log; reserved)
19
+
20
+ Target classes (5):
21
+ auto_resolved_soar, duplicate_merged, false_positive_closed,
22
+ true_positive_escalated, true_positive_remediated
23
+
24
+ Grouping decision
25
+ -----------------
26
+ There is no natural row-level group key for CYB008:
27
+ - 25 analysts -> group-aware split would yield ~4 test analysts
28
+ - 5 SOCs -> group-aware split would yield ~1 test SOC
29
+ - 589 incidents -> only 9% of alerts have a non-null incident_id
30
+
31
+ This baseline uses STRATIFIED random splitting (like CYB001 for network
32
+ flows), which is the right choice when alerts are independent given
33
+ features. The model card documents this rationale.
34
+
35
+ Leakage audit
36
+ -------------
37
+ Three columns are structural oracles for resolution_outcome and are
38
+ DROPPED from the feature set:
39
+
40
+ 1. `alert_lifecycle_phase` (4 values: auto_closed, escalated, resolved,
41
+ suppressed_duplicate): three of the four values map deterministically
42
+ to specific resolution_outcome classes. Drop.
43
+
44
+ 2. `automation_resolved` (binary): exactly 1:1 with auto_resolved_soar
45
+ outcome. Drop.
46
+
47
+ 3. `escalation_flag` (binary): near-1:1 with true_positive_escalated
48
+ outcome (1319 escalation flags = 1319 escalated outcomes). Drop.
49
+
50
+ With all three dropped, accuracy drops from 100% to 79% - confirming
51
+ they were structural oracles, not real predictive signal.
52
+
53
+ `soar_playbook_triggered` is a PARTIAL oracle (one-way necessary
54
+ condition: auto_resolved_soar => soar_playbook_triggered=1, but
55
+ soar_playbook_triggered=1 also yields 32% non-auto-resolve outcomes).
56
+ This is a legitimate observable - a SOAR playbook actually executing
57
+ is part of how the alert is triaged. KEPT.
58
+
59
+ `mitre_technique_id` is a perfect oracle for mitre_tactic (every T-
60
+ number belongs to one tactic by ATT&CK design) but has no relationship
61
+ to resolution_outcome. It is high-cardinality (36 values from a small
62
+ sample of a 600+-value enterprise space) and contributes nothing to
63
+ this task. Dropped for parsimony.
64
+
65
+ `detection_rule_id` has 656 unique values - too high-cardinality for
66
+ one-hot encoding. Dropped.
67
+
68
+ Identifier / non-feature columns
69
+ --------------------------------
70
+ Dropped: alert_id, incident_id (mostly null), analyst_id, soc_id,
71
+ shift_id, alert_timestamp_min, soar_playbook_id (high cardinality).
72
+
73
+ Public API
74
+ ----------
75
+ build_features(alerts_path) -> (X, y, ids, meta)
76
+ transform_single(record, meta) -> np.ndarray
77
+ save_meta(meta, path) / load_meta(path)
78
+
79
+ License
80
+ -------
81
+ Ships with the public model on Hugging Face under CC-BY-NC-4.0,
82
+ matching the dataset license. See README.md.
83
+ """
84
+
85
+ from __future__ import annotations
86
+
87
+ import json
88
+ from pathlib import Path
89
+ from typing import Any
90
+
91
+ import numpy as np
92
+ import pandas as pd
93
+
94
+ # ---------------------------------------------------------------------------
95
+ # Label space
96
+ # ---------------------------------------------------------------------------
97
+
98
+ # Ordered by triage spectrum: auto -> dup -> FP -> TP-remediate -> TP-escalate
99
+ LABEL_ORDER = [
100
+ "auto_resolved_soar",
101
+ "duplicate_merged",
102
+ "false_positive_closed",
103
+ "true_positive_remediated",
104
+ "true_positive_escalated",
105
+ ]
106
+ LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
107
+ INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
108
+
109
+ # ---------------------------------------------------------------------------
110
+ # Identifier and target columns
111
+ # ---------------------------------------------------------------------------
112
+
113
+ ID_COLUMNS = [
114
+ "alert_id", "incident_id", "analyst_id", "soc_id", "shift_id",
115
+ "alert_timestamp_min", "soar_playbook_id",
116
+ ]
117
+ TARGET_COLUMN = "resolution_outcome"
118
+
119
+ # Structural oracle columns - dropped from features.
120
+ ORACLE_COLUMNS = [
121
+ "alert_lifecycle_phase", # deterministically maps to 3 of 5 outcomes
122
+ "automation_resolved", # 1:1 with auto_resolved_soar outcome
123
+ "escalation_flag", # 1:1 with true_positive_escalated outcome
124
+ ]
125
+
126
+ # High-cardinality categorical columns - dropped for tractability.
127
+ HIGH_CARDINALITY_COLUMNS = [
128
+ "mitre_technique_id", # 36 values; no relationship to outcome
129
+ "detection_rule_id", # 656 values; one-hot explosion
130
+ ]
131
+
132
+ DROPPED_FROM_FEATURES = ORACLE_COLUMNS + HIGH_CARDINALITY_COLUMNS
133
+
134
+ # ---------------------------------------------------------------------------
135
+ # Per-alert numeric features
136
+ # ---------------------------------------------------------------------------
137
+
138
+ DIRECT_NUMERIC_FEATURES = [
139
+ "raw_score",
140
+ "enriched_score",
141
+ "time_in_phase_minutes",
142
+ "queue_depth_at_ingestion",
143
+ "soar_playbook_triggered", # partial oracle, kept as observable
144
+ "sla_breached_flag",
145
+ "mttd_minutes",
146
+ "mttr_minutes",
147
+ "fatigue_score_at_alert",
148
+ ]
149
+
150
+ CATEGORICAL_FEATURES = [
151
+ "alert_severity", # 7 values
152
+ "alert_source", # 8 values
153
+ "mitre_tactic", # 12 values
154
+ "analyst_tier", # 3 values (alerts) / 4 (topology) -- 3 here
155
+ "siem_platform", # 8 values
156
+ ]
157
+
158
+
159
+ # ---------------------------------------------------------------------------
160
+ # Engineered features
161
+ # ---------------------------------------------------------------------------
162
+
163
+ def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
164
+ """
165
+ Six engineered features encoding triage-outcome hypotheses.
166
+ Each composite is a quantity a SOC analyst would compute by hand
167
+ to assess an alert's likely disposition.
168
+ """
169
+ df = df.copy()
170
+
171
+ # 1. Enrichment lift: how much enrichment improved the raw score.
172
+ # Positive lift = enrichment increased confidence (often -> TP).
173
+ df["enrichment_lift"] = (
174
+ df["enriched_score"] - df["raw_score"]
175
+ ).astype(float)
176
+
177
+ # 2. Log-scaled MTTR. MTTR is heavy-tailed (auto-resolves seconds,
178
+ # escalations hours). log1p compresses for both XGBoost and MLP.
179
+ df["log_mttr"] = np.log1p(df["mttr_minutes"].clip(lower=0)).astype(float)
180
+
181
+ # 3. Log-scaled MTTD. Same heavy-tail shape.
182
+ df["log_mttd"] = np.log1p(df["mttd_minutes"].clip(lower=0)).astype(float)
183
+
184
+ # 4. Queue pressure: queue depth times analyst fatigue. High =
185
+ # overloaded analyst, more likely to auto-resolve or escalate.
186
+ df["queue_pressure"] = (
187
+ df["queue_depth_at_ingestion"] * df["fatigue_score_at_alert"]
188
+ ).astype(float)
189
+
190
+ # 5. Triage time efficiency: enrichment_score per minute in phase.
191
+ df["enrichment_per_minute"] = (
192
+ df["enriched_score"] / df["time_in_phase_minutes"].clip(lower=0.1)
193
+ ).astype(float)
194
+
195
+ # 6. Is high-confidence alert: enriched score above 0.7 typically
196
+ # indicates a strong signal that warrants escalation.
197
+ df["is_high_confidence"] = (df["enriched_score"] > 0.7).astype(int)
198
+
199
+ return df
200
+
201
+
202
+ # ---------------------------------------------------------------------------
203
+ # Public API
204
+ # ---------------------------------------------------------------------------
205
+
206
+ def build_features(
207
+ alerts_path: str | Path,
208
+ ) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
209
+ """
210
+ Load soc_alerts.csv, drop target + identifiers + oracle columns,
211
+ engineer features, one-hot encode, return (X, y, ids, meta).
212
+
213
+ `ids` is a Series of alert_id values aligned with X (used for
214
+ round-tripping; not a group label since this task uses stratified
215
+ random splitting).
216
+ """
217
+ alerts = pd.read_csv(alerts_path)
218
+
219
+ y = alerts[TARGET_COLUMN].map(LABEL_TO_INT)
220
+ if y.isna().any():
221
+ bad = alerts.loc[y.isna(), TARGET_COLUMN].unique()
222
+ raise ValueError(f"Unknown resolution_outcome values: {bad}")
223
+ y = y.astype(int)
224
+ ids = alerts["alert_id"].copy()
225
+
226
+ alerts = alerts.drop(
227
+ columns=ID_COLUMNS + [TARGET_COLUMN] + DROPPED_FROM_FEATURES,
228
+ errors="ignore",
229
+ )
230
+
231
+ alerts = _add_engineered_features(alerts)
232
+
233
+ numeric_features = (
234
+ DIRECT_NUMERIC_FEATURES
235
+ + [
236
+ "enrichment_lift", "log_mttr", "log_mttd",
237
+ "queue_pressure", "enrichment_per_minute", "is_high_confidence",
238
+ ]
239
+ )
240
+ numeric_features = [c for c in numeric_features if c in alerts.columns]
241
+ X_numeric = alerts[numeric_features].astype(float)
242
+
243
+ categorical_levels: dict[str, list[str]] = {}
244
+ blocks: list[pd.DataFrame] = []
245
+ for col in CATEGORICAL_FEATURES:
246
+ if col not in alerts.columns:
247
+ continue
248
+ levels = sorted(alerts[col].dropna().unique().tolist())
249
+ categorical_levels[col] = levels
250
+ block = pd.get_dummies(
251
+ alerts[col].astype("category").cat.set_categories(levels),
252
+ prefix=col, dummy_na=False,
253
+ ).astype(int)
254
+ blocks.append(block)
255
+
256
+ X = pd.concat(
257
+ [X_numeric.reset_index(drop=True)]
258
+ + [b.reset_index(drop=True) for b in blocks],
259
+ axis=1,
260
+ ).fillna(0.0)
261
+
262
+ meta = {
263
+ "feature_names": X.columns.tolist(),
264
+ "numeric_features": numeric_features,
265
+ "categorical_levels": categorical_levels,
266
+ "label_to_int": LABEL_TO_INT,
267
+ "int_to_label": INT_TO_LABEL,
268
+ "oracle_excluded": ORACLE_COLUMNS,
269
+ "high_cardinality_excluded": HIGH_CARDINALITY_COLUMNS,
270
+ }
271
+ return X, y, ids, meta
272
+
273
+
274
+ def transform_single(
275
+ record: dict | pd.DataFrame,
276
+ meta: dict[str, Any],
277
+ ) -> np.ndarray:
278
+ """Encode a single alert record for inference."""
279
+ if isinstance(record, dict):
280
+ df = pd.DataFrame([record.copy()])
281
+ else:
282
+ df = record.copy()
283
+
284
+ df = _add_engineered_features(df)
285
+
286
+ numeric = pd.DataFrame({
287
+ col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
288
+ for col in meta["numeric_features"]
289
+ })
290
+ blocks: list[pd.DataFrame] = [numeric]
291
+ for col, levels in meta["categorical_levels"].items():
292
+ val = df.get(col, pd.Series([None] * len(df)))
293
+ block = pd.get_dummies(
294
+ val.astype("category").cat.set_categories(levels),
295
+ prefix=col, dummy_na=False,
296
+ ).astype(int)
297
+ for lvl in levels:
298
+ cname = f"{col}_{lvl}"
299
+ if cname not in block.columns:
300
+ block[cname] = 0
301
+ block = block[[f"{col}_{lvl}" for lvl in levels]]
302
+ blocks.append(block)
303
+
304
+ X = pd.concat(blocks, axis=1).fillna(0.0)
305
+ X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
306
+ return X.values.astype(np.float32)
307
+
308
+
309
+ def save_meta(meta: dict[str, Any], path: str | Path) -> None:
310
+ serializable = {
311
+ "feature_names": meta["feature_names"],
312
+ "numeric_features": meta["numeric_features"],
313
+ "categorical_levels": meta["categorical_levels"],
314
+ "label_to_int": meta["label_to_int"],
315
+ "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
316
+ "oracle_excluded": meta.get("oracle_excluded", []),
317
+ "high_cardinality_excluded": meta.get("high_cardinality_excluded", []),
318
+ }
319
+ with open(path, "w") as f:
320
+ json.dump(serializable, f, indent=2)
321
+
322
+
323
+ def load_meta(path: str | Path) -> dict[str, Any]:
324
+ with open(path) as f:
325
+ meta = json.load(f)
326
+ meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
327
+ return meta
328
+
329
+
330
+ if __name__ == "__main__":
331
+ import sys
332
+ base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
333
+ X, y, ids, meta = build_features(base / "soc_alerts.csv")
334
+ print(f"X shape: {X.shape}")
335
+ print(f"y shape: {y.shape}")
336
+ print(f"n_features: {len(meta['feature_names'])}")
337
+ print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
338
+ print(f"X has NaN: {X.isnull().any().any()}")
feature_meta.json ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "feature_names": [
3
+ "raw_score",
4
+ "enriched_score",
5
+ "time_in_phase_minutes",
6
+ "queue_depth_at_ingestion",
7
+ "soar_playbook_triggered",
8
+ "sla_breached_flag",
9
+ "mttd_minutes",
10
+ "mttr_minutes",
11
+ "fatigue_score_at_alert",
12
+ "enrichment_lift",
13
+ "log_mttr",
14
+ "log_mttd",
15
+ "queue_pressure",
16
+ "enrichment_per_minute",
17
+ "is_high_confidence",
18
+ "alert_severity_critical_confirmed",
19
+ "alert_severity_duplicate_suppressed",
20
+ "alert_severity_false_positive",
21
+ "alert_severity_high_severity",
22
+ "alert_severity_informational",
23
+ "alert_severity_low_severity",
24
+ "alert_severity_medium_severity",
25
+ "alert_source_cspm_cloud_rule",
26
+ "alert_source_edr_behavioural_engine",
27
+ "alert_source_honeypot_trigger",
28
+ "alert_source_itdr_identity_anomaly",
29
+ "alert_source_nids_signature",
30
+ "alert_source_siem_correlation_rule",
31
+ "alert_source_threat_intel_ioc_match",
32
+ "alert_source_ueba_user_anomaly",
33
+ "mitre_tactic_collection",
34
+ "mitre_tactic_command_and_control",
35
+ "mitre_tactic_credential_access",
36
+ "mitre_tactic_defense_evasion",
37
+ "mitre_tactic_discovery",
38
+ "mitre_tactic_execution",
39
+ "mitre_tactic_exfiltration",
40
+ "mitre_tactic_impact",
41
+ "mitre_tactic_initial_access",
42
+ "mitre_tactic_lateral_movement",
43
+ "mitre_tactic_persistence",
44
+ "mitre_tactic_privilege_escalation",
45
+ "analyst_tier_L1_junior",
46
+ "analyst_tier_L2_senior",
47
+ "analyst_tier_L3_threat_hunter",
48
+ "siem_platform_chronicle_google",
49
+ "siem_platform_elastic_siem",
50
+ "siem_platform_exabeam_fusion",
51
+ "siem_platform_ibm_qradar",
52
+ "siem_platform_logrhythm_axon",
53
+ "siem_platform_microsoft_sentinel",
54
+ "siem_platform_splunk_enterprise",
55
+ "siem_platform_sumo_logic"
56
+ ],
57
+ "numeric_features": [
58
+ "raw_score",
59
+ "enriched_score",
60
+ "time_in_phase_minutes",
61
+ "queue_depth_at_ingestion",
62
+ "soar_playbook_triggered",
63
+ "sla_breached_flag",
64
+ "mttd_minutes",
65
+ "mttr_minutes",
66
+ "fatigue_score_at_alert",
67
+ "enrichment_lift",
68
+ "log_mttr",
69
+ "log_mttd",
70
+ "queue_pressure",
71
+ "enrichment_per_minute",
72
+ "is_high_confidence"
73
+ ],
74
+ "categorical_levels": {
75
+ "alert_severity": [
76
+ "critical_confirmed",
77
+ "duplicate_suppressed",
78
+ "false_positive",
79
+ "high_severity",
80
+ "informational",
81
+ "low_severity",
82
+ "medium_severity"
83
+ ],
84
+ "alert_source": [
85
+ "cspm_cloud_rule",
86
+ "edr_behavioural_engine",
87
+ "honeypot_trigger",
88
+ "itdr_identity_anomaly",
89
+ "nids_signature",
90
+ "siem_correlation_rule",
91
+ "threat_intel_ioc_match",
92
+ "ueba_user_anomaly"
93
+ ],
94
+ "mitre_tactic": [
95
+ "collection",
96
+ "command_and_control",
97
+ "credential_access",
98
+ "defense_evasion",
99
+ "discovery",
100
+ "execution",
101
+ "exfiltration",
102
+ "impact",
103
+ "initial_access",
104
+ "lateral_movement",
105
+ "persistence",
106
+ "privilege_escalation"
107
+ ],
108
+ "analyst_tier": [
109
+ "L1_junior",
110
+ "L2_senior",
111
+ "L3_threat_hunter"
112
+ ],
113
+ "siem_platform": [
114
+ "chronicle_google",
115
+ "elastic_siem",
116
+ "exabeam_fusion",
117
+ "ibm_qradar",
118
+ "logrhythm_axon",
119
+ "microsoft_sentinel",
120
+ "splunk_enterprise",
121
+ "sumo_logic"
122
+ ]
123
+ },
124
+ "label_to_int": {
125
+ "auto_resolved_soar": 0,
126
+ "duplicate_merged": 1,
127
+ "false_positive_closed": 2,
128
+ "true_positive_remediated": 3,
129
+ "true_positive_escalated": 4
130
+ },
131
+ "int_to_label": {
132
+ "0": "auto_resolved_soar",
133
+ "1": "duplicate_merged",
134
+ "2": "false_positive_closed",
135
+ "3": "true_positive_remediated",
136
+ "4": "true_positive_escalated"
137
+ },
138
+ "oracle_excluded": [
139
+ "alert_lifecycle_phase",
140
+ "automation_resolved",
141
+ "escalation_flag"
142
+ ],
143
+ "high_cardinality_excluded": [
144
+ "mitre_technique_id",
145
+ "detection_rule_id"
146
+ ]
147
+ }
feature_scaler.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"mean": [0.38312180124223605, 0.44073427018633543, 494.7303866459627, 0.0, 0.42220496894409937, 0.1781055900621118, 137.2831350931677, 494.7303866459627, 0.6417724378881986, 0.05761246894409938, 6.162163956057129, 4.838456166953409, 0.0, 0.0009746911233458116, 0.12018633540372671, 0.025, 0.06164596273291925, 0.4549689440993789, 0.07577639751552795, 0.08214285714285714, 0.12950310559006212, 0.17096273291925465, 0.12298136645962733, 0.12406832298136646, 0.13198757763975155, 0.1253105590062112, 0.12593167701863353, 0.12468944099378881, 0.12111801242236025, 0.12391304347826088, 0.06055900621118013, 0.062111801242236024, 0.10170807453416149, 0.10791925465838509, 0.07872670807453416, 0.10512422360248447, 0.05031055900621118, 0.05015527950310559, 0.13788819875776398, 0.06506211180124223, 0.08649068322981367, 0.09394409937888198, 0.7020186335403726, 0.21521739130434783, 0.0827639751552795, 0.044099378881987575, 0.19145962732919256, 0.1203416149068323, 0.20543478260869566, 0.09208074534161491, 0.15434782608695652, 0.0891304347826087, 0.1031055900621118], "std": [0.17850135030508904, 0.20892201626750886, 146.90053054989468, 1.0, 0.49394920704438045, 0.3826313144711351, 58.62096838689685, 146.90053054989468, 0.1734504703334948, 0.04815657263795656, 0.29940309845142277, 0.43629738062432477, 1.0, 0.0005739726052285025, 0.325204554451481, 0.15613707287413436, 0.24053008473802837, 0.4980067419072385, 0.2646605593600251, 0.2746035639662693, 0.33578201102512484, 0.3765056291068823, 0.3284413197786903, 0.3296850798759676, 0.3385035444954766, 0.33109642900410163, 0.33179810798133247, 0.33039209198094494, 0.3262897045833066, 0.32950790685171577, 0.23853814883841196, 0.24137724091987214, 0.3022874975869994, 0.3103024985870334, 0.26933265211902974, 0.3067372346334799, 0.21860198300436606, 0.2182822165526327, 0.34480937505453896, 0.2466545770469458, 0.2811090811235689, 0.29177358481916393, 0.4574067767634188, 0.4110049834120366, 0.2755465283870223, 0.20533185439625573, 0.3934804694938916, 0.3253858494057618, 0.4040503472586642, 0.28916235120376915, 0.3613099024493325, 0.28495404697734833, 0.304120352879204]}
inference_example.ipynb ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# CYB008 Baseline Classifier — Inference Example\n",
8
+ "\n",
9
+ "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **SOC alert triage outcome** from a per-alert record.\n",
10
+ "\n",
11
+ "**Models predict one of 5 outcome classes:** `auto_resolved_soar`, `duplicate_merged`, `false_positive_closed`, `true_positive_remediated`, `true_positive_escalated`.\n",
12
+ "\n",
13
+ "**This is a baseline reference model**, not a production SOC triage system. See the model card and **especially `leakage_diagnostic.json`** for the structural-leakage findings (three columns were dropped as oracles)."
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "markdown",
18
+ "metadata": {},
19
+ "source": [
20
+ "## 1. Install dependencies"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "## 2. Download model artifacts from Hugging Face"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": null,
42
+ "metadata": {},
43
+ "outputs": [],
44
+ "source": [
45
+ "from huggingface_hub import hf_hub_download\n",
46
+ "\n",
47
+ "REPO_ID = \"xpertsystems/cyb008-baseline-classifier\"\n",
48
+ "\n",
49
+ "files = {}\n",
50
+ "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
51
+ " \"feature_engineering.py\", \"feature_meta.json\",\n",
52
+ " \"feature_scaler.json\"]:\n",
53
+ " files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
54
+ " print(f\" downloaded: {name}\")"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": null,
60
+ "metadata": {},
61
+ "outputs": [],
62
+ "source": [
63
+ "import sys, os\n",
64
+ "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
65
+ "if fe_dir not in sys.path:\n",
66
+ " sys.path.insert(0, fe_dir)\n",
67
+ "\n",
68
+ "from feature_engineering import transform_single, load_meta, INT_TO_LABEL"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "markdown",
73
+ "metadata": {},
74
+ "source": [
75
+ "## 3. Load models and metadata"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "code",
80
+ "execution_count": null,
81
+ "metadata": {},
82
+ "outputs": [],
83
+ "source": [
84
+ "import json\n",
85
+ "import numpy as np\n",
86
+ "import torch\n",
87
+ "import torch.nn as nn\n",
88
+ "import xgboost as xgb\n",
89
+ "from safetensors.torch import load_file\n",
90
+ "\n",
91
+ "meta = load_meta(files[\"feature_meta.json\"])\n",
92
+ "with open(files[\"feature_scaler.json\"]) as f:\n",
93
+ " scaler = json.load(f)\n",
94
+ "\n",
95
+ "N_FEATURES = len(meta[\"feature_names\"])\n",
96
+ "N_CLASSES = len(meta[\"int_to_label\"])\n",
97
+ "print(f\"feature count: {N_FEATURES}\")\n",
98
+ "print(f\"class count: {N_CLASSES}\")\n",
99
+ "print(f\"label classes: {list(meta['int_to_label'].values())}\")\n",
100
+ "print(f\"\\noracle columns excluded (do not pass these to the model):\")\n",
101
+ "for c in meta.get(\"oracle_excluded\", []):\n",
102
+ " print(f\" - {c}\")"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "code",
107
+ "execution_count": null,
108
+ "metadata": {},
109
+ "outputs": [],
110
+ "source": [
111
+ "xgb_model = xgb.XGBClassifier()\n",
112
+ "xgb_model.load_model(files[\"model_xgb.json\"])\n",
113
+ "\n",
114
+ "# MLP architecture (must match training)\n",
115
+ "class TriageMLP(nn.Module):\n",
116
+ " def __init__(self, n_features, n_classes=5, hidden1=128, hidden2=64, dropout=0.3):\n",
117
+ " super().__init__()\n",
118
+ " self.net = nn.Sequential(\n",
119
+ " nn.Linear(n_features, hidden1),\n",
120
+ " nn.BatchNorm1d(hidden1),\n",
121
+ " nn.ReLU(),\n",
122
+ " nn.Dropout(dropout),\n",
123
+ " nn.Linear(hidden1, hidden2),\n",
124
+ " nn.BatchNorm1d(hidden2),\n",
125
+ " nn.ReLU(),\n",
126
+ " nn.Dropout(dropout),\n",
127
+ " nn.Linear(hidden2, n_classes),\n",
128
+ " )\n",
129
+ " def forward(self, x):\n",
130
+ " return self.net(x)\n",
131
+ "\n",
132
+ "mlp_model = TriageMLP(N_FEATURES, n_classes=N_CLASSES)\n",
133
+ "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
134
+ "mlp_model.eval()\n",
135
+ "print(\"models loaded\")"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "markdown",
140
+ "metadata": {},
141
+ "source": [
142
+ "## 4. Prediction helper"
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "code",
147
+ "execution_count": null,
148
+ "metadata": {},
149
+ "outputs": [],
150
+ "source": [
151
+ "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
152
+ "SD = np.array(scaler[\"std\"], dtype=np.float32)\n",
153
+ "\n",
154
+ "def predict_triage_outcome(record: dict) -> dict:\n",
155
+ " \"\"\"Predict the resolution outcome for one SOC alert record.\n",
156
+ "\n",
157
+ " Note: do NOT include alert_lifecycle_phase, automation_resolved,\n",
158
+ " or escalation_flag in the record. These were structural oracles\n",
159
+ " in the training data and are excluded from the feature set.\n",
160
+ " \"\"\"\n",
161
+ " X = transform_single(record, meta)\n",
162
+ "\n",
163
+ " xgb_proba = xgb_model.predict_proba(X)[0]\n",
164
+ " xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
165
+ "\n",
166
+ " Xs = ((X - MU) / SD).astype(np.float32)\n",
167
+ " with torch.no_grad():\n",
168
+ " logits = mlp_model(torch.tensor(Xs))\n",
169
+ " mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
170
+ " mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
171
+ "\n",
172
+ " return {\n",
173
+ " \"xgboost\": {\n",
174
+ " \"label\": xgb_label,\n",
175
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
176
+ " },\n",
177
+ " \"mlp\": {\n",
178
+ " \"label\": mlp_label,\n",
179
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
180
+ " },\n",
181
+ " }"
182
+ ]
183
+ },
184
+ {
185
+ "cell_type": "markdown",
186
+ "metadata": {},
187
+ "source": [
188
+ "## 5. Run on an example record\n",
189
+ "\n",
190
+ "Real high-severity ITDR identity-anomaly alert assigned to an L3 threat hunter, who escalated it to a true-positive incident. Both models should predict `true_positive_escalated` or the adjacent `true_positive_remediated`."
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "code",
195
+ "execution_count": null,
196
+ "metadata": {},
197
+ "outputs": [],
198
+ "source": [
199
+ "# Real alert from the sample dataset (true outcome: true_positive_escalated)\n",
200
+ "example_record = {\n",
201
+ " \"alert_severity\": \"high_severity\",\n",
202
+ " \"alert_source\": \"itdr_identity_anomaly\",\n",
203
+ " \"mitre_tactic\": \"initial_access\",\n",
204
+ " \"analyst_tier\": \"L3_threat_hunter\",\n",
205
+ " \"siem_platform\": \"logrhythm_axon\",\n",
206
+ " \"raw_score\": 0.2683,\n",
207
+ " \"enriched_score\": 0.343,\n",
208
+ " \"time_in_phase_minutes\": 429.26,\n",
209
+ " \"queue_depth_at_ingestion\": 0,\n",
210
+ " \"soar_playbook_triggered\": 0,\n",
211
+ " \"sla_breached_flag\": 1,\n",
212
+ " \"mttd_minutes\": 177.47,\n",
213
+ " \"mttr_minutes\": 429.26,\n",
214
+ " \"fatigue_score_at_alert\": 0.3805,\n",
215
+ "}\n",
216
+ "\n",
217
+ "result = predict_triage_outcome(example_record)\n",
218
+ "\n",
219
+ "print(f\"XGBoost -> {result['xgboost']['label']}\")\n",
220
+ "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
221
+ " print(f\" P({lbl:30s}) = {p:.4f}\")\n",
222
+ "\n",
223
+ "print(f\"\\nMLP -> {result['mlp']['label']}\")\n",
224
+ "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
225
+ " print(f\" P({lbl:30s}) = {p:.4f}\")"
226
+ ]
227
+ },
228
+ {
229
+ "cell_type": "markdown",
230
+ "metadata": {},
231
+ "source": [
232
+ "### Honest confusion between TP-remediated and TP-escalated\n",
233
+ "\n",
234
+ "The two `true_positive_*` outcomes look behaviourally similar in the data — both involve genuine threats. They differ by whether the alert was closed by the original analyst (remediated) or passed to a higher tier (escalated). When the trained models confuse these two classes on individual alerts, that's honest learning — not a defect.\n",
235
+ "\n",
236
+ "In a production triage workflow, the better operational metric is **TP vs FP** (recall on true positives, regardless of remediated/escalated). The published baseline achieves ROC-AUC 0.955 on the full 5-class task, which substantially exceeds practical thresholds for downstream binary TP-vs-FP decisions."
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "markdown",
241
+ "metadata": {},
242
+ "source": [
243
+ "## 6. Batch prediction on the sample dataset"
244
+ ]
245
+ },
246
+ {
247
+ "cell_type": "code",
248
+ "execution_count": null,
249
+ "metadata": {},
250
+ "outputs": [],
251
+ "source": [
252
+ "from huggingface_hub import snapshot_download\n",
253
+ "import pandas as pd\n",
254
+ "\n",
255
+ "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb008-sample\", repo_type=\"dataset\")\n",
256
+ "alerts = pd.read_csv(f\"{ds_path}/soc_alerts.csv\")\n",
257
+ "\n",
258
+ "# Score the first 500 alerts\n",
259
+ "sample = alerts.head(500).copy()\n",
260
+ "preds = [predict_triage_outcome(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
261
+ "sample[\"xgb_pred\"] = preds\n",
262
+ "\n",
263
+ "ct = pd.crosstab(sample[\"resolution_outcome\"], sample[\"xgb_pred\"],\n",
264
+ " rownames=[\"true\"], colnames=[\"pred\"])\n",
265
+ "print(\"Confusion on first 500 sample alerts (XGBoost):\")\n",
266
+ "print(ct)\n",
267
+ "acc = (sample[\"resolution_outcome\"] == sample[\"xgb_pred\"]).mean()\n",
268
+ "print(f\"\\nbatch accuracy on first 500 alerts (in-distribution): {acc:.4f}\")\n",
269
+ "print(\"\\nNote: this includes training-set alerts. See validation_results.json\\n\"\n",
270
+ " \"for proper held-out test metrics.\")"
271
+ ]
272
+ },
273
+ {
274
+ "cell_type": "markdown",
275
+ "metadata": {},
276
+ "source": [
277
+ "## 7. Important reading: the leakage diagnostic\n",
278
+ "\n",
279
+ "Before using CYB008 sample data to train your own triage model, read **`leakage_diagnostic.json`** in this repo. The CYB008 sample has three columns (`alert_lifecycle_phase`, `automation_resolved`, `escalation_flag`) that structurally encode the resolution_outcome label. With these columns present, a plain XGBoost achieves 100% accuracy that does not reflect real learning. The published baseline excludes them; the diagnostic file shows the cumulative ablation.\n",
280
+ "\n",
281
+ "The diagnostic also documents that **mitre_tactic prediction is unlearnable on this sample** (acc 0.08 vs majority 0.14). The README lists this as a top suggested use case, but the per-tactic feature distributions are too similar to learn from."
282
+ ]
283
+ },
284
+ {
285
+ "cell_type": "markdown",
286
+ "metadata": {},
287
+ "source": [
288
+ "## 8. Next steps\n",
289
+ "\n",
290
+ "- See `validation_results.json` for held-out test metrics (1,380 alerts).\n",
291
+ "- See `multi_seed_results.json` for the across-10-seeds picture (accuracy 0.777 ± 0.007, ROC-AUC 0.955 ± 0.003).\n",
292
+ "- See `ablation_results.json` for per-feature-group contribution. Alert severity carries the dominant signal (−25 pp accuracy when removed); the SOAR-playbook-triggered indicator is second (−15 pp).\n",
293
+ "- See **`leakage_diagnostic.json`** for the full structural-leakage and unlearnable-target audit.\n",
294
+ "- For the full ~335k-row CYB008 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
295
+ ]
296
+ }
297
+ ],
298
+ "metadata": {
299
+ "kernelspec": {
300
+ "display_name": "Python 3",
301
+ "language": "python",
302
+ "name": "python3"
303
+ },
304
+ "language_info": {
305
+ "name": "python",
306
+ "version": "3.10"
307
+ }
308
+ },
309
+ "nbformat": 4,
310
+ "nbformat_minor": 5
311
+ }
leakage_diagnostic.json ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Document the three structural oracle columns dropped from the primary feature pipeline, and the unlearnable-target finding for mitre_tactic. CYB008 is calibrated against 12 SOC-operations benchmarks but encodes the resolution_outcome label structurally into alert_lifecycle_phase, automation_resolved, and escalation_flag. Real SOC telemetry has substantial overlap between these signals; the sample does not.",
3
+ "primary_target": "resolution_outcome (5-class)",
4
+ "split": "StratifiedShuffleSplit, 70/15/15 nested",
5
+ "oracle_structural_findings": {
6
+ "alert_lifecycle_phase": {
7
+ "deterministic_mapping": {
8
+ "auto_closed": "100% -> auto_resolved_soar",
9
+ "escalated": "100% -> true_positive_escalated",
10
+ "suppressed_duplicate": "100% -> duplicate_merged",
11
+ "resolved": "splits ~62/38 false_positive_closed / true_positive_remediated"
12
+ },
13
+ "note": "3 of 4 lifecycle phases are perfect class oracles. Drop required to evaluate honest learning."
14
+ },
15
+ "automation_resolved": {
16
+ "deterministic_mapping": {
17
+ "1": "100% -> auto_resolved_soar",
18
+ "0": "0 cases of auto_resolved_soar"
19
+ },
20
+ "note": "Exact 1:1 oracle with auto_resolved_soar outcome class."
21
+ },
22
+ "escalation_flag": {
23
+ "deterministic_mapping": {
24
+ "1 (n=1875)": "1319 true_positive_escalated + 556 auto_resolved_soar",
25
+ "0 (n=7325)": "0 cases of true_positive_escalated"
26
+ },
27
+ "note": "Near-perfect oracle for true_positive_escalated outcome."
28
+ }
29
+ },
30
+ "ablation_experiments": [
31
+ {
32
+ "config": "full features (all oracles intact)",
33
+ "n_features": 53,
34
+ "accuracy": 1.0,
35
+ "roc_auc": 1.0
36
+ },
37
+ {
38
+ "config": "cumulative drop through alert_lifecycle_phase",
39
+ "dropped_so_far": [
40
+ "alert_lifecycle_phase"
41
+ ],
42
+ "n_features": 49,
43
+ "accuracy": 1.0,
44
+ "roc_auc": 1.0
45
+ },
46
+ {
47
+ "config": "cumulative drop through automation_resolved",
48
+ "dropped_so_far": [
49
+ "alert_lifecycle_phase",
50
+ "automation_resolved"
51
+ ],
52
+ "n_features": 48,
53
+ "accuracy": 0.8388888888888889,
54
+ "roc_auc": 0.9726930344746759
55
+ },
56
+ {
57
+ "config": "cumulative drop through escalation_flag",
58
+ "dropped_so_far": [
59
+ "alert_lifecycle_phase",
60
+ "automation_resolved",
61
+ "escalation_flag"
62
+ ],
63
+ "n_features": 47,
64
+ "accuracy": 0.7898550724637681,
65
+ "roc_auc": 0.9562439021017856
66
+ }
67
+ ],
68
+ "conclusion": "With all three oracle columns dropped, test accuracy is 0.79 (vs 1.00 with oracles intact, and 0.33 majority baseline). The honest model still ROC-AUC 0.96 on a 5-class task - real learning, real signal, no mechanical leakage. The published baseline trains with the three oracle columns excluded.",
69
+ "mitre_tactic_unlearnable": {
70
+ "purpose": "The CYB008 README's first suggested use case is 'MITRE ATT&CK tactic classification from alert features'. We test this on the sample dataset and find it is NOT LEARNABLE - features do not distinguish tactics, the model performs below majority baseline.",
71
+ "task": "mitre_tactic 12-class (with mitre_technique_id excluded - it would be a perfect ATT&CK oracle)",
72
+ "majority_baseline_accuracy": 0.14097826086956522,
73
+ "xgboost_accuracy_mean_3seeds": 0.07971014492753624,
74
+ "interpretation": "Per-tactic feature distributions are nearly identical (raw_score 0.37-0.39, enriched_score similar, fatigue 0.64 across all 12 tactics). Without mitre_technique_id (which is a 100% ATT&CK-by-design oracle), alert_source is the only discriminating signal, and it has cross-tactic purity of 0.14 - close to random. Real SOC telemetry has stronger source-to-tactic associations and per-tactic feature distributions; the sample does not reproduce these.",
75
+ "recommendation_to_dataset_author": "To make tactic classification a viable benchmark, the generator should produce stronger per-tactic feature signatures (differentiated raw_score / enriched_score distributions per tactic, source-tactic affinity > 0.3 purity, characteristic MTTD / MTTR per tactic)."
76
+ }
77
+ }
model_mlp.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d34f5334ddfda002098c5fa294c98908478b3037593ce06b31dbbfd4f4b672e
3
+ size 66268
model_xgb.json ADDED
The diff for this file is too large to render. See raw diff
 
multi_seed_results.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Multi-seed evaluation across 10 stratified splits of the 9,200-alert sample. Reports XGBoost performance averaged over the full set of seeds.",
3
+ "seeds_evaluated": [
4
+ 42,
5
+ 7,
6
+ 13,
7
+ 17,
8
+ 23,
9
+ 31,
10
+ 45,
11
+ 99,
12
+ 123,
13
+ 200
14
+ ],
15
+ "per_seed": [
16
+ {
17
+ "seed": 42,
18
+ "test_n_classes": 5,
19
+ "accuracy": 0.7659420289855072,
20
+ "macro_f1": 0.7429876131468711,
21
+ "macro_roc_auc_ovr": 0.9522005654044479
22
+ },
23
+ {
24
+ "seed": 7,
25
+ "test_n_classes": 5,
26
+ "accuracy": 0.7768115942028986,
27
+ "macro_f1": 0.769435481914568,
28
+ "macro_roc_auc_ovr": 0.9535405274694995
29
+ },
30
+ {
31
+ "seed": 13,
32
+ "test_n_classes": 5,
33
+ "accuracy": 0.7862318840579711,
34
+ "macro_f1": 0.7773476010631033,
35
+ "macro_roc_auc_ovr": 0.9593350309948587
36
+ },
37
+ {
38
+ "seed": 17,
39
+ "test_n_classes": 5,
40
+ "accuracy": 0.7731884057971015,
41
+ "macro_f1": 0.7657000386460112,
42
+ "macro_roc_auc_ovr": 0.9510884009809615
43
+ },
44
+ {
45
+ "seed": 23,
46
+ "test_n_classes": 5,
47
+ "accuracy": 0.7768115942028986,
48
+ "macro_f1": 0.7655808630589699,
49
+ "macro_roc_auc_ovr": 0.9557712595581618
50
+ },
51
+ {
52
+ "seed": 31,
53
+ "test_n_classes": 5,
54
+ "accuracy": 0.7789855072463768,
55
+ "macro_f1": 0.7635031878905345,
56
+ "macro_roc_auc_ovr": 0.9575528903552497
57
+ },
58
+ {
59
+ "seed": 45,
60
+ "test_n_classes": 5,
61
+ "accuracy": 0.7920289855072464,
62
+ "macro_f1": 0.7827912746822961,
63
+ "macro_roc_auc_ovr": 0.9599146202095736
64
+ },
65
+ {
66
+ "seed": 99,
67
+ "test_n_classes": 5,
68
+ "accuracy": 0.7666666666666667,
69
+ "macro_f1": 0.7513856936195747,
70
+ "macro_roc_auc_ovr": 0.9498718419129876
71
+ },
72
+ {
73
+ "seed": 123,
74
+ "test_n_classes": 5,
75
+ "accuracy": 0.7760869565217391,
76
+ "macro_f1": 0.7672910648132462,
77
+ "macro_roc_auc_ovr": 0.9549881182366795
78
+ },
79
+ {
80
+ "seed": 200,
81
+ "test_n_classes": 5,
82
+ "accuracy": 0.7753623188405797,
83
+ "macro_f1": 0.7594433532222149,
84
+ "macro_roc_auc_ovr": 0.9530752276015168
85
+ }
86
+ ],
87
+ "aggregate": {
88
+ "accuracy_mean": 0.7768115942028986,
89
+ "accuracy_std": 0.0074957104585424775,
90
+ "accuracy_min": 0.7659420289855072,
91
+ "accuracy_max": 0.7920289855072464,
92
+ "macro_f1_mean": 0.7645466172057391,
93
+ "macro_f1_std": 0.01093479933122361,
94
+ "roc_auc_mean": 0.9547338482723937,
95
+ "roc_auc_std": 0.003234503129030738
96
+ },
97
+ "published_artifact_seed": 42
98
+ }
validation_results.json ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0.0",
3
+ "dataset": "xpertsystems/cyb008-sample",
4
+ "task": "5-class resolution_outcome classification (SOC alert triage)",
5
+ "baselines": {
6
+ "always_predict_majority_accuracy": 0.32608695652173914,
7
+ "majority_class": "false_positive_closed",
8
+ "random_guess_accuracy": 0.2
9
+ },
10
+ "split": {
11
+ "strategy": "stratified (StratifiedShuffleSplit, nested 70/15/15)",
12
+ "rationale": "CYB008 has no natural row-level group key: 25 analysts (group-aware split would yield ~4 test analysts), 5 SOCs (would yield 1 test SOC), 589 incidents but only 9% of alerts have a non-null incident_id. Alerts are essentially independent given features, so stratified random split is the right choice (same approach as CYB001 for network flow classification).",
13
+ "alerts_train": 6440,
14
+ "alerts_val": 1380,
15
+ "alerts_test": 1380,
16
+ "seed": 42
17
+ },
18
+ "n_features": 53,
19
+ "label_classes": [
20
+ "auto_resolved_soar",
21
+ "duplicate_merged",
22
+ "false_positive_closed",
23
+ "true_positive_remediated",
24
+ "true_positive_escalated"
25
+ ],
26
+ "class_distribution_train": {
27
+ "false_positive_closed": 2097,
28
+ "auto_resolved_soar": 1849,
29
+ "true_positive_remediated": 1294,
30
+ "true_positive_escalated": 923,
31
+ "duplicate_merged": 277
32
+ },
33
+ "class_distribution_test": {
34
+ "false_positive_closed": 450,
35
+ "auto_resolved_soar": 396,
36
+ "true_positive_remediated": 277,
37
+ "true_positive_escalated": 198,
38
+ "duplicate_merged": 59
39
+ },
40
+ "oracle_excluded_features": [
41
+ "alert_lifecycle_phase (deterministically maps to 3 of 5 outcomes)",
42
+ "automation_resolved (1:1 with auto_resolved_soar)",
43
+ "escalation_flag (near 1:1 with true_positive_escalated)"
44
+ ],
45
+ "high_cardinality_excluded_features": [
46
+ "mitre_technique_id (36 unique values; perfect oracle for mitre_tactic but unrelated to this target)",
47
+ "detection_rule_id (656 unique values; one-hot explosion)"
48
+ ],
49
+ "leakage_audit_note": "See leakage_diagnostic.json for the full audit of structural oracles and the separate unlearnable-target finding for mitre_tactic. The model is trained with all three oracle columns excluded; full-features experiments showed 100% test accuracy, confirming the structural leakage.",
50
+ "models": {
51
+ "xgboost": {
52
+ "architecture": "Gradient-boosted decision trees, multi:softprob, 5 classes",
53
+ "framework": "xgboost",
54
+ "test_metrics": {
55
+ "model": "xgboost",
56
+ "accuracy": 0.7659420289855072,
57
+ "macro_f1": 0.7429876131468711,
58
+ "weighted_f1": 0.7669168766123218,
59
+ "per_class_f1": {
60
+ "auto_resolved_soar": 0.7572383073496659,
61
+ "duplicate_merged": 0.7936507936507936,
62
+ "false_positive_closed": 0.9038461538461539,
63
+ "true_positive_remediated": 0.7012987012987013,
64
+ "true_positive_escalated": 0.5589041095890411
65
+ },
66
+ "confusion_matrix": {
67
+ "labels": [
68
+ "auto_resolved_soar",
69
+ "duplicate_merged",
70
+ "false_positive_closed",
71
+ "true_positive_remediated",
72
+ "true_positive_escalated"
73
+ ],
74
+ "matrix": [
75
+ [
76
+ 340,
77
+ 17,
78
+ 6,
79
+ 16,
80
+ 17
81
+ ],
82
+ [
83
+ 9,
84
+ 50,
85
+ 0,
86
+ 0,
87
+ 0
88
+ ],
89
+ [
90
+ 74,
91
+ 0,
92
+ 376,
93
+ 0,
94
+ 0
95
+ ],
96
+ [
97
+ 40,
98
+ 0,
99
+ 0,
100
+ 189,
101
+ 48
102
+ ],
103
+ [
104
+ 39,
105
+ 0,
106
+ 0,
107
+ 57,
108
+ 102
109
+ ]
110
+ ]
111
+ },
112
+ "macro_roc_auc_ovr": 0.9522005654044479
113
+ }
114
+ },
115
+ "mlp": {
116
+ "architecture": "PyTorch MLP, 53 -> 128 -> 64 -> 5, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
117
+ "framework": "pytorch",
118
+ "test_metrics": {
119
+ "model": "mlp",
120
+ "accuracy": 0.7673913043478261,
121
+ "macro_f1": 0.7510024599009764,
122
+ "weighted_f1": 0.769556192579193,
123
+ "per_class_f1": {
124
+ "auto_resolved_soar": 0.7505773672055427,
125
+ "duplicate_merged": 0.8251748251748252,
126
+ "false_positive_closed": 0.910411622276029,
127
+ "true_positive_remediated": 0.6981818181818182,
128
+ "true_positive_escalated": 0.5706666666666667
129
+ },
130
+ "confusion_matrix": {
131
+ "labels": [
132
+ "auto_resolved_soar",
133
+ "duplicate_merged",
134
+ "false_positive_closed",
135
+ "true_positive_remediated",
136
+ "true_positive_escalated"
137
+ ],
138
+ "matrix": [
139
+ [
140
+ 325,
141
+ 25,
142
+ 0,
143
+ 23,
144
+ 23
145
+ ],
146
+ [
147
+ 0,
148
+ 59,
149
+ 0,
150
+ 0,
151
+ 0
152
+ ],
153
+ [
154
+ 74,
155
+ 0,
156
+ 376,
157
+ 0,
158
+ 0
159
+ ],
160
+ [
161
+ 38,
162
+ 0,
163
+ 0,
164
+ 192,
165
+ 47
166
+ ],
167
+ [
168
+ 33,
169
+ 0,
170
+ 0,
171
+ 58,
172
+ 107
173
+ ]
174
+ ]
175
+ },
176
+ "macro_roc_auc_ovr": 0.9552409409036638
177
+ }
178
+ }
179
+ }
180
+ }