pradeep-xpert commited on
Commit
146a3a4
·
verified ·
1 Parent(s): eea6138

Initial release: XGBoost + MLP for ATT&CK phase classification

Browse files
README.md ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: pytorch
4
+ tags:
5
+ - cybersecurity
6
+ - mitre-attack
7
+ - kill-chain
8
+ - apt
9
+ - tabular-classification
10
+ - synthetic-data
11
+ - xgboost
12
+ - baseline
13
+ pipeline_tag: tabular-classification
14
+ base_model: []
15
+ datasets:
16
+ - xpertsystems/cyb002-sample
17
+ metrics:
18
+ - accuracy
19
+ - f1
20
+ - roc_auc
21
+ model-index:
22
+ - name: cyb002-baseline-classifier
23
+ results:
24
+ - task:
25
+ type: tabular-classification
26
+ name: 10-class MITRE ATT&CK kill-chain phase classification
27
+ dataset:
28
+ type: xpertsystems/cyb002-sample
29
+ name: CYB002 Synthetic Cyber Attack Dataset (Sample)
30
+ metrics:
31
+ - type: roc_auc
32
+ value: 0.8599
33
+ name: Test macro ROC-AUC OvR (XGBoost)
34
+ - type: f1
35
+ value: 0.4255
36
+ name: Test macro-F1 (XGBoost)
37
+ - type: accuracy
38
+ value: 0.4683
39
+ name: Test accuracy (XGBoost)
40
+ - type: roc_auc
41
+ value: 0.8496
42
+ name: Test macro ROC-AUC OvR (MLP)
43
+ - type: f1
44
+ value: 0.3911
45
+ name: Test macro-F1 (MLP)
46
+ - type: accuracy
47
+ value: 0.4449
48
+ name: Test accuracy (MLP)
49
+ ---
50
+
51
+ # CYB002 Baseline Classifier
52
+
53
+ **MITRE ATT&CK kill-chain phase classifier trained on the CYB002
54
+ synthetic cyber attack sample. Predicts which of 10 kill-chain phases
55
+ an attack event belongs to, from observable event + segment features.**
56
+
57
+ > **Baseline reference, not for production use.** This model demonstrates
58
+ > that the [CYB002 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb002-sample)
59
+ > is learnable end-to-end and gives prospective buyers a working starting
60
+ > point. It is not a production threat detector or SOC tool. See
61
+ > [Limitations](#limitations).
62
+
63
+ ## Model overview
64
+
65
+ | Property | Value |
66
+ |---|---|
67
+ | Task | 10-class kill-chain phase classification |
68
+ | Training data | `xpertsystems/cyb002-sample` (4,353 attack events across 100 campaigns) |
69
+ | Models | XGBoost + PyTorch MLP |
70
+ | Input features | 90 (after one-hot encoding) |
71
+ | Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
72
+ | License | CC-BY-NC-4.0 (matches dataset) |
73
+ | Status | Reference baseline |
74
+
75
+ Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
76
+
77
+ - `model_xgb.json` — gradient-boosted trees, primary recommendation
78
+ - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
79
+
80
+ ## Quick start
81
+
82
+ ```bash
83
+ pip install xgboost torch safetensors pandas huggingface_hub
84
+ ```
85
+
86
+ ```python
87
+ from huggingface_hub import hf_hub_download
88
+ import json, numpy as np, torch, xgboost as xgb
89
+ from safetensors.torch import load_file
90
+
91
+ REPO = "xpertsystems/cyb002-baseline-classifier"
92
+
93
+ paths = {n: hf_hub_download(REPO, n) for n in [
94
+ "model_xgb.json", "model_mlp.safetensors",
95
+ "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
96
+ ]}
97
+
98
+ import sys, os
99
+ sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
100
+ from feature_engineering import (
101
+ transform_single, load_meta, INT_TO_LABEL, build_segment_lookup
102
+ )
103
+
104
+ meta = load_meta(paths["feature_meta.json"])
105
+ xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
106
+
107
+ # Build the segment-aggregate lookup from the dataset's topology CSV
108
+ seg_lookup = build_segment_lookup("path/to/network_topology.csv")
109
+
110
+ # Predict (see inference_example.ipynb for the full pattern)
111
+ seg_agg = seg_lookup.get(my_event["target_segment_id"], {})
112
+ X = transform_single(my_event, meta, segment_aggregates=seg_agg)
113
+ proba = xgb_model.predict_proba(X)[0]
114
+ print(INT_TO_LABEL[int(np.argmax(proba))])
115
+ ```
116
+
117
+ See [`inference_example.ipynb`](./inference_example.ipynb) for an
118
+ end-to-end copy-paste demo including segment-aggregate setup and
119
+ batch prediction.
120
+
121
+ ## Training data
122
+
123
+ Trained on the public sample of CYB002, 4,353 attack events from 100
124
+ distinct campaigns:
125
+
126
+ | Phase | Train (n=2,822) | Test (n=726) | Test share |
127
+ |---|---:|---:|---:|
128
+ | `dwell_idle` | 581 | 141 | 19.4% |
129
+ | `reconnaissance` | 411 | 112 | 15.4% |
130
+ | `initial_access` | 358 | 106 | 14.6% |
131
+ | `execution` | 324 | 74 | 10.2% |
132
+ | `persistence` | 287 | 79 | 10.9% |
133
+ | `privilege_escalation` | 249 | 68 | 9.4% |
134
+ | `lateral_movement` | 201 | 54 | 7.4% |
135
+ | `collection` | 162 | 40 | 5.5% |
136
+ | `exfiltration` | 113 | 31 | 4.3% |
137
+ | `impact` | 105 | 21 | 2.9% |
138
+
139
+ ### Group-aware split
140
+
141
+ A single campaign generates ~40 highly-correlated events. Random row-level
142
+ splitting would put events from the same campaign in both train and test,
143
+ inflating metrics in a way that does not generalize to new campaigns.
144
+
145
+ This release uses **GroupShuffleSplit by `campaign_id`**:
146
+
147
+ | Fold | Campaigns | Events |
148
+ |---|---:|---:|
149
+ | Train | 69 | 2,822 |
150
+ | Validation | 16 | 805 |
151
+ | Test | 15 | 726 |
152
+
153
+ All test campaigns are completely unseen during training. Class imbalance
154
+ is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
155
+ weighted cross-entropy (MLP).
156
+
157
+ ## Feature pipeline
158
+
159
+ The bundled `feature_engineering.py` is the canonical feature recipe.
160
+
161
+ **Three columns are deliberately excluded** because they leak the target:
162
+
163
+ - `technique_id` — 62 of 63 ATT&CK techniques map 1:1 to a single phase.
164
+ Including it gives perfect-looking metrics that mean nothing.
165
+ - `technique_name` — 1:1 alias of `technique_id` (63 unique values each).
166
+ - `tactic_category` — direct alias of `kill_chain_phase`.
167
+
168
+ **90 features survive after encoding**, drawn from:
169
+
170
+ - **Event-level numeric** (10): `timestep`, `dest_port`, `bytes_transferred`, `connection_duration_s`, `auth_failure_count`, `process_injection_flag`, `lateral_hop_count`, `c2_beacon_interval_s`, `edr_blocked_flag`, `siem_rule_triggered`
171
+ - **Event-level categorical** (7, one-hot encoded): `target_asset_type`, `source_ip_class`, `protocol`, `attacker_capability_tier`, `defender_maturity_level`, `alert_severity`, `detection_outcome`
172
+ - **Segment-level topology aggregates** (13): mean `patch_lag_days`, mean `exposure_score`, max `vulnerability_count`, fraction with EDR/SIEM/NDR/MFA coverage, mean MTTD / MTTR baselines, plus segment_type and defender_maturity_level (segment-constant)
173
+ - **Engineered** (6): `byte_volume_log`, `has_c2_beacon`, `is_brute_forcing`, `attacker_defender_advantage`, `is_high_volume`, `is_privileged_port`
174
+
175
+ None of the engineered features is derived from phase or technique —
176
+ that would re-introduce the leakage we just excluded.
177
+
178
+ ### Note on detection-outcome features
179
+
180
+ `detection_outcome`, `alert_severity`, `edr_blocked_flag`, and
181
+ `siem_rule_triggered` are post-hoc observables from the SOC's perspective.
182
+ They are kept as features for the realistic use case where a SOC analyst
183
+ has just seen an action and its initial detection signal and is reasoning
184
+ about which phase the campaign is in. Buyers who want a strictly
185
+ pre-detection model can drop these four columns and retrain — the ablation
186
+ results below show this **does not hurt accuracy** (the model doesn't
187
+ lean on them for phase prediction).
188
+
189
+ ## Evaluation
190
+
191
+ ### Test-set metrics (n = 726 events from 15 disjoint campaigns)
192
+
193
+ **XGBoost**
194
+
195
+ | Metric | Value |
196
+ |---|---:|
197
+ | Macro ROC-AUC (OvR) | **0.8599** |
198
+ | Accuracy | 0.4683 |
199
+ | Macro-F1 | 0.4255 |
200
+ | Weighted-F1 | 0.4604 |
201
+
202
+ **MLP**
203
+
204
+ | Metric | Value |
205
+ |---|---:|
206
+ | Macro ROC-AUC (OvR) | **0.8496** |
207
+ | Accuracy | 0.4449 |
208
+ | Macro-F1 | 0.3911 |
209
+ | Weighted-F1 | 0.4350 |
210
+
211
+ ### Headline interpretation
212
+
213
+ Accuracy of 47% looks low at first glance, but the right comparison is:
214
+
215
+ | Baseline | Accuracy | Macro-F1 |
216
+ |---|---:|---:|
217
+ | Random uniform guess (1/10 classes) | 0.10 | ~0.10 |
218
+ | Always predict majority (`dwell_idle`) | 0.19 | n/a |
219
+ | **XGBoost (this model)** | **0.47** | **0.43** |
220
+
221
+ The macro ROC-AUC of **0.86** tells the cleaner story: the model
222
+ distinguishes the 10 phases meaningfully well even though the
223
+ argmax-prediction sometimes lands on an adjacent phase.
224
+
225
+ ### Per-class F1 — where the signal is and isn't
226
+
227
+ | Phase | XGBoost F1 | MLP F1 | Note |
228
+ |---|---:|---:|---|
229
+ | `reconnaissance` | **0.753** | 0.725 | Strong: early timestep, distinct protocols/targets |
230
+ | `lateral_movement` | **0.742** | 0.783 | Strong: lateral-hop count, post-privesc pattern |
231
+ | `initial_access` | **0.647** | 0.648 | Strong: perimeter targets, specific protocols |
232
+ | `privilege_escalation` | 0.500 | 0.488 | Moderate |
233
+ | `execution` | 0.441 | 0.510 | Moderate |
234
+ | `persistence` | 0.413 | 0.301 | Moderate, easily confused with execution |
235
+ | `exfiltration` | 0.273 | 0.119 | Weak: late-phase, similar to collection/impact |
236
+ | `impact` | 0.226 | 0.132 | Weak: late-phase clustering |
237
+ | `collection` | 0.220 | 0.191 | Weak: late-phase clustering |
238
+ | `dwell_idle` | 0.040 | 0.013 | Very weak: no-op steps lack distinguishing features |
239
+
240
+ The model has solid signal on **early and mid-campaign phases** and
241
+ genuinely struggles to disambiguate **late-stage objective-completion
242
+ phases** (collection / exfiltration / impact), which arrive close in
243
+ time and look similar at the event level. This is an honest limitation
244
+ of flat-tabular classification — sequence models would help here.
245
+
246
+ ### Ablation: which feature groups matter
247
+
248
+ | Configuration | Accuracy | Macro-F1 | Δ accuracy vs full |
249
+ |---|---:|---:|---:|
250
+ | Full feature set (published) | 0.4683 | 0.4255 | — |
251
+ | No `timestep` | 0.3264 | 0.3102 | **−0.1419** |
252
+ | No topology aggregates | 0.4601 | 0.4093 | −0.0083 |
253
+ | No engineered features | 0.4642 | 0.4240 | −0.0041 |
254
+ | No detection-signal features | 0.4725 | 0.4284 | **+0.0041** |
255
+
256
+ Two clear findings:
257
+
258
+ 1. **`timestep` is by far the most important feature** (drops 14 pp when
259
+ removed). The honest reading: kill chains progress in time, and where
260
+ you are in the campaign timeline carries most of the phase signal.
261
+ 2. **Detection-signal features (`detection_outcome`, `alert_severity`,
262
+ `edr_blocked_flag`, `siem_rule_triggered`) do not help phase prediction.**
263
+ Removing them actually improves the score marginally. A buyer who wants
264
+ a pre-detection model can drop these four columns with no loss.
265
+
266
+ Topology and engineered features each contribute roughly 1 pp.
267
+
268
+ ### Architecture
269
+
270
+ **XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
271
+ `hist` tree method, class-balanced sample weights, early stopping on
272
+ validation mlogloss.
273
+
274
+ **MLP:** `90 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
275
+ → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
276
+ early stopping on validation macro-F1.
277
+
278
+ Training hyperparameters (learning rate, batch size, n_estimators,
279
+ early-stopping patience, weight decay, class-weighting strategy) are
280
+ held internally by XpertSystems and are not part of this release.
281
+
282
+ ## Limitations
283
+
284
+ **This is a baseline reference, not a production threat detection system.**
285
+
286
+ 1. **Late-phase confusion.** Per-class F1 for `collection`, `exfiltration`,
287
+ and `impact` is 0.22–0.27. These phases arrive near campaign-end with
288
+ similar feature signatures, and a flat-tabular event-level model can't
289
+ easily disambiguate them. Sequence models (LSTM / transformer over the
290
+ per-campaign event sequence) would substantially improve this.
291
+
292
+ 2. **`dwell_idle` is essentially unlearnable in this framing.** The
293
+ class-balanced weights amplify rare classes; `dwell_idle` is common
294
+ but featureless ("no action this timestep"), so the model trades
295
+ `dwell_idle` recall for late-phase recall. F1 = 0.04. A real SOC
296
+ pipeline would handle idle steps with a separate gating rule, not a
297
+ classifier head.
298
+
299
+ 3. **Sample-size constraints.** 100 campaigns / 4,353 events with a
300
+ group-aware split leaves 69 training campaigns. The full 380k-event
301
+ CYB002 product supports much more reliable per-class estimation,
302
+ especially on the rare late-phase classes.
303
+
304
+ 4. **Synthetic-vs-real transfer.** The dataset is synthetic and
305
+ calibrated to threat-intelligence benchmark targets (Mandiant
306
+ M-Trends, IBM CODB, Verizon DBIR, MITRE ATT&CK Evaluations). Real
307
+ attack telemetry has different noise characteristics, adversary
308
+ adaptation, and gaps in coverage. Do not assume metrics transfer.
309
+
310
+ 5. **Adversarial robustness not evaluated.** The dataset is not
311
+ adversarially generated; the model has not been red-teamed.
312
+
313
+ 6. **MLP brittleness on OOD inputs.** With ~2.8k training events, the
314
+ MLP can produce confidently-wrong predictions on hand-crafted
315
+ records far from the training manifold. XGBoost is more robust.
316
+ Use both; treat disagreement as a signal for human review.
317
+
318
+ ## Notes on dataset schema
319
+
320
+ The CYB002 sample dataset README describes some fields differently from
321
+ the actual schema. The model was trained on the actual schema; this note
322
+ is to help buyers reconcile what they read with what they receive.
323
+
324
+ | What the README says | What the data actually contains |
325
+ |---|---|
326
+ | "9 ATT&CK phases" | 10 phases including `dwell_idle` (idle/no-op steps) |
327
+ | 4 attacker tiers: `opportunistic`, `organized_crime`, `apt`, `nation_state` | 4 tiers: `opportunistic`, `script_kiddie`, `apt`, `nation_state` |
328
+ | 5 defender maturity levels: CMMI names (`ad_hoc`, `defined`, `managed`, `quantitatively_managed`, `optimizing`) | 5 levels: `minimal`, `baseline`, `managed`, `advanced`, `zero_trust` |
329
+ | Field name `phase` | Actual column: `kill_chain_phase` |
330
+ | Field name `tactic` | Actual column: `tactic_category` |
331
+ | Field name `segment_id` | Actual column: `target_segment_id` |
332
+ | Field name `attacker_tier` | Actual column: `attacker_capability_tier` |
333
+ | Field name `defender_maturity` | Actual column: `defender_maturity_level` |
334
+ | Field name `detected`, `blocked`, `stealth_score` | Actual: `detection_outcome`, `edr_blocked_flag`, `siem_rule_triggered`; no `stealth_score` on events |
335
+
336
+ None of this affects model correctness — `feature_engineering.py` uses the
337
+ actual column names. If you build your own pipeline against the dataset,
338
+ use the actual columns, not the README descriptions.
339
+
340
+ ## Intended use
341
+
342
+ - **Evaluating fit** of the CYB002 dataset for your ATT&CK / kill-chain
343
+ research
344
+ - **Baseline reference** for new model architectures (especially sequence
345
+ models, which should beat this baseline on the late-phase classes)
346
+ - **Teaching and demo** for tabular classification on attack-event data
347
+ - **Feature engineering reference** for MITRE ATT&CK-aligned datasets
348
+
349
+ ## Out-of-scope use
350
+
351
+ - Production threat detection on real network telemetry
352
+ - SOC alert triage on real systems
353
+ - Forensic attribution of real attacks
354
+ - Adversarial-evasion evaluation (dataset not adversarially generated)
355
+ - Any safety-critical or operational security decision
356
+
357
+ ## Reproducibility
358
+
359
+ Outputs above were produced with `seed = 42`, group-aware nested
360
+ `GroupShuffleSplit` (70/15/15 by campaign_id), on the published sample
361
+ (`xpertsystems/cyb002-sample`, version 1.0.0, generated 2026-05-16).
362
+ The feature pipeline in `feature_engineering.py` is deterministic and
363
+ the trained weights in this repo correspond exactly to the metrics above.
364
+
365
+ The training script itself is private to XpertSystems. The published
366
+ artifacts contain the feature pipeline, model weights, scaler, metadata,
367
+ and validation results — sufficient to reproduce inference but not
368
+ training.
369
+
370
+ ## Files in this repo
371
+
372
+ | File | Purpose |
373
+ |---|---|
374
+ | `model_xgb.json` | XGBoost weights |
375
+ | `model_mlp.safetensors` | PyTorch MLP weights |
376
+ | `feature_engineering.py` | Feature pipeline (load → aggregate topology → engineer → encode) |
377
+ | `feature_meta.json` | Feature column order + categorical levels |
378
+ | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
379
+ | `validation_results.json` | Per-class metrics, confusion matrix, architecture |
380
+ | `ablation_results.json` | Per-feature-group ablation (timestep, topology, engineered, detection-signals) |
381
+ | `inference_example.ipynb` | End-to-end inference demo notebook |
382
+ | `README.md` | This file |
383
+
384
+ ## Contact and full product
385
+
386
+ The full **CYB002** dataset contains ~454,000 rows across four files,
387
+ with calibrated benchmark validation against 12 metrics drawn from
388
+ authoritative threat intelligence sources (Mandiant, IBM, Verizon,
389
+ CrowdStrike, MITRE, SANS, ENISA). The full XpertSystems.ai synthetic data
390
+ catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance &
391
+ Risk, Oil & Gas, and Materials & Energy.
392
+
393
+ - 📧 **pradeep@xpertsystems.ai**
394
+ - 🌐 **https://xpertsystems.ai**
395
+ - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb002-sample
396
+ - 🤖 Companion model (network traffic): https://huggingface.co/xpertsystems/cyb001-baseline-classifier
397
+
398
+ ## Citation
399
+
400
+ ```bibtex
401
+ @misc{xpertsystems_cyb002_baseline_2026,
402
+ title = {CYB002 Baseline Classifier: XGBoost and MLP for MITRE ATT&CK Kill-Chain Phase Classification},
403
+ author = {XpertSystems.ai},
404
+ year = {2026},
405
+ url = {https://huggingface.co/xpertsystems/cyb002-baseline-classifier},
406
+ note = {Baseline reference model trained on xpertsystems/cyb002-sample}
407
+ }
408
+ ```
ablation_results.json ADDED
@@ -0,0 +1,804 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
3
+ "full_model_metrics": {
4
+ "model": "xgboost",
5
+ "accuracy": 0.46831955922865015,
6
+ "macro_f1": 0.42549880749552066,
7
+ "weighted_f1": 0.440668872633435,
8
+ "per_class_f1": {
9
+ "dwell_idle": 0.040268456375838924,
10
+ "reconnaissance": 0.7532467532467533,
11
+ "initial_access": 0.6467661691542289,
12
+ "execution": 0.4406779661016949,
13
+ "persistence": 0.41304347826086957,
14
+ "privilege_escalation": 0.5,
15
+ "lateral_movement": 0.7422680412371134,
16
+ "collection": 0.22018348623853212,
17
+ "exfiltration": 0.2727272727272727,
18
+ "impact": 0.22580645161290322
19
+ },
20
+ "confusion_matrix": {
21
+ "labels": [
22
+ "dwell_idle",
23
+ "reconnaissance",
24
+ "initial_access",
25
+ "execution",
26
+ "persistence",
27
+ "privilege_escalation",
28
+ "lateral_movement",
29
+ "collection",
30
+ "exfiltration",
31
+ "impact"
32
+ ],
33
+ "matrix": [
34
+ [
35
+ 3,
36
+ 23,
37
+ 23,
38
+ 18,
39
+ 21,
40
+ 18,
41
+ 2,
42
+ 17,
43
+ 9,
44
+ 7
45
+ ],
46
+ [
47
+ 2,
48
+ 87,
49
+ 2,
50
+ 21,
51
+ 0,
52
+ 0,
53
+ 0,
54
+ 0,
55
+ 0,
56
+ 0
57
+ ],
58
+ [
59
+ 1,
60
+ 5,
61
+ 65,
62
+ 5,
63
+ 3,
64
+ 26,
65
+ 1,
66
+ 0,
67
+ 0,
68
+ 0
69
+ ],
70
+ [
71
+ 2,
72
+ 4,
73
+ 1,
74
+ 39,
75
+ 24,
76
+ 3,
77
+ 1,
78
+ 0,
79
+ 0,
80
+ 0
81
+ ],
82
+ [
83
+ 0,
84
+ 0,
85
+ 1,
86
+ 12,
87
+ 38,
88
+ 9,
89
+ 0,
90
+ 18,
91
+ 1,
92
+ 0
93
+ ],
94
+ [
95
+ 0,
96
+ 0,
97
+ 3,
98
+ 8,
99
+ 4,
100
+ 44,
101
+ 3,
102
+ 5,
103
+ 1,
104
+ 0
105
+ ],
106
+ [
107
+ 0,
108
+ 0,
109
+ 0,
110
+ 0,
111
+ 6,
112
+ 6,
113
+ 36,
114
+ 2,
115
+ 0,
116
+ 4
117
+ ],
118
+ [
119
+ 0,
120
+ 0,
121
+ 0,
122
+ 0,
123
+ 2,
124
+ 1,
125
+ 0,
126
+ 12,
127
+ 15,
128
+ 10
129
+ ],
130
+ [
131
+ 0,
132
+ 0,
133
+ 0,
134
+ 0,
135
+ 5,
136
+ 0,
137
+ 0,
138
+ 4,
139
+ 9,
140
+ 13
141
+ ],
142
+ [
143
+ 0,
144
+ 0,
145
+ 0,
146
+ 0,
147
+ 2,
148
+ 1,
149
+ 0,
150
+ 11,
151
+ 0,
152
+ 7
153
+ ]
154
+ ]
155
+ },
156
+ "macro_roc_auc_ovr": 0.8598653258869782
157
+ },
158
+ "ablations": {
159
+ "no_topology": {
160
+ "n_features": 67,
161
+ "dropped_count": 23,
162
+ "metrics": {
163
+ "model": "xgboost_no_topology",
164
+ "accuracy": 0.46005509641873277,
165
+ "macro_f1": 0.4093395066167947,
166
+ "weighted_f1": 0.4281869072634682,
167
+ "per_class_f1": {
168
+ "dwell_idle": 0.013513513513513514,
169
+ "reconnaissance": 0.7574468085106383,
170
+ "initial_access": 0.6435643564356436,
171
+ "execution": 0.45348837209302323,
172
+ "persistence": 0.3829787234042553,
173
+ "privilege_escalation": 0.4943820224719101,
174
+ "lateral_movement": 0.72,
175
+ "collection": 0.205607476635514,
176
+ "exfiltration": 0.25,
177
+ "impact": 0.1724137931034483
178
+ },
179
+ "confusion_matrix": {
180
+ "labels": [
181
+ "dwell_idle",
182
+ "reconnaissance",
183
+ "initial_access",
184
+ "execution",
185
+ "persistence",
186
+ "privilege_escalation",
187
+ "lateral_movement",
188
+ "collection",
189
+ "exfiltration",
190
+ "impact"
191
+ ],
192
+ "matrix": [
193
+ [
194
+ 1,
195
+ 24,
196
+ 24,
197
+ 16,
198
+ 24,
199
+ 16,
200
+ 4,
201
+ 15,
202
+ 10,
203
+ 7
204
+ ],
205
+ [
206
+ 2,
207
+ 89,
208
+ 2,
209
+ 16,
210
+ 3,
211
+ 0,
212
+ 0,
213
+ 0,
214
+ 0,
215
+ 0
216
+ ],
217
+ [
218
+ 1,
219
+ 6,
220
+ 65,
221
+ 4,
222
+ 3,
223
+ 26,
224
+ 1,
225
+ 0,
226
+ 0,
227
+ 0
228
+ ],
229
+ [
230
+ 1,
231
+ 4,
232
+ 1,
233
+ 39,
234
+ 25,
235
+ 3,
236
+ 1,
237
+ 0,
238
+ 0,
239
+ 0
240
+ ],
241
+ [
242
+ 1,
243
+ 0,
244
+ 0,
245
+ 16,
246
+ 36,
247
+ 9,
248
+ 0,
249
+ 16,
250
+ 1,
251
+ 0
252
+ ],
253
+ [
254
+ 0,
255
+ 0,
256
+ 3,
257
+ 7,
258
+ 4,
259
+ 44,
260
+ 3,
261
+ 5,
262
+ 2,
263
+ 0
264
+ ],
265
+ [
266
+ 0,
267
+ 0,
268
+ 1,
269
+ 0,
270
+ 5,
271
+ 9,
272
+ 36,
273
+ 2,
274
+ 0,
275
+ 1
276
+ ],
277
+ [
278
+ 1,
279
+ 0,
280
+ 0,
281
+ 0,
282
+ 2,
283
+ 2,
284
+ 1,
285
+ 11,
286
+ 11,
287
+ 12
288
+ ],
289
+ [
290
+ 0,
291
+ 0,
292
+ 0,
293
+ 0,
294
+ 5,
295
+ 0,
296
+ 0,
297
+ 6,
298
+ 8,
299
+ 12
300
+ ],
301
+ [
302
+ 0,
303
+ 0,
304
+ 0,
305
+ 0,
306
+ 2,
307
+ 1,
308
+ 0,
309
+ 12,
310
+ 1,
311
+ 5
312
+ ]
313
+ ]
314
+ },
315
+ "macro_roc_auc_ovr": 0.8625474585447981
316
+ },
317
+ "delta_accuracy": 0.008264462809917383,
318
+ "delta_macro_f1": 0.01615930087872597
319
+ },
320
+ "no_engineered": {
321
+ "n_features": 84,
322
+ "dropped_count": 6,
323
+ "metrics": {
324
+ "model": "xgboost_no_engineered",
325
+ "accuracy": 0.4641873278236915,
326
+ "macro_f1": 0.4239556593623024,
327
+ "weighted_f1": 0.4373277421758876,
328
+ "per_class_f1": {
329
+ "dwell_idle": 0.02631578947368421,
330
+ "reconnaissance": 0.7368421052631579,
331
+ "initial_access": 0.6305418719211823,
332
+ "execution": 0.46060606060606063,
333
+ "persistence": 0.4419889502762431,
334
+ "privilege_escalation": 0.49142857142857144,
335
+ "lateral_movement": 0.7346938775510204,
336
+ "collection": 0.24347826086956523,
337
+ "exfiltration": 0.2647058823529412,
338
+ "impact": 0.208955223880597
339
+ },
340
+ "confusion_matrix": {
341
+ "labels": [
342
+ "dwell_idle",
343
+ "reconnaissance",
344
+ "initial_access",
345
+ "execution",
346
+ "persistence",
347
+ "privilege_escalation",
348
+ "lateral_movement",
349
+ "collection",
350
+ "exfiltration",
351
+ "impact"
352
+ ],
353
+ "matrix": [
354
+ [
355
+ 2,
356
+ 23,
357
+ 24,
358
+ 14,
359
+ 23,
360
+ 20,
361
+ 2,
362
+ 17,
363
+ 9,
364
+ 7
365
+ ],
366
+ [
367
+ 4,
368
+ 84,
369
+ 3,
370
+ 21,
371
+ 0,
372
+ 0,
373
+ 0,
374
+ 0,
375
+ 0,
376
+ 0
377
+ ],
378
+ [
379
+ 2,
380
+ 5,
381
+ 64,
382
+ 4,
383
+ 1,
384
+ 29,
385
+ 1,
386
+ 0,
387
+ 0,
388
+ 0
389
+ ],
390
+ [
391
+ 3,
392
+ 4,
393
+ 1,
394
+ 38,
395
+ 25,
396
+ 2,
397
+ 1,
398
+ 0,
399
+ 0,
400
+ 0
401
+ ],
402
+ [
403
+ 0,
404
+ 0,
405
+ 2,
406
+ 7,
407
+ 40,
408
+ 9,
409
+ 0,
410
+ 20,
411
+ 1,
412
+ 0
413
+ ],
414
+ [
415
+ 0,
416
+ 0,
417
+ 3,
418
+ 7,
419
+ 5,
420
+ 43,
421
+ 4,
422
+ 5,
423
+ 1,
424
+ 0
425
+ ],
426
+ [
427
+ 0,
428
+ 0,
429
+ 0,
430
+ 0,
431
+ 0,
432
+ 3,
433
+ 36,
434
+ 4,
435
+ 4,
436
+ 7
437
+ ],
438
+ [
439
+ 0,
440
+ 0,
441
+ 0,
442
+ 0,
443
+ 1,
444
+ 0,
445
+ 0,
446
+ 14,
447
+ 13,
448
+ 12
449
+ ],
450
+ [
451
+ 0,
452
+ 0,
453
+ 0,
454
+ 0,
455
+ 5,
456
+ 0,
457
+ 0,
458
+ 4,
459
+ 9,
460
+ 13
461
+ ],
462
+ [
463
+ 0,
464
+ 0,
465
+ 0,
466
+ 0,
467
+ 2,
468
+ 1,
469
+ 0,
470
+ 11,
471
+ 0,
472
+ 7
473
+ ]
474
+ ]
475
+ },
476
+ "macro_roc_auc_ovr": 0.8559080760692732
477
+ },
478
+ "delta_accuracy": 0.004132231404958664,
479
+ "delta_macro_f1": 0.001543148133218264
480
+ },
481
+ "no_timestep": {
482
+ "n_features": 89,
483
+ "dropped_count": 1,
484
+ "metrics": {
485
+ "model": "xgboost_no_timestep",
486
+ "accuracy": 0.32644628099173556,
487
+ "macro_f1": 0.31019209599143654,
488
+ "weighted_f1": 0.3273550154519158,
489
+ "per_class_f1": {
490
+ "dwell_idle": 0.06060606060606061,
491
+ "reconnaissance": 0.3728813559322034,
492
+ "initial_access": 0.5666666666666667,
493
+ "execution": 0.4090909090909091,
494
+ "persistence": 0.22818791946308725,
495
+ "privilege_escalation": 0.4520547945205479,
496
+ "lateral_movement": 0.7058823529411765,
497
+ "collection": 0.0975609756097561,
498
+ "exfiltration": 0.1836734693877551,
499
+ "impact": 0.02531645569620253
500
+ },
501
+ "confusion_matrix": {
502
+ "labels": [
503
+ "dwell_idle",
504
+ "reconnaissance",
505
+ "initial_access",
506
+ "execution",
507
+ "persistence",
508
+ "privilege_escalation",
509
+ "lateral_movement",
510
+ "collection",
511
+ "exfiltration",
512
+ "impact"
513
+ ],
514
+ "matrix": [
515
+ [
516
+ 5,
517
+ 11,
518
+ 35,
519
+ 11,
520
+ 17,
521
+ 13,
522
+ 1,
523
+ 25,
524
+ 15,
525
+ 8
526
+ ],
527
+ [
528
+ 7,
529
+ 33,
530
+ 1,
531
+ 11,
532
+ 11,
533
+ 0,
534
+ 0,
535
+ 19,
536
+ 17,
537
+ 13
538
+ ],
539
+ [
540
+ 5,
541
+ 0,
542
+ 68,
543
+ 1,
544
+ 2,
545
+ 16,
546
+ 5,
547
+ 6,
548
+ 3,
549
+ 0
550
+ ],
551
+ [
552
+ 3,
553
+ 6,
554
+ 1,
555
+ 27,
556
+ 4,
557
+ 4,
558
+ 2,
559
+ 20,
560
+ 2,
561
+ 5
562
+ ],
563
+ [
564
+ 2,
565
+ 12,
566
+ 4,
567
+ 1,
568
+ 17,
569
+ 5,
570
+ 0,
571
+ 19,
572
+ 6,
573
+ 13
574
+ ],
575
+ [
576
+ 0,
577
+ 0,
578
+ 17,
579
+ 7,
580
+ 2,
581
+ 33,
582
+ 3,
583
+ 3,
584
+ 2,
585
+ 1
586
+ ],
587
+ [
588
+ 0,
589
+ 1,
590
+ 7,
591
+ 0,
592
+ 2,
593
+ 2,
594
+ 36,
595
+ 1,
596
+ 0,
597
+ 5
598
+ ],
599
+ [
600
+ 0,
601
+ 2,
602
+ 0,
603
+ 0,
604
+ 6,
605
+ 4,
606
+ 1,
607
+ 8,
608
+ 12,
609
+ 7
610
+ ],
611
+ [
612
+ 1,
613
+ 0,
614
+ 1,
615
+ 0,
616
+ 7,
617
+ 0,
618
+ 0,
619
+ 8,
620
+ 9,
621
+ 5
622
+ ],
623
+ [
624
+ 1,
625
+ 0,
626
+ 0,
627
+ 0,
628
+ 2,
629
+ 1,
630
+ 0,
631
+ 15,
632
+ 1,
633
+ 1
634
+ ]
635
+ ]
636
+ },
637
+ "macro_roc_auc_ovr": 0.7557281412642529
638
+ },
639
+ "delta_accuracy": 0.1418732782369146,
640
+ "delta_macro_f1": 0.11530671150408411
641
+ },
642
+ "no_detection_signals": {
643
+ "n_features": 76,
644
+ "dropped_count": 14,
645
+ "metrics": {
646
+ "model": "xgboost_no_detection_signals",
647
+ "accuracy": 0.4724517906336088,
648
+ "macro_f1": 0.4284152317167137,
649
+ "weighted_f1": 0.4449655177644492,
650
+ "per_class_f1": {
651
+ "dwell_idle": 0.039735099337748346,
652
+ "reconnaissance": 0.7456140350877193,
653
+ "initial_access": 0.6600985221674877,
654
+ "execution": 0.47126436781609193,
655
+ "persistence": 0.43333333333333335,
656
+ "privilege_escalation": 0.4971751412429379,
657
+ "lateral_movement": 0.7272727272727273,
658
+ "collection": 0.21818181818181817,
659
+ "exfiltration": 0.2727272727272727,
660
+ "impact": 0.21875
661
+ },
662
+ "confusion_matrix": {
663
+ "labels": [
664
+ "dwell_idle",
665
+ "reconnaissance",
666
+ "initial_access",
667
+ "execution",
668
+ "persistence",
669
+ "privilege_escalation",
670
+ "lateral_movement",
671
+ "collection",
672
+ "exfiltration",
673
+ "impact"
674
+ ],
675
+ "matrix": [
676
+ [
677
+ 3,
678
+ 23,
679
+ 23,
680
+ 18,
681
+ 22,
682
+ 17,
683
+ 3,
684
+ 16,
685
+ 9,
686
+ 7
687
+ ],
688
+ [
689
+ 2,
690
+ 85,
691
+ 3,
692
+ 22,
693
+ 0,
694
+ 0,
695
+ 0,
696
+ 0,
697
+ 0,
698
+ 0
699
+ ],
700
+ [
701
+ 1,
702
+ 5,
703
+ 67,
704
+ 2,
705
+ 2,
706
+ 28,
707
+ 1,
708
+ 0,
709
+ 0,
710
+ 0
711
+ ],
712
+ [
713
+ 2,
714
+ 3,
715
+ 1,
716
+ 41,
717
+ 23,
718
+ 3,
719
+ 1,
720
+ 0,
721
+ 0,
722
+ 0
723
+ ],
724
+ [
725
+ 0,
726
+ 0,
727
+ 1,
728
+ 9,
729
+ 39,
730
+ 9,
731
+ 0,
732
+ 19,
733
+ 1,
734
+ 1
735
+ ],
736
+ [
737
+ 0,
738
+ 0,
739
+ 2,
740
+ 8,
741
+ 3,
742
+ 44,
743
+ 4,
744
+ 6,
745
+ 1,
746
+ 0
747
+ ],
748
+ [
749
+ 1,
750
+ 0,
751
+ 0,
752
+ 0,
753
+ 3,
754
+ 6,
755
+ 36,
756
+ 2,
757
+ 0,
758
+ 6
759
+ ],
760
+ [
761
+ 0,
762
+ 0,
763
+ 0,
764
+ 0,
765
+ 2,
766
+ 1,
767
+ 0,
768
+ 12,
769
+ 15,
770
+ 10
771
+ ],
772
+ [
773
+ 1,
774
+ 0,
775
+ 0,
776
+ 0,
777
+ 5,
778
+ 0,
779
+ 0,
780
+ 4,
781
+ 9,
782
+ 12
783
+ ],
784
+ [
785
+ 0,
786
+ 0,
787
+ 0,
788
+ 0,
789
+ 2,
790
+ 1,
791
+ 0,
792
+ 11,
793
+ 0,
794
+ 7
795
+ ]
796
+ ]
797
+ },
798
+ "macro_roc_auc_ovr": 0.8544378745036634
799
+ },
800
+ "delta_accuracy": -0.004132231404958664,
801
+ "delta_macro_f1": -0.002916424221193037
802
+ }
803
+ }
804
+ }
feature_engineering.py ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ feature_engineering.py
3
+ ======================
4
+
5
+ Feature pipeline for the CYB002 baseline classifier.
6
+
7
+ Predicts `kill_chain_phase` (10-class) from event + segment-level
8
+ observables on the CYB002 sample dataset.
9
+
10
+ CSV inputs:
11
+ attack_events.csv (primary, one row per timestep-level action)
12
+ network_topology.csv (asset-level inventory; aggregated to segment
13
+ level before joining on target_segment_id)
14
+ campaign_summary.csv (reserved for future work, not used in v1)
15
+ campaign_events.csv (reserved for future work, not used in v1)
16
+
17
+ Target classes:
18
+ dwell_idle, reconnaissance, initial_access, execution, persistence,
19
+ privilege_escalation, lateral_movement, collection, exfiltration, impact
20
+
21
+ This corresponds to the README's first listed use case: predicting the
22
+ next ATT&CK phase from observable features. The challenge is that three
23
+ fields perfectly determine phase by construction:
24
+
25
+ - technique_id -> 62 of 63 techniques map 1:1 to a single phase
26
+ - technique_name -> 1:1 with technique_id
27
+ - tactic_category -> direct alias of phase
28
+
29
+ These are dropped before feature assembly. Phase is predicted from:
30
+ timestep position (recon mean=6, impact mean=66), target asset type,
31
+ protocol/port, byte volumes, connection duration, auth-failure count,
32
+ process-injection / lateral-hop counts, attacker tier vs defender
33
+ maturity, and segment-level topology aggregates.
34
+
35
+ Public API
36
+ ----------
37
+ build_features(attack_events_path, topology_path,
38
+ campaign_summary_path=None) -> (X, y, groups, meta)
39
+ transform_single(record, meta, segment_aggregates=None) -> np.ndarray
40
+ save_meta(meta, path) / load_meta(path)
41
+ build_segment_lookup(topology_path) -> dict
42
+
43
+ License
44
+ -------
45
+ Ships with the public model on Hugging Face under CC-BY-NC-4.0, matching
46
+ the dataset license. See README.md.
47
+ """
48
+
49
+ from __future__ import annotations
50
+
51
+ import json
52
+ from pathlib import Path
53
+ from typing import Any
54
+
55
+ import numpy as np
56
+ import pandas as pd
57
+
58
+ # ---------------------------------------------------------------------------
59
+ # Label space
60
+ # ---------------------------------------------------------------------------
61
+
62
+ # The 10 phases observed in the sample. dwell_idle is a no-op step
63
+ # between actions; technique_id=T0000, tactic_category=NaN. Ordering
64
+ # follows tactic flow for readability; CE-loss doesn't care.
65
+ LABEL_ORDER = [
66
+ "dwell_idle",
67
+ "reconnaissance",
68
+ "initial_access",
69
+ "execution",
70
+ "persistence",
71
+ "privilege_escalation",
72
+ "lateral_movement",
73
+ "collection",
74
+ "exfiltration",
75
+ "impact",
76
+ ]
77
+ LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
78
+ INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
79
+
80
+ # ---------------------------------------------------------------------------
81
+ # Columns dropped because they leak the target (kill_chain_phase)
82
+ # ---------------------------------------------------------------------------
83
+
84
+ # `technique_id`: 62 of 63 ATT&CK techniques map 1:1 to a single phase.
85
+ # T1078 Valid Accounts is the one shared technique (appears in both
86
+ # initial_access and persistence, which is correct ATT&CK behavior).
87
+ # Including technique_id as a feature is effectively label memorization.
88
+ #
89
+ # `technique_name`: 1:1 alias of technique_id (63 unique values each).
90
+ #
91
+ # `tactic_category`: direct alias of kill_chain_phase; the two columns
92
+ # carry identical information except tactic_category is null for
93
+ # dwell_idle steps. Drop.
94
+ LEAKY_COLUMNS = [
95
+ "technique_id",
96
+ "technique_name",
97
+ "tactic_category",
98
+ ]
99
+
100
+ # ---------------------------------------------------------------------------
101
+ # Columns kept as features
102
+ # ---------------------------------------------------------------------------
103
+
104
+ DIRECT_NUMERIC_EVENT_FEATURES = [
105
+ "timestep", # strong signal: recon mean=6, impact mean=66
106
+ "dest_port",
107
+ "bytes_transferred",
108
+ "connection_duration_s",
109
+ "auth_failure_count",
110
+ "process_injection_flag",
111
+ "lateral_hop_count",
112
+ "c2_beacon_interval_s", # null-aware; filled with -1 + has_c2_beacon flag
113
+ # Detection-related fields. These are POST-HOC observables from the
114
+ # SOC's perspective. We keep them as features because in the realistic
115
+ # phase-prediction use case, a SOC analyst has just seen an action and
116
+ # its initial detection outcome, and is trying to reason about which
117
+ # phase the campaign is in. Buyers who want a strictly pre-detection
118
+ # model can drop these four columns and retrain.
119
+ "edr_blocked_flag",
120
+ "siem_rule_triggered",
121
+ ]
122
+
123
+ CATEGORICAL_EVENT_FEATURES = [
124
+ "target_asset_type",
125
+ "source_ip_class",
126
+ "protocol",
127
+ "attacker_capability_tier",
128
+ "defender_maturity_level",
129
+ "alert_severity", # critical / high / medium / low / informational
130
+ "detection_outcome", # see note above re: post-hoc observables
131
+ ]
132
+
133
+ ID_COLUMNS = ["campaign_id", "attacker_id"]
134
+
135
+ # ---------------------------------------------------------------------------
136
+ # Topology aggregation
137
+ # ---------------------------------------------------------------------------
138
+ #
139
+ # network_topology.csv is ASSET-LEVEL (651 rows, 12 segments, ~54 assets
140
+ # per segment). Direct join would explode rows. Aggregate to segment level:
141
+ # constant fields as-is, numeric fields mean/max as appropriate, 0/1 flags
142
+ # as fraction-with-coverage.
143
+
144
+ SEGMENT_CONSTANT_TOPO_COLS = ["segment_type", "defender_maturity_level"]
145
+ SEGMENT_NUMERIC_AGGREGATES = {
146
+ "patch_lag_days": "mean",
147
+ "exposure_score": "mean",
148
+ "vulnerability_count": "max", # worst-case asset matters more
149
+ "inter_segment_trust_level": "mean",
150
+ "alert_threshold_sensitivity": "mean",
151
+ "mttd_baseline_hours": "mean",
152
+ "mttr_baseline_hours": "mean",
153
+ "siem_coverage_flag": "mean", # fraction with SIEM
154
+ "edr_deployed_flag": "mean", # fraction with EDR
155
+ "ndr_coverage_flag": "mean",
156
+ "mfa_enforced_flag": "mean",
157
+ }
158
+
159
+
160
+ def _aggregate_topology(topology: pd.DataFrame) -> pd.DataFrame:
161
+ """Collapse asset-level topology to one row per segment."""
162
+ parts = []
163
+ for col in SEGMENT_CONSTANT_TOPO_COLS:
164
+ parts.append(topology.groupby("segment_id")[col].first().rename(f"seg_{col}"))
165
+ for col, agg in SEGMENT_NUMERIC_AGGREGATES.items():
166
+ parts.append(topology.groupby("segment_id")[col].agg(agg).rename(f"seg_{col}_{agg}"))
167
+ return pd.concat(parts, axis=1).reset_index()
168
+
169
+
170
+ TOPOLOGY_FEATURE_NAMES_NUMERIC = [
171
+ f"seg_{col}_{agg}" for col, agg in SEGMENT_NUMERIC_AGGREGATES.items()
172
+ ]
173
+ TOPOLOGY_FEATURE_NAMES_CATEGORICAL = [f"seg_{col}" for col in SEGMENT_CONSTANT_TOPO_COLS]
174
+
175
+
176
+ # ---------------------------------------------------------------------------
177
+ # Engineered features
178
+ # ---------------------------------------------------------------------------
179
+ #
180
+ # Important: NO phase-derived engineered features. is_dwell_idle,
181
+ # is_high_severity_phase, phase_order_index would all be oracles when
182
+ # phase is the target. Six features instead, each a stated hypothesis
183
+ # about phase-discriminative signal in pre-phase observables.
184
+
185
+ TIER_RANK = {"script_kiddie": 1, "opportunistic": 2, "apt": 3, "nation_state": 4}
186
+ DEFENDER_RANK = {"minimal": 1, "baseline": 2, "managed": 3, "advanced": 4, "zero_trust": 5}
187
+
188
+
189
+ def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
190
+ """Six engineered features, no phase-derived oracles."""
191
+ df = df.copy()
192
+
193
+ # 1. Byte volume on log scale. Heavy-tailed across phases: recon
194
+ # transfers tend to be bytes; exfiltration megabytes. log1p tames
195
+ # the tail and gives both XGBoost and the MLP a usable feature.
196
+ df["byte_volume_log"] = np.log1p(df["bytes_transferred"].clip(lower=0)).astype(float)
197
+
198
+ # 2. C2 beacon presence. c2_beacon_interval_s is null for non-C2
199
+ # actions. Encode presence as a binary flag and fill the value
200
+ # column with -1 so it stays usable.
201
+ df["has_c2_beacon"] = df["c2_beacon_interval_s"].notna().astype(int)
202
+ df["c2_beacon_interval_s"] = df["c2_beacon_interval_s"].fillna(-1.0)
203
+
204
+ # 3. Brute-force indicator. auth_failure_count > 0 distinguishes
205
+ # credential-stuffing style actions from authenticated-path
206
+ # actions; loads differently into early phases.
207
+ df["is_brute_forcing"] = (df["auth_failure_count"] > 0).astype(int)
208
+
209
+ # 4. Attacker vs defender advantage. Positive when attacker outclasses
210
+ # defender; influences which phases an attacker can reach.
211
+ tier_r = df["attacker_capability_tier"].map(TIER_RANK).fillna(2).astype(int)
212
+ def_r = df["defender_maturity_level"].map(DEFENDER_RANK).fillna(2).astype(int)
213
+ df["attacker_defender_advantage"] = (tier_r - def_r).astype(int)
214
+
215
+ # 5. High-volume action indicator. Simple binary above 100 KB,
216
+ # correlates with collection / exfiltration phases.
217
+ df["is_high_volume"] = (df["bytes_transferred"] > 100_000).astype(int)
218
+
219
+ # 6. Privileged-port indicator. dest_port < 1024, typically system
220
+ # services; common in initial-access and lateral-movement actions.
221
+ df["is_privileged_port"] = (df["dest_port"] < 1024).astype(int)
222
+
223
+ return df
224
+
225
+
226
+ # ---------------------------------------------------------------------------
227
+ # Public API
228
+ # ---------------------------------------------------------------------------
229
+
230
+ def build_features(
231
+ attack_events_path: str | Path,
232
+ topology_path: str | Path,
233
+ campaign_summary_path: str | Path | None = None,
234
+ ) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
235
+ """
236
+ Load CSVs, aggregate topology, drop leaky columns, engineer features,
237
+ one-hot encode, return (X, y, groups, meta).
238
+
239
+ `groups` is a Series of campaign_id values aligned with X for
240
+ GroupShuffleSplit / GroupKFold use. A single campaign generates ~40
241
+ correlated events; row-level random splitting inflates metrics.
242
+ """
243
+ events = pd.read_csv(attack_events_path)
244
+ topology = pd.read_csv(topology_path)
245
+
246
+ events = events.drop(columns=LEAKY_COLUMNS, errors="ignore")
247
+
248
+ topo_agg = _aggregate_topology(topology)
249
+ events = events.merge(
250
+ topo_agg, left_on="target_segment_id", right_on="segment_id", how="left",
251
+ ).drop(columns=["segment_id"], errors="ignore")
252
+
253
+ y = events["kill_chain_phase"].map(LABEL_TO_INT)
254
+ if y.isna().any():
255
+ bad = events.loc[y.isna(), "kill_chain_phase"].unique()
256
+ raise ValueError(f"Unknown kill_chain_phase values: {bad}")
257
+ y = y.astype(int)
258
+ groups = events["campaign_id"].copy()
259
+
260
+ events = _add_engineered_features(events)
261
+
262
+ numeric_features = (
263
+ DIRECT_NUMERIC_EVENT_FEATURES
264
+ + TOPOLOGY_FEATURE_NAMES_NUMERIC
265
+ + [
266
+ "byte_volume_log", "has_c2_beacon", "is_brute_forcing",
267
+ "attacker_defender_advantage", "is_high_volume",
268
+ "is_privileged_port",
269
+ ]
270
+ )
271
+ X_numeric = events[numeric_features].astype(float)
272
+
273
+ all_categorical = (
274
+ [(col, "event") for col in CATEGORICAL_EVENT_FEATURES]
275
+ + [(col, "topology") for col in TOPOLOGY_FEATURE_NAMES_CATEGORICAL]
276
+ )
277
+ categorical_levels: dict[str, list[str]] = {}
278
+ blocks: list[pd.DataFrame] = []
279
+ for col, _src in all_categorical:
280
+ levels = sorted(events[col].dropna().unique().tolist())
281
+ categorical_levels[col] = levels
282
+ block = pd.get_dummies(
283
+ events[col].astype("category").cat.set_categories(levels),
284
+ prefix=col, dummy_na=False,
285
+ ).astype(int)
286
+ blocks.append(block)
287
+
288
+ X = pd.concat(
289
+ [X_numeric.reset_index(drop=True)]
290
+ + [b.reset_index(drop=True) for b in blocks],
291
+ axis=1,
292
+ ).fillna(0.0)
293
+
294
+ meta = {
295
+ "feature_names": X.columns.tolist(),
296
+ "numeric_features": numeric_features,
297
+ "categorical_levels": categorical_levels,
298
+ "label_to_int": LABEL_TO_INT,
299
+ "int_to_label": INT_TO_LABEL,
300
+ "topology_aggregation": {
301
+ "segment_constant": SEGMENT_CONSTANT_TOPO_COLS,
302
+ "segment_numeric_aggregates": SEGMENT_NUMERIC_AGGREGATES,
303
+ },
304
+ }
305
+ return X, y, groups, meta
306
+
307
+
308
+ def transform_single(
309
+ record: dict | pd.DataFrame,
310
+ meta: dict[str, Any],
311
+ segment_aggregates: dict | None = None,
312
+ ) -> np.ndarray:
313
+ """Encode a single event record for inference.
314
+
315
+ `record` must contain event-level fields (sans leaky columns) plus
316
+ the segment-level aggregate fields. If you only have the raw event,
317
+ pass `segment_aggregates` as a dict {seg_*: value, ...} and they'll
318
+ be merged in.
319
+ """
320
+ if isinstance(record, dict):
321
+ df = pd.DataFrame([record.copy()])
322
+ else:
323
+ df = record.copy()
324
+
325
+ if segment_aggregates is not None:
326
+ for k, v in segment_aggregates.items():
327
+ df[k] = v
328
+
329
+ df = _add_engineered_features(df)
330
+
331
+ numeric = pd.DataFrame({
332
+ col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
333
+ for col in meta["numeric_features"]
334
+ })
335
+ blocks: list[pd.DataFrame] = [numeric]
336
+ for col, levels in meta["categorical_levels"].items():
337
+ val = df.get(col, pd.Series([None] * len(df)))
338
+ block = pd.get_dummies(
339
+ val.astype("category").cat.set_categories(levels),
340
+ prefix=col, dummy_na=False,
341
+ ).astype(int)
342
+ for lvl in levels:
343
+ cname = f"{col}_{lvl}"
344
+ if cname not in block.columns:
345
+ block[cname] = 0
346
+ block = block[[f"{col}_{lvl}" for lvl in levels]]
347
+ blocks.append(block)
348
+
349
+ X = pd.concat(blocks, axis=1).fillna(0.0)
350
+ X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
351
+ return X.values.astype(np.float32)
352
+
353
+
354
+ def save_meta(meta: dict[str, Any], path: str | Path) -> None:
355
+ serializable = {
356
+ "feature_names": meta["feature_names"],
357
+ "numeric_features": meta["numeric_features"],
358
+ "categorical_levels": meta["categorical_levels"],
359
+ "label_to_int": meta["label_to_int"],
360
+ "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
361
+ "topology_aggregation": meta["topology_aggregation"],
362
+ }
363
+ with open(path, "w") as f:
364
+ json.dump(serializable, f, indent=2)
365
+
366
+
367
+ def load_meta(path: str | Path) -> dict[str, Any]:
368
+ with open(path) as f:
369
+ meta = json.load(f)
370
+ meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
371
+ return meta
372
+
373
+
374
+ def build_segment_lookup(topology_path: str | Path) -> dict[str, dict]:
375
+ """Build a {segment_id: {seg_* feature values}} lookup for inference."""
376
+ topology = pd.read_csv(topology_path)
377
+ agg = _aggregate_topology(topology)
378
+ return {row["segment_id"]: {k: v for k, v in row.items() if k != "segment_id"}
379
+ for _, row in agg.iterrows()}
380
+
381
+
382
+ if __name__ == "__main__":
383
+ import sys
384
+ base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
385
+ X, y, groups, meta = build_features(
386
+ base / "attack_events.csv",
387
+ base / "network_topology.csv",
388
+ )
389
+ print(f"X shape: {X.shape}")
390
+ print(f"y shape: {y.shape}")
391
+ print(f"groups: {groups.nunique()} campaigns")
392
+ print(f"n features: {len(meta['feature_names'])}")
393
+ print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
394
+ print(f"X has NaN: {X.isnull().any().any()}")
feature_meta.json ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "feature_names": [
3
+ "timestep",
4
+ "dest_port",
5
+ "bytes_transferred",
6
+ "connection_duration_s",
7
+ "auth_failure_count",
8
+ "process_injection_flag",
9
+ "lateral_hop_count",
10
+ "c2_beacon_interval_s",
11
+ "edr_blocked_flag",
12
+ "siem_rule_triggered",
13
+ "seg_patch_lag_days_mean",
14
+ "seg_exposure_score_mean",
15
+ "seg_vulnerability_count_max",
16
+ "seg_inter_segment_trust_level_mean",
17
+ "seg_alert_threshold_sensitivity_mean",
18
+ "seg_mttd_baseline_hours_mean",
19
+ "seg_mttr_baseline_hours_mean",
20
+ "seg_siem_coverage_flag_mean",
21
+ "seg_edr_deployed_flag_mean",
22
+ "seg_ndr_coverage_flag_mean",
23
+ "seg_mfa_enforced_flag_mean",
24
+ "byte_volume_log",
25
+ "has_c2_beacon",
26
+ "is_brute_forcing",
27
+ "attacker_defender_advantage",
28
+ "is_high_volume",
29
+ "is_privileged_port",
30
+ "target_asset_type_backup_system",
31
+ "target_asset_type_cloud_vm",
32
+ "target_asset_type_container",
33
+ "target_asset_type_database_server",
34
+ "target_asset_type_domain_controller",
35
+ "target_asset_type_ehr_system",
36
+ "target_asset_type_email_server",
37
+ "target_asset_type_firewall",
38
+ "target_asset_type_iot_device",
39
+ "target_asset_type_router",
40
+ "target_asset_type_scada_plc",
41
+ "target_asset_type_server",
42
+ "target_asset_type_vpn_gateway",
43
+ "target_asset_type_web_server",
44
+ "target_asset_type_workstation",
45
+ "source_ip_class_cloud_egress",
46
+ "source_ip_class_external_internet",
47
+ "source_ip_class_internal_lan",
48
+ "source_ip_class_tor_exit",
49
+ "source_ip_class_vpn_tunnel",
50
+ "protocol_dns",
51
+ "protocol_ftp",
52
+ "protocol_http",
53
+ "protocol_https",
54
+ "protocol_icmp",
55
+ "protocol_rdp",
56
+ "protocol_smb",
57
+ "protocol_ssh",
58
+ "protocol_tcp",
59
+ "protocol_udp",
60
+ "attacker_capability_tier_apt",
61
+ "attacker_capability_tier_nation_state",
62
+ "attacker_capability_tier_opportunistic",
63
+ "attacker_capability_tier_script_kiddie",
64
+ "defender_maturity_level_advanced",
65
+ "defender_maturity_level_baseline",
66
+ "defender_maturity_level_managed",
67
+ "defender_maturity_level_minimal",
68
+ "defender_maturity_level_zero_trust",
69
+ "alert_severity_critical",
70
+ "alert_severity_high",
71
+ "alert_severity_informational",
72
+ "alert_severity_low",
73
+ "alert_severity_medium",
74
+ "detection_outcome_blind_spot",
75
+ "detection_outcome_edr_blocked",
76
+ "detection_outcome_evasion_success",
77
+ "detection_outcome_high_confidence_alert",
78
+ "detection_outcome_ir_escalated",
79
+ "detection_outcome_marginal_alert",
80
+ "detection_outcome_suppressed_alert",
81
+ "seg_segment_type_cloud_workload",
82
+ "seg_segment_type_corporate_lan",
83
+ "seg_segment_type_data_exfiltration_target",
84
+ "seg_segment_type_endpoint_fleet",
85
+ "seg_segment_type_soc_management_plane",
86
+ "seg_segment_type_supply_chain_interface",
87
+ "seg_segment_type_zero_trust_segment",
88
+ "seg_defender_maturity_level_advanced",
89
+ "seg_defender_maturity_level_baseline",
90
+ "seg_defender_maturity_level_managed",
91
+ "seg_defender_maturity_level_minimal",
92
+ "seg_defender_maturity_level_zero_trust"
93
+ ],
94
+ "numeric_features": [
95
+ "timestep",
96
+ "dest_port",
97
+ "bytes_transferred",
98
+ "connection_duration_s",
99
+ "auth_failure_count",
100
+ "process_injection_flag",
101
+ "lateral_hop_count",
102
+ "c2_beacon_interval_s",
103
+ "edr_blocked_flag",
104
+ "siem_rule_triggered",
105
+ "seg_patch_lag_days_mean",
106
+ "seg_exposure_score_mean",
107
+ "seg_vulnerability_count_max",
108
+ "seg_inter_segment_trust_level_mean",
109
+ "seg_alert_threshold_sensitivity_mean",
110
+ "seg_mttd_baseline_hours_mean",
111
+ "seg_mttr_baseline_hours_mean",
112
+ "seg_siem_coverage_flag_mean",
113
+ "seg_edr_deployed_flag_mean",
114
+ "seg_ndr_coverage_flag_mean",
115
+ "seg_mfa_enforced_flag_mean",
116
+ "byte_volume_log",
117
+ "has_c2_beacon",
118
+ "is_brute_forcing",
119
+ "attacker_defender_advantage",
120
+ "is_high_volume",
121
+ "is_privileged_port"
122
+ ],
123
+ "categorical_levels": {
124
+ "target_asset_type": [
125
+ "backup_system",
126
+ "cloud_vm",
127
+ "container",
128
+ "database_server",
129
+ "domain_controller",
130
+ "ehr_system",
131
+ "email_server",
132
+ "firewall",
133
+ "iot_device",
134
+ "router",
135
+ "scada_plc",
136
+ "server",
137
+ "vpn_gateway",
138
+ "web_server",
139
+ "workstation"
140
+ ],
141
+ "source_ip_class": [
142
+ "cloud_egress",
143
+ "external_internet",
144
+ "internal_lan",
145
+ "tor_exit",
146
+ "vpn_tunnel"
147
+ ],
148
+ "protocol": [
149
+ "dns",
150
+ "ftp",
151
+ "http",
152
+ "https",
153
+ "icmp",
154
+ "rdp",
155
+ "smb",
156
+ "ssh",
157
+ "tcp",
158
+ "udp"
159
+ ],
160
+ "attacker_capability_tier": [
161
+ "apt",
162
+ "nation_state",
163
+ "opportunistic",
164
+ "script_kiddie"
165
+ ],
166
+ "defender_maturity_level": [
167
+ "advanced",
168
+ "baseline",
169
+ "managed",
170
+ "minimal",
171
+ "zero_trust"
172
+ ],
173
+ "alert_severity": [
174
+ "critical",
175
+ "high",
176
+ "informational",
177
+ "low",
178
+ "medium"
179
+ ],
180
+ "detection_outcome": [
181
+ "blind_spot",
182
+ "edr_blocked",
183
+ "evasion_success",
184
+ "high_confidence_alert",
185
+ "ir_escalated",
186
+ "marginal_alert",
187
+ "suppressed_alert"
188
+ ],
189
+ "seg_segment_type": [
190
+ "cloud_workload",
191
+ "corporate_lan",
192
+ "data_exfiltration_target",
193
+ "endpoint_fleet",
194
+ "soc_management_plane",
195
+ "supply_chain_interface",
196
+ "zero_trust_segment"
197
+ ],
198
+ "seg_defender_maturity_level": [
199
+ "advanced",
200
+ "baseline",
201
+ "managed",
202
+ "minimal",
203
+ "zero_trust"
204
+ ]
205
+ },
206
+ "label_to_int": {
207
+ "dwell_idle": 0,
208
+ "reconnaissance": 1,
209
+ "initial_access": 2,
210
+ "execution": 3,
211
+ "persistence": 4,
212
+ "privilege_escalation": 5,
213
+ "lateral_movement": 6,
214
+ "collection": 7,
215
+ "exfiltration": 8,
216
+ "impact": 9
217
+ },
218
+ "int_to_label": {
219
+ "0": "dwell_idle",
220
+ "1": "reconnaissance",
221
+ "2": "initial_access",
222
+ "3": "execution",
223
+ "4": "persistence",
224
+ "5": "privilege_escalation",
225
+ "6": "lateral_movement",
226
+ "7": "collection",
227
+ "8": "exfiltration",
228
+ "9": "impact"
229
+ },
230
+ "topology_aggregation": {
231
+ "segment_constant": [
232
+ "segment_type",
233
+ "defender_maturity_level"
234
+ ],
235
+ "segment_numeric_aggregates": {
236
+ "patch_lag_days": "mean",
237
+ "exposure_score": "mean",
238
+ "vulnerability_count": "max",
239
+ "inter_segment_trust_level": "mean",
240
+ "alert_threshold_sensitivity": "mean",
241
+ "mttd_baseline_hours": "mean",
242
+ "mttr_baseline_hours": "mean",
243
+ "siem_coverage_flag": "mean",
244
+ "edr_deployed_flag": "mean",
245
+ "ndr_coverage_flag": "mean",
246
+ "mfa_enforced_flag": "mean"
247
+ }
248
+ }
249
+ }
feature_scaler.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"mean": [29.669737774627922, 2374.859673990078, 94190.43841601702, 14.909633238837705, 1.3384124734231042, 0.13040396881644223, 0.05705173635719348, 0.0, 0.3621545003543586, 0.166194188518781, 34.396196846241345, 0.512852316745022, 14.379518072289157, 0.392728801495667, 0.7238749335681469, 6.124241842889212, 36.93126845133998, 0.6976009715267184, 0.8059781553368865, 0.4883178731877128, 0.6477277267624112, 9.540027804902557, 1.0, 0.5510276399716513, -0.010276399716513111, 0.16725726435152374, 0.6463501063075833, 0.0627214741318214, 0.06909992912827782, 0.06591070163004961, 0.0705173635719348, 0.07122608079376329, 0.06520198440822111, 0.0673281360737066, 0.06520198440822111, 0.07299787384833452, 0.058114812189936214, 0.06945428773919206, 0.06130403968816442, 0.057406094968107724, 0.07973068745570518, 0.06378454996456413, 0.19383416017009214, 0.20233876683203403, 0.20411055988660523, 0.2147413182140326, 0.184975194897236, 0.10063784549964565, 0.09815733522324592, 0.10311835577604536, 0.09780297661233169, 0.09319631467044649, 0.09886605244507442, 0.10099220411055988, 0.10276399716513111, 0.09673990077958894, 0.10772501771793054, 0.2271438695960312, 0.4875974486180014, 0.22749822820694543, 0.05776045357902197, 0.4135364989369242, 0.2664776754075124, 0.2147413182140326, 0.050673281360737066, 0.05457122608079376, 0.43834160170092135, 0.06413890857547838, 0.4043231750531538, 0.0673281360737066, 0.0258681785967399, 0.059532246633593196, 0.3621545003543586, 0.3447909284195606, 0.10701630049610206, 0.03330970942593905, 0.0258681785967399, 0.0673281360737066, 0.04642097802976612, 0.23954642097802978, 0.09603118355776046, 0.22041105598866054, 0.08788093550673282, 0.22395464209780297, 0.08575478384124734, 0.4135364989369242, 0.2664776754075124, 0.2147413182140326, 0.050673281360737066, 0.05457122608079376], "std": [21.611718068894575, 3262.3953544252254, 493540.4889491936, 26.882083698757928, 1.7063611088856259, 0.33680702458505324, 0.23198255508867544, 1.0, 0.4807083352654771, 0.3723208326229761, 27.918565668886338, 0.16437622036073063, 6.809572056022862, 0.031089587614791407, 0.16380824644388434, 3.278380945942728, 19.765170276913693, 0.1795819790728066, 0.06230034648225459, 0.1392601592567418, 0.14782851174966183, 1.9732855896589672, 1.0, 0.49747751507194377, 1.14101486329445, 0.3732715435730835, 0.47818686198141197, 0.2425042887305301, 0.25366894008973206, 0.2481699123323166, 0.2560622962417658, 0.2572476946006164, 0.2469256805073951, 0.2506338325607766, 0.2469256805073951, 0.2601791150733033, 0.23400188966661606, 0.25427013215761435, 0.23992968449882038, 0.2326581539410086, 0.2709238172522079, 0.24441204871928013, 0.3953705491117327, 0.4018146379110633, 0.40312160075220743, 0.4107155466365413, 0.38834625527902134, 0.30090190074813306, 0.29757999359048065, 0.3041672976137825, 0.29710071221594014, 0.29075886802325146, 0.2985349856721442, 0.3013718026414007, 0.303704202734562, 0.2956556572543326, 0.3100877479347836, 0.41906057037245537, 0.49993473898865376, 0.4192911666305911, 0.23333125829157636, 0.4925545999567315, 0.44219522159317903, 0.4107155466365413, 0.21936853137528786, 0.22718163733670016, 0.49627161445350315, 0.24504364292201097, 0.49084755398842195, 0.2506338325607766, 0.15877011238413913, 0.2366601047140391, 0.4807083352654771, 0.4753842926323826, 0.30918875753830577, 0.17947586784069838, 0.15877011238413913, 0.2506338325607766, 0.21043232273668783, 0.426882310965684, 0.2946862192784542, 0.4145973147754568, 0.2831718407328375, 0.4169659091226798, 0.28005123240617036, 0.4925545999567315, 0.44219522159317903, 0.4107155466365413, 0.21936853137528786, 0.22718163733670016]}
inference_example.ipynb ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# CYB002 Baseline Classifier — Inference Example\n",
8
+ "\n",
9
+ "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **MITRE ATT&CK kill-chain phase** of a new attack-event record.\n",
10
+ "\n",
11
+ "**Models predict one of 10 phases:** `dwell_idle`, `reconnaissance`, `initial_access`, `execution`, `persistence`, `privilege_escalation`, `lateral_movement`, `collection`, `exfiltration`, `impact`.\n",
12
+ "\n",
13
+ "**This is a baseline reference model**, not a production threat detector. See the model card for full metrics and limitations."
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "markdown",
18
+ "metadata": {},
19
+ "source": [
20
+ "## 1. Install dependencies"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "## 2. Download model artifacts from Hugging Face\n",
37
+ "\n",
38
+ "Five files are needed:\n",
39
+ "- `model_xgb.json` — XGBoost weights\n",
40
+ "- `model_mlp.safetensors` — PyTorch MLP weights\n",
41
+ "- `feature_engineering.py` — feature pipeline (must match the one used at training)\n",
42
+ "- `feature_meta.json` — feature column order + categorical levels\n",
43
+ "- `feature_scaler.json` — MLP input standardization (mean / std)"
44
+ ]
45
+ },
46
+ {
47
+ "cell_type": "code",
48
+ "execution_count": null,
49
+ "metadata": {},
50
+ "outputs": [],
51
+ "source": [
52
+ "from huggingface_hub import hf_hub_download\n",
53
+ "\n",
54
+ "REPO_ID = \"xpertsystems/cyb002-baseline-classifier\"\n",
55
+ "\n",
56
+ "files = {}\n",
57
+ "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
58
+ " \"feature_engineering.py\", \"feature_meta.json\",\n",
59
+ " \"feature_scaler.json\"]:\n",
60
+ " files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
61
+ " print(f\" downloaded: {name}\")"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "code",
66
+ "execution_count": null,
67
+ "metadata": {},
68
+ "outputs": [],
69
+ "source": [
70
+ "# Make feature_engineering.py importable\n",
71
+ "import sys, os\n",
72
+ "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
73
+ "if fe_dir not in sys.path:\n",
74
+ " sys.path.insert(0, fe_dir)\n",
75
+ "\n",
76
+ "from feature_engineering import (\n",
77
+ " transform_single, load_meta, INT_TO_LABEL, build_segment_lookup\n",
78
+ ")"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "markdown",
83
+ "metadata": {},
84
+ "source": [
85
+ "## 3. Load models and metadata"
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "code",
90
+ "execution_count": null,
91
+ "metadata": {},
92
+ "outputs": [],
93
+ "source": [
94
+ "import json\n",
95
+ "import numpy as np\n",
96
+ "import torch\n",
97
+ "import torch.nn as nn\n",
98
+ "import xgboost as xgb\n",
99
+ "from safetensors.torch import load_file\n",
100
+ "\n",
101
+ "meta = load_meta(files[\"feature_meta.json\"])\n",
102
+ "with open(files[\"feature_scaler.json\"]) as f:\n",
103
+ " scaler = json.load(f)\n",
104
+ "\n",
105
+ "N_FEATURES = len(meta[\"feature_names\"])\n",
106
+ "N_CLASSES = len(meta[\"int_to_label\"])\n",
107
+ "print(f\"feature count: {N_FEATURES}\")\n",
108
+ "print(f\"class count: {N_CLASSES}\")\n",
109
+ "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "code",
114
+ "execution_count": null,
115
+ "metadata": {},
116
+ "outputs": [],
117
+ "source": [
118
+ "# XGBoost\n",
119
+ "xgb_model = xgb.XGBClassifier()\n",
120
+ "xgb_model.load_model(files[\"model_xgb.json\"])\n",
121
+ "\n",
122
+ "# MLP architecture (must match training)\n",
123
+ "class PhaseMLP(nn.Module):\n",
124
+ " def __init__(self, n_features, n_classes=10, hidden1=128, hidden2=64, dropout=0.3):\n",
125
+ " super().__init__()\n",
126
+ " self.net = nn.Sequential(\n",
127
+ " nn.Linear(n_features, hidden1),\n",
128
+ " nn.BatchNorm1d(hidden1),\n",
129
+ " nn.ReLU(),\n",
130
+ " nn.Dropout(dropout),\n",
131
+ " nn.Linear(hidden1, hidden2),\n",
132
+ " nn.BatchNorm1d(hidden2),\n",
133
+ " nn.ReLU(),\n",
134
+ " nn.Dropout(dropout),\n",
135
+ " nn.Linear(hidden2, n_classes),\n",
136
+ " )\n",
137
+ " def forward(self, x):\n",
138
+ " return self.net(x)\n",
139
+ "\n",
140
+ "mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
141
+ "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
142
+ "mlp_model.eval()\n",
143
+ "print(\"models loaded\")"
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "markdown",
148
+ "metadata": {},
149
+ "source": [
150
+ "## 4. Build segment-aggregate lookup from the dataset\n",
151
+ "\n",
152
+ "Per-segment topology aggregates (mean exposure, fraction with EDR, etc.) are computed at training time and must be available at inference time too. The helper `build_segment_lookup` pulls them from `network_topology.csv`."
153
+ ]
154
+ },
155
+ {
156
+ "cell_type": "code",
157
+ "execution_count": null,
158
+ "metadata": {},
159
+ "outputs": [],
160
+ "source": [
161
+ "from huggingface_hub import snapshot_download\n",
162
+ "\n",
163
+ "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb002-sample\", repo_type=\"dataset\")\n",
164
+ "\n",
165
+ "import os\n",
166
+ "segment_aggregates_lookup = build_segment_lookup(\n",
167
+ " os.path.join(ds_path, \"network_topology.csv\")\n",
168
+ ")\n",
169
+ "print(f\"loaded {len(segment_aggregates_lookup)} segment aggregates\")"
170
+ ]
171
+ },
172
+ {
173
+ "cell_type": "markdown",
174
+ "metadata": {},
175
+ "source": [
176
+ "## 5. Prediction helper"
177
+ ]
178
+ },
179
+ {
180
+ "cell_type": "code",
181
+ "execution_count": null,
182
+ "metadata": {},
183
+ "outputs": [],
184
+ "source": [
185
+ "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
186
+ "SD = np.array(scaler[\"std\"], dtype=np.float32)\n",
187
+ "\n",
188
+ "def predict_phase(record: dict) -> dict:\n",
189
+ " \"\"\"Predict the kill-chain phase for one event record.\n",
190
+ "\n",
191
+ " `record` is a dict with event-level fields. Segment-level aggregates\n",
192
+ " are pulled automatically from `segment_aggregates_lookup` using the\n",
193
+ " `target_segment_id` field.\n",
194
+ "\n",
195
+ " Returns a dict with both models' predictions and per-class probabilities.\n",
196
+ " \"\"\"\n",
197
+ " seg_id = record.get(\"target_segment_id\")\n",
198
+ " seg_agg = segment_aggregates_lookup.get(seg_id, {})\n",
199
+ " X = transform_single(record, meta, segment_aggregates=seg_agg)\n",
200
+ "\n",
201
+ " xgb_proba = xgb_model.predict_proba(X)[0]\n",
202
+ " xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
203
+ "\n",
204
+ " Xs = ((X - MU) / SD).astype(np.float32)\n",
205
+ " with torch.no_grad():\n",
206
+ " logits = mlp_model(torch.tensor(Xs))\n",
207
+ " mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
208
+ " mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
209
+ "\n",
210
+ " return {\n",
211
+ " \"xgboost\": {\n",
212
+ " \"label\": xgb_label,\n",
213
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
214
+ " },\n",
215
+ " \"mlp\": {\n",
216
+ " \"label\": mlp_label,\n",
217
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
218
+ " },\n",
219
+ " }"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "markdown",
224
+ "metadata": {},
225
+ "source": [
226
+ "## 6. Run on an example record\n",
227
+ "\n",
228
+ "This is a real `reconnaissance` event lifted from the sample dataset: opportunistic attacker scanning an email server early in a campaign (timestep 0). Both models should predict `reconnaissance`."
229
+ ]
230
+ },
231
+ {
232
+ "cell_type": "code",
233
+ "execution_count": null,
234
+ "metadata": {},
235
+ "outputs": [],
236
+ "source": [
237
+ "# Real attack event from the sample dataset (true label: reconnaissance)\n",
238
+ "example_record = {\n",
239
+ " \"campaign_id\": \"CAMP-000030\",\n",
240
+ " \"attacker_id\": \"ATK-0003\",\n",
241
+ " \"timestep\": 0,\n",
242
+ " \"target_segment_id\": \"SEG-0008\",\n",
243
+ " \"target_asset_type\": \"email_server\",\n",
244
+ " \"source_ip_class\": \"vpn_tunnel\",\n",
245
+ " \"dest_port\": 22,\n",
246
+ " \"protocol\": \"icmp\",\n",
247
+ " \"bytes_transferred\": 15648.48,\n",
248
+ " \"connection_duration_s\": 3.913,\n",
249
+ " \"auth_failure_count\": 0,\n",
250
+ " \"process_injection_flag\": 0,\n",
251
+ " \"lateral_hop_count\": 0,\n",
252
+ " \"c2_beacon_interval_s\": 0.0,\n",
253
+ " \"detection_outcome\": \"edr_blocked\",\n",
254
+ " \"alert_severity\": \"critical\",\n",
255
+ " \"siem_rule_triggered\": 0,\n",
256
+ " \"edr_blocked_flag\": 1,\n",
257
+ " \"attacker_capability_tier\": \"opportunistic\",\n",
258
+ " \"defender_maturity_level\": \"baseline\",\n",
259
+ "}\n",
260
+ "\n",
261
+ "result = predict_phase(example_record)\n",
262
+ "\n",
263
+ "print(f\"XGBoost -> {result['xgboost']['label']}\")\n",
264
+ "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
265
+ " print(f\" P({lbl:25s}) = {p:.4f}\")\n",
266
+ "\n",
267
+ "print(f\"\\nMLP -> {result['mlp']['label']}\")\n",
268
+ "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
269
+ " print(f\" P({lbl:25s}) = {p:.4f}\")"
270
+ ]
271
+ },
272
+ {
273
+ "cell_type": "markdown",
274
+ "metadata": {},
275
+ "source": [
276
+ "### Note: when the two models disagree\n",
277
+ "\n",
278
+ "XGBoost and the MLP can disagree on out-of-distribution records — particularly hand-crafted inputs whose feature combinations don't sit on the training-data manifold. The MLP, with BatchNorm and a small training set, has narrower competence than the tree ensemble. Disagreement is a useful triage signal: in a SOC workflow, conflicting predictions are flows worth a human eyeball."
279
+ ]
280
+ },
281
+ {
282
+ "cell_type": "markdown",
283
+ "metadata": {},
284
+ "source": [
285
+ "## 7. Batch prediction on the sample dataset"
286
+ ]
287
+ },
288
+ {
289
+ "cell_type": "code",
290
+ "execution_count": null,
291
+ "metadata": {},
292
+ "outputs": [],
293
+ "source": [
294
+ "import pandas as pd\n",
295
+ "\n",
296
+ "events = pd.read_csv(os.path.join(ds_path, \"attack_events.csv\"))\n",
297
+ "\n",
298
+ "# Drop leakage columns the model was never trained on\n",
299
+ "events = events.drop(columns=[\"technique_id\", \"technique_name\", \"tactic_category\"],\n",
300
+ " errors=\"ignore\")\n",
301
+ "\n",
302
+ "# Score the first 200 events\n",
303
+ "sample = events.head(200).copy()\n",
304
+ "preds = [predict_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
305
+ "sample[\"xgb_pred\"] = preds\n",
306
+ "\n",
307
+ "ct = pd.crosstab(sample[\"kill_chain_phase\"], sample[\"xgb_pred\"],\n",
308
+ " rownames=[\"true\"], colnames=[\"pred\"])\n",
309
+ "print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
310
+ "print(ct)\n",
311
+ "acc = (sample[\"kill_chain_phase\"] == sample[\"xgb_pred\"]).mean()\n",
312
+ "print(f\"\\nbatch accuracy on first 200 (in-distribution): {acc:.4f}\")\n",
313
+ "print(\"\\nNote: this includes training-set events. See validation_results.json\\n\"\n",
314
+ " \"for proper held-out test-set metrics from disjoint campaigns.\")"
315
+ ]
316
+ },
317
+ {
318
+ "cell_type": "markdown",
319
+ "metadata": {},
320
+ "source": [
321
+ "## 8. Next steps\n",
322
+ "\n",
323
+ "- See `validation_results.json` for held-out test-set metrics (15 disjoint campaigns, 726 events).\n",
324
+ "- See `ablation_results.json` for per-feature-group contribution. `timestep` is by far the most predictive feature, which is honest: kill-chain phases progress in time, so where you are in the campaign timeline carries most of the phase signal.\n",
325
+ "- The model card's **Limitations** section explains the gap between this baseline and production threat-detection systems.\n",
326
+ "- For the full 380k-row CYB002 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
327
+ ]
328
+ }
329
+ ],
330
+ "metadata": {
331
+ "kernelspec": {
332
+ "display_name": "Python 3",
333
+ "language": "python",
334
+ "name": "python3"
335
+ },
336
+ "language_info": {
337
+ "name": "python",
338
+ "version": "3.10"
339
+ }
340
+ },
341
+ "nbformat": 4,
342
+ "nbformat_minor": 5
343
+ }
model_mlp.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f35e1a5f1a92330b2ebdf1f65a097ead961fed4b9dbf4ea11aed7d74a5f293bd
3
+ size 86512
model_xgb.json ADDED
The diff for this file is too large to render. See raw diff
 
validation_results.json ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0.0",
3
+ "dataset": "xpertsystems/cyb002-sample",
4
+ "task": "10-class kill_chain_phase classification",
5
+ "baselines": {
6
+ "always_predict_majority_accuracy": 0.19421487603305784,
7
+ "majority_class": "dwell_idle",
8
+ "random_guess_accuracy": 0.1
9
+ },
10
+ "split": {
11
+ "strategy": "group_aware (GroupShuffleSplit by campaign_id, nested)",
12
+ "rationale": "100 campaigns generate ~4,353 events; random row-split would leak campaign-level correlations into the test set. The group-aware split ensures train/val/test campaigns are disjoint.",
13
+ "campaigns_train": 69,
14
+ "campaigns_val": 16,
15
+ "campaigns_test": 15,
16
+ "events_train": 2822,
17
+ "events_val": 805,
18
+ "events_test": 726,
19
+ "seed": 42
20
+ },
21
+ "n_features": 90,
22
+ "label_classes": [
23
+ "dwell_idle",
24
+ "reconnaissance",
25
+ "initial_access",
26
+ "execution",
27
+ "persistence",
28
+ "privilege_escalation",
29
+ "lateral_movement",
30
+ "collection",
31
+ "exfiltration",
32
+ "impact"
33
+ ],
34
+ "class_distribution_train": {
35
+ "dwell_idle": 609,
36
+ "reconnaissance": 439,
37
+ "initial_access": 346,
38
+ "execution": 313,
39
+ "persistence": 275,
40
+ "privilege_escalation": 254,
41
+ "lateral_movement": 205,
42
+ "collection": 165,
43
+ "exfiltration": 117,
44
+ "impact": 99
45
+ },
46
+ "class_distribution_test": {
47
+ "dwell_idle": 141,
48
+ "reconnaissance": 112,
49
+ "initial_access": 106,
50
+ "persistence": 79,
51
+ "execution": 74,
52
+ "privilege_escalation": 68,
53
+ "lateral_movement": 54,
54
+ "collection": 40,
55
+ "exfiltration": 31,
56
+ "impact": 21
57
+ },
58
+ "leakage_excluded_features": [
59
+ "technique_id (62/63 techniques map 1:1 to a single phase)",
60
+ "technique_name (1:1 alias of technique_id)",
61
+ "tactic_category (direct alias of kill_chain_phase)"
62
+ ],
63
+ "models": {
64
+ "xgboost": {
65
+ "architecture": "Gradient-boosted decision trees, multi:softprob, 10 classes",
66
+ "framework": "xgboost",
67
+ "test_metrics": {
68
+ "model": "xgboost",
69
+ "accuracy": 0.46831955922865015,
70
+ "macro_f1": 0.42549880749552066,
71
+ "weighted_f1": 0.440668872633435,
72
+ "per_class_f1": {
73
+ "dwell_idle": 0.040268456375838924,
74
+ "reconnaissance": 0.7532467532467533,
75
+ "initial_access": 0.6467661691542289,
76
+ "execution": 0.4406779661016949,
77
+ "persistence": 0.41304347826086957,
78
+ "privilege_escalation": 0.5,
79
+ "lateral_movement": 0.7422680412371134,
80
+ "collection": 0.22018348623853212,
81
+ "exfiltration": 0.2727272727272727,
82
+ "impact": 0.22580645161290322
83
+ },
84
+ "confusion_matrix": {
85
+ "labels": [
86
+ "dwell_idle",
87
+ "reconnaissance",
88
+ "initial_access",
89
+ "execution",
90
+ "persistence",
91
+ "privilege_escalation",
92
+ "lateral_movement",
93
+ "collection",
94
+ "exfiltration",
95
+ "impact"
96
+ ],
97
+ "matrix": [
98
+ [
99
+ 3,
100
+ 23,
101
+ 23,
102
+ 18,
103
+ 21,
104
+ 18,
105
+ 2,
106
+ 17,
107
+ 9,
108
+ 7
109
+ ],
110
+ [
111
+ 2,
112
+ 87,
113
+ 2,
114
+ 21,
115
+ 0,
116
+ 0,
117
+ 0,
118
+ 0,
119
+ 0,
120
+ 0
121
+ ],
122
+ [
123
+ 1,
124
+ 5,
125
+ 65,
126
+ 5,
127
+ 3,
128
+ 26,
129
+ 1,
130
+ 0,
131
+ 0,
132
+ 0
133
+ ],
134
+ [
135
+ 2,
136
+ 4,
137
+ 1,
138
+ 39,
139
+ 24,
140
+ 3,
141
+ 1,
142
+ 0,
143
+ 0,
144
+ 0
145
+ ],
146
+ [
147
+ 0,
148
+ 0,
149
+ 1,
150
+ 12,
151
+ 38,
152
+ 9,
153
+ 0,
154
+ 18,
155
+ 1,
156
+ 0
157
+ ],
158
+ [
159
+ 0,
160
+ 0,
161
+ 3,
162
+ 8,
163
+ 4,
164
+ 44,
165
+ 3,
166
+ 5,
167
+ 1,
168
+ 0
169
+ ],
170
+ [
171
+ 0,
172
+ 0,
173
+ 0,
174
+ 0,
175
+ 6,
176
+ 6,
177
+ 36,
178
+ 2,
179
+ 0,
180
+ 4
181
+ ],
182
+ [
183
+ 0,
184
+ 0,
185
+ 0,
186
+ 0,
187
+ 2,
188
+ 1,
189
+ 0,
190
+ 12,
191
+ 15,
192
+ 10
193
+ ],
194
+ [
195
+ 0,
196
+ 0,
197
+ 0,
198
+ 0,
199
+ 5,
200
+ 0,
201
+ 0,
202
+ 4,
203
+ 9,
204
+ 13
205
+ ],
206
+ [
207
+ 0,
208
+ 0,
209
+ 0,
210
+ 0,
211
+ 2,
212
+ 1,
213
+ 0,
214
+ 11,
215
+ 0,
216
+ 7
217
+ ]
218
+ ]
219
+ },
220
+ "macro_roc_auc_ovr": 0.8598653258869782
221
+ }
222
+ },
223
+ "mlp": {
224
+ "architecture": "PyTorch MLP, 90 -> 128 -> 64 -> 10, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
225
+ "framework": "pytorch",
226
+ "test_metrics": {
227
+ "model": "mlp",
228
+ "accuracy": 0.44490358126721763,
229
+ "macro_f1": 0.3911186394257205,
230
+ "weighted_f1": 0.4172764238320775,
231
+ "per_class_f1": {
232
+ "dwell_idle": 0.013422818791946308,
233
+ "reconnaissance": 0.7250996015936255,
234
+ "initial_access": 0.6484018264840182,
235
+ "execution": 0.5100671140939598,
236
+ "persistence": 0.30120481927710846,
237
+ "privilege_escalation": 0.4880952380952381,
238
+ "lateral_movement": 0.782608695652174,
239
+ "collection": 0.19130434782608696,
240
+ "exfiltration": 0.11940298507462686,
241
+ "impact": 0.13157894736842105
242
+ },
243
+ "confusion_matrix": {
244
+ "labels": [
245
+ "dwell_idle",
246
+ "reconnaissance",
247
+ "initial_access",
248
+ "execution",
249
+ "persistence",
250
+ "privilege_escalation",
251
+ "lateral_movement",
252
+ "collection",
253
+ "exfiltration",
254
+ "impact"
255
+ ],
256
+ "matrix": [
257
+ [
258
+ 1,
259
+ 26,
260
+ 27,
261
+ 11,
262
+ 20,
263
+ 18,
264
+ 1,
265
+ 20,
266
+ 10,
267
+ 7
268
+ ],
269
+ [
270
+ 0,
271
+ 91,
272
+ 4,
273
+ 10,
274
+ 7,
275
+ 0,
276
+ 0,
277
+ 0,
278
+ 0,
279
+ 0
280
+ ],
281
+ [
282
+ 1,
283
+ 4,
284
+ 71,
285
+ 1,
286
+ 5,
287
+ 21,
288
+ 0,
289
+ 3,
290
+ 0,
291
+ 0
292
+ ],
293
+ [
294
+ 1,
295
+ 10,
296
+ 3,
297
+ 38,
298
+ 17,
299
+ 3,
300
+ 0,
301
+ 2,
302
+ 0,
303
+ 0
304
+ ],
305
+ [
306
+ 4,
307
+ 8,
308
+ 2,
309
+ 8,
310
+ 25,
311
+ 9,
312
+ 0,
313
+ 11,
314
+ 5,
315
+ 7
316
+ ],
317
+ [
318
+ 0,
319
+ 0,
320
+ 6,
321
+ 7,
322
+ 4,
323
+ 41,
324
+ 1,
325
+ 7,
326
+ 2,
327
+ 0
328
+ ],
329
+ [
330
+ 0,
331
+ 0,
332
+ 0,
333
+ 0,
334
+ 0,
335
+ 7,
336
+ 36,
337
+ 3,
338
+ 4,
339
+ 4
340
+ ],
341
+ [
342
+ 1,
343
+ 0,
344
+ 0,
345
+ 0,
346
+ 1,
347
+ 1,
348
+ 0,
349
+ 11,
350
+ 11,
351
+ 15
352
+ ],
353
+ [
354
+ 0,
355
+ 0,
356
+ 0,
357
+ 0,
358
+ 5,
359
+ 0,
360
+ 0,
361
+ 5,
362
+ 4,
363
+ 17
364
+ ],
365
+ [
366
+ 0,
367
+ 0,
368
+ 0,
369
+ 0,
370
+ 3,
371
+ 0,
372
+ 0,
373
+ 13,
374
+ 0,
375
+ 5
376
+ ]
377
+ ]
378
+ },
379
+ "macro_roc_auc_ovr": 0.8496117986303245
380
+ }
381
+ }
382
+ }
383
+ }