pradeep-xpert commited on
Commit
16be928
·
verified ·
1 Parent(s): 095013c

Initial release: XGBoost + MLP for phishing campaign-phase classification

Browse files
README.md ADDED
@@ -0,0 +1,455 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: pytorch
4
+ tags:
5
+ - cybersecurity
6
+ - phishing
7
+ - email-security
8
+ - bec
9
+ - social-engineering
10
+ - tabular-classification
11
+ - synthetic-data
12
+ - xgboost
13
+ - baseline
14
+ pipeline_tag: tabular-classification
15
+ base_model: []
16
+ datasets:
17
+ - xpertsystems/cyb004-sample
18
+ metrics:
19
+ - accuracy
20
+ - f1
21
+ - roc_auc
22
+ model-index:
23
+ - name: cyb004-baseline-classifier
24
+ results:
25
+ - task:
26
+ type: tabular-classification
27
+ name: 7-class phishing campaign phase classification
28
+ dataset:
29
+ type: xpertsystems/cyb004-sample
30
+ name: CYB004 Synthetic Phishing Campaign Dataset (Sample)
31
+ metrics:
32
+ - type: roc_auc
33
+ value: 0.9356
34
+ name: Test macro ROC-AUC OvR (XGBoost, seed 42)
35
+ - type: accuracy
36
+ value: 0.6547
37
+ name: Test accuracy (XGBoost, seed 42)
38
+ - type: f1
39
+ value: 0.6401
40
+ name: Test macro-F1 (XGBoost, seed 42)
41
+ - type: accuracy
42
+ value: 0.649
43
+ name: Multi-seed accuracy mean ± 0.038 (XGBoost, 10 seeds)
44
+ - type: roc_auc
45
+ value: 0.937
46
+ name: Multi-seed ROC-AUC mean ± 0.010 (XGBoost, 10 seeds)
47
+ - type: roc_auc
48
+ value: 0.9265
49
+ name: Test macro ROC-AUC OvR (MLP, seed 42)
50
+ - type: accuracy
51
+ value: 0.6427
52
+ name: Test accuracy (MLP, seed 42)
53
+ - type: f1
54
+ value: 0.6275
55
+ name: Test macro-F1 (MLP, seed 42)
56
+ ---
57
+
58
+ # CYB004 Baseline Classifier
59
+
60
+ **Phishing campaign phase classifier trained on the CYB004 synthetic
61
+ phishing campaign sample. Predicts which of 7 lifecycle phases a
62
+ per-timestep telemetry record belongs to, from observable trajectory
63
+ and victim-topology features.**
64
+
65
+ > **Baseline reference, not for production use.** This model demonstrates
66
+ > that the [CYB004 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb004-sample)
67
+ > is learnable end-to-end and gives prospective buyers a working starting
68
+ > point. It is not a production email-security platform, SOAR component,
69
+ > or threat detector. See [Limitations](#limitations).
70
+
71
+ ## Model overview
72
+
73
+ | Property | Value |
74
+ |---|---|
75
+ | Task | 7-class campaign_phase classification |
76
+ | Training data | `xpertsystems/cyb004-sample` (3,952 timesteps across 100 phishing campaigns) |
77
+ | Models | XGBoost + PyTorch MLP |
78
+ | Input features | 53 (after one-hot encoding) |
79
+ | Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
80
+ | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
81
+ | License | CC-BY-NC-4.0 (matches dataset) |
82
+ | Status | Reference baseline |
83
+
84
+ ## Why this task instead of actor-tier attribution?
85
+
86
+ The CYB004 dataset README leads with "actor attribution modelling — 4-tier
87
+ classification" as a suggested use case. We piloted that target first and
88
+ found a serious issue: four features in the dataset
89
+ (`lure_personalisation_score`, `click_through_rate`,
90
+ `credential_submission_rate`, `target_department_id`) are **constant per
91
+ campaign**, not per-timestep. They look like per-step features but each
92
+ takes a single value across all ~40 timesteps of a given campaign.
93
+
94
+ Because these constants are tier-correlated (especially
95
+ `lure_personalisation_score`, which differs systematically across the
96
+ four actor tiers), they leak tier identity through the campaign-level
97
+ fingerprint they create. With a 15-campaign test fold, many test
98
+ campaigns land in the same feature ranges as training campaigns of the
99
+ same tier, and the model achieves spurious 97%+ accuracy that does not
100
+ generalize. Removing those features (the honest fix) drops tier
101
+ prediction to **accuracy 0.45, ROC-AUC 0.70 — below majority baseline
102
+ of 0.59**. The full 335k-row CYB004 product, with ~4,800 campaigns,
103
+ will not have this constraint; the sample at n=100 cannot support
104
+ honest tier learning.
105
+
106
+ We pivoted to **campaign_phase prediction**, which has 3,952 rows of
107
+ per-timestep data spread across 7 phases with tight timestep windows.
108
+ It learns cleanly under the same group-aware split: 65% accuracy,
109
+ ROC-AUC 0.94, stable across 10 seeds. This is a legitimate
110
+ email-security use case — SOAR playbooks and threat-hunting workflows
111
+ need to tag what phase of a phishing campaign observed activity
112
+ belongs to.
113
+
114
+ Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
115
+
116
+ - `model_xgb.json` — gradient-boosted trees, primary recommendation
117
+ - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
118
+
119
+ ## Quick start
120
+
121
+ ```bash
122
+ pip install xgboost torch safetensors pandas huggingface_hub
123
+ ```
124
+
125
+ ```python
126
+ from huggingface_hub import hf_hub_download
127
+ import json, numpy as np, torch, xgboost as xgb
128
+ from safetensors.torch import load_file
129
+
130
+ REPO = "xpertsystems/cyb004-baseline-classifier"
131
+
132
+ paths = {n: hf_hub_download(REPO, n) for n in [
133
+ "model_xgb.json", "model_mlp.safetensors",
134
+ "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
135
+ ]}
136
+
137
+ import sys, os
138
+ sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
139
+ from feature_engineering import (
140
+ transform_single, load_meta, INT_TO_LABEL, build_department_lookup
141
+ )
142
+
143
+ meta = load_meta(paths["feature_meta.json"])
144
+ xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
145
+ dept_lookup = build_department_lookup("path/to/victim_topology.csv")
146
+
147
+ # Predict (see inference_example.ipynb for the full pattern)
148
+ dept_aggs = dept_lookup.get(my_record["target_department_id"], {})
149
+ X = transform_single(my_record, meta, victim_aggregates=dept_aggs)
150
+ proba = xgb_model.predict_proba(X)[0]
151
+ print(INT_TO_LABEL[int(np.argmax(proba))])
152
+ ```
153
+
154
+ See [`inference_example.ipynb`](./inference_example.ipynb) for the full
155
+ copy-paste demo.
156
+
157
+ ## Training data
158
+
159
+ Trained on the public sample of CYB004, 3,952 per-timestep trajectory
160
+ rows from 100 phishing campaigns (~40 timesteps per campaign):
161
+
162
+ | Phase | Total rows | Test rows (seed 42) |
163
+ |---|---:|---:|
164
+ | `email_delivery` | 919 | 134 |
165
+ | `victim_engagement` | 667 | 102 |
166
+ | `target_reconnaissance` | 558 | 89 |
167
+ | `post_compromise_escalation` | 533 | 50 |
168
+ | `credential_harvesting` | 494 | 91 |
169
+ | `lure_crafting` | 435 | 71 |
170
+ | `infrastructure_setup` | 346 | 48 |
171
+
172
+ ### Group-aware split
173
+
174
+ A single campaign generates ~40 highly-correlated timesteps. Random
175
+ row-level splitting would put timesteps from the same campaign in both
176
+ train and test, inflating metrics in a way that does not generalize to
177
+ new campaigns.
178
+
179
+ This release uses **GroupShuffleSplit by `campaign_id`** (nested,
180
+ 70/15/15):
181
+
182
+ | Fold | Campaigns | Timesteps |
183
+ |---|---:|---:|
184
+ | Train | 69 | 2,792 |
185
+ | Validation | 16 | 575 |
186
+ | Test | 15 | 585 |
187
+
188
+ All test campaigns are completely unseen during training. Class imbalance
189
+ is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
190
+ weighted cross-entropy (MLP).
191
+
192
+ ## Feature pipeline
193
+
194
+ The bundled `feature_engineering.py` is the canonical feature recipe.
195
+ 53 features survive after encoding, drawn from:
196
+
197
+ - **Per-timestep numeric** (7): `timestep`, `emails_sent_cumulative`, `click_through_rate`, `credential_submission_rate`, `gateway_detection_score`, `lure_personalisation_score`, `target_department_id`
198
+ - **Per-timestep categorical** (2, one-hot): `evasion_technique_active`, `actor_capability_tier`
199
+ - **Victim topology numeric** (5): `employee_count`, `privileged_account_density`, `mfa_enrollment_rate`, `click_susceptibility_base`, `email_volume_daily`
200
+ - **Victim topology categorical** (5, one-hot): `department_type`, `industry_sector`, `awareness_training_level`, `gateway_architecture`, `dmarc_enforcement_level`
201
+ - **Engineered** (6): `log_emails_sent`, `is_gateway_blocked_step`, `is_evasion_active`, `is_high_personalisation`, `has_credential_capture`, `has_user_engagement`
202
+
203
+ ### Leakage audit
204
+
205
+ **One column dropped:** `delivery_outcome` (7-class categorical). Its
206
+ crosstab with `campaign_phase` shows that `no_delivery` appears only in
207
+ the early phases (`target_reconnaissance`, `infrastructure_setup`,
208
+ `lure_crafting`, `credential_harvesting`, `post_compromise_escalation`)
209
+ and never in `email_delivery` or `victim_engagement`. Cell purity 0.36
210
+ (uniform baseline 0.14). Keeping it would give the model a near-oracle
211
+ for partitioning early-vs-mid phases.
212
+
213
+ **No oracle features remain.** All retained features have phase-purity
214
+ under 0.20.
215
+
216
+ ### Per-campaign-constant features
217
+
218
+ Four features (`lure_personalisation_score`, `click_through_rate`,
219
+ `credential_submission_rate`, `target_department_id`) are constant
220
+ within each campaign. For **phase prediction** this is acceptable —
221
+ their phase-purity is low, so the model uses them as conditioning
222
+ context (similar to "we know this is an APT campaign targeting finance"
223
+ when reasoning about which phase we're in), not as oracle features.
224
+ They became a problem only for the abandoned actor-tier task.
225
+
226
+ ## Evaluation
227
+
228
+ ### Test-set metrics, seed 42 (n = 585 timesteps from 15 disjoint campaigns)
229
+
230
+ **XGBoost** (the published `model_xgb.json` artifact)
231
+
232
+ | Metric | Value |
233
+ |---|---:|
234
+ | Macro ROC-AUC (OvR) | **0.9356** |
235
+ | Accuracy | **0.6547** |
236
+ | Macro-F1 | 0.6401 |
237
+ | Weighted-F1 | 0.6526 |
238
+
239
+ **MLP** (the published `model_mlp.safetensors` artifact)
240
+
241
+ | Metric | Value |
242
+ |---|---:|
243
+ | Macro ROC-AUC (OvR) | 0.9265 |
244
+ | Accuracy | 0.6427 |
245
+ | Macro-F1 | 0.6275 |
246
+ | Weighted-F1 | 0.6492 |
247
+
248
+ ### Multi-seed robustness (XGBoost, 10 seeds)
249
+
250
+ Stable performance across seeds — the task learns cleanly, not seed-lucky:
251
+
252
+ | Metric | Mean | Std | Min | Max |
253
+ |---|---:|---:|---:|---:|
254
+ | Accuracy | 0.649 | 0.038 | 0.592 | 0.711 |
255
+ | Macro-F1 | 0.638 | 0.040 | 0.574 | 0.714 |
256
+ | Macro ROC-AUC OvR | 0.937 | 0.010 | 0.923 | 0.954 |
257
+
258
+ Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
259
+ All 10 seeds yielded all 7 classes in the test fold.
260
+
261
+ ### Per-class F1 (seed 42) — where the signal is and isn't
262
+
263
+ | Phase | XGBoost F1 | MLP F1 | Note |
264
+ |---|---:|---:|---|
265
+ | `target_reconnaissance` | **0.888** | 0.831 | Tight early window (timesteps 0-7) |
266
+ | `email_delivery` | **0.791** | 0.761 | Tight window (8-30); gateway signals + email volume |
267
+ | `infrastructure_setup` | **0.712** | 0.702 | Tight window (5-18) |
268
+ | `lure_crafting` | **0.676** | 0.561 | Tight window (3-13) |
269
+ | `post_compromise_escalation` | 0.604 | 0.717 | Late window (22-52) |
270
+ | `victim_engagement` | 0.469 | 0.387 | Mid window (14-38), overlaps with adjacent phases |
271
+ | `credential_harvesting` | 0.341 | 0.434 | Mid-late (19-45), similar features to victim_engagement |
272
+
273
+ Four early phases (target_reconnaissance, infrastructure_setup,
274
+ lure_crafting, email_delivery) classify cleanly because they sit in
275
+ tight non-overlapping timestep windows with distinctive features.
276
+ Three later phases (victim_engagement, credential_harvesting,
277
+ post_compromise_escalation) overlap substantially in timestep range
278
+ (14-52, 19-45, 22-52) and share similar behavioural footprints
279
+ (non-zero click/credential rates, deployed evasion); these are
280
+ genuinely harder for a flat-tabular model. Sequence models with
281
+ campaign-level context would help here.
282
+
283
+ ### Ablation: which feature groups matter
284
+
285
+ | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
286
+ |---|---:|---:|---:|---:|
287
+ | Full feature set (published) | 0.6547 | 0.6401 | 0.9356 | — |
288
+ | No `timestep` | 0.3624 | 0.3139 | 0.8128 | **−0.2923** |
289
+ | No behavioural features | 0.5795 | 0.5735 | 0.9188 | −0.0752 |
290
+ | No topology features | 0.6410 | 0.6260 | 0.9342 | −0.0137 |
291
+ | No engineered features | 0.6581 | 0.6402 | 0.9370 | +0.0034 |
292
+
293
+ Three findings:
294
+
295
+ 1. **`timestep` is by far the dominant feature** (drops 29 pp when
296
+ removed, ROC-AUC still 0.81). Phishing campaigns progress through
297
+ phases over time; where you are in the campaign timeline carries
298
+ most of the phase signal.
299
+ 2. **Behavioural features contribute ~8 pp accuracy.** These are the
300
+ per-timestep observables (emails sent, gateway score, click rate,
301
+ evasion technique).
302
+ 3. **Topology and engineered features each contribute ~1 pp.** Trees
303
+ recover most of the engineered features on their own; topology
304
+ provides modest conditioning context.
305
+
306
+ ### Architecture
307
+
308
+ **XGBoost:** multi-class gradient boosting (`multi:softprob`, 7 classes),
309
+ `hist` tree method, class-balanced sample weights, early stopping on
310
+ validation mlogloss.
311
+
312
+ **MLP:** `53 → 128 → 64 → 7`, each hidden layer followed by `BatchNorm1d`
313
+ → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
314
+ early stopping on validation macro-F1.
315
+
316
+ Training hyperparameters (learning rate, batch size, n_estimators,
317
+ early-stopping patience, weight decay, class-weighting strategy) are
318
+ held internally by XpertSystems and are not part of this release.
319
+
320
+ ## Limitations
321
+
322
+ **This is a baseline reference, not a production email-security system.**
323
+
324
+ 1. **Mid- and late-phase confusion.** Per-class F1 for
325
+ `victim_engagement`, `credential_harvesting`, and
326
+ `post_compromise_escalation` is 0.34–0.60. These phases overlap in
327
+ timestep range and share similar behavioural signatures. Sequence
328
+ models that consider campaign-level context would help substantially.
329
+
330
+ 2. **The pivot away from actor-tier classification is dataset-limited,
331
+ not method-limited.** With 100 campaigns and 4 tiers (some with only
332
+ 10 campaigns total), tier classification is below majority baseline
333
+ once leakage-prone features are removed. The full 335k-row CYB004
334
+ product provides ~4,800 campaigns; the sample does not.
335
+
336
+ 3. **Synthetic-vs-real transfer.** The dataset is synthetic and
337
+ calibrated to email-security and threat-intelligence benchmark
338
+ targets (Proofpoint State of the Phish, KnowBe4 Industry Benchmark,
339
+ Cofense PIQ, Mandiant M-Trends, FBI IC3 BEC Report, Verizon DBIR,
340
+ CISA, APWG). Real phishing telemetry has different noise
341
+ characteristics, adversary adaptation, and instrumentation gaps. Do
342
+ not assume metrics transfer.
343
+
344
+ 4. **Adversarial robustness not evaluated.** The dataset is not
345
+ adversarially generated; the model has not been red-teamed against
346
+ evasive lures or novel infrastructure.
347
+
348
+ 5. **MLP brittleness on OOD inputs.** With ~2.8k training timesteps,
349
+ the MLP can produce confidently-wrong predictions on hand-crafted
350
+ records far from the training manifold. XGBoost is more robust.
351
+ Use both; treat disagreement as a signal for human review.
352
+
353
+ 6. **`timestep` dominance is a property of the dataset.** Real
354
+ phishing telemetry doesn't carry a clean per-campaign normalized
355
+ timestep — that's a simulator artifact. A buyer transferring this
356
+ baseline to real campaign telemetry would need to recover an
357
+ equivalent temporal-position feature (e.g. hours since campaign
358
+ first observation, position in stage-detection pipeline).
359
+
360
+ ## Notes on dataset schema
361
+
362
+ The CYB004 sample dataset README describes some fields differently from
363
+ the actual schema. The model was trained on the actual schema; this note
364
+ helps buyers reconcile what they read with what they receive.
365
+
366
+ | What the README says | What the data actually contains |
367
+ |---|---|
368
+ | "9 campaign phases" (reconnaissance, infrastructure_setup, lure_creation, send_wave, gateway_evaluation, user_interaction, credential_capture, lateral_pivot, exfiltration) | 7 phases with different names: target_reconnaissance, infrastructure_setup, lure_crafting, email_delivery, victim_engagement, credential_harvesting, post_compromise_escalation |
369
+ | 4 actor tiers: `opportunistic`, `organized_crime`, `targeted`, `nation_state_apt` | 4 tiers: `opportunistic`, `cybercriminal_gang`, `initial_access_broker`, `nation_state_apt` |
370
+ | 8 department types listed | 4 department types: `executive_leadership`, `finance_accounts_payable`, `human_resources`, `information_technology` |
371
+ | 4 gateway architectures | 8 gateway architectures including `ai_sender_reputation`, `integrated_cloud_defender`, `zero_trust_email_proxy` |
372
+ | Awareness training: none, annual, semi-annual, quarterly, monthly | annual, none, continuous, basic, quarterly (no semi-annual or monthly) |
373
+ | Per-timestep fields: `send_volume`, `gateway_blocked`, `emails_delivered`, `user_report_count`, `mfa_bypass_attempted`, `bec_attempt`, `lateral_pivot_attempted`, `operational_stealth_score`, `dmarc_enforcement_active` | None of these exist per-timestep. The actual per-timestep columns are: `emails_sent_cumulative`, `gateway_detection_score`, `delivery_outcome`, `lure_personalisation_score`, `evasion_technique_active`. BEC / MFA bypass / lateral phishing flags exist only at the campaign-summary level. |
374
+
375
+ None of these discrepancies affects model correctness — the feature
376
+ pipeline uses the actual column names. If you build your own pipeline
377
+ against the dataset, use the actual columns.
378
+
379
+ ## Intended use
380
+
381
+ - **Evaluating fit** of the CYB004 dataset for your email-security
382
+ or threat-hunting research
383
+ - **Baseline reference** for new model architectures (especially
384
+ sequence models, which should beat this baseline on the overlapping
385
+ mid-late phases)
386
+ - **Teaching and demo** for tabular classification on phishing
387
+ campaign telemetry
388
+ - **Feature engineering reference** for per-timestep campaign data
389
+
390
+ ## Out-of-scope use
391
+
392
+ - Production email security on real campaign telemetry
393
+ - Threat hunting / SOAR playbooks on real systems
394
+ - Actor attribution (this baseline does not address that task; see why above)
395
+ - Adversarial-evasion evaluation (dataset not adversarially generated)
396
+ - Any operational security decision
397
+
398
+ ## Reproducibility
399
+
400
+ Outputs above were produced with `seed = 42` (published artifact),
401
+ group-aware nested `GroupShuffleSplit` (70/15/15 by campaign_id), on the
402
+ published sample (`xpertsystems/cyb004-sample`, version 1.0.0, generated
403
+ 2026-05-16). The feature pipeline in `feature_engineering.py` is
404
+ deterministic and the trained weights in this repo correspond exactly
405
+ to the metrics above.
406
+
407
+ Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
408
+ `multi_seed_results.json` confirm robust performance across splits.
409
+
410
+ The training script itself is private to XpertSystems.
411
+
412
+ ## Files in this repo
413
+
414
+ | File | Purpose |
415
+ |---|---|
416
+ | `model_xgb.json` | XGBoost weights (seed 42) |
417
+ | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
418
+ | `feature_engineering.py` | Feature pipeline (load → join topology → engineer → encode) |
419
+ | `feature_meta.json` | Feature column order + categorical levels |
420
+ | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
421
+ | `validation_results.json` | Per-class metrics, confusion matrix, architecture |
422
+ | `ablation_results.json` | Per-feature-group ablation |
423
+ | `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics |
424
+ | `inference_example.ipynb` | End-to-end inference demo notebook |
425
+ | `README.md` | This file |
426
+
427
+ ## Contact and full product
428
+
429
+ The full **CYB004** dataset contains ~335,000 rows across four files,
430
+ with calibrated benchmark validation against 12 metrics from email
431
+ security and threat intelligence sources (Proofpoint, KnowBe4,
432
+ Cofense, Mandiant, FBI IC3, Verizon, CISA, APWG). The full
433
+ XpertSystems.ai synthetic data catalogue spans 41 SKUs across
434
+ Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
435
+ & Energy.
436
+
437
+ - 📧 **pradeep@xpertsystems.ai**
438
+ - 🌐 **https://xpertsystems.ai**
439
+ - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb004-sample
440
+ - 🤖 Companion models:
441
+ - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
442
+ - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
443
+ - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
444
+
445
+ ## Citation
446
+
447
+ ```bibtex
448
+ @misc{xpertsystems_cyb004_baseline_2026,
449
+ title = {CYB004 Baseline Classifier: XGBoost and MLP for Phishing Campaign Phase Classification},
450
+ author = {XpertSystems.ai},
451
+ year = {2026},
452
+ url = {https://huggingface.co/xpertsystems/cyb004-baseline-classifier},
453
+ note = {Baseline reference model trained on xpertsystems/cyb004-sample}
454
+ }
455
+ ```
ablation_results.json ADDED
@@ -0,0 +1,489 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
3
+ "full_model_metrics": {
4
+ "model": "xgboost",
5
+ "accuracy": 0.6547008547008547,
6
+ "macro_f1": 0.6401276666852063,
7
+ "weighted_f1": 0.657179533714298,
8
+ "per_class_f1": {
9
+ "target_reconnaissance": 0.8875739644970414,
10
+ "infrastructure_setup": 0.7115384615384616,
11
+ "lure_crafting": 0.6762589928057554,
12
+ "email_delivery": 0.7913669064748201,
13
+ "victim_engagement": 0.46938775510204084,
14
+ "credential_harvesting": 0.34074074074074073,
15
+ "post_compromise_escalation": 0.6040268456375839
16
+ },
17
+ "confusion_matrix": {
18
+ "labels": [
19
+ "target_reconnaissance",
20
+ "infrastructure_setup",
21
+ "lure_crafting",
22
+ "email_delivery",
23
+ "victim_engagement",
24
+ "credential_harvesting",
25
+ "post_compromise_escalation"
26
+ ],
27
+ "matrix": [
28
+ [
29
+ 75,
30
+ 0,
31
+ 9,
32
+ 0,
33
+ 0,
34
+ 0,
35
+ 0
36
+ ],
37
+ [
38
+ 0,
39
+ 37,
40
+ 16,
41
+ 0,
42
+ 0,
43
+ 0,
44
+ 0
45
+ ],
46
+ [
47
+ 10,
48
+ 10,
49
+ 47,
50
+ 0,
51
+ 0,
52
+ 0,
53
+ 0
54
+ ],
55
+ [
56
+ 0,
57
+ 4,
58
+ 0,
59
+ 110,
60
+ 28,
61
+ 1,
62
+ 0
63
+ ],
64
+ [
65
+ 0,
66
+ 0,
67
+ 0,
68
+ 21,
69
+ 46,
70
+ 24,
71
+ 9
72
+ ],
73
+ [
74
+ 0,
75
+ 0,
76
+ 0,
77
+ 4,
78
+ 16,
79
+ 23,
80
+ 20
81
+ ],
82
+ [
83
+ 0,
84
+ 0,
85
+ 0,
86
+ 0,
87
+ 6,
88
+ 24,
89
+ 45
90
+ ]
91
+ ]
92
+ },
93
+ "macro_roc_auc_ovr": 0.935584434710217
94
+ },
95
+ "ablations": {
96
+ "no_topology": {
97
+ "n_features": 23,
98
+ "dropped_count": 30,
99
+ "metrics": {
100
+ "model": "xgboost_no_topology",
101
+ "accuracy": 0.6410256410256411,
102
+ "macro_f1": 0.626013906528604,
103
+ "weighted_f1": 0.6377089952999916,
104
+ "per_class_f1": {
105
+ "target_reconnaissance": 0.891566265060241,
106
+ "infrastructure_setup": 0.7586206896551724,
107
+ "lure_crafting": 0.676923076923077,
108
+ "email_delivery": 0.7598566308243727,
109
+ "victim_engagement": 0.40609137055837563,
110
+ "credential_harvesting": 0.2782608695652174,
111
+ "post_compromise_escalation": 0.6107784431137725
112
+ },
113
+ "confusion_matrix": {
114
+ "labels": [
115
+ "target_reconnaissance",
116
+ "infrastructure_setup",
117
+ "lure_crafting",
118
+ "email_delivery",
119
+ "victim_engagement",
120
+ "credential_harvesting",
121
+ "post_compromise_escalation"
122
+ ],
123
+ "matrix": [
124
+ [
125
+ 74,
126
+ 0,
127
+ 10,
128
+ 0,
129
+ 0,
130
+ 0,
131
+ 0
132
+ ],
133
+ [
134
+ 0,
135
+ 44,
136
+ 9,
137
+ 0,
138
+ 0,
139
+ 0,
140
+ 0
141
+ ],
142
+ [
143
+ 8,
144
+ 15,
145
+ 44,
146
+ 0,
147
+ 0,
148
+ 0,
149
+ 0
150
+ ],
151
+ [
152
+ 0,
153
+ 4,
154
+ 0,
155
+ 106,
156
+ 30,
157
+ 3,
158
+ 0
159
+ ],
160
+ [
161
+ 0,
162
+ 0,
163
+ 0,
164
+ 26,
165
+ 40,
166
+ 16,
167
+ 18
168
+ ],
169
+ [
170
+ 0,
171
+ 0,
172
+ 0,
173
+ 4,
174
+ 20,
175
+ 16,
176
+ 23
177
+ ],
178
+ [
179
+ 0,
180
+ 0,
181
+ 0,
182
+ 0,
183
+ 7,
184
+ 17,
185
+ 51
186
+ ]
187
+ ]
188
+ },
189
+ "macro_roc_auc_ovr": 0.9341744835062434
190
+ },
191
+ "delta_accuracy": 0.013675213675213627,
192
+ "delta_macro_f1": 0.014113760156602262
193
+ },
194
+ "no_behavioural": {
195
+ "n_features": 36,
196
+ "dropped_count": 17,
197
+ "metrics": {
198
+ "model": "xgboost_no_behavioural",
199
+ "accuracy": 0.5794871794871795,
200
+ "macro_f1": 0.5734830391013238,
201
+ "weighted_f1": 0.5833619015067782,
202
+ "per_class_f1": {
203
+ "target_reconnaissance": 0.9024390243902439,
204
+ "infrastructure_setup": 0.4745762711864407,
205
+ "lure_crafting": 0.6619718309859155,
206
+ "email_delivery": 0.6390977443609023,
207
+ "victim_engagement": 0.3404255319148936,
208
+ "credential_harvesting": 0.3472222222222222,
209
+ "post_compromise_escalation": 0.6486486486486487
210
+ },
211
+ "confusion_matrix": {
212
+ "labels": [
213
+ "target_reconnaissance",
214
+ "infrastructure_setup",
215
+ "lure_crafting",
216
+ "email_delivery",
217
+ "victim_engagement",
218
+ "credential_harvesting",
219
+ "post_compromise_escalation"
220
+ ],
221
+ "matrix": [
222
+ [
223
+ 74,
224
+ 0,
225
+ 10,
226
+ 0,
227
+ 0,
228
+ 0,
229
+ 0
230
+ ],
231
+ [
232
+ 0,
233
+ 28,
234
+ 16,
235
+ 9,
236
+ 0,
237
+ 0,
238
+ 0
239
+ ],
240
+ [
241
+ 6,
242
+ 13,
243
+ 47,
244
+ 1,
245
+ 0,
246
+ 0,
247
+ 0
248
+ ],
249
+ [
250
+ 0,
251
+ 23,
252
+ 2,
253
+ 85,
254
+ 30,
255
+ 3,
256
+ 0
257
+ ],
258
+ [
259
+ 0,
260
+ 1,
261
+ 0,
262
+ 26,
263
+ 32,
264
+ 34,
265
+ 7
266
+ ],
267
+ [
268
+ 0,
269
+ 0,
270
+ 0,
271
+ 2,
272
+ 18,
273
+ 25,
274
+ 18
275
+ ],
276
+ [
277
+ 0,
278
+ 0,
279
+ 0,
280
+ 0,
281
+ 8,
282
+ 19,
283
+ 48
284
+ ]
285
+ ]
286
+ },
287
+ "macro_roc_auc_ovr": 0.9187512184393106
288
+ },
289
+ "delta_accuracy": 0.07521367521367517,
290
+ "delta_macro_f1": 0.06664462758388245
291
+ },
292
+ "no_timestep": {
293
+ "n_features": 52,
294
+ "dropped_count": 1,
295
+ "metrics": {
296
+ "model": "xgboost_no_timestep",
297
+ "accuracy": 0.3623931623931624,
298
+ "macro_f1": 0.3138802646284953,
299
+ "weighted_f1": 0.3500013055228507,
300
+ "per_class_f1": {
301
+ "target_reconnaissance": 0.4419889502762431,
302
+ "infrastructure_setup": 0.24,
303
+ "lure_crafting": 0.2748091603053435,
304
+ "email_delivery": 0.5617283950617284,
305
+ "victim_engagement": 0.26666666666666666,
306
+ "credential_harvesting": 0.11666666666666667,
307
+ "post_compromise_escalation": 0.2953020134228188
308
+ },
309
+ "confusion_matrix": {
310
+ "labels": [
311
+ "target_reconnaissance",
312
+ "infrastructure_setup",
313
+ "lure_crafting",
314
+ "email_delivery",
315
+ "victim_engagement",
316
+ "credential_harvesting",
317
+ "post_compromise_escalation"
318
+ ],
319
+ "matrix": [
320
+ [
321
+ 40,
322
+ 18,
323
+ 26,
324
+ 0,
325
+ 0,
326
+ 0,
327
+ 0
328
+ ],
329
+ [
330
+ 23,
331
+ 12,
332
+ 18,
333
+ 0,
334
+ 0,
335
+ 0,
336
+ 0
337
+ ],
338
+ [
339
+ 32,
340
+ 17,
341
+ 18,
342
+ 0,
343
+ 0,
344
+ 0,
345
+ 0
346
+ ],
347
+ [
348
+ 2,
349
+ 0,
350
+ 2,
351
+ 91,
352
+ 16,
353
+ 17,
354
+ 15
355
+ ],
356
+ [
357
+ 0,
358
+ 0,
359
+ 0,
360
+ 36,
361
+ 22,
362
+ 20,
363
+ 22
364
+ ],
365
+ [
366
+ 0,
367
+ 0,
368
+ 0,
369
+ 25,
370
+ 16,
371
+ 7,
372
+ 15
373
+ ],
374
+ [
375
+ 0,
376
+ 0,
377
+ 0,
378
+ 29,
379
+ 11,
380
+ 13,
381
+ 22
382
+ ]
383
+ ]
384
+ },
385
+ "macro_roc_auc_ovr": 0.8128267634071407
386
+ },
387
+ "delta_accuracy": 0.2923076923076923,
388
+ "delta_macro_f1": 0.326247402056711
389
+ },
390
+ "no_engineered": {
391
+ "n_features": 47,
392
+ "dropped_count": 6,
393
+ "metrics": {
394
+ "model": "xgboost_no_engineered",
395
+ "accuracy": 0.6581196581196581,
396
+ "macro_f1": 0.6401951204875947,
397
+ "weighted_f1": 0.6592473136316277,
398
+ "per_class_f1": {
399
+ "target_reconnaissance": 0.8809523809523809,
400
+ "infrastructure_setup": 0.7155963302752294,
401
+ "lure_crafting": 0.6518518518518519,
402
+ "email_delivery": 0.8,
403
+ "victim_engagement": 0.49473684210526314,
404
+ "credential_harvesting": 0.3484848484848485,
405
+ "post_compromise_escalation": 0.5897435897435898
406
+ },
407
+ "confusion_matrix": {
408
+ "labels": [
409
+ "target_reconnaissance",
410
+ "infrastructure_setup",
411
+ "lure_crafting",
412
+ "email_delivery",
413
+ "victim_engagement",
414
+ "credential_harvesting",
415
+ "post_compromise_escalation"
416
+ ],
417
+ "matrix": [
418
+ [
419
+ 74,
420
+ 0,
421
+ 10,
422
+ 0,
423
+ 0,
424
+ 0,
425
+ 0
426
+ ],
427
+ [
428
+ 0,
429
+ 39,
430
+ 14,
431
+ 0,
432
+ 0,
433
+ 0,
434
+ 0
435
+ ],
436
+ [
437
+ 10,
438
+ 13,
439
+ 44,
440
+ 0,
441
+ 0,
442
+ 0,
443
+ 0
444
+ ],
445
+ [
446
+ 0,
447
+ 4,
448
+ 0,
449
+ 112,
450
+ 26,
451
+ 1,
452
+ 0
453
+ ],
454
+ [
455
+ 0,
456
+ 0,
457
+ 0,
458
+ 20,
459
+ 47,
460
+ 22,
461
+ 11
462
+ ],
463
+ [
464
+ 0,
465
+ 0,
466
+ 0,
467
+ 5,
468
+ 11,
469
+ 23,
470
+ 24
471
+ ],
472
+ [
473
+ 0,
474
+ 0,
475
+ 0,
476
+ 0,
477
+ 6,
478
+ 23,
479
+ 46
480
+ ]
481
+ ]
482
+ },
483
+ "macro_roc_auc_ovr": 0.9369503919262667
484
+ },
485
+ "delta_accuracy": -0.0034188034188034067,
486
+ "delta_macro_f1": -6.745380238848409e-05
487
+ }
488
+ }
489
+ }
feature_engineering.py ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ feature_engineering.py
3
+ ======================
4
+
5
+ Feature pipeline for the CYB004 baseline classifier.
6
+
7
+ Predicts `campaign_phase` (7-class) from per-timestep phishing campaign
8
+ trajectory data on the CYB004 sample dataset.
9
+
10
+ CSV inputs:
11
+ campaign_trajectories.csv (primary, one row per timestep, 100
12
+ campaigns x ~40 timesteps = 3,952 rows)
13
+ victim_topology.csv (per-department victim configuration,
14
+ joined on target_department_id)
15
+ campaign_summary.csv (per-campaign aggregates; reserved for
16
+ future work)
17
+ campaign_events.csv (discrete event log; reserved for
18
+ future work)
19
+
20
+ Target classes (7 phases observed in the sample):
21
+ target_reconnaissance, infrastructure_setup, lure_crafting,
22
+ email_delivery, victim_engagement, credential_harvesting,
23
+ post_compromise_escalation
24
+
25
+ This is the email-security / SOC use case: given the observable
26
+ campaign telemetry at a moment in time, what phase of the phishing
27
+ lifecycle is the campaign in?
28
+
29
+ The pivot to campaign_phase (away from actor_capability_tier, the
30
+ README's headline use case) happened because per-campaign-constant
31
+ features (lure_personalisation_score, click_through_rate,
32
+ credential_submission_rate, target_department_id) leak tier via the
33
+ small test fold under group-aware splitting. With those features
34
+ removed, honest tier prediction is below majority baseline. The full
35
+ 335k-row CYB004 dataset would address this; the sample does not.
36
+ See the model card for full discussion.
37
+
38
+ Public API
39
+ ----------
40
+ build_features(trajectories_path, topology_path)
41
+ -> (X, y, groups, meta)
42
+ transform_single(record, meta, victim_aggregates=None) -> np.ndarray
43
+ save_meta(meta, path) / load_meta(path)
44
+ build_department_lookup(topology_path) -> dict
45
+
46
+ License
47
+ -------
48
+ Ships with the public model on Hugging Face under CC-BY-NC-4.0, matching
49
+ the dataset license. See README.md.
50
+ """
51
+
52
+ from __future__ import annotations
53
+
54
+ import json
55
+ from pathlib import Path
56
+ from typing import Any
57
+
58
+ import numpy as np
59
+ import pandas as pd
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # Label space
63
+ # ---------------------------------------------------------------------------
64
+
65
+ LABEL_ORDER = [
66
+ "target_reconnaissance",
67
+ "infrastructure_setup",
68
+ "lure_crafting",
69
+ "email_delivery",
70
+ "victim_engagement",
71
+ "credential_harvesting",
72
+ "post_compromise_escalation",
73
+ ]
74
+ LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
75
+ INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
76
+
77
+ # ---------------------------------------------------------------------------
78
+ # Identifier and target columns - not features
79
+ # ---------------------------------------------------------------------------
80
+
81
+ ID_COLUMNS = ["campaign_id", "actor_id"]
82
+ TARGET_COLUMN = "campaign_phase"
83
+
84
+ # `actor_capability_tier` is kept as a feature - it's a real SOC observable
85
+ # (analysts typically have an actor cluster hypothesis), and its
86
+ # purity-vs-phase is 0.18 (uniform baseline 0.14), so it isn't an oracle.
87
+
88
+ # `delivery_outcome` is dropped: its purity vs phase is much higher
89
+ # (0.36) - `no_delivery` appears only in early phases, effectively
90
+ # encoding phase position. Keeping it would give the model a near-oracle.
91
+ LEAKY_COLUMNS = [
92
+ "delivery_outcome",
93
+ ]
94
+
95
+ # ---------------------------------------------------------------------------
96
+ # Per-timestep numeric features
97
+ # ---------------------------------------------------------------------------
98
+
99
+ DIRECT_NUMERIC_TIMESTEP_FEATURES = [
100
+ "timestep", # strong but non-deterministic phase signal
101
+ "emails_sent_cumulative", # increases through campaign; useful position proxy
102
+ "click_through_rate", # per-campaign constant; informative when combined with timestep
103
+ "credential_submission_rate", # per-campaign constant
104
+ "gateway_detection_score", # per-step variation
105
+ "lure_personalisation_score", # per-campaign constant; tier signal
106
+ "target_department_id", # per-campaign constant; treated as ordinal ID
107
+ ]
108
+
109
+ # Per-timestep categoricals
110
+ CATEGORICAL_TIMESTEP_FEATURES = [
111
+ "evasion_technique_active", # 6 levels incl. "none" (82%); active evasion correlates with mid-late phases
112
+ "actor_capability_tier", # 4 levels; mostly per-campaign constant
113
+ ]
114
+
115
+ # ---------------------------------------------------------------------------
116
+ # Victim topology features (joined on target_department_id)
117
+ # ---------------------------------------------------------------------------
118
+
119
+ TOPOLOGY_NUMERIC_FEATURES = [
120
+ "employee_count",
121
+ "privileged_account_density",
122
+ "mfa_enrollment_rate",
123
+ "click_susceptibility_base",
124
+ "email_volume_daily",
125
+ ]
126
+
127
+ TOPOLOGY_CATEGORICAL_FEATURES = [
128
+ "department_type",
129
+ "industry_sector",
130
+ "awareness_training_level",
131
+ "gateway_architecture",
132
+ "dmarc_enforcement_level",
133
+ ]
134
+
135
+
136
+ # ---------------------------------------------------------------------------
137
+ # Engineered features (none derived from phase or timestep alone)
138
+ # ---------------------------------------------------------------------------
139
+
140
+ def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
141
+ """
142
+ Six engineered features. None directly encode phase; each is a
143
+ behavioural composite that helps disambiguate adjacent phases.
144
+ """
145
+ df = df.copy()
146
+
147
+ # 1. Log-scaled email volume. emails_sent_cumulative is heavy-tailed
148
+ # (0 in recon, hundreds-to-thousands by post_compromise).
149
+ df["log_emails_sent"] = np.log1p(df["emails_sent_cumulative"].clip(lower=0)).astype(float)
150
+
151
+ # 2. Gateway-blocked step. gateway_detection_score > 0.7 marks
152
+ # high-confidence gateway intervention; common in email_delivery.
153
+ df["is_gateway_blocked_step"] = (df["gateway_detection_score"] > 0.7).astype(int)
154
+
155
+ # 3. Evasion-active flag. Non-"none" evasion_technique_active
156
+ # concentrates in lure_crafting and email_delivery.
157
+ df["is_evasion_active"] = (df["evasion_technique_active"] != "none").astype(int)
158
+
159
+ # 4. High-personalisation flag. lure_personalisation_score > 0.7 is
160
+ # an APT-tier signature.
161
+ df["is_high_personalisation"] = (df["lure_personalisation_score"] > 0.7).astype(int)
162
+
163
+ # 5. Has credential capture flag. credential_submission_rate > 0
164
+ # indicates the campaign has reached credential-capture phases.
165
+ df["has_credential_capture"] = (df["credential_submission_rate"] > 0).astype(int)
166
+
167
+ # 6. Engaged-victim flag. click_through_rate > 0 indicates
168
+ # victim_engagement or later phase.
169
+ df["has_user_engagement"] = (df["click_through_rate"] > 0).astype(int)
170
+
171
+ return df
172
+
173
+
174
+ # ---------------------------------------------------------------------------
175
+ # Public API
176
+ # ---------------------------------------------------------------------------
177
+
178
+ def build_features(
179
+ trajectories_path: str | Path,
180
+ topology_path: str | Path,
181
+ ) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
182
+ """
183
+ Load CSVs, join topology, drop target + leaky columns, engineer features,
184
+ one-hot encode, return (X, y, groups, meta).
185
+
186
+ `groups` is a Series of campaign_id values aligned with X. Use it with
187
+ GroupShuffleSplit / GroupKFold: a single campaign generates ~40
188
+ correlated timesteps; row-level random splitting inflates metrics.
189
+ """
190
+ traj = pd.read_csv(trajectories_path)
191
+ topo = pd.read_csv(topology_path)
192
+
193
+ y = traj[TARGET_COLUMN].map(LABEL_TO_INT)
194
+ if y.isna().any():
195
+ bad = traj.loc[y.isna(), TARGET_COLUMN].unique()
196
+ raise ValueError(f"Unknown campaign_phase values: {bad}")
197
+ y = y.astype(int)
198
+ groups = traj["campaign_id"].copy()
199
+
200
+ traj = traj.drop(columns=ID_COLUMNS + [TARGET_COLUMN] + LEAKY_COLUMNS,
201
+ errors="ignore")
202
+
203
+ topo_cols_needed = (
204
+ ["department_id"]
205
+ + TOPOLOGY_NUMERIC_FEATURES
206
+ + TOPOLOGY_CATEGORICAL_FEATURES
207
+ )
208
+ traj = traj.merge(
209
+ topo[topo_cols_needed],
210
+ left_on="target_department_id", right_on="department_id", how="left",
211
+ ).drop(columns=["department_id"], errors="ignore")
212
+
213
+ traj = _add_engineered_features(traj)
214
+
215
+ numeric_features = (
216
+ DIRECT_NUMERIC_TIMESTEP_FEATURES
217
+ + TOPOLOGY_NUMERIC_FEATURES
218
+ + [
219
+ "log_emails_sent", "is_gateway_blocked_step", "is_evasion_active",
220
+ "is_high_personalisation", "has_credential_capture", "has_user_engagement",
221
+ ]
222
+ )
223
+ X_numeric = traj[numeric_features].astype(float)
224
+
225
+ all_categorical = (
226
+ [(col, "timestep") for col in CATEGORICAL_TIMESTEP_FEATURES]
227
+ + [(col, "topology") for col in TOPOLOGY_CATEGORICAL_FEATURES]
228
+ )
229
+ categorical_levels: dict[str, list[str]] = {}
230
+ blocks: list[pd.DataFrame] = []
231
+ for col, _src in all_categorical:
232
+ if col not in traj.columns:
233
+ continue
234
+ levels = sorted(traj[col].dropna().unique().tolist())
235
+ categorical_levels[col] = levels
236
+ block = pd.get_dummies(
237
+ traj[col].astype("category").cat.set_categories(levels),
238
+ prefix=col, dummy_na=False,
239
+ ).astype(int)
240
+ blocks.append(block)
241
+
242
+ X = pd.concat(
243
+ [X_numeric.reset_index(drop=True)]
244
+ + [b.reset_index(drop=True) for b in blocks],
245
+ axis=1,
246
+ ).fillna(0.0)
247
+
248
+ meta = {
249
+ "feature_names": X.columns.tolist(),
250
+ "numeric_features": numeric_features,
251
+ "categorical_levels": categorical_levels,
252
+ "label_to_int": LABEL_TO_INT,
253
+ "int_to_label": INT_TO_LABEL,
254
+ "leakage_excluded": LEAKY_COLUMNS,
255
+ }
256
+ return X, y, groups, meta
257
+
258
+
259
+ def transform_single(
260
+ record: dict | pd.DataFrame,
261
+ meta: dict[str, Any],
262
+ victim_aggregates: dict | None = None,
263
+ ) -> np.ndarray:
264
+ """Encode a single timestep record for inference."""
265
+ if isinstance(record, dict):
266
+ df = pd.DataFrame([record.copy()])
267
+ else:
268
+ df = record.copy()
269
+
270
+ if victim_aggregates is not None:
271
+ for k, v in victim_aggregates.items():
272
+ df[k] = v
273
+
274
+ df = _add_engineered_features(df)
275
+
276
+ numeric = pd.DataFrame({
277
+ col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
278
+ for col in meta["numeric_features"]
279
+ })
280
+ blocks: list[pd.DataFrame] = [numeric]
281
+ for col, levels in meta["categorical_levels"].items():
282
+ val = df.get(col, pd.Series([None] * len(df)))
283
+ block = pd.get_dummies(
284
+ val.astype("category").cat.set_categories(levels),
285
+ prefix=col, dummy_na=False,
286
+ ).astype(int)
287
+ for lvl in levels:
288
+ cname = f"{col}_{lvl}"
289
+ if cname not in block.columns:
290
+ block[cname] = 0
291
+ block = block[[f"{col}_{lvl}" for lvl in levels]]
292
+ blocks.append(block)
293
+
294
+ X = pd.concat(blocks, axis=1).fillna(0.0)
295
+ X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
296
+ return X.values.astype(np.float32)
297
+
298
+
299
+ def save_meta(meta: dict[str, Any], path: str | Path) -> None:
300
+ serializable = {
301
+ "feature_names": meta["feature_names"],
302
+ "numeric_features": meta["numeric_features"],
303
+ "categorical_levels": meta["categorical_levels"],
304
+ "label_to_int": meta["label_to_int"],
305
+ "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
306
+ "leakage_excluded": meta.get("leakage_excluded", []),
307
+ }
308
+ with open(path, "w") as f:
309
+ json.dump(serializable, f, indent=2)
310
+
311
+
312
+ def load_meta(path: str | Path) -> dict[str, Any]:
313
+ with open(path) as f:
314
+ meta = json.load(f)
315
+ meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
316
+ return meta
317
+
318
+
319
+ def build_department_lookup(topology_path: str | Path) -> dict[int, dict]:
320
+ """Build {department_id: {topology features}} for inference-time lookup."""
321
+ topo = pd.read_csv(topology_path)
322
+ cols = TOPOLOGY_NUMERIC_FEATURES + TOPOLOGY_CATEGORICAL_FEATURES
323
+ out = {}
324
+ for _, row in topo.iterrows():
325
+ out[int(row["department_id"])] = {c: row[c] for c in cols if c in topo.columns}
326
+ return out
327
+
328
+
329
+ if __name__ == "__main__":
330
+ import sys
331
+ base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
332
+ X, y, groups, meta = build_features(
333
+ base / "campaign_trajectories.csv",
334
+ base / "victim_topology.csv",
335
+ )
336
+ print(f"X shape: {X.shape}")
337
+ print(f"y shape: {y.shape}")
338
+ print(f"groups: {groups.nunique()} campaigns")
339
+ print(f"n features: {len(meta['feature_names'])}")
340
+ print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
341
+ print(f"X has NaN: {X.isnull().any().any()}")
feature_meta.json ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "feature_names": [
3
+ "timestep",
4
+ "emails_sent_cumulative",
5
+ "click_through_rate",
6
+ "credential_submission_rate",
7
+ "gateway_detection_score",
8
+ "lure_personalisation_score",
9
+ "target_department_id",
10
+ "employee_count",
11
+ "privileged_account_density",
12
+ "mfa_enrollment_rate",
13
+ "click_susceptibility_base",
14
+ "email_volume_daily",
15
+ "log_emails_sent",
16
+ "is_gateway_blocked_step",
17
+ "is_evasion_active",
18
+ "is_high_personalisation",
19
+ "has_credential_capture",
20
+ "has_user_engagement",
21
+ "evasion_technique_active_base64_payload_embedding",
22
+ "evasion_technique_active_homoglyph_substitution",
23
+ "evasion_technique_active_html_obfuscation",
24
+ "evasion_technique_active_image_only_lure",
25
+ "evasion_technique_active_none",
26
+ "evasion_technique_active_redirect_chain",
27
+ "actor_capability_tier_cybercriminal_gang",
28
+ "actor_capability_tier_initial_access_broker",
29
+ "actor_capability_tier_nation_state_apt",
30
+ "actor_capability_tier_opportunistic",
31
+ "department_type_executive_leadership",
32
+ "department_type_finance_accounts_payable",
33
+ "department_type_human_resources",
34
+ "department_type_information_technology",
35
+ "industry_sector_financial_services",
36
+ "industry_sector_government_state_local",
37
+ "industry_sector_retail_ecommerce",
38
+ "industry_sector_technology",
39
+ "awareness_training_level_annual",
40
+ "awareness_training_level_basic",
41
+ "awareness_training_level_continuous",
42
+ "awareness_training_level_none",
43
+ "awareness_training_level_quarterly",
44
+ "gateway_architecture_ai_sender_reputation",
45
+ "gateway_architecture_ensemble_layered_gateway",
46
+ "gateway_architecture_integrated_cloud_defender",
47
+ "gateway_architecture_legacy_spam_filter",
48
+ "gateway_architecture_ml_classifier_gateway",
49
+ "gateway_architecture_rule_based_filter",
50
+ "gateway_architecture_sandbox_detonation",
51
+ "gateway_architecture_zero_trust_email_proxy",
52
+ "dmarc_enforcement_level_monitoring",
53
+ "dmarc_enforcement_level_none",
54
+ "dmarc_enforcement_level_quarantine",
55
+ "dmarc_enforcement_level_reject"
56
+ ],
57
+ "numeric_features": [
58
+ "timestep",
59
+ "emails_sent_cumulative",
60
+ "click_through_rate",
61
+ "credential_submission_rate",
62
+ "gateway_detection_score",
63
+ "lure_personalisation_score",
64
+ "target_department_id",
65
+ "employee_count",
66
+ "privileged_account_density",
67
+ "mfa_enrollment_rate",
68
+ "click_susceptibility_base",
69
+ "email_volume_daily",
70
+ "log_emails_sent",
71
+ "is_gateway_blocked_step",
72
+ "is_evasion_active",
73
+ "is_high_personalisation",
74
+ "has_credential_capture",
75
+ "has_user_engagement"
76
+ ],
77
+ "categorical_levels": {
78
+ "evasion_technique_active": [
79
+ "base64_payload_embedding",
80
+ "homoglyph_substitution",
81
+ "html_obfuscation",
82
+ "image_only_lure",
83
+ "none",
84
+ "redirect_chain"
85
+ ],
86
+ "actor_capability_tier": [
87
+ "cybercriminal_gang",
88
+ "initial_access_broker",
89
+ "nation_state_apt",
90
+ "opportunistic"
91
+ ],
92
+ "department_type": [
93
+ "executive_leadership",
94
+ "finance_accounts_payable",
95
+ "human_resources",
96
+ "information_technology"
97
+ ],
98
+ "industry_sector": [
99
+ "financial_services",
100
+ "government_state_local",
101
+ "retail_ecommerce",
102
+ "technology"
103
+ ],
104
+ "awareness_training_level": [
105
+ "annual",
106
+ "basic",
107
+ "continuous",
108
+ "none",
109
+ "quarterly"
110
+ ],
111
+ "gateway_architecture": [
112
+ "ai_sender_reputation",
113
+ "ensemble_layered_gateway",
114
+ "integrated_cloud_defender",
115
+ "legacy_spam_filter",
116
+ "ml_classifier_gateway",
117
+ "rule_based_filter",
118
+ "sandbox_detonation",
119
+ "zero_trust_email_proxy"
120
+ ],
121
+ "dmarc_enforcement_level": [
122
+ "monitoring",
123
+ "none",
124
+ "quarantine",
125
+ "reject"
126
+ ]
127
+ },
128
+ "label_to_int": {
129
+ "target_reconnaissance": 0,
130
+ "infrastructure_setup": 1,
131
+ "lure_crafting": 2,
132
+ "email_delivery": 3,
133
+ "victim_engagement": 4,
134
+ "credential_harvesting": 5,
135
+ "post_compromise_escalation": 6
136
+ },
137
+ "int_to_label": {
138
+ "0": "target_reconnaissance",
139
+ "1": "infrastructure_setup",
140
+ "2": "lure_crafting",
141
+ "3": "email_delivery",
142
+ "4": "victim_engagement",
143
+ "5": "credential_harvesting",
144
+ "6": "post_compromise_escalation"
145
+ },
146
+ "leakage_excluded": [
147
+ "delivery_outcome"
148
+ ]
149
+ }
feature_scaler.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"mean": [19.882267966775007, 264.6323582520766, 0.052154893463344176, 0.03361545684362586, 0.6047568436258577, 0.4326537739256049, 17.127121704586493, 172.1491513181654, 0.7521711809317442, 0.8172881906825569, 0.07423943661971831, 1151.8490429758035, 3.894440315600361, 0.30371975442397975, 0.1758757674250632, 0.11917659804983749, 1.0, 1.0, 0.030335861321776816, 0.053810039725532686, 0.04333694474539545, 0.025279884434814014, 0.8241242325749368, 0.02311303719754424, 0.18923799205489347, 0.10220296135789093, 0.10509209100758396, 0.6034669555796316, 0.27230046948356806, 0.26291079812206575, 0.1632358252076562, 0.30155290718671, 0.27230046948356806, 0.30155290718671, 0.1632358252076562, 0.26291079812206575, 0.30841459010473093, 0.16143011917659805, 0.20548934633441676, 0.29360780065005415, 0.031058143734200072, 0.12639942217407008, 0.13578909353557242, 0.14590104730949802, 0.11014806789454677, 0.06608884073672806, 0.09750812567713976, 0.13867822318526543, 0.1794871794871795, 0.09750812567713976, 0.11014806789454677, 0.2047670639219935, 0.58757674250632], "std": [12.12092281961143, 240.98788415799402, 0.020507195059365872, 0.012951632990740584, 0.16345254609210969, 0.1787513429787685, 9.161154583852591, 85.48823018511177, 0.13799067057693098, 0.10193473774948415, 0.02923768201623528, 772.2778476847263, 2.791161927013341, 0.45994615422530144, 0.38078320056479364, 0.3240547183619046, 1.0, 1.0, 0.1715407352835541, 0.22568321453759693, 0.20365125061044834, 0.15700227356694219, 0.38078320056479364, 0.15028965965395197, 0.3917683030033992, 0.30296974343852234, 0.30672743633583477, 0.4892658167708199, 0.4452241130504305, 0.4402939026663002, 0.3696474491122128, 0.45901507812196696, 0.4452241130504305, 0.45901507812196696, 0.3696474491122128, 0.4402939026663002, 0.4619221667863049, 0.3679936701948691, 0.40413173266488067, 0.4554966395152088, 0.17350621713351017, 0.3323589938739654, 0.34262634311115486, 0.35307074530111743, 0.31313077339534806, 0.24848220047758568, 0.2967020106382041, 0.3456728601775323, 0.38382904647787225, 0.2967020106382041, 0.31313077339534806, 0.40360418981498464, 0.4923594837624244]}
inference_example.ipynb ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# CYB004 Baseline Classifier — Inference Example\n",
8
+ "\n",
9
+ "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **phishing campaign phase** of a new per-timestep telemetry record.\n",
10
+ "\n",
11
+ "**Models predict one of 7 phases:** `target_reconnaissance`, `infrastructure_setup`, `lure_crafting`, `email_delivery`, `victim_engagement`, `credential_harvesting`, `post_compromise_escalation`.\n",
12
+ "\n",
13
+ "**This is a baseline reference model**, not a production email-security platform. See the model card for full metrics and limitations."
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "markdown",
18
+ "metadata": {},
19
+ "source": [
20
+ "## 1. Install dependencies"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "## 2. Download model artifacts from Hugging Face"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": null,
42
+ "metadata": {},
43
+ "outputs": [],
44
+ "source": [
45
+ "from huggingface_hub import hf_hub_download\n",
46
+ "\n",
47
+ "REPO_ID = \"xpertsystems/cyb004-baseline-classifier\"\n",
48
+ "\n",
49
+ "files = {}\n",
50
+ "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
51
+ " \"feature_engineering.py\", \"feature_meta.json\",\n",
52
+ " \"feature_scaler.json\"]:\n",
53
+ " files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
54
+ " print(f\" downloaded: {name}\")"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": null,
60
+ "metadata": {},
61
+ "outputs": [],
62
+ "source": [
63
+ "import sys, os\n",
64
+ "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
65
+ "if fe_dir not in sys.path:\n",
66
+ " sys.path.insert(0, fe_dir)\n",
67
+ "\n",
68
+ "from feature_engineering import (\n",
69
+ " transform_single, load_meta, INT_TO_LABEL, build_department_lookup\n",
70
+ ")"
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "markdown",
75
+ "metadata": {},
76
+ "source": [
77
+ "## 3. Load models and metadata"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "code",
82
+ "execution_count": null,
83
+ "metadata": {},
84
+ "outputs": [],
85
+ "source": [
86
+ "import json\n",
87
+ "import numpy as np\n",
88
+ "import torch\n",
89
+ "import torch.nn as nn\n",
90
+ "import xgboost as xgb\n",
91
+ "from safetensors.torch import load_file\n",
92
+ "\n",
93
+ "meta = load_meta(files[\"feature_meta.json\"])\n",
94
+ "with open(files[\"feature_scaler.json\"]) as f:\n",
95
+ " scaler = json.load(f)\n",
96
+ "\n",
97
+ "N_FEATURES = len(meta[\"feature_names\"])\n",
98
+ "N_CLASSES = len(meta[\"int_to_label\"])\n",
99
+ "print(f\"feature count: {N_FEATURES}\")\n",
100
+ "print(f\"class count: {N_CLASSES}\")\n",
101
+ "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "code",
106
+ "execution_count": null,
107
+ "metadata": {},
108
+ "outputs": [],
109
+ "source": [
110
+ "# XGBoost\n",
111
+ "xgb_model = xgb.XGBClassifier()\n",
112
+ "xgb_model.load_model(files[\"model_xgb.json\"])\n",
113
+ "\n",
114
+ "# MLP architecture (must match training)\n",
115
+ "class PhaseMLP(nn.Module):\n",
116
+ " def __init__(self, n_features, n_classes=7, hidden1=128, hidden2=64, dropout=0.3):\n",
117
+ " super().__init__()\n",
118
+ " self.net = nn.Sequential(\n",
119
+ " nn.Linear(n_features, hidden1),\n",
120
+ " nn.BatchNorm1d(hidden1),\n",
121
+ " nn.ReLU(),\n",
122
+ " nn.Dropout(dropout),\n",
123
+ " nn.Linear(hidden1, hidden2),\n",
124
+ " nn.BatchNorm1d(hidden2),\n",
125
+ " nn.ReLU(),\n",
126
+ " nn.Dropout(dropout),\n",
127
+ " nn.Linear(hidden2, n_classes),\n",
128
+ " )\n",
129
+ " def forward(self, x):\n",
130
+ " return self.net(x)\n",
131
+ "\n",
132
+ "mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
133
+ "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
134
+ "mlp_model.eval()\n",
135
+ "print(\"models loaded\")"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "markdown",
140
+ "metadata": {},
141
+ "source": [
142
+ "## 4. Build the department lookup\n",
143
+ "\n",
144
+ "Per-department topology features (employee_count, MFA enrollment, gateway architecture, DMARC level, etc.) are pulled from `victim_topology.csv` and merged into each timestep record by `target_department_id`."
145
+ ]
146
+ },
147
+ {
148
+ "cell_type": "code",
149
+ "execution_count": null,
150
+ "metadata": {},
151
+ "outputs": [],
152
+ "source": [
153
+ "from huggingface_hub import snapshot_download\n",
154
+ "\n",
155
+ "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb004-sample\", repo_type=\"dataset\")\n",
156
+ "\n",
157
+ "dept_lookup = build_department_lookup(\n",
158
+ " os.path.join(ds_path, \"victim_topology.csv\")\n",
159
+ ")\n",
160
+ "print(f\"loaded {len(dept_lookup)} department profiles\")"
161
+ ]
162
+ },
163
+ {
164
+ "cell_type": "markdown",
165
+ "metadata": {},
166
+ "source": [
167
+ "## 5. Prediction helper"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": null,
173
+ "metadata": {},
174
+ "outputs": [],
175
+ "source": [
176
+ "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
177
+ "SD = np.array(scaler[\"std\"], dtype=np.float32)\n",
178
+ "\n",
179
+ "def predict_phase(record: dict) -> dict:\n",
180
+ " \"\"\"Predict the campaign phase for one per-timestep telemetry record.\n",
181
+ "\n",
182
+ " Per-department topology features are pulled automatically via\n",
183
+ " `target_department_id` from the dept_lookup loaded above.\n",
184
+ " \"\"\"\n",
185
+ " dept_id = int(record.get(\"target_department_id\", -1))\n",
186
+ " dept_aggs = dept_lookup.get(dept_id, {})\n",
187
+ " X = transform_single(record, meta, victim_aggregates=dept_aggs)\n",
188
+ "\n",
189
+ " xgb_proba = xgb_model.predict_proba(X)[0]\n",
190
+ " xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
191
+ "\n",
192
+ " Xs = ((X - MU) / SD).astype(np.float32)\n",
193
+ " with torch.no_grad():\n",
194
+ " logits = mlp_model(torch.tensor(Xs))\n",
195
+ " mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
196
+ " mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
197
+ "\n",
198
+ " return {\n",
199
+ " \"xgboost\": {\n",
200
+ " \"label\": xgb_label,\n",
201
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
202
+ " },\n",
203
+ " \"mlp\": {\n",
204
+ " \"label\": mlp_label,\n",
205
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
206
+ " },\n",
207
+ " }"
208
+ ]
209
+ },
210
+ {
211
+ "cell_type": "markdown",
212
+ "metadata": {},
213
+ "source": [
214
+ "## 6. Run on an example record\n",
215
+ "\n",
216
+ "Real `email_delivery` event lifted from the sample dataset: a nation-state APT campaign at timestep 13, with homoglyph substitution evasion active and 58 emails sent. Both models should predict `email_delivery`."
217
+ ]
218
+ },
219
+ {
220
+ "cell_type": "code",
221
+ "execution_count": null,
222
+ "metadata": {},
223
+ "outputs": [],
224
+ "source": [
225
+ "# Real timestep record from the sample dataset (true phase: email_delivery)\n",
226
+ "example_record = {\n",
227
+ " \"timestep\": 13,\n",
228
+ " \"emails_sent_cumulative\": 58,\n",
229
+ " \"click_through_rate\": 0.1158,\n",
230
+ " \"credential_submission_rate\": 0.0713,\n",
231
+ " \"gateway_detection_score\": 0.7327,\n",
232
+ " \"lure_personalisation_score\": 0.7507,\n",
233
+ " \"evasion_technique_active\": \"homoglyph_substitution\",\n",
234
+ " \"target_department_id\": 10,\n",
235
+ " \"actor_capability_tier\": \"nation_state_apt\",\n",
236
+ "}\n",
237
+ "\n",
238
+ "result = predict_phase(example_record)\n",
239
+ "\n",
240
+ "print(f\"XGBoost -> {result['xgboost']['label']}\")\n",
241
+ "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
242
+ " print(f\" P({lbl:30s}) = {p:.4f}\")\n",
243
+ "\n",
244
+ "print(f\"\\nMLP -> {result['mlp']['label']}\")\n",
245
+ "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
246
+ " print(f\" P({lbl:30s}) = {p:.4f}\")"
247
+ ]
248
+ },
249
+ {
250
+ "cell_type": "markdown",
251
+ "metadata": {},
252
+ "source": [
253
+ "### Note: when the two models disagree\n",
254
+ "\n",
255
+ "XGBoost and the MLP can disagree on mid-pipeline phases (`victim_engagement`, `credential_harvesting`) where timestep windows overlap. The per-class F1 in the model card identifies which phases are robustly predicted vs. which are not. In a SOC workflow, conflicting predictions are worth surfacing for human review."
256
+ ]
257
+ },
258
+ {
259
+ "cell_type": "markdown",
260
+ "metadata": {},
261
+ "source": [
262
+ "## 7. Batch prediction on the sample dataset"
263
+ ]
264
+ },
265
+ {
266
+ "cell_type": "code",
267
+ "execution_count": null,
268
+ "metadata": {},
269
+ "outputs": [],
270
+ "source": [
271
+ "import pandas as pd\n",
272
+ "\n",
273
+ "traj = pd.read_csv(f\"{ds_path}/campaign_trajectories.csv\")\n",
274
+ "\n",
275
+ "# Drop the leaky column the model was never trained on\n",
276
+ "traj = traj.drop(columns=[\"delivery_outcome\"], errors=\"ignore\")\n",
277
+ "\n",
278
+ "# Score the first 200 timesteps\n",
279
+ "sample = traj.head(200).copy()\n",
280
+ "preds = [predict_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
281
+ "sample[\"xgb_pred\"] = preds\n",
282
+ "\n",
283
+ "ct = pd.crosstab(sample[\"campaign_phase\"], sample[\"xgb_pred\"],\n",
284
+ " rownames=[\"true\"], colnames=[\"pred\"])\n",
285
+ "print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
286
+ "print(ct)\n",
287
+ "acc = (sample[\"campaign_phase\"] == sample[\"xgb_pred\"]).mean()\n",
288
+ "print(f\"\\nbatch accuracy on first 200 rows (in-distribution): {acc:.4f}\")\n",
289
+ "print(\"\\nNote: these rows include training-set campaigns. See validation_results.json\\n\"\n",
290
+ " \"for proper held-out test metrics from disjoint campaigns.\")"
291
+ ]
292
+ },
293
+ {
294
+ "cell_type": "markdown",
295
+ "metadata": {},
296
+ "source": [
297
+ "## 8. Next steps\n",
298
+ "\n",
299
+ "- See `validation_results.json` for held-out test metrics (15 disjoint campaigns, ~580 timesteps).\n",
300
+ "- See `multi_seed_results.json` for the across-10-seeds robustness picture (accuracy 0.649 ± 0.038, ROC-AUC 0.937 ± 0.010).\n",
301
+ "- See `ablation_results.json` for per-feature-group contribution. `timestep` carries the dominant signal.\n",
302
+ "- The model card explains why `actor_capability_tier` was *not* used as the target despite being the README's headline use case.\n",
303
+ "- For the full 335k-row CYB004 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
304
+ ]
305
+ }
306
+ ],
307
+ "metadata": {
308
+ "kernelspec": {
309
+ "display_name": "Python 3",
310
+ "language": "python",
311
+ "name": "python3"
312
+ },
313
+ "language_info": {
314
+ "name": "python",
315
+ "version": "3.10"
316
+ }
317
+ },
318
+ "nbformat": 4,
319
+ "nbformat_minor": 5
320
+ }
model_mlp.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:072999e38cd542460473780a9c71164efc1a53a1037a4b579064cc93f3f5b4b8
3
+ size 66788
model_xgb.json ADDED
The diff for this file is too large to render. See raw diff
 
multi_seed_results.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "With n=100 campaigns, single-seed metrics carry test-fold variance. Multi-seed evaluation gives a more reliable picture.",
3
+ "seeds_evaluated": [
4
+ 42,
5
+ 7,
6
+ 13,
7
+ 17,
8
+ 23,
9
+ 31,
10
+ 45,
11
+ 99,
12
+ 123,
13
+ 200
14
+ ],
15
+ "per_seed": [
16
+ {
17
+ "seed": 42,
18
+ "test_n_classes": 7,
19
+ "accuracy": 0.6547008547008547,
20
+ "macro_f1": 0.6401276666852063,
21
+ "macro_roc_auc_ovr": 0.935584434710217
22
+ },
23
+ {
24
+ "seed": 7,
25
+ "test_n_classes": 7,
26
+ "accuracy": 0.6267123287671232,
27
+ "macro_f1": 0.6141815367358149,
28
+ "macro_roc_auc_ovr": 0.9256987657069029
29
+ },
30
+ {
31
+ "seed": 13,
32
+ "test_n_classes": 7,
33
+ "accuracy": 0.5983050847457627,
34
+ "macro_f1": 0.5953435905708684,
35
+ "macro_roc_auc_ovr": 0.9235372520169014
36
+ },
37
+ {
38
+ "seed": 17,
39
+ "test_n_classes": 7,
40
+ "accuracy": 0.64349376114082,
41
+ "macro_f1": 0.6328717716731788,
42
+ "macro_roc_auc_ovr": 0.9426545946495839
43
+ },
44
+ {
45
+ "seed": 23,
46
+ "test_n_classes": 7,
47
+ "accuracy": 0.5915254237288136,
48
+ "macro_f1": 0.5734921834318393,
49
+ "macro_roc_auc_ovr": 0.9245031023094512
50
+ },
51
+ {
52
+ "seed": 31,
53
+ "test_n_classes": 7,
54
+ "accuracy": 0.6220095693779905,
55
+ "macro_f1": 0.6103022022937624,
56
+ "macro_roc_auc_ovr": 0.9325576570435162
57
+ },
58
+ {
59
+ "seed": 45,
60
+ "test_n_classes": 7,
61
+ "accuracy": 0.6678082191780822,
62
+ "macro_f1": 0.655097964659693,
63
+ "macro_roc_auc_ovr": 0.9396074000285977
64
+ },
65
+ {
66
+ "seed": 99,
67
+ "test_n_classes": 7,
68
+ "accuracy": 0.7111111111111111,
69
+ "macro_f1": 0.7136854710276727,
70
+ "macro_roc_auc_ovr": 0.9538147161172963
71
+ },
72
+ {
73
+ "seed": 123,
74
+ "test_n_classes": 7,
75
+ "accuracy": 0.6823734729493892,
76
+ "macro_f1": 0.6727927606720584,
77
+ "macro_roc_auc_ovr": 0.9443324151480283
78
+ },
79
+ {
80
+ "seed": 200,
81
+ "test_n_classes": 7,
82
+ "accuracy": 0.6931407942238267,
83
+ "macro_f1": 0.6752712902262269,
84
+ "macro_roc_auc_ovr": 0.9450377543018418
85
+ }
86
+ ],
87
+ "aggregate": {
88
+ "accuracy_mean": 0.6491180619923773,
89
+ "accuracy_std": 0.03799334369624316,
90
+ "accuracy_min": 0.5915254237288136,
91
+ "accuracy_max": 0.7111111111111111,
92
+ "macro_f1_mean": 0.638316643797632,
93
+ "macro_f1_std": 0.039956794294168915,
94
+ "roc_auc_mean": 0.9367328092032338,
95
+ "roc_auc_std": 0.009623085359130642
96
+ },
97
+ "published_artifact_seed": 42
98
+ }
validation_results.json ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0.0",
3
+ "dataset": "xpertsystems/cyb004-sample",
4
+ "task": "7-class campaign_phase classification",
5
+ "baselines": {
6
+ "always_predict_majority_accuracy": 0.24444444444444444,
7
+ "majority_class": "email_delivery",
8
+ "random_guess_accuracy": 0.14285714285714285
9
+ },
10
+ "split": {
11
+ "strategy": "group_aware (GroupShuffleSplit by campaign_id, nested)",
12
+ "rationale": "100 phishing campaigns generate ~3,952 timesteps (~40 per campaign). Random row-split would leak per-campaign correlations into the test fold. Group-aware split keeps train/val/test campaigns disjoint.",
13
+ "campaigns_train": 69,
14
+ "campaigns_val": 16,
15
+ "campaigns_test": 15,
16
+ "timesteps_train": 2769,
17
+ "timesteps_val": 598,
18
+ "timesteps_test": 585,
19
+ "seed": 42
20
+ },
21
+ "n_features": 53,
22
+ "label_classes": [
23
+ "target_reconnaissance",
24
+ "infrastructure_setup",
25
+ "lure_crafting",
26
+ "email_delivery",
27
+ "victim_engagement",
28
+ "credential_harvesting",
29
+ "post_compromise_escalation"
30
+ ],
31
+ "class_distribution_train": {
32
+ "email_delivery": 655,
33
+ "victim_engagement": 459,
34
+ "post_compromise_escalation": 388,
35
+ "target_reconnaissance": 381,
36
+ "credential_harvesting": 352,
37
+ "lure_crafting": 300,
38
+ "infrastructure_setup": 234
39
+ },
40
+ "class_distribution_test": {
41
+ "email_delivery": 143,
42
+ "victim_engagement": 100,
43
+ "target_reconnaissance": 84,
44
+ "post_compromise_escalation": 75,
45
+ "lure_crafting": 67,
46
+ "credential_harvesting": 63,
47
+ "infrastructure_setup": 53
48
+ },
49
+ "leakage_excluded_features": [
50
+ "delivery_outcome (purity 0.36 vs phase; no_delivery appears only in early phases - near-oracle)"
51
+ ],
52
+ "models": {
53
+ "xgboost": {
54
+ "architecture": "Gradient-boosted decision trees, multi:softprob, 7 classes",
55
+ "framework": "xgboost",
56
+ "test_metrics": {
57
+ "model": "xgboost",
58
+ "accuracy": 0.6547008547008547,
59
+ "macro_f1": 0.6401276666852063,
60
+ "weighted_f1": 0.657179533714298,
61
+ "per_class_f1": {
62
+ "target_reconnaissance": 0.8875739644970414,
63
+ "infrastructure_setup": 0.7115384615384616,
64
+ "lure_crafting": 0.6762589928057554,
65
+ "email_delivery": 0.7913669064748201,
66
+ "victim_engagement": 0.46938775510204084,
67
+ "credential_harvesting": 0.34074074074074073,
68
+ "post_compromise_escalation": 0.6040268456375839
69
+ },
70
+ "confusion_matrix": {
71
+ "labels": [
72
+ "target_reconnaissance",
73
+ "infrastructure_setup",
74
+ "lure_crafting",
75
+ "email_delivery",
76
+ "victim_engagement",
77
+ "credential_harvesting",
78
+ "post_compromise_escalation"
79
+ ],
80
+ "matrix": [
81
+ [
82
+ 75,
83
+ 0,
84
+ 9,
85
+ 0,
86
+ 0,
87
+ 0,
88
+ 0
89
+ ],
90
+ [
91
+ 0,
92
+ 37,
93
+ 16,
94
+ 0,
95
+ 0,
96
+ 0,
97
+ 0
98
+ ],
99
+ [
100
+ 10,
101
+ 10,
102
+ 47,
103
+ 0,
104
+ 0,
105
+ 0,
106
+ 0
107
+ ],
108
+ [
109
+ 0,
110
+ 4,
111
+ 0,
112
+ 110,
113
+ 28,
114
+ 1,
115
+ 0
116
+ ],
117
+ [
118
+ 0,
119
+ 0,
120
+ 0,
121
+ 21,
122
+ 46,
123
+ 24,
124
+ 9
125
+ ],
126
+ [
127
+ 0,
128
+ 0,
129
+ 0,
130
+ 4,
131
+ 16,
132
+ 23,
133
+ 20
134
+ ],
135
+ [
136
+ 0,
137
+ 0,
138
+ 0,
139
+ 0,
140
+ 6,
141
+ 24,
142
+ 45
143
+ ]
144
+ ]
145
+ },
146
+ "macro_roc_auc_ovr": 0.935584434710217
147
+ }
148
+ },
149
+ "mlp": {
150
+ "architecture": "PyTorch MLP, 53 -> 128 -> 64 -> 7, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
151
+ "framework": "pytorch",
152
+ "test_metrics": {
153
+ "model": "mlp",
154
+ "accuracy": 0.6427350427350428,
155
+ "macro_f1": 0.6275373447450349,
156
+ "weighted_f1": 0.6380162402905546,
157
+ "per_class_f1": {
158
+ "target_reconnaissance": 0.8313253012048193,
159
+ "infrastructure_setup": 0.7017543859649122,
160
+ "lure_crafting": 0.5606060606060606,
161
+ "email_delivery": 0.7612456747404844,
162
+ "victim_engagement": 0.3867403314917127,
163
+ "credential_harvesting": 0.43410852713178294,
164
+ "post_compromise_escalation": 0.7169811320754716
165
+ },
166
+ "confusion_matrix": {
167
+ "labels": [
168
+ "target_reconnaissance",
169
+ "infrastructure_setup",
170
+ "lure_crafting",
171
+ "email_delivery",
172
+ "victim_engagement",
173
+ "credential_harvesting",
174
+ "post_compromise_escalation"
175
+ ],
176
+ "matrix": [
177
+ [
178
+ 69,
179
+ 1,
180
+ 14,
181
+ 0,
182
+ 0,
183
+ 0,
184
+ 0
185
+ ],
186
+ [
187
+ 0,
188
+ 40,
189
+ 13,
190
+ 0,
191
+ 0,
192
+ 0,
193
+ 0
194
+ ],
195
+ [
196
+ 13,
197
+ 17,
198
+ 37,
199
+ 0,
200
+ 0,
201
+ 0,
202
+ 0
203
+ ],
204
+ [
205
+ 0,
206
+ 3,
207
+ 1,
208
+ 110,
209
+ 23,
210
+ 6,
211
+ 0
212
+ ],
213
+ [
214
+ 0,
215
+ 0,
216
+ 0,
217
+ 32,
218
+ 35,
219
+ 21,
220
+ 12
221
+ ],
222
+ [
223
+ 0,
224
+ 0,
225
+ 0,
226
+ 4,
227
+ 16,
228
+ 28,
229
+ 15
230
+ ],
231
+ [
232
+ 0,
233
+ 0,
234
+ 0,
235
+ 0,
236
+ 7,
237
+ 11,
238
+ 57
239
+ ]
240
+ ]
241
+ },
242
+ "macro_roc_auc_ovr": 0.9264812360054401
243
+ }
244
+ }
245
+ }
246
+ }