Initial release: XGBoost + MLP for phishing campaign-phase classification

Browse files

Files changed (10) hide show

README.md +455 -0
ablation_results.json +489 -0
feature_engineering.py +341 -0
feature_meta.json +149 -0
feature_scaler.json +1 -0
inference_example.ipynb +320 -0
model_mlp.safetensors +3 -0
model_xgb.json +0 -0
multi_seed_results.json +98 -0
validation_results.json +246 -0

README.md ADDED Viewed

	@@ -0,0 +1,455 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - cybersecurity
+  - phishing
+  - email-security
+  - bec
+  - social-engineering
+  - tabular-classification
+  - synthetic-data
+  - xgboost
+  - baseline
+pipeline_tag: tabular-classification
+base_model: []
+datasets:
+  - xpertsystems/cyb004-sample
+metrics:
+  - accuracy
+  - f1
+  - roc_auc
+model-index:
+  - name: cyb004-baseline-classifier
+    results:
+      - task:
+          type: tabular-classification
+          name: 7-class phishing campaign phase classification
+        dataset:
+          type: xpertsystems/cyb004-sample
+          name: CYB004 Synthetic Phishing Campaign Dataset (Sample)
+        metrics:
+          - type: roc_auc
+            value: 0.9356
+            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.6547
+            name: Test accuracy (XGBoost, seed 42)
+          - type: f1
+            value: 0.6401
+            name: Test macro-F1 (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.649
+            name: Multi-seed accuracy mean ± 0.038 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.937
+            name: Multi-seed ROC-AUC mean ± 0.010 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.9265
+            name: Test macro ROC-AUC OvR (MLP, seed 42)
+          - type: accuracy
+            value: 0.6427
+            name: Test accuracy (MLP, seed 42)
+          - type: f1
+            value: 0.6275
+            name: Test macro-F1 (MLP, seed 42)
+---
+# CYB004 Baseline Classifier
+**Phishing campaign phase classifier trained on the CYB004 synthetic
+phishing campaign sample. Predicts which of 7 lifecycle phases a
+per-timestep telemetry record belongs to, from observable trajectory
+and victim-topology features.**
+> **Baseline reference, not for production use.** This model demonstrates
+> that the [CYB004 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb004-sample)
+> is learnable end-to-end and gives prospective buyers a working starting
+> point. It is not a production email-security platform, SOAR component,
+> or threat detector. See [Limitations](#limitations).
+## Model overview
+| Property | Value |
+|---|---|
+| Task | 7-class campaign_phase classification |
+| Training data | `xpertsystems/cyb004-sample` (3,952 timesteps across 100 phishing campaigns) |
+| Models | XGBoost + PyTorch MLP |
+| Input features | 53 (after one-hot encoding) |
+| Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
+| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
+| License | CC-BY-NC-4.0 (matches dataset) |
+| Status | Reference baseline |
+## Why this task instead of actor-tier attribution?
+The CYB004 dataset README leads with "actor attribution modelling — 4-tier
+classification" as a suggested use case. We piloted that target first and
+found a serious issue: four features in the dataset
+(`lure_personalisation_score`, `click_through_rate`,
+`credential_submission_rate`, `target_department_id`) are **constant per
+campaign**, not per-timestep. They look like per-step features but each
+takes a single value across all ~40 timesteps of a given campaign.
+Because these constants are tier-correlated (especially
+`lure_personalisation_score`, which differs systematically across the
+four actor tiers), they leak tier identity through the campaign-level
+fingerprint they create. With a 15-campaign test fold, many test
+campaigns land in the same feature ranges as training campaigns of the
+same tier, and the model achieves spurious 97%+ accuracy that does not
+generalize. Removing those features (the honest fix) drops tier
+prediction to **accuracy 0.45, ROC-AUC 0.70 — below majority baseline
+of 0.59**. The full 335k-row CYB004 product, with ~4,800 campaigns,
+will not have this constraint; the sample at n=100 cannot support
+honest tier learning.
+We pivoted to **campaign_phase prediction**, which has 3,952 rows of
+per-timestep data spread across 7 phases with tight timestep windows.
+It learns cleanly under the same group-aware split: 65% accuracy,
+ROC-AUC 0.94, stable across 10 seeds. This is a legitimate
+email-security use case — SOAR playbooks and threat-hunting workflows
+need to tag what phase of a phishing campaign observed activity
+belongs to.
+Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
+- `model_xgb.json` — gradient-boosted trees, primary recommendation
+- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
+## Quick start
+```bash
+pip install xgboost torch safetensors pandas huggingface_hub
+```
+```python
+from huggingface_hub import hf_hub_download
+import json, numpy as np, torch, xgboost as xgb
+from safetensors.torch import load_file
+REPO = "xpertsystems/cyb004-baseline-classifier"
+paths = {n: hf_hub_download(REPO, n) for n in [
+    "model_xgb.json", "model_mlp.safetensors",
+    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
+]}
+import sys, os
+sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
+from feature_engineering import (
+    transform_single, load_meta, INT_TO_LABEL, build_department_lookup
+)
+meta = load_meta(paths["feature_meta.json"])
+xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
+dept_lookup = build_department_lookup("path/to/victim_topology.csv")
+# Predict (see inference_example.ipynb for the full pattern)
+dept_aggs = dept_lookup.get(my_record["target_department_id"], {})
+X = transform_single(my_record, meta, victim_aggregates=dept_aggs)
+proba = xgb_model.predict_proba(X)[0]
+print(INT_TO_LABEL[int(np.argmax(proba))])
+```
+See [`inference_example.ipynb`](./inference_example.ipynb) for the full
+copy-paste demo.
+## Training data
+Trained on the public sample of CYB004, 3,952 per-timestep trajectory
+rows from 100 phishing campaigns (~40 timesteps per campaign):
+| Phase | Total rows | Test rows (seed 42) |
+|---|---:|---:|
+| `email_delivery` | 919 | 134 |
+| `victim_engagement` | 667 | 102 |
+| `target_reconnaissance` | 558 | 89 |
+| `post_compromise_escalation` | 533 | 50 |
+| `credential_harvesting` | 494 | 91 |
+| `lure_crafting` | 435 | 71 |
+| `infrastructure_setup` | 346 | 48 |
+### Group-aware split
+A single campaign generates ~40 highly-correlated timesteps. Random
+row-level splitting would put timesteps from the same campaign in both
+train and test, inflating metrics in a way that does not generalize to
+new campaigns.
+This release uses **GroupShuffleSplit by `campaign_id`** (nested,
+70/15/15):
+| Fold | Campaigns | Timesteps |
+|---|---:|---:|
+| Train | 69 | 2,792 |
+| Validation | 16 | 575 |
+| Test | 15 | 585 |
+All test campaigns are completely unseen during training. Class imbalance
+is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
+weighted cross-entropy (MLP).
+## Feature pipeline
+The bundled `feature_engineering.py` is the canonical feature recipe.
+53 features survive after encoding, drawn from:
+- **Per-timestep numeric** (7): `timestep`, `emails_sent_cumulative`, `click_through_rate`, `credential_submission_rate`, `gateway_detection_score`, `lure_personalisation_score`, `target_department_id`
+- **Per-timestep categorical** (2, one-hot): `evasion_technique_active`, `actor_capability_tier`
+- **Victim topology numeric** (5): `employee_count`, `privileged_account_density`, `mfa_enrollment_rate`, `click_susceptibility_base`, `email_volume_daily`
+- **Victim topology categorical** (5, one-hot): `department_type`, `industry_sector`, `awareness_training_level`, `gateway_architecture`, `dmarc_enforcement_level`
+- **Engineered** (6): `log_emails_sent`, `is_gateway_blocked_step`, `is_evasion_active`, `is_high_personalisation`, `has_credential_capture`, `has_user_engagement`
+### Leakage audit
+**One column dropped:** `delivery_outcome` (7-class categorical). Its
+crosstab with `campaign_phase` shows that `no_delivery` appears only in
+the early phases (`target_reconnaissance`, `infrastructure_setup`,
+`lure_crafting`, `credential_harvesting`, `post_compromise_escalation`)
+and never in `email_delivery` or `victim_engagement`. Cell purity 0.36
+(uniform baseline 0.14). Keeping it would give the model a near-oracle
+for partitioning early-vs-mid phases.
+**No oracle features remain.** All retained features have phase-purity
+under 0.20.
+### Per-campaign-constant features
+Four features (`lure_personalisation_score`, `click_through_rate`,
+`credential_submission_rate`, `target_department_id`) are constant
+within each campaign. For **phase prediction** this is acceptable —
+their phase-purity is low, so the model uses them as conditioning
+context (similar to "we know this is an APT campaign targeting finance"
+when reasoning about which phase we're in), not as oracle features.
+They became a problem only for the abandoned actor-tier task.
+## Evaluation
+### Test-set metrics, seed 42 (n = 585 timesteps from 15 disjoint campaigns)
+**XGBoost** (the published `model_xgb.json` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.9356** |
+| Accuracy | **0.6547** |
+| Macro-F1 | 0.6401 |
+| Weighted-F1 | 0.6526 |
+**MLP** (the published `model_mlp.safetensors` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | 0.9265 |
+| Accuracy | 0.6427 |
+| Macro-F1 | 0.6275 |
+| Weighted-F1 | 0.6492 |
+### Multi-seed robustness (XGBoost, 10 seeds)
+Stable performance across seeds — the task learns cleanly, not seed-lucky:
+| Metric | Mean | Std | Min | Max |
+|---|---:|---:|---:|---:|
+| Accuracy | 0.649 | 0.038 | 0.592 | 0.711 |
+| Macro-F1 | 0.638 | 0.040 | 0.574 | 0.714 |
+| Macro ROC-AUC OvR | 0.937 | 0.010 | 0.923 | 0.954 |
+Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
+All 10 seeds yielded all 7 classes in the test fold.
+### Per-class F1 (seed 42) — where the signal is and isn't
+| Phase | XGBoost F1 | MLP F1 | Note |
+|---|---:|---:|---|
+| `target_reconnaissance` | **0.888** | 0.831 | Tight early window (timesteps 0-7) |
+| `email_delivery` | **0.791** | 0.761 | Tight window (8-30); gateway signals + email volume |
+| `infrastructure_setup` | **0.712** | 0.702 | Tight window (5-18) |
+| `lure_crafting` | **0.676** | 0.561 | Tight window (3-13) |
+| `post_compromise_escalation` | 0.604 | 0.717 | Late window (22-52) |
+| `victim_engagement` | 0.469 | 0.387 | Mid window (14-38), overlaps with adjacent phases |
+| `credential_harvesting` | 0.341 | 0.434 | Mid-late (19-45), similar features to victim_engagement |
+Four early phases (target_reconnaissance, infrastructure_setup,
+lure_crafting, email_delivery) classify cleanly because they sit in
+tight non-overlapping timestep windows with distinctive features.
+Three later phases (victim_engagement, credential_harvesting,
+post_compromise_escalation) overlap substantially in timestep range
+(14-52, 19-45, 22-52) and share similar behavioural footprints
+(non-zero click/credential rates, deployed evasion); these are
+genuinely harder for a flat-tabular model. Sequence models with
+campaign-level context would help here.
+### Ablation: which feature groups matter
+| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
+|---|---:|---:|---:|---:|
+| Full feature set (published) | 0.6547 | 0.6401 | 0.9356 | — |
+| No `timestep` | 0.3624 | 0.3139 | 0.8128 | **−0.2923** |
+| No behavioural features | 0.5795 | 0.5735 | 0.9188 | −0.0752 |
+| No topology features | 0.6410 | 0.6260 | 0.9342 | −0.0137 |
+| No engineered features | 0.6581 | 0.6402 | 0.9370 | +0.0034 |
+Three findings:
+1. **`timestep` is by far the dominant feature** (drops 29 pp when
+   removed, ROC-AUC still 0.81). Phishing campaigns progress through
+   phases over time; where you are in the campaign timeline carries
+   most of the phase signal.
+2. **Behavioural features contribute ~8 pp accuracy.** These are the
+   per-timestep observables (emails sent, gateway score, click rate,
+   evasion technique).
+3. **Topology and engineered features each contribute ~1 pp.** Trees
+   recover most of the engineered features on their own; topology
+   provides modest conditioning context.
+### Architecture
+**XGBoost:** multi-class gradient boosting (`multi:softprob`, 7 classes),
+`hist` tree method, class-balanced sample weights, early stopping on
+validation mlogloss.
+**MLP:** `53 → 128 → 64 → 7`, each hidden layer followed by `BatchNorm1d`
+→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
+early stopping on validation macro-F1.
+Training hyperparameters (learning rate, batch size, n_estimators,
+early-stopping patience, weight decay, class-weighting strategy) are
+held internally by XpertSystems and are not part of this release.
+## Limitations
+**This is a baseline reference, not a production email-security system.**
+1. **Mid- and late-phase confusion.** Per-class F1 for
+   `victim_engagement`, `credential_harvesting`, and
+   `post_compromise_escalation` is 0.34–0.60. These phases overlap in
+   timestep range and share similar behavioural signatures. Sequence
+   models that consider campaign-level context would help substantially.
+2. **The pivot away from actor-tier classification is dataset-limited,
+   not method-limited.** With 100 campaigns and 4 tiers (some with only
+   10 campaigns total), tier classification is below majority baseline
+   once leakage-prone features are removed. The full 335k-row CYB004
+   product provides ~4,800 campaigns; the sample does not.
+3. **Synthetic-vs-real transfer.** The dataset is synthetic and
+   calibrated to email-security and threat-intelligence benchmark
+   targets (Proofpoint State of the Phish, KnowBe4 Industry Benchmark,
+   Cofense PIQ, Mandiant M-Trends, FBI IC3 BEC Report, Verizon DBIR,
+   CISA, APWG). Real phishing telemetry has different noise
+   characteristics, adversary adaptation, and instrumentation gaps. Do
+   not assume metrics transfer.
+4. **Adversarial robustness not evaluated.** The dataset is not
+   adversarially generated; the model has not been red-teamed against
+   evasive lures or novel infrastructure.
+5. **MLP brittleness on OOD inputs.** With ~2.8k training timesteps,
+   the MLP can produce confidently-wrong predictions on hand-crafted
+   records far from the training manifold. XGBoost is more robust.
+   Use both; treat disagreement as a signal for human review.
+6. **`timestep` dominance is a property of the dataset.** Real
+   phishing telemetry doesn't carry a clean per-campaign normalized
+   timestep — that's a simulator artifact. A buyer transferring this
+   baseline to real campaign telemetry would need to recover an
+   equivalent temporal-position feature (e.g. hours since campaign
+   first observation, position in stage-detection pipeline).
+## Notes on dataset schema
+The CYB004 sample dataset README describes some fields differently from
+the actual schema. The model was trained on the actual schema; this note
+helps buyers reconcile what they read with what they receive.
+| What the README says | What the data actually contains |
+|---|---|
+| "9 campaign phases" (reconnaissance, infrastructure_setup, lure_creation, send_wave, gateway_evaluation, user_interaction, credential_capture, lateral_pivot, exfiltration) | 7 phases with different names: target_reconnaissance, infrastructure_setup, lure_crafting, email_delivery, victim_engagement, credential_harvesting, post_compromise_escalation |
+| 4 actor tiers: `opportunistic`, `organized_crime`, `targeted`, `nation_state_apt` | 4 tiers: `opportunistic`, `cybercriminal_gang`, `initial_access_broker`, `nation_state_apt` |
+| 8 department types listed | 4 department types: `executive_leadership`, `finance_accounts_payable`, `human_resources`, `information_technology` |
+| 4 gateway architectures | 8 gateway architectures including `ai_sender_reputation`, `integrated_cloud_defender`, `zero_trust_email_proxy` |
+| Awareness training: none, annual, semi-annual, quarterly, monthly | annual, none, continuous, basic, quarterly (no semi-annual or monthly) |
+| Per-timestep fields: `send_volume`, `gateway_blocked`, `emails_delivered`, `user_report_count`, `mfa_bypass_attempted`, `bec_attempt`, `lateral_pivot_attempted`, `operational_stealth_score`, `dmarc_enforcement_active` | None of these exist per-timestep. The actual per-timestep columns are: `emails_sent_cumulative`, `gateway_detection_score`, `delivery_outcome`, `lure_personalisation_score`, `evasion_technique_active`. BEC / MFA bypass / lateral phishing flags exist only at the campaign-summary level. |
+None of these discrepancies affects model correctness — the feature
+pipeline uses the actual column names. If you build your own pipeline
+against the dataset, use the actual columns.
+## Intended use
+- **Evaluating fit** of the CYB004 dataset for your email-security
+  or threat-hunting research
+- **Baseline reference** for new model architectures (especially
+  sequence models, which should beat this baseline on the overlapping
+  mid-late phases)
+- **Teaching and demo** for tabular classification on phishing
+  campaign telemetry
+- **Feature engineering reference** for per-timestep campaign data
+## Out-of-scope use
+- Production email security on real campaign telemetry
+- Threat hunting / SOAR playbooks on real systems
+- Actor attribution (this baseline does not address that task; see why above)
+- Adversarial-evasion evaluation (dataset not adversarially generated)
+- Any operational security decision
+## Reproducibility
+Outputs above were produced with `seed = 42` (published artifact),
+group-aware nested `GroupShuffleSplit` (70/15/15 by campaign_id), on the
+published sample (`xpertsystems/cyb004-sample`, version 1.0.0, generated
+2026-05-16). The feature pipeline in `feature_engineering.py` is
+deterministic and the trained weights in this repo correspond exactly
+to the metrics above.
+Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
+`multi_seed_results.json` confirm robust performance across splits.
+The training script itself is private to XpertSystems.
+## Files in this repo
+| File | Purpose |
+|---|---|
+| `model_xgb.json` | XGBoost weights (seed 42) |
+| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
+| `feature_engineering.py` | Feature pipeline (load → join topology → engineer → encode) |
+| `feature_meta.json` | Feature column order + categorical levels |
+| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
+| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
+| `ablation_results.json` | Per-feature-group ablation |
+| `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics |
+| `inference_example.ipynb` | End-to-end inference demo notebook |
+| `README.md` | This file |
+## Contact and full product
+The full **CYB004** dataset contains ~335,000 rows across four files,
+with calibrated benchmark validation against 12 metrics from email
+security and threat intelligence sources (Proofpoint, KnowBe4,
+Cofense, Mandiant, FBI IC3, Verizon, CISA, APWG). The full
+XpertSystems.ai synthetic data catalogue spans 41 SKUs across
+Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
+& Energy.
+- 📧 **pradeep@xpertsystems.ai**
+- 🌐 **https://xpertsystems.ai**
+- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb004-sample
+- 🤖 Companion models:
+  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
+  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
+  - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
+## Citation
+```bibtex
+@misc{xpertsystems_cyb004_baseline_2026,
+  title  = {CYB004 Baseline Classifier: XGBoost and MLP for Phishing Campaign Phase Classification},
+  author = {XpertSystems.ai},
+  year   = {2026},
+  url    = {https://huggingface.co/xpertsystems/cyb004-baseline-classifier},
+  note   = {Baseline reference model trained on xpertsystems/cyb004-sample}
+}
+```

ablation_results.json ADDED Viewed

	@@ -0,0 +1,489 @@

+{
+  "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
+  "full_model_metrics": {
+    "model": "xgboost",
+    "accuracy": 0.6547008547008547,
+    "macro_f1": 0.6401276666852063,
+    "weighted_f1": 0.657179533714298,
+    "per_class_f1": {
+      "target_reconnaissance": 0.8875739644970414,
+      "infrastructure_setup": 0.7115384615384616,
+      "lure_crafting": 0.6762589928057554,
+      "email_delivery": 0.7913669064748201,
+      "victim_engagement": 0.46938775510204084,
+      "credential_harvesting": 0.34074074074074073,
+      "post_compromise_escalation": 0.6040268456375839
+    },
+    "confusion_matrix": {
+      "labels": [
+        "target_reconnaissance",
+        "infrastructure_setup",
+        "lure_crafting",
+        "email_delivery",
+        "victim_engagement",
+        "credential_harvesting",
+        "post_compromise_escalation"
+      ],
+      "matrix": [
+        [
+          75,
+          0,
+          9,
+          0,
+          0,
+          0,
+          0
+        ],
+        [
+          0,
+          37,
+          16,
+          0,
+          0,
+          0,
+          0
+        ],
+        [
+          10,
+          10,
+          47,
+          0,
+          0,
+          0,
+          0
+        ],
+        [
+          0,
+          4,
+          0,
+          110,
+          28,
+          1,
+          0
+        ],
+        [
+          0,
+          0,
+          0,
+          21,
+          46,
+          24,
+          9
+        ],
+        [
+          0,
+          0,
+          0,
+          4,
+          16,
+          23,
+          20
+        ],
+        [
+          0,
+          0,
+          0,
+          0,
+          6,
+          24,
+          45
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.935584434710217
+  },
+  "ablations": {
+    "no_topology": {
+      "n_features": 23,
+      "dropped_count": 30,
+      "metrics": {
+        "model": "xgboost_no_topology",
+        "accuracy": 0.6410256410256411,
+        "macro_f1": 0.626013906528604,
+        "weighted_f1": 0.6377089952999916,
+        "per_class_f1": {
+          "target_reconnaissance": 0.891566265060241,
+          "infrastructure_setup": 0.7586206896551724,
+          "lure_crafting": 0.676923076923077,
+          "email_delivery": 0.7598566308243727,
+          "victim_engagement": 0.40609137055837563,
+          "credential_harvesting": 0.2782608695652174,
+          "post_compromise_escalation": 0.6107784431137725
+        },
+        "confusion_matrix": {
+          "labels": [
+            "target_reconnaissance",
+            "infrastructure_setup",
+            "lure_crafting",
+            "email_delivery",
+            "victim_engagement",
+            "credential_harvesting",
+            "post_compromise_escalation"
+          ],
+          "matrix": [
+            [
+              74,
+              0,
+              10,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              44,
+              9,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              8,
+              15,
+              44,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              4,
+              0,
+              106,
+              30,
+              3,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              26,
+              40,
+              16,
+              18
+            ],
+            [
+              0,
+              0,
+              0,
+              4,
+              20,
+              16,
+              23
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              7,
+              17,
+              51
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9341744835062434
+      },
+      "delta_accuracy": 0.013675213675213627,
+      "delta_macro_f1": 0.014113760156602262
+    },
+    "no_behavioural": {
+      "n_features": 36,
+      "dropped_count": 17,
+      "metrics": {
+        "model": "xgboost_no_behavioural",
+        "accuracy": 0.5794871794871795,
+        "macro_f1": 0.5734830391013238,
+        "weighted_f1": 0.5833619015067782,
+        "per_class_f1": {
+          "target_reconnaissance": 0.9024390243902439,
+          "infrastructure_setup": 0.4745762711864407,
+          "lure_crafting": 0.6619718309859155,
+          "email_delivery": 0.6390977443609023,
+          "victim_engagement": 0.3404255319148936,
+          "credential_harvesting": 0.3472222222222222,
+          "post_compromise_escalation": 0.6486486486486487
+        },
+        "confusion_matrix": {
+          "labels": [
+            "target_reconnaissance",
+            "infrastructure_setup",
+            "lure_crafting",
+            "email_delivery",
+            "victim_engagement",
+            "credential_harvesting",
+            "post_compromise_escalation"
+          ],
+          "matrix": [
+            [
+              74,
+              0,
+              10,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              28,
+              16,
+              9,
+              0,
+              0,
+              0
+            ],
+            [
+              6,
+              13,
+              47,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              23,
+              2,
+              85,
+              30,
+              3,
+              0
+            ],
+            [
+              0,
+              1,
+              0,
+              26,
+              32,
+              34,
+              7
+            ],
+            [
+              0,
+              0,
+              0,
+              2,
+              18,
+              25,
+              18
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              8,
+              19,
+              48
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9187512184393106
+      },
+      "delta_accuracy": 0.07521367521367517,
+      "delta_macro_f1": 0.06664462758388245
+    },
+    "no_timestep": {
+      "n_features": 52,
+      "dropped_count": 1,
+      "metrics": {
+        "model": "xgboost_no_timestep",
+        "accuracy": 0.3623931623931624,
+        "macro_f1": 0.3138802646284953,
+        "weighted_f1": 0.3500013055228507,
+        "per_class_f1": {
+          "target_reconnaissance": 0.4419889502762431,
+          "infrastructure_setup": 0.24,
+          "lure_crafting": 0.2748091603053435,
+          "email_delivery": 0.5617283950617284,
+          "victim_engagement": 0.26666666666666666,
+          "credential_harvesting": 0.11666666666666667,
+          "post_compromise_escalation": 0.2953020134228188
+        },
+        "confusion_matrix": {
+          "labels": [
+            "target_reconnaissance",
+            "infrastructure_setup",
+            "lure_crafting",
+            "email_delivery",
+            "victim_engagement",
+            "credential_harvesting",
+            "post_compromise_escalation"
+          ],
+          "matrix": [
+            [
+              40,
+              18,
+              26,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              23,
+              12,
+              18,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              32,
+              17,
+              18,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              2,
+              0,
+              2,
+              91,
+              16,
+              17,
+              15
+            ],
+            [
+              0,
+              0,
+              0,
+              36,
+              22,
+              20,
+              22
+            ],
+            [
+              0,
+              0,
+              0,
+              25,
+              16,
+              7,
+              15
+            ],
+            [
+              0,
+              0,
+              0,
+              29,
+              11,
+              13,
+              22
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8128267634071407
+      },
+      "delta_accuracy": 0.2923076923076923,
+      "delta_macro_f1": 0.326247402056711
+    },
+    "no_engineered": {
+      "n_features": 47,
+      "dropped_count": 6,
+      "metrics": {
+        "model": "xgboost_no_engineered",
+        "accuracy": 0.6581196581196581,
+        "macro_f1": 0.6401951204875947,
+        "weighted_f1": 0.6592473136316277,
+        "per_class_f1": {
+          "target_reconnaissance": 0.8809523809523809,
+          "infrastructure_setup": 0.7155963302752294,
+          "lure_crafting": 0.6518518518518519,
+          "email_delivery": 0.8,
+          "victim_engagement": 0.49473684210526314,
+          "credential_harvesting": 0.3484848484848485,
+          "post_compromise_escalation": 0.5897435897435898
+        },
+        "confusion_matrix": {
+          "labels": [
+            "target_reconnaissance",
+            "infrastructure_setup",
+            "lure_crafting",
+            "email_delivery",
+            "victim_engagement",
+            "credential_harvesting",
+            "post_compromise_escalation"
+          ],
+          "matrix": [
+            [
+              74,
+              0,
+              10,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              39,
+              14,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              10,
+              13,
+              44,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              4,
+              0,
+              112,
+              26,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              20,
+              47,
+              22,
+              11
+            ],
+            [
+              0,
+              0,
+              0,
+              5,
+              11,
+              23,
+              24
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              6,
+              23,
+              46
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9369503919262667
+      },
+      "delta_accuracy": -0.0034188034188034067,
+      "delta_macro_f1": -6.745380238848409e-05
+    }
+  }
+}

feature_engineering.py ADDED Viewed

	@@ -0,0 +1,341 @@

+"""
+feature_engineering.py
+======================
+Feature pipeline for the CYB004 baseline classifier.
+Predicts `campaign_phase` (7-class) from per-timestep phishing campaign
+trajectory data on the CYB004 sample dataset.
+CSV inputs:
+    campaign_trajectories.csv  (primary, one row per timestep, 100
+                                campaigns x ~40 timesteps = 3,952 rows)
+    victim_topology.csv        (per-department victim configuration,
+                                joined on target_department_id)
+    campaign_summary.csv       (per-campaign aggregates; reserved for
+                                future work)
+    campaign_events.csv        (discrete event log; reserved for
+                                future work)
+Target classes (7 phases observed in the sample):
+    target_reconnaissance, infrastructure_setup, lure_crafting,
+    email_delivery, victim_engagement, credential_harvesting,
+    post_compromise_escalation
+This is the email-security / SOC use case: given the observable
+campaign telemetry at a moment in time, what phase of the phishing
+lifecycle is the campaign in?
+The pivot to campaign_phase (away from actor_capability_tier, the
+README's headline use case) happened because per-campaign-constant
+features (lure_personalisation_score, click_through_rate,
+credential_submission_rate, target_department_id) leak tier via the
+small test fold under group-aware splitting. With those features
+removed, honest tier prediction is below majority baseline. The full
+335k-row CYB004 dataset would address this; the sample does not.
+See the model card for full discussion.
+Public API
+----------
+    build_features(trajectories_path, topology_path)
+        -> (X, y, groups, meta)
+    transform_single(record, meta, victim_aggregates=None) -> np.ndarray
+    save_meta(meta, path) / load_meta(path)
+    build_department_lookup(topology_path) -> dict
+License
+-------
+Ships with the public model on Hugging Face under CC-BY-NC-4.0, matching
+the dataset license. See README.md.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Label space
+# ---------------------------------------------------------------------------
+LABEL_ORDER = [
+    "target_reconnaissance",
+    "infrastructure_setup",
+    "lure_crafting",
+    "email_delivery",
+    "victim_engagement",
+    "credential_harvesting",
+    "post_compromise_escalation",
+]
+LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
+INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
+# ---------------------------------------------------------------------------
+# Identifier and target columns - not features
+# ---------------------------------------------------------------------------
+ID_COLUMNS = ["campaign_id", "actor_id"]
+TARGET_COLUMN = "campaign_phase"
+# `actor_capability_tier` is kept as a feature - it's a real SOC observable
+# (analysts typically have an actor cluster hypothesis), and its
+# purity-vs-phase is 0.18 (uniform baseline 0.14), so it isn't an oracle.
+# `delivery_outcome` is dropped: its purity vs phase is much higher
+# (0.36) - `no_delivery` appears only in early phases, effectively
+# encoding phase position. Keeping it would give the model a near-oracle.
+LEAKY_COLUMNS = [
+    "delivery_outcome",
+]
+# ---------------------------------------------------------------------------
+# Per-timestep numeric features
+# ---------------------------------------------------------------------------
+DIRECT_NUMERIC_TIMESTEP_FEATURES = [
+    "timestep",                      # strong but non-deterministic phase signal
+    "emails_sent_cumulative",        # increases through campaign; useful position proxy
+    "click_through_rate",            # per-campaign constant; informative when combined with timestep
+    "credential_submission_rate",    # per-campaign constant
+    "gateway_detection_score",       # per-step variation
+    "lure_personalisation_score",    # per-campaign constant; tier signal
+    "target_department_id",          # per-campaign constant; treated as ordinal ID
+]
+# Per-timestep categoricals
+CATEGORICAL_TIMESTEP_FEATURES = [
+    "evasion_technique_active",      # 6 levels incl. "none" (82%); active evasion correlates with mid-late phases
+    "actor_capability_tier",         # 4 levels; mostly per-campaign constant
+]
+# ---------------------------------------------------------------------------
+# Victim topology features (joined on target_department_id)
+# ---------------------------------------------------------------------------
+TOPOLOGY_NUMERIC_FEATURES = [
+    "employee_count",
+    "privileged_account_density",
+    "mfa_enrollment_rate",
+    "click_susceptibility_base",
+    "email_volume_daily",
+]
+TOPOLOGY_CATEGORICAL_FEATURES = [
+    "department_type",
+    "industry_sector",
+    "awareness_training_level",
+    "gateway_architecture",
+    "dmarc_enforcement_level",
+]
+# ---------------------------------------------------------------------------
+# Engineered features (none derived from phase or timestep alone)
+# ---------------------------------------------------------------------------
+def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Six engineered features. None directly encode phase; each is a
+    behavioural composite that helps disambiguate adjacent phases.
+    """
+    df = df.copy()
+    # 1. Log-scaled email volume. emails_sent_cumulative is heavy-tailed
+    #    (0 in recon, hundreds-to-thousands by post_compromise).
+    df["log_emails_sent"] = np.log1p(df["emails_sent_cumulative"].clip(lower=0)).astype(float)
+    # 2. Gateway-blocked step. gateway_detection_score > 0.7 marks
+    #    high-confidence gateway intervention; common in email_delivery.
+    df["is_gateway_blocked_step"] = (df["gateway_detection_score"] > 0.7).astype(int)
+    # 3. Evasion-active flag. Non-"none" evasion_technique_active
+    #    concentrates in lure_crafting and email_delivery.
+    df["is_evasion_active"] = (df["evasion_technique_active"] != "none").astype(int)
+    # 4. High-personalisation flag. lure_personalisation_score > 0.7 is
+    #    an APT-tier signature.
+    df["is_high_personalisation"] = (df["lure_personalisation_score"] > 0.7).astype(int)
+    # 5. Has credential capture flag. credential_submission_rate > 0
+    #    indicates the campaign has reached credential-capture phases.
+    df["has_credential_capture"] = (df["credential_submission_rate"] > 0).astype(int)
+    # 6. Engaged-victim flag. click_through_rate > 0 indicates
+    #    victim_engagement or later phase.
+    df["has_user_engagement"] = (df["click_through_rate"] > 0).astype(int)
+    return df
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def build_features(
+    trajectories_path: str | Path,
+    topology_path: str | Path,
+) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
+    """
+    Load CSVs, join topology, drop target + leaky columns, engineer features,
+    one-hot encode, return (X, y, groups, meta).
+    `groups` is a Series of campaign_id values aligned with X. Use it with
+    GroupShuffleSplit / GroupKFold: a single campaign generates ~40
+    correlated timesteps; row-level random splitting inflates metrics.
+    """
+    traj = pd.read_csv(trajectories_path)
+    topo = pd.read_csv(topology_path)
+    y = traj[TARGET_COLUMN].map(LABEL_TO_INT)
+    if y.isna().any():
+        bad = traj.loc[y.isna(), TARGET_COLUMN].unique()
+        raise ValueError(f"Unknown campaign_phase values: {bad}")
+    y = y.astype(int)
+    groups = traj["campaign_id"].copy()
+    traj = traj.drop(columns=ID_COLUMNS + [TARGET_COLUMN] + LEAKY_COLUMNS,
+                     errors="ignore")
+    topo_cols_needed = (
+        ["department_id"]
+        + TOPOLOGY_NUMERIC_FEATURES
+        + TOPOLOGY_CATEGORICAL_FEATURES
+    )
+    traj = traj.merge(
+        topo[topo_cols_needed],
+        left_on="target_department_id", right_on="department_id", how="left",
+    ).drop(columns=["department_id"], errors="ignore")
+    traj = _add_engineered_features(traj)
+    numeric_features = (
+        DIRECT_NUMERIC_TIMESTEP_FEATURES
+        + TOPOLOGY_NUMERIC_FEATURES
+        + [
+            "log_emails_sent", "is_gateway_blocked_step", "is_evasion_active",
+            "is_high_personalisation", "has_credential_capture", "has_user_engagement",
+        ]
+    )
+    X_numeric = traj[numeric_features].astype(float)
+    all_categorical = (
+        [(col, "timestep") for col in CATEGORICAL_TIMESTEP_FEATURES]
+        + [(col, "topology") for col in TOPOLOGY_CATEGORICAL_FEATURES]
+    )
+    categorical_levels: dict[str, list[str]] = {}
+    blocks: list[pd.DataFrame] = []
+    for col, _src in all_categorical:
+        if col not in traj.columns:
+            continue
+        levels = sorted(traj[col].dropna().unique().tolist())
+        categorical_levels[col] = levels
+        block = pd.get_dummies(
+            traj[col].astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        blocks.append(block)
+    X = pd.concat(
+        [X_numeric.reset_index(drop=True)]
+        + [b.reset_index(drop=True) for b in blocks],
+        axis=1,
+    ).fillna(0.0)
+    meta = {
+        "feature_names": X.columns.tolist(),
+        "numeric_features": numeric_features,
+        "categorical_levels": categorical_levels,
+        "label_to_int": LABEL_TO_INT,
+        "int_to_label": INT_TO_LABEL,
+        "leakage_excluded": LEAKY_COLUMNS,
+    }
+    return X, y, groups, meta
+def transform_single(
+    record: dict | pd.DataFrame,
+    meta: dict[str, Any],
+    victim_aggregates: dict | None = None,
+) -> np.ndarray:
+    """Encode a single timestep record for inference."""
+    if isinstance(record, dict):
+        df = pd.DataFrame([record.copy()])
+    else:
+        df = record.copy()
+    if victim_aggregates is not None:
+        for k, v in victim_aggregates.items():
+            df[k] = v
+    df = _add_engineered_features(df)
+    numeric = pd.DataFrame({
+        col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
+        for col in meta["numeric_features"]
+    })
+    blocks: list[pd.DataFrame] = [numeric]
+    for col, levels in meta["categorical_levels"].items():
+        val = df.get(col, pd.Series([None] * len(df)))
+        block = pd.get_dummies(
+            val.astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        for lvl in levels:
+            cname = f"{col}_{lvl}"
+            if cname not in block.columns:
+                block[cname] = 0
+        block = block[[f"{col}_{lvl}" for lvl in levels]]
+        blocks.append(block)
+    X = pd.concat(blocks, axis=1).fillna(0.0)
+    X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
+    return X.values.astype(np.float32)
+def save_meta(meta: dict[str, Any], path: str | Path) -> None:
+    serializable = {
+        "feature_names": meta["feature_names"],
+        "numeric_features": meta["numeric_features"],
+        "categorical_levels": meta["categorical_levels"],
+        "label_to_int": meta["label_to_int"],
+        "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
+        "leakage_excluded": meta.get("leakage_excluded", []),
+    }
+    with open(path, "w") as f:
+        json.dump(serializable, f, indent=2)
+def load_meta(path: str | Path) -> dict[str, Any]:
+    with open(path) as f:
+        meta = json.load(f)
+    meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
+    return meta
+def build_department_lookup(topology_path: str | Path) -> dict[int, dict]:
+    """Build {department_id: {topology features}} for inference-time lookup."""
+    topo = pd.read_csv(topology_path)
+    cols = TOPOLOGY_NUMERIC_FEATURES + TOPOLOGY_CATEGORICAL_FEATURES
+    out = {}
+    for _, row in topo.iterrows():
+        out[int(row["department_id"])] = {c: row[c] for c in cols if c in topo.columns}
+    return out
+if __name__ == "__main__":
+    import sys
+    base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
+    X, y, groups, meta = build_features(
+        base / "campaign_trajectories.csv",
+        base / "victim_topology.csv",
+    )
+    print(f"X shape: {X.shape}")
+    print(f"y shape: {y.shape}")
+    print(f"groups: {groups.nunique()} campaigns")
+    print(f"n features: {len(meta['feature_names'])}")
+    print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
+    print(f"X has NaN: {X.isnull().any().any()}")

feature_meta.json ADDED Viewed

	@@ -0,0 +1,149 @@

+{
+  "feature_names": [
+    "timestep",
+    "emails_sent_cumulative",
+    "click_through_rate",
+    "credential_submission_rate",
+    "gateway_detection_score",
+    "lure_personalisation_score",
+    "target_department_id",
+    "employee_count",
+    "privileged_account_density",
+    "mfa_enrollment_rate",
+    "click_susceptibility_base",
+    "email_volume_daily",
+    "log_emails_sent",
+    "is_gateway_blocked_step",
+    "is_evasion_active",
+    "is_high_personalisation",
+    "has_credential_capture",
+    "has_user_engagement",
+    "evasion_technique_active_base64_payload_embedding",
+    "evasion_technique_active_homoglyph_substitution",
+    "evasion_technique_active_html_obfuscation",
+    "evasion_technique_active_image_only_lure",
+    "evasion_technique_active_none",
+    "evasion_technique_active_redirect_chain",
+    "actor_capability_tier_cybercriminal_gang",
+    "actor_capability_tier_initial_access_broker",
+    "actor_capability_tier_nation_state_apt",
+    "actor_capability_tier_opportunistic",
+    "department_type_executive_leadership",
+    "department_type_finance_accounts_payable",
+    "department_type_human_resources",
+    "department_type_information_technology",
+    "industry_sector_financial_services",
+    "industry_sector_government_state_local",
+    "industry_sector_retail_ecommerce",
+    "industry_sector_technology",
+    "awareness_training_level_annual",
+    "awareness_training_level_basic",
+    "awareness_training_level_continuous",
+    "awareness_training_level_none",
+    "awareness_training_level_quarterly",
+    "gateway_architecture_ai_sender_reputation",
+    "gateway_architecture_ensemble_layered_gateway",
+    "gateway_architecture_integrated_cloud_defender",
+    "gateway_architecture_legacy_spam_filter",
+    "gateway_architecture_ml_classifier_gateway",
+    "gateway_architecture_rule_based_filter",
+    "gateway_architecture_sandbox_detonation",
+    "gateway_architecture_zero_trust_email_proxy",
+    "dmarc_enforcement_level_monitoring",
+    "dmarc_enforcement_level_none",
+    "dmarc_enforcement_level_quarantine",
+    "dmarc_enforcement_level_reject"
+  ],
+  "numeric_features": [
+    "timestep",
+    "emails_sent_cumulative",
+    "click_through_rate",
+    "credential_submission_rate",
+    "gateway_detection_score",
+    "lure_personalisation_score",
+    "target_department_id",
+    "employee_count",
+    "privileged_account_density",
+    "mfa_enrollment_rate",
+    "click_susceptibility_base",
+    "email_volume_daily",
+    "log_emails_sent",
+    "is_gateway_blocked_step",
+    "is_evasion_active",
+    "is_high_personalisation",
+    "has_credential_capture",
+    "has_user_engagement"
+  ],
+  "categorical_levels": {
+    "evasion_technique_active": [
+      "base64_payload_embedding",
+      "homoglyph_substitution",
+      "html_obfuscation",
+      "image_only_lure",
+      "none",
+      "redirect_chain"
+    ],
+    "actor_capability_tier": [
+      "cybercriminal_gang",
+      "initial_access_broker",
+      "nation_state_apt",
+      "opportunistic"
+    ],
+    "department_type": [
+      "executive_leadership",
+      "finance_accounts_payable",
+      "human_resources",
+      "information_technology"
+    ],
+    "industry_sector": [
+      "financial_services",
+      "government_state_local",
+      "retail_ecommerce",
+      "technology"
+    ],
+    "awareness_training_level": [
+      "annual",
+      "basic",
+      "continuous",
+      "none",
+      "quarterly"
+    ],
+    "gateway_architecture": [
+      "ai_sender_reputation",
+      "ensemble_layered_gateway",
+      "integrated_cloud_defender",
+      "legacy_spam_filter",
+      "ml_classifier_gateway",
+      "rule_based_filter",
+      "sandbox_detonation",
+      "zero_trust_email_proxy"
+    ],
+    "dmarc_enforcement_level": [
+      "monitoring",
+      "none",
+      "quarantine",
+      "reject"
+    ]
+  },
+  "label_to_int": {
+    "target_reconnaissance": 0,
+    "infrastructure_setup": 1,
+    "lure_crafting": 2,
+    "email_delivery": 3,
+    "victim_engagement": 4,
+    "credential_harvesting": 5,
+    "post_compromise_escalation": 6
+  },
+  "int_to_label": {
+    "0": "target_reconnaissance",
+    "1": "infrastructure_setup",
+    "2": "lure_crafting",
+    "3": "email_delivery",
+    "4": "victim_engagement",
+    "5": "credential_harvesting",
+    "6": "post_compromise_escalation"
+  },
+  "leakage_excluded": [
+    "delivery_outcome"
+  ]
+}

feature_scaler.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean": [19.882267966775007, 264.6323582520766, 0.052154893463344176, 0.03361545684362586, 0.6047568436258577, 0.4326537739256049, 17.127121704586493, 172.1491513181654, 0.7521711809317442, 0.8172881906825569, 0.07423943661971831, 1151.8490429758035, 3.894440315600361, 0.30371975442397975, 0.1758757674250632, 0.11917659804983749, 1.0, 1.0, 0.030335861321776816, 0.053810039725532686, 0.04333694474539545, 0.025279884434814014, 0.8241242325749368, 0.02311303719754424, 0.18923799205489347, 0.10220296135789093, 0.10509209100758396, 0.6034669555796316, 0.27230046948356806, 0.26291079812206575, 0.1632358252076562, 0.30155290718671, 0.27230046948356806, 0.30155290718671, 0.1632358252076562, 0.26291079812206575, 0.30841459010473093, 0.16143011917659805, 0.20548934633441676, 0.29360780065005415, 0.031058143734200072, 0.12639942217407008, 0.13578909353557242, 0.14590104730949802, 0.11014806789454677, 0.06608884073672806, 0.09750812567713976, 0.13867822318526543, 0.1794871794871795, 0.09750812567713976, 0.11014806789454677, 0.2047670639219935, 0.58757674250632], "std": [12.12092281961143, 240.98788415799402, 0.020507195059365872, 0.012951632990740584, 0.16345254609210969, 0.1787513429787685, 9.161154583852591, 85.48823018511177, 0.13799067057693098, 0.10193473774948415, 0.02923768201623528, 772.2778476847263, 2.791161927013341, 0.45994615422530144, 0.38078320056479364, 0.3240547183619046, 1.0, 1.0, 0.1715407352835541, 0.22568321453759693, 0.20365125061044834, 0.15700227356694219, 0.38078320056479364, 0.15028965965395197, 0.3917683030033992, 0.30296974343852234, 0.30672743633583477, 0.4892658167708199, 0.4452241130504305, 0.4402939026663002, 0.3696474491122128, 0.45901507812196696, 0.4452241130504305, 0.45901507812196696, 0.3696474491122128, 0.4402939026663002, 0.4619221667863049, 0.3679936701948691, 0.40413173266488067, 0.4554966395152088, 0.17350621713351017, 0.3323589938739654, 0.34262634311115486, 0.35307074530111743, 0.31313077339534806, 0.24848220047758568, 0.2967020106382041, 0.3456728601775323, 0.38382904647787225, 0.2967020106382041, 0.31313077339534806, 0.40360418981498464, 0.4923594837624244]}

inference_example.ipynb ADDED Viewed

	@@ -0,0 +1,320 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CYB004 Baseline Classifier — Inference Example\n",
+    "\n",
+    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **phishing campaign phase** of a new per-timestep telemetry record.\n",
+    "\n",
+    "**Models predict one of 7 phases:** `target_reconnaissance`, `infrastructure_setup`, `lure_crafting`, `email_delivery`, `victim_engagement`, `credential_harvesting`, `post_compromise_escalation`.\n",
+    "\n",
+    "**This is a baseline reference model**, not a production email-security platform. See the model card for full metrics and limitations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Download model artifacts from Hugging Face"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "REPO_ID = \"xpertsystems/cyb004-baseline-classifier\"\n",
+    "\n",
+    "files = {}\n",
+    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
+    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
+    "             \"feature_scaler.json\"]:\n",
+    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
+    "    print(f\"  downloaded: {name}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, os\n",
+    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
+    "if fe_dir not in sys.path:\n",
+    "    sys.path.insert(0, fe_dir)\n",
+    "\n",
+    "from feature_engineering import (\n",
+    "    transform_single, load_meta, INT_TO_LABEL, build_department_lookup\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load models and metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import xgboost as xgb\n",
+    "from safetensors.torch import load_file\n",
+    "\n",
+    "meta = load_meta(files[\"feature_meta.json\"])\n",
+    "with open(files[\"feature_scaler.json\"]) as f:\n",
+    "    scaler = json.load(f)\n",
+    "\n",
+    "N_FEATURES = len(meta[\"feature_names\"])\n",
+    "N_CLASSES = len(meta[\"int_to_label\"])\n",
+    "print(f\"feature count: {N_FEATURES}\")\n",
+    "print(f\"class count:   {N_CLASSES}\")\n",
+    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# XGBoost\n",
+    "xgb_model = xgb.XGBClassifier()\n",
+    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
+    "\n",
+    "# MLP architecture (must match training)\n",
+    "class PhaseMLP(nn.Module):\n",
+    "    def __init__(self, n_features, n_classes=7, hidden1=128, hidden2=64, dropout=0.3):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(n_features, hidden1),\n",
+    "            nn.BatchNorm1d(hidden1),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden1, hidden2),\n",
+    "            nn.BatchNorm1d(hidden2),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden2, n_classes),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
+    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
+    "mlp_model.eval()\n",
+    "print(\"models loaded\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Build the department lookup\n",
+    "\n",
+    "Per-department topology features (employee_count, MFA enrollment, gateway architecture, DMARC level, etc.) are pulled from `victim_topology.csv` and merged into each timestep record by `target_department_id`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "\n",
+    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb004-sample\", repo_type=\"dataset\")\n",
+    "\n",
+    "dept_lookup = build_department_lookup(\n",
+    "    os.path.join(ds_path, \"victim_topology.csv\")\n",
+    ")\n",
+    "print(f\"loaded {len(dept_lookup)} department profiles\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Prediction helper"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
+    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
+    "\n",
+    "def predict_phase(record: dict) -> dict:\n",
+    "    \"\"\"Predict the campaign phase for one per-timestep telemetry record.\n",
+    "\n",
+    "    Per-department topology features are pulled automatically via\n",
+    "    `target_department_id` from the dept_lookup loaded above.\n",
+    "    \"\"\"\n",
+    "    dept_id = int(record.get(\"target_department_id\", -1))\n",
+    "    dept_aggs = dept_lookup.get(dept_id, {})\n",
+    "    X = transform_single(record, meta, victim_aggregates=dept_aggs)\n",
+    "\n",
+    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
+    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
+    "\n",
+    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
+    "    with torch.no_grad():\n",
+    "        logits = mlp_model(torch.tensor(Xs))\n",
+    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
+    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
+    "\n",
+    "    return {\n",
+    "        \"xgboost\": {\n",
+    "            \"label\": xgb_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
+    "        },\n",
+    "        \"mlp\": {\n",
+    "            \"label\": mlp_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
+    "        },\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Run on an example record\n",
+    "\n",
+    "Real `email_delivery` event lifted from the sample dataset: a nation-state APT campaign at timestep 13, with homoglyph substitution evasion active and 58 emails sent. Both models should predict `email_delivery`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real timestep record from the sample dataset (true phase: email_delivery)\n",
+    "example_record = {\n",
+    "    \"timestep\": 13,\n",
+    "    \"emails_sent_cumulative\": 58,\n",
+    "    \"click_through_rate\": 0.1158,\n",
+    "    \"credential_submission_rate\": 0.0713,\n",
+    "    \"gateway_detection_score\": 0.7327,\n",
+    "    \"lure_personalisation_score\": 0.7507,\n",
+    "    \"evasion_technique_active\": \"homoglyph_substitution\",\n",
+    "    \"target_department_id\": 10,\n",
+    "    \"actor_capability_tier\": \"nation_state_apt\",\n",
+    "}\n",
+    "\n",
+    "result = predict_phase(example_record)\n",
+    "\n",
+    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
+    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
+    "    print(f\"    P({lbl:30s}) = {p:.4f}\")\n",
+    "\n",
+    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
+    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
+    "    print(f\"    P({lbl:30s}) = {p:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Note: when the two models disagree\n",
+    "\n",
+    "XGBoost and the MLP can disagree on mid-pipeline phases (`victim_engagement`, `credential_harvesting`) where timestep windows overlap. The per-class F1 in the model card identifies which phases are robustly predicted vs. which are not. In a SOC workflow, conflicting predictions are worth surfacing for human review."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Batch prediction on the sample dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "traj = pd.read_csv(f\"{ds_path}/campaign_trajectories.csv\")\n",
+    "\n",
+    "# Drop the leaky column the model was never trained on\n",
+    "traj = traj.drop(columns=[\"delivery_outcome\"], errors=\"ignore\")\n",
+    "\n",
+    "# Score the first 200 timesteps\n",
+    "sample = traj.head(200).copy()\n",
+    "preds = [predict_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
+    "sample[\"xgb_pred\"] = preds\n",
+    "\n",
+    "ct = pd.crosstab(sample[\"campaign_phase\"], sample[\"xgb_pred\"],\n",
+    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
+    "print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
+    "print(ct)\n",
+    "acc = (sample[\"campaign_phase\"] == sample[\"xgb_pred\"]).mean()\n",
+    "print(f\"\\nbatch accuracy on first 200 rows (in-distribution): {acc:.4f}\")\n",
+    "print(\"\\nNote: these rows include training-set campaigns. See validation_results.json\\n\"\n",
+    "      \"for proper held-out test metrics from disjoint campaigns.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Next steps\n",
+    "\n",
+    "- See `validation_results.json` for held-out test metrics (15 disjoint campaigns, ~580 timesteps).\n",
+    "- See `multi_seed_results.json` for the across-10-seeds robustness picture (accuracy 0.649 ± 0.038, ROC-AUC 0.937 ± 0.010).\n",
+    "- See `ablation_results.json` for per-feature-group contribution. `timestep` carries the dominant signal.\n",
+    "- The model card explains why `actor_capability_tier` was *not* used as the target despite being the README's headline use case.\n",
+    "- For the full 335k-row CYB004 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

model_mlp.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:072999e38cd542460473780a9c71164efc1a53a1037a4b579064cc93f3f5b4b8
+size 66788

model_xgb.json ADDED Viewed

The diff for this file is too large to render. See raw diff

multi_seed_results.json ADDED Viewed

	@@ -0,0 +1,98 @@

+{
+  "purpose": "With n=100 campaigns, single-seed metrics carry test-fold variance. Multi-seed evaluation gives a more reliable picture.",
+  "seeds_evaluated": [
+    42,
+    7,
+    13,
+    17,
+    23,
+    31,
+    45,
+    99,
+    123,
+    200
+  ],
+  "per_seed": [
+    {
+      "seed": 42,
+      "test_n_classes": 7,
+      "accuracy": 0.6547008547008547,
+      "macro_f1": 0.6401276666852063,
+      "macro_roc_auc_ovr": 0.935584434710217
+    },
+    {
+      "seed": 7,
+      "test_n_classes": 7,
+      "accuracy": 0.6267123287671232,
+      "macro_f1": 0.6141815367358149,
+      "macro_roc_auc_ovr": 0.9256987657069029
+    },
+    {
+      "seed": 13,
+      "test_n_classes": 7,
+      "accuracy": 0.5983050847457627,
+      "macro_f1": 0.5953435905708684,
+      "macro_roc_auc_ovr": 0.9235372520169014
+    },
+    {
+      "seed": 17,
+      "test_n_classes": 7,
+      "accuracy": 0.64349376114082,
+      "macro_f1": 0.6328717716731788,
+      "macro_roc_auc_ovr": 0.9426545946495839
+    },
+    {
+      "seed": 23,
+      "test_n_classes": 7,
+      "accuracy": 0.5915254237288136,
+      "macro_f1": 0.5734921834318393,
+      "macro_roc_auc_ovr": 0.9245031023094512
+    },
+    {
+      "seed": 31,
+      "test_n_classes": 7,
+      "accuracy": 0.6220095693779905,
+      "macro_f1": 0.6103022022937624,
+      "macro_roc_auc_ovr": 0.9325576570435162
+    },
+    {
+      "seed": 45,
+      "test_n_classes": 7,
+      "accuracy": 0.6678082191780822,
+      "macro_f1": 0.655097964659693,
+      "macro_roc_auc_ovr": 0.9396074000285977
+    },
+    {
+      "seed": 99,
+      "test_n_classes": 7,
+      "accuracy": 0.7111111111111111,
+      "macro_f1": 0.7136854710276727,
+      "macro_roc_auc_ovr": 0.9538147161172963
+    },
+    {
+      "seed": 123,
+      "test_n_classes": 7,
+      "accuracy": 0.6823734729493892,
+      "macro_f1": 0.6727927606720584,
+      "macro_roc_auc_ovr": 0.9443324151480283
+    },
+    {
+      "seed": 200,
+      "test_n_classes": 7,
+      "accuracy": 0.6931407942238267,
+      "macro_f1": 0.6752712902262269,
+      "macro_roc_auc_ovr": 0.9450377543018418
+    }
+  ],
+  "aggregate": {
+    "accuracy_mean": 0.6491180619923773,
+    "accuracy_std": 0.03799334369624316,
+    "accuracy_min": 0.5915254237288136,
+    "accuracy_max": 0.7111111111111111,
+    "macro_f1_mean": 0.638316643797632,
+    "macro_f1_std": 0.039956794294168915,
+    "roc_auc_mean": 0.9367328092032338,
+    "roc_auc_std": 0.009623085359130642
+  },
+  "published_artifact_seed": 42
+}

validation_results.json ADDED Viewed

	@@ -0,0 +1,246 @@

+{
+  "version": "1.0.0",
+  "dataset": "xpertsystems/cyb004-sample",
+  "task": "7-class campaign_phase classification",
+  "baselines": {
+    "always_predict_majority_accuracy": 0.24444444444444444,
+    "majority_class": "email_delivery",
+    "random_guess_accuracy": 0.14285714285714285
+  },
+  "split": {
+    "strategy": "group_aware (GroupShuffleSplit by campaign_id, nested)",
+    "rationale": "100 phishing campaigns generate ~3,952 timesteps (~40 per campaign). Random row-split would leak per-campaign correlations into the test fold. Group-aware split keeps train/val/test campaigns disjoint.",
+    "campaigns_train": 69,
+    "campaigns_val": 16,
+    "campaigns_test": 15,
+    "timesteps_train": 2769,
+    "timesteps_val": 598,
+    "timesteps_test": 585,
+    "seed": 42
+  },
+  "n_features": 53,
+  "label_classes": [
+    "target_reconnaissance",
+    "infrastructure_setup",
+    "lure_crafting",
+    "email_delivery",
+    "victim_engagement",
+    "credential_harvesting",
+    "post_compromise_escalation"
+  ],
+  "class_distribution_train": {
+    "email_delivery": 655,
+    "victim_engagement": 459,
+    "post_compromise_escalation": 388,
+    "target_reconnaissance": 381,
+    "credential_harvesting": 352,
+    "lure_crafting": 300,
+    "infrastructure_setup": 234
+  },
+  "class_distribution_test": {
+    "email_delivery": 143,
+    "victim_engagement": 100,
+    "target_reconnaissance": 84,
+    "post_compromise_escalation": 75,
+    "lure_crafting": 67,
+    "credential_harvesting": 63,
+    "infrastructure_setup": 53
+  },
+  "leakage_excluded_features": [
+    "delivery_outcome (purity 0.36 vs phase; no_delivery appears only in early phases - near-oracle)"
+  ],
+  "models": {
+    "xgboost": {
+      "architecture": "Gradient-boosted decision trees, multi:softprob, 7 classes",
+      "framework": "xgboost",
+      "test_metrics": {
+        "model": "xgboost",
+        "accuracy": 0.6547008547008547,
+        "macro_f1": 0.6401276666852063,
+        "weighted_f1": 0.657179533714298,
+        "per_class_f1": {
+          "target_reconnaissance": 0.8875739644970414,
+          "infrastructure_setup": 0.7115384615384616,
+          "lure_crafting": 0.6762589928057554,
+          "email_delivery": 0.7913669064748201,
+          "victim_engagement": 0.46938775510204084,
+          "credential_harvesting": 0.34074074074074073,
+          "post_compromise_escalation": 0.6040268456375839
+        },
+        "confusion_matrix": {
+          "labels": [
+            "target_reconnaissance",
+            "infrastructure_setup",
+            "lure_crafting",
+            "email_delivery",
+            "victim_engagement",
+            "credential_harvesting",
+            "post_compromise_escalation"
+          ],
+          "matrix": [
+            [
+              75,
+              0,
+              9,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              37,
+              16,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              10,
+              10,
+              47,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              4,
+              0,
+              110,
+              28,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              21,
+              46,
+              24,
+              9
+            ],
+            [
+              0,
+              0,
+              0,
+              4,
+              16,
+              23,
+              20
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              6,
+              24,
+              45
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.935584434710217
+      }
+    },
+    "mlp": {
+      "architecture": "PyTorch MLP, 53 -> 128 -> 64 -> 7, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
+      "framework": "pytorch",
+      "test_metrics": {
+        "model": "mlp",
+        "accuracy": 0.6427350427350428,
+        "macro_f1": 0.6275373447450349,
+        "weighted_f1": 0.6380162402905546,
+        "per_class_f1": {
+          "target_reconnaissance": 0.8313253012048193,
+          "infrastructure_setup": 0.7017543859649122,
+          "lure_crafting": 0.5606060606060606,
+          "email_delivery": 0.7612456747404844,
+          "victim_engagement": 0.3867403314917127,
+          "credential_harvesting": 0.43410852713178294,
+          "post_compromise_escalation": 0.7169811320754716
+        },
+        "confusion_matrix": {
+          "labels": [
+            "target_reconnaissance",
+            "infrastructure_setup",
+            "lure_crafting",
+            "email_delivery",
+            "victim_engagement",
+            "credential_harvesting",
+            "post_compromise_escalation"
+          ],
+          "matrix": [
+            [
+              69,
+              1,
+              14,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              40,
+              13,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              13,
+              17,
+              37,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              3,
+              1,
+              110,
+              23,
+              6,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              32,
+              35,
+              21,
+              12
+            ],
+            [
+              0,
+              0,
+              0,
+              4,
+              16,
+              28,
+              15
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              7,
+              11,
+              57
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9264812360054401
+      }
+    }
+  }
+}