Initial release: XGBoost + MLP for ransomware actor-tier attribution

Browse files

Files changed (10) hide show

README.md +444 -0
ablation_results.json +264 -0
feature_engineering.py +388 -0
feature_meta.json +157 -0
feature_scaler.json +1 -0
inference_example.ipynb +326 -0
model_mlp.safetensors +3 -0
model_xgb.json +0 -0
multi_seed_results.json +98 -0
validation_results.json +146 -0

README.md ADDED Viewed

	@@ -0,0 +1,444 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - cybersecurity
+  - ransomware
+  - threat-intelligence
+  - threat-attribution
+  - mitre-attack
+  - tabular-classification
+  - synthetic-data
+  - xgboost
+  - baseline
+pipeline_tag: tabular-classification
+base_model: []
+datasets:
+  - xpertsystems/cyb005-sample
+metrics:
+  - accuracy
+  - f1
+  - roc_auc
+model-index:
+  - name: cyb005-baseline-classifier
+    results:
+      - task:
+          type: tabular-classification
+          name: 4-class threat-actor capability tier attribution
+        dataset:
+          type: xpertsystems/cyb005-sample
+          name: CYB005 Synthetic Ransomware Attack Simulation (Sample)
+        metrics:
+          - type: roc_auc
+            value: 0.8736
+            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.6898
+            name: Test accuracy (XGBoost, seed 42)
+          - type: f1
+            value: 0.6751
+            name: Test macro-F1 (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.603
+            name: Multi-seed accuracy mean ± 0.040 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.853
+            name: Multi-seed ROC-AUC mean ± 0.031 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.8072
+            name: Test macro ROC-AUC OvR (MLP, seed 42)
+          - type: accuracy
+            value: 0.5118
+            name: Test accuracy (MLP, seed 42)
+          - type: f1
+            value: 0.5121
+            name: Test macro-F1 (MLP, seed 42)
+---
+# CYB005 Baseline Classifier
+**Threat-actor capability-tier classifier trained on the CYB005 synthetic
+ransomware campaign sample. Predicts which of 4 actor tiers
+(lone_actor / organised_syndicate / raas_affiliate / nation_state_nexus)
+is behind an observed ransomware campaign from per-timestep telemetry.**
+> **Baseline reference, not for production use.** This model demonstrates
+> that the [CYB005 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb005-sample)
+> is learnable end-to-end and gives prospective buyers a working starting
+> point for threat-attribution research. It is not a production
+> threat-intelligence system, attribution engine, or incident-response
+> tool. See [Limitations](#limitations).
+## Model overview
+| Property | Value |
+|---|---|
+| Task | 4-class actor_capability_tier classification |
+| Training data | `xpertsystems/cyb005-sample` (37,489 timesteps across 500 ransomware campaigns) |
+| Models | XGBoost + PyTorch MLP |
+| Input features | 63 (after one-hot encoding) |
+| Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
+| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
+| License | CC-BY-NC-4.0 (matches dataset) |
+| Status | Reference baseline |
+## Why this task — and why CYB005 ships it where CYB002/3/4 could not
+This is the first XpertSystems baseline that targets the **dataset's
+stated headline use case**. The CYB005 README's first suggested use case
+is "ransomware classifier models (4-tier actor attribution)", and that is
+exactly what this baseline ships.
+In CYB002 (kill-chain), CYB003 (malware family), and CYB004 (actor tier),
+the sample datasets had only ~100 groups (events / samples / campaigns),
+which limits group-aware test folds to ~15 unseen groups and 1.5–2 groups
+per class. Each baseline had to pivot to a phase-prediction subtask that
+was learnable at sample size.
+CYB005's sample is intentionally **5× larger — 500 campaigns** — because
+the README explicitly notes that "benchmarks are conditional on small
+actor-tier subsets". The larger sample makes a held-out test fold of
+75 disjoint campaigns possible, with each of the four tiers represented
+by 11–30 unseen test campaigns. Tier attribution becomes genuinely
+learnable, and that's what we publish.
+Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
+- `model_xgb.json` — gradient-boosted trees, primary recommendation
+- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
+## Quick start
+```bash
+pip install xgboost torch safetensors pandas huggingface_hub
+```
+```python
+from huggingface_hub import hf_hub_download
+import json, numpy as np, torch, xgboost as xgb
+from safetensors.torch import load_file
+REPO = "xpertsystems/cyb005-baseline-classifier"
+paths = {n: hf_hub_download(REPO, n) for n in [
+    "model_xgb.json", "model_mlp.safetensors",
+    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
+]}
+import sys, os
+sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
+from feature_engineering import (
+    transform_single, load_meta, INT_TO_LABEL, build_segment_lookup
+)
+meta = load_meta(paths["feature_meta.json"])
+xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
+seg_lookup = build_segment_lookup("path/to/victim_topology.csv")
+# Predict (see inference_example.ipynb for the full pattern)
+seg_aggs = seg_lookup.get(my_record["target_segment_id"], {})
+X = transform_single(my_record, meta, segment_aggregates=seg_aggs)
+proba = xgb_model.predict_proba(X)[0]
+print(INT_TO_LABEL[int(np.argmax(proba))])
+```
+See [`inference_example.ipynb`](./inference_example.ipynb) for the full
+copy-paste demo.
+## Training data
+Trained on the public sample of CYB005, 37,489 per-timestep telemetry
+rows from 500 ransomware campaigns (75 timesteps per campaign):
+| Tier | Campaigns | Timestep rows | Train share |
+|---|---:|---:|---:|
+| `organised_syndicate` | 200 | 14,998 | 40.0% |
+| `raas_affiliate` | 150 | 11,250 | 30.0% |
+| `lone_actor` | 75 | 5,625 | 15.0% |
+| `nation_state_nexus` | 75 | 5,616 | 15.0% |
+### Group-aware split
+A single campaign generates 75 highly-correlated timesteps. Random
+row-level splitting would put timesteps from the same campaign in both
+train and test, inflating metrics in a way that does not generalize to
+new campaigns.
+This release uses **GroupShuffleSplit by `campaign_id`** (nested,
+70/15/15):
+| Fold | Campaigns | Timesteps |
+|---|---:|---:|
+| Train | 350 | 26,242 |
+| Validation | 75 | 5,624 |
+| Test | 75 | 5,623 |
+All test campaigns are completely unseen during training. Class imbalance
+is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
+weighted cross-entropy (MLP).
+## Feature pipeline
+The bundled `feature_engineering.py` is the canonical feature recipe.
+63 features survive after encoding, drawn from:
+- **Per-timestep numeric** (15): `timestep`, `files_encrypted_cumulative`, `encryption_throughput_mbps`, `endpoints_compromised`, `lateral_move_count`, `credential_harvest_count`, `c2_bytes_exfiltrated`, `defender_alert_score`, `blast_radius_pct`, `living_off_land_score`, `attribution_risk_score`, `data_exfiltrated_gb`, `wiper_flag`, `double_extortion_flag`, `ir_activated`
+- **Per-timestep categorical** (2, one-hot): `attack_phase`, `detection_outcome`
+- **Victim segment** (10 numeric, 3 categorical one-hot): EDR coverage, network segmentation quality, patch posture, IR latency, endpoint count, AD domain complexity, SOC maturity score, backup recovery probability, backup recovery time, SIEM cadence; `segment_type`, `soc_maturity_tier`, `backup_maturity_tier`
+- **Engineered** (6): `c2_intensity_score`, `escalation_velocity`, `is_destructive`, `dwell_efficiency`, `is_post_detonation`, `lotl_intensity_bin`
+- **Ordinal** (1): `segment_id_hash` (segment ID hashed to integer)
+### Leakage audit
+Three columns were audited as potential tier oracles. **None were
+dropped** for this task:
+| Feature | Cross-tier ranges (mean) | Verdict |
+|---|---|---|
+| `attribution_risk_score` | lone 0.016 / nation_state 0.017 / organised 0.026 / raas 0.025 | Overlapping; NOT an oracle. Keep. |
+| `living_off_land_score` | lone 0.05 / nation_state 0.20 / organised 0.16 / raas 0.13 | Mild correlation with massive overlap (std 0.08–0.25). Real observable. Keep. |
+| `attack_phase` | Phase-purity vs tier is ~uniform | No oracle relationship. Keep. |
+`detection_outcome` contains a `recovery_in_progress` value that is 1:1
+identical to the `attack_phase` value of the same name (purity 0.89 vs
+phase), but this only matters for *phase* prediction, not *tier*
+prediction. The column is kept as a feature for tier work.
+The honest result of dropping the two candidate-leakage columns
+(`attribution_risk_score` + `living_off_land_score`) is a 2pp accuracy
+reduction — confirming they provide modest legitimate signal, not oracle
+leakage. They are kept in the published pipeline.
+## Evaluation
+### Test-set metrics, seed 42 (n = 5,623 timesteps from 75 disjoint campaigns)
+**XGBoost** (the published `model_xgb.json` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.8736** |
+| Accuracy | **0.6898** |
+| Macro-F1 | 0.6751 |
+| Weighted-F1 | 0.6939 |
+**MLP** (the published `model_mlp.safetensors` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | 0.8072 |
+| Accuracy | 0.5118 |
+| Macro-F1 | 0.5121 |
+| Weighted-F1 | 0.5160 |
+The MLP underperforms XGBoost on this task (a common pattern on tabular
+data with limited training scale). Both are published so users can pick
+the right tool, and disagreement between them is a useful triage signal.
+### Multi-seed robustness (XGBoost, 10 seeds)
+Stable performance across seeds — all 10 seeds yield all 4 tiers in
+the test fold:
+| Metric | Mean | Std | Min | Max |
+|---|---:|---:|---:|---:|
+| Accuracy | 0.603 | 0.040 | 0.533 | 0.690 |
+| Macro-F1 | 0.593 | 0.047 | 0.509 | 0.675 |
+| Macro ROC-AUC OvR | 0.853 | 0.031 | 0.796 | 0.891 |
+Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
+Seed 42 happens to be a stronger-than-average seed (acc 0.69 vs mean
+0.60). The published artifact uses seed 42 because it produces clean
+ROC-AUC computation; the **multi-seed aggregate ROC-AUC of 0.853 ± 0.031
+is the honest performance estimate**.
+### Per-class F1 (seed 42)
+| Tier | Class share | XGBoost F1 | MLP F1 |
+|---|---:|---:|---:|
+| `organised_syndicate` | 40% | **0.739** | 0.520 |
+| `nation_state_nexus` | 15% | **0.686** | 0.602 |
+| `raas_affiliate` | 30% | 0.646 | 0.499 |
+| `lone_actor` | 15% | 0.630 | 0.428 |
+The model performs evenly across all four classes — no single tier
+collapses. The strongest performance on minority `nation_state_nexus`
+(F1 0.69 despite only 15% prevalence) suggests the model picks up on
+nation-state-specific behaviours (high LotL score, wiper deployment,
+sustained C2 dwell) reliably. The hardest tier is `lone_actor`, the
+behaviourally most variable class.
+### Ablation: which feature groups matter
+| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
+|---|---:|---:|---:|---:|
+| Full feature set (published) | 0.6898 | 0.6751 | 0.8736 | — |
+| No behavioural features | 0.5673 | 0.5214 | 0.8107 | **−0.1225** |
+| No topology features | 0.6146 | 0.6302 | 0.8707 | −0.0752 |
+| No `timestep` | 0.6717 | 0.6417 | 0.8673 | −0.0181 |
+| No engineered features | 0.6882 | 0.6563 | 0.8747 | −0.0016 |
+Four findings:
+1. **Behavioural features carry the most tier signal** (drops 12 pp accuracy,
+   15 pp macro-F1 when removed). This is the most important finding:
+   tier prediction is genuinely behaviour-driven, not a topology-lookup
+   shortcut. Sustained C2 intensity, lateral-move velocity, wiper
+   deployment, and LotL technique use jointly discriminate tiers.
+2. **Topology contributes ~7 pp accuracy.** Defender posture (SOC
+   maturity, backup tier, EDR coverage) provides useful conditioning
+   context — actors target environments differently by tier.
+3. **`timestep` matters much less than for phase prediction** (drops only
+   ~2 pp). This is expected and good: phase prediction depends on
+   knowing *where* in the lifecycle you are; tier prediction depends on
+   *how* the actor operates, which is more invariant to timestep.
+4. **Engineered features barely contribute on their own** — the trees
+   recover most of the c2_intensity, escalation_velocity, etc. signal
+   directly from the raw features. They remain in the pipeline as
+   a documented baseline-feature reference.
+### Architecture
+**XGBoost:** multi-class gradient boosting (`multi:softprob`, 4 classes),
+`hist` tree method, class-balanced sample weights, early stopping on
+validation mlogloss.
+**MLP:** `63 → 128 → 64 → 4`, each hidden layer followed by `BatchNorm1d`
+→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
+early stopping on validation macro-F1.
+Training hyperparameters (learning rate, batch size, n_estimators,
+early-stopping patience, weight decay, class-weighting strategy) are
+held internally by XpertSystems and are not part of this release.
+## Limitations
+**This is a baseline reference, not a production threat-attribution system.**
+1. **Adjacent-tier confusion is honest.** The hardest discriminations
+   are `lone_actor` ↔ `nation_state_nexus` (both small minorities,
+   sometimes behaviourally similar in early-phase recon) and
+   `raas_affiliate` ↔ `organised_syndicate` (operationally similar in
+   mid-campaign). Confusion-matrix-aware downstream logic (e.g. flagging
+   disagreement between XGBoost and MLP for analyst review) is recommended.
+2. **MLP weaker than XGBoost.** The MLP lags ~18 pp accuracy behind
+   XGBoost. This is a common pattern on tabular data when training set
+   sizes don't justify deep-model parameter counts. Both are published;
+   the recommendation is XGBoost as the primary predictor and the MLP
+   for disagreement-as-triage signal.
+3. **Synthetic-vs-real transfer.** The dataset is synthetic and
+   calibrated to ransomware threat-intelligence benchmark targets
+   (Mandiant M-Trends, CrowdStrike GTR, Coveware Quarterly, Sophos
+   State of Ransomware, IBM CODB, Verizon DBIR, CISA #StopRansomware,
+   Chainalysis). Real ransomware telemetry has different noise
+   characteristics, adversary adaptation, and instrumentation gaps. Do
+   not assume metrics transfer.
+4. **Adversarial robustness not evaluated.** The dataset is not
+   adversarially generated; the model has not been red-teamed against
+   tier-spoofing campaigns (a real attacker may deliberately mimic
+   another tier's TTPs to evade attribution).
+5. **Per-tier sample sizes are still modest.** `lone_actor` and
+   `nation_state_nexus` have only 75 training campaigns each. The
+   full ~5,500-campaign CYB005 product (with ~825 per minority tier)
+   would tighten the per-class confidence intervals materially.
+## Notes on dataset schema
+The CYB005 sample dataset README describes some fields differently
+from the actual schema. The model was trained on the actual schema;
+this note helps buyers reconcile what they read with what they receive.
+| What the README says | What the data actually contains |
+|---|---|
+| "7 attack phases" (initial_access, persistence, privilege_escalation, lateral_movement, data_exfiltration, encryption_deployment, ransom_demand) | **8 attack phases**: `initial_access`, `internal_recon`, `privilege_escalation`, `lateral_movement`, `exfiltration_staging`, `encryption_detonation`, `ransom_negotiation`, `recovery_in_progress`. (No `persistence` phase as a distinct value; `recovery_in_progress` is the dominant phase at 35% of rows because campaigns run beyond detonation.) |
+| Backup tiers include `cloud_replicated`, `immutable_object_lock` | Backup tiers in the actual data use `offsite_unverified`, `offsite_verified_immutable` for those concepts |
+| Summary has `campaign_outcome`, `dwell_time_pre_detonation_hrs` | Neither field exists. Use `total_dwell_time_hrs` and `campaign_success_flag` / `detection_phase` instead |
+| Per-timestep includes `endpoints_compromised`, `lateral_pivots`, `edr_alerted`, `siem_correlated`, `lotl_technique_used`, `vss_deletion_attempted`, `wiper_component_deployed`, `dwell_hours`, `c2_beacon_active`, `backup_maturity_tier` | Actual per-timestep columns: `endpoints_compromised` ✓, `lateral_move_count` (not pivots), no `edr_alerted`/`siem_correlated`/`vss_deletion_attempted`/`dwell_hours`/`c2_beacon_active`; `defender_alert_score` and `attribution_risk_score` exist instead; `backup_maturity_tier` is only on per-campaign `victim_topology`, not per-timestep |
+None of these discrepancies affects model correctness — the feature
+pipeline uses the actual column names. If you build your own pipeline
+against the dataset, use the actual columns.
+## Intended use
+- **Evaluating fit** of the CYB005 dataset for your threat-attribution
+  or ransomware-research work
+- **Baseline reference** for new model architectures (especially
+  sequence models, which should beat this baseline by leveraging
+  temporal context across the 75-step campaign)
+- **Teaching and demo** for multi-class tabular classification on
+  cybersecurity telemetry
+- **Feature engineering reference** for ransomware campaign attribution
+## Out-of-scope use
+- Production threat-actor attribution on real ransomware campaigns
+- Incident-response decision-making on real systems
+- Adversarial-evasion evaluation (dataset not adversarially generated)
+- Any operational security or law-enforcement decision
+## Reproducibility
+Outputs above were produced with `seed = 42` (published artifact),
+group-aware nested `GroupShuffleSplit` (70/15/15 by campaign_id), on the
+published sample (`xpertsystems/cyb005-sample`, version 1.0.0, generated
+2026-05-16). The feature pipeline in `feature_engineering.py` is
+deterministic and the trained weights in this repo correspond exactly
+to the metrics above.
+Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
+`multi_seed_results.json` confirm robust performance across splits.
+The training script itself is private to XpertSystems.
+## Files in this repo
+| File | Purpose |
+|---|---|
+| `model_xgb.json` | XGBoost weights (seed 42) |
+| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
+| `feature_engineering.py` | Feature pipeline (load → join topology → engineer → encode) |
+| `feature_meta.json` | Feature column order + categorical levels |
+| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
+| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
+| `ablation_results.json` | Per-feature-group ablation |
+| `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics |
+| `inference_example.ipynb` | End-to-end inference demo notebook |
+| `README.md` | This file |
+## Contact and full product
+The full **CYB005** dataset contains ~358,000 rows across four files,
+with calibrated benchmark validation against 12 metrics drawn from
+authoritative ransomware threat-intelligence sources (Mandiant
+M-Trends, CrowdStrike GTR, Coveware Quarterly Ransomware Report,
+Sophos State of Ransomware, IBM CODB, Verizon DBIR, CISA
+#StopRansomware, Chainalysis). The full XpertSystems.ai synthetic
+data catalogue spans 41 SKUs across Cybersecurity, Healthcare,
+Insurance & Risk, Oil & Gas, and Materials & Energy.
+- 📧 **pradeep@xpertsystems.ai**
+- 🌐 **https://xpertsystems.ai**
+- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb005-sample
+- 🤖 Companion models:
+  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
+  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
+  - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
+  - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
+## Citation
+```bibtex
+@misc{xpertsystems_cyb005_baseline_2026,
+  title  = {CYB005 Baseline Classifier: XGBoost and MLP for Ransomware Actor-Tier Attribution},
+  author = {XpertSystems.ai},
+  year   = {2026},
+  url    = {https://huggingface.co/xpertsystems/cyb005-baseline-classifier},
+  note   = {Baseline reference model trained on xpertsystems/cyb005-sample}
+}
+```

ablation_results.json ADDED Viewed

	@@ -0,0 +1,264 @@

+{
+  "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
+  "full_model_metrics": {
+    "model": "xgboost",
+    "accuracy": 0.6898452783211808,
+    "macro_f1": 0.6751447018282526,
+    "weighted_f1": 0.6881356546405818,
+    "per_class_f1": {
+      "lone_actor": 0.6297297297297297,
+      "organised_syndicate": 0.7391393864525427,
+      "raas_affiliate": 0.6458906202260922,
+      "nation_state_nexus": 0.6858190709046454
+    },
+    "confusion_matrix": {
+      "labels": [
+        "lone_actor",
+        "organised_syndicate",
+        "raas_affiliate",
+        "nation_state_nexus"
+      ],
+      "matrix": [
+        [
+          466,
+          67,
+          216,
+          1
+        ],
+        [
+          83,
+          1795,
+          275,
+          172
+        ],
+        [
+          156,
+          433,
+          1057,
+          79
+        ],
+        [
+          25,
+          237,
+          0,
+          561
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.873606865711172
+  },
+  "ablations": {
+    "no_topology": {
+      "n_features": 35,
+      "dropped_count": 28,
+      "metrics": {
+        "model": "xgboost_no_topology",
+        "accuracy": 0.6146185310332563,
+        "macro_f1": 0.630244354214636,
+        "weighted_f1": 0.6146007963862242,
+        "per_class_f1": {
+          "lone_actor": 0.5802285146547441,
+          "organised_syndicate": 0.595659765527563,
+          "raas_affiliate": 0.5862656072644722,
+          "nation_state_nexus": 0.7588235294117647
+        },
+        "confusion_matrix": {
+          "labels": [
+            "lone_actor",
+            "organised_syndicate",
+            "raas_affiliate",
+            "nation_state_nexus"
+          ],
+          "matrix": [
+            [
+              584,
+              41,
+              110,
+              15
+            ],
+            [
+              308,
+              1194,
+              655,
+              168
+            ],
+            [
+              273,
+              370,
+              1033,
+              49
+            ],
+            [
+              98,
+              79,
+              1,
+              645
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8706652220620055
+      },
+      "delta_accuracy": 0.07522674728792456,
+      "delta_macro_f1": 0.04490034761361661
+    },
+    "no_behavioural": {
+      "n_features": 36,
+      "dropped_count": 27,
+      "metrics": {
+        "model": "xgboost_no_behavioural",
+        "accuracy": 0.5673128223368309,
+        "macro_f1": 0.5213632789864133,
+        "weighted_f1": 0.5706324884542183,
+        "per_class_f1": {
+          "lone_actor": 0.44366608289550497,
+          "organised_syndicate": 0.6739977090492555,
+          "raas_affiliate": 0.5680505911465493,
+          "nation_state_nexus": 0.3997387328543436
+        },
+        "confusion_matrix": {
+          "labels": [
+            "lone_actor",
+            "organised_syndicate",
+            "raas_affiliate",
+            "nation_state_nexus"
+          ],
+          "matrix": [
+            [
+              380,
+              45,
+              306,
+              19
+            ],
+            [
+              101,
+              1471,
+              498,
+              255
+            ],
+            [
+              319,
+              245,
+              1033,
+              128
+            ],
+            [
+              163,
+              279,
+              75,
+              306
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8106558391572862
+      },
+      "delta_accuracy": 0.12253245598434992,
+      "delta_macro_f1": 0.15378142284183927
+    },
+    "no_timestep": {
+      "n_features": 62,
+      "dropped_count": 1,
+      "metrics": {
+        "model": "xgboost_no_timestep",
+        "accuracy": 0.6717054952872132,
+        "macro_f1": 0.6417349625987673,
+        "weighted_f1": 0.6719572046072043,
+        "per_class_f1": {
+          "lone_actor": 0.5438813349814586,
+          "organised_syndicate": 0.7479365079365079,
+          "raas_affiliate": 0.6453731343283582,
+          "nation_state_nexus": 0.6297488731487444
+        },
+        "confusion_matrix": {
+          "labels": [
+            "lone_actor",
+            "organised_syndicate",
+            "raas_affiliate",
+            "nation_state_nexus"
+          ],
+          "matrix": [
+            [
+              440,
+              66,
+              240,
+              4
+            ],
+            [
+              154,
+              1767,
+              230,
+              174
+            ],
+            [
+              169,
+              412,
+              1081,
+              63
+            ],
+            [
+              105,
+              155,
+              74,
+              489
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8672596014037719
+      },
+      "delta_accuracy": 0.01813978303396757,
+      "delta_macro_f1": 0.033409739229485313
+    },
+    "no_engineered": {
+      "n_features": 57,
+      "dropped_count": 6,
+      "metrics": {
+        "model": "xgboost_no_engineered",
+        "accuracy": 0.6882447092299484,
+        "macro_f1": 0.6562913668551777,
+        "weighted_f1": 0.6881813027750402,
+        "per_class_f1": {
+          "lone_actor": 0.5686274509803921,
+          "organised_syndicate": 0.7419631375910845,
+          "raas_affiliate": 0.7053364269141531,
+          "nation_state_nexus": 0.6092384519350812
+        },
+        "confusion_matrix": {
+          "labels": [
+            "lone_actor",
+            "organised_syndicate",
+            "raas_affiliate",
+            "nation_state_nexus"
+          ],
+          "matrix": [
+            [
+              435,
+              71,
+              219,
+              25
+            ],
+            [
+              127,
+              1731,
+              287,
+              180
+            ],
+            [
+              107,
+              316,
+              1216,
+              86
+            ],
+            [
+              111,
+              223,
+              1,
+              488
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.874702950892121
+      },
+      "delta_accuracy": 0.0016005690912324066,
+      "delta_macro_f1": 0.018853334973074842
+    }
+  }
+}

feature_engineering.py ADDED Viewed

	@@ -0,0 +1,388 @@

+"""
+feature_engineering.py
+======================
+Feature pipeline for the CYB005 baseline classifier.
+Predicts `actor_capability_tier` (4-class) from per-timestep ransomware
+campaign telemetry on the CYB005 sample dataset.
+CSV inputs:
+    attack_timelines.csv    (primary, one row per timestep, 500 campaigns
+                             x 75 timesteps = 37,489 rows)
+    victim_topology.csv     (per-segment defender configuration, joined
+                             on target_segment_id; one row per segment)
+    campaign_summary.csv    (per-campaign aggregates; reserved for future
+                             work - many fields are post-hoc outcomes that
+                             would leak the tier through training)
+    campaign_events.csv     (discrete event log; reserved for future work)
+Target classes (4):
+    lone_actor, organised_syndicate, raas_affiliate, nation_state_nexus
+Sample size note
+----------------
+CYB005's sample is intentionally larger than its sister datasets (500
+campaigns vs 100 in CYB002/3/4). The README states this is because
+"benchmarks are conditional on small actor-tier subsets". The larger
+sample makes tier attribution genuinely learnable here, where it was
+not in CYB003/CYB004.
+Leakage audit
+-------------
+Three columns inspected for tier leakage:
+- `attribution_risk_score` - mean 0.016-0.026 across tiers, ranges
+  overlap heavily. NOT an oracle; keep.
+- `living_off_land_score` - mean 0.05 (lone) to 0.20 (nation_state),
+  with substantial overlap (std 0.08-0.25). Real observable, not
+  an oracle; keep.
+- `attack_phase` - 89% purity vs `detection_outcome` (recovery_in_progress
+  is a 1:1 alias), but for TIER prediction it has no oracle relationship.
+  Keep.
+No columns are dropped for tier prediction. The model is trained on what
+a SOC analyst would actually see at observation time.
+Public API
+----------
+    build_features(timelines_path, topology_path)
+        -> (X, y, groups, meta)
+    transform_single(record, meta, segment_aggregates=None) -> np.ndarray
+    save_meta(meta, path) / load_meta(path)
+    build_segment_lookup(topology_path) -> dict
+License
+-------
+Ships with the public model on Hugging Face under CC-BY-NC-4.0,
+matching the dataset license. See README.md.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Label space
+# ---------------------------------------------------------------------------
+# Ordered roughly by capability: lone -> nation_state. Class imbalance:
+# organised_syndicate (40%), raas_affiliate (30%), lone_actor (15%),
+# nation_state_nexus (15%).
+LABEL_ORDER = [
+    "lone_actor",
+    "organised_syndicate",
+    "raas_affiliate",
+    "nation_state_nexus",
+]
+LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
+INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
+# ---------------------------------------------------------------------------
+# Identifier and target columns - not features
+# ---------------------------------------------------------------------------
+ID_COLUMNS = ["campaign_id", "actor_id"]
+TARGET_COLUMN = "actor_capability_tier"
+# No columns dropped for leakage. See module docstring's "Leakage audit"
+# for the rationale on each candidate.
+LEAKY_COLUMNS: list[str] = []
+# ---------------------------------------------------------------------------
+# Per-timestep numeric features
+# ---------------------------------------------------------------------------
+DIRECT_NUMERIC_TIMESTEP_FEATURES = [
+    "timestep",                       # position in 75-step lifecycle
+    "files_encrypted_cumulative",
+    "encryption_throughput_mbps",
+    "endpoints_compromised",
+    "lateral_move_count",
+    "credential_harvest_count",
+    "c2_bytes_exfiltrated",
+    "defender_alert_score",
+    "blast_radius_pct",
+    "living_off_land_score",
+    "attribution_risk_score",
+    "data_exfiltrated_gb",
+    "wiper_flag",
+    "double_extortion_flag",
+    "ir_activated",
+]
+# Per-timestep categoricals to one-hot
+CATEGORICAL_TIMESTEP_FEATURES = [
+    "attack_phase",        # 8 phases
+    "detection_outcome",   # 5 outcomes incl. recovery_in_progress
+]
+# ---------------------------------------------------------------------------
+# Victim topology features (joined on target_segment_id == segment_id)
+# ---------------------------------------------------------------------------
+# victim_topology.csv is segment-level (300 rows, one per segment). Each
+# campaign targets one segment, so these become per-campaign-constant
+# features. They provide useful conditioning context (what defender
+# posture is the actor working against) without being tier oracles.
+TOPOLOGY_NUMERIC_FEATURES = [
+    "edr_coverage_rate",
+    "network_segmentation_quality",
+    "patch_posture_score",
+    "ir_activation_latency_hrs",
+    "endpoint_count",
+    "ad_domain_complexity",
+    "soc_maturity_score",
+    "backup_recovery_prob",
+    "backup_recovery_hrs_mean",
+    "siem_rule_refresh_cadence_days",
+]
+TOPOLOGY_CATEGORICAL_FEATURES = [
+    "segment_type",            # 8 values: corporate_lan / dmz / cloud_workload / ot_ics_control / ...
+    "soc_maturity_tier",       # tier label
+    "backup_maturity_tier",    # 6 values: no_backup / local_only / network_attached / ...
+]
+# ---------------------------------------------------------------------------
+# Engineered features
+# ---------------------------------------------------------------------------
+def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Six engineered features encoding tier-discriminative hypotheses.
+    Each is a behavioural composite that a threat analyst would compute
+    by hand to distinguish actor sophistication levels.
+    """
+    df = df.copy()
+    # 1. C2 intensity: data exfiltration combined with encryption throughput.
+    #    Nation-state and organised tiers tend to sustain higher both;
+    #    lone actors burst then quiet down.
+    df["c2_intensity_score"] = np.log1p(
+        df["c2_bytes_exfiltrated"].clip(lower=0)
+        * df["encryption_throughput_mbps"].clip(lower=0)
+    ).astype(float)
+    # 2. Escalation velocity: lateral moves per timestep elapsed.
+    #    Higher = aggressive (raas/syndicate). Lower = methodical (apt).
+    df["escalation_velocity"] = (
+        df["lateral_move_count"] / df["timestep"].clip(lower=1)
+    ).astype(float)
+    # 3. Destructive intent: wiper or double_extortion deployed.
+    #    Wiper is a strong nation_state signature.
+    df["is_destructive"] = (
+        (df["wiper_flag"] == 1) | (df["double_extortion_flag"] == 1)
+    ).astype(int)
+    # 4. Dwell efficiency: blast radius per timestep. High = fast,
+    #    low = patient. Helps separate organised_syndicate (fast) from
+    #    nation_state_nexus (patient).
+    df["dwell_efficiency"] = (
+        df["blast_radius_pct"] / df["timestep"].clip(lower=1)
+    ).astype(float)
+    # 5. Post-detonation indicator. Timesteps after 50 are typically
+    #    encryption_detonation / ransom_negotiation / recovery phases,
+    #    which surface tier signal through ransom posture.
+    df["is_post_detonation"] = (df["timestep"] > 50).astype(int)
+    # 6. LotL intensity bin. Quartile bins of living_off_land_score
+    #    give the trees a categorical view of an otherwise continuous
+    #    tier-correlated feature.
+    df["lotl_intensity_bin"] = pd.cut(
+        df["living_off_land_score"], bins=[-0.01, 0.1, 0.3, 0.6, 1.01],
+        labels=[0, 1, 2, 3],
+    ).astype(int)
+    return df
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def build_features(
+    timelines_path: str | Path,
+    topology_path: str | Path,
+) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
+    """
+    Load CSVs, join topology, drop target + identifiers, engineer features,
+    one-hot encode, return (X, y, groups, meta).
+    `groups` is a Series of campaign_id values aligned with X. Use it with
+    GroupShuffleSplit / GroupKFold so train and test sets contain disjoint
+    campaigns - each campaign generates 75 highly-correlated timesteps.
+    """
+    timelines = pd.read_csv(timelines_path)
+    topo = pd.read_csv(topology_path)
+    y = timelines[TARGET_COLUMN].map(LABEL_TO_INT)
+    if y.isna().any():
+        bad = timelines.loc[y.isna(), TARGET_COLUMN].unique()
+        raise ValueError(f"Unknown actor_capability_tier values: {bad}")
+    y = y.astype(int)
+    groups = timelines["campaign_id"].copy()
+    timelines = timelines.drop(
+        columns=ID_COLUMNS + [TARGET_COLUMN] + LEAKY_COLUMNS, errors="ignore",
+    )
+    # Join victim topology features on target_segment_id == segment_id
+    topo_cols_needed = (
+        ["segment_id"] + TOPOLOGY_NUMERIC_FEATURES + TOPOLOGY_CATEGORICAL_FEATURES
+    )
+    timelines = timelines.merge(
+        topo[topo_cols_needed],
+        left_on="target_segment_id", right_on="segment_id", how="left",
+    ).drop(columns=["segment_id"], errors="ignore")
+    # target_segment_id is high-cardinality (251 unique). Use it as an
+    # ordinal feature by hashing to integer rather than one-hot.
+    timelines["segment_id_hash"] = (
+        timelines["target_segment_id"].astype("category").cat.codes.astype(float)
+    )
+    timelines = timelines.drop(columns=["target_segment_id"])
+    timelines = _add_engineered_features(timelines)
+    numeric_features = (
+        DIRECT_NUMERIC_TIMESTEP_FEATURES
+        + TOPOLOGY_NUMERIC_FEATURES
+        + [
+            "segment_id_hash",
+            "c2_intensity_score", "escalation_velocity", "is_destructive",
+            "dwell_efficiency", "is_post_detonation", "lotl_intensity_bin",
+        ]
+    )
+    X_numeric = timelines[numeric_features].astype(float)
+    all_categorical = (
+        [(col, "timestep") for col in CATEGORICAL_TIMESTEP_FEATURES]
+        + [(col, "topology") for col in TOPOLOGY_CATEGORICAL_FEATURES]
+    )
+    categorical_levels: dict[str, list[str]] = {}
+    blocks: list[pd.DataFrame] = []
+    for col, _src in all_categorical:
+        if col not in timelines.columns:
+            continue
+        levels = sorted(timelines[col].dropna().unique().tolist())
+        categorical_levels[col] = levels
+        block = pd.get_dummies(
+            timelines[col].astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        blocks.append(block)
+    X = pd.concat(
+        [X_numeric.reset_index(drop=True)]
+        + [b.reset_index(drop=True) for b in blocks],
+        axis=1,
+    ).fillna(0.0)
+    meta = {
+        "feature_names": X.columns.tolist(),
+        "numeric_features": numeric_features,
+        "categorical_levels": categorical_levels,
+        "label_to_int": LABEL_TO_INT,
+        "int_to_label": INT_TO_LABEL,
+        "leakage_excluded": LEAKY_COLUMNS,
+    }
+    return X, y, groups, meta
+def transform_single(
+    record: dict | pd.DataFrame,
+    meta: dict[str, Any],
+    segment_aggregates: dict | None = None,
+) -> np.ndarray:
+    """Encode a single timestep record for inference."""
+    if isinstance(record, dict):
+        df = pd.DataFrame([record.copy()])
+    else:
+        df = record.copy()
+    if segment_aggregates is not None:
+        for k, v in segment_aggregates.items():
+            df[k] = v
+    # If target_segment_id is present but segment_id_hash isn't, set 0 (unknown)
+    if "segment_id_hash" not in df.columns:
+        df["segment_id_hash"] = 0.0
+    if "target_segment_id" in df.columns:
+        df = df.drop(columns=["target_segment_id"])
+    df = _add_engineered_features(df)
+    numeric = pd.DataFrame({
+        col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
+        for col in meta["numeric_features"]
+    })
+    blocks: list[pd.DataFrame] = [numeric]
+    for col, levels in meta["categorical_levels"].items():
+        val = df.get(col, pd.Series([None] * len(df)))
+        block = pd.get_dummies(
+            val.astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        for lvl in levels:
+            cname = f"{col}_{lvl}"
+            if cname not in block.columns:
+                block[cname] = 0
+        block = block[[f"{col}_{lvl}" for lvl in levels]]
+        blocks.append(block)
+    X = pd.concat(blocks, axis=1).fillna(0.0)
+    X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
+    return X.values.astype(np.float32)
+def save_meta(meta: dict[str, Any], path: str | Path) -> None:
+    serializable = {
+        "feature_names": meta["feature_names"],
+        "numeric_features": meta["numeric_features"],
+        "categorical_levels": meta["categorical_levels"],
+        "label_to_int": meta["label_to_int"],
+        "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
+        "leakage_excluded": meta.get("leakage_excluded", []),
+    }
+    with open(path, "w") as f:
+        json.dump(serializable, f, indent=2)
+def load_meta(path: str | Path) -> dict[str, Any]:
+    with open(path) as f:
+        meta = json.load(f)
+    meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
+    return meta
+def build_segment_lookup(topology_path: str | Path) -> dict[str, dict]:
+    """Build {segment_id: {topology feature values}} for inference-time lookup."""
+    topo = pd.read_csv(topology_path)
+    cols = TOPOLOGY_NUMERIC_FEATURES + TOPOLOGY_CATEGORICAL_FEATURES
+    out = {}
+    for _, row in topo.iterrows():
+        out[row["segment_id"]] = {c: row[c] for c in cols if c in topo.columns}
+    return out
+if __name__ == "__main__":
+    import sys
+    base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
+    X, y, groups, meta = build_features(
+        base / "attack_timelines.csv",
+        base / "victim_topology.csv",
+    )
+    print(f"X shape: {X.shape}")
+    print(f"y shape: {y.shape}")
+    print(f"groups: {groups.nunique()} campaigns")
+    print(f"n features: {len(meta['feature_names'])}")
+    print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
+    print(f"X has NaN: {X.isnull().any().any()}")

feature_meta.json ADDED Viewed

	@@ -0,0 +1,157 @@

+{
+  "feature_names": [
+    "timestep",
+    "files_encrypted_cumulative",
+    "encryption_throughput_mbps",
+    "endpoints_compromised",
+    "lateral_move_count",
+    "credential_harvest_count",
+    "c2_bytes_exfiltrated",
+    "defender_alert_score",
+    "blast_radius_pct",
+    "living_off_land_score",
+    "attribution_risk_score",
+    "data_exfiltrated_gb",
+    "wiper_flag",
+    "double_extortion_flag",
+    "ir_activated",
+    "edr_coverage_rate",
+    "network_segmentation_quality",
+    "patch_posture_score",
+    "ir_activation_latency_hrs",
+    "endpoint_count",
+    "ad_domain_complexity",
+    "soc_maturity_score",
+    "backup_recovery_prob",
+    "backup_recovery_hrs_mean",
+    "siem_rule_refresh_cadence_days",
+    "segment_id_hash",
+    "c2_intensity_score",
+    "escalation_velocity",
+    "is_destructive",
+    "dwell_efficiency",
+    "is_post_detonation",
+    "lotl_intensity_bin",
+    "attack_phase_encryption_detonation",
+    "attack_phase_exfiltration_staging",
+    "attack_phase_initial_access",
+    "attack_phase_internal_recon",
+    "attack_phase_lateral_movement",
+    "attack_phase_privilege_escalation",
+    "attack_phase_ransom_negotiation",
+    "attack_phase_recovery_in_progress",
+    "detection_outcome_alert_generated",
+    "detection_outcome_delayed_detection",
+    "detection_outcome_no_detection",
+    "detection_outcome_partial_containment",
+    "detection_outcome_recovery_in_progress",
+    "segment_type_active_directory_domain",
+    "segment_type_backup_infrastructure",
+    "segment_type_cloud_workload_tier",
+    "segment_type_corporate_workstation_fleet",
+    "segment_type_dmz_perimeter",
+    "segment_type_executive_endpoint_zone",
+    "segment_type_file_server_cluster",
+    "segment_type_ot_ics_control_network",
+    "soc_maturity_tier_none",
+    "soc_maturity_tier_tier1",
+    "soc_maturity_tier_tier2",
+    "soc_maturity_tier_tier3_mdr",
+    "backup_maturity_tier_air_gapped_gold_standard",
+    "backup_maturity_tier_local_only",
+    "backup_maturity_tier_network_attached",
+    "backup_maturity_tier_no_backup",
+    "backup_maturity_tier_offsite_unverified",
+    "backup_maturity_tier_offsite_verified_immutable"
+  ],
+  "numeric_features": [
+    "timestep",
+    "files_encrypted_cumulative",
+    "encryption_throughput_mbps",
+    "endpoints_compromised",
+    "lateral_move_count",
+    "credential_harvest_count",
+    "c2_bytes_exfiltrated",
+    "defender_alert_score",
+    "blast_radius_pct",
+    "living_off_land_score",
+    "attribution_risk_score",
+    "data_exfiltrated_gb",
+    "wiper_flag",
+    "double_extortion_flag",
+    "ir_activated",
+    "edr_coverage_rate",
+    "network_segmentation_quality",
+    "patch_posture_score",
+    "ir_activation_latency_hrs",
+    "endpoint_count",
+    "ad_domain_complexity",
+    "soc_maturity_score",
+    "backup_recovery_prob",
+    "backup_recovery_hrs_mean",
+    "siem_rule_refresh_cadence_days",
+    "segment_id_hash",
+    "c2_intensity_score",
+    "escalation_velocity",
+    "is_destructive",
+    "dwell_efficiency",
+    "is_post_detonation",
+    "lotl_intensity_bin"
+  ],
+  "categorical_levels": {
+    "attack_phase": [
+      "encryption_detonation",
+      "exfiltration_staging",
+      "initial_access",
+      "internal_recon",
+      "lateral_movement",
+      "privilege_escalation",
+      "ransom_negotiation",
+      "recovery_in_progress"
+    ],
+    "detection_outcome": [
+      "alert_generated",
+      "delayed_detection",
+      "no_detection",
+      "partial_containment",
+      "recovery_in_progress"
+    ],
+    "segment_type": [
+      "active_directory_domain",
+      "backup_infrastructure",
+      "cloud_workload_tier",
+      "corporate_workstation_fleet",
+      "dmz_perimeter",
+      "executive_endpoint_zone",
+      "file_server_cluster",
+      "ot_ics_control_network"
+    ],
+    "soc_maturity_tier": [
+      "none",
+      "tier1",
+      "tier2",
+      "tier3_mdr"
+    ],
+    "backup_maturity_tier": [
+      "air_gapped_gold_standard",
+      "local_only",
+      "network_attached",
+      "no_backup",
+      "offsite_unverified",
+      "offsite_verified_immutable"
+    ]
+  },
+  "label_to_int": {
+    "lone_actor": 0,
+    "organised_syndicate": 1,
+    "raas_affiliate": 2,
+    "nation_state_nexus": 3
+  },
+  "int_to_label": {
+    "0": "lone_actor",
+    "1": "organised_syndicate",
+    "2": "raas_affiliate",
+    "3": "nation_state_nexus"
+  },
+  "leakage_excluded": []
+}

feature_scaler.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean": [37.00038106851612, 17991.317582501335, 14.544094504991996, 74.80169194421157, 138.9533191067754, 7.282105022483043, 27653651.954919595, 0.37065581891624116, 0.022953715418032164, 0.13800967914030943, 0.02037649569392577, 3.1598579757640426, 0.0799862815334197, 0.5887508574041612, 0.11912201813886136, 0.5610207110738511, 0.5038509793460865, 0.6147713474582729, 211.73937619083915, 403.9024845667251, 0.42209049615120803, 0.4762784848715799, 0.3303223839646368, 232.6633640728603, 97.50316286868379, 123.18851459492417, 2.460235469644747, 3.32033272147805, 0.6430150140995351, 0.0003669160190541892, 0.32009755354012653, 0.5515585702309275, 0.11199603688743236, 0.0907324136879811, 0.07987196097858396, 0.11043365597134365, 0.13082082158372074, 0.0968676167975002, 0.033800777379772884, 0.34547671671366514, 0.3409801082234586, 0.033724563676549045, 0.19442115692401493, 0.08539745446231232, 0.34547671671366514, 0.15707644234433352, 0.09999237862967762, 0.13143053120951148, 0.1342123313771816, 0.14575870741559332, 0.10288849935218353, 0.09717247161039555, 0.13146863806112338, 0.21141681274293117, 0.26575718314152885, 0.30569316363082083, 0.21713284048471915, 0.05716027741787973, 0.16850849782790947, 0.31422909839189084, 0.054302263546985745, 0.2343571374133069, 0.1714427254020273], "std": [21.65185417886005, 70075.14437404645, 46.623050778770434, 138.8927820606075, 333.3671687687729, 10.411700758798842, 112444742.30416173, 0.45135105008302945, 0.07477089986783442, 0.20000045211083176, 0.05994709512338441, 8.340943676601865, 0.27127712884039307, 0.49206962131159826, 0.32393820662667727, 0.17535640681475026, 0.19840256506792886, 0.206597959364603, 215.90980436865328, 223.6938623978402, 0.2049917965317542, 0.30290492736879715, 0.26507070287211687, 186.3537217786501, 49.81550590467409, 68.8922730412834, 6.962899822938947, 8.322473481208718, 0.47911945627130226, 0.0012057216657705649, 0.46652267197064756, 0.8570014461906231, 0.3153675864625374, 0.28723367966000857, 0.271100039653078, 0.3134354914270541, 0.3372107166070688, 0.29578305477334943, 0.18071947703591806, 0.4755325142084713, 0.4740477164149472, 0.1805227390800372, 0.3957619729670082, 0.2794775584414953, 0.4755325142084713, 0.36387975936434375, 0.2999955539014646, 0.3378770441859857, 0.34088679887143486, 0.3528708710157629, 0.3038189815388037, 0.2961981188534845, 0.3379186095056718, 0.4083210715101615, 0.4417439743053282, 0.46070917250642185, 0.41230164679802983, 0.2321530397678419, 0.37432435596831565, 0.4642169579437471, 0.2266174854606215, 0.423604423344658, 0.37690254787912686]}

inference_example.ipynb ADDED Viewed

	@@ -0,0 +1,326 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CYB005 Baseline Classifier — Inference Example\n",
+    "\n",
+    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **threat-actor capability tier** of a ransomware campaign from a per-timestep telemetry record.\n",
+    "\n",
+    "**Models predict one of 4 tiers:** `lone_actor`, `organised_syndicate`, `raas_affiliate`, `nation_state_nexus`.\n",
+    "\n",
+    "**This is a baseline reference model**, not a production threat-attribution system. See the model card for full metrics and limitations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Download model artifacts from Hugging Face"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "REPO_ID = \"xpertsystems/cyb005-baseline-classifier\"\n",
+    "\n",
+    "files = {}\n",
+    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
+    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
+    "             \"feature_scaler.json\"]:\n",
+    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
+    "    print(f\"  downloaded: {name}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, os\n",
+    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
+    "if fe_dir not in sys.path:\n",
+    "    sys.path.insert(0, fe_dir)\n",
+    "\n",
+    "from feature_engineering import (\n",
+    "    transform_single, load_meta, INT_TO_LABEL, build_segment_lookup\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load models and metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import xgboost as xgb\n",
+    "from safetensors.torch import load_file\n",
+    "\n",
+    "meta = load_meta(files[\"feature_meta.json\"])\n",
+    "with open(files[\"feature_scaler.json\"]) as f:\n",
+    "    scaler = json.load(f)\n",
+    "\n",
+    "N_FEATURES = len(meta[\"feature_names\"])\n",
+    "N_CLASSES = len(meta[\"int_to_label\"])\n",
+    "print(f\"feature count: {N_FEATURES}\")\n",
+    "print(f\"class count:   {N_CLASSES}\")\n",
+    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# XGBoost\n",
+    "xgb_model = xgb.XGBClassifier()\n",
+    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
+    "\n",
+    "# MLP architecture (must match training)\n",
+    "class TierMLP(nn.Module):\n",
+    "    def __init__(self, n_features, n_classes=4, hidden1=128, hidden2=64, dropout=0.3):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(n_features, hidden1),\n",
+    "            nn.BatchNorm1d(hidden1),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden1, hidden2),\n",
+    "            nn.BatchNorm1d(hidden2),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden2, n_classes),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "mlp_model = TierMLP(N_FEATURES, n_classes=N_CLASSES)\n",
+    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
+    "mlp_model.eval()\n",
+    "print(\"models loaded\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Build the segment lookup\n",
+    "\n",
+    "Per-segment topology features (SOC maturity, EDR coverage, backup tier, etc.) are pulled from `victim_topology.csv` and merged into each timestep record by `target_segment_id`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "\n",
+    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb005-sample\", repo_type=\"dataset\")\n",
+    "\n",
+    "seg_lookup = build_segment_lookup(\n",
+    "    os.path.join(ds_path, \"victim_topology.csv\")\n",
+    ")\n",
+    "print(f\"loaded {len(seg_lookup)} segment profiles\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Prediction helper"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
+    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
+    "\n",
+    "def predict_tier(record: dict) -> dict:\n",
+    "    \"\"\"Predict the threat-actor tier for one per-timestep telemetry record.\n",
+    "\n",
+    "    Per-segment topology features are pulled automatically via\n",
+    "    `target_segment_id` from the seg_lookup loaded above.\n",
+    "    \"\"\"\n",
+    "    seg_id = record.get(\"target_segment_id\")\n",
+    "    seg_aggs = seg_lookup.get(seg_id, {})\n",
+    "    X = transform_single(record, meta, segment_aggregates=seg_aggs)\n",
+    "\n",
+    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
+    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
+    "\n",
+    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
+    "    with torch.no_grad():\n",
+    "        logits = mlp_model(torch.tensor(Xs))\n",
+    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
+    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
+    "\n",
+    "    return {\n",
+    "        \"xgboost\": {\n",
+    "            \"label\": xgb_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
+    "        },\n",
+    "        \"mlp\": {\n",
+    "            \"label\": mlp_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
+    "        },\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Run on an example record\n",
+    "\n",
+    "Real `encryption_detonation` event from the sample dataset: a nation-state-tier ransomware campaign at timestep 68, with a wiper component deployed and 36,586 files encrypted across 634 endpoints. Both models should lean toward `nation_state_nexus`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real timestep record from the sample dataset (true tier: nation_state_nexus)\n",
+    "example_record = {\n",
+    "    \"timestep\": 68,\n",
+    "    \"attack_phase\": \"encryption_detonation\",\n",
+    "    \"files_encrypted_cumulative\": 36586,\n",
+    "    \"encryption_throughput_mbps\": 244.913,\n",
+    "    \"endpoints_compromised\": 634,\n",
+    "    \"lateral_move_count\": 1498,\n",
+    "    \"credential_harvest_count\": 17,\n",
+    "    \"c2_bytes_exfiltrated\": 138747511.1,\n",
+    "    \"defender_alert_score\": 1.0,\n",
+    "    \"detection_outcome\": \"alert_generated\",\n",
+    "    \"blast_radius_pct\": 0.4032,\n",
+    "    \"living_off_land_score\": 0.35,\n",
+    "    \"attribution_risk_score\": 0.0,\n",
+    "    \"data_exfiltrated_gb\": 14.852,\n",
+    "    \"wiper_flag\": 1,\n",
+    "    \"double_extortion_flag\": 0,\n",
+    "    \"ir_activated\": 0,\n",
+    "    \"target_segment_id\": \"SEG00150\",\n",
+    "}\n",
+    "\n",
+    "result = predict_tier(example_record)\n",
+    "\n",
+    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
+    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
+    "    print(f\"    P({lbl:25s}) = {p:.4f}\")\n",
+    "\n",
+    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
+    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
+    "    print(f\"    P({lbl:25s}) = {p:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### When the two models disagree\n",
+    "\n",
+    "XGBoost and the MLP can disagree on borderline cases — `lone_actor` ↔ `nation_state_nexus` (low blast radius can look similar across both extremes), or `raas_affiliate` ↔ `organised_syndicate` (operational similarity). In threat-attribution workflows, disagreement is a useful triage signal for human analyst review."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Batch prediction on the sample dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "timelines = pd.read_csv(f\"{ds_path}/attack_timelines.csv\")\n",
+    "\n",
+    "# Score the first 500 timesteps\n",
+    "sample = timelines.head(500).copy()\n",
+    "preds = [predict_tier(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
+    "sample[\"xgb_pred\"] = preds\n",
+    "\n",
+    "ct = pd.crosstab(sample[\"actor_capability_tier\"], sample[\"xgb_pred\"],\n",
+    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
+    "print(\"Confusion on first 500 sample rows (XGBoost):\")\n",
+    "print(ct)\n",
+    "acc = (sample[\"actor_capability_tier\"] == sample[\"xgb_pred\"]).mean()\n",
+    "print(f\"\\nbatch accuracy on first 500 rows (in-distribution): {acc:.4f}\")\n",
+    "print(\"\\nNote: these rows include training-set campaigns. See validation_results.json\\n\"\n",
+    "      \"for proper held-out test metrics from disjoint campaigns.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Next steps\n",
+    "\n",
+    "- See `validation_results.json` for held-out test metrics (75 disjoint campaigns, ~5,600 timesteps).\n",
+    "- See `multi_seed_results.json` for the across-10-seeds robustness picture (accuracy 0.603 ± 0.040, ROC-AUC 0.853 ± 0.031).\n",
+    "- See `ablation_results.json` for per-feature-group contribution. Behavioural features carry the most tier signal (−12pp accuracy when removed).\n",
+    "- The model card explains the leakage audit and the per-class tier-confusion patterns.\n",
+    "- For the full ~358k-row CYB005 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

model_mlp.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d0bec2554d1504c06a93a9d6ad1b6de3fa12d0a09eb6468e6b874cc778739840
+size 71128

model_xgb.json ADDED Viewed

The diff for this file is too large to render. See raw diff

multi_seed_results.json ADDED Viewed

	@@ -0,0 +1,98 @@

+{
+  "purpose": "Multi-seed evaluation across 10 random splits of the 500 ransomware campaigns. Reports XGBoost performance averaged over the full set of seeds for a robust performance picture.",
+  "seeds_evaluated": [
+    42,
+    7,
+    13,
+    17,
+    23,
+    31,
+    45,
+    99,
+    123,
+    200
+  ],
+  "per_seed": [
+    {
+      "seed": 42,
+      "test_n_classes": 4,
+      "accuracy": 0.6898452783211808,
+      "macro_f1": 0.6751447018282526,
+      "macro_roc_auc_ovr": 0.873606865711172
+    },
+    {
+      "seed": 7,
+      "test_n_classes": 4,
+      "accuracy": 0.5936,
+      "macro_f1": 0.6058668770031597,
+      "macro_roc_auc_ovr": 0.8807958394340375
+    },
+    {
+      "seed": 13,
+      "test_n_classes": 4,
+      "accuracy": 0.6160412591143518,
+      "macro_f1": 0.6098050823090829,
+      "macro_roc_auc_ovr": 0.891446004502376
+    },
+    {
+      "seed": 17,
+      "test_n_classes": 4,
+      "accuracy": 0.5668563300142248,
+      "macro_f1": 0.5260776400679491,
+      "macro_roc_auc_ovr": 0.8435537531292995
+    },
+    {
+      "seed": 23,
+      "test_n_classes": 4,
+      "accuracy": 0.5331673483905388,
+      "macro_f1": 0.5092426374129808,
+      "macro_roc_auc_ovr": 0.8177927651119797
+    },
+    {
+      "seed": 31,
+      "test_n_classes": 4,
+      "accuracy": 0.6072953736654805,
+      "macro_f1": 0.6146362246152752,
+      "macro_roc_auc_ovr": 0.8585576361068035
+    },
+    {
+      "seed": 45,
+      "test_n_classes": 4,
+      "accuracy": 0.5793777777777778,
+      "macro_f1": 0.5739793543388237,
+      "macro_roc_auc_ovr": 0.8200552847948792
+    },
+    {
+      "seed": 99,
+      "test_n_classes": 4,
+      "accuracy": 0.6200640341515475,
+      "macro_f1": 0.6242476136431796,
+      "macro_roc_auc_ovr": 0.8679174384576391
+    },
+    {
+      "seed": 123,
+      "test_n_classes": 4,
+      "accuracy": 0.6323372465314835,
+      "macro_f1": 0.6277596831292473,
+      "macro_roc_auc_ovr": 0.8854323799134519
+    },
+    {
+      "seed": 200,
+      "test_n_classes": 4,
+      "accuracy": 0.587157595161864,
+      "macro_f1": 0.5653959696484754,
+      "macro_roc_auc_ovr": 0.7957473212581817
+    }
+  ],
+  "aggregate": {
+    "accuracy_mean": 0.602574224312845,
+    "accuracy_std": 0.039951201198129296,
+    "accuracy_min": 0.5331673483905388,
+    "accuracy_max": 0.6898452783211808,
+    "macro_f1_mean": 0.5932155783996427,
+    "macro_f1_std": 0.04739799073577289,
+    "roc_auc_mean": 0.853490528841982,
+    "roc_auc_std": 0.031096980060089464
+  },
+  "published_artifact_seed": 42
+}

validation_results.json ADDED Viewed

	@@ -0,0 +1,146 @@

+{
+  "version": "1.0.0",
+  "dataset": "xpertsystems/cyb005-sample",
+  "task": "4-class actor_capability_tier classification",
+  "baselines": {
+    "always_predict_majority_accuracy": 0.41348034856837984,
+    "majority_class": "organised_syndicate",
+    "random_guess_accuracy": 0.25
+  },
+  "split": {
+    "strategy": "group_aware (GroupShuffleSplit by campaign_id, nested)",
+    "rationale": "500 ransomware campaigns generate ~37,489 timesteps (75 per campaign). Random row-split would leak per-campaign correlations into the test fold. Group-aware split keeps train/val/test campaigns disjoint.",
+    "campaigns_train": 350,
+    "campaigns_val": 75,
+    "campaigns_test": 75,
+    "timesteps_train": 26242,
+    "timesteps_val": 5624,
+    "timesteps_test": 5623,
+    "seed": 42
+  },
+  "n_features": 63,
+  "label_classes": [
+    "lone_actor",
+    "organised_syndicate",
+    "raas_affiliate",
+    "nation_state_nexus"
+  ],
+  "class_distribution_train": {
+    "organised_syndicate": 10423,
+    "raas_affiliate": 7950,
+    "lone_actor": 4125,
+    "nation_state_nexus": 3744
+  },
+  "class_distribution_test": {
+    "organised_syndicate": 2325,
+    "raas_affiliate": 1725,
+    "nation_state_nexus": 823,
+    "lone_actor": 750
+  },
+  "leakage_excluded_features": [],
+  "leakage_audit_notes": "Three columns were audited as potential tier oracles: attribution_risk_score (mean 0.016-0.026 with overlapping ranges - not an oracle, kept); living_off_land_score (mean 0.05-0.20 with large overlap - real observable, kept); attack_phase (no oracle relationship to tier - kept). detection_outcome contains a recovery_in_progress value that is 1:1 with the attack_phase of the same name, but this is a phase-prediction leak, not a tier-prediction one. No features dropped for this task.",
+  "models": {
+    "xgboost": {
+      "architecture": "Gradient-boosted decision trees, multi:softprob, 4 classes",
+      "framework": "xgboost",
+      "test_metrics": {
+        "model": "xgboost",
+        "accuracy": 0.6898452783211808,
+        "macro_f1": 0.6751447018282526,
+        "weighted_f1": 0.6881356546405818,
+        "per_class_f1": {
+          "lone_actor": 0.6297297297297297,
+          "organised_syndicate": 0.7391393864525427,
+          "raas_affiliate": 0.6458906202260922,
+          "nation_state_nexus": 0.6858190709046454
+        },
+        "confusion_matrix": {
+          "labels": [
+            "lone_actor",
+            "organised_syndicate",
+            "raas_affiliate",
+            "nation_state_nexus"
+          ],
+          "matrix": [
+            [
+              466,
+              67,
+              216,
+              1
+            ],
+            [
+              83,
+              1795,
+              275,
+              172
+            ],
+            [
+              156,
+              433,
+              1057,
+              79
+            ],
+            [
+              25,
+              237,
+              0,
+              561
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.873606865711172
+      }
+    },
+    "mlp": {
+      "architecture": "PyTorch MLP, 63 -> 128 -> 64 -> 4, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
+      "framework": "pytorch",
+      "test_metrics": {
+        "model": "mlp",
+        "accuracy": 0.5118264271741063,
+        "macro_f1": 0.512148917800585,
+        "weighted_f1": 0.5133102239521222,
+        "per_class_f1": {
+          "lone_actor": 0.427515633882888,
+          "organised_syndicate": 0.5204107187578262,
+          "raas_affiliate": 0.49878147847278637,
+          "nation_state_nexus": 0.6018878400888396
+        },
+        "confusion_matrix": {
+          "labels": [
+            "lone_actor",
+            "organised_syndicate",
+            "raas_affiliate",
+            "nation_state_nexus"
+          ],
+          "matrix": [
+            [
+              376,
+              17,
+              280,
+              77
+            ],
+            [
+              282,
+              1039,
+              745,
+              259
+            ],
+            [
+              248,
+              456,
+              921,
+              100
+            ],
+            [
+              103,
+              156,
+              22,
+              542
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8071564672462985
+      }
+    }
+  }
+}