Initial release: XGBoost + MLP for ATT&CK phase classification

Browse files

Files changed (9) hide show

README.md +408 -0
ablation_results.json +804 -0
feature_engineering.py +394 -0
feature_meta.json +249 -0
feature_scaler.json +1 -0
inference_example.ipynb +343 -0
model_mlp.safetensors +3 -0
model_xgb.json +0 -0
validation_results.json +383 -0

README.md ADDED Viewed

	@@ -0,0 +1,408 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - cybersecurity
+  - mitre-attack
+  - kill-chain
+  - apt
+  - tabular-classification
+  - synthetic-data
+  - xgboost
+  - baseline
+pipeline_tag: tabular-classification
+base_model: []
+datasets:
+  - xpertsystems/cyb002-sample
+metrics:
+  - accuracy
+  - f1
+  - roc_auc
+model-index:
+  - name: cyb002-baseline-classifier
+    results:
+      - task:
+          type: tabular-classification
+          name: 10-class MITRE ATT&CK kill-chain phase classification
+        dataset:
+          type: xpertsystems/cyb002-sample
+          name: CYB002 Synthetic Cyber Attack Dataset (Sample)
+        metrics:
+          - type: roc_auc
+            value: 0.8599
+            name: Test macro ROC-AUC OvR (XGBoost)
+          - type: f1
+            value: 0.4255
+            name: Test macro-F1 (XGBoost)
+          - type: accuracy
+            value: 0.4683
+            name: Test accuracy (XGBoost)
+          - type: roc_auc
+            value: 0.8496
+            name: Test macro ROC-AUC OvR (MLP)
+          - type: f1
+            value: 0.3911
+            name: Test macro-F1 (MLP)
+          - type: accuracy
+            value: 0.4449
+            name: Test accuracy (MLP)
+---
+# CYB002 Baseline Classifier
+**MITRE ATT&CK kill-chain phase classifier trained on the CYB002
+synthetic cyber attack sample. Predicts which of 10 kill-chain phases
+an attack event belongs to, from observable event + segment features.**
+> **Baseline reference, not for production use.** This model demonstrates
+> that the [CYB002 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb002-sample)
+> is learnable end-to-end and gives prospective buyers a working starting
+> point. It is not a production threat detector or SOC tool. See
+> [Limitations](#limitations).
+## Model overview
+| Property | Value |
+|---|---|
+| Task | 10-class kill-chain phase classification |
+| Training data | `xpertsystems/cyb002-sample` (4,353 attack events across 100 campaigns) |
+| Models | XGBoost + PyTorch MLP |
+| Input features | 90 (after one-hot encoding) |
+| Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
+| License | CC-BY-NC-4.0 (matches dataset) |
+| Status | Reference baseline |
+Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
+- `model_xgb.json` — gradient-boosted trees, primary recommendation
+- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
+## Quick start
+```bash
+pip install xgboost torch safetensors pandas huggingface_hub
+```
+```python
+from huggingface_hub import hf_hub_download
+import json, numpy as np, torch, xgboost as xgb
+from safetensors.torch import load_file
+REPO = "xpertsystems/cyb002-baseline-classifier"
+paths = {n: hf_hub_download(REPO, n) for n in [
+    "model_xgb.json", "model_mlp.safetensors",
+    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
+]}
+import sys, os
+sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
+from feature_engineering import (
+    transform_single, load_meta, INT_TO_LABEL, build_segment_lookup
+)
+meta = load_meta(paths["feature_meta.json"])
+xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
+# Build the segment-aggregate lookup from the dataset's topology CSV
+seg_lookup = build_segment_lookup("path/to/network_topology.csv")
+# Predict (see inference_example.ipynb for the full pattern)
+seg_agg = seg_lookup.get(my_event["target_segment_id"], {})
+X = transform_single(my_event, meta, segment_aggregates=seg_agg)
+proba = xgb_model.predict_proba(X)[0]
+print(INT_TO_LABEL[int(np.argmax(proba))])
+```
+See [`inference_example.ipynb`](./inference_example.ipynb) for an
+end-to-end copy-paste demo including segment-aggregate setup and
+batch prediction.
+## Training data
+Trained on the public sample of CYB002, 4,353 attack events from 100
+distinct campaigns:
+| Phase | Train (n=2,822) | Test (n=726) | Test share |
+|---|---:|---:|---:|
+| `dwell_idle` | 581 | 141 | 19.4% |
+| `reconnaissance` | 411 | 112 | 15.4% |
+| `initial_access` | 358 | 106 | 14.6% |
+| `execution` | 324 | 74 | 10.2% |
+| `persistence` | 287 | 79 | 10.9% |
+| `privilege_escalation` | 249 | 68 | 9.4% |
+| `lateral_movement` | 201 | 54 | 7.4% |
+| `collection` | 162 | 40 | 5.5% |
+| `exfiltration` | 113 | 31 | 4.3% |
+| `impact` | 105 | 21 | 2.9% |
+### Group-aware split
+A single campaign generates ~40 highly-correlated events. Random row-level
+splitting would put events from the same campaign in both train and test,
+inflating metrics in a way that does not generalize to new campaigns.
+This release uses **GroupShuffleSplit by `campaign_id`**:
+| Fold | Campaigns | Events |
+|---|---:|---:|
+| Train | 69 | 2,822 |
+| Validation | 16 | 805 |
+| Test | 15 | 726 |
+All test campaigns are completely unseen during training. Class imbalance
+is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
+weighted cross-entropy (MLP).
+## Feature pipeline
+The bundled `feature_engineering.py` is the canonical feature recipe.
+**Three columns are deliberately excluded** because they leak the target:
+- `technique_id` — 62 of 63 ATT&CK techniques map 1:1 to a single phase.
+  Including it gives perfect-looking metrics that mean nothing.
+- `technique_name` — 1:1 alias of `technique_id` (63 unique values each).
+- `tactic_category` — direct alias of `kill_chain_phase`.
+**90 features survive after encoding**, drawn from:
+- **Event-level numeric** (10): `timestep`, `dest_port`, `bytes_transferred`, `connection_duration_s`, `auth_failure_count`, `process_injection_flag`, `lateral_hop_count`, `c2_beacon_interval_s`, `edr_blocked_flag`, `siem_rule_triggered`
+- **Event-level categorical** (7, one-hot encoded): `target_asset_type`, `source_ip_class`, `protocol`, `attacker_capability_tier`, `defender_maturity_level`, `alert_severity`, `detection_outcome`
+- **Segment-level topology aggregates** (13): mean `patch_lag_days`, mean `exposure_score`, max `vulnerability_count`, fraction with EDR/SIEM/NDR/MFA coverage, mean MTTD / MTTR baselines, plus segment_type and defender_maturity_level (segment-constant)
+- **Engineered** (6): `byte_volume_log`, `has_c2_beacon`, `is_brute_forcing`, `attacker_defender_advantage`, `is_high_volume`, `is_privileged_port`
+None of the engineered features is derived from phase or technique —
+that would re-introduce the leakage we just excluded.
+### Note on detection-outcome features
+`detection_outcome`, `alert_severity`, `edr_blocked_flag`, and
+`siem_rule_triggered` are post-hoc observables from the SOC's perspective.
+They are kept as features for the realistic use case where a SOC analyst
+has just seen an action and its initial detection signal and is reasoning
+about which phase the campaign is in. Buyers who want a strictly
+pre-detection model can drop these four columns and retrain — the ablation
+results below show this **does not hurt accuracy** (the model doesn't
+lean on them for phase prediction).
+## Evaluation
+### Test-set metrics (n = 726 events from 15 disjoint campaigns)
+**XGBoost**
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.8599** |
+| Accuracy | 0.4683 |
+| Macro-F1 | 0.4255 |
+| Weighted-F1 | 0.4604 |
+**MLP**
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.8496** |
+| Accuracy | 0.4449 |
+| Macro-F1 | 0.3911 |
+| Weighted-F1 | 0.4350 |
+### Headline interpretation
+Accuracy of 47% looks low at first glance, but the right comparison is:
+| Baseline | Accuracy | Macro-F1 |
+|---|---:|---:|
+| Random uniform guess (1/10 classes) | 0.10 | ~0.10 |
+| Always predict majority (`dwell_idle`) | 0.19 | n/a |
+| **XGBoost (this model)** | **0.47** | **0.43** |
+The macro ROC-AUC of **0.86** tells the cleaner story: the model
+distinguishes the 10 phases meaningfully well even though the
+argmax-prediction sometimes lands on an adjacent phase.
+### Per-class F1 — where the signal is and isn't
+| Phase | XGBoost F1 | MLP F1 | Note |
+|---|---:|---:|---|
+| `reconnaissance` | **0.753** | 0.725 | Strong: early timestep, distinct protocols/targets |
+| `lateral_movement` | **0.742** | 0.783 | Strong: lateral-hop count, post-privesc pattern |
+| `initial_access` | **0.647** | 0.648 | Strong: perimeter targets, specific protocols |
+| `privilege_escalation` | 0.500 | 0.488 | Moderate |
+| `execution` | 0.441 | 0.510 | Moderate |
+| `persistence` | 0.413 | 0.301 | Moderate, easily confused with execution |
+| `exfiltration` | 0.273 | 0.119 | Weak: late-phase, similar to collection/impact |
+| `impact` | 0.226 | 0.132 | Weak: late-phase clustering |
+| `collection` | 0.220 | 0.191 | Weak: late-phase clustering |
+| `dwell_idle` | 0.040 | 0.013 | Very weak: no-op steps lack distinguishing features |
+The model has solid signal on **early and mid-campaign phases** and
+genuinely struggles to disambiguate **late-stage objective-completion
+phases** (collection / exfiltration / impact), which arrive close in
+time and look similar at the event level. This is an honest limitation
+of flat-tabular classification — sequence models would help here.
+### Ablation: which feature groups matter
+| Configuration | Accuracy | Macro-F1 | Δ accuracy vs full |
+|---|---:|---:|---:|
+| Full feature set (published) | 0.4683 | 0.4255 | — |
+| No `timestep` | 0.3264 | 0.3102 | **−0.1419** |
+| No topology aggregates | 0.4601 | 0.4093 | −0.0083 |
+| No engineered features | 0.4642 | 0.4240 | −0.0041 |
+| No detection-signal features | 0.4725 | 0.4284 | **+0.0041** |
+Two clear findings:
+1. **`timestep` is by far the most important feature** (drops 14 pp when
+   removed). The honest reading: kill chains progress in time, and where
+   you are in the campaign timeline carries most of the phase signal.
+2. **Detection-signal features (`detection_outcome`, `alert_severity`,
+   `edr_blocked_flag`, `siem_rule_triggered`) do not help phase prediction.**
+   Removing them actually improves the score marginally. A buyer who wants
+   a pre-detection model can drop these four columns with no loss.
+Topology and engineered features each contribute roughly 1 pp.
+### Architecture
+**XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
+`hist` tree method, class-balanced sample weights, early stopping on
+validation mlogloss.
+**MLP:** `90 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
+→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
+early stopping on validation macro-F1.
+Training hyperparameters (learning rate, batch size, n_estimators,
+early-stopping patience, weight decay, class-weighting strategy) are
+held internally by XpertSystems and are not part of this release.
+## Limitations
+**This is a baseline reference, not a production threat detection system.**
+1. **Late-phase confusion.** Per-class F1 for `collection`, `exfiltration`,
+   and `impact` is 0.22–0.27. These phases arrive near campaign-end with
+   similar feature signatures, and a flat-tabular event-level model can't
+   easily disambiguate them. Sequence models (LSTM / transformer over the
+   per-campaign event sequence) would substantially improve this.
+2. **`dwell_idle` is essentially unlearnable in this framing.** The
+   class-balanced weights amplify rare classes; `dwell_idle` is common
+   but featureless ("no action this timestep"), so the model trades
+   `dwell_idle` recall for late-phase recall. F1 = 0.04. A real SOC
+   pipeline would handle idle steps with a separate gating rule, not a
+   classifier head.
+3. **Sample-size constraints.** 100 campaigns / 4,353 events with a
+   group-aware split leaves 69 training campaigns. The full 380k-event
+   CYB002 product supports much more reliable per-class estimation,
+   especially on the rare late-phase classes.
+4. **Synthetic-vs-real transfer.** The dataset is synthetic and
+   calibrated to threat-intelligence benchmark targets (Mandiant
+   M-Trends, IBM CODB, Verizon DBIR, MITRE ATT&CK Evaluations). Real
+   attack telemetry has different noise characteristics, adversary
+   adaptation, and gaps in coverage. Do not assume metrics transfer.
+5. **Adversarial robustness not evaluated.** The dataset is not
+   adversarially generated; the model has not been red-teamed.
+6. **MLP brittleness on OOD inputs.** With ~2.8k training events, the
+   MLP can produce confidently-wrong predictions on hand-crafted
+   records far from the training manifold. XGBoost is more robust.
+   Use both; treat disagreement as a signal for human review.
+## Notes on dataset schema
+The CYB002 sample dataset README describes some fields differently from
+the actual schema. The model was trained on the actual schema; this note
+is to help buyers reconcile what they read with what they receive.
+| What the README says | What the data actually contains |
+|---|---|
+| "9 ATT&CK phases" | 10 phases including `dwell_idle` (idle/no-op steps) |
+| 4 attacker tiers: `opportunistic`, `organized_crime`, `apt`, `nation_state` | 4 tiers: `opportunistic`, `script_kiddie`, `apt`, `nation_state` |
+| 5 defender maturity levels: CMMI names (`ad_hoc`, `defined`, `managed`, `quantitatively_managed`, `optimizing`) | 5 levels: `minimal`, `baseline`, `managed`, `advanced`, `zero_trust` |
+| Field name `phase` | Actual column: `kill_chain_phase` |
+| Field name `tactic` | Actual column: `tactic_category` |
+| Field name `segment_id` | Actual column: `target_segment_id` |
+| Field name `attacker_tier` | Actual column: `attacker_capability_tier` |
+| Field name `defender_maturity` | Actual column: `defender_maturity_level` |
+| Field name `detected`, `blocked`, `stealth_score` | Actual: `detection_outcome`, `edr_blocked_flag`, `siem_rule_triggered`; no `stealth_score` on events |
+None of this affects model correctness — `feature_engineering.py` uses the
+actual column names. If you build your own pipeline against the dataset,
+use the actual columns, not the README descriptions.
+## Intended use
+- **Evaluating fit** of the CYB002 dataset for your ATT&CK / kill-chain
+  research
+- **Baseline reference** for new model architectures (especially sequence
+  models, which should beat this baseline on the late-phase classes)
+- **Teaching and demo** for tabular classification on attack-event data
+- **Feature engineering reference** for MITRE ATT&CK-aligned datasets
+## Out-of-scope use
+- Production threat detection on real network telemetry
+- SOC alert triage on real systems
+- Forensic attribution of real attacks
+- Adversarial-evasion evaluation (dataset not adversarially generated)
+- Any safety-critical or operational security decision
+## Reproducibility
+Outputs above were produced with `seed = 42`, group-aware nested
+`GroupShuffleSplit` (70/15/15 by campaign_id), on the published sample
+(`xpertsystems/cyb002-sample`, version 1.0.0, generated 2026-05-16).
+The feature pipeline in `feature_engineering.py` is deterministic and
+the trained weights in this repo correspond exactly to the metrics above.
+The training script itself is private to XpertSystems. The published
+artifacts contain the feature pipeline, model weights, scaler, metadata,
+and validation results — sufficient to reproduce inference but not
+training.
+## Files in this repo
+| File | Purpose |
+|---|---|
+| `model_xgb.json` | XGBoost weights |
+| `model_mlp.safetensors` | PyTorch MLP weights |
+| `feature_engineering.py` | Feature pipeline (load → aggregate topology → engineer → encode) |
+| `feature_meta.json` | Feature column order + categorical levels |
+| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
+| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
+| `ablation_results.json` | Per-feature-group ablation (timestep, topology, engineered, detection-signals) |
+| `inference_example.ipynb` | End-to-end inference demo notebook |
+| `README.md` | This file |
+## Contact and full product
+The full **CYB002** dataset contains ~454,000 rows across four files,
+with calibrated benchmark validation against 12 metrics drawn from
+authoritative threat intelligence sources (Mandiant, IBM, Verizon,
+CrowdStrike, MITRE, SANS, ENISA). The full XpertSystems.ai synthetic data
+catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance &
+Risk, Oil & Gas, and Materials & Energy.
+- 📧 **pradeep@xpertsystems.ai**
+- 🌐 **https://xpertsystems.ai**
+- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb002-sample
+- 🤖 Companion model (network traffic): https://huggingface.co/xpertsystems/cyb001-baseline-classifier
+## Citation
+```bibtex
+@misc{xpertsystems_cyb002_baseline_2026,
+  title  = {CYB002 Baseline Classifier: XGBoost and MLP for MITRE ATT&CK Kill-Chain Phase Classification},
+  author = {XpertSystems.ai},
+  year   = {2026},
+  url    = {https://huggingface.co/xpertsystems/cyb002-baseline-classifier},
+  note   = {Baseline reference model trained on xpertsystems/cyb002-sample}
+}
+```

ablation_results.json ADDED Viewed

	@@ -0,0 +1,804 @@

+{
+  "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
+  "full_model_metrics": {
+    "model": "xgboost",
+    "accuracy": 0.46831955922865015,
+    "macro_f1": 0.42549880749552066,
+    "weighted_f1": 0.440668872633435,
+    "per_class_f1": {
+      "dwell_idle": 0.040268456375838924,
+      "reconnaissance": 0.7532467532467533,
+      "initial_access": 0.6467661691542289,
+      "execution": 0.4406779661016949,
+      "persistence": 0.41304347826086957,
+      "privilege_escalation": 0.5,
+      "lateral_movement": 0.7422680412371134,
+      "collection": 0.22018348623853212,
+      "exfiltration": 0.2727272727272727,
+      "impact": 0.22580645161290322
+    },
+    "confusion_matrix": {
+      "labels": [
+        "dwell_idle",
+        "reconnaissance",
+        "initial_access",
+        "execution",
+        "persistence",
+        "privilege_escalation",
+        "lateral_movement",
+        "collection",
+        "exfiltration",
+        "impact"
+      ],
+      "matrix": [
+        [
+          3,
+          23,
+          23,
+          18,
+          21,
+          18,
+          2,
+          17,
+          9,
+          7
+        ],
+        [
+          2,
+          87,
+          2,
+          21,
+          0,
+          0,
+          0,
+          0,
+          0,
+          0
+        ],
+        [
+          1,
+          5,
+          65,
+          5,
+          3,
+          26,
+          1,
+          0,
+          0,
+          0
+        ],
+        [
+          2,
+          4,
+          1,
+          39,
+          24,
+          3,
+          1,
+          0,
+          0,
+          0
+        ],
+        [
+          0,
+          0,
+          1,
+          12,
+          38,
+          9,
+          0,
+          18,
+          1,
+          0
+        ],
+        [
+          0,
+          0,
+          3,
+          8,
+          4,
+          44,
+          3,
+          5,
+          1,
+          0
+        ],
+        [
+          0,
+          0,
+          0,
+          0,
+          6,
+          6,
+          36,
+          2,
+          0,
+          4
+        ],
+        [
+          0,
+          0,
+          0,
+          0,
+          2,
+          1,
+          0,
+          12,
+          15,
+          10
+        ],
+        [
+          0,
+          0,
+          0,
+          0,
+          5,
+          0,
+          0,
+          4,
+          9,
+          13
+        ],
+        [
+          0,
+          0,
+          0,
+          0,
+          2,
+          1,
+          0,
+          11,
+          0,
+          7
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.8598653258869782
+  },
+  "ablations": {
+    "no_topology": {
+      "n_features": 67,
+      "dropped_count": 23,
+      "metrics": {
+        "model": "xgboost_no_topology",
+        "accuracy": 0.46005509641873277,
+        "macro_f1": 0.4093395066167947,
+        "weighted_f1": 0.4281869072634682,
+        "per_class_f1": {
+          "dwell_idle": 0.013513513513513514,
+          "reconnaissance": 0.7574468085106383,
+          "initial_access": 0.6435643564356436,
+          "execution": 0.45348837209302323,
+          "persistence": 0.3829787234042553,
+          "privilege_escalation": 0.4943820224719101,
+          "lateral_movement": 0.72,
+          "collection": 0.205607476635514,
+          "exfiltration": 0.25,
+          "impact": 0.1724137931034483
+        },
+        "confusion_matrix": {
+          "labels": [
+            "dwell_idle",
+            "reconnaissance",
+            "initial_access",
+            "execution",
+            "persistence",
+            "privilege_escalation",
+            "lateral_movement",
+            "collection",
+            "exfiltration",
+            "impact"
+          ],
+          "matrix": [
+            [
+              1,
+              24,
+              24,
+              16,
+              24,
+              16,
+              4,
+              15,
+              10,
+              7
+            ],
+            [
+              2,
+              89,
+              2,
+              16,
+              3,
+              0,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              1,
+              6,
+              65,
+              4,
+              3,
+              26,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              1,
+              4,
+              1,
+              39,
+              25,
+              3,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              1,
+              0,
+              0,
+              16,
+              36,
+              9,
+              0,
+              16,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              3,
+              7,
+              4,
+              44,
+              3,
+              5,
+              2,
+              0
+            ],
+            [
+              0,
+              0,
+              1,
+              0,
+              5,
+              9,
+              36,
+              2,
+              0,
+              1
+            ],
+            [
+              1,
+              0,
+              0,
+              0,
+              2,
+              2,
+              1,
+              11,
+              11,
+              12
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              5,
+              0,
+              0,
+              6,
+              8,
+              12
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              12,
+              1,
+              5
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8625474585447981
+      },
+      "delta_accuracy": 0.008264462809917383,
+      "delta_macro_f1": 0.01615930087872597
+    },
+    "no_engineered": {
+      "n_features": 84,
+      "dropped_count": 6,
+      "metrics": {
+        "model": "xgboost_no_engineered",
+        "accuracy": 0.4641873278236915,
+        "macro_f1": 0.4239556593623024,
+        "weighted_f1": 0.4373277421758876,
+        "per_class_f1": {
+          "dwell_idle": 0.02631578947368421,
+          "reconnaissance": 0.7368421052631579,
+          "initial_access": 0.6305418719211823,
+          "execution": 0.46060606060606063,
+          "persistence": 0.4419889502762431,
+          "privilege_escalation": 0.49142857142857144,
+          "lateral_movement": 0.7346938775510204,
+          "collection": 0.24347826086956523,
+          "exfiltration": 0.2647058823529412,
+          "impact": 0.208955223880597
+        },
+        "confusion_matrix": {
+          "labels": [
+            "dwell_idle",
+            "reconnaissance",
+            "initial_access",
+            "execution",
+            "persistence",
+            "privilege_escalation",
+            "lateral_movement",
+            "collection",
+            "exfiltration",
+            "impact"
+          ],
+          "matrix": [
+            [
+              2,
+              23,
+              24,
+              14,
+              23,
+              20,
+              2,
+              17,
+              9,
+              7
+            ],
+            [
+              4,
+              84,
+              3,
+              21,
+              0,
+              0,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              2,
+              5,
+              64,
+              4,
+              1,
+              29,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              3,
+              4,
+              1,
+              38,
+              25,
+              2,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              0,
+              2,
+              7,
+              40,
+              9,
+              0,
+              20,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              3,
+              7,
+              5,
+              43,
+              4,
+              5,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              0,
+              3,
+              36,
+              4,
+              4,
+              7
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              1,
+              0,
+              0,
+              14,
+              13,
+              12
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              5,
+              0,
+              0,
+              4,
+              9,
+              13
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              11,
+              0,
+              7
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8559080760692732
+      },
+      "delta_accuracy": 0.004132231404958664,
+      "delta_macro_f1": 0.001543148133218264
+    },
+    "no_timestep": {
+      "n_features": 89,
+      "dropped_count": 1,
+      "metrics": {
+        "model": "xgboost_no_timestep",
+        "accuracy": 0.32644628099173556,
+        "macro_f1": 0.31019209599143654,
+        "weighted_f1": 0.3273550154519158,
+        "per_class_f1": {
+          "dwell_idle": 0.06060606060606061,
+          "reconnaissance": 0.3728813559322034,
+          "initial_access": 0.5666666666666667,
+          "execution": 0.4090909090909091,
+          "persistence": 0.22818791946308725,
+          "privilege_escalation": 0.4520547945205479,
+          "lateral_movement": 0.7058823529411765,
+          "collection": 0.0975609756097561,
+          "exfiltration": 0.1836734693877551,
+          "impact": 0.02531645569620253
+        },
+        "confusion_matrix": {
+          "labels": [
+            "dwell_idle",
+            "reconnaissance",
+            "initial_access",
+            "execution",
+            "persistence",
+            "privilege_escalation",
+            "lateral_movement",
+            "collection",
+            "exfiltration",
+            "impact"
+          ],
+          "matrix": [
+            [
+              5,
+              11,
+              35,
+              11,
+              17,
+              13,
+              1,
+              25,
+              15,
+              8
+            ],
+            [
+              7,
+              33,
+              1,
+              11,
+              11,
+              0,
+              0,
+              19,
+              17,
+              13
+            ],
+            [
+              5,
+              0,
+              68,
+              1,
+              2,
+              16,
+              5,
+              6,
+              3,
+              0
+            ],
+            [
+              3,
+              6,
+              1,
+              27,
+              4,
+              4,
+              2,
+              20,
+              2,
+              5
+            ],
+            [
+              2,
+              12,
+              4,
+              1,
+              17,
+              5,
+              0,
+              19,
+              6,
+              13
+            ],
+            [
+              0,
+              0,
+              17,
+              7,
+              2,
+              33,
+              3,
+              3,
+              2,
+              1
+            ],
+            [
+              0,
+              1,
+              7,
+              0,
+              2,
+              2,
+              36,
+              1,
+              0,
+              5
+            ],
+            [
+              0,
+              2,
+              0,
+              0,
+              6,
+              4,
+              1,
+              8,
+              12,
+              7
+            ],
+            [
+              1,
+              0,
+              1,
+              0,
+              7,
+              0,
+              0,
+              8,
+              9,
+              5
+            ],
+            [
+              1,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              15,
+              1,
+              1
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.7557281412642529
+      },
+      "delta_accuracy": 0.1418732782369146,
+      "delta_macro_f1": 0.11530671150408411
+    },
+    "no_detection_signals": {
+      "n_features": 76,
+      "dropped_count": 14,
+      "metrics": {
+        "model": "xgboost_no_detection_signals",
+        "accuracy": 0.4724517906336088,
+        "macro_f1": 0.4284152317167137,
+        "weighted_f1": 0.4449655177644492,
+        "per_class_f1": {
+          "dwell_idle": 0.039735099337748346,
+          "reconnaissance": 0.7456140350877193,
+          "initial_access": 0.6600985221674877,
+          "execution": 0.47126436781609193,
+          "persistence": 0.43333333333333335,
+          "privilege_escalation": 0.4971751412429379,
+          "lateral_movement": 0.7272727272727273,
+          "collection": 0.21818181818181817,
+          "exfiltration": 0.2727272727272727,
+          "impact": 0.21875
+        },
+        "confusion_matrix": {
+          "labels": [
+            "dwell_idle",
+            "reconnaissance",
+            "initial_access",
+            "execution",
+            "persistence",
+            "privilege_escalation",
+            "lateral_movement",
+            "collection",
+            "exfiltration",
+            "impact"
+          ],
+          "matrix": [
+            [
+              3,
+              23,
+              23,
+              18,
+              22,
+              17,
+              3,
+              16,
+              9,
+              7
+            ],
+            [
+              2,
+              85,
+              3,
+              22,
+              0,
+              0,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              1,
+              5,
+              67,
+              2,
+              2,
+              28,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              2,
+              3,
+              1,
+              41,
+              23,
+              3,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              0,
+              1,
+              9,
+              39,
+              9,
+              0,
+              19,
+              1,
+              1
+            ],
+            [
+              0,
+              0,
+              2,
+              8,
+              3,
+              44,
+              4,
+              6,
+              1,
+              0
+            ],
+            [
+              1,
+              0,
+              0,
+              0,
+              3,
+              6,
+              36,
+              2,
+              0,
+              6
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              12,
+              15,
+              10
+            ],
+            [
+              1,
+              0,
+              0,
+              0,
+              5,
+              0,
+              0,
+              4,
+              9,
+              12
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              11,
+              0,
+              7
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8544378745036634
+      },
+      "delta_accuracy": -0.004132231404958664,
+      "delta_macro_f1": -0.002916424221193037
+    }
+  }
+}

feature_engineering.py ADDED Viewed

	@@ -0,0 +1,394 @@

+"""
+feature_engineering.py
+======================
+Feature pipeline for the CYB002 baseline classifier.
+Predicts `kill_chain_phase` (10-class) from event + segment-level
+observables on the CYB002 sample dataset.
+CSV inputs:
+    attack_events.csv     (primary, one row per timestep-level action)
+    network_topology.csv  (asset-level inventory; aggregated to segment
+                           level before joining on target_segment_id)
+    campaign_summary.csv  (reserved for future work, not used in v1)
+    campaign_events.csv   (reserved for future work, not used in v1)
+Target classes:
+    dwell_idle, reconnaissance, initial_access, execution, persistence,
+    privilege_escalation, lateral_movement, collection, exfiltration, impact
+This corresponds to the README's first listed use case: predicting the
+next ATT&CK phase from observable features. The challenge is that three
+fields perfectly determine phase by construction:
+  - technique_id    -> 62 of 63 techniques map 1:1 to a single phase
+  - technique_name  -> 1:1 with technique_id
+  - tactic_category -> direct alias of phase
+These are dropped before feature assembly. Phase is predicted from:
+timestep position (recon mean=6, impact mean=66), target asset type,
+protocol/port, byte volumes, connection duration, auth-failure count,
+process-injection / lateral-hop counts, attacker tier vs defender
+maturity, and segment-level topology aggregates.
+Public API
+----------
+    build_features(attack_events_path, topology_path,
+                   campaign_summary_path=None) -> (X, y, groups, meta)
+    transform_single(record, meta, segment_aggregates=None) -> np.ndarray
+    save_meta(meta, path) / load_meta(path)
+    build_segment_lookup(topology_path) -> dict
+License
+-------
+Ships with the public model on Hugging Face under CC-BY-NC-4.0, matching
+the dataset license. See README.md.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Label space
+# ---------------------------------------------------------------------------
+# The 10 phases observed in the sample. dwell_idle is a no-op step
+# between actions; technique_id=T0000, tactic_category=NaN. Ordering
+# follows tactic flow for readability; CE-loss doesn't care.
+LABEL_ORDER = [
+    "dwell_idle",
+    "reconnaissance",
+    "initial_access",
+    "execution",
+    "persistence",
+    "privilege_escalation",
+    "lateral_movement",
+    "collection",
+    "exfiltration",
+    "impact",
+]
+LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
+INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
+# ---------------------------------------------------------------------------
+# Columns dropped because they leak the target (kill_chain_phase)
+# ---------------------------------------------------------------------------
+# `technique_id`: 62 of 63 ATT&CK techniques map 1:1 to a single phase.
+# T1078 Valid Accounts is the one shared technique (appears in both
+# initial_access and persistence, which is correct ATT&CK behavior).
+# Including technique_id as a feature is effectively label memorization.
+#
+# `technique_name`: 1:1 alias of technique_id (63 unique values each).
+#
+# `tactic_category`: direct alias of kill_chain_phase; the two columns
+# carry identical information except tactic_category is null for
+# dwell_idle steps. Drop.
+LEAKY_COLUMNS = [
+    "technique_id",
+    "technique_name",
+    "tactic_category",
+]
+# ---------------------------------------------------------------------------
+# Columns kept as features
+# ---------------------------------------------------------------------------
+DIRECT_NUMERIC_EVENT_FEATURES = [
+    "timestep",                # strong signal: recon mean=6, impact mean=66
+    "dest_port",
+    "bytes_transferred",
+    "connection_duration_s",
+    "auth_failure_count",
+    "process_injection_flag",
+    "lateral_hop_count",
+    "c2_beacon_interval_s",    # null-aware; filled with -1 + has_c2_beacon flag
+    # Detection-related fields. These are POST-HOC observables from the
+    # SOC's perspective. We keep them as features because in the realistic
+    # phase-prediction use case, a SOC analyst has just seen an action and
+    # its initial detection outcome, and is trying to reason about which
+    # phase the campaign is in. Buyers who want a strictly pre-detection
+    # model can drop these four columns and retrain.
+    "edr_blocked_flag",
+    "siem_rule_triggered",
+]
+CATEGORICAL_EVENT_FEATURES = [
+    "target_asset_type",
+    "source_ip_class",
+    "protocol",
+    "attacker_capability_tier",
+    "defender_maturity_level",
+    "alert_severity",        # critical / high / medium / low / informational
+    "detection_outcome",     # see note above re: post-hoc observables
+]
+ID_COLUMNS = ["campaign_id", "attacker_id"]
+# ---------------------------------------------------------------------------
+# Topology aggregation
+# ---------------------------------------------------------------------------
+#
+# network_topology.csv is ASSET-LEVEL (651 rows, 12 segments, ~54 assets
+# per segment). Direct join would explode rows. Aggregate to segment level:
+# constant fields as-is, numeric fields mean/max as appropriate, 0/1 flags
+# as fraction-with-coverage.
+SEGMENT_CONSTANT_TOPO_COLS = ["segment_type", "defender_maturity_level"]
+SEGMENT_NUMERIC_AGGREGATES = {
+    "patch_lag_days":              "mean",
+    "exposure_score":              "mean",
+    "vulnerability_count":         "max",    # worst-case asset matters more
+    "inter_segment_trust_level":   "mean",
+    "alert_threshold_sensitivity": "mean",
+    "mttd_baseline_hours":         "mean",
+    "mttr_baseline_hours":         "mean",
+    "siem_coverage_flag":          "mean",   # fraction with SIEM
+    "edr_deployed_flag":           "mean",   # fraction with EDR
+    "ndr_coverage_flag":           "mean",
+    "mfa_enforced_flag":           "mean",
+}
+def _aggregate_topology(topology: pd.DataFrame) -> pd.DataFrame:
+    """Collapse asset-level topology to one row per segment."""
+    parts = []
+    for col in SEGMENT_CONSTANT_TOPO_COLS:
+        parts.append(topology.groupby("segment_id")[col].first().rename(f"seg_{col}"))
+    for col, agg in SEGMENT_NUMERIC_AGGREGATES.items():
+        parts.append(topology.groupby("segment_id")[col].agg(agg).rename(f"seg_{col}_{agg}"))
+    return pd.concat(parts, axis=1).reset_index()
+TOPOLOGY_FEATURE_NAMES_NUMERIC = [
+    f"seg_{col}_{agg}" for col, agg in SEGMENT_NUMERIC_AGGREGATES.items()
+]
+TOPOLOGY_FEATURE_NAMES_CATEGORICAL = [f"seg_{col}" for col in SEGMENT_CONSTANT_TOPO_COLS]
+# ---------------------------------------------------------------------------
+# Engineered features
+# ---------------------------------------------------------------------------
+#
+# Important: NO phase-derived engineered features. is_dwell_idle,
+# is_high_severity_phase, phase_order_index would all be oracles when
+# phase is the target. Six features instead, each a stated hypothesis
+# about phase-discriminative signal in pre-phase observables.
+TIER_RANK     = {"script_kiddie": 1, "opportunistic": 2, "apt": 3, "nation_state": 4}
+DEFENDER_RANK = {"minimal": 1, "baseline": 2, "managed": 3, "advanced": 4, "zero_trust": 5}
+def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
+    """Six engineered features, no phase-derived oracles."""
+    df = df.copy()
+    # 1. Byte volume on log scale. Heavy-tailed across phases: recon
+    #    transfers tend to be bytes; exfiltration megabytes. log1p tames
+    #    the tail and gives both XGBoost and the MLP a usable feature.
+    df["byte_volume_log"] = np.log1p(df["bytes_transferred"].clip(lower=0)).astype(float)
+    # 2. C2 beacon presence. c2_beacon_interval_s is null for non-C2
+    #    actions. Encode presence as a binary flag and fill the value
+    #    column with -1 so it stays usable.
+    df["has_c2_beacon"] = df["c2_beacon_interval_s"].notna().astype(int)
+    df["c2_beacon_interval_s"] = df["c2_beacon_interval_s"].fillna(-1.0)
+    # 3. Brute-force indicator. auth_failure_count > 0 distinguishes
+    #    credential-stuffing style actions from authenticated-path
+    #    actions; loads differently into early phases.
+    df["is_brute_forcing"] = (df["auth_failure_count"] > 0).astype(int)
+    # 4. Attacker vs defender advantage. Positive when attacker outclasses
+    #    defender; influences which phases an attacker can reach.
+    tier_r = df["attacker_capability_tier"].map(TIER_RANK).fillna(2).astype(int)
+    def_r  = df["defender_maturity_level"].map(DEFENDER_RANK).fillna(2).astype(int)
+    df["attacker_defender_advantage"] = (tier_r - def_r).astype(int)
+    # 5. High-volume action indicator. Simple binary above 100 KB,
+    #    correlates with collection / exfiltration phases.
+    df["is_high_volume"] = (df["bytes_transferred"] > 100_000).astype(int)
+    # 6. Privileged-port indicator. dest_port < 1024, typically system
+    #    services; common in initial-access and lateral-movement actions.
+    df["is_privileged_port"] = (df["dest_port"] < 1024).astype(int)
+    return df
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def build_features(
+    attack_events_path: str | Path,
+    topology_path: str | Path,
+    campaign_summary_path: str | Path | None = None,
+) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
+    """
+    Load CSVs, aggregate topology, drop leaky columns, engineer features,
+    one-hot encode, return (X, y, groups, meta).
+    `groups` is a Series of campaign_id values aligned with X for
+    GroupShuffleSplit / GroupKFold use. A single campaign generates ~40
+    correlated events; row-level random splitting inflates metrics.
+    """
+    events = pd.read_csv(attack_events_path)
+    topology = pd.read_csv(topology_path)
+    events = events.drop(columns=LEAKY_COLUMNS, errors="ignore")
+    topo_agg = _aggregate_topology(topology)
+    events = events.merge(
+        topo_agg, left_on="target_segment_id", right_on="segment_id", how="left",
+    ).drop(columns=["segment_id"], errors="ignore")
+    y = events["kill_chain_phase"].map(LABEL_TO_INT)
+    if y.isna().any():
+        bad = events.loc[y.isna(), "kill_chain_phase"].unique()
+        raise ValueError(f"Unknown kill_chain_phase values: {bad}")
+    y = y.astype(int)
+    groups = events["campaign_id"].copy()
+    events = _add_engineered_features(events)
+    numeric_features = (
+        DIRECT_NUMERIC_EVENT_FEATURES
+        + TOPOLOGY_FEATURE_NAMES_NUMERIC
+        + [
+            "byte_volume_log", "has_c2_beacon", "is_brute_forcing",
+            "attacker_defender_advantage", "is_high_volume",
+            "is_privileged_port",
+        ]
+    )
+    X_numeric = events[numeric_features].astype(float)
+    all_categorical = (
+        [(col, "event")    for col in CATEGORICAL_EVENT_FEATURES]
+        + [(col, "topology") for col in TOPOLOGY_FEATURE_NAMES_CATEGORICAL]
+    )
+    categorical_levels: dict[str, list[str]] = {}
+    blocks: list[pd.DataFrame] = []
+    for col, _src in all_categorical:
+        levels = sorted(events[col].dropna().unique().tolist())
+        categorical_levels[col] = levels
+        block = pd.get_dummies(
+            events[col].astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        blocks.append(block)
+    X = pd.concat(
+        [X_numeric.reset_index(drop=True)]
+        + [b.reset_index(drop=True) for b in blocks],
+        axis=1,
+    ).fillna(0.0)
+    meta = {
+        "feature_names": X.columns.tolist(),
+        "numeric_features": numeric_features,
+        "categorical_levels": categorical_levels,
+        "label_to_int": LABEL_TO_INT,
+        "int_to_label": INT_TO_LABEL,
+        "topology_aggregation": {
+            "segment_constant": SEGMENT_CONSTANT_TOPO_COLS,
+            "segment_numeric_aggregates": SEGMENT_NUMERIC_AGGREGATES,
+        },
+    }
+    return X, y, groups, meta
+def transform_single(
+    record: dict | pd.DataFrame,
+    meta: dict[str, Any],
+    segment_aggregates: dict | None = None,
+) -> np.ndarray:
+    """Encode a single event record for inference.
+    `record` must contain event-level fields (sans leaky columns) plus
+    the segment-level aggregate fields. If you only have the raw event,
+    pass `segment_aggregates` as a dict {seg_*: value, ...} and they'll
+    be merged in.
+    """
+    if isinstance(record, dict):
+        df = pd.DataFrame([record.copy()])
+    else:
+        df = record.copy()
+    if segment_aggregates is not None:
+        for k, v in segment_aggregates.items():
+            df[k] = v
+    df = _add_engineered_features(df)
+    numeric = pd.DataFrame({
+        col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
+        for col in meta["numeric_features"]
+    })
+    blocks: list[pd.DataFrame] = [numeric]
+    for col, levels in meta["categorical_levels"].items():
+        val = df.get(col, pd.Series([None] * len(df)))
+        block = pd.get_dummies(
+            val.astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        for lvl in levels:
+            cname = f"{col}_{lvl}"
+            if cname not in block.columns:
+                block[cname] = 0
+        block = block[[f"{col}_{lvl}" for lvl in levels]]
+        blocks.append(block)
+    X = pd.concat(blocks, axis=1).fillna(0.0)
+    X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
+    return X.values.astype(np.float32)
+def save_meta(meta: dict[str, Any], path: str | Path) -> None:
+    serializable = {
+        "feature_names": meta["feature_names"],
+        "numeric_features": meta["numeric_features"],
+        "categorical_levels": meta["categorical_levels"],
+        "label_to_int": meta["label_to_int"],
+        "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
+        "topology_aggregation": meta["topology_aggregation"],
+    }
+    with open(path, "w") as f:
+        json.dump(serializable, f, indent=2)
+def load_meta(path: str | Path) -> dict[str, Any]:
+    with open(path) as f:
+        meta = json.load(f)
+    meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
+    return meta
+def build_segment_lookup(topology_path: str | Path) -> dict[str, dict]:
+    """Build a {segment_id: {seg_* feature values}} lookup for inference."""
+    topology = pd.read_csv(topology_path)
+    agg = _aggregate_topology(topology)
+    return {row["segment_id"]: {k: v for k, v in row.items() if k != "segment_id"}
+            for _, row in agg.iterrows()}
+if __name__ == "__main__":
+    import sys
+    base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
+    X, y, groups, meta = build_features(
+        base / "attack_events.csv",
+        base / "network_topology.csv",
+    )
+    print(f"X shape: {X.shape}")
+    print(f"y shape: {y.shape}")
+    print(f"groups: {groups.nunique()} campaigns")
+    print(f"n features: {len(meta['feature_names'])}")
+    print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
+    print(f"X has NaN: {X.isnull().any().any()}")

feature_meta.json ADDED Viewed

	@@ -0,0 +1,249 @@

+{
+  "feature_names": [
+    "timestep",
+    "dest_port",
+    "bytes_transferred",
+    "connection_duration_s",
+    "auth_failure_count",
+    "process_injection_flag",
+    "lateral_hop_count",
+    "c2_beacon_interval_s",
+    "edr_blocked_flag",
+    "siem_rule_triggered",
+    "seg_patch_lag_days_mean",
+    "seg_exposure_score_mean",
+    "seg_vulnerability_count_max",
+    "seg_inter_segment_trust_level_mean",
+    "seg_alert_threshold_sensitivity_mean",
+    "seg_mttd_baseline_hours_mean",
+    "seg_mttr_baseline_hours_mean",
+    "seg_siem_coverage_flag_mean",
+    "seg_edr_deployed_flag_mean",
+    "seg_ndr_coverage_flag_mean",
+    "seg_mfa_enforced_flag_mean",
+    "byte_volume_log",
+    "has_c2_beacon",
+    "is_brute_forcing",
+    "attacker_defender_advantage",
+    "is_high_volume",
+    "is_privileged_port",
+    "target_asset_type_backup_system",
+    "target_asset_type_cloud_vm",
+    "target_asset_type_container",
+    "target_asset_type_database_server",
+    "target_asset_type_domain_controller",
+    "target_asset_type_ehr_system",
+    "target_asset_type_email_server",
+    "target_asset_type_firewall",
+    "target_asset_type_iot_device",
+    "target_asset_type_router",
+    "target_asset_type_scada_plc",
+    "target_asset_type_server",
+    "target_asset_type_vpn_gateway",
+    "target_asset_type_web_server",
+    "target_asset_type_workstation",
+    "source_ip_class_cloud_egress",
+    "source_ip_class_external_internet",
+    "source_ip_class_internal_lan",
+    "source_ip_class_tor_exit",
+    "source_ip_class_vpn_tunnel",
+    "protocol_dns",
+    "protocol_ftp",
+    "protocol_http",
+    "protocol_https",
+    "protocol_icmp",
+    "protocol_rdp",
+    "protocol_smb",
+    "protocol_ssh",
+    "protocol_tcp",
+    "protocol_udp",
+    "attacker_capability_tier_apt",
+    "attacker_capability_tier_nation_state",
+    "attacker_capability_tier_opportunistic",
+    "attacker_capability_tier_script_kiddie",
+    "defender_maturity_level_advanced",
+    "defender_maturity_level_baseline",
+    "defender_maturity_level_managed",
+    "defender_maturity_level_minimal",
+    "defender_maturity_level_zero_trust",
+    "alert_severity_critical",
+    "alert_severity_high",
+    "alert_severity_informational",
+    "alert_severity_low",
+    "alert_severity_medium",
+    "detection_outcome_blind_spot",
+    "detection_outcome_edr_blocked",
+    "detection_outcome_evasion_success",
+    "detection_outcome_high_confidence_alert",
+    "detection_outcome_ir_escalated",
+    "detection_outcome_marginal_alert",
+    "detection_outcome_suppressed_alert",
+    "seg_segment_type_cloud_workload",
+    "seg_segment_type_corporate_lan",
+    "seg_segment_type_data_exfiltration_target",
+    "seg_segment_type_endpoint_fleet",
+    "seg_segment_type_soc_management_plane",
+    "seg_segment_type_supply_chain_interface",
+    "seg_segment_type_zero_trust_segment",
+    "seg_defender_maturity_level_advanced",
+    "seg_defender_maturity_level_baseline",
+    "seg_defender_maturity_level_managed",
+    "seg_defender_maturity_level_minimal",
+    "seg_defender_maturity_level_zero_trust"
+  ],
+  "numeric_features": [
+    "timestep",
+    "dest_port",
+    "bytes_transferred",
+    "connection_duration_s",
+    "auth_failure_count",
+    "process_injection_flag",
+    "lateral_hop_count",
+    "c2_beacon_interval_s",
+    "edr_blocked_flag",
+    "siem_rule_triggered",
+    "seg_patch_lag_days_mean",
+    "seg_exposure_score_mean",
+    "seg_vulnerability_count_max",
+    "seg_inter_segment_trust_level_mean",
+    "seg_alert_threshold_sensitivity_mean",
+    "seg_mttd_baseline_hours_mean",
+    "seg_mttr_baseline_hours_mean",
+    "seg_siem_coverage_flag_mean",
+    "seg_edr_deployed_flag_mean",
+    "seg_ndr_coverage_flag_mean",
+    "seg_mfa_enforced_flag_mean",
+    "byte_volume_log",
+    "has_c2_beacon",
+    "is_brute_forcing",
+    "attacker_defender_advantage",
+    "is_high_volume",
+    "is_privileged_port"
+  ],
+  "categorical_levels": {
+    "target_asset_type": [
+      "backup_system",
+      "cloud_vm",
+      "container",
+      "database_server",
+      "domain_controller",
+      "ehr_system",
+      "email_server",
+      "firewall",
+      "iot_device",
+      "router",
+      "scada_plc",
+      "server",
+      "vpn_gateway",
+      "web_server",
+      "workstation"
+    ],
+    "source_ip_class": [
+      "cloud_egress",
+      "external_internet",
+      "internal_lan",
+      "tor_exit",
+      "vpn_tunnel"
+    ],
+    "protocol": [
+      "dns",
+      "ftp",
+      "http",
+      "https",
+      "icmp",
+      "rdp",
+      "smb",
+      "ssh",
+      "tcp",
+      "udp"
+    ],
+    "attacker_capability_tier": [
+      "apt",
+      "nation_state",
+      "opportunistic",
+      "script_kiddie"
+    ],
+    "defender_maturity_level": [
+      "advanced",
+      "baseline",
+      "managed",
+      "minimal",
+      "zero_trust"
+    ],
+    "alert_severity": [
+      "critical",
+      "high",
+      "informational",
+      "low",
+      "medium"
+    ],
+    "detection_outcome": [
+      "blind_spot",
+      "edr_blocked",
+      "evasion_success",
+      "high_confidence_alert",
+      "ir_escalated",
+      "marginal_alert",
+      "suppressed_alert"
+    ],
+    "seg_segment_type": [
+      "cloud_workload",
+      "corporate_lan",
+      "data_exfiltration_target",
+      "endpoint_fleet",
+      "soc_management_plane",
+      "supply_chain_interface",
+      "zero_trust_segment"
+    ],
+    "seg_defender_maturity_level": [
+      "advanced",
+      "baseline",
+      "managed",
+      "minimal",
+      "zero_trust"
+    ]
+  },
+  "label_to_int": {
+    "dwell_idle": 0,
+    "reconnaissance": 1,
+    "initial_access": 2,
+    "execution": 3,
+    "persistence": 4,
+    "privilege_escalation": 5,
+    "lateral_movement": 6,
+    "collection": 7,
+    "exfiltration": 8,
+    "impact": 9
+  },
+  "int_to_label": {
+    "0": "dwell_idle",
+    "1": "reconnaissance",
+    "2": "initial_access",
+    "3": "execution",
+    "4": "persistence",
+    "5": "privilege_escalation",
+    "6": "lateral_movement",
+    "7": "collection",
+    "8": "exfiltration",
+    "9": "impact"
+  },
+  "topology_aggregation": {
+    "segment_constant": [
+      "segment_type",
+      "defender_maturity_level"
+    ],
+    "segment_numeric_aggregates": {
+      "patch_lag_days": "mean",
+      "exposure_score": "mean",
+      "vulnerability_count": "max",
+      "inter_segment_trust_level": "mean",
+      "alert_threshold_sensitivity": "mean",
+      "mttd_baseline_hours": "mean",
+      "mttr_baseline_hours": "mean",
+      "siem_coverage_flag": "mean",
+      "edr_deployed_flag": "mean",
+      "ndr_coverage_flag": "mean",
+      "mfa_enforced_flag": "mean"
+    }
+  }
+}

feature_scaler.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean": [29.669737774627922, 2374.859673990078, 94190.43841601702, 14.909633238837705, 1.3384124734231042, 0.13040396881644223, 0.05705173635719348, 0.0, 0.3621545003543586, 0.166194188518781, 34.396196846241345, 0.512852316745022, 14.379518072289157, 0.392728801495667, 0.7238749335681469, 6.124241842889212, 36.93126845133998, 0.6976009715267184, 0.8059781553368865, 0.4883178731877128, 0.6477277267624112, 9.540027804902557, 1.0, 0.5510276399716513, -0.010276399716513111, 0.16725726435152374, 0.6463501063075833, 0.0627214741318214, 0.06909992912827782, 0.06591070163004961, 0.0705173635719348, 0.07122608079376329, 0.06520198440822111, 0.0673281360737066, 0.06520198440822111, 0.07299787384833452, 0.058114812189936214, 0.06945428773919206, 0.06130403968816442, 0.057406094968107724, 0.07973068745570518, 0.06378454996456413, 0.19383416017009214, 0.20233876683203403, 0.20411055988660523, 0.2147413182140326, 0.184975194897236, 0.10063784549964565, 0.09815733522324592, 0.10311835577604536, 0.09780297661233169, 0.09319631467044649, 0.09886605244507442, 0.10099220411055988, 0.10276399716513111, 0.09673990077958894, 0.10772501771793054, 0.2271438695960312, 0.4875974486180014, 0.22749822820694543, 0.05776045357902197, 0.4135364989369242, 0.2664776754075124, 0.2147413182140326, 0.050673281360737066, 0.05457122608079376, 0.43834160170092135, 0.06413890857547838, 0.4043231750531538, 0.0673281360737066, 0.0258681785967399, 0.059532246633593196, 0.3621545003543586, 0.3447909284195606, 0.10701630049610206, 0.03330970942593905, 0.0258681785967399, 0.0673281360737066, 0.04642097802976612, 0.23954642097802978, 0.09603118355776046, 0.22041105598866054, 0.08788093550673282, 0.22395464209780297, 0.08575478384124734, 0.4135364989369242, 0.2664776754075124, 0.2147413182140326, 0.050673281360737066, 0.05457122608079376], "std": [21.611718068894575, 3262.3953544252254, 493540.4889491936, 26.882083698757928, 1.7063611088856259, 0.33680702458505324, 0.23198255508867544, 1.0, 0.4807083352654771, 0.3723208326229761, 27.918565668886338, 0.16437622036073063, 6.809572056022862, 0.031089587614791407, 0.16380824644388434, 3.278380945942728, 19.765170276913693, 0.1795819790728066, 0.06230034648225459, 0.1392601592567418, 0.14782851174966183, 1.9732855896589672, 1.0, 0.49747751507194377, 1.14101486329445, 0.3732715435730835, 0.47818686198141197, 0.2425042887305301, 0.25366894008973206, 0.2481699123323166, 0.2560622962417658, 0.2572476946006164, 0.2469256805073951, 0.2506338325607766, 0.2469256805073951, 0.2601791150733033, 0.23400188966661606, 0.25427013215761435, 0.23992968449882038, 0.2326581539410086, 0.2709238172522079, 0.24441204871928013, 0.3953705491117327, 0.4018146379110633, 0.40312160075220743, 0.4107155466365413, 0.38834625527902134, 0.30090190074813306, 0.29757999359048065, 0.3041672976137825, 0.29710071221594014, 0.29075886802325146, 0.2985349856721442, 0.3013718026414007, 0.303704202734562, 0.2956556572543326, 0.3100877479347836, 0.41906057037245537, 0.49993473898865376, 0.4192911666305911, 0.23333125829157636, 0.4925545999567315, 0.44219522159317903, 0.4107155466365413, 0.21936853137528786, 0.22718163733670016, 0.49627161445350315, 0.24504364292201097, 0.49084755398842195, 0.2506338325607766, 0.15877011238413913, 0.2366601047140391, 0.4807083352654771, 0.4753842926323826, 0.30918875753830577, 0.17947586784069838, 0.15877011238413913, 0.2506338325607766, 0.21043232273668783, 0.426882310965684, 0.2946862192784542, 0.4145973147754568, 0.2831718407328375, 0.4169659091226798, 0.28005123240617036, 0.4925545999567315, 0.44219522159317903, 0.4107155466365413, 0.21936853137528786, 0.22718163733670016]}

inference_example.ipynb ADDED Viewed

	@@ -0,0 +1,343 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CYB002 Baseline Classifier — Inference Example\n",
+    "\n",
+    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **MITRE ATT&CK kill-chain phase** of a new attack-event record.\n",
+    "\n",
+    "**Models predict one of 10 phases:** `dwell_idle`, `reconnaissance`, `initial_access`, `execution`, `persistence`, `privilege_escalation`, `lateral_movement`, `collection`, `exfiltration`, `impact`.\n",
+    "\n",
+    "**This is a baseline reference model**, not a production threat detector. See the model card for full metrics and limitations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Download model artifacts from Hugging Face\n",
+    "\n",
+    "Five files are needed:\n",
+    "- `model_xgb.json` — XGBoost weights\n",
+    "- `model_mlp.safetensors` — PyTorch MLP weights\n",
+    "- `feature_engineering.py` — feature pipeline (must match the one used at training)\n",
+    "- `feature_meta.json` — feature column order + categorical levels\n",
+    "- `feature_scaler.json` — MLP input standardization (mean / std)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "REPO_ID = \"xpertsystems/cyb002-baseline-classifier\"\n",
+    "\n",
+    "files = {}\n",
+    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
+    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
+    "             \"feature_scaler.json\"]:\n",
+    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
+    "    print(f\"  downloaded: {name}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Make feature_engineering.py importable\n",
+    "import sys, os\n",
+    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
+    "if fe_dir not in sys.path:\n",
+    "    sys.path.insert(0, fe_dir)\n",
+    "\n",
+    "from feature_engineering import (\n",
+    "    transform_single, load_meta, INT_TO_LABEL, build_segment_lookup\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load models and metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import xgboost as xgb\n",
+    "from safetensors.torch import load_file\n",
+    "\n",
+    "meta = load_meta(files[\"feature_meta.json\"])\n",
+    "with open(files[\"feature_scaler.json\"]) as f:\n",
+    "    scaler = json.load(f)\n",
+    "\n",
+    "N_FEATURES = len(meta[\"feature_names\"])\n",
+    "N_CLASSES = len(meta[\"int_to_label\"])\n",
+    "print(f\"feature count: {N_FEATURES}\")\n",
+    "print(f\"class count:   {N_CLASSES}\")\n",
+    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# XGBoost\n",
+    "xgb_model = xgb.XGBClassifier()\n",
+    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
+    "\n",
+    "# MLP architecture (must match training)\n",
+    "class PhaseMLP(nn.Module):\n",
+    "    def __init__(self, n_features, n_classes=10, hidden1=128, hidden2=64, dropout=0.3):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(n_features, hidden1),\n",
+    "            nn.BatchNorm1d(hidden1),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden1, hidden2),\n",
+    "            nn.BatchNorm1d(hidden2),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden2, n_classes),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
+    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
+    "mlp_model.eval()\n",
+    "print(\"models loaded\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Build segment-aggregate lookup from the dataset\n",
+    "\n",
+    "Per-segment topology aggregates (mean exposure, fraction with EDR, etc.) are computed at training time and must be available at inference time too. The helper `build_segment_lookup` pulls them from `network_topology.csv`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "\n",
+    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb002-sample\", repo_type=\"dataset\")\n",
+    "\n",
+    "import os\n",
+    "segment_aggregates_lookup = build_segment_lookup(\n",
+    "    os.path.join(ds_path, \"network_topology.csv\")\n",
+    ")\n",
+    "print(f\"loaded {len(segment_aggregates_lookup)} segment aggregates\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Prediction helper"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
+    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
+    "\n",
+    "def predict_phase(record: dict) -> dict:\n",
+    "    \"\"\"Predict the kill-chain phase for one event record.\n",
+    "\n",
+    "    `record` is a dict with event-level fields. Segment-level aggregates\n",
+    "    are pulled automatically from `segment_aggregates_lookup` using the\n",
+    "    `target_segment_id` field.\n",
+    "\n",
+    "    Returns a dict with both models' predictions and per-class probabilities.\n",
+    "    \"\"\"\n",
+    "    seg_id = record.get(\"target_segment_id\")\n",
+    "    seg_agg = segment_aggregates_lookup.get(seg_id, {})\n",
+    "    X = transform_single(record, meta, segment_aggregates=seg_agg)\n",
+    "\n",
+    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
+    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
+    "\n",
+    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
+    "    with torch.no_grad():\n",
+    "        logits = mlp_model(torch.tensor(Xs))\n",
+    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
+    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
+    "\n",
+    "    return {\n",
+    "        \"xgboost\": {\n",
+    "            \"label\": xgb_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
+    "        },\n",
+    "        \"mlp\": {\n",
+    "            \"label\": mlp_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
+    "        },\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Run on an example record\n",
+    "\n",
+    "This is a real `reconnaissance` event lifted from the sample dataset: opportunistic attacker scanning an email server early in a campaign (timestep 0). Both models should predict `reconnaissance`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real attack event from the sample dataset (true label: reconnaissance)\n",
+    "example_record = {\n",
+    "    \"campaign_id\": \"CAMP-000030\",\n",
+    "    \"attacker_id\": \"ATK-0003\",\n",
+    "    \"timestep\": 0,\n",
+    "    \"target_segment_id\": \"SEG-0008\",\n",
+    "    \"target_asset_type\": \"email_server\",\n",
+    "    \"source_ip_class\": \"vpn_tunnel\",\n",
+    "    \"dest_port\": 22,\n",
+    "    \"protocol\": \"icmp\",\n",
+    "    \"bytes_transferred\": 15648.48,\n",
+    "    \"connection_duration_s\": 3.913,\n",
+    "    \"auth_failure_count\": 0,\n",
+    "    \"process_injection_flag\": 0,\n",
+    "    \"lateral_hop_count\": 0,\n",
+    "    \"c2_beacon_interval_s\": 0.0,\n",
+    "    \"detection_outcome\": \"edr_blocked\",\n",
+    "    \"alert_severity\": \"critical\",\n",
+    "    \"siem_rule_triggered\": 0,\n",
+    "    \"edr_blocked_flag\": 1,\n",
+    "    \"attacker_capability_tier\": \"opportunistic\",\n",
+    "    \"defender_maturity_level\": \"baseline\",\n",
+    "}\n",
+    "\n",
+    "result = predict_phase(example_record)\n",
+    "\n",
+    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
+    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
+    "    print(f\"    P({lbl:25s}) = {p:.4f}\")\n",
+    "\n",
+    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
+    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
+    "    print(f\"    P({lbl:25s}) = {p:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Note: when the two models disagree\n",
+    "\n",
+    "XGBoost and the MLP can disagree on out-of-distribution records — particularly hand-crafted inputs whose feature combinations don't sit on the training-data manifold. The MLP, with BatchNorm and a small training set, has narrower competence than the tree ensemble. Disagreement is a useful triage signal: in a SOC workflow, conflicting predictions are flows worth a human eyeball."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Batch prediction on the sample dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "events = pd.read_csv(os.path.join(ds_path, \"attack_events.csv\"))\n",
+    "\n",
+    "# Drop leakage columns the model was never trained on\n",
+    "events = events.drop(columns=[\"technique_id\", \"technique_name\", \"tactic_category\"],\n",
+    "                     errors=\"ignore\")\n",
+    "\n",
+    "# Score the first 200 events\n",
+    "sample = events.head(200).copy()\n",
+    "preds = [predict_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
+    "sample[\"xgb_pred\"] = preds\n",
+    "\n",
+    "ct = pd.crosstab(sample[\"kill_chain_phase\"], sample[\"xgb_pred\"],\n",
+    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
+    "print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
+    "print(ct)\n",
+    "acc = (sample[\"kill_chain_phase\"] == sample[\"xgb_pred\"]).mean()\n",
+    "print(f\"\\nbatch accuracy on first 200 (in-distribution): {acc:.4f}\")\n",
+    "print(\"\\nNote: this includes training-set events. See validation_results.json\\n\"\n",
+    "      \"for proper held-out test-set metrics from disjoint campaigns.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Next steps\n",
+    "\n",
+    "- See `validation_results.json` for held-out test-set metrics (15 disjoint campaigns, 726 events).\n",
+    "- See `ablation_results.json` for per-feature-group contribution. `timestep` is by far the most predictive feature, which is honest: kill-chain phases progress in time, so where you are in the campaign timeline carries most of the phase signal.\n",
+    "- The model card's **Limitations** section explains the gap between this baseline and production threat-detection systems.\n",
+    "- For the full 380k-row CYB002 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

model_mlp.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f35e1a5f1a92330b2ebdf1f65a097ead961fed4b9dbf4ea11aed7d74a5f293bd
+size 86512

model_xgb.json ADDED Viewed

The diff for this file is too large to render. See raw diff

validation_results.json ADDED Viewed

	@@ -0,0 +1,383 @@

+{
+  "version": "1.0.0",
+  "dataset": "xpertsystems/cyb002-sample",
+  "task": "10-class kill_chain_phase classification",
+  "baselines": {
+    "always_predict_majority_accuracy": 0.19421487603305784,
+    "majority_class": "dwell_idle",
+    "random_guess_accuracy": 0.1
+  },
+  "split": {
+    "strategy": "group_aware (GroupShuffleSplit by campaign_id, nested)",
+    "rationale": "100 campaigns generate ~4,353 events; random row-split would leak campaign-level correlations into the test set. The group-aware split ensures train/val/test campaigns are disjoint.",
+    "campaigns_train": 69,
+    "campaigns_val": 16,
+    "campaigns_test": 15,
+    "events_train": 2822,
+    "events_val": 805,
+    "events_test": 726,
+    "seed": 42
+  },
+  "n_features": 90,
+  "label_classes": [
+    "dwell_idle",
+    "reconnaissance",
+    "initial_access",
+    "execution",
+    "persistence",
+    "privilege_escalation",
+    "lateral_movement",
+    "collection",
+    "exfiltration",
+    "impact"
+  ],
+  "class_distribution_train": {
+    "dwell_idle": 609,
+    "reconnaissance": 439,
+    "initial_access": 346,
+    "execution": 313,
+    "persistence": 275,
+    "privilege_escalation": 254,
+    "lateral_movement": 205,
+    "collection": 165,
+    "exfiltration": 117,
+    "impact": 99
+  },
+  "class_distribution_test": {
+    "dwell_idle": 141,
+    "reconnaissance": 112,
+    "initial_access": 106,
+    "persistence": 79,
+    "execution": 74,
+    "privilege_escalation": 68,
+    "lateral_movement": 54,
+    "collection": 40,
+    "exfiltration": 31,
+    "impact": 21
+  },
+  "leakage_excluded_features": [
+    "technique_id (62/63 techniques map 1:1 to a single phase)",
+    "technique_name (1:1 alias of technique_id)",
+    "tactic_category (direct alias of kill_chain_phase)"
+  ],
+  "models": {
+    "xgboost": {
+      "architecture": "Gradient-boosted decision trees, multi:softprob, 10 classes",
+      "framework": "xgboost",
+      "test_metrics": {
+        "model": "xgboost",
+        "accuracy": 0.46831955922865015,
+        "macro_f1": 0.42549880749552066,
+        "weighted_f1": 0.440668872633435,
+        "per_class_f1": {
+          "dwell_idle": 0.040268456375838924,
+          "reconnaissance": 0.7532467532467533,
+          "initial_access": 0.6467661691542289,
+          "execution": 0.4406779661016949,
+          "persistence": 0.41304347826086957,
+          "privilege_escalation": 0.5,
+          "lateral_movement": 0.7422680412371134,
+          "collection": 0.22018348623853212,
+          "exfiltration": 0.2727272727272727,
+          "impact": 0.22580645161290322
+        },
+        "confusion_matrix": {
+          "labels": [
+            "dwell_idle",
+            "reconnaissance",
+            "initial_access",
+            "execution",
+            "persistence",
+            "privilege_escalation",
+            "lateral_movement",
+            "collection",
+            "exfiltration",
+            "impact"
+          ],
+          "matrix": [
+            [
+              3,
+              23,
+              23,
+              18,
+              21,
+              18,
+              2,
+              17,
+              9,
+              7
+            ],
+            [
+              2,
+              87,
+              2,
+              21,
+              0,
+              0,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              1,
+              5,
+              65,
+              5,
+              3,
+              26,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              2,
+              4,
+              1,
+              39,
+              24,
+              3,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              0,
+              0,
+              1,
+              12,
+              38,
+              9,
+              0,
+              18,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              3,
+              8,
+              4,
+              44,
+              3,
+              5,
+              1,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              6,
+              6,
+              36,
+              2,
+              0,
+              4
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              12,
+              15,
+              10
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              5,
+              0,
+              0,
+              4,
+              9,
+              13
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              2,
+              1,
+              0,
+              11,
+              0,
+              7
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8598653258869782
+      }
+    },
+    "mlp": {
+      "architecture": "PyTorch MLP, 90 -> 128 -> 64 -> 10, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
+      "framework": "pytorch",
+      "test_metrics": {
+        "model": "mlp",
+        "accuracy": 0.44490358126721763,
+        "macro_f1": 0.3911186394257205,
+        "weighted_f1": 0.4172764238320775,
+        "per_class_f1": {
+          "dwell_idle": 0.013422818791946308,
+          "reconnaissance": 0.7250996015936255,
+          "initial_access": 0.6484018264840182,
+          "execution": 0.5100671140939598,
+          "persistence": 0.30120481927710846,
+          "privilege_escalation": 0.4880952380952381,
+          "lateral_movement": 0.782608695652174,
+          "collection": 0.19130434782608696,
+          "exfiltration": 0.11940298507462686,
+          "impact": 0.13157894736842105
+        },
+        "confusion_matrix": {
+          "labels": [
+            "dwell_idle",
+            "reconnaissance",
+            "initial_access",
+            "execution",
+            "persistence",
+            "privilege_escalation",
+            "lateral_movement",
+            "collection",
+            "exfiltration",
+            "impact"
+          ],
+          "matrix": [
+            [
+              1,
+              26,
+              27,
+              11,
+              20,
+              18,
+              1,
+              20,
+              10,
+              7
+            ],
+            [
+              0,
+              91,
+              4,
+              10,
+              7,
+              0,
+              0,
+              0,
+              0,
+              0
+            ],
+            [
+              1,
+              4,
+              71,
+              1,
+              5,
+              21,
+              0,
+              3,
+              0,
+              0
+            ],
+            [
+              1,
+              10,
+              3,
+              38,
+              17,
+              3,
+              0,
+              2,
+              0,
+              0
+            ],
+            [
+              4,
+              8,
+              2,
+              8,
+              25,
+              9,
+              0,
+              11,
+              5,
+              7
+            ],
+            [
+              0,
+              0,
+              6,
+              7,
+              4,
+              41,
+              1,
+              7,
+              2,
+              0
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              0,
+              7,
+              36,
+              3,
+              4,
+              4
+            ],
+            [
+              1,
+              0,
+              0,
+              0,
+              1,
+              1,
+              0,
+              11,
+              11,
+              15
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              5,
+              0,
+              0,
+              5,
+              4,
+              17
+            ],
+            [
+              0,
+              0,
+              0,
+              0,
+              3,
+              0,
+              0,
+              13,
+              0,
+              5
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8496117986303245
+      }
+    }
+  }
+}