Initial release: attack_lifecycle_phase 5-class baseline + 11-oracle-path leakage diagnostic

Browse files

Files changed (11) hide show

README.md +483 -0
ablation_results.json +659 -0
feature_engineering.py +413 -0
feature_meta.json +224 -0
feature_scaler.json +1 -0
inference_example.ipynb +350 -0
leakage_diagnostic.json +186 -0
model_mlp.safetensors +3 -0
model_xgb.json +0 -0
multi_seed_results.json +98 -0
validation_results.json +180 -0

README.md ADDED Viewed

	@@ -0,0 +1,483 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - cybersecurity
+  - siem
+  - security-logs
+  - mitre-attack
+  - apt
+  - tabular-classification
+  - synthetic-data
+  - xgboost
+  - baseline
+  - leakage-diagnostic
+pipeline_tag: tabular-classification
+base_model: []
+datasets:
+  - xpertsystems/cyb010-sample
+metrics:
+  - accuracy
+  - f1
+  - roc_auc
+model-index:
+  - name: cyb010-baseline-classifier
+    results:
+      - task:
+          type: tabular-classification
+          name: 5-class attack lifecycle phase classification
+        dataset:
+          type: xpertsystems/cyb010-sample
+          name: CYB010 Synthetic Security Event Log Dataset (Sample)
+        metrics:
+          - type: roc_auc
+            value: 0.9904
+            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.9493
+            name: Test accuracy (XGBoost, seed 42)
+          - type: f1
+            value: 0.7781
+            name: Test macro-F1 (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.936
+            name: Multi-seed accuracy mean ± 0.007 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.988
+            name: Multi-seed ROC-AUC mean ± 0.001 (XGBoost, 10 seeds)
+---
+# CYB010 Baseline Classifier
+**Attack lifecycle phase classifier (5-class) trained on the CYB010
+synthetic security event log sample. Predicts which of 5 attack phases
+(`benign_background` / `initial_access` / `lateral_movement` /
+`persistence_establishment` / `exfiltration_or_impact`) a security
+event belongs to, from per-event features. ALSO ships a comprehensive
+`leakage_diagnostic.json` documenting 11 oracle paths discovered
+across the dataset's targets and 2 README-suggested targets that are
+unlearnable on the sample after honest leak removal.**
+> **Read this first.** This repo ships two related artifacts:
+> (1) a working baseline classifier for `attack_lifecycle_phase` (the
+> dataset's headline target), and (2) `leakage_diagnostic.json`
+> documenting 11 separate oracle paths plus 2 unlearnable targets.
+> Both files matter; the diagnostic is required reading for anyone
+> evaluating CYB010 for SIEM ML work.
+## Model overview
+| Property | Value |
+|---|---|
+| Primary task | 5-class `attack_lifecycle_phase` classification |
+| Secondary artifact | `leakage_diagnostic.json` — 11 oracle paths + 2 unlearnable targets |
+| Training data | `xpertsystems/cyb010-sample` (21,896 events / 500 incidents) |
+| Models | XGBoost + PyTorch MLP |
+| Input features | 87 (after one-hot encoding) |
+| Split | **Group-aware** (GroupShuffleSplit on `incident_id`) |
+| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
+| License | CC-BY-NC-4.0 (matches dataset) |
+| Status | Reference baseline + comprehensive leakage diagnostic |
+## Why this task — and what was dropped
+The CYB010 README's central concept is the "5-phase attack lifecycle
+state machine", and `attack_lifecycle_phase` is the data's headline
+target. We piloted six candidate targets and found:
+- **`attack_lifecycle_phase` 5-class**: strongest honest result.
+  Acc 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001 (multi-seed). All 5 classes
+  represented, per-class F1 range 0.48–1.00.
+- **`threat_actor_profile` 5-class**: works at acc 0.84 but per-class
+  F1 reveals it's almost entirely driven by `benign_user` separation
+  (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). The 4-class
+  malicious-only formulation is below majority (acc 0.55 vs 0.61).
+- **`label_true_positive` binary on alerts**: documented as a secondary
+  finding. Has 7 oracle features; honest acc 0.80, AUC 0.89 after
+  dropping all of them.
+- **`mitre_tactic` 14-class**: hits acc 0.90 but macro-F1 0.37 -
+  imbalance gaming (benign class dominates at 57%).
+- **`event_class` 12-class**: unlearnable (acc 0.35 vs majority 0.42).
+### Six oracle columns dropped from the phase task
+CYB010 encodes the benign vs malicious distinction explicitly in
+multiple columns. Each is a perfect or near-perfect oracle for the
+`benign_background` phase:
+| Column | Oracle relationship |
+|---|---|
+| `mitre_tactic` | `=="benign"` ↔ `benign_background` phase (12,448/12,448, perfect) |
+| `mitre_technique_id` | Perfect ATT&CK-by-design oracle for `mitre_tactic` (54/54 techniques → single tactic) |
+| `label_malicious` | `==False` ↔ `benign_background` (perfect) |
+| `threat_actor_id` | `=="NONE"` ↔ `benign_background` (perfect) |
+| `threat_actor_profile` | `=="benign_user"` ↔ `benign_background` (perfect) |
+| `event_type` | Many values phase-specific (`c2_beacon_outbound` → 100% `exfiltration_or_impact`) |
+With these six columns present, a plain XGBoost trivially separates
+benign vs malicious. The published baseline trains with all six
+excluded.
+Two model artifacts are published. They are designed to be used
+together:
+- `model_xgb.json` — gradient-boosted trees (slightly higher F1)
+- `model_mlp.safetensors` — PyTorch MLP
+## Quick start
+```bash
+pip install xgboost torch safetensors pandas huggingface_hub
+```
+```python
+from huggingface_hub import hf_hub_download, snapshot_download
+import json, numpy as np, torch, xgboost as xgb
+from safetensors.torch import load_file
+REPO = "xpertsystems/cyb010-baseline-classifier"
+paths = {n: hf_hub_download(REPO, n) for n in [
+    "model_xgb.json", "model_mlp.safetensors",
+    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
+]}
+import sys, os
+sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
+from feature_engineering import (
+    transform_single, load_meta, build_host_lookup, INT_TO_LABEL,
+)
+meta = load_meta(paths["feature_meta.json"])
+# Host features are joined from host_inventory.csv at inference time
+ds = snapshot_download("xpertsystems/cyb010-sample", repo_type="dataset")
+host_lookup = build_host_lookup(f"{ds}/host_inventory.csv")
+xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
+# Predict (see inference_example.ipynb for the full pattern)
+# Note: do NOT include mitre_tactic, mitre_technique_id, label_malicious,
+# threat_actor_id, threat_actor_profile, or event_type - those were the
+# oracle columns.
+X = transform_single(my_event, meta, host_lookup=host_lookup)
+proba = xgb_model.predict_proba(X)[0]
+print(INT_TO_LABEL[int(np.argmax(proba))])
+```
+See [`inference_example.ipynb`](./inference_example.ipynb) for the full
+copy-paste demo.
+## Training data
+Trained on the public sample of CYB010, 21,896 per-event records:
+| Phase | Events | Class share |
+|---|---:|---:|
+| `benign_background` | 12,448 | 56.9% |
+| `exfiltration_or_impact` | 6,205 | 28.3% |
+| `initial_access` | 1,674 | 7.6% |
+| `lateral_movement` | 968 | 4.4% |
+| `persistence_establishment` | 601 | 2.7% |
+### Group-aware split by incident_id
+500 incidents × ~44 events each. Events from the same incident share
+host, threat actor, and phase trajectory — so train/test contamination
+is a real risk with random splitting. The baseline uses
+**GroupShuffleSplit** on `incident_id` (nested 70/15/15):
+| Fold | Events | Incidents |
+|---|---:|---:|
+| Train | 14,697 | ~350 |
+| Validation | 3,473 | ~75 |
+| Test | 3,726 | ~75 |
+All 10 multi-seed evaluations yielded all 5 classes in the test fold.
+Class imbalance is addressed with `class_weight='balanced'` (XGBoost
+`sample_weight`) and weighted cross-entropy (MLP).
+## Feature pipeline
+The bundled `feature_engineering.py` is the canonical recipe. 87
+features survive after encoding, drawn from:
+- **Per-event numeric** (5): `source_port`, `dest_port`,
+  `cvss_score_analogue`, `label_log_tampered`, `label_false_positive`
+- **Per-event categorical** (3, one-hot): `event_class` (12 values),
+  `log_source_type` (8 values), `severity_level` (5 values)
+- **Host features** (joined from `host_inventory.csv`): 3 numeric +
+  7 categorical (os_type, host_role, network_segment, defender_posture,
+  criticality_rating, cloud_provider, siem_platform)
+- **Engineered** (9): `hour_of_day`, `is_off_hours`, `is_weekend`,
+  `log_cvss`, `is_high_cvss`, `is_well_known_port`, `is_dynamic_port`,
+  `is_outbound_web`, `risk_composite`
+### Partial-oracle features kept as legitimate observables
+`event_class` (max purity 0.87, mean 0.72 across phases) is the
+strongest non-oracle feature. C2 beacon traffic (`event_class =
+network_flow`) is 65% exfiltration phase but also 29% benign and 6%
+other phases — real overlap, not deterministic encoding. Kept.
+`severity_level` and `cvss_score_analogue` correlate strongly with
+phase (high-severity events skew toward exfil and initial_access) but
+with substantial overlap. Kept.
+`label_log_tampered` is a real observable — APTs tamper more than
+script_kiddies — but is not phase-deterministic. Kept.
+## Evaluation
+### Test-set metrics, seed 42 (n = 3,726 events from ~75 test incidents)
+**XGBoost** (the published `model_xgb.json` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.9904** |
+| Accuracy | **0.9493** |
+| Macro-F1 | 0.7781 |
+| Weighted-F1 | 0.9478 |
+**MLP** (the published `model_mlp.safetensors` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.9861** |
+| Accuracy | **0.9412** |
+| Macro-F1 | 0.7534 |
+| Weighted-F1 | 0.9396 |
+XGBoost slightly outperforms MLP on this task (acc 0.949 vs 0.941,
+macro-F1 0.778 vs 0.753). The gap is consistent across seeds.
+### Multi-seed robustness (XGBoost, 10 seeds)
+| Metric | Mean | Std | Min | Max |
+|---|---:|---:|---:|---:|
+| Accuracy | 0.936 | 0.007 | 0.923 | 0.949 |
+| Macro-F1 | 0.759 | 0.015 | 0.741 | 0.781 |
+| Macro ROC-AUC OvR | 0.988 | 0.001 | 0.986 | 0.990 |
+**Tightest ROC-AUC std in the catalog** (0.001). All 10 seeds yielded
+all 5 classes in the test fold. Full per-seed results in
+[`multi_seed_results.json`](./multi_seed_results.json).
+### Per-class F1 (seed 42)
+| Phase | Class share | XGBoost F1 | MLP F1 |
+|---|---:|---:|---:|
+| `benign_background` | 56.9% | **0.998** | 0.994 |
+| `exfiltration_or_impact` | 28.3% | **0.987** | 0.981 |
+| `initial_access` | 7.6% | 0.720 | 0.651 |
+| `persistence_establishment` | 2.7% | 0.703 | 0.690 |
+| `lateral_movement` | 4.4% | **0.483** | 0.451 |
+The two largest classes (`benign_background` and `exfiltration_or_impact`)
+are nearly perfectly separable — `benign_background` because the
+non-oracle features (severity, CVSS, log_source) still cleanly separate
+non-malicious traffic, and `exfiltration_or_impact` because it's
+dominated by network_flow events (C2 beacons). The three middle
+classes overlap substantially in feature space; `lateral_movement` is
+the hardest (F1 0.48) because lateral movement events look similar to
+initial_access events at the per-event level. A sequence model that
+considers event ordering within an incident would likely do better
+than the per-event baseline.
+### Ablation: which feature groups matter
+| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | Δ macro-F1 |
+|---|---:|---:|---:|---:|---:|
+| Full feature set (published) | 0.9493 | 0.7781 | 0.9904 | — | — |
+| No `event_class` | 0.9206 | 0.5969 | 0.9723 | **−0.0287** | **−0.181** |
+| No CVSS features | 0.9383 | 0.7475 | 0.9812 | −0.0110 | −0.031 |
+| No `log_source_type` | 0.9469 | 0.7655 | 0.9902 | −0.0024 | −0.013 |
+| No engineered features | 0.9471 | 0.7655 | 0.9903 | −0.0022 | −0.013 |
+| No ports | 0.9463 | 0.7621 | 0.9903 | −0.0030 | −0.016 |
+| No `severity_level` | 0.9479 | 0.7688 | 0.9902 | −0.0014 | −0.009 |
+| No tamper flags | 0.9469 | 0.7657 | 0.9905 | −0.0024 | −0.012 |
+| No timing | 0.9501 | 0.7730 | 0.9907 | +0.0008 | −0.005 |
+| No host features | 0.9522 | 0.7828 | 0.9917 | +0.0029 | +0.005 |
+Three findings:
+1. **`event_class` is the dominant signal** (drops 18pp macro-F1 when
+   removed). Phase prediction without it loses most discrimination
+   between the middle classes.
+2. **CVSS features are second-strongest** (drops 3pp F1). Captures
+   severity information that complements event_class.
+3. **Host features and timing add modest noise.** The model performs
+   marginally *better* without host features (+0.3pp accuracy), and
+   timing features contribute essentially nothing. Kept in the
+   pipeline as documented baseline reference.
+### Architecture
+**XGBoost:** multi-class gradient boosting (`multi:softprob`, 5 classes),
+`hist` tree method, class-balanced sample weights, early stopping on
+validation mlogloss.
+**MLP:** `87 → 128 → 64 → 5`, each hidden layer followed by `BatchNorm1d`
+→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
+early stopping on validation macro-F1.
+Training hyperparameters are held internally by XpertSystems.
+## Limitations
+**This is a baseline reference, not a production phase classifier.**
+1. **The leakage diagnostic is required reading.** Six oracle columns
+   for the phase task and seven for the alert TP task are documented
+   in `leakage_diagnostic.json`. If you use CYB010 sample data for
+   your own training, you MUST drop these or your model will learn
+   the oracles instead of the task.
+2. **`lateral_movement` F1 0.48 is the weakest class.** The 968-event
+   sample with substantial overlap to `initial_access` makes this
+   class hard. A sequence model that considers event ordering within
+   incidents would likely do better than per-event classification.
+3. **`threat_actor_profile` 4-class (malicious-only) is unlearnable
+   on this sample** (acc 0.55 vs majority 0.61). The 5-class
+   formulation with benign included works only because benign_user
+   separation is structurally trivial.
+4. **`event_class` 12-class is unlearnable on this sample** (acc 0.35
+   vs majority 0.42). event_class is a structural property of the
+   event itself, not something to predict from other features.
+5. **Synthetic-vs-real transfer.** The dataset is synthetic, calibrated
+   to 6 benchmarks from SANS / IBM / Mandiant / Verizon / CISA / MITRE
+   ATT&CK Evaluations / Splunk. Real SIEM telemetry has different noise
+   characteristics — and in particular, the explicit `mitre_tactic ==
+   "benign"` marker and `threat_actor_id == "NONE"` benign sentinel
+   would not be present in real data. Real telemetry has implicit
+   benign-vs-malicious distinctions that emerge from event content.
+   Do not assume metrics transfer end-to-end.
+6. **21,896 events / 500 incidents is a modest training set.** The
+   3,726-event / ~75-incident test fold yields stable multi-seed
+   metrics (std 0.007 on accuracy) but per-class confidence intervals
+   widen for the smallest classes (lateral_movement, persistence).
+## Notes on dataset schema
+The CYB010 sample dataset README describes some fields differently
+from the actual schema. The model was trained on the actual schema;
+this note helps buyers reconcile what they read with what they receive.
+| What the README says | What the data actually contains |
+|---|---|
+| `security_events` has 16 columns | Data has **23 columns** |
+| Field renames | `timestamp_utc` → `timestamp`, `user` → `user_id`, `log_format` → `log_source_type` |
+| README missing from `security_events` | `event_class`, `severity_level`, `label_malicious`, `label_log_tampered`, `threat_actor_id`, `cvss_score_analogue` are in data but not documented |
+| README claims `command_line` / `process_name` / `is_off_hours` columns | Not present in `security_events` (off-hours derived from timestamp in pipeline) |
+| `alert_records` has 9 columns | Data has **21 columns** |
+| Field renames | `alert_severity` → `severity_level`, `detection_rule` → `alert_rule_name` |
+| README's `triage_outcome` (categorical) | Replaced by `label_true_positive` / `label_false_positive` (mirror booleans) |
+| README's `ioc_matched` | Not present in `alert_records` |
+| README missing from `alert_records` | `correlated_chain_length`, `time_to_detect_seconds`, `suppression_reason`, `analyst_triage_priority` are in data but not documented |
+| `incident_summary` has 8 columns | Data has **24 columns** |
+| `host_inventory` has 6 columns | Data has **15 columns** |
+| `threat_actor_profile` has 4 values | Data has **5 values** (adds `benign_user` at 57% of events) |
+| `attack_lifecycle_phase` 5-phase malicious lifecycle | Data adds `benign_background` as a phase value (57% of events) — so the lifecycle is 5-class with benign included |
+| README says MITRE ATT&CK v14 with 50 techniques | Data has 54 unique technique IDs across 14 tactics + benign |
+None of these affects model correctness — the feature pipeline uses
+the actual column names. If you build your own pipeline against the
+dataset, use the actual columns.
+## Intended use
+- **Evaluating fit** of the CYB010 dataset for your SIEM ML research
+- **Baseline reference** for new model architectures on the
+  attack-phase classification task
+- **Reference example of structural-leakage diagnostics** for
+  synthetic SIEM datasets — the methodology is reusable
+- **Feature engineering reference** for per-event SIEM telemetry
+## Out-of-scope use
+- Production SIEM phase detection on real telemetry
+- Threat actor attribution (4-class malicious-only is unlearnable
+  on the sample)
+- Event-class prediction (this is a structural property, not a
+  learnable target)
+- Any operational decision affecting actual security operations
+  without further validation on your own data
+## Reproducibility
+Outputs above were produced with `seed = 42` (published artifact),
+nested `GroupShuffleSplit` on `incident_id` (70/15/15), on the published
+sample (`xpertsystems/cyb010-sample`, version 1.0.0, generated
+2026-05-16). The feature pipeline in `feature_engineering.py` is
+deterministic and the trained weights in this repo correspond exactly
+to the metrics above.
+Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200)
+in `multi_seed_results.json` confirm robust performance across splits
+(std 0.007 on accuracy, 0.001 on ROC-AUC — the tightest ROC-AUC std
+in the XpertSystems catalog).
+The training script itself is private to XpertSystems.
+## Files in this repo
+| File | Purpose |
+|---|---|
+| `model_xgb.json` | XGBoost weights (seed 42) |
+| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
+| `feature_engineering.py` | Feature pipeline |
+| `feature_meta.json` | Feature column order + categorical levels |
+| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
+| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
+| `ablation_results.json` | Per-feature-group ablation |
+| `multi_seed_results.json` | XGBoost metrics across 10 seeds |
+| **`leakage_diagnostic.json`** | **11-oracle-path audit + 2 unlearnable targets** |
+| `inference_example.ipynb` | End-to-end inference demo notebook |
+| `README.md` | This file |
+## Contact and full product
+The full **CYB010** dataset contains **~550,000 rows** across four files,
+with calibrated benchmark validation against 6 metrics drawn from
+authoritative SOC operations and threat intelligence sources (SANS SOC
+Survey, IBM Cost of Data Breach, Mandiant M-Trends, Verizon DBIR, CISA
+Joint Advisories, MITRE ATT&CK Evaluations, Splunk State of Security).
+The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
+Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
+& Energy.
+- 📧 **pradeep@xpertsystems.ai**
+- 🌐 **https://xpertsystems.ai**
+- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb010-sample
+- 🤖 Companion models:
+  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
+  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
+  - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
+  - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
+  - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
+  - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
+  - https://huggingface.co/xpertsystems/cyb007-baseline-classifier (insider threat type)
+  - https://huggingface.co/xpertsystems/cyb008-baseline-classifier (SOC alert triage + leakage diagnostic)
+  - https://huggingface.co/xpertsystems/cyb009-baseline-classifier (vulnerability classification + leakage diagnostic)
+## Citation
+```bibtex
+@misc{xpertsystems_cyb010_baseline_2026,
+  title  = {CYB010 Baseline Classifier: XGBoost and MLP for Attack Lifecycle Phase Classification, with 11-Oracle-Path Leakage Diagnostic},
+  author = {XpertSystems.ai},
+  year   = {2026},
+  url    = {https://huggingface.co/xpertsystems/cyb010-baseline-classifier},
+  note   = {Baseline reference model + comprehensive leakage audit trained on xpertsystems/cyb010-sample}
+}
+```

ablation_results.json ADDED Viewed

	@@ -0,0 +1,659 @@

+{
+  "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
+  "full_model_metrics": {
+    "model": "xgboost",
+    "accuracy": 0.9492753623188406,
+    "macro_f1": 0.7780594102481514,
+    "weighted_f1": 0.9522470071864876,
+    "per_class_f1": {
+      "benign_background": 0.9975996159385502,
+      "initial_access": 0.7196652719665272,
+      "lateral_movement": 0.48322147651006714,
+      "persistence_establishment": 0.703030303030303,
+      "exfiltration_or_impact": 0.9867803837953092
+    },
+    "confusion_matrix": {
+      "labels": [
+        "benign_background",
+        "initial_access",
+        "lateral_movement",
+        "persistence_establishment",
+        "exfiltration_or_impact"
+      ],
+      "matrix": [
+        [
+          2078,
+          6,
+          0,
+          0,
+          0
+        ],
+        [
+          4,
+          172,
+          65,
+          6,
+          0
+        ],
+        [
+          0,
+          38,
+          72,
+          6,
+          2
+        ],
+        [
+          0,
+          11,
+          22,
+          58,
+          0
+        ],
+        [
+          0,
+          4,
+          21,
+          4,
+          1157
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.9904125505537232
+  },
+  "ablations": {
+    "no_event_class": {
+      "n_features": 75,
+      "dropped_count": 12,
+      "metrics": {
+        "model": "xgboost_no_event_class",
+        "accuracy": 0.9205582393988191,
+        "macro_f1": 0.5968926085832369,
+        "weighted_f1": 0.9214122465392139,
+        "per_class_f1": {
+          "benign_background": 0.9978412089230031,
+          "initial_access": 0.5674044265593562,
+          "lateral_movement": 0.3170731707317073,
+          "persistence_establishment": 0.11965811965811966,
+          "exfiltration_or_impact": 0.9824861170439982
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2080,
+              4,
+              0,
+              0,
+              0
+            ],
+            [
+              4,
+              141,
+              94,
+              6,
+              2
+            ],
+            [
+              0,
+              54,
+              52,
+              9,
+              3
+            ],
+            [
+              1,
+              40,
+              43,
+              7,
+              0
+            ],
+            [
+              0,
+              11,
+              21,
+              4,
+              1150
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9722802673741894
+      },
+      "delta_accuracy": 0.028717122920021487,
+      "delta_macro_f1": 0.1811668016649145
+    },
+    "no_log_source": {
+      "n_features": 79,
+      "dropped_count": 8,
+      "metrics": {
+        "model": "xgboost_no_log_source",
+        "accuracy": 0.9468599033816425,
+        "macro_f1": 0.7655457635864822,
+        "weighted_f1": 0.9496485129647918,
+        "per_class_f1": {
+          "benign_background": 0.9975996159385502,
+          "initial_access": 0.7080745341614907,
+          "lateral_movement": 0.4536082474226804,
+          "persistence_establishment": 0.6829268292682927,
+          "exfiltration_or_impact": 0.985519591141397
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2078,
+              6,
+              0,
+              0,
+              0
+            ],
+            [
+              4,
+              171,
+              65,
+              6,
+              1
+            ],
+            [
+              0,
+              43,
+              66,
+              7,
+              2
+            ],
+            [
+              0,
+              12,
+              21,
+              56,
+              2
+            ],
+            [
+              0,
+              4,
+              21,
+              4,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9902223408149018
+      },
+      "delta_accuracy": 0.0024154589371980784,
+      "delta_macro_f1": 0.012513646661669209
+    },
+    "no_severity": {
+      "n_features": 82,
+      "dropped_count": 5,
+      "metrics": {
+        "model": "xgboost_no_severity",
+        "accuracy": 0.9479334406870639,
+        "macro_f1": 0.7688286964848263,
+        "weighted_f1": 0.9505815101921871,
+        "per_class_f1": {
+          "benign_background": 0.9971195391262602,
+          "initial_access": 0.7213114754098361,
+          "lateral_movement": 0.4689655172413793,
+          "persistence_establishment": 0.6708074534161491,
+          "exfiltration_or_impact": 0.985939497230507
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2077,
+              7,
+              0,
+              0,
+              0
+            ],
+            [
+              4,
+              176,
+              60,
+              7,
+              0
+            ],
+            [
+              0,
+              42,
+              68,
+              5,
+              3
+            ],
+            [
+              1,
+              12,
+              23,
+              54,
+              1
+            ],
+            [
+              0,
+              4,
+              21,
+              4,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9901923411691304
+      },
+      "delta_accuracy": 0.0013419216317767102,
+      "delta_macro_f1": 0.009230713763325071
+    },
+    "no_cvss": {
+      "n_features": 84,
+      "dropped_count": 3,
+      "metrics": {
+        "model": "xgboost_no_cvss",
+        "accuracy": 0.9382716049382716,
+        "macro_f1": 0.7475120671323378,
+        "weighted_f1": 0.940926432572893,
+        "per_class_f1": {
+          "benign_background": 0.9930737998566993,
+          "initial_access": 0.6948775055679287,
+          "lateral_movement": 0.43278688524590164,
+          "persistence_establishment": 0.6428571428571429,
+          "exfiltration_or_impact": 0.9739650021340163
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2079,
+              4,
+              0,
+              0,
+              1
+            ],
+            [
+              12,
+              156,
+              60,
+              14,
+              5
+            ],
+            [
+              6,
+              31,
+              66,
+              5,
+              10
+            ],
+            [
+              6,
+              8,
+              23,
+              54,
+              0
+            ],
+            [
+              0,
+              3,
+              38,
+              4,
+              1141
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9812083795500166
+      },
+      "delta_accuracy": 0.011003757380569024,
+      "delta_macro_f1": 0.03054734311581364
+    },
+    "no_host": {
+      "n_features": 39,
+      "dropped_count": 48,
+      "metrics": {
+        "model": "xgboost_no_host",
+        "accuracy": 0.9522275899087493,
+        "macro_f1": 0.7828011365615016,
+        "weighted_f1": 0.9541737562003638,
+        "per_class_f1": {
+          "benign_background": 0.9983217453847998,
+          "initial_access": 0.746268656716418,
+          "lateral_movement": 0.4962962962962963,
+          "persistence_establishment": 0.6871794871794872,
+          "exfiltration_or_impact": 0.985939497230507
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2082,
+              1,
+              0,
+              1,
+              0
+            ],
+            [
+              4,
+              175,
+              49,
+              18,
+              1
+            ],
+            [
+              0,
+              36,
+              67,
+              13,
+              2
+            ],
+            [
+              1,
+              6,
+              16,
+              67,
+              1
+            ],
+            [
+              0,
+              4,
+              20,
+              5,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9917448228530954
+      },
+      "delta_accuracy": -0.0029522275899087624,
+      "delta_macro_f1": -0.004741726313350236
+    },
+    "no_timing": {
+      "n_features": 84,
+      "dropped_count": 3,
+      "metrics": {
+        "model": "xgboost_no_timing",
+        "accuracy": 0.9500805152979066,
+        "macro_f1": 0.7730074031058032,
+        "weighted_f1": 0.9527084816660557,
+        "per_class_f1": {
+          "benign_background": 0.9990407673860912,
+          "initial_access": 0.7326315789473684,
+          "lateral_movement": 0.48484848484848486,
+          "persistence_establishment": 0.6625766871165644,
+          "exfiltration_or_impact": 0.985939497230507
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2083,
+              1,
+              0,
+              0,
+              0
+            ],
+            [
+              3,
+              174,
+              60,
+              8,
+              2
+            ],
+            [
+              0,
+              39,
+              72,
+              5,
+              2
+            ],
+            [
+              0,
+              9,
+              28,
+              54,
+              0
+            ],
+            [
+              0,
+              5,
+              19,
+              5,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9906863118522171
+      },
+      "delta_accuracy": -0.0008051529790660261,
+      "delta_macro_f1": 0.005052007142348214
+    },
+    "no_ports": {
+      "n_features": 82,
+      "dropped_count": 5,
+      "metrics": {
+        "model": "xgboost_no_ports",
+        "accuracy": 0.9463231347289318,
+        "macro_f1": 0.7620715002556177,
+        "weighted_f1": 0.949550457691939,
+        "per_class_f1": {
+          "benign_background": 0.9978401727861771,
+          "initial_access": 0.7036247334754797,
+          "lateral_movement": 0.45544554455445546,
+          "persistence_establishment": 0.6666666666666666,
+          "exfiltration_or_impact": 0.9867803837953092
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2079,
+              5,
+              0,
+              0,
+              0
+            ],
+            [
+              4,
+              165,
+              72,
+              6,
+              0
+            ],
+            [
+              0,
+              38,
+              69,
+              9,
+              2
+            ],
+            [
+              0,
+              11,
+              24,
+              56,
+              0
+            ],
+            [
+              0,
+              3,
+              20,
+              6,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9902855327593585
+      },
+      "delta_accuracy": 0.0029522275899087624,
+      "delta_macro_f1": 0.015987909992533744
+    },
+    "no_engineered": {
+      "n_features": 79,
+      "dropped_count": 8,
+      "metrics": {
+        "model": "xgboost_no_engineered",
+        "accuracy": 0.9471282877079978,
+        "macro_f1": 0.7655097846280253,
+        "weighted_f1": 0.9499972622574527,
+        "per_class_f1": {
+          "benign_background": 0.9975984630163305,
+          "initial_access": 0.7166666666666667,
+          "lateral_movement": 0.4697986577181208,
+          "persistence_establishment": 0.6583850931677019,
+          "exfiltration_or_impact": 0.9851000425713069
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2077,
+              7,
+              0,
+              0,
+              0
+            ],
+            [
+              3,
+              172,
+              63,
+              8,
+              1
+            ],
+            [
+              0,
+              40,
+              70,
+              5,
+              3
+            ],
+            [
+              0,
+              10,
+              26,
+              53,
+              2
+            ],
+            [
+              0,
+              4,
+              21,
+              4,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9903013631552575
+      },
+      "delta_accuracy": 0.0021470746108427363,
+      "delta_macro_f1": 0.01254962562012607
+    },
+    "no_tamper": {
+      "n_features": 85,
+      "dropped_count": 2,
+      "metrics": {
+        "model": "xgboost_no_tamper",
+        "accuracy": 0.9468599033816425,
+        "macro_f1": 0.7656884000157337,
+        "weighted_f1": 0.9499631319237402,
+        "per_class_f1": {
+          "benign_background": 0.9980806142034548,
+          "initial_access": 0.7048832271762208,
+          "lateral_movement": 0.4605263157894737,
+          "persistence_establishment": 0.6790123456790124,
+          "exfiltration_or_impact": 0.985939497230507
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2080,
+              4,
+              0,
+              0,
+              0
+            ],
+            [
+              4,
+              166,
+              70,
+              6,
+              1
+            ],
+            [
+              0,
+              39,
+              70,
+              7,
+              2
+            ],
+            [
+              0,
+              11,
+              24,
+              55,
+              1
+            ],
+            [
+              0,
+              4,
+              22,
+              3,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9904534455006762
+      },
+      "delta_accuracy": 0.0024154589371980784,
+      "delta_macro_f1": 0.012371010232417712
+    }
+  }
+}

feature_engineering.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""
+feature_engineering.py
+======================
+Feature pipeline for the CYB010 baseline classifier.
+Predicts `attack_lifecycle_phase` (5-class attack phase) from per-event
+features on the CYB010 sample dataset.
+CSV inputs:
+    security_events.csv    (primary, one row per event, 21,896 events)
+    host_inventory.csv     (per-host registry, joined for host context)
+    alert_records.csv      (per-alert records; reserved)
+    incident_summary.csv   (per-incident summaries; reserved)
+Target classes (5):
+    benign_background, initial_access, lateral_movement,
+    persistence_establishment, exfiltration_or_impact
+Why this task
+-------------
+The CYB010 README's central concept is the "5-phase attack lifecycle
+state machine", and `attack_lifecycle_phase` is the data's headline
+target. We piloted six candidate targets and found it gives the
+strongest honest result on the sample (acc 0.95, macro-F1 0.78,
+ROC-AUC 0.99 with group-aware split on incident_id).
+The other README-suggested targets either have unrecoverable structural
+leakage or are weaker after honest leak removal:
+- `threat_actor_profile` 5-class works (acc 0.84) but is benign-driven
+  - 4-class malicious-only collapses to acc 0.57 vs majority 0.61.
+- `label_true_positive` on alerts has 9 oracle features; after dropping
+  all of them, honest acc 0.80, AUC 0.89 (documented as a secondary
+  finding in leakage_diagnostic.json).
+- `mitre_tactic` 14-class hits 0.90 acc but macro-F1 0.37 - imbalance
+  gaming (benign class dominates at 57%).
+- `event_class` 12-class is unlearnable (acc 0.35 vs majority 0.42).
+Group structure
+---------------
+500 incidents x ~44 events each. The per-event task has clear group
+structure: events from the same incident share host, threat actor, and
+phase trajectory. Group-aware split by `incident_id` is required to
+prevent train/test contamination. With 500 incidents, ~75 test
+incidents per fold gives reasonable estimation precision.
+Leakage audit
+-------------
+Four columns dropped from features because they're structural oracles
+for the target:
+1. `mitre_tactic`: when == "benign", deterministically pins
+   attack_lifecycle_phase == "benign_background" (12,448 cases - all
+   benign events).
+2. `mitre_technique_id`: perfect oracle for `mitre_tactic` by ATT&CK
+   design (54 techniques, each maps to exactly one tactic). Dropped
+   because it indirectly encodes the benign vs malicious distinction.
+3. `label_malicious`: when False, perfect oracle for
+   benign_background phase.
+4. `threat_actor_id`: when == "NONE", perfect oracle for benign
+   profile/phase. The non-"NONE" actor IDs are 10 distinct labels
+   that would also leak actor profile information indirectly.
+5. `threat_actor_profile`: contains "benign_user" which trivially
+   identifies benign_background phase.
+6. `event_type`: many event types are phase-specific
+   (`c2_beacon_outbound` -> 99% exfiltration_or_impact). Dropped to
+   avoid this near-oracle path.
+KEPT features that are informative but NOT oracles:
+- `event_class` (12 values): max purity 0.87, mean 0.72 - real signal
+  with substantial overlap. C2 beacons (network_flow class) hit 65%
+  exfil phase but also 29% benign. Strong feature, kept.
+- `severity_level`, `cvss_score_analogue`: per-event severity is a
+  real observable, correlates with phase, has overlap.
+- `label_log_tampered`: real observable (APTs tamper more), correlates
+  with malicious phases but not deterministic.
+- `log_source_type`, `siem_platform`: not phase-deterministic.
+- All host context features.
+Public API
+----------
+    build_features(events_path, hosts_path) -> (X, y, ids, groups, meta)
+    transform_single(record, meta, host_lookup=None) -> np.ndarray
+    save_meta(meta, path) / load_meta(path)
+    build_host_lookup(hosts_path) -> dict
+License
+-------
+Ships with the public model on Hugging Face under CC-BY-NC-4.0,
+matching the dataset license. See README.md.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Label space
+# ---------------------------------------------------------------------------
+# Ordered by attack progression.
+LABEL_ORDER = [
+    "benign_background",
+    "initial_access",
+    "lateral_movement",
+    "persistence_establishment",
+    "exfiltration_or_impact",
+]
+LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
+INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
+# ---------------------------------------------------------------------------
+# Identifier and target columns
+# ---------------------------------------------------------------------------
+ID_COLUMNS = [
+    "event_id", "host_id", "incident_id", "timestamp", "user_id",
+    "source_ip", "dest_ip", "raw_log_payload",
+]
+TARGET_COLUMN = "attack_lifecycle_phase"
+GROUP_COLUMN = "incident_id"
+# Oracle columns dropped from features.
+ORACLE_COLUMNS = [
+    "mitre_tactic",          # benign value -> benign_background phase
+    "mitre_technique_id",    # ATT&CK technique -> tactic deterministic
+    "label_malicious",       # False -> benign_background
+    "threat_actor_id",       # NONE -> benign
+    "threat_actor_profile",  # benign_user -> benign_background
+    "event_type",            # many event types phase-specific (e.g. c2_beacon_outbound)
+]
+# ---------------------------------------------------------------------------
+# Per-event numeric features
+# ---------------------------------------------------------------------------
+EVENT_NUMERIC_FEATURES = [
+    "source_port",
+    "dest_port",
+    "cvss_score_analogue",
+    "label_log_tampered",  # bool kept as observable
+    "label_false_positive",  # bool kept as observable (all False on events)
+]
+EVENT_CATEGORICAL_FEATURES = [
+    "event_class",      # 12 values
+    "log_source_type",  # 8 values
+    "severity_level",   # 5 values
+]
+# ---------------------------------------------------------------------------
+# Host features (joined on host_id from host_inventory.csv)
+# ---------------------------------------------------------------------------
+HOST_NUMERIC_FEATURES = [
+    "edr_agent_installed",
+    "patch_compliance_level",
+    "vulnerability_count_open",
+]
+HOST_CATEGORICAL_FEATURES = [
+    "os_type",                 # 7 values
+    "host_role",               # 10 values
+    "network_segment",         # 8 values
+    "defender_posture_tier",   # 4 values
+    "criticality_rating",      # 4 values
+    "cloud_provider",          # 4 values
+    "siem_platform",           # 8 values
+]
+# ---------------------------------------------------------------------------
+# Engineered features
+# ---------------------------------------------------------------------------
+def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Six engineered features encoding phase-discriminative hypotheses.
+    Each composite is something a SOC analyst would compute by hand.
+    """
+    df = df.copy()
+    # 1. Hour of day (0-23) from timestamp, if available
+    if "timestamp" in df.columns:
+        ts = pd.to_datetime(df["timestamp"], errors="coerce")
+        df["hour_of_day"] = ts.dt.hour.fillna(12).astype(int)
+        df["is_off_hours"] = ((ts.dt.hour < 9) | (ts.dt.hour > 17)).fillna(False).astype(int)
+        df["is_weekend"] = (ts.dt.weekday >= 5).fillna(False).astype(int)
+    else:
+        df["hour_of_day"] = 12
+        df["is_off_hours"] = 0
+        df["is_weekend"] = 0
+    # 2. Log-scaled CVSS (heavy-tailed)
+    df["log_cvss"] = np.log1p(
+        df.get("cvss_score_analogue", 0).clip(lower=0)
+    ).astype(float)
+    # 3. High-CVSS indicator
+    df["is_high_cvss"] = (
+        df.get("cvss_score_analogue", 0) >= 7.0
+    ).astype(int)
+    # 4. Port category: well-known (<1024) vs registered vs dynamic
+    dest = df.get("dest_port", 0).fillna(0).astype(int)
+    df["is_well_known_port"] = (dest < 1024).astype(int)
+    df["is_dynamic_port"] = (dest >= 49152).astype(int)
+    # 5. Network direction: same-network if source_port equals dest_port
+    #    OR if specific dest_port matches common service. Rough proxy.
+    df["is_outbound_web"] = (dest.isin([80, 443, 8080, 8443])).astype(int)
+    # 6. Risk composite: CVSS x defender_weakness. Higher composite -> later phase.
+    if "patch_compliance_level" in df.columns:
+        df["risk_composite"] = (
+            df["cvss_score_analogue"].fillna(0) *
+            (1 - df["patch_compliance_level"].fillna(1))
+        ).astype(float)
+    else:
+        df["risk_composite"] = 0.0
+    return df
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def build_features(
+    events_path: str | Path,
+    hosts_path: str | Path,
+) -> tuple[pd.DataFrame, pd.Series, pd.Series, pd.Series, dict[str, Any]]:
+    """
+    Load security_events.csv, join host_inventory.csv, drop target +
+    identifiers + oracle columns, engineer features, one-hot encode,
+    return (X, y, ids, groups, meta).
+    """
+    events = pd.read_csv(events_path)
+    hosts = pd.read_csv(hosts_path)
+    y = events[TARGET_COLUMN].map(LABEL_TO_INT)
+    if y.isna().any():
+        bad = events.loc[y.isna(), TARGET_COLUMN].unique()
+        raise ValueError(f"Unknown attack_lifecycle_phase values: {bad}")
+    y = y.astype(int)
+    ids = events["event_id"].copy()
+    groups = events[GROUP_COLUMN].copy()
+    host_cols_needed = (
+        ["host_id"] + HOST_NUMERIC_FEATURES + HOST_CATEGORICAL_FEATURES
+    )
+    events = events.merge(
+        hosts[host_cols_needed], on="host_id", how="left",
+    )
+    # Apply engineered features BEFORE dropping timestamp
+    events = _add_engineered_features(events)
+    events = events.drop(
+        columns=ID_COLUMNS + [TARGET_COLUMN] + ORACLE_COLUMNS,
+        errors="ignore",
+    )
+    numeric_features = (
+        EVENT_NUMERIC_FEATURES
+        + HOST_NUMERIC_FEATURES
+        + [
+            "hour_of_day", "is_off_hours", "is_weekend",
+            "log_cvss", "is_high_cvss",
+            "is_well_known_port", "is_dynamic_port", "is_outbound_web",
+            "risk_composite",
+        ]
+    )
+    numeric_features = [c for c in numeric_features if c in events.columns]
+    X_numeric = events[numeric_features].apply(
+        lambda s: s.astype(float) if s.dtype != bool else s.astype(int).astype(float)
+    )
+    all_categorical = EVENT_CATEGORICAL_FEATURES + HOST_CATEGORICAL_FEATURES
+    categorical_levels: dict[str, list[str]] = {}
+    blocks: list[pd.DataFrame] = []
+    for col in all_categorical:
+        if col not in events.columns:
+            continue
+        levels = sorted(events[col].dropna().astype(str).unique().tolist())
+        categorical_levels[col] = levels
+        block = pd.get_dummies(
+            events[col].astype(str).astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        blocks.append(block)
+    X = pd.concat(
+        [X_numeric.reset_index(drop=True)]
+        + [b.reset_index(drop=True) for b in blocks],
+        axis=1,
+    ).fillna(0.0)
+    meta = {
+        "feature_names": X.columns.tolist(),
+        "numeric_features": numeric_features,
+        "categorical_levels": categorical_levels,
+        "label_to_int": LABEL_TO_INT,
+        "int_to_label": INT_TO_LABEL,
+        "oracle_excluded": ORACLE_COLUMNS,
+    }
+    return X, y, ids, groups, meta
+def transform_single(
+    record: dict | pd.DataFrame,
+    meta: dict[str, Any],
+    host_lookup: dict | None = None,
+) -> np.ndarray:
+    """Encode a single event record for inference."""
+    if isinstance(record, dict):
+        df = pd.DataFrame([record.copy()])
+    else:
+        df = record.copy()
+    if host_lookup is not None and "host_id" in df.columns:
+        host_id = df["host_id"].iloc[0]
+        host_feats = host_lookup.get(host_id, {})
+        for k, v in host_feats.items():
+            if k not in df.columns:
+                df[k] = v
+    df = _add_engineered_features(df)
+    numeric = pd.DataFrame()
+    for col in meta["numeric_features"]:
+        s = df.get(col, pd.Series([0.0] * len(df)))
+        if s.dtype == bool:
+            s = s.astype(int)
+        numeric[col] = s.astype(float).values
+    blocks: list[pd.DataFrame] = [numeric]
+    for col, levels in meta["categorical_levels"].items():
+        val = df.get(col, pd.Series([None] * len(df))).astype(str)
+        block = pd.get_dummies(
+            val.astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        for lvl in levels:
+            cname = f"{col}_{lvl}"
+            if cname not in block.columns:
+                block[cname] = 0
+        block = block[[f"{col}_{lvl}" for lvl in levels]]
+        blocks.append(block)
+    X = pd.concat(blocks, axis=1).fillna(0.0)
+    X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
+    return X.values.astype(np.float32)
+def save_meta(meta: dict[str, Any], path: str | Path) -> None:
+    serializable = {
+        "feature_names": meta["feature_names"],
+        "numeric_features": meta["numeric_features"],
+        "categorical_levels": meta["categorical_levels"],
+        "label_to_int": meta["label_to_int"],
+        "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
+        "oracle_excluded": meta.get("oracle_excluded", []),
+    }
+    with open(path, "w") as f:
+        json.dump(serializable, f, indent=2)
+def load_meta(path: str | Path) -> dict[str, Any]:
+    with open(path) as f:
+        meta = json.load(f)
+    meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
+    return meta
+def build_host_lookup(hosts_path: str | Path) -> dict[str, dict]:
+    """Build {host_id: {host feature values}} for inference-time lookup."""
+    hosts = pd.read_csv(hosts_path)
+    cols = HOST_NUMERIC_FEATURES + HOST_CATEGORICAL_FEATURES
+    out = {}
+    for _, row in hosts.iterrows():
+        out[row["host_id"]] = {c: row[c] for c in cols if c in hosts.columns}
+    return out
+if __name__ == "__main__":
+    import sys
+    base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
+    X, y, ids, groups, meta = build_features(
+        base / "security_events.csv",
+        base / "host_inventory.csv",
+    )
+    print(f"X shape: {X.shape}")
+    print(f"y shape: {y.shape}")
+    print(f"groups: {groups.nunique()} unique incidents")
+    print(f"n_features: {len(meta['feature_names'])}")
+    print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
+    print(f"X has NaN: {X.isnull().any().any()}")

feature_meta.json ADDED Viewed

	@@ -0,0 +1,224 @@

+{
+  "feature_names": [
+    "source_port",
+    "dest_port",
+    "cvss_score_analogue",
+    "label_log_tampered",
+    "label_false_positive",
+    "edr_agent_installed",
+    "patch_compliance_level",
+    "vulnerability_count_open",
+    "hour_of_day",
+    "is_off_hours",
+    "is_weekend",
+    "log_cvss",
+    "is_high_cvss",
+    "is_well_known_port",
+    "is_dynamic_port",
+    "is_outbound_web",
+    "risk_composite",
+    "event_class_application_api",
+    "event_class_application_waf",
+    "event_class_authentication",
+    "event_class_cloud_compute",
+    "event_class_cloud_iam",
+    "event_class_cloud_storage",
+    "event_class_dns_resolution",
+    "event_class_endpoint_filesystem",
+    "event_class_endpoint_process",
+    "event_class_endpoint_registry",
+    "event_class_network_flow",
+    "event_class_threat_intelligence_match",
+    "log_source_type_arcsight_esm",
+    "log_source_type_aws_security_hub",
+    "log_source_type_elastic_siem",
+    "log_source_type_google_chronicle",
+    "log_source_type_ibm_qradar",
+    "log_source_type_microsoft_sentinel",
+    "log_source_type_palo_alto_xsiam",
+    "log_source_type_splunk",
+    "severity_level_critical",
+    "severity_level_high",
+    "severity_level_informational",
+    "severity_level_low",
+    "severity_level_medium",
+    "os_type_cloud_managed",
+    "os_type_linux_debian",
+    "os_type_linux_rhel",
+    "os_type_linux_ubuntu",
+    "os_type_macos",
+    "os_type_windows_server",
+    "os_type_windows_workstation",
+    "host_role_cloud_compute_instance",
+    "host_role_database_server",
+    "host_role_domain_controller",
+    "host_role_file_server",
+    "host_role_ot_ics_controller",
+    "host_role_siem_collector",
+    "host_role_vpn_gateway",
+    "host_role_web_server",
+    "host_role_workstation_privileged",
+    "host_role_workstation_standard",
+    "network_segment_cloud_workload",
+    "network_segment_corporate_lan",
+    "network_segment_data_exfiltration_target",
+    "network_segment_dmz_perimeter",
+    "network_segment_endpoint_fleet",
+    "network_segment_ot_ics_control_network",
+    "network_segment_soc_management_plane",
+    "network_segment_zero_trust_segment",
+    "defender_posture_tier_hardened",
+    "defender_posture_tier_minimal",
+    "defender_posture_tier_standard",
+    "defender_posture_tier_zero_trust",
+    "criticality_rating_critical",
+    "criticality_rating_high",
+    "criticality_rating_low",
+    "criticality_rating_medium",
+    "cloud_provider_aws",
+    "cloud_provider_azure",
+    "cloud_provider_gcp",
+    "cloud_provider_on_premises",
+    "siem_platform_arcsight_esm",
+    "siem_platform_aws_security_hub",
+    "siem_platform_elastic_siem",
+    "siem_platform_google_chronicle",
+    "siem_platform_ibm_qradar",
+    "siem_platform_microsoft_sentinel",
+    "siem_platform_palo_alto_xsiam",
+    "siem_platform_splunk"
+  ],
+  "numeric_features": [
+    "source_port",
+    "dest_port",
+    "cvss_score_analogue",
+    "label_log_tampered",
+    "label_false_positive",
+    "edr_agent_installed",
+    "patch_compliance_level",
+    "vulnerability_count_open",
+    "hour_of_day",
+    "is_off_hours",
+    "is_weekend",
+    "log_cvss",
+    "is_high_cvss",
+    "is_well_known_port",
+    "is_dynamic_port",
+    "is_outbound_web",
+    "risk_composite"
+  ],
+  "categorical_levels": {
+    "event_class": [
+      "application_api",
+      "application_waf",
+      "authentication",
+      "cloud_compute",
+      "cloud_iam",
+      "cloud_storage",
+      "dns_resolution",
+      "endpoint_filesystem",
+      "endpoint_process",
+      "endpoint_registry",
+      "network_flow",
+      "threat_intelligence_match"
+    ],
+    "log_source_type": [
+      "arcsight_esm",
+      "aws_security_hub",
+      "elastic_siem",
+      "google_chronicle",
+      "ibm_qradar",
+      "microsoft_sentinel",
+      "palo_alto_xsiam",
+      "splunk"
+    ],
+    "severity_level": [
+      "critical",
+      "high",
+      "informational",
+      "low",
+      "medium"
+    ],
+    "os_type": [
+      "cloud_managed",
+      "linux_debian",
+      "linux_rhel",
+      "linux_ubuntu",
+      "macos",
+      "windows_server",
+      "windows_workstation"
+    ],
+    "host_role": [
+      "cloud_compute_instance",
+      "database_server",
+      "domain_controller",
+      "file_server",
+      "ot_ics_controller",
+      "siem_collector",
+      "vpn_gateway",
+      "web_server",
+      "workstation_privileged",
+      "workstation_standard"
+    ],
+    "network_segment": [
+      "cloud_workload",
+      "corporate_lan",
+      "data_exfiltration_target",
+      "dmz_perimeter",
+      "endpoint_fleet",
+      "ot_ics_control_network",
+      "soc_management_plane",
+      "zero_trust_segment"
+    ],
+    "defender_posture_tier": [
+      "hardened",
+      "minimal",
+      "standard",
+      "zero_trust"
+    ],
+    "criticality_rating": [
+      "critical",
+      "high",
+      "low",
+      "medium"
+    ],
+    "cloud_provider": [
+      "aws",
+      "azure",
+      "gcp",
+      "on_premises"
+    ],
+    "siem_platform": [
+      "arcsight_esm",
+      "aws_security_hub",
+      "elastic_siem",
+      "google_chronicle",
+      "ibm_qradar",
+      "microsoft_sentinel",
+      "palo_alto_xsiam",
+      "splunk"
+    ]
+  },
+  "label_to_int": {
+    "benign_background": 0,
+    "initial_access": 1,
+    "lateral_movement": 2,
+    "persistence_establishment": 3,
+    "exfiltration_or_impact": 4
+  },
+  "int_to_label": {
+    "0": "benign_background",
+    "1": "initial_access",
+    "2": "lateral_movement",
+    "3": "persistence_establishment",
+    "4": "exfiltration_or_impact"
+  },
+  "oracle_excluded": [
+    "mitre_tactic",
+    "mitre_technique_id",
+    "label_malicious",
+    "threat_actor_id",
+    "threat_actor_profile",
+    "event_type"
+  ]
+}

feature_scaler.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean": [33252.34347145676, 2996.478260869565, 2.8174688711982037, 0.05245968565013268, 0.0, 0.7555283391168266, 0.7200855956998027, 4.552833911682656, 12.7339593114241, 0.5532421582635912, 0.2685582091583316, 0.8327589119295066, 0.30815812750901544, 0.596652378036334, 0.0, 0.6238007756685038, 0.8264450180308907, 0.015989657753282982, 0.021364904402258963, 0.2782200449071239, 0.014220589235898484, 0.029189630536844254, 0.01483295910730081, 0.009253589167857386, 0.055317411716676874, 0.0923998094849289, 0.03075457576376131, 0.4073620466761924, 0.031094781247873717, 0.10784513846363203, 0.12022861808532354, 0.1143770837585902, 0.10566782336531265, 0.13771517996870108, 0.15207185139824453, 0.1263523167993468, 0.13574198816084915, 0.023065931822820983, 0.3070014288630333, 0.33496631965707285, 0.23208818126148192, 0.10287813839559094, 0.06021637068789549, 0.1196842893107437, 0.17017078315302442, 0.14302238552085458, 0.20902224943866096, 0.08226168605837926, 0.2156222358304416, 0.20466761924202218, 0.056406069265836564, 0.04790093216302647, 0.06736068585425597, 0.01986800027216439, 0.01850717833571477, 0.02878138395590937, 0.13186364564196776, 0.07191943934136218, 0.35272504592774034, 0.11424100156494522, 0.11070286453017622, 0.13553786487038172, 0.13887187861468328, 0.14145744029393753, 0.13771517996870108, 0.10553174117166769, 0.11594202898550725, 0.2548139076001905, 0.20806967408314622, 0.48683404776484995, 0.050282370551813296, 0.048241137647138874, 0.18922229026331905, 0.3440838266312853, 0.4184527454582568, 0.019731918078519425, 0.02605974008301014, 0.014424712526365926, 0.9397836293121045, 0.10784513846363203, 0.12022861808532354, 0.1143770837585902, 0.10566782336531265, 0.13771517996870108, 0.15207185139824453, 0.1263523167993468, 0.13574198816084915], "std": [18715.207254926845, 3628.380310921406, 3.54459303672502, 0.2229597484434109, 1.0, 0.42978812956197554, 0.17539856245260071, 3.3596279846325925, 6.617606652566204, 0.497174105443988, 0.4432246202477297, 1.0144933397707419, 0.4617479865503102, 0.49058607154102, 1.0, 0.48444745480159396, 1.3184493675097533, 0.12543946439978274, 0.14460244808610237, 0.4481376083636101, 0.11840320083280491, 0.16834347108896366, 0.1208881167845266, 0.09575272369957992, 0.2286065431480699, 0.28959936317018303, 0.17265792825750237, 0.4913599872613703, 0.17357979692806114, 0.3101952797244992, 0.3252397499180114, 0.3182795299089748, 0.3074224535341671, 0.34461252093496103, 0.35910273966521833, 0.3322573102735896, 0.3425260335658089, 0.1501180467072525, 0.4612656808966241, 0.47199474836365296, 0.42217932766928296, 0.303809985457358, 0.23789537641829842, 0.3246030337165473, 0.3757955516422055, 0.35010758763584876, 0.4066241493207977, 0.2747723387770502, 0.41126730452490956, 0.4034722558944266, 0.23071204197108325, 0.21356389250949495, 0.2506541416123477, 0.1395513808948919, 0.1347809285968829, 0.16719724273077177, 0.33835397762852726, 0.25836326255215475, 0.4778343054123098, 0.31811457161396967, 0.3137745038447582, 0.3423088149527841, 0.3458245469796016, 0.34850465828256166, 0.34461252093496103, 0.30724780868090457, 0.32016628422038745, 0.435771386019691, 0.4059407557259376, 0.49984363288754735, 0.21853444401923464, 0.21428265102990002, 0.39169842291834606, 0.47508473363402964, 0.49332200863876363, 0.13908229817866685, 0.15931841410593847, 0.11923760974001169, 0.23789537641829842, 0.3101952797244992, 0.3252397499180114, 0.3182795299089748, 0.3074224535341671, 0.34461252093496103, 0.35910273966521833, 0.3322573102735896, 0.3425260335658089]}

inference_example.ipynb ADDED Viewed

	@@ -0,0 +1,350 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CYB010 Baseline Classifier — Inference Example\n",
+    "\n",
+    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **attack lifecycle phase** for a security event.\n",
+    "\n",
+    "**Models predict one of 5 phases:** `benign_background`, `initial_access`, `lateral_movement`, `persistence_establishment`, `exfiltration_or_impact`.\n",
+    "\n",
+    "**This is a baseline reference model**, not a production phase classifier. See the model card and **`leakage_diagnostic.json`** for the structural-leakage findings (11 oracle paths documented across the dataset)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Download model artifacts from Hugging Face"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "REPO_ID = \"xpertsystems/cyb010-baseline-classifier\"\n",
+    "\n",
+    "files = {}\n",
+    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
+    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
+    "             \"feature_scaler.json\"]:\n",
+    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
+    "    print(f\"  downloaded: {name}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, os\n",
+    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
+    "if fe_dir not in sys.path:\n",
+    "    sys.path.insert(0, fe_dir)\n",
+    "\n",
+    "from feature_engineering import (\n",
+    "    transform_single, load_meta, build_host_lookup, INT_TO_LABEL,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load models and metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import xgboost as xgb\n",
+    "from safetensors.torch import load_file\n",
+    "\n",
+    "meta = load_meta(files[\"feature_meta.json\"])\n",
+    "with open(files[\"feature_scaler.json\"]) as f:\n",
+    "    scaler = json.load(f)\n",
+    "\n",
+    "N_FEATURES = len(meta[\"feature_names\"])\n",
+    "N_CLASSES = len(meta[\"int_to_label\"])\n",
+    "print(f\"feature count: {N_FEATURES}\")\n",
+    "print(f\"class count:   {N_CLASSES}\")\n",
+    "print(f\"label classes: {list(meta['int_to_label'].values())}\")\n",
+    "print(f\"\\noracle columns excluded (do not pass these to the model):\")\n",
+    "for c in meta.get(\"oracle_excluded\", []):\n",
+    "    print(f\"  - {c}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xgb_model = xgb.XGBClassifier()\n",
+    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
+    "\n",
+    "# MLP architecture (must match training)\n",
+    "class PhaseMLP(nn.Module):\n",
+    "    def __init__(self, n_features, n_classes=5, hidden1=128, hidden2=64, dropout=0.3):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(n_features, hidden1),\n",
+    "            nn.BatchNorm1d(hidden1),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden1, hidden2),\n",
+    "            nn.BatchNorm1d(hidden2),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden2, n_classes),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
+    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
+    "mlp_model.eval()\n",
+    "print(\"models loaded\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Load host inventory for host-feature lookup\n",
+    "\n",
+    "The model uses host context (os_type, host_role, defender_posture, etc.) as features. To predict on a new event, we look up its host features from the host_inventory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "\n",
+    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb010-sample\", repo_type=\"dataset\")\n",
+    "host_lookup = build_host_lookup(f\"{ds_path}/host_inventory.csv\")\n",
+    "print(f\"loaded {len(host_lookup)} host records\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Prediction helper"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
+    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
+    "\n",
+    "def predict_attack_phase(event: dict) -> dict:\n",
+    "    \"\"\"Predict the attack lifecycle phase for one security event.\n",
+    "\n",
+    "    Note: do NOT include mitre_tactic, mitre_technique_id,\n",
+    "    label_malicious, threat_actor_id, threat_actor_profile, or\n",
+    "    event_type in the record. These were structural oracles in the\n",
+    "    training data and are excluded from the feature set.\n",
+    "\n",
+    "    Host features (os_type, host_role, etc.) are looked up from\n",
+    "    host_inventory by host_id.\n",
+    "    \"\"\"\n",
+    "    X = transform_single(event, meta, host_lookup=host_lookup)\n",
+    "\n",
+    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
+    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
+    "\n",
+    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
+    "    with torch.no_grad():\n",
+    "        logits = mlp_model(torch.tensor(Xs))\n",
+    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
+    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
+    "\n",
+    "    return {\n",
+    "        \"xgboost\": {\n",
+    "            \"label\": xgb_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
+    "        },\n",
+    "        \"mlp\": {\n",
+    "            \"label\": mlp_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
+    "        },\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Run on an example event\n",
+    "\n",
+    "Real high-severity authentication event from the CYB010 sample. True phase is `initial_access` — an APT session anomaly with CVSS 7.56 against a workstation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real event from the sample dataset (true phase: initial_access)\n",
+    "example_event = {\n",
+    "    \"host_id\": \"HOST-00352\",\n",
+    "    \"timestamp\": \"2024-07-22T21:55:40.046569+00:00\",\n",
+    "    \"source_port\": 27110,\n",
+    "    \"dest_port\": 8443,\n",
+    "    \"event_class\": \"authentication\",\n",
+    "    \"log_source_type\": \"splunk\",\n",
+    "    \"severity_level\": \"high\",\n",
+    "    \"label_false_positive\": False,\n",
+    "    \"label_log_tampered\": False,\n",
+    "    \"cvss_score_analogue\": 7.56,\n",
+    "}\n",
+    "\n",
+    "result = predict_attack_phase(example_event)\n",
+    "\n",
+    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
+    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
+    "    print(f\"    P({lbl:30s}) = {p:.4f}\")\n",
+    "\n",
+    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
+    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
+    "    print(f\"    P({lbl:30s}) = {p:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Per-class confidence patterns\n",
+    "\n",
+    "The model has strong confidence on `benign_background` and `exfiltration_or_impact` (per-class F1 0.99 each). The middle phases (`initial_access`, `lateral_movement`, `persistence_establishment`) overlap more in feature space — expect modest confidence (0.4-0.7) on those predictions.\n",
+    "\n",
+    "`lateral_movement` is the hardest class (F1 0.48 at seed 42). Real SOC data would have stronger sequential signal (event-sequence features within an incident) that the per-event baseline does not capture."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Batch prediction on the sample dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "events = pd.read_csv(f\"{ds_path}/security_events.csv\")\n",
+    "\n",
+    "# Score the first 500 events\n",
+    "sample = events.head(500).copy()\n",
+    "preds = [predict_attack_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
+    "sample[\"xgb_pred\"] = preds\n",
+    "\n",
+    "ct = pd.crosstab(sample[\"attack_lifecycle_phase\"], sample[\"xgb_pred\"],\n",
+    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
+    "print(\"Confusion on first 500 sample events (XGBoost):\")\n",
+    "print(ct)\n",
+    "acc = (sample[\"attack_lifecycle_phase\"] == sample[\"xgb_pred\"]).mean()\n",
+    "print(f\"\\nbatch accuracy on first 500 events (in-distribution): {acc:.4f}\")\n",
+    "print(\"\\nNote: this includes training-set events. See validation_results.json\\n\"\n",
+    "      \"for proper held-out test metrics (group-aware split by incident_id).\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Important reading: the leakage diagnostic\n",
+    "\n",
+    "Before using CYB010 sample data to train your own models, read **`leakage_diagnostic.json`** in this repo. It documents **11 oracle paths** across the sample's targets:\n",
+    "\n",
+    "**Phase target oracles (6 paths):**\n",
+    "1. `mitre_tactic == \"benign\"` → 100% `benign_background` phase\n",
+    "2. `mitre_technique_id` → `mitre_tactic` (perfect ATT&CK-by-design oracle)\n",
+    "3. `label_malicious == False` → 100% `benign_background`\n",
+    "4. `threat_actor_id == \"NONE\"` → 100% benign\n",
+    "5. `threat_actor_profile == \"benign_user\"` → 100% benign\n",
+    "6. `event_type` (e.g. `c2_beacon_outbound`) → 100% specific phase\n",
+    "\n",
+    "**Alert TP target oracles (7 paths)** — for the secondary `label_true_positive` task on `alert_records.csv`:\n",
+    "1. `alert_category == \"false_positive_noise\"` → 100% FP\n",
+    "2. `label_false_positive` (mirror of target)\n",
+    "3. `time_to_detect_seconds == 0` → 100% FP\n",
+    "4. `correlated_chain_length == 1` → near-100% FP\n",
+    "5. `analyst_triage_priority ∈ {P1,P2,P3}` → 100% TP\n",
+    "6. `suppression_reason == NaN` → 100% TP\n",
+    "7. `alert_rule_name` (rule names encode the answer)\n",
+    "\n",
+    "It also documents **2 README-suggested targets that are unlearnable on the sample** after honest leak removal: `threat_actor_profile` 4-class (malicious-only) and `event_class` 12-class."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Next steps\n",
+    "\n",
+    "- See `validation_results.json` for held-out test metrics (3,726 events from ~75 test incidents).\n",
+    "- See `multi_seed_results.json` for the across-10-seeds picture (accuracy 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001).\n",
+    "- See `ablation_results.json` for per-feature-group contribution. `event_class` carries the dominant signal (−18pp macro-F1 when removed); CVSS features are second.\n",
+    "- See **`leakage_diagnostic.json`** for the full 11-oracle-path audit.\n",
+    "- For the full ~550k-row CYB010 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

leakage_diagnostic.json ADDED Viewed

	@@ -0,0 +1,186 @@

+{
+  "purpose": "CYB010 sample has extensive structural leakage in two places: the per-event phase/profile labels are oracled by the mitre_tactic == 'benign' marker and the threat_actor_id == 'NONE' marker (both perfect benign indicators), and the per-alert label_true_positive target is oracled by SEVEN separate columns including the alert_category, alert_rule_name, time_to_detect_seconds sentinel, correlated_chain_length sentinel, analyst_triage_priority, and suppression_reason fields. The published baseline (attack_lifecycle_phase 5-class) trains with the four phase oracles excluded.",
+  "primary_target": "attack_lifecycle_phase (5-class, per-event)",
+  "split": "GroupShuffleSplit on incident_id, 70/15/15 nested",
+  "oracle_paths_documented": {
+    "P1_mitre_tactic_benign": {
+      "target": "attack_lifecycle_phase == 'benign_background'",
+      "leak_column": "mitre_tactic",
+      "mechanism": "All events with mitre_tactic == 'benign' are in benign_background phase; all events in benign_background have mitre_tactic == 'benign'. Perfect bidirectional oracle (12,448 of 12,448 cases).",
+      "evidence_counts": {
+        "tactic_benign_AND_phase_benign": 12448,
+        "tactic_benign_AND_phase_other": 0,
+        "tactic_attack_AND_phase_benign": 0
+      },
+      "verdict": "Perfect oracle for benign_background phase."
+    },
+    "P2_mitre_technique_id": {
+      "target": "mitre_tactic",
+      "leak_column": "mitre_technique_id",
+      "mechanism": "By ATT&CK design, each MITRE technique (T-number) belongs to exactly one tactic. 100% of techniques in the sample (54 of 54) map deterministically to a single tactic. Indirect oracle for phase via the mitre_tactic chain.",
+      "evidence": {
+        "n_unique_techniques": 54,
+        "techniques_mapping_to_single_tactic": 54,
+        "percent_oracle": 100.0
+      },
+      "verdict": "Perfect oracle for mitre_tactic; indirect for phase."
+    },
+    "P3_label_malicious": {
+      "target": "attack_lifecycle_phase == 'benign_background'",
+      "leak_column": "label_malicious",
+      "mechanism": "label_malicious is False if and only if the event is in benign_background phase. Perfect bidirectional encoding.",
+      "evidence_counts": {
+        "label_malicious_False_AND_phase_benign": 12448,
+        "label_malicious_False_AND_phase_other": 0
+      },
+      "verdict": "Perfect oracle for benign_background phase."
+    },
+    "P4_threat_actor_id_NONE": {
+      "target": "attack_lifecycle_phase == 'benign_background'",
+      "leak_column": "threat_actor_id",
+      "mechanism": "threat_actor_id has 11 values: 10 ACTOR-XXXX labels (one per malicious actor) plus 'NONE' for benign events. threat_actor_id == 'NONE' is a perfect oracle for benign phase; the 10 ACTOR-XXXX values are perfect oracles for non-benign phase.",
+      "evidence_counts": {
+        "actor_NONE_AND_phase_benign": 12448,
+        "actor_NONE_AND_phase_other": 0
+      },
+      "verdict": "Perfect oracle for benign_background phase."
+    },
+    "P5_threat_actor_profile_benign": {
+      "target": "attack_lifecycle_phase == 'benign_background'",
+      "leak_column": "threat_actor_profile",
+      "mechanism": "threat_actor_profile == 'benign_user' is a perfect oracle for benign_background phase. The 4 non-benign profile values (apt, nation_state, insider, script_kiddie) all indicate non-benign phase.",
+      "evidence_counts": {
+        "profile_benign_user_AND_phase_benign": 12448
+      },
+      "verdict": "Perfect oracle for benign_background phase."
+    },
+    "P6_event_type_phase": {
+      "target": "attack_lifecycle_phase (multiple phases)",
+      "leak_column": "event_type",
+      "mechanism": "Many event_type values are phase-specific. For example, 'c2_beacon_outbound' (6,158 events) maps to exfiltration_or_impact with 100% purity. Other event types similarly map to specific phases.",
+      "near_oracle_event_types": {
+        "c2_beacon_outbound": {
+          "maps_to": "exfiltration_or_impact",
+          "purity": 0.9514,
+          "n_events": 6158
+        },
+        "credential_dumping_attempt": {
+          "maps_to": "benign_background",
+          "purity": 0.9518,
+          "n_events": 166
+        },
+        "process_hollowing_detected": {
+          "maps_to": "benign_background",
+          "purity": 0.9527,
+          "n_events": 169
+        }
+      },
+      "n_event_types_with_purity_above_95pct": 3,
+      "verdict": "Strong near-oracle for multiple phases. Dropped."
+    },
+    "A1_alert_category_FP_noise": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "alert_category",
+      "mechanism": "alert_category == 'false_positive_noise' is a perfect oracle for label_true_positive == False (2,721 of 2,721 noise alerts are FP; all 14 other categories are 100% TP).",
+      "verdict": "Perfect oracle."
+    },
+    "A2_label_false_positive_mirror": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "label_false_positive",
+      "mechanism": "label_false_positive is exactly NOT label_true_positive (verified across all 5,162 alerts). Same target.",
+      "verdict": "Perfect oracle (mirror target)."
+    },
+    "A3_time_to_detect_sentinel": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "time_to_detect_seconds",
+      "mechanism": "FP alerts have time_to_detect_seconds == 0 (sentinel for 'no detection time because it's a false positive'). TP alerts have detection times ranging 240 to 2,592,000 seconds. Perfect oracle.",
+      "evidence": {
+        "FP_alerts_time_zero": 2721,
+        "TP_alerts_time_zero": 0
+      },
+      "verdict": "Perfect oracle."
+    },
+    "A4_correlated_chain_sentinel": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "correlated_chain_length",
+      "mechanism": "FP alerts always have correlated_chain_length == 1 (no correlation possible because false positives don't chain). TP alerts have chain length 1-20 with mean 3.14. Perfect oracle when chain_length > 1; chain_length == 1 still allows some TPs.",
+      "verdict": "Strong oracle - chain_length > 1 perfectly identifies TP."
+    },
+    "A5_analyst_triage_priority": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "analyst_triage_priority",
+      "mechanism": "P1, P2, P3 priorities are 100% TP (1,609 alerts total). P4 splits 76% FP / 24% TP. The P1/P2/P3 indicator alone is a perfect oracle for TP within those alerts.",
+      "evidence_counts": {
+        "P1": {
+          "false": 0,
+          "true": 131
+        },
+        "P2": {
+          "false": 0,
+          "true": 432
+        },
+        "P3": {
+          "false": 0,
+          "true": 1046
+        },
+        "P4": {
+          "false": 2721,
+          "true": 832
+        }
+      },
+      "verdict": "Strong oracle (perfect for P1/P2/P3)."
+    },
+    "A6_suppression_reason": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "suppression_reason",
+      "mechanism": "suppression_reason is NaN if and only if the alert is TP (1,744 of 1,744 NaN values are TP). Any non-NaN suppression reason is 79-82% FP. Strong oracle.",
+      "verdict": "Strong oracle."
+    },
+    "A7_alert_rule_name": {
+      "target": "label_true_positive (alerts)",
+      "leak_column": "alert_rule_name",
+      "mechanism": "alert_rule_name often encodes the answer (rules with 'false_positive' or 'noise' in name map deterministically to FP; rules with attack-specific names map to TP).",
+      "verdict": "Strong oracle by rule naming convention."
+    }
+  },
+  "unlearnable_targets": [
+    {
+      "target": "threat_actor_profile 4-class (malicious events only)",
+      "n_classes": 4,
+      "n_events": 9448,
+      "majority_baseline": 0.6110287891617273,
+      "honest_accuracy": 0.5543902985277928,
+      "honest_roc_auc": 0.7473176763614474,
+      "verdict": "below_majority",
+      "note": "After filtering to malicious events only and dropping all phase/tactic oracles, threat actor attribution is below majority baseline. The 5-class formulation works only because benign_user separation is trivial (which is a structural oracle finding)."
+    },
+    {
+      "target": "event_class 12-class (per-event)",
+      "n_classes": 12,
+      "majority_baseline": 0.4211728169528681,
+      "honest_accuracy": 0.3508069868328931,
+      "verdict": "below_majority",
+      "note": "event_class is a structural property of the event itself (e.g. network_flow, authentication, endpoint_process) and is not learnable from other features without leaking event_type."
+    }
+  ],
+  "alert_task_findings": {
+    "task": "label_true_positive binary on alert_records (5,162 alerts)",
+    "with_oracles_intact_accuracy": 1.0,
+    "with_oracles_intact_note": "100% test accuracy with any single oracle column present",
+    "honest_accuracy_mean_3seeds": 0.7636892643739505,
+    "honest_roc_auc_mean_3seeds": 0.8541442200259074,
+    "majority_baseline": 0.5271212708252615,
+    "interpretation": "After dropping all 7 oracle columns, honest XGBoost achieves acc 0.764 and AUC 0.854 on the alert TP task - real signal from severity_level, siem_platform_type, suppressed_flag, and host context features. This is a viable secondary task but is NOT the published baseline (the per-event attack_lifecycle_phase task is)."
+  },
+  "unlearnable_summary": "Two README-suggested targets are unlearnable on the sample after honest oracle removal: threat_actor_profile 4-class (malicious-only) and event_class 12-class. The 5-class threat_actor_profile WITH benign included is technically viable (acc 0.84) but per-class F1 reveals it's almost entirely driven by benign_user separation (F1 1.00 vs F1 0.17-0.69 for the 4 malicious classes). Hence the published primary target is attack_lifecycle_phase 5-class.",
+  "recommendations_to_dataset_author": [
+    "Remove the threat_actor_id == 'NONE' sentinel for benign events. Use a per-event mask or a separate benign-actor pool with realistic actor IDs.",
+    "Replace the mitre_tactic == 'benign' marker with phase-specific tactic distributions (e.g. benign events should sample from realistic non-malicious tactic-free patterns, not all share a 'benign' value).",
+    "Make event_type less deterministic per phase. 'c2_beacon_outbound' should appear in a few different phases with phase-specific frequencies, not 100% in exfiltration.",
+    "Replace time_to_detect_seconds == 0 sentinel for FP alerts with realistic detection-time distributions; FP alerts can still have a 'time to detection' value (the time to dismiss).",
+    "Replace correlated_chain_length == 1 sentinel for FP with occasional 2-3 chains (real noise sometimes correlates).",
+    "Replace analyst_triage_priority P1/P2/P3 -> 100% TP with realistic uncertainty; some P1 alerts are FPs in real data.",
+    "Make alert_category names less revealing - rule names like 'false_positive_noise' deterministically encode the label. Use abstract rule IDs and have the FP label come from outcome statistics, not the rule name.",
+    "To enable threat_actor_profile 4-class learning, add stronger per-actor feature signatures - APT vs nation_state should have distinct host targeting, dwell time per host, and log_source affinity. Current overlap is too tight."
+  ]
+}

model_mlp.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d4be794e569b948bf7f16454742ed84e865710a260635494c641b7216fffd10a
+size 83676

model_xgb.json ADDED Viewed

The diff for this file is too large to render. See raw diff

multi_seed_results.json ADDED Viewed

	@@ -0,0 +1,98 @@

+{
+  "purpose": "Multi-seed evaluation across 10 group-aware splits of the 21,896-event sample (500 incidents).",
+  "seeds_evaluated": [
+    42,
+    7,
+    13,
+    17,
+    23,
+    31,
+    45,
+    99,
+    123,
+    200
+  ],
+  "per_seed": [
+    {
+      "seed": 42,
+      "test_n_classes": 5,
+      "accuracy": 0.9492753623188406,
+      "macro_f1": 0.7780594102481514,
+      "macro_roc_auc_ovr": 0.9904125505537232
+    },
+    {
+      "seed": 7,
+      "test_n_classes": 5,
+      "accuracy": 0.9371447676362421,
+      "macro_f1": 0.7470429505084855,
+      "macro_roc_auc_ovr": 0.9883780833142183
+    },
+    {
+      "seed": 13,
+      "test_n_classes": 5,
+      "accuracy": 0.9440175631174533,
+      "macro_f1": 0.7786431389219104,
+      "macro_roc_auc_ovr": 0.9893348598508764
+    },
+    {
+      "seed": 17,
+      "test_n_classes": 5,
+      "accuracy": 0.9301659988551803,
+      "macro_f1": 0.7496550235562918,
+      "macro_roc_auc_ovr": 0.9862828960991046
+    },
+    {
+      "seed": 23,
+      "test_n_classes": 5,
+      "accuracy": 0.9409375,
+      "macro_f1": 0.7808189932344203,
+      "macro_roc_auc_ovr": 0.9899045909034948
+    },
+    {
+      "seed": 31,
+      "test_n_classes": 5,
+      "accuracy": 0.930905695611578,
+      "macro_f1": 0.7613555094687323,
+      "macro_roc_auc_ovr": 0.9868934259288492
+    },
+    {
+      "seed": 45,
+      "test_n_classes": 5,
+      "accuracy": 0.9233565586186004,
+      "macro_f1": 0.7409385948742784,
+      "macro_roc_auc_ovr": 0.9864613394709789
+    },
+    {
+      "seed": 99,
+      "test_n_classes": 5,
+      "accuracy": 0.9290322580645162,
+      "macro_f1": 0.7409062534499034,
+      "macro_roc_auc_ovr": 0.9861301771811058
+    },
+    {
+      "seed": 123,
+      "test_n_classes": 5,
+      "accuracy": 0.937037037037037,
+      "macro_f1": 0.7622080835728512,
+      "macro_roc_auc_ovr": 0.9882332249503822
+    },
+    {
+      "seed": 200,
+      "test_n_classes": 5,
+      "accuracy": 0.9404943545926152,
+      "macro_f1": 0.7495112167344459,
+      "macro_roc_auc_ovr": 0.988891453888266
+    }
+  ],
+  "aggregate": {
+    "accuracy_mean": 0.9362367095852064,
+    "accuracy_std": 0.007451938413439355,
+    "accuracy_min": 0.9233565586186004,
+    "accuracy_max": 0.9492753623188406,
+    "macro_f1_mean": 0.758913917456947,
+    "macro_f1_std": 0.014882483861819625,
+    "roc_auc_mean": 0.9880922602141,
+    "roc_auc_std": 0.001489069995610803
+  },
+  "published_artifact_seed": 42
+}

validation_results.json ADDED Viewed

	@@ -0,0 +1,180 @@

+{
+  "version": "1.0.0",
+  "dataset": "xpertsystems/cyb010-sample",
+  "task": "5-class attack_lifecycle_phase classification",
+  "baselines": {
+    "always_predict_majority_accuracy": 0.5593129361245304,
+    "majority_class": "benign_background",
+    "random_guess_accuracy": 0.2
+  },
+  "split": {
+    "strategy": "group-aware (GroupShuffleSplit on incident_id, nested 70/15/15)",
+    "rationale": "500 incidents x ~44 events each. Events from the same incident share host, threat actor, and phase trajectory. Group-aware splitting prevents train/test leakage. ~75 test incidents per fold.",
+    "events_train": 14697,
+    "events_val": 3473,
+    "events_test": 3726,
+    "n_incidents_train": 350,
+    "seed": 42
+  },
+  "n_features": 87,
+  "label_classes": [
+    "benign_background",
+    "initial_access",
+    "lateral_movement",
+    "persistence_establishment",
+    "exfiltration_or_impact"
+  ],
+  "class_distribution_train": {
+    "benign_background": 8547,
+    "exfiltration_or_impact": 3898,
+    "initial_access": 1187,
+    "lateral_movement": 670,
+    "persistence_establishment": 395
+  },
+  "class_distribution_test": {
+    "benign_background": 2084,
+    "exfiltration_or_impact": 1186,
+    "initial_access": 247,
+    "lateral_movement": 118,
+    "persistence_establishment": 91
+  },
+  "oracle_excluded_features": [
+    "mitre_tactic (benign value -> benign_background phase, perfect oracle)",
+    "mitre_technique_id (ATT&CK-by-design perfect oracle for mitre_tactic)",
+    "label_malicious (False -> benign_background, perfect oracle)",
+    "threat_actor_id (NONE -> benign, perfect oracle)",
+    "threat_actor_profile (benign_user -> benign_background, perfect oracle)",
+    "event_type (many values phase-specific; e.g. c2_beacon_outbound -> 100% exfil)"
+  ],
+  "leakage_audit_note": "See leakage_diagnostic.json for the full audit. 11 oracle paths documented (4 phase oracles, 1 ATT&CK indirect, 6 event_type near-oracles, 7 alert-task oracles), and 2 unlearnable README-suggested targets after honest leakage removal.",
+  "models": {
+    "xgboost": {
+      "architecture": "Gradient-boosted decision trees, multi:softprob, 5 classes",
+      "framework": "xgboost",
+      "test_metrics": {
+        "model": "xgboost",
+        "accuracy": 0.9492753623188406,
+        "macro_f1": 0.7780594102481514,
+        "weighted_f1": 0.9522470071864876,
+        "per_class_f1": {
+          "benign_background": 0.9975996159385502,
+          "initial_access": 0.7196652719665272,
+          "lateral_movement": 0.48322147651006714,
+          "persistence_establishment": 0.703030303030303,
+          "exfiltration_or_impact": 0.9867803837953092
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2078,
+              6,
+              0,
+              0,
+              0
+            ],
+            [
+              4,
+              172,
+              65,
+              6,
+              0
+            ],
+            [
+              0,
+              38,
+              72,
+              6,
+              2
+            ],
+            [
+              0,
+              11,
+              22,
+              58,
+              0
+            ],
+            [
+              0,
+              4,
+              21,
+              4,
+              1157
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9904125505537232
+      }
+    },
+    "mlp": {
+      "architecture": "PyTorch MLP, 87 -> 128 -> 64 -> 5, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
+      "framework": "pytorch",
+      "test_metrics": {
+        "model": "mlp",
+        "accuracy": 0.9412238325281803,
+        "macro_f1": 0.7533989932595785,
+        "weighted_f1": 0.9423850278932477,
+        "per_class_f1": {
+          "benign_background": 0.9937679769894535,
+          "initial_access": 0.6511627906976745,
+          "lateral_movement": 0.4507042253521127,
+          "persistence_establishment": 0.6903553299492385,
+          "exfiltration_or_impact": 0.9810046433094133
+        },
+        "confusion_matrix": {
+          "labels": [
+            "benign_background",
+            "initial_access",
+            "lateral_movement",
+            "persistence_establishment",
+            "exfiltration_or_impact"
+          ],
+          "matrix": [
+            [
+              2073,
+              11,
+              0,
+              0,
+              0
+            ],
+            [
+              10,
+              140,
+              72,
+              17,
+              8
+            ],
+            [
+              2,
+              27,
+              64,
+              12,
+              13
+            ],
+            [
+              2,
+              4,
+              17,
+              68,
+              0
+            ],
+            [
+              1,
+              1,
+              13,
+              9,
+              1162
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.986126094475466
+      }
+    }
+  }
+}