Initial release: XGBoost + MLP for ATT&CK phase classification
Browse files- README.md +408 -0
- ablation_results.json +804 -0
- feature_engineering.py +394 -0
- feature_meta.json +249 -0
- feature_scaler.json +1 -0
- inference_example.ipynb +343 -0
- model_mlp.safetensors +3 -0
- model_xgb.json +0 -0
- validation_results.json +383 -0
README.md
ADDED
|
@@ -0,0 +1,408 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
tags:
|
| 5 |
+
- cybersecurity
|
| 6 |
+
- mitre-attack
|
| 7 |
+
- kill-chain
|
| 8 |
+
- apt
|
| 9 |
+
- tabular-classification
|
| 10 |
+
- synthetic-data
|
| 11 |
+
- xgboost
|
| 12 |
+
- baseline
|
| 13 |
+
pipeline_tag: tabular-classification
|
| 14 |
+
base_model: []
|
| 15 |
+
datasets:
|
| 16 |
+
- xpertsystems/cyb002-sample
|
| 17 |
+
metrics:
|
| 18 |
+
- accuracy
|
| 19 |
+
- f1
|
| 20 |
+
- roc_auc
|
| 21 |
+
model-index:
|
| 22 |
+
- name: cyb002-baseline-classifier
|
| 23 |
+
results:
|
| 24 |
+
- task:
|
| 25 |
+
type: tabular-classification
|
| 26 |
+
name: 10-class MITRE ATT&CK kill-chain phase classification
|
| 27 |
+
dataset:
|
| 28 |
+
type: xpertsystems/cyb002-sample
|
| 29 |
+
name: CYB002 Synthetic Cyber Attack Dataset (Sample)
|
| 30 |
+
metrics:
|
| 31 |
+
- type: roc_auc
|
| 32 |
+
value: 0.8599
|
| 33 |
+
name: Test macro ROC-AUC OvR (XGBoost)
|
| 34 |
+
- type: f1
|
| 35 |
+
value: 0.4255
|
| 36 |
+
name: Test macro-F1 (XGBoost)
|
| 37 |
+
- type: accuracy
|
| 38 |
+
value: 0.4683
|
| 39 |
+
name: Test accuracy (XGBoost)
|
| 40 |
+
- type: roc_auc
|
| 41 |
+
value: 0.8496
|
| 42 |
+
name: Test macro ROC-AUC OvR (MLP)
|
| 43 |
+
- type: f1
|
| 44 |
+
value: 0.3911
|
| 45 |
+
name: Test macro-F1 (MLP)
|
| 46 |
+
- type: accuracy
|
| 47 |
+
value: 0.4449
|
| 48 |
+
name: Test accuracy (MLP)
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
# CYB002 Baseline Classifier
|
| 52 |
+
|
| 53 |
+
**MITRE ATT&CK kill-chain phase classifier trained on the CYB002
|
| 54 |
+
synthetic cyber attack sample. Predicts which of 10 kill-chain phases
|
| 55 |
+
an attack event belongs to, from observable event + segment features.**
|
| 56 |
+
|
| 57 |
+
> **Baseline reference, not for production use.** This model demonstrates
|
| 58 |
+
> that the [CYB002 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb002-sample)
|
| 59 |
+
> is learnable end-to-end and gives prospective buyers a working starting
|
| 60 |
+
> point. It is not a production threat detector or SOC tool. See
|
| 61 |
+
> [Limitations](#limitations).
|
| 62 |
+
|
| 63 |
+
## Model overview
|
| 64 |
+
|
| 65 |
+
| Property | Value |
|
| 66 |
+
|---|---|
|
| 67 |
+
| Task | 10-class kill-chain phase classification |
|
| 68 |
+
| Training data | `xpertsystems/cyb002-sample` (4,353 attack events across 100 campaigns) |
|
| 69 |
+
| Models | XGBoost + PyTorch MLP |
|
| 70 |
+
| Input features | 90 (after one-hot encoding) |
|
| 71 |
+
| Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) |
|
| 72 |
+
| License | CC-BY-NC-4.0 (matches dataset) |
|
| 73 |
+
| Status | Reference baseline |
|
| 74 |
+
|
| 75 |
+
Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
|
| 76 |
+
|
| 77 |
+
- `model_xgb.json` — gradient-boosted trees, primary recommendation
|
| 78 |
+
- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
|
| 79 |
+
|
| 80 |
+
## Quick start
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
pip install xgboost torch safetensors pandas huggingface_hub
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
from huggingface_hub import hf_hub_download
|
| 88 |
+
import json, numpy as np, torch, xgboost as xgb
|
| 89 |
+
from safetensors.torch import load_file
|
| 90 |
+
|
| 91 |
+
REPO = "xpertsystems/cyb002-baseline-classifier"
|
| 92 |
+
|
| 93 |
+
paths = {n: hf_hub_download(REPO, n) for n in [
|
| 94 |
+
"model_xgb.json", "model_mlp.safetensors",
|
| 95 |
+
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
|
| 96 |
+
]}
|
| 97 |
+
|
| 98 |
+
import sys, os
|
| 99 |
+
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
|
| 100 |
+
from feature_engineering import (
|
| 101 |
+
transform_single, load_meta, INT_TO_LABEL, build_segment_lookup
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
meta = load_meta(paths["feature_meta.json"])
|
| 105 |
+
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
|
| 106 |
+
|
| 107 |
+
# Build the segment-aggregate lookup from the dataset's topology CSV
|
| 108 |
+
seg_lookup = build_segment_lookup("path/to/network_topology.csv")
|
| 109 |
+
|
| 110 |
+
# Predict (see inference_example.ipynb for the full pattern)
|
| 111 |
+
seg_agg = seg_lookup.get(my_event["target_segment_id"], {})
|
| 112 |
+
X = transform_single(my_event, meta, segment_aggregates=seg_agg)
|
| 113 |
+
proba = xgb_model.predict_proba(X)[0]
|
| 114 |
+
print(INT_TO_LABEL[int(np.argmax(proba))])
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
See [`inference_example.ipynb`](./inference_example.ipynb) for an
|
| 118 |
+
end-to-end copy-paste demo including segment-aggregate setup and
|
| 119 |
+
batch prediction.
|
| 120 |
+
|
| 121 |
+
## Training data
|
| 122 |
+
|
| 123 |
+
Trained on the public sample of CYB002, 4,353 attack events from 100
|
| 124 |
+
distinct campaigns:
|
| 125 |
+
|
| 126 |
+
| Phase | Train (n=2,822) | Test (n=726) | Test share |
|
| 127 |
+
|---|---:|---:|---:|
|
| 128 |
+
| `dwell_idle` | 581 | 141 | 19.4% |
|
| 129 |
+
| `reconnaissance` | 411 | 112 | 15.4% |
|
| 130 |
+
| `initial_access` | 358 | 106 | 14.6% |
|
| 131 |
+
| `execution` | 324 | 74 | 10.2% |
|
| 132 |
+
| `persistence` | 287 | 79 | 10.9% |
|
| 133 |
+
| `privilege_escalation` | 249 | 68 | 9.4% |
|
| 134 |
+
| `lateral_movement` | 201 | 54 | 7.4% |
|
| 135 |
+
| `collection` | 162 | 40 | 5.5% |
|
| 136 |
+
| `exfiltration` | 113 | 31 | 4.3% |
|
| 137 |
+
| `impact` | 105 | 21 | 2.9% |
|
| 138 |
+
|
| 139 |
+
### Group-aware split
|
| 140 |
+
|
| 141 |
+
A single campaign generates ~40 highly-correlated events. Random row-level
|
| 142 |
+
splitting would put events from the same campaign in both train and test,
|
| 143 |
+
inflating metrics in a way that does not generalize to new campaigns.
|
| 144 |
+
|
| 145 |
+
This release uses **GroupShuffleSplit by `campaign_id`**:
|
| 146 |
+
|
| 147 |
+
| Fold | Campaigns | Events |
|
| 148 |
+
|---|---:|---:|
|
| 149 |
+
| Train | 69 | 2,822 |
|
| 150 |
+
| Validation | 16 | 805 |
|
| 151 |
+
| Test | 15 | 726 |
|
| 152 |
+
|
| 153 |
+
All test campaigns are completely unseen during training. Class imbalance
|
| 154 |
+
is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
|
| 155 |
+
weighted cross-entropy (MLP).
|
| 156 |
+
|
| 157 |
+
## Feature pipeline
|
| 158 |
+
|
| 159 |
+
The bundled `feature_engineering.py` is the canonical feature recipe.
|
| 160 |
+
|
| 161 |
+
**Three columns are deliberately excluded** because they leak the target:
|
| 162 |
+
|
| 163 |
+
- `technique_id` — 62 of 63 ATT&CK techniques map 1:1 to a single phase.
|
| 164 |
+
Including it gives perfect-looking metrics that mean nothing.
|
| 165 |
+
- `technique_name` — 1:1 alias of `technique_id` (63 unique values each).
|
| 166 |
+
- `tactic_category` — direct alias of `kill_chain_phase`.
|
| 167 |
+
|
| 168 |
+
**90 features survive after encoding**, drawn from:
|
| 169 |
+
|
| 170 |
+
- **Event-level numeric** (10): `timestep`, `dest_port`, `bytes_transferred`, `connection_duration_s`, `auth_failure_count`, `process_injection_flag`, `lateral_hop_count`, `c2_beacon_interval_s`, `edr_blocked_flag`, `siem_rule_triggered`
|
| 171 |
+
- **Event-level categorical** (7, one-hot encoded): `target_asset_type`, `source_ip_class`, `protocol`, `attacker_capability_tier`, `defender_maturity_level`, `alert_severity`, `detection_outcome`
|
| 172 |
+
- **Segment-level topology aggregates** (13): mean `patch_lag_days`, mean `exposure_score`, max `vulnerability_count`, fraction with EDR/SIEM/NDR/MFA coverage, mean MTTD / MTTR baselines, plus segment_type and defender_maturity_level (segment-constant)
|
| 173 |
+
- **Engineered** (6): `byte_volume_log`, `has_c2_beacon`, `is_brute_forcing`, `attacker_defender_advantage`, `is_high_volume`, `is_privileged_port`
|
| 174 |
+
|
| 175 |
+
None of the engineered features is derived from phase or technique —
|
| 176 |
+
that would re-introduce the leakage we just excluded.
|
| 177 |
+
|
| 178 |
+
### Note on detection-outcome features
|
| 179 |
+
|
| 180 |
+
`detection_outcome`, `alert_severity`, `edr_blocked_flag`, and
|
| 181 |
+
`siem_rule_triggered` are post-hoc observables from the SOC's perspective.
|
| 182 |
+
They are kept as features for the realistic use case where a SOC analyst
|
| 183 |
+
has just seen an action and its initial detection signal and is reasoning
|
| 184 |
+
about which phase the campaign is in. Buyers who want a strictly
|
| 185 |
+
pre-detection model can drop these four columns and retrain — the ablation
|
| 186 |
+
results below show this **does not hurt accuracy** (the model doesn't
|
| 187 |
+
lean on them for phase prediction).
|
| 188 |
+
|
| 189 |
+
## Evaluation
|
| 190 |
+
|
| 191 |
+
### Test-set metrics (n = 726 events from 15 disjoint campaigns)
|
| 192 |
+
|
| 193 |
+
**XGBoost**
|
| 194 |
+
|
| 195 |
+
| Metric | Value |
|
| 196 |
+
|---|---:|
|
| 197 |
+
| Macro ROC-AUC (OvR) | **0.8599** |
|
| 198 |
+
| Accuracy | 0.4683 |
|
| 199 |
+
| Macro-F1 | 0.4255 |
|
| 200 |
+
| Weighted-F1 | 0.4604 |
|
| 201 |
+
|
| 202 |
+
**MLP**
|
| 203 |
+
|
| 204 |
+
| Metric | Value |
|
| 205 |
+
|---|---:|
|
| 206 |
+
| Macro ROC-AUC (OvR) | **0.8496** |
|
| 207 |
+
| Accuracy | 0.4449 |
|
| 208 |
+
| Macro-F1 | 0.3911 |
|
| 209 |
+
| Weighted-F1 | 0.4350 |
|
| 210 |
+
|
| 211 |
+
### Headline interpretation
|
| 212 |
+
|
| 213 |
+
Accuracy of 47% looks low at first glance, but the right comparison is:
|
| 214 |
+
|
| 215 |
+
| Baseline | Accuracy | Macro-F1 |
|
| 216 |
+
|---|---:|---:|
|
| 217 |
+
| Random uniform guess (1/10 classes) | 0.10 | ~0.10 |
|
| 218 |
+
| Always predict majority (`dwell_idle`) | 0.19 | n/a |
|
| 219 |
+
| **XGBoost (this model)** | **0.47** | **0.43** |
|
| 220 |
+
|
| 221 |
+
The macro ROC-AUC of **0.86** tells the cleaner story: the model
|
| 222 |
+
distinguishes the 10 phases meaningfully well even though the
|
| 223 |
+
argmax-prediction sometimes lands on an adjacent phase.
|
| 224 |
+
|
| 225 |
+
### Per-class F1 — where the signal is and isn't
|
| 226 |
+
|
| 227 |
+
| Phase | XGBoost F1 | MLP F1 | Note |
|
| 228 |
+
|---|---:|---:|---|
|
| 229 |
+
| `reconnaissance` | **0.753** | 0.725 | Strong: early timestep, distinct protocols/targets |
|
| 230 |
+
| `lateral_movement` | **0.742** | 0.783 | Strong: lateral-hop count, post-privesc pattern |
|
| 231 |
+
| `initial_access` | **0.647** | 0.648 | Strong: perimeter targets, specific protocols |
|
| 232 |
+
| `privilege_escalation` | 0.500 | 0.488 | Moderate |
|
| 233 |
+
| `execution` | 0.441 | 0.510 | Moderate |
|
| 234 |
+
| `persistence` | 0.413 | 0.301 | Moderate, easily confused with execution |
|
| 235 |
+
| `exfiltration` | 0.273 | 0.119 | Weak: late-phase, similar to collection/impact |
|
| 236 |
+
| `impact` | 0.226 | 0.132 | Weak: late-phase clustering |
|
| 237 |
+
| `collection` | 0.220 | 0.191 | Weak: late-phase clustering |
|
| 238 |
+
| `dwell_idle` | 0.040 | 0.013 | Very weak: no-op steps lack distinguishing features |
|
| 239 |
+
|
| 240 |
+
The model has solid signal on **early and mid-campaign phases** and
|
| 241 |
+
genuinely struggles to disambiguate **late-stage objective-completion
|
| 242 |
+
phases** (collection / exfiltration / impact), which arrive close in
|
| 243 |
+
time and look similar at the event level. This is an honest limitation
|
| 244 |
+
of flat-tabular classification — sequence models would help here.
|
| 245 |
+
|
| 246 |
+
### Ablation: which feature groups matter
|
| 247 |
+
|
| 248 |
+
| Configuration | Accuracy | Macro-F1 | Δ accuracy vs full |
|
| 249 |
+
|---|---:|---:|---:|
|
| 250 |
+
| Full feature set (published) | 0.4683 | 0.4255 | — |
|
| 251 |
+
| No `timestep` | 0.3264 | 0.3102 | **−0.1419** |
|
| 252 |
+
| No topology aggregates | 0.4601 | 0.4093 | −0.0083 |
|
| 253 |
+
| No engineered features | 0.4642 | 0.4240 | −0.0041 |
|
| 254 |
+
| No detection-signal features | 0.4725 | 0.4284 | **+0.0041** |
|
| 255 |
+
|
| 256 |
+
Two clear findings:
|
| 257 |
+
|
| 258 |
+
1. **`timestep` is by far the most important feature** (drops 14 pp when
|
| 259 |
+
removed). The honest reading: kill chains progress in time, and where
|
| 260 |
+
you are in the campaign timeline carries most of the phase signal.
|
| 261 |
+
2. **Detection-signal features (`detection_outcome`, `alert_severity`,
|
| 262 |
+
`edr_blocked_flag`, `siem_rule_triggered`) do not help phase prediction.**
|
| 263 |
+
Removing them actually improves the score marginally. A buyer who wants
|
| 264 |
+
a pre-detection model can drop these four columns with no loss.
|
| 265 |
+
|
| 266 |
+
Topology and engineered features each contribute roughly 1 pp.
|
| 267 |
+
|
| 268 |
+
### Architecture
|
| 269 |
+
|
| 270 |
+
**XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
|
| 271 |
+
`hist` tree method, class-balanced sample weights, early stopping on
|
| 272 |
+
validation mlogloss.
|
| 273 |
+
|
| 274 |
+
**MLP:** `90 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
|
| 275 |
+
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
|
| 276 |
+
early stopping on validation macro-F1.
|
| 277 |
+
|
| 278 |
+
Training hyperparameters (learning rate, batch size, n_estimators,
|
| 279 |
+
early-stopping patience, weight decay, class-weighting strategy) are
|
| 280 |
+
held internally by XpertSystems and are not part of this release.
|
| 281 |
+
|
| 282 |
+
## Limitations
|
| 283 |
+
|
| 284 |
+
**This is a baseline reference, not a production threat detection system.**
|
| 285 |
+
|
| 286 |
+
1. **Late-phase confusion.** Per-class F1 for `collection`, `exfiltration`,
|
| 287 |
+
and `impact` is 0.22–0.27. These phases arrive near campaign-end with
|
| 288 |
+
similar feature signatures, and a flat-tabular event-level model can't
|
| 289 |
+
easily disambiguate them. Sequence models (LSTM / transformer over the
|
| 290 |
+
per-campaign event sequence) would substantially improve this.
|
| 291 |
+
|
| 292 |
+
2. **`dwell_idle` is essentially unlearnable in this framing.** The
|
| 293 |
+
class-balanced weights amplify rare classes; `dwell_idle` is common
|
| 294 |
+
but featureless ("no action this timestep"), so the model trades
|
| 295 |
+
`dwell_idle` recall for late-phase recall. F1 = 0.04. A real SOC
|
| 296 |
+
pipeline would handle idle steps with a separate gating rule, not a
|
| 297 |
+
classifier head.
|
| 298 |
+
|
| 299 |
+
3. **Sample-size constraints.** 100 campaigns / 4,353 events with a
|
| 300 |
+
group-aware split leaves 69 training campaigns. The full 380k-event
|
| 301 |
+
CYB002 product supports much more reliable per-class estimation,
|
| 302 |
+
especially on the rare late-phase classes.
|
| 303 |
+
|
| 304 |
+
4. **Synthetic-vs-real transfer.** The dataset is synthetic and
|
| 305 |
+
calibrated to threat-intelligence benchmark targets (Mandiant
|
| 306 |
+
M-Trends, IBM CODB, Verizon DBIR, MITRE ATT&CK Evaluations). Real
|
| 307 |
+
attack telemetry has different noise characteristics, adversary
|
| 308 |
+
adaptation, and gaps in coverage. Do not assume metrics transfer.
|
| 309 |
+
|
| 310 |
+
5. **Adversarial robustness not evaluated.** The dataset is not
|
| 311 |
+
adversarially generated; the model has not been red-teamed.
|
| 312 |
+
|
| 313 |
+
6. **MLP brittleness on OOD inputs.** With ~2.8k training events, the
|
| 314 |
+
MLP can produce confidently-wrong predictions on hand-crafted
|
| 315 |
+
records far from the training manifold. XGBoost is more robust.
|
| 316 |
+
Use both; treat disagreement as a signal for human review.
|
| 317 |
+
|
| 318 |
+
## Notes on dataset schema
|
| 319 |
+
|
| 320 |
+
The CYB002 sample dataset README describes some fields differently from
|
| 321 |
+
the actual schema. The model was trained on the actual schema; this note
|
| 322 |
+
is to help buyers reconcile what they read with what they receive.
|
| 323 |
+
|
| 324 |
+
| What the README says | What the data actually contains |
|
| 325 |
+
|---|---|
|
| 326 |
+
| "9 ATT&CK phases" | 10 phases including `dwell_idle` (idle/no-op steps) |
|
| 327 |
+
| 4 attacker tiers: `opportunistic`, `organized_crime`, `apt`, `nation_state` | 4 tiers: `opportunistic`, `script_kiddie`, `apt`, `nation_state` |
|
| 328 |
+
| 5 defender maturity levels: CMMI names (`ad_hoc`, `defined`, `managed`, `quantitatively_managed`, `optimizing`) | 5 levels: `minimal`, `baseline`, `managed`, `advanced`, `zero_trust` |
|
| 329 |
+
| Field name `phase` | Actual column: `kill_chain_phase` |
|
| 330 |
+
| Field name `tactic` | Actual column: `tactic_category` |
|
| 331 |
+
| Field name `segment_id` | Actual column: `target_segment_id` |
|
| 332 |
+
| Field name `attacker_tier` | Actual column: `attacker_capability_tier` |
|
| 333 |
+
| Field name `defender_maturity` | Actual column: `defender_maturity_level` |
|
| 334 |
+
| Field name `detected`, `blocked`, `stealth_score` | Actual: `detection_outcome`, `edr_blocked_flag`, `siem_rule_triggered`; no `stealth_score` on events |
|
| 335 |
+
|
| 336 |
+
None of this affects model correctness — `feature_engineering.py` uses the
|
| 337 |
+
actual column names. If you build your own pipeline against the dataset,
|
| 338 |
+
use the actual columns, not the README descriptions.
|
| 339 |
+
|
| 340 |
+
## Intended use
|
| 341 |
+
|
| 342 |
+
- **Evaluating fit** of the CYB002 dataset for your ATT&CK / kill-chain
|
| 343 |
+
research
|
| 344 |
+
- **Baseline reference** for new model architectures (especially sequence
|
| 345 |
+
models, which should beat this baseline on the late-phase classes)
|
| 346 |
+
- **Teaching and demo** for tabular classification on attack-event data
|
| 347 |
+
- **Feature engineering reference** for MITRE ATT&CK-aligned datasets
|
| 348 |
+
|
| 349 |
+
## Out-of-scope use
|
| 350 |
+
|
| 351 |
+
- Production threat detection on real network telemetry
|
| 352 |
+
- SOC alert triage on real systems
|
| 353 |
+
- Forensic attribution of real attacks
|
| 354 |
+
- Adversarial-evasion evaluation (dataset not adversarially generated)
|
| 355 |
+
- Any safety-critical or operational security decision
|
| 356 |
+
|
| 357 |
+
## Reproducibility
|
| 358 |
+
|
| 359 |
+
Outputs above were produced with `seed = 42`, group-aware nested
|
| 360 |
+
`GroupShuffleSplit` (70/15/15 by campaign_id), on the published sample
|
| 361 |
+
(`xpertsystems/cyb002-sample`, version 1.0.0, generated 2026-05-16).
|
| 362 |
+
The feature pipeline in `feature_engineering.py` is deterministic and
|
| 363 |
+
the trained weights in this repo correspond exactly to the metrics above.
|
| 364 |
+
|
| 365 |
+
The training script itself is private to XpertSystems. The published
|
| 366 |
+
artifacts contain the feature pipeline, model weights, scaler, metadata,
|
| 367 |
+
and validation results — sufficient to reproduce inference but not
|
| 368 |
+
training.
|
| 369 |
+
|
| 370 |
+
## Files in this repo
|
| 371 |
+
|
| 372 |
+
| File | Purpose |
|
| 373 |
+
|---|---|
|
| 374 |
+
| `model_xgb.json` | XGBoost weights |
|
| 375 |
+
| `model_mlp.safetensors` | PyTorch MLP weights |
|
| 376 |
+
| `feature_engineering.py` | Feature pipeline (load → aggregate topology → engineer → encode) |
|
| 377 |
+
| `feature_meta.json` | Feature column order + categorical levels |
|
| 378 |
+
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
|
| 379 |
+
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
|
| 380 |
+
| `ablation_results.json` | Per-feature-group ablation (timestep, topology, engineered, detection-signals) |
|
| 381 |
+
| `inference_example.ipynb` | End-to-end inference demo notebook |
|
| 382 |
+
| `README.md` | This file |
|
| 383 |
+
|
| 384 |
+
## Contact and full product
|
| 385 |
+
|
| 386 |
+
The full **CYB002** dataset contains ~454,000 rows across four files,
|
| 387 |
+
with calibrated benchmark validation against 12 metrics drawn from
|
| 388 |
+
authoritative threat intelligence sources (Mandiant, IBM, Verizon,
|
| 389 |
+
CrowdStrike, MITRE, SANS, ENISA). The full XpertSystems.ai synthetic data
|
| 390 |
+
catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance &
|
| 391 |
+
Risk, Oil & Gas, and Materials & Energy.
|
| 392 |
+
|
| 393 |
+
- 📧 **pradeep@xpertsystems.ai**
|
| 394 |
+
- 🌐 **https://xpertsystems.ai**
|
| 395 |
+
- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb002-sample
|
| 396 |
+
- 🤖 Companion model (network traffic): https://huggingface.co/xpertsystems/cyb001-baseline-classifier
|
| 397 |
+
|
| 398 |
+
## Citation
|
| 399 |
+
|
| 400 |
+
```bibtex
|
| 401 |
+
@misc{xpertsystems_cyb002_baseline_2026,
|
| 402 |
+
title = {CYB002 Baseline Classifier: XGBoost and MLP for MITRE ATT&CK Kill-Chain Phase Classification},
|
| 403 |
+
author = {XpertSystems.ai},
|
| 404 |
+
year = {2026},
|
| 405 |
+
url = {https://huggingface.co/xpertsystems/cyb002-baseline-classifier},
|
| 406 |
+
note = {Baseline reference model trained on xpertsystems/cyb002-sample}
|
| 407 |
+
}
|
| 408 |
+
```
|
ablation_results.json
ADDED
|
@@ -0,0 +1,804 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
|
| 3 |
+
"full_model_metrics": {
|
| 4 |
+
"model": "xgboost",
|
| 5 |
+
"accuracy": 0.46831955922865015,
|
| 6 |
+
"macro_f1": 0.42549880749552066,
|
| 7 |
+
"weighted_f1": 0.440668872633435,
|
| 8 |
+
"per_class_f1": {
|
| 9 |
+
"dwell_idle": 0.040268456375838924,
|
| 10 |
+
"reconnaissance": 0.7532467532467533,
|
| 11 |
+
"initial_access": 0.6467661691542289,
|
| 12 |
+
"execution": 0.4406779661016949,
|
| 13 |
+
"persistence": 0.41304347826086957,
|
| 14 |
+
"privilege_escalation": 0.5,
|
| 15 |
+
"lateral_movement": 0.7422680412371134,
|
| 16 |
+
"collection": 0.22018348623853212,
|
| 17 |
+
"exfiltration": 0.2727272727272727,
|
| 18 |
+
"impact": 0.22580645161290322
|
| 19 |
+
},
|
| 20 |
+
"confusion_matrix": {
|
| 21 |
+
"labels": [
|
| 22 |
+
"dwell_idle",
|
| 23 |
+
"reconnaissance",
|
| 24 |
+
"initial_access",
|
| 25 |
+
"execution",
|
| 26 |
+
"persistence",
|
| 27 |
+
"privilege_escalation",
|
| 28 |
+
"lateral_movement",
|
| 29 |
+
"collection",
|
| 30 |
+
"exfiltration",
|
| 31 |
+
"impact"
|
| 32 |
+
],
|
| 33 |
+
"matrix": [
|
| 34 |
+
[
|
| 35 |
+
3,
|
| 36 |
+
23,
|
| 37 |
+
23,
|
| 38 |
+
18,
|
| 39 |
+
21,
|
| 40 |
+
18,
|
| 41 |
+
2,
|
| 42 |
+
17,
|
| 43 |
+
9,
|
| 44 |
+
7
|
| 45 |
+
],
|
| 46 |
+
[
|
| 47 |
+
2,
|
| 48 |
+
87,
|
| 49 |
+
2,
|
| 50 |
+
21,
|
| 51 |
+
0,
|
| 52 |
+
0,
|
| 53 |
+
0,
|
| 54 |
+
0,
|
| 55 |
+
0,
|
| 56 |
+
0
|
| 57 |
+
],
|
| 58 |
+
[
|
| 59 |
+
1,
|
| 60 |
+
5,
|
| 61 |
+
65,
|
| 62 |
+
5,
|
| 63 |
+
3,
|
| 64 |
+
26,
|
| 65 |
+
1,
|
| 66 |
+
0,
|
| 67 |
+
0,
|
| 68 |
+
0
|
| 69 |
+
],
|
| 70 |
+
[
|
| 71 |
+
2,
|
| 72 |
+
4,
|
| 73 |
+
1,
|
| 74 |
+
39,
|
| 75 |
+
24,
|
| 76 |
+
3,
|
| 77 |
+
1,
|
| 78 |
+
0,
|
| 79 |
+
0,
|
| 80 |
+
0
|
| 81 |
+
],
|
| 82 |
+
[
|
| 83 |
+
0,
|
| 84 |
+
0,
|
| 85 |
+
1,
|
| 86 |
+
12,
|
| 87 |
+
38,
|
| 88 |
+
9,
|
| 89 |
+
0,
|
| 90 |
+
18,
|
| 91 |
+
1,
|
| 92 |
+
0
|
| 93 |
+
],
|
| 94 |
+
[
|
| 95 |
+
0,
|
| 96 |
+
0,
|
| 97 |
+
3,
|
| 98 |
+
8,
|
| 99 |
+
4,
|
| 100 |
+
44,
|
| 101 |
+
3,
|
| 102 |
+
5,
|
| 103 |
+
1,
|
| 104 |
+
0
|
| 105 |
+
],
|
| 106 |
+
[
|
| 107 |
+
0,
|
| 108 |
+
0,
|
| 109 |
+
0,
|
| 110 |
+
0,
|
| 111 |
+
6,
|
| 112 |
+
6,
|
| 113 |
+
36,
|
| 114 |
+
2,
|
| 115 |
+
0,
|
| 116 |
+
4
|
| 117 |
+
],
|
| 118 |
+
[
|
| 119 |
+
0,
|
| 120 |
+
0,
|
| 121 |
+
0,
|
| 122 |
+
0,
|
| 123 |
+
2,
|
| 124 |
+
1,
|
| 125 |
+
0,
|
| 126 |
+
12,
|
| 127 |
+
15,
|
| 128 |
+
10
|
| 129 |
+
],
|
| 130 |
+
[
|
| 131 |
+
0,
|
| 132 |
+
0,
|
| 133 |
+
0,
|
| 134 |
+
0,
|
| 135 |
+
5,
|
| 136 |
+
0,
|
| 137 |
+
0,
|
| 138 |
+
4,
|
| 139 |
+
9,
|
| 140 |
+
13
|
| 141 |
+
],
|
| 142 |
+
[
|
| 143 |
+
0,
|
| 144 |
+
0,
|
| 145 |
+
0,
|
| 146 |
+
0,
|
| 147 |
+
2,
|
| 148 |
+
1,
|
| 149 |
+
0,
|
| 150 |
+
11,
|
| 151 |
+
0,
|
| 152 |
+
7
|
| 153 |
+
]
|
| 154 |
+
]
|
| 155 |
+
},
|
| 156 |
+
"macro_roc_auc_ovr": 0.8598653258869782
|
| 157 |
+
},
|
| 158 |
+
"ablations": {
|
| 159 |
+
"no_topology": {
|
| 160 |
+
"n_features": 67,
|
| 161 |
+
"dropped_count": 23,
|
| 162 |
+
"metrics": {
|
| 163 |
+
"model": "xgboost_no_topology",
|
| 164 |
+
"accuracy": 0.46005509641873277,
|
| 165 |
+
"macro_f1": 0.4093395066167947,
|
| 166 |
+
"weighted_f1": 0.4281869072634682,
|
| 167 |
+
"per_class_f1": {
|
| 168 |
+
"dwell_idle": 0.013513513513513514,
|
| 169 |
+
"reconnaissance": 0.7574468085106383,
|
| 170 |
+
"initial_access": 0.6435643564356436,
|
| 171 |
+
"execution": 0.45348837209302323,
|
| 172 |
+
"persistence": 0.3829787234042553,
|
| 173 |
+
"privilege_escalation": 0.4943820224719101,
|
| 174 |
+
"lateral_movement": 0.72,
|
| 175 |
+
"collection": 0.205607476635514,
|
| 176 |
+
"exfiltration": 0.25,
|
| 177 |
+
"impact": 0.1724137931034483
|
| 178 |
+
},
|
| 179 |
+
"confusion_matrix": {
|
| 180 |
+
"labels": [
|
| 181 |
+
"dwell_idle",
|
| 182 |
+
"reconnaissance",
|
| 183 |
+
"initial_access",
|
| 184 |
+
"execution",
|
| 185 |
+
"persistence",
|
| 186 |
+
"privilege_escalation",
|
| 187 |
+
"lateral_movement",
|
| 188 |
+
"collection",
|
| 189 |
+
"exfiltration",
|
| 190 |
+
"impact"
|
| 191 |
+
],
|
| 192 |
+
"matrix": [
|
| 193 |
+
[
|
| 194 |
+
1,
|
| 195 |
+
24,
|
| 196 |
+
24,
|
| 197 |
+
16,
|
| 198 |
+
24,
|
| 199 |
+
16,
|
| 200 |
+
4,
|
| 201 |
+
15,
|
| 202 |
+
10,
|
| 203 |
+
7
|
| 204 |
+
],
|
| 205 |
+
[
|
| 206 |
+
2,
|
| 207 |
+
89,
|
| 208 |
+
2,
|
| 209 |
+
16,
|
| 210 |
+
3,
|
| 211 |
+
0,
|
| 212 |
+
0,
|
| 213 |
+
0,
|
| 214 |
+
0,
|
| 215 |
+
0
|
| 216 |
+
],
|
| 217 |
+
[
|
| 218 |
+
1,
|
| 219 |
+
6,
|
| 220 |
+
65,
|
| 221 |
+
4,
|
| 222 |
+
3,
|
| 223 |
+
26,
|
| 224 |
+
1,
|
| 225 |
+
0,
|
| 226 |
+
0,
|
| 227 |
+
0
|
| 228 |
+
],
|
| 229 |
+
[
|
| 230 |
+
1,
|
| 231 |
+
4,
|
| 232 |
+
1,
|
| 233 |
+
39,
|
| 234 |
+
25,
|
| 235 |
+
3,
|
| 236 |
+
1,
|
| 237 |
+
0,
|
| 238 |
+
0,
|
| 239 |
+
0
|
| 240 |
+
],
|
| 241 |
+
[
|
| 242 |
+
1,
|
| 243 |
+
0,
|
| 244 |
+
0,
|
| 245 |
+
16,
|
| 246 |
+
36,
|
| 247 |
+
9,
|
| 248 |
+
0,
|
| 249 |
+
16,
|
| 250 |
+
1,
|
| 251 |
+
0
|
| 252 |
+
],
|
| 253 |
+
[
|
| 254 |
+
0,
|
| 255 |
+
0,
|
| 256 |
+
3,
|
| 257 |
+
7,
|
| 258 |
+
4,
|
| 259 |
+
44,
|
| 260 |
+
3,
|
| 261 |
+
5,
|
| 262 |
+
2,
|
| 263 |
+
0
|
| 264 |
+
],
|
| 265 |
+
[
|
| 266 |
+
0,
|
| 267 |
+
0,
|
| 268 |
+
1,
|
| 269 |
+
0,
|
| 270 |
+
5,
|
| 271 |
+
9,
|
| 272 |
+
36,
|
| 273 |
+
2,
|
| 274 |
+
0,
|
| 275 |
+
1
|
| 276 |
+
],
|
| 277 |
+
[
|
| 278 |
+
1,
|
| 279 |
+
0,
|
| 280 |
+
0,
|
| 281 |
+
0,
|
| 282 |
+
2,
|
| 283 |
+
2,
|
| 284 |
+
1,
|
| 285 |
+
11,
|
| 286 |
+
11,
|
| 287 |
+
12
|
| 288 |
+
],
|
| 289 |
+
[
|
| 290 |
+
0,
|
| 291 |
+
0,
|
| 292 |
+
0,
|
| 293 |
+
0,
|
| 294 |
+
5,
|
| 295 |
+
0,
|
| 296 |
+
0,
|
| 297 |
+
6,
|
| 298 |
+
8,
|
| 299 |
+
12
|
| 300 |
+
],
|
| 301 |
+
[
|
| 302 |
+
0,
|
| 303 |
+
0,
|
| 304 |
+
0,
|
| 305 |
+
0,
|
| 306 |
+
2,
|
| 307 |
+
1,
|
| 308 |
+
0,
|
| 309 |
+
12,
|
| 310 |
+
1,
|
| 311 |
+
5
|
| 312 |
+
]
|
| 313 |
+
]
|
| 314 |
+
},
|
| 315 |
+
"macro_roc_auc_ovr": 0.8625474585447981
|
| 316 |
+
},
|
| 317 |
+
"delta_accuracy": 0.008264462809917383,
|
| 318 |
+
"delta_macro_f1": 0.01615930087872597
|
| 319 |
+
},
|
| 320 |
+
"no_engineered": {
|
| 321 |
+
"n_features": 84,
|
| 322 |
+
"dropped_count": 6,
|
| 323 |
+
"metrics": {
|
| 324 |
+
"model": "xgboost_no_engineered",
|
| 325 |
+
"accuracy": 0.4641873278236915,
|
| 326 |
+
"macro_f1": 0.4239556593623024,
|
| 327 |
+
"weighted_f1": 0.4373277421758876,
|
| 328 |
+
"per_class_f1": {
|
| 329 |
+
"dwell_idle": 0.02631578947368421,
|
| 330 |
+
"reconnaissance": 0.7368421052631579,
|
| 331 |
+
"initial_access": 0.6305418719211823,
|
| 332 |
+
"execution": 0.46060606060606063,
|
| 333 |
+
"persistence": 0.4419889502762431,
|
| 334 |
+
"privilege_escalation": 0.49142857142857144,
|
| 335 |
+
"lateral_movement": 0.7346938775510204,
|
| 336 |
+
"collection": 0.24347826086956523,
|
| 337 |
+
"exfiltration": 0.2647058823529412,
|
| 338 |
+
"impact": 0.208955223880597
|
| 339 |
+
},
|
| 340 |
+
"confusion_matrix": {
|
| 341 |
+
"labels": [
|
| 342 |
+
"dwell_idle",
|
| 343 |
+
"reconnaissance",
|
| 344 |
+
"initial_access",
|
| 345 |
+
"execution",
|
| 346 |
+
"persistence",
|
| 347 |
+
"privilege_escalation",
|
| 348 |
+
"lateral_movement",
|
| 349 |
+
"collection",
|
| 350 |
+
"exfiltration",
|
| 351 |
+
"impact"
|
| 352 |
+
],
|
| 353 |
+
"matrix": [
|
| 354 |
+
[
|
| 355 |
+
2,
|
| 356 |
+
23,
|
| 357 |
+
24,
|
| 358 |
+
14,
|
| 359 |
+
23,
|
| 360 |
+
20,
|
| 361 |
+
2,
|
| 362 |
+
17,
|
| 363 |
+
9,
|
| 364 |
+
7
|
| 365 |
+
],
|
| 366 |
+
[
|
| 367 |
+
4,
|
| 368 |
+
84,
|
| 369 |
+
3,
|
| 370 |
+
21,
|
| 371 |
+
0,
|
| 372 |
+
0,
|
| 373 |
+
0,
|
| 374 |
+
0,
|
| 375 |
+
0,
|
| 376 |
+
0
|
| 377 |
+
],
|
| 378 |
+
[
|
| 379 |
+
2,
|
| 380 |
+
5,
|
| 381 |
+
64,
|
| 382 |
+
4,
|
| 383 |
+
1,
|
| 384 |
+
29,
|
| 385 |
+
1,
|
| 386 |
+
0,
|
| 387 |
+
0,
|
| 388 |
+
0
|
| 389 |
+
],
|
| 390 |
+
[
|
| 391 |
+
3,
|
| 392 |
+
4,
|
| 393 |
+
1,
|
| 394 |
+
38,
|
| 395 |
+
25,
|
| 396 |
+
2,
|
| 397 |
+
1,
|
| 398 |
+
0,
|
| 399 |
+
0,
|
| 400 |
+
0
|
| 401 |
+
],
|
| 402 |
+
[
|
| 403 |
+
0,
|
| 404 |
+
0,
|
| 405 |
+
2,
|
| 406 |
+
7,
|
| 407 |
+
40,
|
| 408 |
+
9,
|
| 409 |
+
0,
|
| 410 |
+
20,
|
| 411 |
+
1,
|
| 412 |
+
0
|
| 413 |
+
],
|
| 414 |
+
[
|
| 415 |
+
0,
|
| 416 |
+
0,
|
| 417 |
+
3,
|
| 418 |
+
7,
|
| 419 |
+
5,
|
| 420 |
+
43,
|
| 421 |
+
4,
|
| 422 |
+
5,
|
| 423 |
+
1,
|
| 424 |
+
0
|
| 425 |
+
],
|
| 426 |
+
[
|
| 427 |
+
0,
|
| 428 |
+
0,
|
| 429 |
+
0,
|
| 430 |
+
0,
|
| 431 |
+
0,
|
| 432 |
+
3,
|
| 433 |
+
36,
|
| 434 |
+
4,
|
| 435 |
+
4,
|
| 436 |
+
7
|
| 437 |
+
],
|
| 438 |
+
[
|
| 439 |
+
0,
|
| 440 |
+
0,
|
| 441 |
+
0,
|
| 442 |
+
0,
|
| 443 |
+
1,
|
| 444 |
+
0,
|
| 445 |
+
0,
|
| 446 |
+
14,
|
| 447 |
+
13,
|
| 448 |
+
12
|
| 449 |
+
],
|
| 450 |
+
[
|
| 451 |
+
0,
|
| 452 |
+
0,
|
| 453 |
+
0,
|
| 454 |
+
0,
|
| 455 |
+
5,
|
| 456 |
+
0,
|
| 457 |
+
0,
|
| 458 |
+
4,
|
| 459 |
+
9,
|
| 460 |
+
13
|
| 461 |
+
],
|
| 462 |
+
[
|
| 463 |
+
0,
|
| 464 |
+
0,
|
| 465 |
+
0,
|
| 466 |
+
0,
|
| 467 |
+
2,
|
| 468 |
+
1,
|
| 469 |
+
0,
|
| 470 |
+
11,
|
| 471 |
+
0,
|
| 472 |
+
7
|
| 473 |
+
]
|
| 474 |
+
]
|
| 475 |
+
},
|
| 476 |
+
"macro_roc_auc_ovr": 0.8559080760692732
|
| 477 |
+
},
|
| 478 |
+
"delta_accuracy": 0.004132231404958664,
|
| 479 |
+
"delta_macro_f1": 0.001543148133218264
|
| 480 |
+
},
|
| 481 |
+
"no_timestep": {
|
| 482 |
+
"n_features": 89,
|
| 483 |
+
"dropped_count": 1,
|
| 484 |
+
"metrics": {
|
| 485 |
+
"model": "xgboost_no_timestep",
|
| 486 |
+
"accuracy": 0.32644628099173556,
|
| 487 |
+
"macro_f1": 0.31019209599143654,
|
| 488 |
+
"weighted_f1": 0.3273550154519158,
|
| 489 |
+
"per_class_f1": {
|
| 490 |
+
"dwell_idle": 0.06060606060606061,
|
| 491 |
+
"reconnaissance": 0.3728813559322034,
|
| 492 |
+
"initial_access": 0.5666666666666667,
|
| 493 |
+
"execution": 0.4090909090909091,
|
| 494 |
+
"persistence": 0.22818791946308725,
|
| 495 |
+
"privilege_escalation": 0.4520547945205479,
|
| 496 |
+
"lateral_movement": 0.7058823529411765,
|
| 497 |
+
"collection": 0.0975609756097561,
|
| 498 |
+
"exfiltration": 0.1836734693877551,
|
| 499 |
+
"impact": 0.02531645569620253
|
| 500 |
+
},
|
| 501 |
+
"confusion_matrix": {
|
| 502 |
+
"labels": [
|
| 503 |
+
"dwell_idle",
|
| 504 |
+
"reconnaissance",
|
| 505 |
+
"initial_access",
|
| 506 |
+
"execution",
|
| 507 |
+
"persistence",
|
| 508 |
+
"privilege_escalation",
|
| 509 |
+
"lateral_movement",
|
| 510 |
+
"collection",
|
| 511 |
+
"exfiltration",
|
| 512 |
+
"impact"
|
| 513 |
+
],
|
| 514 |
+
"matrix": [
|
| 515 |
+
[
|
| 516 |
+
5,
|
| 517 |
+
11,
|
| 518 |
+
35,
|
| 519 |
+
11,
|
| 520 |
+
17,
|
| 521 |
+
13,
|
| 522 |
+
1,
|
| 523 |
+
25,
|
| 524 |
+
15,
|
| 525 |
+
8
|
| 526 |
+
],
|
| 527 |
+
[
|
| 528 |
+
7,
|
| 529 |
+
33,
|
| 530 |
+
1,
|
| 531 |
+
11,
|
| 532 |
+
11,
|
| 533 |
+
0,
|
| 534 |
+
0,
|
| 535 |
+
19,
|
| 536 |
+
17,
|
| 537 |
+
13
|
| 538 |
+
],
|
| 539 |
+
[
|
| 540 |
+
5,
|
| 541 |
+
0,
|
| 542 |
+
68,
|
| 543 |
+
1,
|
| 544 |
+
2,
|
| 545 |
+
16,
|
| 546 |
+
5,
|
| 547 |
+
6,
|
| 548 |
+
3,
|
| 549 |
+
0
|
| 550 |
+
],
|
| 551 |
+
[
|
| 552 |
+
3,
|
| 553 |
+
6,
|
| 554 |
+
1,
|
| 555 |
+
27,
|
| 556 |
+
4,
|
| 557 |
+
4,
|
| 558 |
+
2,
|
| 559 |
+
20,
|
| 560 |
+
2,
|
| 561 |
+
5
|
| 562 |
+
],
|
| 563 |
+
[
|
| 564 |
+
2,
|
| 565 |
+
12,
|
| 566 |
+
4,
|
| 567 |
+
1,
|
| 568 |
+
17,
|
| 569 |
+
5,
|
| 570 |
+
0,
|
| 571 |
+
19,
|
| 572 |
+
6,
|
| 573 |
+
13
|
| 574 |
+
],
|
| 575 |
+
[
|
| 576 |
+
0,
|
| 577 |
+
0,
|
| 578 |
+
17,
|
| 579 |
+
7,
|
| 580 |
+
2,
|
| 581 |
+
33,
|
| 582 |
+
3,
|
| 583 |
+
3,
|
| 584 |
+
2,
|
| 585 |
+
1
|
| 586 |
+
],
|
| 587 |
+
[
|
| 588 |
+
0,
|
| 589 |
+
1,
|
| 590 |
+
7,
|
| 591 |
+
0,
|
| 592 |
+
2,
|
| 593 |
+
2,
|
| 594 |
+
36,
|
| 595 |
+
1,
|
| 596 |
+
0,
|
| 597 |
+
5
|
| 598 |
+
],
|
| 599 |
+
[
|
| 600 |
+
0,
|
| 601 |
+
2,
|
| 602 |
+
0,
|
| 603 |
+
0,
|
| 604 |
+
6,
|
| 605 |
+
4,
|
| 606 |
+
1,
|
| 607 |
+
8,
|
| 608 |
+
12,
|
| 609 |
+
7
|
| 610 |
+
],
|
| 611 |
+
[
|
| 612 |
+
1,
|
| 613 |
+
0,
|
| 614 |
+
1,
|
| 615 |
+
0,
|
| 616 |
+
7,
|
| 617 |
+
0,
|
| 618 |
+
0,
|
| 619 |
+
8,
|
| 620 |
+
9,
|
| 621 |
+
5
|
| 622 |
+
],
|
| 623 |
+
[
|
| 624 |
+
1,
|
| 625 |
+
0,
|
| 626 |
+
0,
|
| 627 |
+
0,
|
| 628 |
+
2,
|
| 629 |
+
1,
|
| 630 |
+
0,
|
| 631 |
+
15,
|
| 632 |
+
1,
|
| 633 |
+
1
|
| 634 |
+
]
|
| 635 |
+
]
|
| 636 |
+
},
|
| 637 |
+
"macro_roc_auc_ovr": 0.7557281412642529
|
| 638 |
+
},
|
| 639 |
+
"delta_accuracy": 0.1418732782369146,
|
| 640 |
+
"delta_macro_f1": 0.11530671150408411
|
| 641 |
+
},
|
| 642 |
+
"no_detection_signals": {
|
| 643 |
+
"n_features": 76,
|
| 644 |
+
"dropped_count": 14,
|
| 645 |
+
"metrics": {
|
| 646 |
+
"model": "xgboost_no_detection_signals",
|
| 647 |
+
"accuracy": 0.4724517906336088,
|
| 648 |
+
"macro_f1": 0.4284152317167137,
|
| 649 |
+
"weighted_f1": 0.4449655177644492,
|
| 650 |
+
"per_class_f1": {
|
| 651 |
+
"dwell_idle": 0.039735099337748346,
|
| 652 |
+
"reconnaissance": 0.7456140350877193,
|
| 653 |
+
"initial_access": 0.6600985221674877,
|
| 654 |
+
"execution": 0.47126436781609193,
|
| 655 |
+
"persistence": 0.43333333333333335,
|
| 656 |
+
"privilege_escalation": 0.4971751412429379,
|
| 657 |
+
"lateral_movement": 0.7272727272727273,
|
| 658 |
+
"collection": 0.21818181818181817,
|
| 659 |
+
"exfiltration": 0.2727272727272727,
|
| 660 |
+
"impact": 0.21875
|
| 661 |
+
},
|
| 662 |
+
"confusion_matrix": {
|
| 663 |
+
"labels": [
|
| 664 |
+
"dwell_idle",
|
| 665 |
+
"reconnaissance",
|
| 666 |
+
"initial_access",
|
| 667 |
+
"execution",
|
| 668 |
+
"persistence",
|
| 669 |
+
"privilege_escalation",
|
| 670 |
+
"lateral_movement",
|
| 671 |
+
"collection",
|
| 672 |
+
"exfiltration",
|
| 673 |
+
"impact"
|
| 674 |
+
],
|
| 675 |
+
"matrix": [
|
| 676 |
+
[
|
| 677 |
+
3,
|
| 678 |
+
23,
|
| 679 |
+
23,
|
| 680 |
+
18,
|
| 681 |
+
22,
|
| 682 |
+
17,
|
| 683 |
+
3,
|
| 684 |
+
16,
|
| 685 |
+
9,
|
| 686 |
+
7
|
| 687 |
+
],
|
| 688 |
+
[
|
| 689 |
+
2,
|
| 690 |
+
85,
|
| 691 |
+
3,
|
| 692 |
+
22,
|
| 693 |
+
0,
|
| 694 |
+
0,
|
| 695 |
+
0,
|
| 696 |
+
0,
|
| 697 |
+
0,
|
| 698 |
+
0
|
| 699 |
+
],
|
| 700 |
+
[
|
| 701 |
+
1,
|
| 702 |
+
5,
|
| 703 |
+
67,
|
| 704 |
+
2,
|
| 705 |
+
2,
|
| 706 |
+
28,
|
| 707 |
+
1,
|
| 708 |
+
0,
|
| 709 |
+
0,
|
| 710 |
+
0
|
| 711 |
+
],
|
| 712 |
+
[
|
| 713 |
+
2,
|
| 714 |
+
3,
|
| 715 |
+
1,
|
| 716 |
+
41,
|
| 717 |
+
23,
|
| 718 |
+
3,
|
| 719 |
+
1,
|
| 720 |
+
0,
|
| 721 |
+
0,
|
| 722 |
+
0
|
| 723 |
+
],
|
| 724 |
+
[
|
| 725 |
+
0,
|
| 726 |
+
0,
|
| 727 |
+
1,
|
| 728 |
+
9,
|
| 729 |
+
39,
|
| 730 |
+
9,
|
| 731 |
+
0,
|
| 732 |
+
19,
|
| 733 |
+
1,
|
| 734 |
+
1
|
| 735 |
+
],
|
| 736 |
+
[
|
| 737 |
+
0,
|
| 738 |
+
0,
|
| 739 |
+
2,
|
| 740 |
+
8,
|
| 741 |
+
3,
|
| 742 |
+
44,
|
| 743 |
+
4,
|
| 744 |
+
6,
|
| 745 |
+
1,
|
| 746 |
+
0
|
| 747 |
+
],
|
| 748 |
+
[
|
| 749 |
+
1,
|
| 750 |
+
0,
|
| 751 |
+
0,
|
| 752 |
+
0,
|
| 753 |
+
3,
|
| 754 |
+
6,
|
| 755 |
+
36,
|
| 756 |
+
2,
|
| 757 |
+
0,
|
| 758 |
+
6
|
| 759 |
+
],
|
| 760 |
+
[
|
| 761 |
+
0,
|
| 762 |
+
0,
|
| 763 |
+
0,
|
| 764 |
+
0,
|
| 765 |
+
2,
|
| 766 |
+
1,
|
| 767 |
+
0,
|
| 768 |
+
12,
|
| 769 |
+
15,
|
| 770 |
+
10
|
| 771 |
+
],
|
| 772 |
+
[
|
| 773 |
+
1,
|
| 774 |
+
0,
|
| 775 |
+
0,
|
| 776 |
+
0,
|
| 777 |
+
5,
|
| 778 |
+
0,
|
| 779 |
+
0,
|
| 780 |
+
4,
|
| 781 |
+
9,
|
| 782 |
+
12
|
| 783 |
+
],
|
| 784 |
+
[
|
| 785 |
+
0,
|
| 786 |
+
0,
|
| 787 |
+
0,
|
| 788 |
+
0,
|
| 789 |
+
2,
|
| 790 |
+
1,
|
| 791 |
+
0,
|
| 792 |
+
11,
|
| 793 |
+
0,
|
| 794 |
+
7
|
| 795 |
+
]
|
| 796 |
+
]
|
| 797 |
+
},
|
| 798 |
+
"macro_roc_auc_ovr": 0.8544378745036634
|
| 799 |
+
},
|
| 800 |
+
"delta_accuracy": -0.004132231404958664,
|
| 801 |
+
"delta_macro_f1": -0.002916424221193037
|
| 802 |
+
}
|
| 803 |
+
}
|
| 804 |
+
}
|
feature_engineering.py
ADDED
|
@@ -0,0 +1,394 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
feature_engineering.py
|
| 3 |
+
======================
|
| 4 |
+
|
| 5 |
+
Feature pipeline for the CYB002 baseline classifier.
|
| 6 |
+
|
| 7 |
+
Predicts `kill_chain_phase` (10-class) from event + segment-level
|
| 8 |
+
observables on the CYB002 sample dataset.
|
| 9 |
+
|
| 10 |
+
CSV inputs:
|
| 11 |
+
attack_events.csv (primary, one row per timestep-level action)
|
| 12 |
+
network_topology.csv (asset-level inventory; aggregated to segment
|
| 13 |
+
level before joining on target_segment_id)
|
| 14 |
+
campaign_summary.csv (reserved for future work, not used in v1)
|
| 15 |
+
campaign_events.csv (reserved for future work, not used in v1)
|
| 16 |
+
|
| 17 |
+
Target classes:
|
| 18 |
+
dwell_idle, reconnaissance, initial_access, execution, persistence,
|
| 19 |
+
privilege_escalation, lateral_movement, collection, exfiltration, impact
|
| 20 |
+
|
| 21 |
+
This corresponds to the README's first listed use case: predicting the
|
| 22 |
+
next ATT&CK phase from observable features. The challenge is that three
|
| 23 |
+
fields perfectly determine phase by construction:
|
| 24 |
+
|
| 25 |
+
- technique_id -> 62 of 63 techniques map 1:1 to a single phase
|
| 26 |
+
- technique_name -> 1:1 with technique_id
|
| 27 |
+
- tactic_category -> direct alias of phase
|
| 28 |
+
|
| 29 |
+
These are dropped before feature assembly. Phase is predicted from:
|
| 30 |
+
timestep position (recon mean=6, impact mean=66), target asset type,
|
| 31 |
+
protocol/port, byte volumes, connection duration, auth-failure count,
|
| 32 |
+
process-injection / lateral-hop counts, attacker tier vs defender
|
| 33 |
+
maturity, and segment-level topology aggregates.
|
| 34 |
+
|
| 35 |
+
Public API
|
| 36 |
+
----------
|
| 37 |
+
build_features(attack_events_path, topology_path,
|
| 38 |
+
campaign_summary_path=None) -> (X, y, groups, meta)
|
| 39 |
+
transform_single(record, meta, segment_aggregates=None) -> np.ndarray
|
| 40 |
+
save_meta(meta, path) / load_meta(path)
|
| 41 |
+
build_segment_lookup(topology_path) -> dict
|
| 42 |
+
|
| 43 |
+
License
|
| 44 |
+
-------
|
| 45 |
+
Ships with the public model on Hugging Face under CC-BY-NC-4.0, matching
|
| 46 |
+
the dataset license. See README.md.
|
| 47 |
+
"""
|
| 48 |
+
|
| 49 |
+
from __future__ import annotations
|
| 50 |
+
|
| 51 |
+
import json
|
| 52 |
+
from pathlib import Path
|
| 53 |
+
from typing import Any
|
| 54 |
+
|
| 55 |
+
import numpy as np
|
| 56 |
+
import pandas as pd
|
| 57 |
+
|
| 58 |
+
# ---------------------------------------------------------------------------
|
| 59 |
+
# Label space
|
| 60 |
+
# ---------------------------------------------------------------------------
|
| 61 |
+
|
| 62 |
+
# The 10 phases observed in the sample. dwell_idle is a no-op step
|
| 63 |
+
# between actions; technique_id=T0000, tactic_category=NaN. Ordering
|
| 64 |
+
# follows tactic flow for readability; CE-loss doesn't care.
|
| 65 |
+
LABEL_ORDER = [
|
| 66 |
+
"dwell_idle",
|
| 67 |
+
"reconnaissance",
|
| 68 |
+
"initial_access",
|
| 69 |
+
"execution",
|
| 70 |
+
"persistence",
|
| 71 |
+
"privilege_escalation",
|
| 72 |
+
"lateral_movement",
|
| 73 |
+
"collection",
|
| 74 |
+
"exfiltration",
|
| 75 |
+
"impact",
|
| 76 |
+
]
|
| 77 |
+
LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
|
| 78 |
+
INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
|
| 79 |
+
|
| 80 |
+
# ---------------------------------------------------------------------------
|
| 81 |
+
# Columns dropped because they leak the target (kill_chain_phase)
|
| 82 |
+
# ---------------------------------------------------------------------------
|
| 83 |
+
|
| 84 |
+
# `technique_id`: 62 of 63 ATT&CK techniques map 1:1 to a single phase.
|
| 85 |
+
# T1078 Valid Accounts is the one shared technique (appears in both
|
| 86 |
+
# initial_access and persistence, which is correct ATT&CK behavior).
|
| 87 |
+
# Including technique_id as a feature is effectively label memorization.
|
| 88 |
+
#
|
| 89 |
+
# `technique_name`: 1:1 alias of technique_id (63 unique values each).
|
| 90 |
+
#
|
| 91 |
+
# `tactic_category`: direct alias of kill_chain_phase; the two columns
|
| 92 |
+
# carry identical information except tactic_category is null for
|
| 93 |
+
# dwell_idle steps. Drop.
|
| 94 |
+
LEAKY_COLUMNS = [
|
| 95 |
+
"technique_id",
|
| 96 |
+
"technique_name",
|
| 97 |
+
"tactic_category",
|
| 98 |
+
]
|
| 99 |
+
|
| 100 |
+
# ---------------------------------------------------------------------------
|
| 101 |
+
# Columns kept as features
|
| 102 |
+
# ---------------------------------------------------------------------------
|
| 103 |
+
|
| 104 |
+
DIRECT_NUMERIC_EVENT_FEATURES = [
|
| 105 |
+
"timestep", # strong signal: recon mean=6, impact mean=66
|
| 106 |
+
"dest_port",
|
| 107 |
+
"bytes_transferred",
|
| 108 |
+
"connection_duration_s",
|
| 109 |
+
"auth_failure_count",
|
| 110 |
+
"process_injection_flag",
|
| 111 |
+
"lateral_hop_count",
|
| 112 |
+
"c2_beacon_interval_s", # null-aware; filled with -1 + has_c2_beacon flag
|
| 113 |
+
# Detection-related fields. These are POST-HOC observables from the
|
| 114 |
+
# SOC's perspective. We keep them as features because in the realistic
|
| 115 |
+
# phase-prediction use case, a SOC analyst has just seen an action and
|
| 116 |
+
# its initial detection outcome, and is trying to reason about which
|
| 117 |
+
# phase the campaign is in. Buyers who want a strictly pre-detection
|
| 118 |
+
# model can drop these four columns and retrain.
|
| 119 |
+
"edr_blocked_flag",
|
| 120 |
+
"siem_rule_triggered",
|
| 121 |
+
]
|
| 122 |
+
|
| 123 |
+
CATEGORICAL_EVENT_FEATURES = [
|
| 124 |
+
"target_asset_type",
|
| 125 |
+
"source_ip_class",
|
| 126 |
+
"protocol",
|
| 127 |
+
"attacker_capability_tier",
|
| 128 |
+
"defender_maturity_level",
|
| 129 |
+
"alert_severity", # critical / high / medium / low / informational
|
| 130 |
+
"detection_outcome", # see note above re: post-hoc observables
|
| 131 |
+
]
|
| 132 |
+
|
| 133 |
+
ID_COLUMNS = ["campaign_id", "attacker_id"]
|
| 134 |
+
|
| 135 |
+
# ---------------------------------------------------------------------------
|
| 136 |
+
# Topology aggregation
|
| 137 |
+
# ---------------------------------------------------------------------------
|
| 138 |
+
#
|
| 139 |
+
# network_topology.csv is ASSET-LEVEL (651 rows, 12 segments, ~54 assets
|
| 140 |
+
# per segment). Direct join would explode rows. Aggregate to segment level:
|
| 141 |
+
# constant fields as-is, numeric fields mean/max as appropriate, 0/1 flags
|
| 142 |
+
# as fraction-with-coverage.
|
| 143 |
+
|
| 144 |
+
SEGMENT_CONSTANT_TOPO_COLS = ["segment_type", "defender_maturity_level"]
|
| 145 |
+
SEGMENT_NUMERIC_AGGREGATES = {
|
| 146 |
+
"patch_lag_days": "mean",
|
| 147 |
+
"exposure_score": "mean",
|
| 148 |
+
"vulnerability_count": "max", # worst-case asset matters more
|
| 149 |
+
"inter_segment_trust_level": "mean",
|
| 150 |
+
"alert_threshold_sensitivity": "mean",
|
| 151 |
+
"mttd_baseline_hours": "mean",
|
| 152 |
+
"mttr_baseline_hours": "mean",
|
| 153 |
+
"siem_coverage_flag": "mean", # fraction with SIEM
|
| 154 |
+
"edr_deployed_flag": "mean", # fraction with EDR
|
| 155 |
+
"ndr_coverage_flag": "mean",
|
| 156 |
+
"mfa_enforced_flag": "mean",
|
| 157 |
+
}
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
def _aggregate_topology(topology: pd.DataFrame) -> pd.DataFrame:
|
| 161 |
+
"""Collapse asset-level topology to one row per segment."""
|
| 162 |
+
parts = []
|
| 163 |
+
for col in SEGMENT_CONSTANT_TOPO_COLS:
|
| 164 |
+
parts.append(topology.groupby("segment_id")[col].first().rename(f"seg_{col}"))
|
| 165 |
+
for col, agg in SEGMENT_NUMERIC_AGGREGATES.items():
|
| 166 |
+
parts.append(topology.groupby("segment_id")[col].agg(agg).rename(f"seg_{col}_{agg}"))
|
| 167 |
+
return pd.concat(parts, axis=1).reset_index()
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
TOPOLOGY_FEATURE_NAMES_NUMERIC = [
|
| 171 |
+
f"seg_{col}_{agg}" for col, agg in SEGMENT_NUMERIC_AGGREGATES.items()
|
| 172 |
+
]
|
| 173 |
+
TOPOLOGY_FEATURE_NAMES_CATEGORICAL = [f"seg_{col}" for col in SEGMENT_CONSTANT_TOPO_COLS]
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
# ---------------------------------------------------------------------------
|
| 177 |
+
# Engineered features
|
| 178 |
+
# ---------------------------------------------------------------------------
|
| 179 |
+
#
|
| 180 |
+
# Important: NO phase-derived engineered features. is_dwell_idle,
|
| 181 |
+
# is_high_severity_phase, phase_order_index would all be oracles when
|
| 182 |
+
# phase is the target. Six features instead, each a stated hypothesis
|
| 183 |
+
# about phase-discriminative signal in pre-phase observables.
|
| 184 |
+
|
| 185 |
+
TIER_RANK = {"script_kiddie": 1, "opportunistic": 2, "apt": 3, "nation_state": 4}
|
| 186 |
+
DEFENDER_RANK = {"minimal": 1, "baseline": 2, "managed": 3, "advanced": 4, "zero_trust": 5}
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
|
| 190 |
+
"""Six engineered features, no phase-derived oracles."""
|
| 191 |
+
df = df.copy()
|
| 192 |
+
|
| 193 |
+
# 1. Byte volume on log scale. Heavy-tailed across phases: recon
|
| 194 |
+
# transfers tend to be bytes; exfiltration megabytes. log1p tames
|
| 195 |
+
# the tail and gives both XGBoost and the MLP a usable feature.
|
| 196 |
+
df["byte_volume_log"] = np.log1p(df["bytes_transferred"].clip(lower=0)).astype(float)
|
| 197 |
+
|
| 198 |
+
# 2. C2 beacon presence. c2_beacon_interval_s is null for non-C2
|
| 199 |
+
# actions. Encode presence as a binary flag and fill the value
|
| 200 |
+
# column with -1 so it stays usable.
|
| 201 |
+
df["has_c2_beacon"] = df["c2_beacon_interval_s"].notna().astype(int)
|
| 202 |
+
df["c2_beacon_interval_s"] = df["c2_beacon_interval_s"].fillna(-1.0)
|
| 203 |
+
|
| 204 |
+
# 3. Brute-force indicator. auth_failure_count > 0 distinguishes
|
| 205 |
+
# credential-stuffing style actions from authenticated-path
|
| 206 |
+
# actions; loads differently into early phases.
|
| 207 |
+
df["is_brute_forcing"] = (df["auth_failure_count"] > 0).astype(int)
|
| 208 |
+
|
| 209 |
+
# 4. Attacker vs defender advantage. Positive when attacker outclasses
|
| 210 |
+
# defender; influences which phases an attacker can reach.
|
| 211 |
+
tier_r = df["attacker_capability_tier"].map(TIER_RANK).fillna(2).astype(int)
|
| 212 |
+
def_r = df["defender_maturity_level"].map(DEFENDER_RANK).fillna(2).astype(int)
|
| 213 |
+
df["attacker_defender_advantage"] = (tier_r - def_r).astype(int)
|
| 214 |
+
|
| 215 |
+
# 5. High-volume action indicator. Simple binary above 100 KB,
|
| 216 |
+
# correlates with collection / exfiltration phases.
|
| 217 |
+
df["is_high_volume"] = (df["bytes_transferred"] > 100_000).astype(int)
|
| 218 |
+
|
| 219 |
+
# 6. Privileged-port indicator. dest_port < 1024, typically system
|
| 220 |
+
# services; common in initial-access and lateral-movement actions.
|
| 221 |
+
df["is_privileged_port"] = (df["dest_port"] < 1024).astype(int)
|
| 222 |
+
|
| 223 |
+
return df
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
# ---------------------------------------------------------------------------
|
| 227 |
+
# Public API
|
| 228 |
+
# ---------------------------------------------------------------------------
|
| 229 |
+
|
| 230 |
+
def build_features(
|
| 231 |
+
attack_events_path: str | Path,
|
| 232 |
+
topology_path: str | Path,
|
| 233 |
+
campaign_summary_path: str | Path | None = None,
|
| 234 |
+
) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
|
| 235 |
+
"""
|
| 236 |
+
Load CSVs, aggregate topology, drop leaky columns, engineer features,
|
| 237 |
+
one-hot encode, return (X, y, groups, meta).
|
| 238 |
+
|
| 239 |
+
`groups` is a Series of campaign_id values aligned with X for
|
| 240 |
+
GroupShuffleSplit / GroupKFold use. A single campaign generates ~40
|
| 241 |
+
correlated events; row-level random splitting inflates metrics.
|
| 242 |
+
"""
|
| 243 |
+
events = pd.read_csv(attack_events_path)
|
| 244 |
+
topology = pd.read_csv(topology_path)
|
| 245 |
+
|
| 246 |
+
events = events.drop(columns=LEAKY_COLUMNS, errors="ignore")
|
| 247 |
+
|
| 248 |
+
topo_agg = _aggregate_topology(topology)
|
| 249 |
+
events = events.merge(
|
| 250 |
+
topo_agg, left_on="target_segment_id", right_on="segment_id", how="left",
|
| 251 |
+
).drop(columns=["segment_id"], errors="ignore")
|
| 252 |
+
|
| 253 |
+
y = events["kill_chain_phase"].map(LABEL_TO_INT)
|
| 254 |
+
if y.isna().any():
|
| 255 |
+
bad = events.loc[y.isna(), "kill_chain_phase"].unique()
|
| 256 |
+
raise ValueError(f"Unknown kill_chain_phase values: {bad}")
|
| 257 |
+
y = y.astype(int)
|
| 258 |
+
groups = events["campaign_id"].copy()
|
| 259 |
+
|
| 260 |
+
events = _add_engineered_features(events)
|
| 261 |
+
|
| 262 |
+
numeric_features = (
|
| 263 |
+
DIRECT_NUMERIC_EVENT_FEATURES
|
| 264 |
+
+ TOPOLOGY_FEATURE_NAMES_NUMERIC
|
| 265 |
+
+ [
|
| 266 |
+
"byte_volume_log", "has_c2_beacon", "is_brute_forcing",
|
| 267 |
+
"attacker_defender_advantage", "is_high_volume",
|
| 268 |
+
"is_privileged_port",
|
| 269 |
+
]
|
| 270 |
+
)
|
| 271 |
+
X_numeric = events[numeric_features].astype(float)
|
| 272 |
+
|
| 273 |
+
all_categorical = (
|
| 274 |
+
[(col, "event") for col in CATEGORICAL_EVENT_FEATURES]
|
| 275 |
+
+ [(col, "topology") for col in TOPOLOGY_FEATURE_NAMES_CATEGORICAL]
|
| 276 |
+
)
|
| 277 |
+
categorical_levels: dict[str, list[str]] = {}
|
| 278 |
+
blocks: list[pd.DataFrame] = []
|
| 279 |
+
for col, _src in all_categorical:
|
| 280 |
+
levels = sorted(events[col].dropna().unique().tolist())
|
| 281 |
+
categorical_levels[col] = levels
|
| 282 |
+
block = pd.get_dummies(
|
| 283 |
+
events[col].astype("category").cat.set_categories(levels),
|
| 284 |
+
prefix=col, dummy_na=False,
|
| 285 |
+
).astype(int)
|
| 286 |
+
blocks.append(block)
|
| 287 |
+
|
| 288 |
+
X = pd.concat(
|
| 289 |
+
[X_numeric.reset_index(drop=True)]
|
| 290 |
+
+ [b.reset_index(drop=True) for b in blocks],
|
| 291 |
+
axis=1,
|
| 292 |
+
).fillna(0.0)
|
| 293 |
+
|
| 294 |
+
meta = {
|
| 295 |
+
"feature_names": X.columns.tolist(),
|
| 296 |
+
"numeric_features": numeric_features,
|
| 297 |
+
"categorical_levels": categorical_levels,
|
| 298 |
+
"label_to_int": LABEL_TO_INT,
|
| 299 |
+
"int_to_label": INT_TO_LABEL,
|
| 300 |
+
"topology_aggregation": {
|
| 301 |
+
"segment_constant": SEGMENT_CONSTANT_TOPO_COLS,
|
| 302 |
+
"segment_numeric_aggregates": SEGMENT_NUMERIC_AGGREGATES,
|
| 303 |
+
},
|
| 304 |
+
}
|
| 305 |
+
return X, y, groups, meta
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
def transform_single(
|
| 309 |
+
record: dict | pd.DataFrame,
|
| 310 |
+
meta: dict[str, Any],
|
| 311 |
+
segment_aggregates: dict | None = None,
|
| 312 |
+
) -> np.ndarray:
|
| 313 |
+
"""Encode a single event record for inference.
|
| 314 |
+
|
| 315 |
+
`record` must contain event-level fields (sans leaky columns) plus
|
| 316 |
+
the segment-level aggregate fields. If you only have the raw event,
|
| 317 |
+
pass `segment_aggregates` as a dict {seg_*: value, ...} and they'll
|
| 318 |
+
be merged in.
|
| 319 |
+
"""
|
| 320 |
+
if isinstance(record, dict):
|
| 321 |
+
df = pd.DataFrame([record.copy()])
|
| 322 |
+
else:
|
| 323 |
+
df = record.copy()
|
| 324 |
+
|
| 325 |
+
if segment_aggregates is not None:
|
| 326 |
+
for k, v in segment_aggregates.items():
|
| 327 |
+
df[k] = v
|
| 328 |
+
|
| 329 |
+
df = _add_engineered_features(df)
|
| 330 |
+
|
| 331 |
+
numeric = pd.DataFrame({
|
| 332 |
+
col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
|
| 333 |
+
for col in meta["numeric_features"]
|
| 334 |
+
})
|
| 335 |
+
blocks: list[pd.DataFrame] = [numeric]
|
| 336 |
+
for col, levels in meta["categorical_levels"].items():
|
| 337 |
+
val = df.get(col, pd.Series([None] * len(df)))
|
| 338 |
+
block = pd.get_dummies(
|
| 339 |
+
val.astype("category").cat.set_categories(levels),
|
| 340 |
+
prefix=col, dummy_na=False,
|
| 341 |
+
).astype(int)
|
| 342 |
+
for lvl in levels:
|
| 343 |
+
cname = f"{col}_{lvl}"
|
| 344 |
+
if cname not in block.columns:
|
| 345 |
+
block[cname] = 0
|
| 346 |
+
block = block[[f"{col}_{lvl}" for lvl in levels]]
|
| 347 |
+
blocks.append(block)
|
| 348 |
+
|
| 349 |
+
X = pd.concat(blocks, axis=1).fillna(0.0)
|
| 350 |
+
X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
|
| 351 |
+
return X.values.astype(np.float32)
|
| 352 |
+
|
| 353 |
+
|
| 354 |
+
def save_meta(meta: dict[str, Any], path: str | Path) -> None:
|
| 355 |
+
serializable = {
|
| 356 |
+
"feature_names": meta["feature_names"],
|
| 357 |
+
"numeric_features": meta["numeric_features"],
|
| 358 |
+
"categorical_levels": meta["categorical_levels"],
|
| 359 |
+
"label_to_int": meta["label_to_int"],
|
| 360 |
+
"int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
|
| 361 |
+
"topology_aggregation": meta["topology_aggregation"],
|
| 362 |
+
}
|
| 363 |
+
with open(path, "w") as f:
|
| 364 |
+
json.dump(serializable, f, indent=2)
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
def load_meta(path: str | Path) -> dict[str, Any]:
|
| 368 |
+
with open(path) as f:
|
| 369 |
+
meta = json.load(f)
|
| 370 |
+
meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
|
| 371 |
+
return meta
|
| 372 |
+
|
| 373 |
+
|
| 374 |
+
def build_segment_lookup(topology_path: str | Path) -> dict[str, dict]:
|
| 375 |
+
"""Build a {segment_id: {seg_* feature values}} lookup for inference."""
|
| 376 |
+
topology = pd.read_csv(topology_path)
|
| 377 |
+
agg = _aggregate_topology(topology)
|
| 378 |
+
return {row["segment_id"]: {k: v for k, v in row.items() if k != "segment_id"}
|
| 379 |
+
for _, row in agg.iterrows()}
|
| 380 |
+
|
| 381 |
+
|
| 382 |
+
if __name__ == "__main__":
|
| 383 |
+
import sys
|
| 384 |
+
base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
|
| 385 |
+
X, y, groups, meta = build_features(
|
| 386 |
+
base / "attack_events.csv",
|
| 387 |
+
base / "network_topology.csv",
|
| 388 |
+
)
|
| 389 |
+
print(f"X shape: {X.shape}")
|
| 390 |
+
print(f"y shape: {y.shape}")
|
| 391 |
+
print(f"groups: {groups.nunique()} campaigns")
|
| 392 |
+
print(f"n features: {len(meta['feature_names'])}")
|
| 393 |
+
print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
|
| 394 |
+
print(f"X has NaN: {X.isnull().any().any()}")
|
feature_meta.json
ADDED
|
@@ -0,0 +1,249 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"feature_names": [
|
| 3 |
+
"timestep",
|
| 4 |
+
"dest_port",
|
| 5 |
+
"bytes_transferred",
|
| 6 |
+
"connection_duration_s",
|
| 7 |
+
"auth_failure_count",
|
| 8 |
+
"process_injection_flag",
|
| 9 |
+
"lateral_hop_count",
|
| 10 |
+
"c2_beacon_interval_s",
|
| 11 |
+
"edr_blocked_flag",
|
| 12 |
+
"siem_rule_triggered",
|
| 13 |
+
"seg_patch_lag_days_mean",
|
| 14 |
+
"seg_exposure_score_mean",
|
| 15 |
+
"seg_vulnerability_count_max",
|
| 16 |
+
"seg_inter_segment_trust_level_mean",
|
| 17 |
+
"seg_alert_threshold_sensitivity_mean",
|
| 18 |
+
"seg_mttd_baseline_hours_mean",
|
| 19 |
+
"seg_mttr_baseline_hours_mean",
|
| 20 |
+
"seg_siem_coverage_flag_mean",
|
| 21 |
+
"seg_edr_deployed_flag_mean",
|
| 22 |
+
"seg_ndr_coverage_flag_mean",
|
| 23 |
+
"seg_mfa_enforced_flag_mean",
|
| 24 |
+
"byte_volume_log",
|
| 25 |
+
"has_c2_beacon",
|
| 26 |
+
"is_brute_forcing",
|
| 27 |
+
"attacker_defender_advantage",
|
| 28 |
+
"is_high_volume",
|
| 29 |
+
"is_privileged_port",
|
| 30 |
+
"target_asset_type_backup_system",
|
| 31 |
+
"target_asset_type_cloud_vm",
|
| 32 |
+
"target_asset_type_container",
|
| 33 |
+
"target_asset_type_database_server",
|
| 34 |
+
"target_asset_type_domain_controller",
|
| 35 |
+
"target_asset_type_ehr_system",
|
| 36 |
+
"target_asset_type_email_server",
|
| 37 |
+
"target_asset_type_firewall",
|
| 38 |
+
"target_asset_type_iot_device",
|
| 39 |
+
"target_asset_type_router",
|
| 40 |
+
"target_asset_type_scada_plc",
|
| 41 |
+
"target_asset_type_server",
|
| 42 |
+
"target_asset_type_vpn_gateway",
|
| 43 |
+
"target_asset_type_web_server",
|
| 44 |
+
"target_asset_type_workstation",
|
| 45 |
+
"source_ip_class_cloud_egress",
|
| 46 |
+
"source_ip_class_external_internet",
|
| 47 |
+
"source_ip_class_internal_lan",
|
| 48 |
+
"source_ip_class_tor_exit",
|
| 49 |
+
"source_ip_class_vpn_tunnel",
|
| 50 |
+
"protocol_dns",
|
| 51 |
+
"protocol_ftp",
|
| 52 |
+
"protocol_http",
|
| 53 |
+
"protocol_https",
|
| 54 |
+
"protocol_icmp",
|
| 55 |
+
"protocol_rdp",
|
| 56 |
+
"protocol_smb",
|
| 57 |
+
"protocol_ssh",
|
| 58 |
+
"protocol_tcp",
|
| 59 |
+
"protocol_udp",
|
| 60 |
+
"attacker_capability_tier_apt",
|
| 61 |
+
"attacker_capability_tier_nation_state",
|
| 62 |
+
"attacker_capability_tier_opportunistic",
|
| 63 |
+
"attacker_capability_tier_script_kiddie",
|
| 64 |
+
"defender_maturity_level_advanced",
|
| 65 |
+
"defender_maturity_level_baseline",
|
| 66 |
+
"defender_maturity_level_managed",
|
| 67 |
+
"defender_maturity_level_minimal",
|
| 68 |
+
"defender_maturity_level_zero_trust",
|
| 69 |
+
"alert_severity_critical",
|
| 70 |
+
"alert_severity_high",
|
| 71 |
+
"alert_severity_informational",
|
| 72 |
+
"alert_severity_low",
|
| 73 |
+
"alert_severity_medium",
|
| 74 |
+
"detection_outcome_blind_spot",
|
| 75 |
+
"detection_outcome_edr_blocked",
|
| 76 |
+
"detection_outcome_evasion_success",
|
| 77 |
+
"detection_outcome_high_confidence_alert",
|
| 78 |
+
"detection_outcome_ir_escalated",
|
| 79 |
+
"detection_outcome_marginal_alert",
|
| 80 |
+
"detection_outcome_suppressed_alert",
|
| 81 |
+
"seg_segment_type_cloud_workload",
|
| 82 |
+
"seg_segment_type_corporate_lan",
|
| 83 |
+
"seg_segment_type_data_exfiltration_target",
|
| 84 |
+
"seg_segment_type_endpoint_fleet",
|
| 85 |
+
"seg_segment_type_soc_management_plane",
|
| 86 |
+
"seg_segment_type_supply_chain_interface",
|
| 87 |
+
"seg_segment_type_zero_trust_segment",
|
| 88 |
+
"seg_defender_maturity_level_advanced",
|
| 89 |
+
"seg_defender_maturity_level_baseline",
|
| 90 |
+
"seg_defender_maturity_level_managed",
|
| 91 |
+
"seg_defender_maturity_level_minimal",
|
| 92 |
+
"seg_defender_maturity_level_zero_trust"
|
| 93 |
+
],
|
| 94 |
+
"numeric_features": [
|
| 95 |
+
"timestep",
|
| 96 |
+
"dest_port",
|
| 97 |
+
"bytes_transferred",
|
| 98 |
+
"connection_duration_s",
|
| 99 |
+
"auth_failure_count",
|
| 100 |
+
"process_injection_flag",
|
| 101 |
+
"lateral_hop_count",
|
| 102 |
+
"c2_beacon_interval_s",
|
| 103 |
+
"edr_blocked_flag",
|
| 104 |
+
"siem_rule_triggered",
|
| 105 |
+
"seg_patch_lag_days_mean",
|
| 106 |
+
"seg_exposure_score_mean",
|
| 107 |
+
"seg_vulnerability_count_max",
|
| 108 |
+
"seg_inter_segment_trust_level_mean",
|
| 109 |
+
"seg_alert_threshold_sensitivity_mean",
|
| 110 |
+
"seg_mttd_baseline_hours_mean",
|
| 111 |
+
"seg_mttr_baseline_hours_mean",
|
| 112 |
+
"seg_siem_coverage_flag_mean",
|
| 113 |
+
"seg_edr_deployed_flag_mean",
|
| 114 |
+
"seg_ndr_coverage_flag_mean",
|
| 115 |
+
"seg_mfa_enforced_flag_mean",
|
| 116 |
+
"byte_volume_log",
|
| 117 |
+
"has_c2_beacon",
|
| 118 |
+
"is_brute_forcing",
|
| 119 |
+
"attacker_defender_advantage",
|
| 120 |
+
"is_high_volume",
|
| 121 |
+
"is_privileged_port"
|
| 122 |
+
],
|
| 123 |
+
"categorical_levels": {
|
| 124 |
+
"target_asset_type": [
|
| 125 |
+
"backup_system",
|
| 126 |
+
"cloud_vm",
|
| 127 |
+
"container",
|
| 128 |
+
"database_server",
|
| 129 |
+
"domain_controller",
|
| 130 |
+
"ehr_system",
|
| 131 |
+
"email_server",
|
| 132 |
+
"firewall",
|
| 133 |
+
"iot_device",
|
| 134 |
+
"router",
|
| 135 |
+
"scada_plc",
|
| 136 |
+
"server",
|
| 137 |
+
"vpn_gateway",
|
| 138 |
+
"web_server",
|
| 139 |
+
"workstation"
|
| 140 |
+
],
|
| 141 |
+
"source_ip_class": [
|
| 142 |
+
"cloud_egress",
|
| 143 |
+
"external_internet",
|
| 144 |
+
"internal_lan",
|
| 145 |
+
"tor_exit",
|
| 146 |
+
"vpn_tunnel"
|
| 147 |
+
],
|
| 148 |
+
"protocol": [
|
| 149 |
+
"dns",
|
| 150 |
+
"ftp",
|
| 151 |
+
"http",
|
| 152 |
+
"https",
|
| 153 |
+
"icmp",
|
| 154 |
+
"rdp",
|
| 155 |
+
"smb",
|
| 156 |
+
"ssh",
|
| 157 |
+
"tcp",
|
| 158 |
+
"udp"
|
| 159 |
+
],
|
| 160 |
+
"attacker_capability_tier": [
|
| 161 |
+
"apt",
|
| 162 |
+
"nation_state",
|
| 163 |
+
"opportunistic",
|
| 164 |
+
"script_kiddie"
|
| 165 |
+
],
|
| 166 |
+
"defender_maturity_level": [
|
| 167 |
+
"advanced",
|
| 168 |
+
"baseline",
|
| 169 |
+
"managed",
|
| 170 |
+
"minimal",
|
| 171 |
+
"zero_trust"
|
| 172 |
+
],
|
| 173 |
+
"alert_severity": [
|
| 174 |
+
"critical",
|
| 175 |
+
"high",
|
| 176 |
+
"informational",
|
| 177 |
+
"low",
|
| 178 |
+
"medium"
|
| 179 |
+
],
|
| 180 |
+
"detection_outcome": [
|
| 181 |
+
"blind_spot",
|
| 182 |
+
"edr_blocked",
|
| 183 |
+
"evasion_success",
|
| 184 |
+
"high_confidence_alert",
|
| 185 |
+
"ir_escalated",
|
| 186 |
+
"marginal_alert",
|
| 187 |
+
"suppressed_alert"
|
| 188 |
+
],
|
| 189 |
+
"seg_segment_type": [
|
| 190 |
+
"cloud_workload",
|
| 191 |
+
"corporate_lan",
|
| 192 |
+
"data_exfiltration_target",
|
| 193 |
+
"endpoint_fleet",
|
| 194 |
+
"soc_management_plane",
|
| 195 |
+
"supply_chain_interface",
|
| 196 |
+
"zero_trust_segment"
|
| 197 |
+
],
|
| 198 |
+
"seg_defender_maturity_level": [
|
| 199 |
+
"advanced",
|
| 200 |
+
"baseline",
|
| 201 |
+
"managed",
|
| 202 |
+
"minimal",
|
| 203 |
+
"zero_trust"
|
| 204 |
+
]
|
| 205 |
+
},
|
| 206 |
+
"label_to_int": {
|
| 207 |
+
"dwell_idle": 0,
|
| 208 |
+
"reconnaissance": 1,
|
| 209 |
+
"initial_access": 2,
|
| 210 |
+
"execution": 3,
|
| 211 |
+
"persistence": 4,
|
| 212 |
+
"privilege_escalation": 5,
|
| 213 |
+
"lateral_movement": 6,
|
| 214 |
+
"collection": 7,
|
| 215 |
+
"exfiltration": 8,
|
| 216 |
+
"impact": 9
|
| 217 |
+
},
|
| 218 |
+
"int_to_label": {
|
| 219 |
+
"0": "dwell_idle",
|
| 220 |
+
"1": "reconnaissance",
|
| 221 |
+
"2": "initial_access",
|
| 222 |
+
"3": "execution",
|
| 223 |
+
"4": "persistence",
|
| 224 |
+
"5": "privilege_escalation",
|
| 225 |
+
"6": "lateral_movement",
|
| 226 |
+
"7": "collection",
|
| 227 |
+
"8": "exfiltration",
|
| 228 |
+
"9": "impact"
|
| 229 |
+
},
|
| 230 |
+
"topology_aggregation": {
|
| 231 |
+
"segment_constant": [
|
| 232 |
+
"segment_type",
|
| 233 |
+
"defender_maturity_level"
|
| 234 |
+
],
|
| 235 |
+
"segment_numeric_aggregates": {
|
| 236 |
+
"patch_lag_days": "mean",
|
| 237 |
+
"exposure_score": "mean",
|
| 238 |
+
"vulnerability_count": "max",
|
| 239 |
+
"inter_segment_trust_level": "mean",
|
| 240 |
+
"alert_threshold_sensitivity": "mean",
|
| 241 |
+
"mttd_baseline_hours": "mean",
|
| 242 |
+
"mttr_baseline_hours": "mean",
|
| 243 |
+
"siem_coverage_flag": "mean",
|
| 244 |
+
"edr_deployed_flag": "mean",
|
| 245 |
+
"ndr_coverage_flag": "mean",
|
| 246 |
+
"mfa_enforced_flag": "mean"
|
| 247 |
+
}
|
| 248 |
+
}
|
| 249 |
+
}
|
feature_scaler.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"mean": [29.669737774627922, 2374.859673990078, 94190.43841601702, 14.909633238837705, 1.3384124734231042, 0.13040396881644223, 0.05705173635719348, 0.0, 0.3621545003543586, 0.166194188518781, 34.396196846241345, 0.512852316745022, 14.379518072289157, 0.392728801495667, 0.7238749335681469, 6.124241842889212, 36.93126845133998, 0.6976009715267184, 0.8059781553368865, 0.4883178731877128, 0.6477277267624112, 9.540027804902557, 1.0, 0.5510276399716513, -0.010276399716513111, 0.16725726435152374, 0.6463501063075833, 0.0627214741318214, 0.06909992912827782, 0.06591070163004961, 0.0705173635719348, 0.07122608079376329, 0.06520198440822111, 0.0673281360737066, 0.06520198440822111, 0.07299787384833452, 0.058114812189936214, 0.06945428773919206, 0.06130403968816442, 0.057406094968107724, 0.07973068745570518, 0.06378454996456413, 0.19383416017009214, 0.20233876683203403, 0.20411055988660523, 0.2147413182140326, 0.184975194897236, 0.10063784549964565, 0.09815733522324592, 0.10311835577604536, 0.09780297661233169, 0.09319631467044649, 0.09886605244507442, 0.10099220411055988, 0.10276399716513111, 0.09673990077958894, 0.10772501771793054, 0.2271438695960312, 0.4875974486180014, 0.22749822820694543, 0.05776045357902197, 0.4135364989369242, 0.2664776754075124, 0.2147413182140326, 0.050673281360737066, 0.05457122608079376, 0.43834160170092135, 0.06413890857547838, 0.4043231750531538, 0.0673281360737066, 0.0258681785967399, 0.059532246633593196, 0.3621545003543586, 0.3447909284195606, 0.10701630049610206, 0.03330970942593905, 0.0258681785967399, 0.0673281360737066, 0.04642097802976612, 0.23954642097802978, 0.09603118355776046, 0.22041105598866054, 0.08788093550673282, 0.22395464209780297, 0.08575478384124734, 0.4135364989369242, 0.2664776754075124, 0.2147413182140326, 0.050673281360737066, 0.05457122608079376], "std": [21.611718068894575, 3262.3953544252254, 493540.4889491936, 26.882083698757928, 1.7063611088856259, 0.33680702458505324, 0.23198255508867544, 1.0, 0.4807083352654771, 0.3723208326229761, 27.918565668886338, 0.16437622036073063, 6.809572056022862, 0.031089587614791407, 0.16380824644388434, 3.278380945942728, 19.765170276913693, 0.1795819790728066, 0.06230034648225459, 0.1392601592567418, 0.14782851174966183, 1.9732855896589672, 1.0, 0.49747751507194377, 1.14101486329445, 0.3732715435730835, 0.47818686198141197, 0.2425042887305301, 0.25366894008973206, 0.2481699123323166, 0.2560622962417658, 0.2572476946006164, 0.2469256805073951, 0.2506338325607766, 0.2469256805073951, 0.2601791150733033, 0.23400188966661606, 0.25427013215761435, 0.23992968449882038, 0.2326581539410086, 0.2709238172522079, 0.24441204871928013, 0.3953705491117327, 0.4018146379110633, 0.40312160075220743, 0.4107155466365413, 0.38834625527902134, 0.30090190074813306, 0.29757999359048065, 0.3041672976137825, 0.29710071221594014, 0.29075886802325146, 0.2985349856721442, 0.3013718026414007, 0.303704202734562, 0.2956556572543326, 0.3100877479347836, 0.41906057037245537, 0.49993473898865376, 0.4192911666305911, 0.23333125829157636, 0.4925545999567315, 0.44219522159317903, 0.4107155466365413, 0.21936853137528786, 0.22718163733670016, 0.49627161445350315, 0.24504364292201097, 0.49084755398842195, 0.2506338325607766, 0.15877011238413913, 0.2366601047140391, 0.4807083352654771, 0.4753842926323826, 0.30918875753830577, 0.17947586784069838, 0.15877011238413913, 0.2506338325607766, 0.21043232273668783, 0.426882310965684, 0.2946862192784542, 0.4145973147754568, 0.2831718407328375, 0.4169659091226798, 0.28005123240617036, 0.4925545999567315, 0.44219522159317903, 0.4107155466365413, 0.21936853137528786, 0.22718163733670016]}
|
inference_example.ipynb
ADDED
|
@@ -0,0 +1,343 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# CYB002 Baseline Classifier — Inference Example\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **MITRE ATT&CK kill-chain phase** of a new attack-event record.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"**Models predict one of 10 phases:** `dwell_idle`, `reconnaissance`, `initial_access`, `execution`, `persistence`, `privilege_escalation`, `lateral_movement`, `collection`, `exfiltration`, `impact`.\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"**This is a baseline reference model**, not a production threat detector. See the model card for full metrics and limitations."
|
| 14 |
+
]
|
| 15 |
+
},
|
| 16 |
+
{
|
| 17 |
+
"cell_type": "markdown",
|
| 18 |
+
"metadata": {},
|
| 19 |
+
"source": [
|
| 20 |
+
"## 1. Install dependencies"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "code",
|
| 25 |
+
"execution_count": null,
|
| 26 |
+
"metadata": {},
|
| 27 |
+
"outputs": [],
|
| 28 |
+
"source": [
|
| 29 |
+
"%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
|
| 30 |
+
]
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"cell_type": "markdown",
|
| 34 |
+
"metadata": {},
|
| 35 |
+
"source": [
|
| 36 |
+
"## 2. Download model artifacts from Hugging Face\n",
|
| 37 |
+
"\n",
|
| 38 |
+
"Five files are needed:\n",
|
| 39 |
+
"- `model_xgb.json` — XGBoost weights\n",
|
| 40 |
+
"- `model_mlp.safetensors` — PyTorch MLP weights\n",
|
| 41 |
+
"- `feature_engineering.py` — feature pipeline (must match the one used at training)\n",
|
| 42 |
+
"- `feature_meta.json` — feature column order + categorical levels\n",
|
| 43 |
+
"- `feature_scaler.json` — MLP input standardization (mean / std)"
|
| 44 |
+
]
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"cell_type": "code",
|
| 48 |
+
"execution_count": null,
|
| 49 |
+
"metadata": {},
|
| 50 |
+
"outputs": [],
|
| 51 |
+
"source": [
|
| 52 |
+
"from huggingface_hub import hf_hub_download\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"REPO_ID = \"xpertsystems/cyb002-baseline-classifier\"\n",
|
| 55 |
+
"\n",
|
| 56 |
+
"files = {}\n",
|
| 57 |
+
"for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
|
| 58 |
+
" \"feature_engineering.py\", \"feature_meta.json\",\n",
|
| 59 |
+
" \"feature_scaler.json\"]:\n",
|
| 60 |
+
" files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
|
| 61 |
+
" print(f\" downloaded: {name}\")"
|
| 62 |
+
]
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"cell_type": "code",
|
| 66 |
+
"execution_count": null,
|
| 67 |
+
"metadata": {},
|
| 68 |
+
"outputs": [],
|
| 69 |
+
"source": [
|
| 70 |
+
"# Make feature_engineering.py importable\n",
|
| 71 |
+
"import sys, os\n",
|
| 72 |
+
"fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
|
| 73 |
+
"if fe_dir not in sys.path:\n",
|
| 74 |
+
" sys.path.insert(0, fe_dir)\n",
|
| 75 |
+
"\n",
|
| 76 |
+
"from feature_engineering import (\n",
|
| 77 |
+
" transform_single, load_meta, INT_TO_LABEL, build_segment_lookup\n",
|
| 78 |
+
")"
|
| 79 |
+
]
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"cell_type": "markdown",
|
| 83 |
+
"metadata": {},
|
| 84 |
+
"source": [
|
| 85 |
+
"## 3. Load models and metadata"
|
| 86 |
+
]
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"cell_type": "code",
|
| 90 |
+
"execution_count": null,
|
| 91 |
+
"metadata": {},
|
| 92 |
+
"outputs": [],
|
| 93 |
+
"source": [
|
| 94 |
+
"import json\n",
|
| 95 |
+
"import numpy as np\n",
|
| 96 |
+
"import torch\n",
|
| 97 |
+
"import torch.nn as nn\n",
|
| 98 |
+
"import xgboost as xgb\n",
|
| 99 |
+
"from safetensors.torch import load_file\n",
|
| 100 |
+
"\n",
|
| 101 |
+
"meta = load_meta(files[\"feature_meta.json\"])\n",
|
| 102 |
+
"with open(files[\"feature_scaler.json\"]) as f:\n",
|
| 103 |
+
" scaler = json.load(f)\n",
|
| 104 |
+
"\n",
|
| 105 |
+
"N_FEATURES = len(meta[\"feature_names\"])\n",
|
| 106 |
+
"N_CLASSES = len(meta[\"int_to_label\"])\n",
|
| 107 |
+
"print(f\"feature count: {N_FEATURES}\")\n",
|
| 108 |
+
"print(f\"class count: {N_CLASSES}\")\n",
|
| 109 |
+
"print(f\"label classes: {list(meta['int_to_label'].values())}\")"
|
| 110 |
+
]
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"cell_type": "code",
|
| 114 |
+
"execution_count": null,
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"outputs": [],
|
| 117 |
+
"source": [
|
| 118 |
+
"# XGBoost\n",
|
| 119 |
+
"xgb_model = xgb.XGBClassifier()\n",
|
| 120 |
+
"xgb_model.load_model(files[\"model_xgb.json\"])\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"# MLP architecture (must match training)\n",
|
| 123 |
+
"class PhaseMLP(nn.Module):\n",
|
| 124 |
+
" def __init__(self, n_features, n_classes=10, hidden1=128, hidden2=64, dropout=0.3):\n",
|
| 125 |
+
" super().__init__()\n",
|
| 126 |
+
" self.net = nn.Sequential(\n",
|
| 127 |
+
" nn.Linear(n_features, hidden1),\n",
|
| 128 |
+
" nn.BatchNorm1d(hidden1),\n",
|
| 129 |
+
" nn.ReLU(),\n",
|
| 130 |
+
" nn.Dropout(dropout),\n",
|
| 131 |
+
" nn.Linear(hidden1, hidden2),\n",
|
| 132 |
+
" nn.BatchNorm1d(hidden2),\n",
|
| 133 |
+
" nn.ReLU(),\n",
|
| 134 |
+
" nn.Dropout(dropout),\n",
|
| 135 |
+
" nn.Linear(hidden2, n_classes),\n",
|
| 136 |
+
" )\n",
|
| 137 |
+
" def forward(self, x):\n",
|
| 138 |
+
" return self.net(x)\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
|
| 141 |
+
"mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
|
| 142 |
+
"mlp_model.eval()\n",
|
| 143 |
+
"print(\"models loaded\")"
|
| 144 |
+
]
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"cell_type": "markdown",
|
| 148 |
+
"metadata": {},
|
| 149 |
+
"source": [
|
| 150 |
+
"## 4. Build segment-aggregate lookup from the dataset\n",
|
| 151 |
+
"\n",
|
| 152 |
+
"Per-segment topology aggregates (mean exposure, fraction with EDR, etc.) are computed at training time and must be available at inference time too. The helper `build_segment_lookup` pulls them from `network_topology.csv`."
|
| 153 |
+
]
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"cell_type": "code",
|
| 157 |
+
"execution_count": null,
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"outputs": [],
|
| 160 |
+
"source": [
|
| 161 |
+
"from huggingface_hub import snapshot_download\n",
|
| 162 |
+
"\n",
|
| 163 |
+
"ds_path = snapshot_download(repo_id=\"xpertsystems/cyb002-sample\", repo_type=\"dataset\")\n",
|
| 164 |
+
"\n",
|
| 165 |
+
"import os\n",
|
| 166 |
+
"segment_aggregates_lookup = build_segment_lookup(\n",
|
| 167 |
+
" os.path.join(ds_path, \"network_topology.csv\")\n",
|
| 168 |
+
")\n",
|
| 169 |
+
"print(f\"loaded {len(segment_aggregates_lookup)} segment aggregates\")"
|
| 170 |
+
]
|
| 171 |
+
},
|
| 172 |
+
{
|
| 173 |
+
"cell_type": "markdown",
|
| 174 |
+
"metadata": {},
|
| 175 |
+
"source": [
|
| 176 |
+
"## 5. Prediction helper"
|
| 177 |
+
]
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"cell_type": "code",
|
| 181 |
+
"execution_count": null,
|
| 182 |
+
"metadata": {},
|
| 183 |
+
"outputs": [],
|
| 184 |
+
"source": [
|
| 185 |
+
"MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
|
| 186 |
+
"SD = np.array(scaler[\"std\"], dtype=np.float32)\n",
|
| 187 |
+
"\n",
|
| 188 |
+
"def predict_phase(record: dict) -> dict:\n",
|
| 189 |
+
" \"\"\"Predict the kill-chain phase for one event record.\n",
|
| 190 |
+
"\n",
|
| 191 |
+
" `record` is a dict with event-level fields. Segment-level aggregates\n",
|
| 192 |
+
" are pulled automatically from `segment_aggregates_lookup` using the\n",
|
| 193 |
+
" `target_segment_id` field.\n",
|
| 194 |
+
"\n",
|
| 195 |
+
" Returns a dict with both models' predictions and per-class probabilities.\n",
|
| 196 |
+
" \"\"\"\n",
|
| 197 |
+
" seg_id = record.get(\"target_segment_id\")\n",
|
| 198 |
+
" seg_agg = segment_aggregates_lookup.get(seg_id, {})\n",
|
| 199 |
+
" X = transform_single(record, meta, segment_aggregates=seg_agg)\n",
|
| 200 |
+
"\n",
|
| 201 |
+
" xgb_proba = xgb_model.predict_proba(X)[0]\n",
|
| 202 |
+
" xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
|
| 203 |
+
"\n",
|
| 204 |
+
" Xs = ((X - MU) / SD).astype(np.float32)\n",
|
| 205 |
+
" with torch.no_grad():\n",
|
| 206 |
+
" logits = mlp_model(torch.tensor(Xs))\n",
|
| 207 |
+
" mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
|
| 208 |
+
" mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
|
| 209 |
+
"\n",
|
| 210 |
+
" return {\n",
|
| 211 |
+
" \"xgboost\": {\n",
|
| 212 |
+
" \"label\": xgb_label,\n",
|
| 213 |
+
" \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
|
| 214 |
+
" },\n",
|
| 215 |
+
" \"mlp\": {\n",
|
| 216 |
+
" \"label\": mlp_label,\n",
|
| 217 |
+
" \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
|
| 218 |
+
" },\n",
|
| 219 |
+
" }"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"cell_type": "markdown",
|
| 224 |
+
"metadata": {},
|
| 225 |
+
"source": [
|
| 226 |
+
"## 6. Run on an example record\n",
|
| 227 |
+
"\n",
|
| 228 |
+
"This is a real `reconnaissance` event lifted from the sample dataset: opportunistic attacker scanning an email server early in a campaign (timestep 0). Both models should predict `reconnaissance`."
|
| 229 |
+
]
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"cell_type": "code",
|
| 233 |
+
"execution_count": null,
|
| 234 |
+
"metadata": {},
|
| 235 |
+
"outputs": [],
|
| 236 |
+
"source": [
|
| 237 |
+
"# Real attack event from the sample dataset (true label: reconnaissance)\n",
|
| 238 |
+
"example_record = {\n",
|
| 239 |
+
" \"campaign_id\": \"CAMP-000030\",\n",
|
| 240 |
+
" \"attacker_id\": \"ATK-0003\",\n",
|
| 241 |
+
" \"timestep\": 0,\n",
|
| 242 |
+
" \"target_segment_id\": \"SEG-0008\",\n",
|
| 243 |
+
" \"target_asset_type\": \"email_server\",\n",
|
| 244 |
+
" \"source_ip_class\": \"vpn_tunnel\",\n",
|
| 245 |
+
" \"dest_port\": 22,\n",
|
| 246 |
+
" \"protocol\": \"icmp\",\n",
|
| 247 |
+
" \"bytes_transferred\": 15648.48,\n",
|
| 248 |
+
" \"connection_duration_s\": 3.913,\n",
|
| 249 |
+
" \"auth_failure_count\": 0,\n",
|
| 250 |
+
" \"process_injection_flag\": 0,\n",
|
| 251 |
+
" \"lateral_hop_count\": 0,\n",
|
| 252 |
+
" \"c2_beacon_interval_s\": 0.0,\n",
|
| 253 |
+
" \"detection_outcome\": \"edr_blocked\",\n",
|
| 254 |
+
" \"alert_severity\": \"critical\",\n",
|
| 255 |
+
" \"siem_rule_triggered\": 0,\n",
|
| 256 |
+
" \"edr_blocked_flag\": 1,\n",
|
| 257 |
+
" \"attacker_capability_tier\": \"opportunistic\",\n",
|
| 258 |
+
" \"defender_maturity_level\": \"baseline\",\n",
|
| 259 |
+
"}\n",
|
| 260 |
+
"\n",
|
| 261 |
+
"result = predict_phase(example_record)\n",
|
| 262 |
+
"\n",
|
| 263 |
+
"print(f\"XGBoost -> {result['xgboost']['label']}\")\n",
|
| 264 |
+
"for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
|
| 265 |
+
" print(f\" P({lbl:25s}) = {p:.4f}\")\n",
|
| 266 |
+
"\n",
|
| 267 |
+
"print(f\"\\nMLP -> {result['mlp']['label']}\")\n",
|
| 268 |
+
"for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1])[:5]:\n",
|
| 269 |
+
" print(f\" P({lbl:25s}) = {p:.4f}\")"
|
| 270 |
+
]
|
| 271 |
+
},
|
| 272 |
+
{
|
| 273 |
+
"cell_type": "markdown",
|
| 274 |
+
"metadata": {},
|
| 275 |
+
"source": [
|
| 276 |
+
"### Note: when the two models disagree\n",
|
| 277 |
+
"\n",
|
| 278 |
+
"XGBoost and the MLP can disagree on out-of-distribution records — particularly hand-crafted inputs whose feature combinations don't sit on the training-data manifold. The MLP, with BatchNorm and a small training set, has narrower competence than the tree ensemble. Disagreement is a useful triage signal: in a SOC workflow, conflicting predictions are flows worth a human eyeball."
|
| 279 |
+
]
|
| 280 |
+
},
|
| 281 |
+
{
|
| 282 |
+
"cell_type": "markdown",
|
| 283 |
+
"metadata": {},
|
| 284 |
+
"source": [
|
| 285 |
+
"## 7. Batch prediction on the sample dataset"
|
| 286 |
+
]
|
| 287 |
+
},
|
| 288 |
+
{
|
| 289 |
+
"cell_type": "code",
|
| 290 |
+
"execution_count": null,
|
| 291 |
+
"metadata": {},
|
| 292 |
+
"outputs": [],
|
| 293 |
+
"source": [
|
| 294 |
+
"import pandas as pd\n",
|
| 295 |
+
"\n",
|
| 296 |
+
"events = pd.read_csv(os.path.join(ds_path, \"attack_events.csv\"))\n",
|
| 297 |
+
"\n",
|
| 298 |
+
"# Drop leakage columns the model was never trained on\n",
|
| 299 |
+
"events = events.drop(columns=[\"technique_id\", \"technique_name\", \"tactic_category\"],\n",
|
| 300 |
+
" errors=\"ignore\")\n",
|
| 301 |
+
"\n",
|
| 302 |
+
"# Score the first 200 events\n",
|
| 303 |
+
"sample = events.head(200).copy()\n",
|
| 304 |
+
"preds = [predict_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
|
| 305 |
+
"sample[\"xgb_pred\"] = preds\n",
|
| 306 |
+
"\n",
|
| 307 |
+
"ct = pd.crosstab(sample[\"kill_chain_phase\"], sample[\"xgb_pred\"],\n",
|
| 308 |
+
" rownames=[\"true\"], colnames=[\"pred\"])\n",
|
| 309 |
+
"print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
|
| 310 |
+
"print(ct)\n",
|
| 311 |
+
"acc = (sample[\"kill_chain_phase\"] == sample[\"xgb_pred\"]).mean()\n",
|
| 312 |
+
"print(f\"\\nbatch accuracy on first 200 (in-distribution): {acc:.4f}\")\n",
|
| 313 |
+
"print(\"\\nNote: this includes training-set events. See validation_results.json\\n\"\n",
|
| 314 |
+
" \"for proper held-out test-set metrics from disjoint campaigns.\")"
|
| 315 |
+
]
|
| 316 |
+
},
|
| 317 |
+
{
|
| 318 |
+
"cell_type": "markdown",
|
| 319 |
+
"metadata": {},
|
| 320 |
+
"source": [
|
| 321 |
+
"## 8. Next steps\n",
|
| 322 |
+
"\n",
|
| 323 |
+
"- See `validation_results.json` for held-out test-set metrics (15 disjoint campaigns, 726 events).\n",
|
| 324 |
+
"- See `ablation_results.json` for per-feature-group contribution. `timestep` is by far the most predictive feature, which is honest: kill-chain phases progress in time, so where you are in the campaign timeline carries most of the phase signal.\n",
|
| 325 |
+
"- The model card's **Limitations** section explains the gap between this baseline and production threat-detection systems.\n",
|
| 326 |
+
"- For the full 380k-row CYB002 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
|
| 327 |
+
]
|
| 328 |
+
}
|
| 329 |
+
],
|
| 330 |
+
"metadata": {
|
| 331 |
+
"kernelspec": {
|
| 332 |
+
"display_name": "Python 3",
|
| 333 |
+
"language": "python",
|
| 334 |
+
"name": "python3"
|
| 335 |
+
},
|
| 336 |
+
"language_info": {
|
| 337 |
+
"name": "python",
|
| 338 |
+
"version": "3.10"
|
| 339 |
+
}
|
| 340 |
+
},
|
| 341 |
+
"nbformat": 4,
|
| 342 |
+
"nbformat_minor": 5
|
| 343 |
+
}
|
model_mlp.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f35e1a5f1a92330b2ebdf1f65a097ead961fed4b9dbf4ea11aed7d74a5f293bd
|
| 3 |
+
size 86512
|
model_xgb.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
validation_results.json
ADDED
|
@@ -0,0 +1,383 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"version": "1.0.0",
|
| 3 |
+
"dataset": "xpertsystems/cyb002-sample",
|
| 4 |
+
"task": "10-class kill_chain_phase classification",
|
| 5 |
+
"baselines": {
|
| 6 |
+
"always_predict_majority_accuracy": 0.19421487603305784,
|
| 7 |
+
"majority_class": "dwell_idle",
|
| 8 |
+
"random_guess_accuracy": 0.1
|
| 9 |
+
},
|
| 10 |
+
"split": {
|
| 11 |
+
"strategy": "group_aware (GroupShuffleSplit by campaign_id, nested)",
|
| 12 |
+
"rationale": "100 campaigns generate ~4,353 events; random row-split would leak campaign-level correlations into the test set. The group-aware split ensures train/val/test campaigns are disjoint.",
|
| 13 |
+
"campaigns_train": 69,
|
| 14 |
+
"campaigns_val": 16,
|
| 15 |
+
"campaigns_test": 15,
|
| 16 |
+
"events_train": 2822,
|
| 17 |
+
"events_val": 805,
|
| 18 |
+
"events_test": 726,
|
| 19 |
+
"seed": 42
|
| 20 |
+
},
|
| 21 |
+
"n_features": 90,
|
| 22 |
+
"label_classes": [
|
| 23 |
+
"dwell_idle",
|
| 24 |
+
"reconnaissance",
|
| 25 |
+
"initial_access",
|
| 26 |
+
"execution",
|
| 27 |
+
"persistence",
|
| 28 |
+
"privilege_escalation",
|
| 29 |
+
"lateral_movement",
|
| 30 |
+
"collection",
|
| 31 |
+
"exfiltration",
|
| 32 |
+
"impact"
|
| 33 |
+
],
|
| 34 |
+
"class_distribution_train": {
|
| 35 |
+
"dwell_idle": 609,
|
| 36 |
+
"reconnaissance": 439,
|
| 37 |
+
"initial_access": 346,
|
| 38 |
+
"execution": 313,
|
| 39 |
+
"persistence": 275,
|
| 40 |
+
"privilege_escalation": 254,
|
| 41 |
+
"lateral_movement": 205,
|
| 42 |
+
"collection": 165,
|
| 43 |
+
"exfiltration": 117,
|
| 44 |
+
"impact": 99
|
| 45 |
+
},
|
| 46 |
+
"class_distribution_test": {
|
| 47 |
+
"dwell_idle": 141,
|
| 48 |
+
"reconnaissance": 112,
|
| 49 |
+
"initial_access": 106,
|
| 50 |
+
"persistence": 79,
|
| 51 |
+
"execution": 74,
|
| 52 |
+
"privilege_escalation": 68,
|
| 53 |
+
"lateral_movement": 54,
|
| 54 |
+
"collection": 40,
|
| 55 |
+
"exfiltration": 31,
|
| 56 |
+
"impact": 21
|
| 57 |
+
},
|
| 58 |
+
"leakage_excluded_features": [
|
| 59 |
+
"technique_id (62/63 techniques map 1:1 to a single phase)",
|
| 60 |
+
"technique_name (1:1 alias of technique_id)",
|
| 61 |
+
"tactic_category (direct alias of kill_chain_phase)"
|
| 62 |
+
],
|
| 63 |
+
"models": {
|
| 64 |
+
"xgboost": {
|
| 65 |
+
"architecture": "Gradient-boosted decision trees, multi:softprob, 10 classes",
|
| 66 |
+
"framework": "xgboost",
|
| 67 |
+
"test_metrics": {
|
| 68 |
+
"model": "xgboost",
|
| 69 |
+
"accuracy": 0.46831955922865015,
|
| 70 |
+
"macro_f1": 0.42549880749552066,
|
| 71 |
+
"weighted_f1": 0.440668872633435,
|
| 72 |
+
"per_class_f1": {
|
| 73 |
+
"dwell_idle": 0.040268456375838924,
|
| 74 |
+
"reconnaissance": 0.7532467532467533,
|
| 75 |
+
"initial_access": 0.6467661691542289,
|
| 76 |
+
"execution": 0.4406779661016949,
|
| 77 |
+
"persistence": 0.41304347826086957,
|
| 78 |
+
"privilege_escalation": 0.5,
|
| 79 |
+
"lateral_movement": 0.7422680412371134,
|
| 80 |
+
"collection": 0.22018348623853212,
|
| 81 |
+
"exfiltration": 0.2727272727272727,
|
| 82 |
+
"impact": 0.22580645161290322
|
| 83 |
+
},
|
| 84 |
+
"confusion_matrix": {
|
| 85 |
+
"labels": [
|
| 86 |
+
"dwell_idle",
|
| 87 |
+
"reconnaissance",
|
| 88 |
+
"initial_access",
|
| 89 |
+
"execution",
|
| 90 |
+
"persistence",
|
| 91 |
+
"privilege_escalation",
|
| 92 |
+
"lateral_movement",
|
| 93 |
+
"collection",
|
| 94 |
+
"exfiltration",
|
| 95 |
+
"impact"
|
| 96 |
+
],
|
| 97 |
+
"matrix": [
|
| 98 |
+
[
|
| 99 |
+
3,
|
| 100 |
+
23,
|
| 101 |
+
23,
|
| 102 |
+
18,
|
| 103 |
+
21,
|
| 104 |
+
18,
|
| 105 |
+
2,
|
| 106 |
+
17,
|
| 107 |
+
9,
|
| 108 |
+
7
|
| 109 |
+
],
|
| 110 |
+
[
|
| 111 |
+
2,
|
| 112 |
+
87,
|
| 113 |
+
2,
|
| 114 |
+
21,
|
| 115 |
+
0,
|
| 116 |
+
0,
|
| 117 |
+
0,
|
| 118 |
+
0,
|
| 119 |
+
0,
|
| 120 |
+
0
|
| 121 |
+
],
|
| 122 |
+
[
|
| 123 |
+
1,
|
| 124 |
+
5,
|
| 125 |
+
65,
|
| 126 |
+
5,
|
| 127 |
+
3,
|
| 128 |
+
26,
|
| 129 |
+
1,
|
| 130 |
+
0,
|
| 131 |
+
0,
|
| 132 |
+
0
|
| 133 |
+
],
|
| 134 |
+
[
|
| 135 |
+
2,
|
| 136 |
+
4,
|
| 137 |
+
1,
|
| 138 |
+
39,
|
| 139 |
+
24,
|
| 140 |
+
3,
|
| 141 |
+
1,
|
| 142 |
+
0,
|
| 143 |
+
0,
|
| 144 |
+
0
|
| 145 |
+
],
|
| 146 |
+
[
|
| 147 |
+
0,
|
| 148 |
+
0,
|
| 149 |
+
1,
|
| 150 |
+
12,
|
| 151 |
+
38,
|
| 152 |
+
9,
|
| 153 |
+
0,
|
| 154 |
+
18,
|
| 155 |
+
1,
|
| 156 |
+
0
|
| 157 |
+
],
|
| 158 |
+
[
|
| 159 |
+
0,
|
| 160 |
+
0,
|
| 161 |
+
3,
|
| 162 |
+
8,
|
| 163 |
+
4,
|
| 164 |
+
44,
|
| 165 |
+
3,
|
| 166 |
+
5,
|
| 167 |
+
1,
|
| 168 |
+
0
|
| 169 |
+
],
|
| 170 |
+
[
|
| 171 |
+
0,
|
| 172 |
+
0,
|
| 173 |
+
0,
|
| 174 |
+
0,
|
| 175 |
+
6,
|
| 176 |
+
6,
|
| 177 |
+
36,
|
| 178 |
+
2,
|
| 179 |
+
0,
|
| 180 |
+
4
|
| 181 |
+
],
|
| 182 |
+
[
|
| 183 |
+
0,
|
| 184 |
+
0,
|
| 185 |
+
0,
|
| 186 |
+
0,
|
| 187 |
+
2,
|
| 188 |
+
1,
|
| 189 |
+
0,
|
| 190 |
+
12,
|
| 191 |
+
15,
|
| 192 |
+
10
|
| 193 |
+
],
|
| 194 |
+
[
|
| 195 |
+
0,
|
| 196 |
+
0,
|
| 197 |
+
0,
|
| 198 |
+
0,
|
| 199 |
+
5,
|
| 200 |
+
0,
|
| 201 |
+
0,
|
| 202 |
+
4,
|
| 203 |
+
9,
|
| 204 |
+
13
|
| 205 |
+
],
|
| 206 |
+
[
|
| 207 |
+
0,
|
| 208 |
+
0,
|
| 209 |
+
0,
|
| 210 |
+
0,
|
| 211 |
+
2,
|
| 212 |
+
1,
|
| 213 |
+
0,
|
| 214 |
+
11,
|
| 215 |
+
0,
|
| 216 |
+
7
|
| 217 |
+
]
|
| 218 |
+
]
|
| 219 |
+
},
|
| 220 |
+
"macro_roc_auc_ovr": 0.8598653258869782
|
| 221 |
+
}
|
| 222 |
+
},
|
| 223 |
+
"mlp": {
|
| 224 |
+
"architecture": "PyTorch MLP, 90 -> 128 -> 64 -> 10, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
|
| 225 |
+
"framework": "pytorch",
|
| 226 |
+
"test_metrics": {
|
| 227 |
+
"model": "mlp",
|
| 228 |
+
"accuracy": 0.44490358126721763,
|
| 229 |
+
"macro_f1": 0.3911186394257205,
|
| 230 |
+
"weighted_f1": 0.4172764238320775,
|
| 231 |
+
"per_class_f1": {
|
| 232 |
+
"dwell_idle": 0.013422818791946308,
|
| 233 |
+
"reconnaissance": 0.7250996015936255,
|
| 234 |
+
"initial_access": 0.6484018264840182,
|
| 235 |
+
"execution": 0.5100671140939598,
|
| 236 |
+
"persistence": 0.30120481927710846,
|
| 237 |
+
"privilege_escalation": 0.4880952380952381,
|
| 238 |
+
"lateral_movement": 0.782608695652174,
|
| 239 |
+
"collection": 0.19130434782608696,
|
| 240 |
+
"exfiltration": 0.11940298507462686,
|
| 241 |
+
"impact": 0.13157894736842105
|
| 242 |
+
},
|
| 243 |
+
"confusion_matrix": {
|
| 244 |
+
"labels": [
|
| 245 |
+
"dwell_idle",
|
| 246 |
+
"reconnaissance",
|
| 247 |
+
"initial_access",
|
| 248 |
+
"execution",
|
| 249 |
+
"persistence",
|
| 250 |
+
"privilege_escalation",
|
| 251 |
+
"lateral_movement",
|
| 252 |
+
"collection",
|
| 253 |
+
"exfiltration",
|
| 254 |
+
"impact"
|
| 255 |
+
],
|
| 256 |
+
"matrix": [
|
| 257 |
+
[
|
| 258 |
+
1,
|
| 259 |
+
26,
|
| 260 |
+
27,
|
| 261 |
+
11,
|
| 262 |
+
20,
|
| 263 |
+
18,
|
| 264 |
+
1,
|
| 265 |
+
20,
|
| 266 |
+
10,
|
| 267 |
+
7
|
| 268 |
+
],
|
| 269 |
+
[
|
| 270 |
+
0,
|
| 271 |
+
91,
|
| 272 |
+
4,
|
| 273 |
+
10,
|
| 274 |
+
7,
|
| 275 |
+
0,
|
| 276 |
+
0,
|
| 277 |
+
0,
|
| 278 |
+
0,
|
| 279 |
+
0
|
| 280 |
+
],
|
| 281 |
+
[
|
| 282 |
+
1,
|
| 283 |
+
4,
|
| 284 |
+
71,
|
| 285 |
+
1,
|
| 286 |
+
5,
|
| 287 |
+
21,
|
| 288 |
+
0,
|
| 289 |
+
3,
|
| 290 |
+
0,
|
| 291 |
+
0
|
| 292 |
+
],
|
| 293 |
+
[
|
| 294 |
+
1,
|
| 295 |
+
10,
|
| 296 |
+
3,
|
| 297 |
+
38,
|
| 298 |
+
17,
|
| 299 |
+
3,
|
| 300 |
+
0,
|
| 301 |
+
2,
|
| 302 |
+
0,
|
| 303 |
+
0
|
| 304 |
+
],
|
| 305 |
+
[
|
| 306 |
+
4,
|
| 307 |
+
8,
|
| 308 |
+
2,
|
| 309 |
+
8,
|
| 310 |
+
25,
|
| 311 |
+
9,
|
| 312 |
+
0,
|
| 313 |
+
11,
|
| 314 |
+
5,
|
| 315 |
+
7
|
| 316 |
+
],
|
| 317 |
+
[
|
| 318 |
+
0,
|
| 319 |
+
0,
|
| 320 |
+
6,
|
| 321 |
+
7,
|
| 322 |
+
4,
|
| 323 |
+
41,
|
| 324 |
+
1,
|
| 325 |
+
7,
|
| 326 |
+
2,
|
| 327 |
+
0
|
| 328 |
+
],
|
| 329 |
+
[
|
| 330 |
+
0,
|
| 331 |
+
0,
|
| 332 |
+
0,
|
| 333 |
+
0,
|
| 334 |
+
0,
|
| 335 |
+
7,
|
| 336 |
+
36,
|
| 337 |
+
3,
|
| 338 |
+
4,
|
| 339 |
+
4
|
| 340 |
+
],
|
| 341 |
+
[
|
| 342 |
+
1,
|
| 343 |
+
0,
|
| 344 |
+
0,
|
| 345 |
+
0,
|
| 346 |
+
1,
|
| 347 |
+
1,
|
| 348 |
+
0,
|
| 349 |
+
11,
|
| 350 |
+
11,
|
| 351 |
+
15
|
| 352 |
+
],
|
| 353 |
+
[
|
| 354 |
+
0,
|
| 355 |
+
0,
|
| 356 |
+
0,
|
| 357 |
+
0,
|
| 358 |
+
5,
|
| 359 |
+
0,
|
| 360 |
+
0,
|
| 361 |
+
5,
|
| 362 |
+
4,
|
| 363 |
+
17
|
| 364 |
+
],
|
| 365 |
+
[
|
| 366 |
+
0,
|
| 367 |
+
0,
|
| 368 |
+
0,
|
| 369 |
+
0,
|
| 370 |
+
3,
|
| 371 |
+
0,
|
| 372 |
+
0,
|
| 373 |
+
13,
|
| 374 |
+
0,
|
| 375 |
+
5
|
| 376 |
+
]
|
| 377 |
+
]
|
| 378 |
+
},
|
| 379 |
+
"macro_roc_auc_ovr": 0.8496117986303245
|
| 380 |
+
}
|
| 381 |
+
}
|
| 382 |
+
}
|
| 383 |
+
}
|