pradeep-xpert commited on
Commit
ed9d6a1
·
verified ·
1 Parent(s): 7391013

Initial release: XGBoost + MLP for insider threat type classification

Browse files
README.md ADDED
@@ -0,0 +1,452 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: pytorch
4
+ tags:
5
+ - cybersecurity
6
+ - insider-threat
7
+ - ueba
8
+ - data-exfiltration
9
+ - dlp
10
+ - privileged-access
11
+ - tabular-classification
12
+ - synthetic-data
13
+ - xgboost
14
+ - baseline
15
+ pipeline_tag: tabular-classification
16
+ base_model: []
17
+ datasets:
18
+ - xpertsystems/cyb007-sample
19
+ metrics:
20
+ - accuracy
21
+ - f1
22
+ - roc_auc
23
+ model-index:
24
+ - name: cyb007-baseline-classifier
25
+ results:
26
+ - task:
27
+ type: tabular-classification
28
+ name: 3-class insider threat type classification
29
+ dataset:
30
+ type: xpertsystems/cyb007-sample
31
+ name: CYB007 Synthetic Insider Threat Dataset (Sample)
32
+ metrics:
33
+ - type: roc_auc
34
+ value: 0.9628
35
+ name: Test macro ROC-AUC OvR (XGBoost, seed 42)
36
+ - type: accuracy
37
+ value: 0.8529
38
+ name: Test accuracy (XGBoost, seed 42)
39
+ - type: f1
40
+ value: 0.8496
41
+ name: Test macro-F1 (XGBoost, seed 42)
42
+ - type: accuracy
43
+ value: 0.855
44
+ name: Multi-seed accuracy mean ± 0.012 (XGBoost, 10 seeds)
45
+ - type: roc_auc
46
+ value: 0.961
47
+ name: Multi-seed ROC-AUC mean ± 0.007 (XGBoost, 10 seeds)
48
+ - type: roc_auc
49
+ value: 0.9661
50
+ name: Test macro ROC-AUC OvR (MLP, seed 42)
51
+ - type: accuracy
52
+ value: 0.8685
53
+ name: Test accuracy (MLP, seed 42)
54
+ - type: f1
55
+ value: 0.8636
56
+ name: Test macro-F1 (MLP, seed 42)
57
+ ---
58
+
59
+ # CYB007 Baseline Classifier
60
+
61
+ **Insider-threat type classifier trained on the CYB007 synthetic
62
+ insider-threat sample. Predicts which of 3 actor types
63
+ (`negligent_user` / `malicious_employee` / `privileged_insider`) is
64
+ behind an observed insider incident from per-timestep trajectory
65
+ telemetry.**
66
+
67
+ > **Baseline reference, not for production use.** This model demonstrates
68
+ > that the [CYB007 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb007-sample)
69
+ > is learnable end-to-end and gives prospective buyers a working starting
70
+ > point for insider-threat detection research. It is not a production
71
+ > UEBA system, DLP engine, or HR-investigation tool. See [Limitations](#limitations).
72
+
73
+ ## Model overview
74
+
75
+ | Property | Value |
76
+ |---|---|
77
+ | Task | 3-class actor_threat_type classification |
78
+ | Training data | `xpertsystems/cyb007-sample` (32,500 timesteps across 500 incidents) |
79
+ | Models | XGBoost + PyTorch MLP |
80
+ | Input features | 28 (after one-hot encoding) |
81
+ | Split | **Group-aware by incident_id** (disjoint train/val/test incidents) |
82
+ | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
83
+ | License | CC-BY-NC-4.0 (matches dataset) |
84
+ | Status | Reference baseline |
85
+
86
+ ## Why this task — CYB007 ships the README's stated headline use case
87
+
88
+ This is the second XpertSystems baseline (after CYB005) that ships
89
+ the **dataset's stated headline use case** rather than pivoting away
90
+ from it. The CYB007 README's first suggested use case is "training
91
+ insider threat classifier models (4-tier actor attribution)", and
92
+ that is the task this baseline trains on (with one schema correction:
93
+ the sample data contains 3 of the 4 tiers — `compromised_account` is
94
+ absent from the sample).
95
+
96
+ CYB003 (malware family), CYB004 (phishing actor tier), and CYB006
97
+ (threat-actor tier) all had to pivot away from their README headline
98
+ targets — n=100 groups isn't enough to support group-aware tier
99
+ classification, and CYB006 in particular had structural distributional
100
+ leakage. CYB007's 500 incidents (matching CYB005's profile of 500
101
+ campaigns × 75 timesteps) is large enough that tier attribution learns
102
+ honestly under group-aware splitting, with no oracle features and
103
+ multi-seed std of just 0.012.
104
+
105
+ Two model artifacts are published. They are designed to be used
106
+ together — disagreement is a useful triage signal. **Unusually for the
107
+ XpertSystems baseline catalog, on CYB007 the MLP slightly outperforms
108
+ XGBoost on the test fold** (0.869 vs 0.853 accuracy at seed 42, 0.966
109
+ vs 0.963 ROC-AUC):
110
+
111
+ - `model_xgb.json` — gradient-boosted trees
112
+ - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
113
+
114
+ ## Quick start
115
+
116
+ ```bash
117
+ pip install xgboost torch safetensors pandas huggingface_hub
118
+ ```
119
+
120
+ ```python
121
+ from huggingface_hub import hf_hub_download
122
+ import json, numpy as np, torch, xgboost as xgb
123
+ from safetensors.torch import load_file
124
+
125
+ REPO = "xpertsystems/cyb007-baseline-classifier"
126
+
127
+ paths = {n: hf_hub_download(REPO, n) for n in [
128
+ "model_xgb.json", "model_mlp.safetensors",
129
+ "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
130
+ ]}
131
+
132
+ import sys, os
133
+ sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
134
+ from feature_engineering import transform_single, load_meta, INT_TO_LABEL
135
+
136
+ meta = load_meta(paths["feature_meta.json"])
137
+ xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
138
+
139
+ # Predict (see inference_example.ipynb for the full pattern)
140
+ X = transform_single(my_timestep_record, meta)
141
+ proba = xgb_model.predict_proba(X)[0]
142
+ print(INT_TO_LABEL[int(np.argmax(proba))])
143
+ ```
144
+
145
+ See [`inference_example.ipynb`](./inference_example.ipynb) for the full
146
+ copy-paste demo.
147
+
148
+ ## Training data
149
+
150
+ Trained on the public sample of CYB007, 32,500 per-timestep telemetry
151
+ rows from 500 insider threat incidents (65 timesteps per incident):
152
+
153
+ | Tier | Incidents | Timestep rows | Class share |
154
+ |---|---:|---:|---:|
155
+ | `negligent_user` | 250 | 16,250 | 50.0% |
156
+ | `malicious_employee` | 150 | 9,750 | 30.0% |
157
+ | `privileged_insider` | 100 | 6,500 | 20.0% |
158
+
159
+ ### Group-aware split
160
+
161
+ A single incident generates 65 highly-correlated timesteps. Random
162
+ row-level splitting would put timesteps from the same incident in both
163
+ train and test, inflating metrics in a way that does not generalize to
164
+ new incidents.
165
+
166
+ This release uses **GroupShuffleSplit by `incident_id`** (nested,
167
+ 70/15/15):
168
+
169
+ | Fold | Incidents | Timesteps |
170
+ |---|---:|---:|
171
+ | Train | 350 | 22,750 |
172
+ | Validation | 75 | 4,875 |
173
+ | Test | 75 | 4,875 |
174
+
175
+ All test incidents are completely unseen during training. Class
176
+ imbalance is addressed with `class_weight='balanced'` (XGBoost
177
+ `sample_weight`) and weighted cross-entropy (MLP).
178
+
179
+ ## Feature pipeline
180
+
181
+ The bundled `feature_engineering.py` is the canonical feature recipe.
182
+ 28 features survive after encoding, drawn from:
183
+
184
+ - **Per-timestep numeric** (7): `timestep`, `data_access_volume_mb`, `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `exfiltration_volume_mb_cumulative`, `behavioural_risk_score`
185
+ - **Per-timestep categorical** (3, one-hot): `incident_phase` (8 values), `detection_outcome` (4 values), `target_data_sensitivity_tier` (3 values)
186
+ - **Engineered** (6): `log_data_volume`, `log_cumulative_exfil`, `exfil_velocity`, `is_privileged_event`, `risk_x_dlp_composite`, `is_late_stage`
187
+
188
+ ### Leakage audit
189
+
190
+ Two features have strongly tier-correlated means but with substantial
191
+ distributional overlap. **Neither was dropped**:
192
+
193
+ | Feature | Distribution by tier | Verdict |
194
+ |---|---|---|
195
+ | `data_access_volume_mb` | negligent [0, 88] mean 14 / malicious [0, 328] mean 44 / privileged [0, 2541] mean 302; median ~9 MB for all three | Massive overlap in [0, 88]; real signal, not oracle. KEEP. |
196
+ | `exfiltration_volume_mb_cumulative` | negligent [0, ~50] mean 5 / malicious [0, ~500] mean 90 / privileged [0, ~10000] mean 818 | Heavy-tailed with overlap in low-quantile region. KEEP. |
197
+
198
+ The honest test: dropping both features collapses accuracy from 0.85
199
+ to 0.47 (below the 0.50 majority baseline). This confirms they carry
200
+ legitimate discriminative signal that **defines what `privileged_insider`
201
+ means** — a privileged user with elevated data access — rather than
202
+ being an oracle leak.
203
+
204
+ `detection_outcome` is a near-oracle for **incident phase** (purity
205
+ 0.79, max 1.00 for reconnaissance which is 100% `suppressed`). But its
206
+ purity vs **tier** is uniform (~0.50 across all tiers), so it has no
207
+ oracle relationship to the target. KEEP.
208
+
209
+ No columns dropped for this task.
210
+
211
+ ## Evaluation
212
+
213
+ ### Test-set metrics, seed 42 (n = 4,875 timesteps from 75 disjoint incidents)
214
+
215
+ **XGBoost** (the published `model_xgb.json` artifact)
216
+
217
+ | Metric | Value |
218
+ |---|---:|
219
+ | Macro ROC-AUC (OvR) | **0.9628** |
220
+ | Accuracy | **0.8529** |
221
+ | Macro-F1 | 0.8496 |
222
+ | Weighted-F1 | 0.8543 |
223
+
224
+ **MLP** (the published `model_mlp.safetensors` artifact) — **slightly outperforms XGBoost**
225
+
226
+ | Metric | Value |
227
+ |---|---:|
228
+ | Macro ROC-AUC (OvR) | **0.9661** |
229
+ | Accuracy | **0.8685** |
230
+ | Macro-F1 | 0.8636 |
231
+ | Weighted-F1 | 0.8682 |
232
+
233
+ The MLP outperforming XGBoost is unusual for tabular data and unusual
234
+ within the XpertSystems baseline catalog — CYB001–CYB006 all had
235
+ XGBoost ahead. With 22,750 training rows and only 28 features, the
236
+ MLP has enough data to fit cleanly and the tabular advantage of trees
237
+ is reduced. Both models are published.
238
+
239
+ ### Multi-seed robustness (XGBoost, 10 seeds)
240
+
241
+ Very stable performance — std 0.012 on accuracy is among the tightest
242
+ in the XpertSystems catalog:
243
+
244
+ | Metric | Mean | Std | Min | Max |
245
+ |---|---:|---:|---:|---:|
246
+ | Accuracy | 0.855 | 0.012 | 0.831 | 0.873 |
247
+ | Macro-F1 | 0.839 | 0.010 | 0.829 | 0.860 |
248
+ | Macro ROC-AUC OvR | 0.961 | 0.007 | 0.949 | 0.972 |
249
+
250
+ Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
251
+ All 10 seeds yielded all 3 tiers in the test fold.
252
+
253
+ ### Per-class F1 (seed 42)
254
+
255
+ | Tier | Class share | XGBoost F1 | MLP F1 |
256
+ |---|---:|---:|---:|
257
+ | `negligent_user` | 50% | 0.876 | 0.894 |
258
+ | `privileged_insider` | 20% | 0.846 | 0.856 |
259
+ | `malicious_employee` | 30% | 0.826 | 0.841 |
260
+
261
+ The model performs evenly across all three tiers — no class collapse.
262
+ The strongest performance on `privileged_insider` despite it being
263
+ the minority class (20%) confirms that the volume-based behavioural
264
+ signature (sustained large data access) is reliably discriminative.
265
+ `malicious_employee` is the marginally hardest tier because they
266
+ operate in a middle zone — more aggressive than negligent users but
267
+ without the privileged access volumes that distinguish insiders.
268
+
269
+ ### Ablation: which feature groups matter
270
+
271
+ | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
272
+ |---|---:|---:|---:|---:|
273
+ | Full feature set (published) | 0.8529 | 0.8496 | 0.9628 | — |
274
+ | No volume features | 0.4890 | 0.4736 | 0.6828 | **−0.3639** |
275
+ | No behavioural features | 0.7126 | 0.7055 | 0.8961 | −0.1403 |
276
+ | No `timestep` | 0.8394 | 0.8336 | 0.9569 | −0.0135 |
277
+ | No context features | 0.8544 | 0.8490 | 0.9632 | −0.0000 |
278
+ | No engineered features | 0.8597 | 0.8560 | 0.9629 | +0.0068 |
279
+
280
+ Four findings:
281
+
282
+ 1. **Volume features carry the overwhelmingly dominant signal**
283
+ (drops 36 pp accuracy, 28 pp ROC-AUC when removed). This is by
284
+ design — privileged insiders are *defined* by access to large
285
+ data volumes, and the synthetic generator models this faithfully.
286
+ 2. **Behavioural features (privilege events, communication anomaly,
287
+ DLP confidence, risk scores) contribute 14 pp accuracy.** They
288
+ add a second axis of discrimination beyond pure volume.
289
+ 3. **`timestep` contributes only 1 pp.** Tier attribution is largely
290
+ invariant to where in the incident lifecycle you are — different
291
+ from phase prediction, which is strongly timestep-driven.
292
+ 4. **Context features (incident_phase, sensitivity tier) and
293
+ engineered composites are recovered by the trees from raw inputs.**
294
+ They are retained in the pipeline as a documented baseline reference
295
+ but contribute essentially zero on their own.
296
+
297
+ ### Architecture
298
+
299
+ **XGBoost:** multi-class gradient boosting (`multi:softprob`, 3 classes),
300
+ `hist` tree method, class-balanced sample weights, early stopping on
301
+ validation mlogloss.
302
+
303
+ **MLP:** `28 → 128 → 64 → 3`, each hidden layer followed by `BatchNorm1d`
304
+ → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
305
+ early stopping on validation macro-F1.
306
+
307
+ Training hyperparameters are held internally by XpertSystems.
308
+
309
+ ## Limitations
310
+
311
+ **This is a baseline reference, not a production insider-threat detection system.**
312
+
313
+ 1. **The dataset has 3 tiers, not 4.** The CYB007 README claims a
314
+ 4-tier scheme including `compromised_account` but the sample
315
+ contains only `negligent_user`, `malicious_employee`, and
316
+ `privileged_insider`. If your work requires the 4th tier, request
317
+ regeneration.
318
+
319
+ 2. **Volume-feature dominance is a property of the dataset.** Real
320
+ insider-threat telemetry has more variance — some negligent users
321
+ accidentally trigger large data downloads, some privileged
322
+ insiders work patiently with small transfers. The sample's
323
+ per-tier volume distributions overlap, but not as much as in real
324
+ environments. Buyers should test the model on their own data
325
+ before assuming the 0.86 accuracy transfers.
326
+
327
+ 3. **MLP modestly outperforms XGBoost.** With 22,750 training rows,
328
+ the MLP has enough data to compete favorably. On smaller training
329
+ sets (n < 1k rows) we would expect XGBoost to be stronger.
330
+
331
+ 4. **Synthetic-vs-real transfer.** The dataset is synthetic and
332
+ calibrated to insider-threat research benchmarks (CERT Insider
333
+ Threat Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon
334
+ Institute, MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix,
335
+ Forrester UEBA, Gartner ZTNA, CrowdStrike, Mandiant). Real
336
+ insider telemetry has different noise characteristics, and
337
+ adversarial insiders may deliberately mimic negligent-user
338
+ patterns. Do not assume metrics transfer.
339
+
340
+ 5. **Adversarial robustness not evaluated.** The dataset does not
341
+ simulate insiders deliberately spoofing a different tier's
342
+ behavioural footprint to evade attribution.
343
+
344
+ 6. **The 75-incident test fold is robust but not large.** Multi-seed
345
+ std of 0.012 on accuracy confirms the metric is stable, but full
346
+ confidence intervals for downstream production decisions should
347
+ come from the full ~4,800-incident product.
348
+
349
+ ## Notes on dataset schema
350
+
351
+ The CYB007 sample dataset README describes some fields differently
352
+ from the actual schema. The model was trained on the actual schema;
353
+ this note helps buyers reconcile what they read with what they receive.
354
+
355
+ | What the README says | What the data actually contains |
356
+ |---|---|
357
+ | 4 actor tiers including `compromised_account` | **3 tiers only**: `negligent_user`, `malicious_employee`, `privileged_insider`. No `compromised_account` rows in the sample. |
358
+ | 6 incident phases | **8 phases**: adds `idle_dwell` and `lateral_access` to the 6 documented |
359
+ | Per-timestep columns: `payload_entropy`, `cover_actions_taken`, `dlp_alerts_raised`, `detection_flag`, `blast_radius`, `sensitive_data_accessed`, `threat_type_tier` | Actual per-timestep columns: `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `detection_outcome` (categorical 4-value, not boolean), `behavioural_risk_score`, `target_data_sensitivity_tier`, `actor_threat_type` |
360
+ | Summary field `ueba_status` | Actual field is `ueba_deployment_status` (only on `org_topology.csv`, not on `insider_trajectories.csv` or `incident_summary.csv`) |
361
+ | Summary field `collusion_flag` | Actual: `coordinated_incident_flag` |
362
+ | Summary field `lateral_access_flag` | Actual: `lateral_access_count` (not boolean) |
363
+ | Summary field `sabotage_flag` | Actual: `sabotage_events_executed` (count) |
364
+ | Summary field `cover_tracks_flag` | Actual: `cover_tracks_events` (count) |
365
+ | Summary field `hr_trigger_flag` | Actual: `hr_case_triggers_caused` (count) |
366
+ | Summary field `exfiltration_success_flag` | Actual: `exfiltration_successes` (count) and `exfiltration_success_rate` (float) |
367
+ | Summary field `dwell_time_ratio` | Not present in summary; `actor_efficiency_score` is the closest analog |
368
+
369
+ None of these affects model correctness — the feature pipeline uses
370
+ the actual column names. If you build your own pipeline against the
371
+ dataset, use the actual columns.
372
+
373
+ ## Intended use
374
+
375
+ - **Evaluating fit** of the CYB007 dataset for your insider-threat
376
+ research
377
+ - **Baseline reference** for new model architectures (sequence models,
378
+ graph models considering collusion structure)
379
+ - **Teaching and demo** for multi-class tabular classification on
380
+ insider-threat telemetry
381
+ - **Feature engineering reference** for per-timestep insider activity
382
+
383
+ ## Out-of-scope use
384
+
385
+ - Production insider-threat detection on real telemetry
386
+ - HR investigation or employment decisions
387
+ - Adversarial-evasion evaluation (dataset not adversarially generated)
388
+ - Any operational or legal decision affecting actual persons
389
+
390
+ ## Reproducibility
391
+
392
+ Outputs above were produced with `seed = 42` (published artifact),
393
+ group-aware nested `GroupShuffleSplit` (70/15/15 by incident_id), on
394
+ the published sample (`xpertsystems/cyb007-sample`, version 1.0.0,
395
+ generated 2026-05-16). The feature pipeline in `feature_engineering.py`
396
+ is deterministic and the trained weights in this repo correspond
397
+ exactly to the metrics above.
398
+
399
+ Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
400
+ `multi_seed_results.json` confirm robust performance across splits.
401
+
402
+ The training script itself is private to XpertSystems.
403
+
404
+ ## Files in this repo
405
+
406
+ | File | Purpose |
407
+ |---|---|
408
+ | `model_xgb.json` | XGBoost weights (seed 42) |
409
+ | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
410
+ | `feature_engineering.py` | Feature pipeline |
411
+ | `feature_meta.json` | Feature column order + categorical levels |
412
+ | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
413
+ | `validation_results.json` | Per-class metrics, confusion matrix, architecture |
414
+ | `ablation_results.json` | Per-feature-group ablation |
415
+ | `multi_seed_results.json` | XGBoost metrics across 10 seeds |
416
+ | `inference_example.ipynb` | End-to-end inference demo notebook |
417
+ | `README.md` | This file |
418
+
419
+ ## Contact and full product
420
+
421
+ The full **CYB007** dataset contains ~335,000 rows across four files,
422
+ with calibrated benchmark validation against 12 metrics drawn from
423
+ authoritative insider-threat research sources (CERT Insider Threat
424
+ Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon Institute,
425
+ MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix, Forrester UEBA,
426
+ Gartner ZTNA, CrowdStrike, Mandiant M-Trends). The full
427
+ XpertSystems.ai synthetic data catalogue spans 41 SKUs across
428
+ Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
429
+ & Energy.
430
+
431
+ - 📧 **pradeep@xpertsystems.ai**
432
+ - 🌐 **https://xpertsystems.ai**
433
+ - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb007-sample
434
+ - 🤖 Companion models:
435
+ - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
436
+ - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
437
+ - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
438
+ - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
439
+ - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
440
+ - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
441
+
442
+ ## Citation
443
+
444
+ ```bibtex
445
+ @misc{xpertsystems_cyb007_baseline_2026,
446
+ title = {CYB007 Baseline Classifier: XGBoost and MLP for Insider Threat Type Classification},
447
+ author = {XpertSystems.ai},
448
+ year = {2026},
449
+ url = {https://huggingface.co/xpertsystems/cyb007-baseline-classifier},
450
+ note = {Baseline reference model trained on xpertsystems/cyb007-sample}
451
+ }
452
+ ```
ablation_results.json ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same group-aware split, with one feature group dropped at a time.",
3
+ "full_model_metrics": {
4
+ "model": "xgboost",
5
+ "accuracy": 0.8529230769230769,
6
+ "macro_f1": 0.8495931102241494,
7
+ "weighted_f1": 0.8518585237469937,
8
+ "per_class_f1": {
9
+ "negligent_user": 0.8762557077625571,
10
+ "malicious_employee": 0.8262571514604035,
11
+ "privileged_insider": 0.8462664714494875
12
+ },
13
+ "confusion_matrix": {
14
+ "labels": [
15
+ "negligent_user",
16
+ "malicious_employee",
17
+ "privileged_insider"
18
+ ],
19
+ "matrix": [
20
+ [
21
+ 1919,
22
+ 111,
23
+ 50
24
+ ],
25
+ [
26
+ 291,
27
+ 1372,
28
+ 92
29
+ ],
30
+ [
31
+ 90,
32
+ 83,
33
+ 867
34
+ ]
35
+ ]
36
+ },
37
+ "macro_roc_auc_ovr": 0.9627526877302969
38
+ },
39
+ "ablations": {
40
+ "no_volume": {
41
+ "n_features": 23,
42
+ "dropped_count": 5,
43
+ "metrics": {
44
+ "model": "xgboost_no_volume",
45
+ "accuracy": 0.489025641025641,
46
+ "macro_f1": 0.47358930080150813,
47
+ "weighted_f1": 0.48784413847470176,
48
+ "per_class_f1": {
49
+ "negligent_user": 0.5617715617715617,
50
+ "malicious_employee": 0.44251626898047725,
51
+ "privileged_insider": 0.41648007165248546
52
+ },
53
+ "confusion_matrix": {
54
+ "labels": [
55
+ "negligent_user",
56
+ "malicious_employee",
57
+ "privileged_insider"
58
+ ],
59
+ "matrix": [
60
+ [
61
+ 1205,
62
+ 483,
63
+ 392
64
+ ],
65
+ [
66
+ 705,
67
+ 714,
68
+ 336
69
+ ],
70
+ [
71
+ 300,
72
+ 275,
73
+ 465
74
+ ]
75
+ ]
76
+ },
77
+ "macro_roc_auc_ovr": 0.6827532681591143
78
+ },
79
+ "delta_accuracy": 0.3638974358974359,
80
+ "delta_macro_f1": 0.3760038094226413
81
+ },
82
+ "no_behavioural": {
83
+ "n_features": 18,
84
+ "dropped_count": 10,
85
+ "metrics": {
86
+ "model": "xgboost_no_behavioural",
87
+ "accuracy": 0.7126153846153847,
88
+ "macro_f1": 0.7054601986097401,
89
+ "weighted_f1": 0.7141318275968602,
90
+ "per_class_f1": {
91
+ "negligent_user": 0.7372585524784734,
92
+ "malicious_employee": 0.7183327906219472,
93
+ "privileged_insider": 0.6607892527287993
94
+ },
95
+ "confusion_matrix": {
96
+ "labels": [
97
+ "negligent_user",
98
+ "malicious_employee",
99
+ "privileged_insider"
100
+ ],
101
+ "matrix": [
102
+ [
103
+ 1584,
104
+ 154,
105
+ 342
106
+ ],
107
+ [
108
+ 439,
109
+ 1103,
110
+ 213
111
+ ],
112
+ [
113
+ 194,
114
+ 59,
115
+ 787
116
+ ]
117
+ ]
118
+ },
119
+ "macro_roc_auc_ovr": 0.896141715091384
120
+ },
121
+ "delta_accuracy": 0.14030769230769224,
122
+ "delta_macro_f1": 0.14413291161440933
123
+ },
124
+ "no_timestep": {
125
+ "n_features": 26,
126
+ "dropped_count": 2,
127
+ "metrics": {
128
+ "model": "xgboost_no_timestep",
129
+ "accuracy": 0.8393846153846154,
130
+ "macro_f1": 0.8335587554093177,
131
+ "weighted_f1": 0.838097363099834,
132
+ "per_class_f1": {
133
+ "negligent_user": 0.8618759794045221,
134
+ "malicious_employee": 0.8233151183970856,
135
+ "privileged_insider": 0.8154851684263449
136
+ },
137
+ "confusion_matrix": {
138
+ "labels": [
139
+ "negligent_user",
140
+ "malicious_employee",
141
+ "privileged_insider"
142
+ ],
143
+ "matrix": [
144
+ [
145
+ 1925,
146
+ 97,
147
+ 58
148
+ ],
149
+ [
150
+ 319,
151
+ 1356,
152
+ 80
153
+ ],
154
+ [
155
+ 143,
156
+ 86,
157
+ 811
158
+ ]
159
+ ]
160
+ },
161
+ "macro_roc_auc_ovr": 0.9568593124770418
162
+ },
163
+ "delta_accuracy": 0.0135384615384615,
164
+ "delta_macro_f1": 0.01603435481483173
165
+ },
166
+ "no_context": {
167
+ "n_features": 17,
168
+ "dropped_count": 11,
169
+ "metrics": {
170
+ "model": "xgboost_no_context",
171
+ "accuracy": 0.8543589743589743,
172
+ "macro_f1": 0.8489739255889375,
173
+ "weighted_f1": 0.8531648766003023,
174
+ "per_class_f1": {
175
+ "negligent_user": 0.8806546942486929,
176
+ "malicious_employee": 0.8314674735249622,
177
+ "privileged_insider": 0.8347996089931574
178
+ },
179
+ "confusion_matrix": {
180
+ "labels": [
181
+ "negligent_user",
182
+ "malicious_employee",
183
+ "privileged_insider"
184
+ ],
185
+ "matrix": [
186
+ [
187
+ 1937,
188
+ 92,
189
+ 51
190
+ ],
191
+ [
192
+ 280,
193
+ 1374,
194
+ 101
195
+ ],
196
+ [
197
+ 102,
198
+ 84,
199
+ 854
200
+ ]
201
+ ]
202
+ },
203
+ "macro_roc_auc_ovr": 0.9632029829754446
204
+ },
205
+ "delta_accuracy": -0.0014358974358974486,
206
+ "delta_macro_f1": 0.0006191846352119335
207
+ },
208
+ "no_engineered": {
209
+ "n_features": 22,
210
+ "dropped_count": 6,
211
+ "metrics": {
212
+ "model": "xgboost_no_engineered",
213
+ "accuracy": 0.8596923076923076,
214
+ "macro_f1": 0.8559750404567971,
215
+ "weighted_f1": 0.8586557301112084,
216
+ "per_class_f1": {
217
+ "negligent_user": 0.8818575005690872,
218
+ "malicious_employee": 0.8366052552099064,
219
+ "privileged_insider": 0.8494623655913979
220
+ },
221
+ "confusion_matrix": {
222
+ "labels": [
223
+ "negligent_user",
224
+ "malicious_employee",
225
+ "privileged_insider"
226
+ ],
227
+ "matrix": [
228
+ [
229
+ 1937,
230
+ 91,
231
+ 52
232
+ ],
233
+ [
234
+ 285,
235
+ 1385,
236
+ 85
237
+ ],
238
+ [
239
+ 91,
240
+ 80,
241
+ 869
242
+ ]
243
+ ]
244
+ },
245
+ "macro_roc_auc_ovr": 0.9629058321133872
246
+ },
247
+ "delta_accuracy": -0.00676923076923075,
248
+ "delta_macro_f1": -0.006381930232647659
249
+ }
250
+ }
251
+ }
feature_engineering.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ feature_engineering.py
3
+ ======================
4
+
5
+ Feature pipeline for the CYB007 baseline classifier.
6
+
7
+ Predicts `actor_threat_type` (3-class: negligent_user / malicious_employee
8
+ / privileged_insider) from per-timestep insider threat trajectory data on
9
+ the CYB007 sample dataset.
10
+
11
+ CSV inputs:
12
+ insider_trajectories.csv (primary, per-timestep, 500 incidents x 65
13
+ timesteps = 32,500 rows)
14
+ incident_summary.csv (per-incident aggregates; reserved for
15
+ future work)
16
+ incident_events.csv (discrete incident event log; reserved
17
+ for future work - 191 collusion records
18
+ out of 38,687 events)
19
+ org_topology.csv (per-department defender configuration;
20
+ joinable to events but not directly to
21
+ per-timestep trajectories without a
22
+ department key on the trajectory row)
23
+
24
+ Target classes (3):
25
+ negligent_user, malicious_employee, privileged_insider
26
+
27
+ The CYB007 README claims 4 actor tiers (adds compromised_account) but
28
+ the sample data contains only 3. We train on the 3 that exist.
29
+
30
+ Sample-size note
31
+ ----------------
32
+ 500 incidents with 65 timesteps each is the same volume profile as
33
+ CYB005 (500 campaigns × 75 timesteps). At this scale, group-aware
34
+ splitting yields ~75 test incidents (~11-25 per tier), which is enough
35
+ to learn tier attribution honestly. CYB003/4/6 pivoted away from the
36
+ README's stated tier-attribution headline because their samples had
37
+ only 100 groups; CYB007 ships the headline use case.
38
+
39
+ Leakage audit
40
+ -------------
41
+ Two features have strongly tier-correlated means but with substantial
42
+ distributional overlap:
43
+ - data_access_volume_mb: privileged 0-2541, malicious 0-328,
44
+ negligent 0-88. Overlap region [0, 88] covers most timesteps for all
45
+ three tiers (median ~9 MB each). Real observable, not oracle. KEPT.
46
+ - exfiltration_volume_mb_cumulative: similar shape, overlap [0, ~5].
47
+ Real observable. KEPT.
48
+
49
+ Removing both features drops accuracy from 0.85 to 0.47 (below
50
+ majority). This confirms they are not oracles - they carry legitimate
51
+ discriminative signal that defines what privileged_insider means.
52
+
53
+ `detection_outcome` is near-oracle for incident_phase (purity 0.79,
54
+ max 1.00 for reconnaissance). For TIER prediction it has no oracle
55
+ relationship (purity vs tier is uniform around 0.50). KEPT.
56
+
57
+ No columns dropped for this task.
58
+
59
+ Public API
60
+ ----------
61
+ build_features(trajectories_path) -> (X, y, groups, meta)
62
+ transform_single(record, meta) -> np.ndarray
63
+ save_meta(meta, path) / load_meta(path)
64
+
65
+ License
66
+ -------
67
+ Ships with the public model on Hugging Face under CC-BY-NC-4.0,
68
+ matching the dataset license. See README.md.
69
+ """
70
+
71
+ from __future__ import annotations
72
+
73
+ import json
74
+ from pathlib import Path
75
+ from typing import Any
76
+
77
+ import numpy as np
78
+ import pandas as pd
79
+
80
+ # ---------------------------------------------------------------------------
81
+ # Label space
82
+ # ---------------------------------------------------------------------------
83
+
84
+ # Ordered roughly by access/sophistication. The CYB007 README claims a 4th
85
+ # tier 'compromised_account' but the sample data contains only 3.
86
+ LABEL_ORDER = [
87
+ "negligent_user",
88
+ "malicious_employee",
89
+ "privileged_insider",
90
+ ]
91
+ LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
92
+ INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
93
+
94
+ # ---------------------------------------------------------------------------
95
+ # Identifier and target columns
96
+ # ---------------------------------------------------------------------------
97
+
98
+ ID_COLUMNS = ["incident_id", "actor_id"]
99
+ TARGET_COLUMN = "actor_threat_type"
100
+
101
+ # No columns dropped for leakage. See module docstring's "Leakage audit".
102
+ LEAKY_COLUMNS: list[str] = []
103
+
104
+ # ---------------------------------------------------------------------------
105
+ # Per-timestep numeric features
106
+ # ---------------------------------------------------------------------------
107
+
108
+ DIRECT_NUMERIC_TIMESTEP_FEATURES = [
109
+ "timestep", # position in 65-step lifecycle
110
+ "data_access_volume_mb",
111
+ "privilege_event_count",
112
+ "communication_anomaly_score",
113
+ "dlp_confidence_score",
114
+ "exfiltration_volume_mb_cumulative",
115
+ "behavioural_risk_score",
116
+ ]
117
+
118
+ # Per-timestep categoricals to one-hot
119
+ CATEGORICAL_TIMESTEP_FEATURES = [
120
+ "incident_phase", # 8 values
121
+ "detection_outcome", # 4 values
122
+ "target_data_sensitivity_tier", # 3 values
123
+ ]
124
+
125
+
126
+ # ---------------------------------------------------------------------------
127
+ # Engineered features
128
+ # ---------------------------------------------------------------------------
129
+
130
+ def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
131
+ """
132
+ Six engineered features encoding tier-discriminative hypotheses.
133
+ Each composite would be computed by a security analyst by hand.
134
+ """
135
+ df = df.copy()
136
+
137
+ # 1. Log-scaled data volume. data_access_volume_mb is heavy-tailed
138
+ # (median ~9 MB, max ~2541 MB for privileged insiders). log1p
139
+ # compresses for both XGBoost and MLP.
140
+ df["log_data_volume"] = np.log1p(
141
+ df["data_access_volume_mb"].clip(lower=0)
142
+ ).astype(float)
143
+
144
+ # 2. Log-scaled cumulative exfiltration. Same heavy-tail shape.
145
+ df["log_cumulative_exfil"] = np.log1p(
146
+ df["exfiltration_volume_mb_cumulative"].clip(lower=0)
147
+ ).astype(float)
148
+
149
+ # 3. Exfil velocity: cumulative exfil per timestep elapsed.
150
+ # High = aggressive exfiltration; low = patient or accidental.
151
+ df["exfil_velocity"] = (
152
+ df["exfiltration_volume_mb_cumulative"]
153
+ / df["timestep"].clip(lower=1)
154
+ ).astype(float)
155
+
156
+ # 4. Privileged event indicator. privilege_event_count > 0 marks
157
+ # timesteps with privileged operations. Strong privileged_insider
158
+ # signature.
159
+ df["is_privileged_event"] = (df["privilege_event_count"] > 0).astype(int)
160
+
161
+ # 5. Risk x DLP composite. Combines behavioural risk score with
162
+ # DLP confidence - high values indicate both behavioural anomaly
163
+ # AND DLP-recognised risk pattern.
164
+ df["risk_x_dlp_composite"] = (
165
+ df["behavioural_risk_score"] * df["dlp_confidence_score"]
166
+ ).astype(float)
167
+
168
+ # 6. Late-stage indicator. Timesteps after 40 sit in cover_tracks /
169
+ # incident_resolution / late exfiltration_attempt; tier signal
170
+ # differs across these late phases.
171
+ df["is_late_stage"] = (df["timestep"] > 40).astype(int)
172
+
173
+ return df
174
+
175
+
176
+ # ---------------------------------------------------------------------------
177
+ # Public API
178
+ # ---------------------------------------------------------------------------
179
+
180
+ def build_features(
181
+ trajectories_path: str | Path,
182
+ ) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
183
+ """
184
+ Load CSV, drop target + identifiers, engineer features, one-hot encode,
185
+ return (X, y, groups, meta).
186
+
187
+ `groups` is a Series of incident_id values aligned with X. Use it with
188
+ GroupShuffleSplit / GroupKFold so train and test sets contain disjoint
189
+ incidents - each incident generates 65 highly-correlated timesteps.
190
+ """
191
+ traj = pd.read_csv(trajectories_path)
192
+
193
+ y = traj[TARGET_COLUMN].map(LABEL_TO_INT)
194
+ if y.isna().any():
195
+ bad = traj.loc[y.isna(), TARGET_COLUMN].unique()
196
+ raise ValueError(f"Unknown actor_threat_type values: {bad}")
197
+ y = y.astype(int)
198
+ groups = traj["incident_id"].copy()
199
+
200
+ traj = traj.drop(
201
+ columns=ID_COLUMNS + [TARGET_COLUMN] + LEAKY_COLUMNS, errors="ignore",
202
+ )
203
+
204
+ traj = _add_engineered_features(traj)
205
+
206
+ numeric_features = (
207
+ DIRECT_NUMERIC_TIMESTEP_FEATURES
208
+ + [
209
+ "log_data_volume", "log_cumulative_exfil", "exfil_velocity",
210
+ "is_privileged_event", "risk_x_dlp_composite", "is_late_stage",
211
+ ]
212
+ )
213
+ X_numeric = traj[numeric_features].astype(float)
214
+
215
+ categorical_levels: dict[str, list[str]] = {}
216
+ blocks: list[pd.DataFrame] = []
217
+ for col in CATEGORICAL_TIMESTEP_FEATURES:
218
+ if col not in traj.columns:
219
+ continue
220
+ levels = sorted(traj[col].dropna().unique().tolist())
221
+ categorical_levels[col] = levels
222
+ block = pd.get_dummies(
223
+ traj[col].astype("category").cat.set_categories(levels),
224
+ prefix=col, dummy_na=False,
225
+ ).astype(int)
226
+ blocks.append(block)
227
+
228
+ X = pd.concat(
229
+ [X_numeric.reset_index(drop=True)]
230
+ + [b.reset_index(drop=True) for b in blocks],
231
+ axis=1,
232
+ ).fillna(0.0)
233
+
234
+ meta = {
235
+ "feature_names": X.columns.tolist(),
236
+ "numeric_features": numeric_features,
237
+ "categorical_levels": categorical_levels,
238
+ "label_to_int": LABEL_TO_INT,
239
+ "int_to_label": INT_TO_LABEL,
240
+ "leakage_excluded": LEAKY_COLUMNS,
241
+ }
242
+ return X, y, groups, meta
243
+
244
+
245
+ def transform_single(
246
+ record: dict | pd.DataFrame,
247
+ meta: dict[str, Any],
248
+ ) -> np.ndarray:
249
+ """Encode a single timestep record for inference."""
250
+ if isinstance(record, dict):
251
+ df = pd.DataFrame([record.copy()])
252
+ else:
253
+ df = record.copy()
254
+
255
+ df = _add_engineered_features(df)
256
+
257
+ numeric = pd.DataFrame({
258
+ col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
259
+ for col in meta["numeric_features"]
260
+ })
261
+ blocks: list[pd.DataFrame] = [numeric]
262
+ for col, levels in meta["categorical_levels"].items():
263
+ val = df.get(col, pd.Series([None] * len(df)))
264
+ block = pd.get_dummies(
265
+ val.astype("category").cat.set_categories(levels),
266
+ prefix=col, dummy_na=False,
267
+ ).astype(int)
268
+ for lvl in levels:
269
+ cname = f"{col}_{lvl}"
270
+ if cname not in block.columns:
271
+ block[cname] = 0
272
+ block = block[[f"{col}_{lvl}" for lvl in levels]]
273
+ blocks.append(block)
274
+
275
+ X = pd.concat(blocks, axis=1).fillna(0.0)
276
+ X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
277
+ return X.values.astype(np.float32)
278
+
279
+
280
+ def save_meta(meta: dict[str, Any], path: str | Path) -> None:
281
+ serializable = {
282
+ "feature_names": meta["feature_names"],
283
+ "numeric_features": meta["numeric_features"],
284
+ "categorical_levels": meta["categorical_levels"],
285
+ "label_to_int": meta["label_to_int"],
286
+ "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
287
+ "leakage_excluded": meta.get("leakage_excluded", []),
288
+ }
289
+ with open(path, "w") as f:
290
+ json.dump(serializable, f, indent=2)
291
+
292
+
293
+ def load_meta(path: str | Path) -> dict[str, Any]:
294
+ with open(path) as f:
295
+ meta = json.load(f)
296
+ meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
297
+ return meta
298
+
299
+
300
+ if __name__ == "__main__":
301
+ import sys
302
+ base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
303
+ X, y, groups, meta = build_features(base / "insider_trajectories.csv")
304
+ print(f"X shape: {X.shape}")
305
+ print(f"y shape: {y.shape}")
306
+ print(f"groups: {groups.nunique()} incidents")
307
+ print(f"n_features: {len(meta['feature_names'])}")
308
+ print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
309
+ print(f"X has NaN: {X.isnull().any().any()}")
feature_meta.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "feature_names": [
3
+ "timestep",
4
+ "data_access_volume_mb",
5
+ "privilege_event_count",
6
+ "communication_anomaly_score",
7
+ "dlp_confidence_score",
8
+ "exfiltration_volume_mb_cumulative",
9
+ "behavioural_risk_score",
10
+ "log_data_volume",
11
+ "log_cumulative_exfil",
12
+ "exfil_velocity",
13
+ "is_privileged_event",
14
+ "risk_x_dlp_composite",
15
+ "is_late_stage",
16
+ "incident_phase_access_escalation",
17
+ "incident_phase_cover_tracks",
18
+ "incident_phase_data_staging",
19
+ "incident_phase_exfiltration_attempt",
20
+ "incident_phase_idle_dwell",
21
+ "incident_phase_incident_resolution",
22
+ "incident_phase_lateral_access",
23
+ "incident_phase_reconnaissance",
24
+ "detection_outcome_exfil_success",
25
+ "detection_outcome_high_risk_alert",
26
+ "detection_outcome_moderate_risk_alert",
27
+ "detection_outcome_suppressed",
28
+ "target_data_sensitivity_tier_confidential",
29
+ "target_data_sensitivity_tier_internal",
30
+ "target_data_sensitivity_tier_restricted"
31
+ ],
32
+ "numeric_features": [
33
+ "timestep",
34
+ "data_access_volume_mb",
35
+ "privilege_event_count",
36
+ "communication_anomaly_score",
37
+ "dlp_confidence_score",
38
+ "exfiltration_volume_mb_cumulative",
39
+ "behavioural_risk_score",
40
+ "log_data_volume",
41
+ "log_cumulative_exfil",
42
+ "exfil_velocity",
43
+ "is_privileged_event",
44
+ "risk_x_dlp_composite",
45
+ "is_late_stage"
46
+ ],
47
+ "categorical_levels": {
48
+ "incident_phase": [
49
+ "access_escalation",
50
+ "cover_tracks",
51
+ "data_staging",
52
+ "exfiltration_attempt",
53
+ "idle_dwell",
54
+ "incident_resolution",
55
+ "lateral_access",
56
+ "reconnaissance"
57
+ ],
58
+ "detection_outcome": [
59
+ "exfil_success",
60
+ "high_risk_alert",
61
+ "moderate_risk_alert",
62
+ "suppressed"
63
+ ],
64
+ "target_data_sensitivity_tier": [
65
+ "confidential",
66
+ "internal",
67
+ "restricted"
68
+ ]
69
+ },
70
+ "label_to_int": {
71
+ "negligent_user": 0,
72
+ "malicious_employee": 1,
73
+ "privileged_insider": 2
74
+ },
75
+ "int_to_label": {
76
+ "0": "negligent_user",
77
+ "1": "malicious_employee",
78
+ "2": "privileged_insider"
79
+ },
80
+ "leakage_excluded": []
81
+ }
feature_scaler.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"mean": [32.0, 79.22963365714287, 0.8590769230769231, 0.1355283047912088, 0.4583615573186812, 192.59504764835165, 0.21502486004395605, 2.2549143963709435, 1.4574760046941566, 4.083364973522955, 0.46584615384615385, 0.11968470332782674, 0.36923076923076925, 0.09243956043956043, 0.08004395604395605, 0.13604395604395605, 0.2164835164835165, 0.22254945054945055, 0.09753846153846153, 0.0843076923076923, 0.0705934065934066, 0.034417582417582415, 0.3030769230769231, 0.16013186813186814, 0.5023736263736264, 0.4114285714285714, 0.19714285714285715, 0.3914285714285714], "std": [18.762075397130605, 238.64412799765506, 1.1010367321730437, 0.12227170547855615, 0.38240445028259407, 655.2828426642064, 0.11440456287107159, 2.013024370428324, 2.5410911146144906, 13.558449957039286, 0.49884311462684755, 0.12842022668589032, 0.4826071342931682, 0.28965181846004456, 0.2713672015458289, 0.3428427696663136, 0.41185660084086384, 0.4159673043293796, 0.2966961062219711, 0.27785481618471686, 0.25615007636300646, 0.18230324542936777, 0.4595982883302537, 0.3667363696627672, 0.5000053551155411, 0.4921033902433739, 0.3978498568319774, 0.48808064520747485]}
inference_example.ipynb ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# CYB007 Baseline Classifier — Inference Example\n",
8
+ "\n",
9
+ "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **insider threat type** of an incident from a per-timestep trajectory record.\n",
10
+ "\n",
11
+ "**Models predict one of 3 tiers:** `negligent_user`, `malicious_employee`, `privileged_insider`.\n",
12
+ "\n",
13
+ "**This is a baseline reference model**, not a production insider-threat detection system. See the model card for full metrics and limitations."
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "markdown",
18
+ "metadata": {},
19
+ "source": [
20
+ "## 1. Install dependencies"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "## 2. Download model artifacts from Hugging Face"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": null,
42
+ "metadata": {},
43
+ "outputs": [],
44
+ "source": [
45
+ "from huggingface_hub import hf_hub_download\n",
46
+ "\n",
47
+ "REPO_ID = \"xpertsystems/cyb007-baseline-classifier\"\n",
48
+ "\n",
49
+ "files = {}\n",
50
+ "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
51
+ " \"feature_engineering.py\", \"feature_meta.json\",\n",
52
+ " \"feature_scaler.json\"]:\n",
53
+ " files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
54
+ " print(f\" downloaded: {name}\")"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": null,
60
+ "metadata": {},
61
+ "outputs": [],
62
+ "source": [
63
+ "import sys, os\n",
64
+ "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
65
+ "if fe_dir not in sys.path:\n",
66
+ " sys.path.insert(0, fe_dir)\n",
67
+ "\n",
68
+ "from feature_engineering import transform_single, load_meta, INT_TO_LABEL"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "markdown",
73
+ "metadata": {},
74
+ "source": [
75
+ "## 3. Load models and metadata"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "code",
80
+ "execution_count": null,
81
+ "metadata": {},
82
+ "outputs": [],
83
+ "source": [
84
+ "import json\n",
85
+ "import numpy as np\n",
86
+ "import torch\n",
87
+ "import torch.nn as nn\n",
88
+ "import xgboost as xgb\n",
89
+ "from safetensors.torch import load_file\n",
90
+ "\n",
91
+ "meta = load_meta(files[\"feature_meta.json\"])\n",
92
+ "with open(files[\"feature_scaler.json\"]) as f:\n",
93
+ " scaler = json.load(f)\n",
94
+ "\n",
95
+ "N_FEATURES = len(meta[\"feature_names\"])\n",
96
+ "N_CLASSES = len(meta[\"int_to_label\"])\n",
97
+ "print(f\"feature count: {N_FEATURES}\")\n",
98
+ "print(f\"class count: {N_CLASSES}\")\n",
99
+ "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
100
+ ]
101
+ },
102
+ {
103
+ "cell_type": "code",
104
+ "execution_count": null,
105
+ "metadata": {},
106
+ "outputs": [],
107
+ "source": [
108
+ "# XGBoost\n",
109
+ "xgb_model = xgb.XGBClassifier()\n",
110
+ "xgb_model.load_model(files[\"model_xgb.json\"])\n",
111
+ "\n",
112
+ "# MLP architecture (must match training)\n",
113
+ "class TierMLP(nn.Module):\n",
114
+ " def __init__(self, n_features, n_classes=3, hidden1=128, hidden2=64, dropout=0.3):\n",
115
+ " super().__init__()\n",
116
+ " self.net = nn.Sequential(\n",
117
+ " nn.Linear(n_features, hidden1),\n",
118
+ " nn.BatchNorm1d(hidden1),\n",
119
+ " nn.ReLU(),\n",
120
+ " nn.Dropout(dropout),\n",
121
+ " nn.Linear(hidden1, hidden2),\n",
122
+ " nn.BatchNorm1d(hidden2),\n",
123
+ " nn.ReLU(),\n",
124
+ " nn.Dropout(dropout),\n",
125
+ " nn.Linear(hidden2, n_classes),\n",
126
+ " )\n",
127
+ " def forward(self, x):\n",
128
+ " return self.net(x)\n",
129
+ "\n",
130
+ "mlp_model = TierMLP(N_FEATURES, n_classes=N_CLASSES)\n",
131
+ "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
132
+ "mlp_model.eval()\n",
133
+ "print(\"models loaded\")"
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "markdown",
138
+ "metadata": {},
139
+ "source": [
140
+ "## 4. Prediction helper"
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "code",
145
+ "execution_count": null,
146
+ "metadata": {},
147
+ "outputs": [],
148
+ "source": [
149
+ "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
150
+ "SD = np.array(scaler[\"std\"], dtype=np.float32)\n",
151
+ "\n",
152
+ "def predict_threat_type(record: dict) -> dict:\n",
153
+ " \"\"\"Predict the actor threat type for one per-timestep telemetry record.\"\"\"\n",
154
+ " X = transform_single(record, meta)\n",
155
+ "\n",
156
+ " xgb_proba = xgb_model.predict_proba(X)[0]\n",
157
+ " xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
158
+ "\n",
159
+ " Xs = ((X - MU) / SD).astype(np.float32)\n",
160
+ " with torch.no_grad():\n",
161
+ " logits = mlp_model(torch.tensor(Xs))\n",
162
+ " mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
163
+ " mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
164
+ "\n",
165
+ " return {\n",
166
+ " \"xgboost\": {\n",
167
+ " \"label\": xgb_label,\n",
168
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
169
+ " },\n",
170
+ " \"mlp\": {\n",
171
+ " \"label\": mlp_label,\n",
172
+ " \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
173
+ " },\n",
174
+ " }"
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "markdown",
179
+ "metadata": {},
180
+ "source": [
181
+ "## 5. Run on an example record\n",
182
+ "\n",
183
+ "Real `exfiltration_attempt` event from the sample dataset: a privileged-insider incident at timestep 31, accessing 424 MB at a single step with internal-tier data and a moderate-risk DLP alert. Both models should predict `privileged_insider` (large per-step data volume is a strong privileged-insider signature)."
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": null,
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "# Real timestep record from the sample dataset (true tier: privileged_insider)\n",
193
+ "example_record = {\n",
194
+ " \"timestep\": 31,\n",
195
+ " \"incident_phase\": \"exfiltration_attempt\",\n",
196
+ " \"data_access_volume_mb\": 424.4688,\n",
197
+ " \"privilege_event_count\": 2,\n",
198
+ " \"communication_anomaly_score\": 0.407904,\n",
199
+ " \"dlp_confidence_score\": 0.652392,\n",
200
+ " \"detection_outcome\": \"moderate_risk_alert\",\n",
201
+ " \"exfiltration_volume_mb_cumulative\": 0.0,\n",
202
+ " \"behavioural_risk_score\": 0.301542,\n",
203
+ " \"target_data_sensitivity_tier\": \"internal\",\n",
204
+ "}\n",
205
+ "\n",
206
+ "result = predict_threat_type(example_record)\n",
207
+ "\n",
208
+ "print(f\"XGBoost -> {result['xgboost']['label']}\")\n",
209
+ "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
210
+ " print(f\" P({lbl:25s}) = {p:.4f}\")\n",
211
+ "\n",
212
+ "print(f\"\\nMLP -> {result['mlp']['label']}\")\n",
213
+ "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
214
+ " print(f\" P({lbl:25s}) = {p:.4f}\")"
215
+ ]
216
+ },
217
+ {
218
+ "cell_type": "markdown",
219
+ "metadata": {},
220
+ "source": [
221
+ "### When the two models disagree\n",
222
+ "\n",
223
+ "XGBoost and the MLP can disagree on borderline cases — e.g. low-volume timesteps where a malicious employee might look similar to a negligent user, or early-stage timesteps before tier-distinguishing behaviour appears. In threat-investigation workflows, disagreement is a useful triage signal for human analyst review.\n",
224
+ "\n",
225
+ "Unusually for the XpertSystems baseline catalog, on CYB007 the **MLP slightly outperforms XGBoost** at multi-seed evaluation (acc 0.869 vs 0.853 at seed 42). Both are published; we recommend running both and treating disagreement as the triage signal."
226
+ ]
227
+ },
228
+ {
229
+ "cell_type": "markdown",
230
+ "metadata": {},
231
+ "source": [
232
+ "## 6. Batch prediction on the sample dataset"
233
+ ]
234
+ },
235
+ {
236
+ "cell_type": "code",
237
+ "execution_count": null,
238
+ "metadata": {},
239
+ "outputs": [],
240
+ "source": [
241
+ "from huggingface_hub import snapshot_download\n",
242
+ "import pandas as pd\n",
243
+ "\n",
244
+ "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb007-sample\", repo_type=\"dataset\")\n",
245
+ "traj = pd.read_csv(f\"{ds_path}/insider_trajectories.csv\")\n",
246
+ "\n",
247
+ "# Score the first 500 timesteps\n",
248
+ "sample = traj.head(500).copy()\n",
249
+ "preds = [predict_threat_type(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
250
+ "sample[\"xgb_pred\"] = preds\n",
251
+ "\n",
252
+ "ct = pd.crosstab(sample[\"actor_threat_type\"], sample[\"xgb_pred\"],\n",
253
+ " rownames=[\"true\"], colnames=[\"pred\"])\n",
254
+ "print(\"Confusion on first 500 sample rows (XGBoost):\")\n",
255
+ "print(ct)\n",
256
+ "acc = (sample[\"actor_threat_type\"] == sample[\"xgb_pred\"]).mean()\n",
257
+ "print(f\"\\nbatch accuracy on first 500 rows (in-distribution): {acc:.4f}\")\n",
258
+ "print(\"\\nNote: these rows include training-set incidents. See validation_results.json\\n\"\n",
259
+ " \"for proper held-out test metrics from disjoint incidents.\")"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "markdown",
264
+ "metadata": {},
265
+ "source": [
266
+ "## 7. Next steps\n",
267
+ "\n",
268
+ "- See `validation_results.json` for held-out test metrics (75 disjoint incidents, ~4,875 timesteps).\n",
269
+ "- See `multi_seed_results.json` for the across-10-seeds robustness picture (accuracy 0.855 ± 0.012, ROC-AUC 0.961 ± 0.007).\n",
270
+ "- See `ablation_results.json` for per-feature-group contribution. **Volume features carry the dominant tier signal** (−36pp accuracy when removed) — this is the defining behavioural signature of privileged_insider tier.\n",
271
+ "- The model card documents the leakage audit on volume features (they are tier-correlated by design but have substantial distributional overlap — not oracles).\n",
272
+ "- For the full ~335k-row CYB007 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
273
+ ]
274
+ }
275
+ ],
276
+ "metadata": {
277
+ "kernelspec": {
278
+ "display_name": "Python 3",
279
+ "language": "python",
280
+ "name": "python3"
281
+ },
282
+ "language_info": {
283
+ "name": "python",
284
+ "version": "3.10"
285
+ }
286
+ },
287
+ "nbformat": 4,
288
+ "nbformat_minor": 5
289
+ }
model_mlp.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9f44841279c51825d878fe2cad129ff9ee9e7896bb50db68b370144a939ef39
3
+ size 52948
model_xgb.json ADDED
The diff for this file is too large to render. See raw diff
 
multi_seed_results.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Multi-seed evaluation across 10 random splits of the 500 insider threat incidents. Reports XGBoost performance averaged over the full set of seeds for a robust performance picture.",
3
+ "seeds_evaluated": [
4
+ 42,
5
+ 7,
6
+ 13,
7
+ 17,
8
+ 23,
9
+ 31,
10
+ 45,
11
+ 99,
12
+ 123,
13
+ 200
14
+ ],
15
+ "per_seed": [
16
+ {
17
+ "seed": 42,
18
+ "test_n_classes": 3,
19
+ "accuracy": 0.8529230769230769,
20
+ "macro_f1": 0.8495931102241494,
21
+ "macro_roc_auc_ovr": 0.9627526877302969
22
+ },
23
+ {
24
+ "seed": 7,
25
+ "test_n_classes": 3,
26
+ "accuracy": 0.859897435897436,
27
+ "macro_f1": 0.8489366810370947,
28
+ "macro_roc_auc_ovr": 0.9706063404287054
29
+ },
30
+ {
31
+ "seed": 13,
32
+ "test_n_classes": 3,
33
+ "accuracy": 0.8473846153846154,
34
+ "macro_f1": 0.8308717142007808,
35
+ "macro_roc_auc_ovr": 0.9487669993321273
36
+ },
37
+ {
38
+ "seed": 17,
39
+ "test_n_classes": 3,
40
+ "accuracy": 0.8592820512820513,
41
+ "macro_f1": 0.8303962310053286,
42
+ "macro_roc_auc_ovr": 0.9599908480231973
43
+ },
44
+ {
45
+ "seed": 23,
46
+ "test_n_classes": 3,
47
+ "accuracy": 0.8734358974358974,
48
+ "macro_f1": 0.8422305585058111,
49
+ "macro_roc_auc_ovr": 0.9640019681906883
50
+ },
51
+ {
52
+ "seed": 31,
53
+ "test_n_classes": 3,
54
+ "accuracy": 0.8307692307692308,
55
+ "macro_f1": 0.8309747753220957,
56
+ "macro_roc_auc_ovr": 0.9592892734393724
57
+ },
58
+ {
59
+ "seed": 45,
60
+ "test_n_classes": 3,
61
+ "accuracy": 0.8541538461538462,
62
+ "macro_f1": 0.8389296586948394,
63
+ "macro_roc_auc_ovr": 0.9570438308416293
64
+ },
65
+ {
66
+ "seed": 99,
67
+ "test_n_classes": 3,
68
+ "accuracy": 0.8689230769230769,
69
+ "macro_f1": 0.8596390692085856,
70
+ "macro_roc_auc_ovr": 0.9717446089452725
71
+ },
72
+ {
73
+ "seed": 123,
74
+ "test_n_classes": 3,
75
+ "accuracy": 0.8588717948717949,
76
+ "macro_f1": 0.828584805246768,
77
+ "macro_roc_auc_ovr": 0.9537031820223459
78
+ },
79
+ {
80
+ "seed": 200,
81
+ "test_n_classes": 3,
82
+ "accuracy": 0.8432820512820512,
83
+ "macro_f1": 0.8288042228202444,
84
+ "macro_roc_auc_ovr": 0.9638969223068744
85
+ }
86
+ ],
87
+ "aggregate": {
88
+ "accuracy_mean": 0.8548923076923078,
89
+ "accuracy_std": 0.011740503457401963,
90
+ "accuracy_min": 0.8307692307692308,
91
+ "accuracy_max": 0.8734358974358974,
92
+ "macro_f1_mean": 0.8388960826265699,
93
+ "macro_f1_std": 0.010315931931384944,
94
+ "roc_auc_mean": 0.961179666126051,
95
+ "roc_auc_std": 0.006710943228986276
96
+ },
97
+ "published_artifact_seed": 42
98
+ }
validation_results.json ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0.0",
3
+ "dataset": "xpertsystems/cyb007-sample",
4
+ "task": "3-class actor_threat_type classification",
5
+ "baselines": {
6
+ "always_predict_majority_accuracy": 0.4266666666666667,
7
+ "majority_class": "negligent_user",
8
+ "random_guess_accuracy": 0.3333333333333333
9
+ },
10
+ "split": {
11
+ "strategy": "group_aware (GroupShuffleSplit by incident_id, nested)",
12
+ "rationale": "500 insider threat incidents generate 32,500 timesteps (65 per incident). Random row-split would leak per-incident correlations into the test fold. Group-aware split keeps train/val/test incidents disjoint.",
13
+ "incidents_train": 350,
14
+ "incidents_val": 75,
15
+ "incidents_test": 75,
16
+ "timesteps_train": 22750,
17
+ "timesteps_val": 4875,
18
+ "timesteps_test": 4875,
19
+ "seed": 42
20
+ },
21
+ "n_features": 28,
22
+ "label_classes": [
23
+ "negligent_user",
24
+ "malicious_employee",
25
+ "privileged_insider"
26
+ ],
27
+ "class_distribution_train": {
28
+ "negligent_user": 11895,
29
+ "malicious_employee": 6370,
30
+ "privileged_insider": 4485
31
+ },
32
+ "class_distribution_test": {
33
+ "negligent_user": 2080,
34
+ "malicious_employee": 1755,
35
+ "privileged_insider": 1040
36
+ },
37
+ "leakage_excluded_features": [],
38
+ "leakage_audit_notes": "Two features were audited as potential tier oracles: data_access_volume_mb (privileged 0-2541 MB, malicious 0-328, negligent 0-88; overlap [0, 88] covers most timesteps with median ~9 MB each) and exfiltration_volume_mb_cumulative (similar shape). Both have substantial distributional overlap across tiers and represent legitimate observables. Removing both features drops accuracy from 0.85 to 0.47 (below majority), confirming they are real signal rather than oracle leakage. detection_outcome is a near-oracle for INCIDENT_PHASE (purity 0.79, max 1.00 for reconnaissance) but has uniform purity vs tier (~0.50) and is kept as a feature for tier prediction. No features dropped.",
39
+ "models": {
40
+ "xgboost": {
41
+ "architecture": "Gradient-boosted decision trees, multi:softprob, 3 classes",
42
+ "framework": "xgboost",
43
+ "test_metrics": {
44
+ "model": "xgboost",
45
+ "accuracy": 0.8529230769230769,
46
+ "macro_f1": 0.8495931102241494,
47
+ "weighted_f1": 0.8518585237469937,
48
+ "per_class_f1": {
49
+ "negligent_user": 0.8762557077625571,
50
+ "malicious_employee": 0.8262571514604035,
51
+ "privileged_insider": 0.8462664714494875
52
+ },
53
+ "confusion_matrix": {
54
+ "labels": [
55
+ "negligent_user",
56
+ "malicious_employee",
57
+ "privileged_insider"
58
+ ],
59
+ "matrix": [
60
+ [
61
+ 1919,
62
+ 111,
63
+ 50
64
+ ],
65
+ [
66
+ 291,
67
+ 1372,
68
+ 92
69
+ ],
70
+ [
71
+ 90,
72
+ 83,
73
+ 867
74
+ ]
75
+ ]
76
+ },
77
+ "macro_roc_auc_ovr": 0.9627526877302969
78
+ }
79
+ },
80
+ "mlp": {
81
+ "architecture": "PyTorch MLP, 28 -> 128 -> 64 -> 3, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
82
+ "framework": "pytorch",
83
+ "test_metrics": {
84
+ "model": "mlp",
85
+ "accuracy": 0.8685128205128205,
86
+ "macro_f1": 0.8636019696274673,
87
+ "weighted_f1": 0.866725739844854,
88
+ "per_class_f1": {
89
+ "negligent_user": 0.8934753661784287,
90
+ "malicious_employee": 0.8414481897627965,
91
+ "privileged_insider": 0.8558823529411764
92
+ },
93
+ "confusion_matrix": {
94
+ "labels": [
95
+ "negligent_user",
96
+ "malicious_employee",
97
+ "privileged_insider"
98
+ ],
99
+ "matrix": [
100
+ [
101
+ 2013,
102
+ 22,
103
+ 45
104
+ ],
105
+ [
106
+ 325,
107
+ 1348,
108
+ 82
109
+ ],
110
+ [
111
+ 88,
112
+ 79,
113
+ 873
114
+ ]
115
+ ]
116
+ },
117
+ "macro_roc_auc_ovr": 0.9660800234091633
118
+ }
119
+ }
120
+ }
121
+ }