File size: 19,411 Bytes
ed9d6a1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 | ---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- cybersecurity
- insider-threat
- ueba
- data-exfiltration
- dlp
- privileged-access
- tabular-classification
- synthetic-data
- xgboost
- baseline
pipeline_tag: tabular-classification
base_model: []
datasets:
- xpertsystems/cyb007-sample
metrics:
- accuracy
- f1
- roc_auc
model-index:
- name: cyb007-baseline-classifier
results:
- task:
type: tabular-classification
name: 3-class insider threat type classification
dataset:
type: xpertsystems/cyb007-sample
name: CYB007 Synthetic Insider Threat Dataset (Sample)
metrics:
- type: roc_auc
value: 0.9628
name: Test macro ROC-AUC OvR (XGBoost, seed 42)
- type: accuracy
value: 0.8529
name: Test accuracy (XGBoost, seed 42)
- type: f1
value: 0.8496
name: Test macro-F1 (XGBoost, seed 42)
- type: accuracy
value: 0.855
name: Multi-seed accuracy mean ± 0.012 (XGBoost, 10 seeds)
- type: roc_auc
value: 0.961
name: Multi-seed ROC-AUC mean ± 0.007 (XGBoost, 10 seeds)
- type: roc_auc
value: 0.9661
name: Test macro ROC-AUC OvR (MLP, seed 42)
- type: accuracy
value: 0.8685
name: Test accuracy (MLP, seed 42)
- type: f1
value: 0.8636
name: Test macro-F1 (MLP, seed 42)
---
# CYB007 Baseline Classifier
**Insider-threat type classifier trained on the CYB007 synthetic
insider-threat sample. Predicts which of 3 actor types
(`negligent_user` / `malicious_employee` / `privileged_insider`) is
behind an observed insider incident from per-timestep trajectory
telemetry.**
> **Baseline reference, not for production use.** This model demonstrates
> that the [CYB007 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb007-sample)
> is learnable end-to-end and gives prospective buyers a working starting
> point for insider-threat detection research. It is not a production
> UEBA system, DLP engine, or HR-investigation tool. See [Limitations](#limitations).
## Model overview
| Property | Value |
|---|---|
| Task | 3-class actor_threat_type classification |
| Training data | `xpertsystems/cyb007-sample` (32,500 timesteps across 500 incidents) |
| Models | XGBoost + PyTorch MLP |
| Input features | 28 (after one-hot encoding) |
| Split | **Group-aware by incident_id** (disjoint train/val/test incidents) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline |
## Why this task — CYB007 ships the README's stated headline use case
This is the second XpertSystems baseline (after CYB005) that ships
the **dataset's stated headline use case** rather than pivoting away
from it. The CYB007 README's first suggested use case is "training
insider threat classifier models (4-tier actor attribution)", and
that is the task this baseline trains on (with one schema correction:
the sample data contains 3 of the 4 tiers — `compromised_account` is
absent from the sample).
CYB003 (malware family), CYB004 (phishing actor tier), and CYB006
(threat-actor tier) all had to pivot away from their README headline
targets — n=100 groups isn't enough to support group-aware tier
classification, and CYB006 in particular had structural distributional
leakage. CYB007's 500 incidents (matching CYB005's profile of 500
campaigns × 75 timesteps) is large enough that tier attribution learns
honestly under group-aware splitting, with no oracle features and
multi-seed std of just 0.012.
Two model artifacts are published. They are designed to be used
together — disagreement is a useful triage signal. **Unusually for the
XpertSystems baseline catalog, on CYB007 the MLP slightly outperforms
XGBoost on the test fold** (0.869 vs 0.853 accuracy at seed 42, 0.966
vs 0.963 ROC-AUC):
- `model_xgb.json` — gradient-boosted trees
- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
## Quick start
```bash
pip install xgboost torch safetensors pandas huggingface_hub
```
```python
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file
REPO = "xpertsystems/cyb007-baseline-classifier"
paths = {n: hf_hub_download(REPO, n) for n in [
"model_xgb.json", "model_mlp.safetensors",
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}
import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL
meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```
See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.
## Training data
Trained on the public sample of CYB007, 32,500 per-timestep telemetry
rows from 500 insider threat incidents (65 timesteps per incident):
| Tier | Incidents | Timestep rows | Class share |
|---|---:|---:|---:|
| `negligent_user` | 250 | 16,250 | 50.0% |
| `malicious_employee` | 150 | 9,750 | 30.0% |
| `privileged_insider` | 100 | 6,500 | 20.0% |
### Group-aware split
A single incident generates 65 highly-correlated timesteps. Random
row-level splitting would put timesteps from the same incident in both
train and test, inflating metrics in a way that does not generalize to
new incidents.
This release uses **GroupShuffleSplit by `incident_id`** (nested,
70/15/15):
| Fold | Incidents | Timesteps |
|---|---:|---:|
| Train | 350 | 22,750 |
| Validation | 75 | 4,875 |
| Test | 75 | 4,875 |
All test incidents are completely unseen during training. Class
imbalance is addressed with `class_weight='balanced'` (XGBoost
`sample_weight`) and weighted cross-entropy (MLP).
## Feature pipeline
The bundled `feature_engineering.py` is the canonical feature recipe.
28 features survive after encoding, drawn from:
- **Per-timestep numeric** (7): `timestep`, `data_access_volume_mb`, `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `exfiltration_volume_mb_cumulative`, `behavioural_risk_score`
- **Per-timestep categorical** (3, one-hot): `incident_phase` (8 values), `detection_outcome` (4 values), `target_data_sensitivity_tier` (3 values)
- **Engineered** (6): `log_data_volume`, `log_cumulative_exfil`, `exfil_velocity`, `is_privileged_event`, `risk_x_dlp_composite`, `is_late_stage`
### Leakage audit
Two features have strongly tier-correlated means but with substantial
distributional overlap. **Neither was dropped**:
| Feature | Distribution by tier | Verdict |
|---|---|---|
| `data_access_volume_mb` | negligent [0, 88] mean 14 / malicious [0, 328] mean 44 / privileged [0, 2541] mean 302; median ~9 MB for all three | Massive overlap in [0, 88]; real signal, not oracle. KEEP. |
| `exfiltration_volume_mb_cumulative` | negligent [0, ~50] mean 5 / malicious [0, ~500] mean 90 / privileged [0, ~10000] mean 818 | Heavy-tailed with overlap in low-quantile region. KEEP. |
The honest test: dropping both features collapses accuracy from 0.85
to 0.47 (below the 0.50 majority baseline). This confirms they carry
legitimate discriminative signal that **defines what `privileged_insider`
means** — a privileged user with elevated data access — rather than
being an oracle leak.
`detection_outcome` is a near-oracle for **incident phase** (purity
0.79, max 1.00 for reconnaissance which is 100% `suppressed`). But its
purity vs **tier** is uniform (~0.50 across all tiers), so it has no
oracle relationship to the target. KEEP.
No columns dropped for this task.
## Evaluation
### Test-set metrics, seed 42 (n = 4,875 timesteps from 75 disjoint incidents)
**XGBoost** (the published `model_xgb.json` artifact)
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9628** |
| Accuracy | **0.8529** |
| Macro-F1 | 0.8496 |
| Weighted-F1 | 0.8543 |
**MLP** (the published `model_mlp.safetensors` artifact) — **slightly outperforms XGBoost**
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9661** |
| Accuracy | **0.8685** |
| Macro-F1 | 0.8636 |
| Weighted-F1 | 0.8682 |
The MLP outperforming XGBoost is unusual for tabular data and unusual
within the XpertSystems baseline catalog — CYB001–CYB006 all had
XGBoost ahead. With 22,750 training rows and only 28 features, the
MLP has enough data to fit cleanly and the tabular advantage of trees
is reduced. Both models are published.
### Multi-seed robustness (XGBoost, 10 seeds)
Very stable performance — std 0.012 on accuracy is among the tightest
in the XpertSystems catalog:
| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.855 | 0.012 | 0.831 | 0.873 |
| Macro-F1 | 0.839 | 0.010 | 0.829 | 0.860 |
| Macro ROC-AUC OvR | 0.961 | 0.007 | 0.949 | 0.972 |
Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
All 10 seeds yielded all 3 tiers in the test fold.
### Per-class F1 (seed 42)
| Tier | Class share | XGBoost F1 | MLP F1 |
|---|---:|---:|---:|
| `negligent_user` | 50% | 0.876 | 0.894 |
| `privileged_insider` | 20% | 0.846 | 0.856 |
| `malicious_employee` | 30% | 0.826 | 0.841 |
The model performs evenly across all three tiers — no class collapse.
The strongest performance on `privileged_insider` despite it being
the minority class (20%) confirms that the volume-based behavioural
signature (sustained large data access) is reliably discriminative.
`malicious_employee` is the marginally hardest tier because they
operate in a middle zone — more aggressive than negligent users but
without the privileged access volumes that distinguish insiders.
### Ablation: which feature groups matter
| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
|---|---:|---:|---:|---:|
| Full feature set (published) | 0.8529 | 0.8496 | 0.9628 | — |
| No volume features | 0.4890 | 0.4736 | 0.6828 | **−0.3639** |
| No behavioural features | 0.7126 | 0.7055 | 0.8961 | −0.1403 |
| No `timestep` | 0.8394 | 0.8336 | 0.9569 | −0.0135 |
| No context features | 0.8544 | 0.8490 | 0.9632 | −0.0000 |
| No engineered features | 0.8597 | 0.8560 | 0.9629 | +0.0068 |
Four findings:
1. **Volume features carry the overwhelmingly dominant signal**
(drops 36 pp accuracy, 28 pp ROC-AUC when removed). This is by
design — privileged insiders are *defined* by access to large
data volumes, and the synthetic generator models this faithfully.
2. **Behavioural features (privilege events, communication anomaly,
DLP confidence, risk scores) contribute 14 pp accuracy.** They
add a second axis of discrimination beyond pure volume.
3. **`timestep` contributes only 1 pp.** Tier attribution is largely
invariant to where in the incident lifecycle you are — different
from phase prediction, which is strongly timestep-driven.
4. **Context features (incident_phase, sensitivity tier) and
engineered composites are recovered by the trees from raw inputs.**
They are retained in the pipeline as a documented baseline reference
but contribute essentially zero on their own.
### Architecture
**XGBoost:** multi-class gradient boosting (`multi:softprob`, 3 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.
**MLP:** `28 → 128 → 64 → 3`, each hidden layer followed by `BatchNorm1d`
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.
Training hyperparameters are held internally by XpertSystems.
## Limitations
**This is a baseline reference, not a production insider-threat detection system.**
1. **The dataset has 3 tiers, not 4.** The CYB007 README claims a
4-tier scheme including `compromised_account` but the sample
contains only `negligent_user`, `malicious_employee`, and
`privileged_insider`. If your work requires the 4th tier, request
regeneration.
2. **Volume-feature dominance is a property of the dataset.** Real
insider-threat telemetry has more variance — some negligent users
accidentally trigger large data downloads, some privileged
insiders work patiently with small transfers. The sample's
per-tier volume distributions overlap, but not as much as in real
environments. Buyers should test the model on their own data
before assuming the 0.86 accuracy transfers.
3. **MLP modestly outperforms XGBoost.** With 22,750 training rows,
the MLP has enough data to compete favorably. On smaller training
sets (n < 1k rows) we would expect XGBoost to be stronger.
4. **Synthetic-vs-real transfer.** The dataset is synthetic and
calibrated to insider-threat research benchmarks (CERT Insider
Threat Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon
Institute, MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix,
Forrester UEBA, Gartner ZTNA, CrowdStrike, Mandiant). Real
insider telemetry has different noise characteristics, and
adversarial insiders may deliberately mimic negligent-user
patterns. Do not assume metrics transfer.
5. **Adversarial robustness not evaluated.** The dataset does not
simulate insiders deliberately spoofing a different tier's
behavioural footprint to evade attribution.
6. **The 75-incident test fold is robust but not large.** Multi-seed
std of 0.012 on accuracy confirms the metric is stable, but full
confidence intervals for downstream production decisions should
come from the full ~4,800-incident product.
## Notes on dataset schema
The CYB007 sample dataset README describes some fields differently
from the actual schema. The model was trained on the actual schema;
this note helps buyers reconcile what they read with what they receive.
| What the README says | What the data actually contains |
|---|---|
| 4 actor tiers including `compromised_account` | **3 tiers only**: `negligent_user`, `malicious_employee`, `privileged_insider`. No `compromised_account` rows in the sample. |
| 6 incident phases | **8 phases**: adds `idle_dwell` and `lateral_access` to the 6 documented |
| Per-timestep columns: `payload_entropy`, `cover_actions_taken`, `dlp_alerts_raised`, `detection_flag`, `blast_radius`, `sensitive_data_accessed`, `threat_type_tier` | Actual per-timestep columns: `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `detection_outcome` (categorical 4-value, not boolean), `behavioural_risk_score`, `target_data_sensitivity_tier`, `actor_threat_type` |
| Summary field `ueba_status` | Actual field is `ueba_deployment_status` (only on `org_topology.csv`, not on `insider_trajectories.csv` or `incident_summary.csv`) |
| Summary field `collusion_flag` | Actual: `coordinated_incident_flag` |
| Summary field `lateral_access_flag` | Actual: `lateral_access_count` (not boolean) |
| Summary field `sabotage_flag` | Actual: `sabotage_events_executed` (count) |
| Summary field `cover_tracks_flag` | Actual: `cover_tracks_events` (count) |
| Summary field `hr_trigger_flag` | Actual: `hr_case_triggers_caused` (count) |
| Summary field `exfiltration_success_flag` | Actual: `exfiltration_successes` (count) and `exfiltration_success_rate` (float) |
| Summary field `dwell_time_ratio` | Not present in summary; `actor_efficiency_score` is the closest analog |
None of these affects model correctness — the feature pipeline uses
the actual column names. If you build your own pipeline against the
dataset, use the actual columns.
## Intended use
- **Evaluating fit** of the CYB007 dataset for your insider-threat
research
- **Baseline reference** for new model architectures (sequence models,
graph models considering collusion structure)
- **Teaching and demo** for multi-class tabular classification on
insider-threat telemetry
- **Feature engineering reference** for per-timestep insider activity
## Out-of-scope use
- Production insider-threat detection on real telemetry
- HR investigation or employment decisions
- Adversarial-evasion evaluation (dataset not adversarially generated)
- Any operational or legal decision affecting actual persons
## Reproducibility
Outputs above were produced with `seed = 42` (published artifact),
group-aware nested `GroupShuffleSplit` (70/15/15 by incident_id), on
the published sample (`xpertsystems/cyb007-sample`, version 1.0.0,
generated 2026-05-16). The feature pipeline in `feature_engineering.py`
is deterministic and the trained weights in this repo correspond
exactly to the metrics above.
Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
`multi_seed_results.json` confirm robust performance across splits.
The training script itself is private to XpertSystems.
## Files in this repo
| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |
## Contact and full product
The full **CYB007** dataset contains ~335,000 rows across four files,
with calibrated benchmark validation against 12 metrics drawn from
authoritative insider-threat research sources (CERT Insider Threat
Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon Institute,
MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix, Forrester UEBA,
Gartner ZTNA, CrowdStrike, Mandiant M-Trends). The full
XpertSystems.ai synthetic data catalogue spans 41 SKUs across
Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
& Energy.
- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb007-sample
- 🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
- https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic)
## Citation
```bibtex
@misc{xpertsystems_cyb007_baseline_2026,
title = {CYB007 Baseline Classifier: XGBoost and MLP for Insider Threat Type Classification},
author = {XpertSystems.ai},
year = {2026},
url = {https://huggingface.co/xpertsystems/cyb007-baseline-classifier},
note = {Baseline reference model trained on xpertsystems/cyb007-sample}
}
```
|