| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - cybersecurity |
| - insider-threat |
| - ueba |
| - data-exfiltration |
| - dlp |
| - privileged-access |
| - tabular-classification |
| - synthetic-data |
| - xgboost |
| - baseline |
| pipeline_tag: tabular-classification |
| base_model: [] |
| datasets: |
| - xpertsystems/cyb007-sample |
| metrics: |
| - accuracy |
| - f1 |
| - roc_auc |
| model-index: |
| - name: cyb007-baseline-classifier |
| results: |
| - task: |
| type: tabular-classification |
| name: 3-class insider threat type classification |
| dataset: |
| type: xpertsystems/cyb007-sample |
| name: CYB007 Synthetic Insider Threat Dataset (Sample) |
| metrics: |
| - type: roc_auc |
| value: 0.9628 |
| name: Test macro ROC-AUC OvR (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.8529 |
| name: Test accuracy (XGBoost, seed 42) |
| - type: f1 |
| value: 0.8496 |
| name: Test macro-F1 (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.855 |
| name: Multi-seed accuracy mean ± 0.012 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.961 |
| name: Multi-seed ROC-AUC mean ± 0.007 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.9661 |
| name: Test macro ROC-AUC OvR (MLP, seed 42) |
| - type: accuracy |
| value: 0.8685 |
| name: Test accuracy (MLP, seed 42) |
| - type: f1 |
| value: 0.8636 |
| name: Test macro-F1 (MLP, seed 42) |
| --- |
| |
| # CYB007 Baseline Classifier |
|
|
| **Insider-threat type classifier trained on the CYB007 synthetic |
| insider-threat sample. Predicts which of 3 actor types |
| (`negligent_user` / `malicious_employee` / `privileged_insider`) is |
| behind an observed insider incident from per-timestep trajectory |
| telemetry.** |
| |
| > **Baseline reference, not for production use.** This model demonstrates |
| > that the [CYB007 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb007-sample) |
| > is learnable end-to-end and gives prospective buyers a working starting |
| > point for insider-threat detection research. It is not a production |
| > UEBA system, DLP engine, or HR-investigation tool. See [Limitations](#limitations). |
| |
| ## Model overview |
| |
| | Property | Value | |
| |---|---| |
| | Task | 3-class actor_threat_type classification | |
| | Training data | `xpertsystems/cyb007-sample` (32,500 timesteps across 500 incidents) | |
| | Models | XGBoost + PyTorch MLP | |
| | Input features | 28 (after one-hot encoding) | |
| | Split | **Group-aware by incident_id** (disjoint train/val/test incidents) | |
| | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | |
| | License | CC-BY-NC-4.0 (matches dataset) | |
| | Status | Reference baseline | |
|
|
| ## Why this task — CYB007 ships the README's stated headline use case |
|
|
| This is the second XpertSystems baseline (after CYB005) that ships |
| the **dataset's stated headline use case** rather than pivoting away |
| from it. The CYB007 README's first suggested use case is "training |
| insider threat classifier models (4-tier actor attribution)", and |
| that is the task this baseline trains on (with one schema correction: |
| the sample data contains 3 of the 4 tiers — `compromised_account` is |
| absent from the sample). |
|
|
| CYB003 (malware family), CYB004 (phishing actor tier), and CYB006 |
| (threat-actor tier) all had to pivot away from their README headline |
| targets — n=100 groups isn't enough to support group-aware tier |
| classification, and CYB006 in particular had structural distributional |
| leakage. CYB007's 500 incidents (matching CYB005's profile of 500 |
| campaigns × 75 timesteps) is large enough that tier attribution learns |
| honestly under group-aware splitting, with no oracle features and |
| multi-seed std of just 0.012. |
|
|
| Two model artifacts are published. They are designed to be used |
| together — disagreement is a useful triage signal. **Unusually for the |
| XpertSystems baseline catalog, on CYB007 the MLP slightly outperforms |
| XGBoost on the test fold** (0.869 vs 0.853 accuracy at seed 42, 0.966 |
| vs 0.963 ROC-AUC): |
|
|
| - `model_xgb.json` — gradient-boosted trees |
| - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format |
|
|
| ## Quick start |
|
|
| ```bash |
| pip install xgboost torch safetensors pandas huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import json, numpy as np, torch, xgboost as xgb |
| from safetensors.torch import load_file |
| |
| REPO = "xpertsystems/cyb007-baseline-classifier" |
| |
| paths = {n: hf_hub_download(REPO, n) for n in [ |
| "model_xgb.json", "model_mlp.safetensors", |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", |
| ]} |
| |
| import sys, os |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) |
| from feature_engineering import transform_single, load_meta, INT_TO_LABEL |
| |
| meta = load_meta(paths["feature_meta.json"]) |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) |
| |
| # Predict (see inference_example.ipynb for the full pattern) |
| X = transform_single(my_timestep_record, meta) |
| proba = xgb_model.predict_proba(X)[0] |
| print(INT_TO_LABEL[int(np.argmax(proba))]) |
| ``` |
|
|
| See [`inference_example.ipynb`](./inference_example.ipynb) for the full |
| copy-paste demo. |
|
|
| ## Training data |
|
|
| Trained on the public sample of CYB007, 32,500 per-timestep telemetry |
| rows from 500 insider threat incidents (65 timesteps per incident): |
|
|
| | Tier | Incidents | Timestep rows | Class share | |
| |---|---:|---:|---:| |
| | `negligent_user` | 250 | 16,250 | 50.0% | |
| | `malicious_employee` | 150 | 9,750 | 30.0% | |
| | `privileged_insider` | 100 | 6,500 | 20.0% | |
|
|
| ### Group-aware split |
|
|
| A single incident generates 65 highly-correlated timesteps. Random |
| row-level splitting would put timesteps from the same incident in both |
| train and test, inflating metrics in a way that does not generalize to |
| new incidents. |
|
|
| This release uses **GroupShuffleSplit by `incident_id`** (nested, |
| 70/15/15): |
| |
| | Fold | Incidents | Timesteps | |
| |---|---:|---:| |
| | Train | 350 | 22,750 | |
| | Validation | 75 | 4,875 | |
| | Test | 75 | 4,875 | |
| |
| All test incidents are completely unseen during training. Class |
| imbalance is addressed with `class_weight='balanced'` (XGBoost |
| `sample_weight`) and weighted cross-entropy (MLP). |
| |
| ## Feature pipeline |
| |
| The bundled `feature_engineering.py` is the canonical feature recipe. |
| 28 features survive after encoding, drawn from: |
| |
| - **Per-timestep numeric** (7): `timestep`, `data_access_volume_mb`, `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `exfiltration_volume_mb_cumulative`, `behavioural_risk_score` |
| - **Per-timestep categorical** (3, one-hot): `incident_phase` (8 values), `detection_outcome` (4 values), `target_data_sensitivity_tier` (3 values) |
| - **Engineered** (6): `log_data_volume`, `log_cumulative_exfil`, `exfil_velocity`, `is_privileged_event`, `risk_x_dlp_composite`, `is_late_stage` |
|
|
| ### Leakage audit |
|
|
| Two features have strongly tier-correlated means but with substantial |
| distributional overlap. **Neither was dropped**: |
|
|
| | Feature | Distribution by tier | Verdict | |
| |---|---|---| |
| | `data_access_volume_mb` | negligent [0, 88] mean 14 / malicious [0, 328] mean 44 / privileged [0, 2541] mean 302; median ~9 MB for all three | Massive overlap in [0, 88]; real signal, not oracle. KEEP. | |
| | `exfiltration_volume_mb_cumulative` | negligent [0, ~50] mean 5 / malicious [0, ~500] mean 90 / privileged [0, ~10000] mean 818 | Heavy-tailed with overlap in low-quantile region. KEEP. | |
|
|
| The honest test: dropping both features collapses accuracy from 0.85 |
| to 0.47 (below the 0.50 majority baseline). This confirms they carry |
| legitimate discriminative signal that **defines what `privileged_insider` |
| means** — a privileged user with elevated data access — rather than |
| being an oracle leak. |
| |
| `detection_outcome` is a near-oracle for **incident phase** (purity |
| 0.79, max 1.00 for reconnaissance which is 100% `suppressed`). But its |
| purity vs **tier** is uniform (~0.50 across all tiers), so it has no |
| oracle relationship to the target. KEEP. |
|
|
| No columns dropped for this task. |
|
|
| ## Evaluation |
|
|
| ### Test-set metrics, seed 42 (n = 4,875 timesteps from 75 disjoint incidents) |
|
|
| **XGBoost** (the published `model_xgb.json` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.9628** | |
| | Accuracy | **0.8529** | |
| | Macro-F1 | 0.8496 | |
| | Weighted-F1 | 0.8543 | |
|
|
| **MLP** (the published `model_mlp.safetensors` artifact) — **slightly outperforms XGBoost** |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.9661** | |
| | Accuracy | **0.8685** | |
| | Macro-F1 | 0.8636 | |
| | Weighted-F1 | 0.8682 | |
|
|
| The MLP outperforming XGBoost is unusual for tabular data and unusual |
| within the XpertSystems baseline catalog — CYB001–CYB006 all had |
| XGBoost ahead. With 22,750 training rows and only 28 features, the |
| MLP has enough data to fit cleanly and the tabular advantage of trees |
| is reduced. Both models are published. |
|
|
| ### Multi-seed robustness (XGBoost, 10 seeds) |
|
|
| Very stable performance — std 0.012 on accuracy is among the tightest |
| in the XpertSystems catalog: |
|
|
| | Metric | Mean | Std | Min | Max | |
| |---|---:|---:|---:|---:| |
| | Accuracy | 0.855 | 0.012 | 0.831 | 0.873 | |
| | Macro-F1 | 0.839 | 0.010 | 0.829 | 0.860 | |
| | Macro ROC-AUC OvR | 0.961 | 0.007 | 0.949 | 0.972 | |
|
|
| Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). |
| All 10 seeds yielded all 3 tiers in the test fold. |
|
|
| ### Per-class F1 (seed 42) |
|
|
| | Tier | Class share | XGBoost F1 | MLP F1 | |
| |---|---:|---:|---:| |
| | `negligent_user` | 50% | 0.876 | 0.894 | |
| | `privileged_insider` | 20% | 0.846 | 0.856 | |
| | `malicious_employee` | 30% | 0.826 | 0.841 | |
|
|
| The model performs evenly across all three tiers — no class collapse. |
| The strongest performance on `privileged_insider` despite it being |
| the minority class (20%) confirms that the volume-based behavioural |
| signature (sustained large data access) is reliably discriminative. |
| `malicious_employee` is the marginally hardest tier because they |
| operate in a middle zone — more aggressive than negligent users but |
| without the privileged access volumes that distinguish insiders. |
|
|
| ### Ablation: which feature groups matter |
|
|
| | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | |
| |---|---:|---:|---:|---:| |
| | Full feature set (published) | 0.8529 | 0.8496 | 0.9628 | — | |
| | No volume features | 0.4890 | 0.4736 | 0.6828 | **−0.3639** | |
| | No behavioural features | 0.7126 | 0.7055 | 0.8961 | −0.1403 | |
| | No `timestep` | 0.8394 | 0.8336 | 0.9569 | −0.0135 | |
| | No context features | 0.8544 | 0.8490 | 0.9632 | −0.0000 | |
| | No engineered features | 0.8597 | 0.8560 | 0.9629 | +0.0068 | |
|
|
| Four findings: |
|
|
| 1. **Volume features carry the overwhelmingly dominant signal** |
| (drops 36 pp accuracy, 28 pp ROC-AUC when removed). This is by |
| design — privileged insiders are *defined* by access to large |
| data volumes, and the synthetic generator models this faithfully. |
| 2. **Behavioural features (privilege events, communication anomaly, |
| DLP confidence, risk scores) contribute 14 pp accuracy.** They |
| add a second axis of discrimination beyond pure volume. |
| 3. **`timestep` contributes only 1 pp.** Tier attribution is largely |
| invariant to where in the incident lifecycle you are — different |
| from phase prediction, which is strongly timestep-driven. |
| 4. **Context features (incident_phase, sensitivity tier) and |
| engineered composites are recovered by the trees from raw inputs.** |
| They are retained in the pipeline as a documented baseline reference |
| but contribute essentially zero on their own. |
| |
| ### Architecture |
| |
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 3 classes), |
| `hist` tree method, class-balanced sample weights, early stopping on |
| validation mlogloss. |
| |
| **MLP:** `28 → 128 → 64 → 3`, each hidden layer followed by `BatchNorm1d` |
| → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, |
| early stopping on validation macro-F1. |
| |
| Training hyperparameters are held internally by XpertSystems. |
| |
| ## Limitations |
| |
| **This is a baseline reference, not a production insider-threat detection system.** |
| |
| 1. **The dataset has 3 tiers, not 4.** The CYB007 README claims a |
| 4-tier scheme including `compromised_account` but the sample |
| contains only `negligent_user`, `malicious_employee`, and |
| `privileged_insider`. If your work requires the 4th tier, request |
| regeneration. |
| |
| 2. **Volume-feature dominance is a property of the dataset.** Real |
| insider-threat telemetry has more variance — some negligent users |
| accidentally trigger large data downloads, some privileged |
| insiders work patiently with small transfers. The sample's |
| per-tier volume distributions overlap, but not as much as in real |
| environments. Buyers should test the model on their own data |
| before assuming the 0.86 accuracy transfers. |
| |
| 3. **MLP modestly outperforms XGBoost.** With 22,750 training rows, |
| the MLP has enough data to compete favorably. On smaller training |
| sets (n < 1k rows) we would expect XGBoost to be stronger. |
| |
| 4. **Synthetic-vs-real transfer.** The dataset is synthetic and |
| calibrated to insider-threat research benchmarks (CERT Insider |
| Threat Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon |
| Institute, MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix, |
| Forrester UEBA, Gartner ZTNA, CrowdStrike, Mandiant). Real |
| insider telemetry has different noise characteristics, and |
| adversarial insiders may deliberately mimic negligent-user |
| patterns. Do not assume metrics transfer. |
| |
| 5. **Adversarial robustness not evaluated.** The dataset does not |
| simulate insiders deliberately spoofing a different tier's |
| behavioural footprint to evade attribution. |
| |
| 6. **The 75-incident test fold is robust but not large.** Multi-seed |
| std of 0.012 on accuracy confirms the metric is stable, but full |
| confidence intervals for downstream production decisions should |
| come from the full ~4,800-incident product. |
| |
| ## Notes on dataset schema |
| |
| The CYB007 sample dataset README describes some fields differently |
| from the actual schema. The model was trained on the actual schema; |
| this note helps buyers reconcile what they read with what they receive. |
| |
| | What the README says | What the data actually contains | |
| |---|---| |
| | 4 actor tiers including `compromised_account` | **3 tiers only**: `negligent_user`, `malicious_employee`, `privileged_insider`. No `compromised_account` rows in the sample. | |
| | 6 incident phases | **8 phases**: adds `idle_dwell` and `lateral_access` to the 6 documented | |
| | Per-timestep columns: `payload_entropy`, `cover_actions_taken`, `dlp_alerts_raised`, `detection_flag`, `blast_radius`, `sensitive_data_accessed`, `threat_type_tier` | Actual per-timestep columns: `privilege_event_count`, `communication_anomaly_score`, `dlp_confidence_score`, `detection_outcome` (categorical 4-value, not boolean), `behavioural_risk_score`, `target_data_sensitivity_tier`, `actor_threat_type` | |
| | Summary field `ueba_status` | Actual field is `ueba_deployment_status` (only on `org_topology.csv`, not on `insider_trajectories.csv` or `incident_summary.csv`) | |
| | Summary field `collusion_flag` | Actual: `coordinated_incident_flag` | |
| | Summary field `lateral_access_flag` | Actual: `lateral_access_count` (not boolean) | |
| | Summary field `sabotage_flag` | Actual: `sabotage_events_executed` (count) | |
| | Summary field `cover_tracks_flag` | Actual: `cover_tracks_events` (count) | |
| | Summary field `hr_trigger_flag` | Actual: `hr_case_triggers_caused` (count) | |
| | Summary field `exfiltration_success_flag` | Actual: `exfiltration_successes` (count) and `exfiltration_success_rate` (float) | |
| | Summary field `dwell_time_ratio` | Not present in summary; `actor_efficiency_score` is the closest analog | |
| |
| None of these affects model correctness — the feature pipeline uses |
| the actual column names. If you build your own pipeline against the |
| dataset, use the actual columns. |
| |
| ## Intended use |
| |
| - **Evaluating fit** of the CYB007 dataset for your insider-threat |
| research |
| - **Baseline reference** for new model architectures (sequence models, |
| graph models considering collusion structure) |
| - **Teaching and demo** for multi-class tabular classification on |
| insider-threat telemetry |
| - **Feature engineering reference** for per-timestep insider activity |
| |
| ## Out-of-scope use |
| |
| - Production insider-threat detection on real telemetry |
| - HR investigation or employment decisions |
| - Adversarial-evasion evaluation (dataset not adversarially generated) |
| - Any operational or legal decision affecting actual persons |
| |
| ## Reproducibility |
| |
| Outputs above were produced with `seed = 42` (published artifact), |
| group-aware nested `GroupShuffleSplit` (70/15/15 by incident_id), on |
| the published sample (`xpertsystems/cyb007-sample`, version 1.0.0, |
| generated 2026-05-16). The feature pipeline in `feature_engineering.py` |
| is deterministic and the trained weights in this repo correspond |
| exactly to the metrics above. |
| |
| Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in |
| `multi_seed_results.json` confirm robust performance across splits. |
| |
| The training script itself is private to XpertSystems. |
| |
| ## Files in this repo |
| |
| | File | Purpose | |
| |---|---| |
| | `model_xgb.json` | XGBoost weights (seed 42) | |
| | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | |
| | `feature_engineering.py` | Feature pipeline | |
| | `feature_meta.json` | Feature column order + categorical levels | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | |
| | `ablation_results.json` | Per-feature-group ablation | |
| | `multi_seed_results.json` | XGBoost metrics across 10 seeds | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | |
| | `README.md` | This file | |
| |
| ## Contact and full product |
| |
| The full **CYB007** dataset contains ~335,000 rows across four files, |
| with calibrated benchmark validation against 12 metrics drawn from |
| authoritative insider-threat research sources (CERT Insider Threat |
| Center, Verizon DBIR, IBM Cost of Insider Threats, Ponemon Institute, |
| MITRE ATT&CK, NIST SP 800-53 / SP 800-207, Securonix, Forrester UEBA, |
| Gartner ZTNA, CrowdStrike, Mandiant M-Trends). The full |
| XpertSystems.ai synthetic data catalogue spans 41 SKUs across |
| Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials |
| & Energy. |
| |
| - 📧 **pradeep@xpertsystems.ai** |
| - 🌐 **https://xpertsystems.ai** |
| - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb007-sample |
| - 🤖 Companion models: |
| - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) |
| - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) |
| - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase) |
| - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase) |
| - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution) |
| - https://huggingface.co/xpertsystems/cyb006-baseline-classifier (user risk tier + leakage diagnostic) |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{xpertsystems_cyb007_baseline_2026, |
| title = {CYB007 Baseline Classifier: XGBoost and MLP for Insider Threat Type Classification}, |
| author = {XpertSystems.ai}, |
| year = {2026}, |
| url = {https://huggingface.co/xpertsystems/cyb007-baseline-classifier}, |
| note = {Baseline reference model trained on xpertsystems/cyb007-sample} |
| } |
| ``` |
| |