| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - cybersecurity |
| - ransomware |
| - threat-intelligence |
| - threat-attribution |
| - mitre-attack |
| - tabular-classification |
| - synthetic-data |
| - xgboost |
| - baseline |
| pipeline_tag: tabular-classification |
| base_model: [] |
| datasets: |
| - xpertsystems/cyb005-sample |
| metrics: |
| - accuracy |
| - f1 |
| - roc_auc |
| model-index: |
| - name: cyb005-baseline-classifier |
| results: |
| - task: |
| type: tabular-classification |
| name: 4-class threat-actor capability tier attribution |
| dataset: |
| type: xpertsystems/cyb005-sample |
| name: CYB005 Synthetic Ransomware Attack Simulation (Sample) |
| metrics: |
| - type: roc_auc |
| value: 0.8736 |
| name: Test macro ROC-AUC OvR (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.6898 |
| name: Test accuracy (XGBoost, seed 42) |
| - type: f1 |
| value: 0.6751 |
| name: Test macro-F1 (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.603 |
| name: Multi-seed accuracy mean ± 0.040 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.853 |
| name: Multi-seed ROC-AUC mean ± 0.031 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.8072 |
| name: Test macro ROC-AUC OvR (MLP, seed 42) |
| - type: accuracy |
| value: 0.5118 |
| name: Test accuracy (MLP, seed 42) |
| - type: f1 |
| value: 0.5121 |
| name: Test macro-F1 (MLP, seed 42) |
| --- |
| |
| # CYB005 Baseline Classifier |
|
|
| **Threat-actor capability-tier classifier trained on the CYB005 synthetic |
| ransomware campaign sample. Predicts which of 4 actor tiers |
| (lone_actor / organised_syndicate / raas_affiliate / nation_state_nexus) |
| is behind an observed ransomware campaign from per-timestep telemetry.** |
| |
| > **Baseline reference, not for production use.** This model demonstrates |
| > that the [CYB005 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb005-sample) |
| > is learnable end-to-end and gives prospective buyers a working starting |
| > point for threat-attribution research. It is not a production |
| > threat-intelligence system, attribution engine, or incident-response |
| > tool. See [Limitations](#limitations). |
| |
| ## Model overview |
| |
| | Property | Value | |
| |---|---| |
| | Task | 4-class actor_capability_tier classification | |
| | Training data | `xpertsystems/cyb005-sample` (37,489 timesteps across 500 ransomware campaigns) | |
| | Models | XGBoost + PyTorch MLP | |
| | Input features | 63 (after one-hot encoding) | |
| | Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) | |
| | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | |
| | License | CC-BY-NC-4.0 (matches dataset) | |
| | Status | Reference baseline | |
|
|
| ## Why this task — and why CYB005 ships it where CYB002/3/4 could not |
|
|
| This is the first XpertSystems baseline that targets the **dataset's |
| stated headline use case**. The CYB005 README's first suggested use case |
| is "ransomware classifier models (4-tier actor attribution)", and that is |
| exactly what this baseline ships. |
|
|
| In CYB002 (kill-chain), CYB003 (malware family), and CYB004 (actor tier), |
| the sample datasets had only ~100 groups (events / samples / campaigns), |
| which limits group-aware test folds to ~15 unseen groups and 1.5–2 groups |
| per class. Each baseline had to pivot to a phase-prediction subtask that |
| was learnable at sample size. |
|
|
| CYB005's sample is intentionally **5× larger — 500 campaigns** — because |
| the README explicitly notes that "benchmarks are conditional on small |
| actor-tier subsets". The larger sample makes a held-out test fold of |
| 75 disjoint campaigns possible, with each of the four tiers represented |
| by 11–30 unseen test campaigns. Tier attribution becomes genuinely |
| learnable, and that's what we publish. |
|
|
| Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal: |
|
|
| - `model_xgb.json` — gradient-boosted trees, primary recommendation |
| - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format |
|
|
| ## Quick start |
|
|
| ```bash |
| pip install xgboost torch safetensors pandas huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import json, numpy as np, torch, xgboost as xgb |
| from safetensors.torch import load_file |
| |
| REPO = "xpertsystems/cyb005-baseline-classifier" |
| |
| paths = {n: hf_hub_download(REPO, n) for n in [ |
| "model_xgb.json", "model_mlp.safetensors", |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", |
| ]} |
| |
| import sys, os |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) |
| from feature_engineering import ( |
| transform_single, load_meta, INT_TO_LABEL, build_segment_lookup |
| ) |
| |
| meta = load_meta(paths["feature_meta.json"]) |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) |
| seg_lookup = build_segment_lookup("path/to/victim_topology.csv") |
| |
| # Predict (see inference_example.ipynb for the full pattern) |
| seg_aggs = seg_lookup.get(my_record["target_segment_id"], {}) |
| X = transform_single(my_record, meta, segment_aggregates=seg_aggs) |
| proba = xgb_model.predict_proba(X)[0] |
| print(INT_TO_LABEL[int(np.argmax(proba))]) |
| ``` |
|
|
| See [`inference_example.ipynb`](./inference_example.ipynb) for the full |
| copy-paste demo. |
|
|
| ## Training data |
|
|
| Trained on the public sample of CYB005, 37,489 per-timestep telemetry |
| rows from 500 ransomware campaigns (75 timesteps per campaign): |
|
|
| | Tier | Campaigns | Timestep rows | Train share | |
| |---|---:|---:|---:| |
| | `organised_syndicate` | 200 | 14,998 | 40.0% | |
| | `raas_affiliate` | 150 | 11,250 | 30.0% | |
| | `lone_actor` | 75 | 5,625 | 15.0% | |
| | `nation_state_nexus` | 75 | 5,616 | 15.0% | |
|
|
| ### Group-aware split |
|
|
| A single campaign generates 75 highly-correlated timesteps. Random |
| row-level splitting would put timesteps from the same campaign in both |
| train and test, inflating metrics in a way that does not generalize to |
| new campaigns. |
|
|
| This release uses **GroupShuffleSplit by `campaign_id`** (nested, |
| 70/15/15): |
| |
| | Fold | Campaigns | Timesteps | |
| |---|---:|---:| |
| | Train | 350 | 26,242 | |
| | Validation | 75 | 5,624 | |
| | Test | 75 | 5,623 | |
| |
| All test campaigns are completely unseen during training. Class imbalance |
| is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and |
| weighted cross-entropy (MLP). |
| |
| ## Feature pipeline |
| |
| The bundled `feature_engineering.py` is the canonical feature recipe. |
| 63 features survive after encoding, drawn from: |
| |
| - **Per-timestep numeric** (15): `timestep`, `files_encrypted_cumulative`, `encryption_throughput_mbps`, `endpoints_compromised`, `lateral_move_count`, `credential_harvest_count`, `c2_bytes_exfiltrated`, `defender_alert_score`, `blast_radius_pct`, `living_off_land_score`, `attribution_risk_score`, `data_exfiltrated_gb`, `wiper_flag`, `double_extortion_flag`, `ir_activated` |
| - **Per-timestep categorical** (2, one-hot): `attack_phase`, `detection_outcome` |
| - **Victim segment** (10 numeric, 3 categorical one-hot): EDR coverage, network segmentation quality, patch posture, IR latency, endpoint count, AD domain complexity, SOC maturity score, backup recovery probability, backup recovery time, SIEM cadence; `segment_type`, `soc_maturity_tier`, `backup_maturity_tier` |
| - **Engineered** (6): `c2_intensity_score`, `escalation_velocity`, `is_destructive`, `dwell_efficiency`, `is_post_detonation`, `lotl_intensity_bin` |
| - **Ordinal** (1): `segment_id_hash` (segment ID hashed to integer) |
|
|
| ### Leakage audit |
|
|
| Three columns were audited as potential tier oracles. **None were |
| dropped** for this task: |
|
|
| | Feature | Cross-tier ranges (mean) | Verdict | |
| |---|---|---| |
| | `attribution_risk_score` | lone 0.016 / nation_state 0.017 / organised 0.026 / raas 0.025 | Overlapping; NOT an oracle. Keep. | |
| | `living_off_land_score` | lone 0.05 / nation_state 0.20 / organised 0.16 / raas 0.13 | Mild correlation with massive overlap (std 0.08–0.25). Real observable. Keep. | |
| | `attack_phase` | Phase-purity vs tier is ~uniform | No oracle relationship. Keep. | |
|
|
| `detection_outcome` contains a `recovery_in_progress` value that is 1:1 |
| identical to the `attack_phase` value of the same name (purity 0.89 vs |
| phase), but this only matters for *phase* prediction, not *tier* |
| prediction. The column is kept as a feature for tier work. |
|
|
| The honest result of dropping the two candidate-leakage columns |
| (`attribution_risk_score` + `living_off_land_score`) is a 2pp accuracy |
| reduction — confirming they provide modest legitimate signal, not oracle |
| leakage. They are kept in the published pipeline. |
|
|
| ## Evaluation |
|
|
| ### Test-set metrics, seed 42 (n = 5,623 timesteps from 75 disjoint campaigns) |
|
|
| **XGBoost** (the published `model_xgb.json` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.8736** | |
| | Accuracy | **0.6898** | |
| | Macro-F1 | 0.6751 | |
| | Weighted-F1 | 0.6939 | |
|
|
| **MLP** (the published `model_mlp.safetensors` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | 0.8072 | |
| | Accuracy | 0.5118 | |
| | Macro-F1 | 0.5121 | |
| | Weighted-F1 | 0.5160 | |
|
|
| The MLP underperforms XGBoost on this task (a common pattern on tabular |
| data with limited training scale). Both are published so users can pick |
| the right tool, and disagreement between them is a useful triage signal. |
|
|
| ### Multi-seed robustness (XGBoost, 10 seeds) |
|
|
| Stable performance across seeds — all 10 seeds yield all 4 tiers in |
| the test fold: |
|
|
| | Metric | Mean | Std | Min | Max | |
| |---|---:|---:|---:|---:| |
| | Accuracy | 0.603 | 0.040 | 0.533 | 0.690 | |
| | Macro-F1 | 0.593 | 0.047 | 0.509 | 0.675 | |
| | Macro ROC-AUC OvR | 0.853 | 0.031 | 0.796 | 0.891 | |
|
|
| Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). |
|
|
| Seed 42 happens to be a stronger-than-average seed (acc 0.69 vs mean |
| 0.60). The published artifact uses seed 42 because it produces clean |
| ROC-AUC computation; the **multi-seed aggregate ROC-AUC of 0.853 ± 0.031 |
| is the honest performance estimate**. |
|
|
| ### Per-class F1 (seed 42) |
|
|
| | Tier | Class share | XGBoost F1 | MLP F1 | |
| |---|---:|---:|---:| |
| | `organised_syndicate` | 40% | **0.739** | 0.520 | |
| | `nation_state_nexus` | 15% | **0.686** | 0.602 | |
| | `raas_affiliate` | 30% | 0.646 | 0.499 | |
| | `lone_actor` | 15% | 0.630 | 0.428 | |
|
|
| The model performs evenly across all four classes — no single tier |
| collapses. The strongest performance on minority `nation_state_nexus` |
| (F1 0.69 despite only 15% prevalence) suggests the model picks up on |
| nation-state-specific behaviours (high LotL score, wiper deployment, |
| sustained C2 dwell) reliably. The hardest tier is `lone_actor`, the |
| behaviourally most variable class. |
|
|
| ### Ablation: which feature groups matter |
|
|
| | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | |
| |---|---:|---:|---:|---:| |
| | Full feature set (published) | 0.6898 | 0.6751 | 0.8736 | — | |
| | No behavioural features | 0.5673 | 0.5214 | 0.8107 | **−0.1225** | |
| | No topology features | 0.6146 | 0.6302 | 0.8707 | −0.0752 | |
| | No `timestep` | 0.6717 | 0.6417 | 0.8673 | −0.0181 | |
| | No engineered features | 0.6882 | 0.6563 | 0.8747 | −0.0016 | |
|
|
| Four findings: |
|
|
| 1. **Behavioural features carry the most tier signal** (drops 12 pp accuracy, |
| 15 pp macro-F1 when removed). This is the most important finding: |
| tier prediction is genuinely behaviour-driven, not a topology-lookup |
| shortcut. Sustained C2 intensity, lateral-move velocity, wiper |
| deployment, and LotL technique use jointly discriminate tiers. |
| 2. **Topology contributes ~7 pp accuracy.** Defender posture (SOC |
| maturity, backup tier, EDR coverage) provides useful conditioning |
| context — actors target environments differently by tier. |
| 3. **`timestep` matters much less than for phase prediction** (drops only |
| ~2 pp). This is expected and good: phase prediction depends on |
| knowing *where* in the lifecycle you are; tier prediction depends on |
| *how* the actor operates, which is more invariant to timestep. |
| 4. **Engineered features barely contribute on their own** — the trees |
| recover most of the c2_intensity, escalation_velocity, etc. signal |
| directly from the raw features. They remain in the pipeline as |
| a documented baseline-feature reference. |
|
|
| ### Architecture |
|
|
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 4 classes), |
| `hist` tree method, class-balanced sample weights, early stopping on |
| validation mlogloss. |
|
|
| **MLP:** `63 → 128 → 64 → 4`, each hidden layer followed by `BatchNorm1d` |
| → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, |
| early stopping on validation macro-F1. |
|
|
| Training hyperparameters (learning rate, batch size, n_estimators, |
| early-stopping patience, weight decay, class-weighting strategy) are |
| held internally by XpertSystems and are not part of this release. |
| |
| ## Limitations |
| |
| **This is a baseline reference, not a production threat-attribution system.** |
| |
| 1. **Adjacent-tier confusion is honest.** The hardest discriminations |
| are `lone_actor` ↔ `nation_state_nexus` (both small minorities, |
| sometimes behaviourally similar in early-phase recon) and |
| `raas_affiliate` ↔ `organised_syndicate` (operationally similar in |
| mid-campaign). Confusion-matrix-aware downstream logic (e.g. flagging |
| disagreement between XGBoost and MLP for analyst review) is recommended. |
|
|
| 2. **MLP weaker than XGBoost.** The MLP lags ~18 pp accuracy behind |
| XGBoost. This is a common pattern on tabular data when training set |
| sizes don't justify deep-model parameter counts. Both are published; |
| the recommendation is XGBoost as the primary predictor and the MLP |
| for disagreement-as-triage signal. |
|
|
| 3. **Synthetic-vs-real transfer.** The dataset is synthetic and |
| calibrated to ransomware threat-intelligence benchmark targets |
| (Mandiant M-Trends, CrowdStrike GTR, Coveware Quarterly, Sophos |
| State of Ransomware, IBM CODB, Verizon DBIR, CISA #StopRansomware, |
| Chainalysis). Real ransomware telemetry has different noise |
| characteristics, adversary adaptation, and instrumentation gaps. Do |
| not assume metrics transfer. |
|
|
| 4. **Adversarial robustness not evaluated.** The dataset is not |
| adversarially generated; the model has not been red-teamed against |
| tier-spoofing campaigns (a real attacker may deliberately mimic |
| another tier's TTPs to evade attribution). |
|
|
| 5. **Per-tier sample sizes are still modest.** `lone_actor` and |
| `nation_state_nexus` have only 75 training campaigns each. The |
| full ~5,500-campaign CYB005 product (with ~825 per minority tier) |
| would tighten the per-class confidence intervals materially. |
|
|
| ## Notes on dataset schema |
|
|
| The CYB005 sample dataset README describes some fields differently |
| from the actual schema. The model was trained on the actual schema; |
| this note helps buyers reconcile what they read with what they receive. |
|
|
| | What the README says | What the data actually contains | |
| |---|---| |
| | "7 attack phases" (initial_access, persistence, privilege_escalation, lateral_movement, data_exfiltration, encryption_deployment, ransom_demand) | **8 attack phases**: `initial_access`, `internal_recon`, `privilege_escalation`, `lateral_movement`, `exfiltration_staging`, `encryption_detonation`, `ransom_negotiation`, `recovery_in_progress`. (No `persistence` phase as a distinct value; `recovery_in_progress` is the dominant phase at 35% of rows because campaigns run beyond detonation.) | |
| | Backup tiers include `cloud_replicated`, `immutable_object_lock` | Backup tiers in the actual data use `offsite_unverified`, `offsite_verified_immutable` for those concepts | |
| | Summary has `campaign_outcome`, `dwell_time_pre_detonation_hrs` | Neither field exists. Use `total_dwell_time_hrs` and `campaign_success_flag` / `detection_phase` instead | |
| | Per-timestep includes `endpoints_compromised`, `lateral_pivots`, `edr_alerted`, `siem_correlated`, `lotl_technique_used`, `vss_deletion_attempted`, `wiper_component_deployed`, `dwell_hours`, `c2_beacon_active`, `backup_maturity_tier` | Actual per-timestep columns: `endpoints_compromised` ✓, `lateral_move_count` (not pivots), no `edr_alerted`/`siem_correlated`/`vss_deletion_attempted`/`dwell_hours`/`c2_beacon_active`; `defender_alert_score` and `attribution_risk_score` exist instead; `backup_maturity_tier` is only on per-campaign `victim_topology`, not per-timestep | |
|
|
| None of these discrepancies affects model correctness — the feature |
| pipeline uses the actual column names. If you build your own pipeline |
| against the dataset, use the actual columns. |
|
|
| ## Intended use |
|
|
| - **Evaluating fit** of the CYB005 dataset for your threat-attribution |
| or ransomware-research work |
| - **Baseline reference** for new model architectures (especially |
| sequence models, which should beat this baseline by leveraging |
| temporal context across the 75-step campaign) |
| - **Teaching and demo** for multi-class tabular classification on |
| cybersecurity telemetry |
| - **Feature engineering reference** for ransomware campaign attribution |
|
|
| ## Out-of-scope use |
|
|
| - Production threat-actor attribution on real ransomware campaigns |
| - Incident-response decision-making on real systems |
| - Adversarial-evasion evaluation (dataset not adversarially generated) |
| - Any operational security or law-enforcement decision |
|
|
| ## Reproducibility |
|
|
| Outputs above were produced with `seed = 42` (published artifact), |
| group-aware nested `GroupShuffleSplit` (70/15/15 by campaign_id), on the |
| published sample (`xpertsystems/cyb005-sample`, version 1.0.0, generated |
| 2026-05-16). The feature pipeline in `feature_engineering.py` is |
| deterministic and the trained weights in this repo correspond exactly |
| to the metrics above. |
|
|
| Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in |
| `multi_seed_results.json` confirm robust performance across splits. |
|
|
| The training script itself is private to XpertSystems. |
|
|
| ## Files in this repo |
|
|
| | File | Purpose | |
| |---|---| |
| | `model_xgb.json` | XGBoost weights (seed 42) | |
| | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | |
| | `feature_engineering.py` | Feature pipeline (load → join topology → engineer → encode) | |
| | `feature_meta.json` | Feature column order + categorical levels | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | |
| | `ablation_results.json` | Per-feature-group ablation | |
| | `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | |
| | `README.md` | This file | |
|
|
| ## Contact and full product |
|
|
| The full **CYB005** dataset contains ~358,000 rows across four files, |
| with calibrated benchmark validation against 12 metrics drawn from |
| authoritative ransomware threat-intelligence sources (Mandiant |
| M-Trends, CrowdStrike GTR, Coveware Quarterly Ransomware Report, |
| Sophos State of Ransomware, IBM CODB, Verizon DBIR, CISA |
| #StopRansomware, Chainalysis). The full XpertSystems.ai synthetic |
| data catalogue spans 41 SKUs across Cybersecurity, Healthcare, |
| Insurance & Risk, Oil & Gas, and Materials & Energy. |
|
|
| - 📧 **pradeep@xpertsystems.ai** |
| - 🌐 **https://xpertsystems.ai** |
| - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb005-sample |
| - 🤖 Companion models: |
| - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) |
| - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) |
| - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase) |
| - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{xpertsystems_cyb005_baseline_2026, |
| title = {CYB005 Baseline Classifier: XGBoost and MLP for Ransomware Actor-Tier Attribution}, |
| author = {XpertSystems.ai}, |
| year = {2026}, |
| url = {https://huggingface.co/xpertsystems/cyb005-baseline-classifier}, |
| note = {Baseline reference model trained on xpertsystems/cyb005-sample} |
| } |
| ``` |
|
|