| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - cybersecurity |
| - phishing |
| - email-security |
| - bec |
| - social-engineering |
| - tabular-classification |
| - synthetic-data |
| - xgboost |
| - baseline |
| pipeline_tag: tabular-classification |
| base_model: [] |
| datasets: |
| - xpertsystems/cyb004-sample |
| metrics: |
| - accuracy |
| - f1 |
| - roc_auc |
| model-index: |
| - name: cyb004-baseline-classifier |
| results: |
| - task: |
| type: tabular-classification |
| name: 7-class phishing campaign phase classification |
| dataset: |
| type: xpertsystems/cyb004-sample |
| name: CYB004 Synthetic Phishing Campaign Dataset (Sample) |
| metrics: |
| - type: roc_auc |
| value: 0.9356 |
| name: Test macro ROC-AUC OvR (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.6547 |
| name: Test accuracy (XGBoost, seed 42) |
| - type: f1 |
| value: 0.6401 |
| name: Test macro-F1 (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.649 |
| name: Multi-seed accuracy mean ± 0.038 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.937 |
| name: Multi-seed ROC-AUC mean ± 0.010 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.9265 |
| name: Test macro ROC-AUC OvR (MLP, seed 42) |
| - type: accuracy |
| value: 0.6427 |
| name: Test accuracy (MLP, seed 42) |
| - type: f1 |
| value: 0.6275 |
| name: Test macro-F1 (MLP, seed 42) |
| --- |
| |
| # CYB004 Baseline Classifier |
|
|
| **Phishing campaign phase classifier trained on the CYB004 synthetic |
| phishing campaign sample. Predicts which of 7 lifecycle phases a |
| per-timestep telemetry record belongs to, from observable trajectory |
| and victim-topology features.** |
|
|
| > **Baseline reference, not for production use.** This model demonstrates |
| > that the [CYB004 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb004-sample) |
| > is learnable end-to-end and gives prospective buyers a working starting |
| > point. It is not a production email-security platform, SOAR component, |
| > or threat detector. See [Limitations](#limitations). |
|
|
| ## Model overview |
|
|
| | Property | Value | |
| |---|---| |
| | Task | 7-class campaign_phase classification | |
| | Training data | `xpertsystems/cyb004-sample` (3,952 timesteps across 100 phishing campaigns) | |
| | Models | XGBoost + PyTorch MLP | |
| | Input features | 53 (after one-hot encoding) | |
| | Split | **Group-aware by campaign_id** (disjoint train/val/test campaigns) | |
| | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | |
| | License | CC-BY-NC-4.0 (matches dataset) | |
| | Status | Reference baseline | |
| |
| ## Why this task instead of actor-tier attribution? |
| |
| The CYB004 dataset README leads with "actor attribution modelling — 4-tier |
| classification" as a suggested use case. We piloted that target first and |
| found a serious issue: four features in the dataset |
| (`lure_personalisation_score`, `click_through_rate`, |
| `credential_submission_rate`, `target_department_id`) are **constant per |
| campaign**, not per-timestep. They look like per-step features but each |
| takes a single value across all ~40 timesteps of a given campaign. |
| |
| Because these constants are tier-correlated (especially |
| `lure_personalisation_score`, which differs systematically across the |
| four actor tiers), they leak tier identity through the campaign-level |
| fingerprint they create. With a 15-campaign test fold, many test |
| campaigns land in the same feature ranges as training campaigns of the |
| same tier, and the model achieves spurious 97%+ accuracy that does not |
| generalize. Removing those features (the honest fix) drops tier |
| prediction to **accuracy 0.45, ROC-AUC 0.70 — below majority baseline |
| of 0.59**. The full 335k-row CYB004 product, with ~4,800 campaigns, |
| will not have this constraint; the sample at n=100 cannot support |
| honest tier learning. |
| |
| We pivoted to **campaign_phase prediction**, which has 3,952 rows of |
| per-timestep data spread across 7 phases with tight timestep windows. |
| It learns cleanly under the same group-aware split: 65% accuracy, |
| ROC-AUC 0.94, stable across 10 seeds. This is a legitimate |
| email-security use case — SOAR playbooks and threat-hunting workflows |
| need to tag what phase of a phishing campaign observed activity |
| belongs to. |
| |
| Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal: |
| |
| - `model_xgb.json` — gradient-boosted trees, primary recommendation |
| - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format |
|
|
| ## Quick start |
|
|
| ```bash |
| pip install xgboost torch safetensors pandas huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import json, numpy as np, torch, xgboost as xgb |
| from safetensors.torch import load_file |
| |
| REPO = "xpertsystems/cyb004-baseline-classifier" |
| |
| paths = {n: hf_hub_download(REPO, n) for n in [ |
| "model_xgb.json", "model_mlp.safetensors", |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", |
| ]} |
| |
| import sys, os |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) |
| from feature_engineering import ( |
| transform_single, load_meta, INT_TO_LABEL, build_department_lookup |
| ) |
| |
| meta = load_meta(paths["feature_meta.json"]) |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) |
| dept_lookup = build_department_lookup("path/to/victim_topology.csv") |
| |
| # Predict (see inference_example.ipynb for the full pattern) |
| dept_aggs = dept_lookup.get(my_record["target_department_id"], {}) |
| X = transform_single(my_record, meta, victim_aggregates=dept_aggs) |
| proba = xgb_model.predict_proba(X)[0] |
| print(INT_TO_LABEL[int(np.argmax(proba))]) |
| ``` |
|
|
| See [`inference_example.ipynb`](./inference_example.ipynb) for the full |
| copy-paste demo. |
|
|
| ## Training data |
|
|
| Trained on the public sample of CYB004, 3,952 per-timestep trajectory |
| rows from 100 phishing campaigns (~40 timesteps per campaign): |
|
|
| | Phase | Total rows | Test rows (seed 42) | |
| |---|---:|---:| |
| | `email_delivery` | 919 | 134 | |
| | `victim_engagement` | 667 | 102 | |
| | `target_reconnaissance` | 558 | 89 | |
| | `post_compromise_escalation` | 533 | 50 | |
| | `credential_harvesting` | 494 | 91 | |
| | `lure_crafting` | 435 | 71 | |
| | `infrastructure_setup` | 346 | 48 | |
|
|
| ### Group-aware split |
|
|
| A single campaign generates ~40 highly-correlated timesteps. Random |
| row-level splitting would put timesteps from the same campaign in both |
| train and test, inflating metrics in a way that does not generalize to |
| new campaigns. |
|
|
| This release uses **GroupShuffleSplit by `campaign_id`** (nested, |
| 70/15/15): |
| |
| | Fold | Campaigns | Timesteps | |
| |---|---:|---:| |
| | Train | 69 | 2,792 | |
| | Validation | 16 | 575 | |
| | Test | 15 | 585 | |
| |
| All test campaigns are completely unseen during training. Class imbalance |
| is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and |
| weighted cross-entropy (MLP). |
| |
| ## Feature pipeline |
| |
| The bundled `feature_engineering.py` is the canonical feature recipe. |
| 53 features survive after encoding, drawn from: |
| |
| - **Per-timestep numeric** (7): `timestep`, `emails_sent_cumulative`, `click_through_rate`, `credential_submission_rate`, `gateway_detection_score`, `lure_personalisation_score`, `target_department_id` |
| - **Per-timestep categorical** (2, one-hot): `evasion_technique_active`, `actor_capability_tier` |
| - **Victim topology numeric** (5): `employee_count`, `privileged_account_density`, `mfa_enrollment_rate`, `click_susceptibility_base`, `email_volume_daily` |
| - **Victim topology categorical** (5, one-hot): `department_type`, `industry_sector`, `awareness_training_level`, `gateway_architecture`, `dmarc_enforcement_level` |
| - **Engineered** (6): `log_emails_sent`, `is_gateway_blocked_step`, `is_evasion_active`, `is_high_personalisation`, `has_credential_capture`, `has_user_engagement` |
|
|
| ### Leakage audit |
|
|
| **One column dropped:** `delivery_outcome` (7-class categorical). Its |
| crosstab with `campaign_phase` shows that `no_delivery` appears only in |
| the early phases (`target_reconnaissance`, `infrastructure_setup`, |
| `lure_crafting`, `credential_harvesting`, `post_compromise_escalation`) |
| and never in `email_delivery` or `victim_engagement`. Cell purity 0.36 |
| (uniform baseline 0.14). Keeping it would give the model a near-oracle |
| for partitioning early-vs-mid phases. |
|
|
| **No oracle features remain.** All retained features have phase-purity |
| under 0.20. |
|
|
| ### Per-campaign-constant features |
|
|
| Four features (`lure_personalisation_score`, `click_through_rate`, |
| `credential_submission_rate`, `target_department_id`) are constant |
| within each campaign. For **phase prediction** this is acceptable — |
| their phase-purity is low, so the model uses them as conditioning |
| context (similar to "we know this is an APT campaign targeting finance" |
| when reasoning about which phase we're in), not as oracle features. |
| They became a problem only for the abandoned actor-tier task. |
|
|
| ## Evaluation |
|
|
| ### Test-set metrics, seed 42 (n = 585 timesteps from 15 disjoint campaigns) |
|
|
| **XGBoost** (the published `model_xgb.json` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.9356** | |
| | Accuracy | **0.6547** | |
| | Macro-F1 | 0.6401 | |
| | Weighted-F1 | 0.6526 | |
|
|
| **MLP** (the published `model_mlp.safetensors` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | 0.9265 | |
| | Accuracy | 0.6427 | |
| | Macro-F1 | 0.6275 | |
| | Weighted-F1 | 0.6492 | |
|
|
| ### Multi-seed robustness (XGBoost, 10 seeds) |
|
|
| Stable performance across seeds — the task learns cleanly, not seed-lucky: |
|
|
| | Metric | Mean | Std | Min | Max | |
| |---|---:|---:|---:|---:| |
| | Accuracy | 0.649 | 0.038 | 0.592 | 0.711 | |
| | Macro-F1 | 0.638 | 0.040 | 0.574 | 0.714 | |
| | Macro ROC-AUC OvR | 0.937 | 0.010 | 0.923 | 0.954 | |
|
|
| Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). |
| All 10 seeds yielded all 7 classes in the test fold. |
|
|
| ### Per-class F1 (seed 42) — where the signal is and isn't |
|
|
| | Phase | XGBoost F1 | MLP F1 | Note | |
| |---|---:|---:|---| |
| | `target_reconnaissance` | **0.888** | 0.831 | Tight early window (timesteps 0-7) | |
| | `email_delivery` | **0.791** | 0.761 | Tight window (8-30); gateway signals + email volume | |
| | `infrastructure_setup` | **0.712** | 0.702 | Tight window (5-18) | |
| | `lure_crafting` | **0.676** | 0.561 | Tight window (3-13) | |
| | `post_compromise_escalation` | 0.604 | 0.717 | Late window (22-52) | |
| | `victim_engagement` | 0.469 | 0.387 | Mid window (14-38), overlaps with adjacent phases | |
| | `credential_harvesting` | 0.341 | 0.434 | Mid-late (19-45), similar features to victim_engagement | |
| |
| Four early phases (target_reconnaissance, infrastructure_setup, |
| lure_crafting, email_delivery) classify cleanly because they sit in |
| tight non-overlapping timestep windows with distinctive features. |
| Three later phases (victim_engagement, credential_harvesting, |
| post_compromise_escalation) overlap substantially in timestep range |
| (14-52, 19-45, 22-52) and share similar behavioural footprints |
| (non-zero click/credential rates, deployed evasion); these are |
| genuinely harder for a flat-tabular model. Sequence models with |
| campaign-level context would help here. |
| |
| ### Ablation: which feature groups matter |
| |
| | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | |
| |---|---:|---:|---:|---:| |
| | Full feature set (published) | 0.6547 | 0.6401 | 0.9356 | — | |
| | No `timestep` | 0.3624 | 0.3139 | 0.8128 | **−0.2923** | |
| | No behavioural features | 0.5795 | 0.5735 | 0.9188 | −0.0752 | |
| | No topology features | 0.6410 | 0.6260 | 0.9342 | −0.0137 | |
| | No engineered features | 0.6581 | 0.6402 | 0.9370 | +0.0034 | |
| |
| Three findings: |
| |
| 1. **`timestep` is by far the dominant feature** (drops 29 pp when |
| removed, ROC-AUC still 0.81). Phishing campaigns progress through |
| phases over time; where you are in the campaign timeline carries |
| most of the phase signal. |
| 2. **Behavioural features contribute ~8 pp accuracy.** These are the |
| per-timestep observables (emails sent, gateway score, click rate, |
| evasion technique). |
| 3. **Topology and engineered features each contribute ~1 pp.** Trees |
| recover most of the engineered features on their own; topology |
| provides modest conditioning context. |
| |
| ### Architecture |
| |
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 7 classes), |
| `hist` tree method, class-balanced sample weights, early stopping on |
| validation mlogloss. |
| |
| **MLP:** `53 → 128 → 64 → 7`, each hidden layer followed by `BatchNorm1d` |
| → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, |
| early stopping on validation macro-F1. |
| |
| Training hyperparameters (learning rate, batch size, n_estimators, |
| early-stopping patience, weight decay, class-weighting strategy) are |
| held internally by XpertSystems and are not part of this release. |
|
|
| ## Limitations |
|
|
| **This is a baseline reference, not a production email-security system.** |
|
|
| 1. **Mid- and late-phase confusion.** Per-class F1 for |
| `victim_engagement`, `credential_harvesting`, and |
| `post_compromise_escalation` is 0.34–0.60. These phases overlap in |
| timestep range and share similar behavioural signatures. Sequence |
| models that consider campaign-level context would help substantially. |
|
|
| 2. **The pivot away from actor-tier classification is dataset-limited, |
| not method-limited.** With 100 campaigns and 4 tiers (some with only |
| 10 campaigns total), tier classification is below majority baseline |
| once leakage-prone features are removed. The full 335k-row CYB004 |
| product provides ~4,800 campaigns; the sample does not. |
|
|
| 3. **Synthetic-vs-real transfer.** The dataset is synthetic and |
| calibrated to email-security and threat-intelligence benchmark |
| targets (Proofpoint State of the Phish, KnowBe4 Industry Benchmark, |
| Cofense PIQ, Mandiant M-Trends, FBI IC3 BEC Report, Verizon DBIR, |
| CISA, APWG). Real phishing telemetry has different noise |
| characteristics, adversary adaptation, and instrumentation gaps. Do |
| not assume metrics transfer. |
|
|
| 4. **Adversarial robustness not evaluated.** The dataset is not |
| adversarially generated; the model has not been red-teamed against |
| evasive lures or novel infrastructure. |
|
|
| 5. **MLP brittleness on OOD inputs.** With ~2.8k training timesteps, |
| the MLP can produce confidently-wrong predictions on hand-crafted |
| records far from the training manifold. XGBoost is more robust. |
| Use both; treat disagreement as a signal for human review. |
|
|
| 6. **`timestep` dominance is a property of the dataset.** Real |
| phishing telemetry doesn't carry a clean per-campaign normalized |
| timestep — that's a simulator artifact. A buyer transferring this |
| baseline to real campaign telemetry would need to recover an |
| equivalent temporal-position feature (e.g. hours since campaign |
| first observation, position in stage-detection pipeline). |
|
|
| ## Notes on dataset schema |
|
|
| The CYB004 sample dataset README describes some fields differently from |
| the actual schema. The model was trained on the actual schema; this note |
| helps buyers reconcile what they read with what they receive. |
|
|
| | What the README says | What the data actually contains | |
| |---|---| |
| | "9 campaign phases" (reconnaissance, infrastructure_setup, lure_creation, send_wave, gateway_evaluation, user_interaction, credential_capture, lateral_pivot, exfiltration) | 7 phases with different names: target_reconnaissance, infrastructure_setup, lure_crafting, email_delivery, victim_engagement, credential_harvesting, post_compromise_escalation | |
| | 4 actor tiers: `opportunistic`, `organized_crime`, `targeted`, `nation_state_apt` | 4 tiers: `opportunistic`, `cybercriminal_gang`, `initial_access_broker`, `nation_state_apt` | |
| | 8 department types listed | 4 department types: `executive_leadership`, `finance_accounts_payable`, `human_resources`, `information_technology` | |
| | 4 gateway architectures | 8 gateway architectures including `ai_sender_reputation`, `integrated_cloud_defender`, `zero_trust_email_proxy` | |
| | Awareness training: none, annual, semi-annual, quarterly, monthly | annual, none, continuous, basic, quarterly (no semi-annual or monthly) | |
| | Per-timestep fields: `send_volume`, `gateway_blocked`, `emails_delivered`, `user_report_count`, `mfa_bypass_attempted`, `bec_attempt`, `lateral_pivot_attempted`, `operational_stealth_score`, `dmarc_enforcement_active` | None of these exist per-timestep. The actual per-timestep columns are: `emails_sent_cumulative`, `gateway_detection_score`, `delivery_outcome`, `lure_personalisation_score`, `evasion_technique_active`. BEC / MFA bypass / lateral phishing flags exist only at the campaign-summary level. | |
|
|
| None of these discrepancies affects model correctness — the feature |
| pipeline uses the actual column names. If you build your own pipeline |
| against the dataset, use the actual columns. |
|
|
| ## Intended use |
|
|
| - **Evaluating fit** of the CYB004 dataset for your email-security |
| or threat-hunting research |
| - **Baseline reference** for new model architectures (especially |
| sequence models, which should beat this baseline on the overlapping |
| mid-late phases) |
| - **Teaching and demo** for tabular classification on phishing |
| campaign telemetry |
| - **Feature engineering reference** for per-timestep campaign data |
|
|
| ## Out-of-scope use |
|
|
| - Production email security on real campaign telemetry |
| - Threat hunting / SOAR playbooks on real systems |
| - Actor attribution (this baseline does not address that task; see why above) |
| - Adversarial-evasion evaluation (dataset not adversarially generated) |
| - Any operational security decision |
|
|
| ## Reproducibility |
|
|
| Outputs above were produced with `seed = 42` (published artifact), |
| group-aware nested `GroupShuffleSplit` (70/15/15 by campaign_id), on the |
| published sample (`xpertsystems/cyb004-sample`, version 1.0.0, generated |
| 2026-05-16). The feature pipeline in `feature_engineering.py` is |
| deterministic and the trained weights in this repo correspond exactly |
| to the metrics above. |
|
|
| Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in |
| `multi_seed_results.json` confirm robust performance across splits. |
|
|
| The training script itself is private to XpertSystems. |
|
|
| ## Files in this repo |
|
|
| | File | Purpose | |
| |---|---| |
| | `model_xgb.json` | XGBoost weights (seed 42) | |
| | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | |
| | `feature_engineering.py` | Feature pipeline (load → join topology → engineer → encode) | |
| | `feature_meta.json` | Feature column order + categorical levels | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | |
| | `ablation_results.json` | Per-feature-group ablation | |
| | `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | |
| | `README.md` | This file | |
|
|
| ## Contact and full product |
|
|
| The full **CYB004** dataset contains ~335,000 rows across four files, |
| with calibrated benchmark validation against 12 metrics from email |
| security and threat intelligence sources (Proofpoint, KnowBe4, |
| Cofense, Mandiant, FBI IC3, Verizon, CISA, APWG). The full |
| XpertSystems.ai synthetic data catalogue spans 41 SKUs across |
| Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials |
| & Energy. |
|
|
| - 📧 **pradeep@xpertsystems.ai** |
| - 🌐 **https://xpertsystems.ai** |
| - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb004-sample |
| - 🤖 Companion models: |
| - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) |
| - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) |
| - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{xpertsystems_cyb004_baseline_2026, |
| title = {CYB004 Baseline Classifier: XGBoost and MLP for Phishing Campaign Phase Classification}, |
| author = {XpertSystems.ai}, |
| year = {2026}, |
| url = {https://huggingface.co/xpertsystems/cyb004-baseline-classifier}, |
| note = {Baseline reference model trained on xpertsystems/cyb004-sample} |
| } |
| ``` |
|
|