| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - cybersecurity |
| - malware |
| - malware-behaviour |
| - sandbox-analysis |
| - edr |
| - tabular-classification |
| - synthetic-data |
| - xgboost |
| - baseline |
| pipeline_tag: tabular-classification |
| base_model: [] |
| datasets: |
| - xpertsystems/cyb003-sample |
| metrics: |
| - accuracy |
| - f1 |
| - roc_auc |
| model-index: |
| - name: cyb003-baseline-classifier |
| results: |
| - task: |
| type: tabular-classification |
| name: 10-class malware execution phase classification |
| dataset: |
| type: xpertsystems/cyb003-sample |
| name: CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample) |
| metrics: |
| - type: roc_auc |
| value: 0.9792 |
| name: Test macro ROC-AUC OvR (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.9178 |
| name: Test accuracy (XGBoost, seed 42) |
| - type: f1 |
| value: 0.7781 |
| name: Test macro-F1 (XGBoost, seed 42) |
| - type: accuracy |
| value: 0.905 |
| name: Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.975 |
| name: Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds) |
| - type: roc_auc |
| value: 0.9681 |
| name: Test macro ROC-AUC OvR (MLP, seed 42) |
| - type: accuracy |
| value: 0.8222 |
| name: Test accuracy (MLP, seed 42) |
| - type: f1 |
| value: 0.7072 |
| name: Test macro-F1 (MLP, seed 42) |
| --- |
| |
| # CYB003 Baseline Classifier |
|
|
| **Malware execution-phase classifier trained on the CYB003 synthetic |
| malware behaviour sample. Predicts which of 10 execution phases a |
| per-timestep telemetry record belongs to, from observable behavioural |
| and PE-static features.** |
|
|
| > **Baseline reference, not for production use.** This model demonstrates |
| > that the [CYB003 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb003-sample) |
| > is learnable end-to-end and gives prospective buyers a working starting |
| > point. It is not a production sandbox, EDR, or threat-detection system. |
| > See [Limitations](#limitations). |
|
|
| ## Model overview |
|
|
| | Property | Value | |
| |---|---| |
| | Task | 10-class execution_phase classification | |
| | Training data | `xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples) | |
| | Models | XGBoost + PyTorch MLP | |
| | Input features | 69 (after one-hot encoding) | |
| | Split | **Group-aware by sample_id** (disjoint train/val/test samples) | |
| | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | |
| | License | CC-BY-NC-4.0 (matches dataset) | |
| | Status | Reference baseline | |
| |
| ## Why this task instead of malware family classification? |
| |
| The CYB003 dataset README leads with "training malware family classifiers" |
| as a suggested use case. We piloted that target first and found it is |
| **not learnable from the sample dataset** under proper group-aware |
| evaluation: with only 100 unique samples spread across 10 families, |
| XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58 |
| — at majority baseline. Per-sample aggregation gives the same result. |
| |
| This is a **sample-size constraint**, not a feature-engineering failure. |
| With ~7 samples per family on average, a held-out test set of 15 samples |
| covers at most ~8 families and yields a model that cannot generalize. |
| The full 280k-row CYB003 product, with ~28 samples per family at the |
| sample's distribution, will not have this constraint. |
| |
| We pivoted to **execution_phase prediction**, which has 6,000 rows of |
| per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable |
| across seeds. This is a legitimate SOC use case — dynamic-analysis tools |
| and EDR systems regularly need to tag what phase of execution observed |
| malware activity belongs to — and it shows the dataset is well-calibrated |
| even when the headline product use case needs more data. |
| |
| Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal: |
| |
| - `model_xgb.json` — gradient-boosted trees, primary recommendation |
| - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format |
|
|
| ## Quick start |
|
|
| ```bash |
| pip install xgboost torch safetensors pandas huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import json, numpy as np, torch, xgboost as xgb |
| from safetensors.torch import load_file |
| |
| REPO = "xpertsystems/cyb003-baseline-classifier" |
| |
| paths = {n: hf_hub_download(REPO, n) for n in [ |
| "model_xgb.json", "model_mlp.safetensors", |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", |
| ]} |
| |
| import sys, os |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) |
| from feature_engineering import transform_single, load_meta, INT_TO_LABEL |
| |
| meta = load_meta(paths["feature_meta.json"]) |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) |
| |
| # Predict (see inference_example.ipynb for the full pattern) |
| X = transform_single(my_timestep_record, meta) |
| proba = xgb_model.predict_proba(X)[0] |
| print(INT_TO_LABEL[int(np.argmax(proba))]) |
| ``` |
|
|
| See [`inference_example.ipynb`](./inference_example.ipynb) for the full |
| copy-paste demo. |
|
|
| ## Training data |
|
|
| Trained on the public sample of CYB003, 6,000 per-timestep telemetry |
| rows from 100 malware samples (60 timesteps per sample): |
|
|
| | Phase | Total rows | Train share | Test rows (seed 42) | |
| |---|---:|---:|---:| |
| | `initial_drop` | 801 | 13.4% | 120 | |
| | `lateral_movement` | 799 | 13.3% | 120 | |
| | `persistence_establishment` | 787 | 13.1% | 119 | |
| | `data_exfiltration` | 783 | 13.1% | 100 | |
| | `c2_communication` | 709 | 11.8% | 87 | |
| | `privilege_escalation` | 705 | 11.8% | 107 | |
| | `payload_execution` | 705 | 11.8% | 109 | |
| | `dormancy_dwell` | 250 | 4.2% | 83 | |
| | `sandbox_evasion_stall` | 234 | 3.9% | 32 | |
| | `self_destruct_cleanup` | 227 | 3.8% | 23 | |
|
|
| ### Group-aware split |
|
|
| A single malware sample generates 60 highly-correlated timesteps. Random |
| row-level splitting would put timesteps from the same sample in both |
| train and test, inflating metrics in a way that does not generalize to |
| new samples. |
|
|
| This release uses **GroupShuffleSplit by `sample_id`** (nested, 70/15/15): |
| |
| | Fold | Samples | Timesteps | |
| |---|---:|---:| |
| | Train | 69 | 4,140 | |
| | Validation | 16 | 960 | |
| | Test | 15 | 900 | |
| |
| All test samples are completely unseen during training. Class imbalance |
| is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and |
| weighted cross-entropy (MLP). |
| |
| ## Feature pipeline |
| |
| The bundled `feature_engineering.py` is the canonical feature recipe. |
| 69 features survive after encoding, drawn from: |
| |
| - **Per-timestep numeric** (10): `timestep`, `api_call_rate`, `registry_write_count`, `network_connection_count`, `process_injection_flag`, `c2_beacon_interval_sec`, `av_signature_hit_flag`, `sandbox_evasion_flag`, `lateral_propagation_count`, `privilege_escalation_flag` |
| - **PE static features** (11): `pe_entropy_mean`, `pe_entropy_std`, `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`, `packer_detected_flag` |
| - **Categorical** (6, one-hot encoded): `malware_family`, `threat_actor_tier`, `target_platform`, `obfuscation_technique`, `detection_outcome`, `ep_stack` |
| - **Engineered** (6): `api_burst_score`, `is_c2_active`, `is_high_net_volume`, `is_stealth_step`, `is_destructive_step`, `lateral_activity_score` |
|
|
| ### Leakage audit |
|
|
| No categorical feature has phase->phase purity above 0.17 (uniform |
| random baseline is 0.10), so nothing in the dataset is an oracle for |
| the target. The model relies on a mix of `timestep` (strong but not |
| deterministic) and behavioural features. |
|
|
| ## Evaluation |
|
|
| ### Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples) |
|
|
| **XGBoost** (the published `model_xgb.json` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | **0.9792** | |
| | Accuracy | **0.9178** | |
| | Macro-F1 | 0.7781 | |
| | Weighted-F1 | 0.9173 | |
|
|
| **MLP** (the published `model_mlp.safetensors` artifact) |
|
|
| | Metric | Value | |
| |---|---:| |
| | Macro ROC-AUC (OvR) | 0.9681 | |
| | Accuracy | 0.8222 | |
| | Macro-F1 | 0.7072 | |
| | Weighted-F1 | 0.8278 | |
|
|
| ### Multi-seed robustness (XGBoost, 10 seeds) |
|
|
| Accuracy and ROC-AUC are tight across seeds — the task is genuinely |
| learnable, not seed-lucky: |
|
|
| | Metric | Mean | Std | Min | Max | |
| |---|---:|---:|---:|---:| |
| | Accuracy | 0.905 | 0.010 | 0.882 | 0.921 | |
| | Macro-F1 | 0.784 | 0.013 | 0.759 | 0.807 | |
| | Macro ROC-AUC OvR | 0.975 | 0.002 | 0.972 | 0.979 | |
|
|
| Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). |
| All 10 seeds yielded all 10 classes in the test fold, supporting clean |
| multi-class ROC-AUC computation. |
|
|
| ### Per-class F1 (seed 42) — where the signal is and isn't |
|
|
| | Phase | XGBoost F1 | MLP F1 | Note | |
| |---|---:|---:|---| |
| | `c2_communication` | **1.000** | 1.000 | Trivial: tight timestep window 52-59 + c2_beacon signal | |
| | `persistence_establishment` | **0.992** | 0.870 | Tight timestep window 9-17 + registry writes | |
| | `lateral_movement` | **0.992** | 0.907 | Tight timestep window 26-34 + lateral_propagation | |
| | `privilege_escalation` | **0.991** | 0.915 | Tight timestep window 18-25 + privilege flag | |
| | `data_exfiltration` | **0.970** | 0.918 | Tight timestep window 43-51 + network volume | |
| | `payload_execution` | **0.963** | 0.698 | Tight timestep window 35-42 + API bursts | |
| | `initial_drop` | **0.945** | 0.886 | Tight timestep window 0-8 | |
| | `dormancy_dwell` | 0.530 | 0.520 | Hard: spans full 0-59 timestep range | |
| | `self_destruct_cleanup` | 0.273 | 0.282 | Hard: spans full 0-59, low row count (227) | |
| | `sandbox_evasion_stall` | 0.125 | 0.077 | Hard: spans full 0-59, low row count (234) | |
|
|
| Seven phases are near-trivially classified because they sit in tight |
| timestep windows with characteristic behavioural signatures. **Three |
| phases — `dormancy_dwell`, `sandbox_evasion_stall`, `self_destruct_cleanup` |
| — scatter across the full 0–59 timestep range** and lack distinctive |
| behavioural features (idle/evasion phases have low activity by design), |
| so a flat-tabular event-level model can't reliably disambiguate them. |
| Sequence models that consider neighbouring timesteps would help here. |
| |
| ### Ablation: which feature groups matter |
| |
| | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | |
| |---|---:|---:|---:|---:| |
| | Full feature set (published) | 0.9178 | 0.7781 | 0.9792 | — | |
| | No `timestep` | 0.6933 | 0.5963 | 0.9264 | **−0.2244** | |
| | No behavioural features | 0.9089 | 0.7579 | 0.9705 | −0.0089 | |
| | No PE static features | 0.9167 | 0.7808 | 0.9786 | −0.0011 | |
| | No engineered features | 0.9200 | 0.7931 | 0.9797 | +0.0022 | |
| |
| Three clear findings: |
| |
| 1. **`timestep` is by far the dominant feature** (drops 22 pp when removed, |
| ROC-AUC still 0.93). Malware execution progresses in time, and where |
| you are in that timeline carries most of the phase signal. |
| 2. **PE static features are barely used for phase prediction.** This is |
| honest: PE features (entropy, packed sections, import hashes) inform |
| family classification, not phase classification. A buyer doing family |
| work should expect to use them; for phase work they can be dropped. |
| 3. **Engineered features and behavioural features each contribute ~1 pp.** |
| Trees recover most of the engineered features on their own. |
| |
| ### Architecture |
| |
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes), |
| `hist` tree method, class-balanced sample weights, early stopping on |
| validation mlogloss. |
| |
| **MLP:** `69 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d` |
| → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, |
| early stopping on validation macro-F1. |
| |
| Training hyperparameters (learning rate, batch size, n_estimators, |
| early-stopping patience, weight decay, class-weighting strategy) are |
| held internally by XpertSystems and are not part of this release. |
| |
| ## Limitations |
| |
| **This is a baseline reference, not a production sandbox or threat detector.** |
|
|
| 1. **Three phases are genuinely hard at sample size.** `dormancy_dwell`, |
| `sandbox_evasion_stall`, and `self_destruct_cleanup` span the full |
| 0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53. |
| These are the phases by design lacking distinctive moment-to-moment |
| features (the malware is being quiet to evade detection). Sequence |
| models or per-sample aggregation would substantially improve these. |
|
|
| 2. **The pivot away from malware family classification is dataset-limited, |
| not method-limited.** Family classification on 100 samples with 10 |
| classes is at majority baseline. The full 280k-row CYB003 product |
| provides ~5,600 samples and supports proper family classification. |
|
|
| 3. **Synthetic-vs-real transfer.** The dataset is synthetic and calibrated |
| to threat-intelligence and AV-testing benchmark targets (VirusTotal, |
| AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR, |
| Verizon DBIR). Real malware telemetry has different noise |
| characteristics, adversary adaptation, and instrumentation gaps. Do |
| not assume metrics transfer. |
|
|
| 4. **Adversarial robustness not evaluated.** The dataset is not |
| adversarially generated; the model has not been red-teamed against |
| evasive samples. |
|
|
| 5. **MLP brittleness on OOD inputs.** With ~4k training timesteps, the |
| MLP can produce confidently-wrong predictions on hand-crafted records |
| far from the training manifold. XGBoost is more robust. Use both; |
| treat disagreement as a signal for human review. |
|
|
| 6. **`timestep` dominance is a property of the dataset.** Real malware |
| in production doesn't have a clean "timestep" feature on a per-sample |
| 60-step normalized timeline — that's a simulator artifact. A buyer |
| transferring this baseline to real sandbox traces would need to |
| recover an equivalent temporal-position feature from execution-trace |
| timestamps relative to detonation. |
|
|
| ## Notes on dataset schema |
|
|
| The CYB003 sample dataset README describes some fields differently from |
| the actual schema. The model was trained on the actual schema; this note |
| helps buyers reconcile what they read with what they receive. |
|
|
| | What the README says | What the data actually contains | |
| |---|---| |
| | `pe_entropy` (one column) | `pe_entropy_mean` + `pe_entropy_std` (two columns) | |
| | `process_injection_count` | `process_injection_flag` (binary, not a count) | |
| | `c2_beacon_active` | `c2_beacon_interval_sec` (seconds, 0 when inactive) | |
| | `av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep) | None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names | |
| | `ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`) | `ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`) | |
| | 9 malware families listed | 10 families in the data (`apt_implant` is the additional one) | |
| | `coordinated_campaign_flag` (described as a flag) | Constant = 1 for all rows in the sample (uninformative) | |
|
|
| The actual per-timestep table also contains rich PE-static features not |
| listed in the README: `import_hash_cluster`, `section_count`, |
| `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, |
| `code_section_rx_ratio`, `resource_section_entropy`, |
| `suspicious_import_count`. These are excellent features for family |
| classification work and are documented in the model's |
| `feature_engineering.py`. |
|
|
| None of these discrepancies affects model correctness — the feature |
| pipeline uses the actual column names. If you build your own pipeline |
| against the dataset, use the actual columns, not the README descriptions. |
|
|
| ## Intended use |
|
|
| - **Evaluating fit** of the CYB003 dataset for your malware-analysis |
| or sandbox-detection research |
| - **Baseline reference** for new model architectures (especially sequence |
| models, which should beat this baseline on the late/scattered phases) |
| - **Teaching and demo** for tabular classification on malware telemetry |
| - **Feature engineering reference** for per-timestep behavioural data |
|
|
| ## Out-of-scope use |
|
|
| - Production sandbox analysis on real malware |
| - EDR phase tagging on real systems |
| - Family attribution (this baseline does not address that task; see why above) |
| - Adversarial-evasion evaluation (dataset not adversarially generated) |
| - Any operational security decision |
|
|
| ## Reproducibility |
|
|
| Outputs above were produced with `seed = 42` (published artifact), |
| group-aware nested `GroupShuffleSplit` (70/15/15 by sample_id), on the |
| published sample (`xpertsystems/cyb003-sample`, version 1.0.0, generated |
| 2026-05-16). The feature pipeline in `feature_engineering.py` is |
| deterministic and the trained weights in this repo correspond exactly |
| to the metrics above. |
|
|
| Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in |
| `multi_seed_results.json` confirm robust performance across splits. |
|
|
| The training script itself is private to XpertSystems. The published |
| artifacts contain the feature pipeline, model weights, scaler, metadata, |
| and validation results — sufficient to reproduce inference but not |
| training. |
|
|
| ## Files in this repo |
|
|
| | File | Purpose | |
| |---|---| |
| | `model_xgb.json` | XGBoost weights (seed 42) | |
| | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | |
| | `feature_engineering.py` | Feature pipeline (load → engineer → encode) | |
| | `feature_meta.json` | Feature column order + categorical levels | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | |
| | `ablation_results.json` | Per-feature-group ablation (timestep, behavioural, PE static, engineered) | |
| | `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | |
| | `README.md` | This file | |
|
|
| ## Contact and full product |
|
|
| The full **CYB003** dataset contains ~349,000 rows across four files, |
| with calibrated benchmark validation against 12 metrics drawn from |
| authoritative threat intelligence and AV-testing sources (VirusTotal, |
| AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon). |
| The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across |
| Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials |
| & Energy. |
|
|
| - 📧 **pradeep@xpertsystems.ai** |
| - 🌐 **https://xpertsystems.ai** |
| - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample |
| - 🤖 Companion models: |
| - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) |
| - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{xpertsystems_cyb003_baseline_2026, |
| title = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification}, |
| author = {XpertSystems.ai}, |
| year = {2026}, |
| url = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier}, |
| note = {Baseline reference model trained on xpertsystems/cyb003-sample} |
| } |
| ``` |
|
|