--- license: cc-by-nc-4.0 library_name: pytorch tags: - cybersecurity - malware - malware-behaviour - sandbox-analysis - edr - tabular-classification - synthetic-data - xgboost - baseline pipeline_tag: tabular-classification base_model: [] datasets: - xpertsystems/cyb003-sample metrics: - accuracy - f1 - roc_auc model-index: - name: cyb003-baseline-classifier results: - task: type: tabular-classification name: 10-class malware execution phase classification dataset: type: xpertsystems/cyb003-sample name: CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample) metrics: - type: roc_auc value: 0.9792 name: Test macro ROC-AUC OvR (XGBoost, seed 42) - type: accuracy value: 0.9178 name: Test accuracy (XGBoost, seed 42) - type: f1 value: 0.7781 name: Test macro-F1 (XGBoost, seed 42) - type: accuracy value: 0.905 name: Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds) - type: roc_auc value: 0.975 name: Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds) - type: roc_auc value: 0.9681 name: Test macro ROC-AUC OvR (MLP, seed 42) - type: accuracy value: 0.8222 name: Test accuracy (MLP, seed 42) - type: f1 value: 0.7072 name: Test macro-F1 (MLP, seed 42) --- # CYB003 Baseline Classifier **Malware execution-phase classifier trained on the CYB003 synthetic malware behaviour sample. Predicts which of 10 execution phases a per-timestep telemetry record belongs to, from observable behavioural and PE-static features.** > **Baseline reference, not for production use.** This model demonstrates > that the [CYB003 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb003-sample) > is learnable end-to-end and gives prospective buyers a working starting > point. It is not a production sandbox, EDR, or threat-detection system. > See [Limitations](#limitations). ## Model overview | Property | Value | |---|---| | Task | 10-class execution_phase classification | | Training data | `xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples) | | Models | XGBoost + PyTorch MLP | | Input features | 69 (after one-hot encoding) | | Split | **Group-aware by sample_id** (disjoint train/val/test samples) | | Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds | | License | CC-BY-NC-4.0 (matches dataset) | | Status | Reference baseline | ## Why this task instead of malware family classification? The CYB003 dataset README leads with "training malware family classifiers" as a suggested use case. We piloted that target first and found it is **not learnable from the sample dataset** under proper group-aware evaluation: with only 100 unique samples spread across 10 families, XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58 — at majority baseline. Per-sample aggregation gives the same result. This is a **sample-size constraint**, not a feature-engineering failure. With ~7 samples per family on average, a held-out test set of 15 samples covers at most ~8 families and yields a model that cannot generalize. The full 280k-row CYB003 product, with ~28 samples per family at the sample's distribution, will not have this constraint. We pivoted to **execution_phase prediction**, which has 6,000 rows of per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable across seeds. This is a legitimate SOC use case — dynamic-analysis tools and EDR systems regularly need to tag what phase of execution observed malware activity belongs to — and it shows the dataset is well-calibrated even when the headline product use case needs more data. Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal: - `model_xgb.json` — gradient-boosted trees, primary recommendation - `model_mlp.safetensors` — PyTorch MLP in SafeTensors format ## Quick start ```bash pip install xgboost torch safetensors pandas huggingface_hub ``` ```python from huggingface_hub import hf_hub_download import json, numpy as np, torch, xgboost as xgb from safetensors.torch import load_file REPO = "xpertsystems/cyb003-baseline-classifier" paths = {n: hf_hub_download(REPO, n) for n in [ "model_xgb.json", "model_mlp.safetensors", "feature_engineering.py", "feature_meta.json", "feature_scaler.json", ]} import sys, os sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) from feature_engineering import transform_single, load_meta, INT_TO_LABEL meta = load_meta(paths["feature_meta.json"]) xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) # Predict (see inference_example.ipynb for the full pattern) X = transform_single(my_timestep_record, meta) proba = xgb_model.predict_proba(X)[0] print(INT_TO_LABEL[int(np.argmax(proba))]) ``` See [`inference_example.ipynb`](./inference_example.ipynb) for the full copy-paste demo. ## Training data Trained on the public sample of CYB003, 6,000 per-timestep telemetry rows from 100 malware samples (60 timesteps per sample): | Phase | Total rows | Train share | Test rows (seed 42) | |---|---:|---:|---:| | `initial_drop` | 801 | 13.4% | 120 | | `lateral_movement` | 799 | 13.3% | 120 | | `persistence_establishment` | 787 | 13.1% | 119 | | `data_exfiltration` | 783 | 13.1% | 100 | | `c2_communication` | 709 | 11.8% | 87 | | `privilege_escalation` | 705 | 11.8% | 107 | | `payload_execution` | 705 | 11.8% | 109 | | `dormancy_dwell` | 250 | 4.2% | 83 | | `sandbox_evasion_stall` | 234 | 3.9% | 32 | | `self_destruct_cleanup` | 227 | 3.8% | 23 | ### Group-aware split A single malware sample generates 60 highly-correlated timesteps. Random row-level splitting would put timesteps from the same sample in both train and test, inflating metrics in a way that does not generalize to new samples. This release uses **GroupShuffleSplit by `sample_id`** (nested, 70/15/15): | Fold | Samples | Timesteps | |---|---:|---:| | Train | 69 | 4,140 | | Validation | 16 | 960 | | Test | 15 | 900 | All test samples are completely unseen during training. Class imbalance is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and weighted cross-entropy (MLP). ## Feature pipeline The bundled `feature_engineering.py` is the canonical feature recipe. 69 features survive after encoding, drawn from: - **Per-timestep numeric** (10): `timestep`, `api_call_rate`, `registry_write_count`, `network_connection_count`, `process_injection_flag`, `c2_beacon_interval_sec`, `av_signature_hit_flag`, `sandbox_evasion_flag`, `lateral_propagation_count`, `privilege_escalation_flag` - **PE static features** (11): `pe_entropy_mean`, `pe_entropy_std`, `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`, `packer_detected_flag` - **Categorical** (6, one-hot encoded): `malware_family`, `threat_actor_tier`, `target_platform`, `obfuscation_technique`, `detection_outcome`, `ep_stack` - **Engineered** (6): `api_burst_score`, `is_c2_active`, `is_high_net_volume`, `is_stealth_step`, `is_destructive_step`, `lateral_activity_score` ### Leakage audit No categorical feature has phase->phase purity above 0.17 (uniform random baseline is 0.10), so nothing in the dataset is an oracle for the target. The model relies on a mix of `timestep` (strong but not deterministic) and behavioural features. ## Evaluation ### Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples) **XGBoost** (the published `model_xgb.json` artifact) | Metric | Value | |---|---:| | Macro ROC-AUC (OvR) | **0.9792** | | Accuracy | **0.9178** | | Macro-F1 | 0.7781 | | Weighted-F1 | 0.9173 | **MLP** (the published `model_mlp.safetensors` artifact) | Metric | Value | |---|---:| | Macro ROC-AUC (OvR) | 0.9681 | | Accuracy | 0.8222 | | Macro-F1 | 0.7072 | | Weighted-F1 | 0.8278 | ### Multi-seed robustness (XGBoost, 10 seeds) Accuracy and ROC-AUC are tight across seeds — the task is genuinely learnable, not seed-lucky: | Metric | Mean | Std | Min | Max | |---|---:|---:|---:|---:| | Accuracy | 0.905 | 0.010 | 0.882 | 0.921 | | Macro-F1 | 0.784 | 0.013 | 0.759 | 0.807 | | Macro ROC-AUC OvR | 0.975 | 0.002 | 0.972 | 0.979 | Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json). All 10 seeds yielded all 10 classes in the test fold, supporting clean multi-class ROC-AUC computation. ### Per-class F1 (seed 42) — where the signal is and isn't | Phase | XGBoost F1 | MLP F1 | Note | |---|---:|---:|---| | `c2_communication` | **1.000** | 1.000 | Trivial: tight timestep window 52-59 + c2_beacon signal | | `persistence_establishment` | **0.992** | 0.870 | Tight timestep window 9-17 + registry writes | | `lateral_movement` | **0.992** | 0.907 | Tight timestep window 26-34 + lateral_propagation | | `privilege_escalation` | **0.991** | 0.915 | Tight timestep window 18-25 + privilege flag | | `data_exfiltration` | **0.970** | 0.918 | Tight timestep window 43-51 + network volume | | `payload_execution` | **0.963** | 0.698 | Tight timestep window 35-42 + API bursts | | `initial_drop` | **0.945** | 0.886 | Tight timestep window 0-8 | | `dormancy_dwell` | 0.530 | 0.520 | Hard: spans full 0-59 timestep range | | `self_destruct_cleanup` | 0.273 | 0.282 | Hard: spans full 0-59, low row count (227) | | `sandbox_evasion_stall` | 0.125 | 0.077 | Hard: spans full 0-59, low row count (234) | Seven phases are near-trivially classified because they sit in tight timestep windows with characteristic behavioural signatures. **Three phases — `dormancy_dwell`, `sandbox_evasion_stall`, `self_destruct_cleanup` — scatter across the full 0–59 timestep range** and lack distinctive behavioural features (idle/evasion phases have low activity by design), so a flat-tabular event-level model can't reliably disambiguate them. Sequence models that consider neighbouring timesteps would help here. ### Ablation: which feature groups matter | Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy | |---|---:|---:|---:|---:| | Full feature set (published) | 0.9178 | 0.7781 | 0.9792 | — | | No `timestep` | 0.6933 | 0.5963 | 0.9264 | **−0.2244** | | No behavioural features | 0.9089 | 0.7579 | 0.9705 | −0.0089 | | No PE static features | 0.9167 | 0.7808 | 0.9786 | −0.0011 | | No engineered features | 0.9200 | 0.7931 | 0.9797 | +0.0022 | Three clear findings: 1. **`timestep` is by far the dominant feature** (drops 22 pp when removed, ROC-AUC still 0.93). Malware execution progresses in time, and where you are in that timeline carries most of the phase signal. 2. **PE static features are barely used for phase prediction.** This is honest: PE features (entropy, packed sections, import hashes) inform family classification, not phase classification. A buyer doing family work should expect to use them; for phase work they can be dropped. 3. **Engineered features and behavioural features each contribute ~1 pp.** Trees recover most of the engineered features on their own. ### Architecture **XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes), `hist` tree method, class-balanced sample weights, early stopping on validation mlogloss. **MLP:** `69 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d` → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1. Training hyperparameters (learning rate, batch size, n_estimators, early-stopping patience, weight decay, class-weighting strategy) are held internally by XpertSystems and are not part of this release. ## Limitations **This is a baseline reference, not a production sandbox or threat detector.** 1. **Three phases are genuinely hard at sample size.** `dormancy_dwell`, `sandbox_evasion_stall`, and `self_destruct_cleanup` span the full 0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53. These are the phases by design lacking distinctive moment-to-moment features (the malware is being quiet to evade detection). Sequence models or per-sample aggregation would substantially improve these. 2. **The pivot away from malware family classification is dataset-limited, not method-limited.** Family classification on 100 samples with 10 classes is at majority baseline. The full 280k-row CYB003 product provides ~5,600 samples and supports proper family classification. 3. **Synthetic-vs-real transfer.** The dataset is synthetic and calibrated to threat-intelligence and AV-testing benchmark targets (VirusTotal, AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR, Verizon DBIR). Real malware telemetry has different noise characteristics, adversary adaptation, and instrumentation gaps. Do not assume metrics transfer. 4. **Adversarial robustness not evaluated.** The dataset is not adversarially generated; the model has not been red-teamed against evasive samples. 5. **MLP brittleness on OOD inputs.** With ~4k training timesteps, the MLP can produce confidently-wrong predictions on hand-crafted records far from the training manifold. XGBoost is more robust. Use both; treat disagreement as a signal for human review. 6. **`timestep` dominance is a property of the dataset.** Real malware in production doesn't have a clean "timestep" feature on a per-sample 60-step normalized timeline — that's a simulator artifact. A buyer transferring this baseline to real sandbox traces would need to recover an equivalent temporal-position feature from execution-trace timestamps relative to detonation. ## Notes on dataset schema The CYB003 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note helps buyers reconcile what they read with what they receive. | What the README says | What the data actually contains | |---|---| | `pe_entropy` (one column) | `pe_entropy_mean` + `pe_entropy_std` (two columns) | | `process_injection_count` | `process_injection_flag` (binary, not a count) | | `c2_beacon_active` | `c2_beacon_interval_sec` (seconds, 0 when inactive) | | `av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep) | None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names | | `ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`) | `ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`) | | 9 malware families listed | 10 families in the data (`apt_implant` is the additional one) | | `coordinated_campaign_flag` (described as a flag) | Constant = 1 for all rows in the sample (uninformative) | The actual per-timestep table also contains rich PE-static features not listed in the README: `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`. These are excellent features for family classification work and are documented in the model's `feature_engineering.py`. None of these discrepancies affects model correctness — the feature pipeline uses the actual column names. If you build your own pipeline against the dataset, use the actual columns, not the README descriptions. ## Intended use - **Evaluating fit** of the CYB003 dataset for your malware-analysis or sandbox-detection research - **Baseline reference** for new model architectures (especially sequence models, which should beat this baseline on the late/scattered phases) - **Teaching and demo** for tabular classification on malware telemetry - **Feature engineering reference** for per-timestep behavioural data ## Out-of-scope use - Production sandbox analysis on real malware - EDR phase tagging on real systems - Family attribution (this baseline does not address that task; see why above) - Adversarial-evasion evaluation (dataset not adversarially generated) - Any operational security decision ## Reproducibility Outputs above were produced with `seed = 42` (published artifact), group-aware nested `GroupShuffleSplit` (70/15/15 by sample_id), on the published sample (`xpertsystems/cyb003-sample`, version 1.0.0, generated 2026-05-16). The feature pipeline in `feature_engineering.py` is deterministic and the trained weights in this repo correspond exactly to the metrics above. Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in `multi_seed_results.json` confirm robust performance across splits. The training script itself is private to XpertSystems. The published artifacts contain the feature pipeline, model weights, scaler, metadata, and validation results — sufficient to reproduce inference but not training. ## Files in this repo | File | Purpose | |---|---| | `model_xgb.json` | XGBoost weights (seed 42) | | `model_mlp.safetensors` | PyTorch MLP weights (seed 42) | | `feature_engineering.py` | Feature pipeline (load → engineer → encode) | | `feature_meta.json` | Feature column order + categorical levels | | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | | `validation_results.json` | Per-class metrics, confusion matrix, architecture | | `ablation_results.json` | Per-feature-group ablation (timestep, behavioural, PE static, engineered) | | `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics | | `inference_example.ipynb` | End-to-end inference demo notebook | | `README.md` | This file | ## Contact and full product The full **CYB003** dataset contains ~349,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative threat intelligence and AV-testing sources (VirusTotal, AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy. - 📧 **pradeep@xpertsystems.ai** - 🌐 **https://xpertsystems.ai** - 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample - 🤖 Companion models: - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic) - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain) ## Citation ```bibtex @misc{xpertsystems_cyb003_baseline_2026, title = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification}, author = {XpertSystems.ai}, year = {2026}, url = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier}, note = {Baseline reference model trained on xpertsystems/cyb003-sample} } ```