Initial release: XGBoost + MLP for malware execution phase classification

c6a80e7 verified 3 days ago

19.3 kB

	---
	license: cc-by-nc-4.0
	library_name: pytorch
	tags:
	- cybersecurity
	- malware
	- malware-behaviour
	- sandbox-analysis
	- edr
	- tabular-classification
	- synthetic-data
	- xgboost
	- baseline
	pipeline_tag: tabular-classification
	base_model: []
	datasets:
	- xpertsystems/cyb003-sample
	metrics:
	- accuracy
	- f1
	- roc_auc
	model-index:
	- name: cyb003-baseline-classifier
	results:
	- task:
	type: tabular-classification
	name: 10-class malware execution phase classification
	dataset:
	type: xpertsystems/cyb003-sample
	name: CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
	metrics:
	- type: roc_auc
	value: 0.9792
	name: Test macro ROC-AUC OvR (XGBoost, seed 42)
	- type: accuracy
	value: 0.9178
	name: Test accuracy (XGBoost, seed 42)
	- type: f1
	value: 0.7781
	name: Test macro-F1 (XGBoost, seed 42)
	- type: accuracy
	value: 0.905
	name: Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.975
	name: Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.9681
	name: Test macro ROC-AUC OvR (MLP, seed 42)
	- type: accuracy
	value: 0.8222
	name: Test accuracy (MLP, seed 42)
	- type: f1
	value: 0.7072
	name: Test macro-F1 (MLP, seed 42)
	---

	# CYB003 Baseline Classifier

	**Malware execution-phase classifier trained on the CYB003 synthetic
	malware behaviour sample. Predicts which of 10 execution phases a
	per-timestep telemetry record belongs to, from observable behavioural
	and PE-static features.**

	> Baseline reference, not for production use. This model demonstrates
	> that the [CYB003 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb003-sample)
	> is learnable end-to-end and gives prospective buyers a working starting
	> point. It is not a production sandbox, EDR, or threat-detection system.
	> See [Limitations](#limitations).

	## Model overview

	\| Property \| Value \|
	\|---\|---\|
	\| Task \| 10-class execution_phase classification \|
	\| Training data \| `xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples) \|
	\| Models \| XGBoost + PyTorch MLP \|
	\| Input features \| 69 (after one-hot encoding) \|
	\| Split \| Group-aware by sample_id (disjoint train/val/test samples) \|
	\| Validation \| Single seed (artifact) + multi-seed aggregate across 10 seeds \|
	\| License \| CC-BY-NC-4.0 (matches dataset) \|
	\| Status \| Reference baseline \|

	## Why this task instead of malware family classification?

	The CYB003 dataset README leads with "training malware family classifiers"
	as a suggested use case. We piloted that target first and found it is
	not learnable from the sample dataset under proper group-aware
	evaluation: with only 100 unique samples spread across 10 families,
	XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58
	— at majority baseline. Per-sample aggregation gives the same result.

	This is a sample-size constraint, not a feature-engineering failure.
	With ~7 samples per family on average, a held-out test set of 15 samples
	covers at most ~8 families and yields a model that cannot generalize.
	The full 280k-row CYB003 product, with ~28 samples per family at the
	sample's distribution, will not have this constraint.

	We pivoted to execution_phase prediction, which has 6,000 rows of
	per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable
	across seeds. This is a legitimate SOC use case — dynamic-analysis tools
	and EDR systems regularly need to tag what phase of execution observed
	malware activity belongs to — and it shows the dataset is well-calibrated
	even when the headline product use case needs more data.

	Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

	- `model_xgb.json` — gradient-boosted trees, primary recommendation
	- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format

	## Quick start

	```bash
	pip install xgboost torch safetensors pandas huggingface_hub
	```

	```python
	from huggingface_hub import hf_hub_download
	import json, numpy as np, torch, xgboost as xgb
	from safetensors.torch import load_file

	REPO = "xpertsystems/cyb003-baseline-classifier"

	paths = {n: hf_hub_download(REPO, n) for n in [
	"model_xgb.json", "model_mlp.safetensors",
	"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
	]}

	import sys, os
	sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
	from feature_engineering import transform_single, load_meta, INT_TO_LABEL

	meta = load_meta(paths["feature_meta.json"])
	xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

	# Predict (see inference_example.ipynb for the full pattern)
	X = transform_single(my_timestep_record, meta)
	proba = xgb_model.predict_proba(X)[0]
	print(INT_TO_LABEL[int(np.argmax(proba))])
	```

	See [`inference_example.ipynb`](./inference_example.ipynb) for the full
	copy-paste demo.

	## Training data

	Trained on the public sample of CYB003, 6,000 per-timestep telemetry
	rows from 100 malware samples (60 timesteps per sample):

	\| Phase \| Total rows \| Train share \| Test rows (seed 42) \|
	\|---\|---:\|---:\|---:\|
	\| `initial_drop` \| 801 \| 13.4% \| 120 \|
	\| `lateral_movement` \| 799 \| 13.3% \| 120 \|
	\| `persistence_establishment` \| 787 \| 13.1% \| 119 \|
	\| `data_exfiltration` \| 783 \| 13.1% \| 100 \|
	\| `c2_communication` \| 709 \| 11.8% \| 87 \|
	\| `privilege_escalation` \| 705 \| 11.8% \| 107 \|
	\| `payload_execution` \| 705 \| 11.8% \| 109 \|
	\| `dormancy_dwell` \| 250 \| 4.2% \| 83 \|
	\| `sandbox_evasion_stall` \| 234 \| 3.9% \| 32 \|
	\| `self_destruct_cleanup` \| 227 \| 3.8% \| 23 \|

	### Group-aware split

	A single malware sample generates 60 highly-correlated timesteps. Random
	row-level splitting would put timesteps from the same sample in both
	train and test, inflating metrics in a way that does not generalize to
	new samples.

	This release uses GroupShuffleSplit by `sample_id` (nested, 70/15/15):

	\| Fold \| Samples \| Timesteps \|
	\|---\|---:\|---:\|
	\| Train \| 69 \| 4,140 \|
	\| Validation \| 16 \| 960 \|
	\| Test \| 15 \| 900 \|

	All test samples are completely unseen during training. Class imbalance
	is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
	weighted cross-entropy (MLP).

	## Feature pipeline

	The bundled `feature_engineering.py` is the canonical feature recipe.
	69 features survive after encoding, drawn from:

	- Per-timestep numeric (10): `timestep`, `api_call_rate`, `registry_write_count`, `network_connection_count`, `process_injection_flag`, `c2_beacon_interval_sec`, `av_signature_hit_flag`, `sandbox_evasion_flag`, `lateral_propagation_count`, `privilege_escalation_flag`
	- PE static features (11): `pe_entropy_mean`, `pe_entropy_std`, `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`, `packer_detected_flag`
	- Categorical (6, one-hot encoded): `malware_family`, `threat_actor_tier`, `target_platform`, `obfuscation_technique`, `detection_outcome`, `ep_stack`
	- Engineered (6): `api_burst_score`, `is_c2_active`, `is_high_net_volume`, `is_stealth_step`, `is_destructive_step`, `lateral_activity_score`

	### Leakage audit

	No categorical feature has phase->phase purity above 0.17 (uniform
	random baseline is 0.10), so nothing in the dataset is an oracle for
	the target. The model relies on a mix of `timestep` (strong but not
	deterministic) and behavioural features.

	## Evaluation

	### Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples)

	XGBoost (the published `model_xgb.json` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.9792 \|
	\| Accuracy \| 0.9178 \|
	\| Macro-F1 \| 0.7781 \|
	\| Weighted-F1 \| 0.9173 \|

	MLP (the published `model_mlp.safetensors` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.9681 \|
	\| Accuracy \| 0.8222 \|
	\| Macro-F1 \| 0.7072 \|
	\| Weighted-F1 \| 0.8278 \|

	### Multi-seed robustness (XGBoost, 10 seeds)

	Accuracy and ROC-AUC are tight across seeds — the task is genuinely
	learnable, not seed-lucky:

	\| Metric \| Mean \| Std \| Min \| Max \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Accuracy \| 0.905 \| 0.010 \| 0.882 \| 0.921 \|
	\| Macro-F1 \| 0.784 \| 0.013 \| 0.759 \| 0.807 \|
	\| Macro ROC-AUC OvR \| 0.975 \| 0.002 \| 0.972 \| 0.979 \|

	Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
	All 10 seeds yielded all 10 classes in the test fold, supporting clean
	multi-class ROC-AUC computation.

	### Per-class F1 (seed 42) — where the signal is and isn't

	\| Phase \| XGBoost F1 \| MLP F1 \| Note \|
	\|---\|---:\|---:\|---\|
	\| `c2_communication` \| 1.000 \| 1.000 \| Trivial: tight timestep window 52-59 + c2_beacon signal \|
	\| `persistence_establishment` \| 0.992 \| 0.870 \| Tight timestep window 9-17 + registry writes \|
	\| `lateral_movement` \| 0.992 \| 0.907 \| Tight timestep window 26-34 + lateral_propagation \|
	\| `privilege_escalation` \| 0.991 \| 0.915 \| Tight timestep window 18-25 + privilege flag \|
	\| `data_exfiltration` \| 0.970 \| 0.918 \| Tight timestep window 43-51 + network volume \|
	\| `payload_execution` \| 0.963 \| 0.698 \| Tight timestep window 35-42 + API bursts \|
	\| `initial_drop` \| 0.945 \| 0.886 \| Tight timestep window 0-8 \|
	\| `dormancy_dwell` \| 0.530 \| 0.520 \| Hard: spans full 0-59 timestep range \|
	\| `self_destruct_cleanup` \| 0.273 \| 0.282 \| Hard: spans full 0-59, low row count (227) \|
	\| `sandbox_evasion_stall` \| 0.125 \| 0.077 \| Hard: spans full 0-59, low row count (234) \|

	Seven phases are near-trivially classified because they sit in tight
	timestep windows with characteristic behavioural signatures. **Three
	phases — `dormancy_dwell`, `sandbox_evasion_stall`, `self_destruct_cleanup`
	— scatter across the full 0–59 timestep range** and lack distinctive
	behavioural features (idle/evasion phases have low activity by design),
	so a flat-tabular event-level model can't reliably disambiguate them.
	Sequence models that consider neighbouring timesteps would help here.

	### Ablation: which feature groups matter

	\| Configuration \| Accuracy \| Macro-F1 \| ROC-AUC \| Δ accuracy \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Full feature set (published) \| 0.9178 \| 0.7781 \| 0.9792 \| — \|
	\| No `timestep` \| 0.6933 \| 0.5963 \| 0.9264 \| −0.2244 \|
	\| No behavioural features \| 0.9089 \| 0.7579 \| 0.9705 \| −0.0089 \|
	\| No PE static features \| 0.9167 \| 0.7808 \| 0.9786 \| −0.0011 \|
	\| No engineered features \| 0.9200 \| 0.7931 \| 0.9797 \| +0.0022 \|

	Three clear findings:

	1. `timestep` is by far the dominant feature (drops 22 pp when removed,
	ROC-AUC still 0.93). Malware execution progresses in time, and where
	you are in that timeline carries most of the phase signal.
	2. PE static features are barely used for phase prediction. This is
	honest: PE features (entropy, packed sections, import hashes) inform
	family classification, not phase classification. A buyer doing family
	work should expect to use them; for phase work they can be dropped.
	3. Engineered features and behavioural features each contribute ~1 pp.
	Trees recover most of the engineered features on their own.

	### Architecture

	XGBoost: multi-class gradient boosting (`multi:softprob`, 10 classes),
	`hist` tree method, class-balanced sample weights, early stopping on
	validation mlogloss.

	MLP: `69 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
	→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
	early stopping on validation macro-F1.

	Training hyperparameters (learning rate, batch size, n_estimators,
	early-stopping patience, weight decay, class-weighting strategy) are
	held internally by XpertSystems and are not part of this release.

	## Limitations

	This is a baseline reference, not a production sandbox or threat detector.

	1. Three phases are genuinely hard at sample size. `dormancy_dwell`,
	`sandbox_evasion_stall`, and `self_destruct_cleanup` span the full
	0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53.
	These are the phases by design lacking distinctive moment-to-moment
	features (the malware is being quiet to evade detection). Sequence
	models or per-sample aggregation would substantially improve these.

	2. **The pivot away from malware family classification is dataset-limited,
	not method-limited.** Family classification on 100 samples with 10
	classes is at majority baseline. The full 280k-row CYB003 product
	provides ~5,600 samples and supports proper family classification.

	3. Synthetic-vs-real transfer. The dataset is synthetic and calibrated
	to threat-intelligence and AV-testing benchmark targets (VirusTotal,
	AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR,
	Verizon DBIR). Real malware telemetry has different noise
	characteristics, adversary adaptation, and instrumentation gaps. Do
	not assume metrics transfer.

	4. Adversarial robustness not evaluated. The dataset is not
	adversarially generated; the model has not been red-teamed against
	evasive samples.

	5. MLP brittleness on OOD inputs. With ~4k training timesteps, the
	MLP can produce confidently-wrong predictions on hand-crafted records
	far from the training manifold. XGBoost is more robust. Use both;
	treat disagreement as a signal for human review.

	6. `timestep` dominance is a property of the dataset. Real malware
	in production doesn't have a clean "timestep" feature on a per-sample
	60-step normalized timeline — that's a simulator artifact. A buyer
	transferring this baseline to real sandbox traces would need to
	recover an equivalent temporal-position feature from execution-trace
	timestamps relative to detonation.

	## Notes on dataset schema

	The CYB003 sample dataset README describes some fields differently from
	the actual schema. The model was trained on the actual schema; this note
	helps buyers reconcile what they read with what they receive.

	\| What the README says \| What the data actually contains \|
	\|---\|---\|
	\| `pe_entropy` (one column) \| `pe_entropy_mean` + `pe_entropy_std` (two columns) \|
	\| `process_injection_count` \| `process_injection_flag` (binary, not a count) \|
	\| `c2_beacon_active` \| `c2_beacon_interval_sec` (seconds, 0 when inactive) \|
	\| `av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep) \| None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names \|
	\| `ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`) \| `ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`) \|
	\| 9 malware families listed \| 10 families in the data (`apt_implant` is the additional one) \|
	\| `coordinated_campaign_flag` (described as a flag) \| Constant = 1 for all rows in the sample (uninformative) \|

	The actual per-timestep table also contains rich PE-static features not
	listed in the README: `import_hash_cluster`, `section_count`,
	`packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`,
	`code_section_rx_ratio`, `resource_section_entropy`,
	`suspicious_import_count`. These are excellent features for family
	classification work and are documented in the model's
	`feature_engineering.py`.

	None of these discrepancies affects model correctness — the feature
	pipeline uses the actual column names. If you build your own pipeline
	against the dataset, use the actual columns, not the README descriptions.

	## Intended use

	- Evaluating fit of the CYB003 dataset for your malware-analysis
	or sandbox-detection research
	- Baseline reference for new model architectures (especially sequence
	models, which should beat this baseline on the late/scattered phases)
	- Teaching and demo for tabular classification on malware telemetry
	- Feature engineering reference for per-timestep behavioural data

	## Out-of-scope use

	- Production sandbox analysis on real malware
	- EDR phase tagging on real systems
	- Family attribution (this baseline does not address that task; see why above)
	- Adversarial-evasion evaluation (dataset not adversarially generated)
	- Any operational security decision

	## Reproducibility

	Outputs above were produced with `seed = 42` (published artifact),
	group-aware nested `GroupShuffleSplit` (70/15/15 by sample_id), on the
	published sample (`xpertsystems/cyb003-sample`, version 1.0.0, generated
	2026-05-16). The feature pipeline in `feature_engineering.py` is
	deterministic and the trained weights in this repo correspond exactly
	to the metrics above.

	Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
	`multi_seed_results.json` confirm robust performance across splits.

	The training script itself is private to XpertSystems. The published
	artifacts contain the feature pipeline, model weights, scaler, metadata,
	and validation results — sufficient to reproduce inference but not
	training.

	## Files in this repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `model_xgb.json` \| XGBoost weights (seed 42) \|
	\| `model_mlp.safetensors` \| PyTorch MLP weights (seed 42) \|
	\| `feature_engineering.py` \| Feature pipeline (load → engineer → encode) \|
	\| `feature_meta.json` \| Feature column order + categorical levels \|
	\| `feature_scaler.json` \| MLP input mean/std (XGBoost ignores) \|
	\| `validation_results.json` \| Per-class metrics, confusion matrix, architecture \|
	\| `ablation_results.json` \| Per-feature-group ablation (timestep, behavioural, PE static, engineered) \|
	\| `multi_seed_results.json` \| XGBoost metrics across 10 seeds with aggregate statistics \|
	\| `inference_example.ipynb` \| End-to-end inference demo notebook \|
	\| `README.md` \| This file \|

	## Contact and full product

	The full CYB003 dataset contains ~349,000 rows across four files,
	with calibrated benchmark validation against 12 metrics drawn from
	authoritative threat intelligence and AV-testing sources (VirusTotal,
	AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon).
	The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
	Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
	& Energy.

	- 📧 pradeep@xpertsystems.ai
	- 🌐 https://xpertsystems.ai
	- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample
	- 🤖 Companion models:
	- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
	- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)

	## Citation

	```bibtex
	@misc{xpertsystems_cyb003_baseline_2026,
	title = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification},
	author = {XpertSystems.ai},
	year = {2026},
	url = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier},
	note = {Baseline reference model trained on xpertsystems/cyb003-sample}
	}
	```