File size: 19,260 Bytes
c6a80e7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 | ---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- cybersecurity
- malware
- malware-behaviour
- sandbox-analysis
- edr
- tabular-classification
- synthetic-data
- xgboost
- baseline
pipeline_tag: tabular-classification
base_model: []
datasets:
- xpertsystems/cyb003-sample
metrics:
- accuracy
- f1
- roc_auc
model-index:
- name: cyb003-baseline-classifier
results:
- task:
type: tabular-classification
name: 10-class malware execution phase classification
dataset:
type: xpertsystems/cyb003-sample
name: CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
metrics:
- type: roc_auc
value: 0.9792
name: Test macro ROC-AUC OvR (XGBoost, seed 42)
- type: accuracy
value: 0.9178
name: Test accuracy (XGBoost, seed 42)
- type: f1
value: 0.7781
name: Test macro-F1 (XGBoost, seed 42)
- type: accuracy
value: 0.905
name: Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds)
- type: roc_auc
value: 0.975
name: Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds)
- type: roc_auc
value: 0.9681
name: Test macro ROC-AUC OvR (MLP, seed 42)
- type: accuracy
value: 0.8222
name: Test accuracy (MLP, seed 42)
- type: f1
value: 0.7072
name: Test macro-F1 (MLP, seed 42)
---
# CYB003 Baseline Classifier
**Malware execution-phase classifier trained on the CYB003 synthetic
malware behaviour sample. Predicts which of 10 execution phases a
per-timestep telemetry record belongs to, from observable behavioural
and PE-static features.**
> **Baseline reference, not for production use.** This model demonstrates
> that the [CYB003 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb003-sample)
> is learnable end-to-end and gives prospective buyers a working starting
> point. It is not a production sandbox, EDR, or threat-detection system.
> See [Limitations](#limitations).
## Model overview
| Property | Value |
|---|---|
| Task | 10-class execution_phase classification |
| Training data | `xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples) |
| Models | XGBoost + PyTorch MLP |
| Input features | 69 (after one-hot encoding) |
| Split | **Group-aware by sample_id** (disjoint train/val/test samples) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline |
## Why this task instead of malware family classification?
The CYB003 dataset README leads with "training malware family classifiers"
as a suggested use case. We piloted that target first and found it is
**not learnable from the sample dataset** under proper group-aware
evaluation: with only 100 unique samples spread across 10 families,
XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58
— at majority baseline. Per-sample aggregation gives the same result.
This is a **sample-size constraint**, not a feature-engineering failure.
With ~7 samples per family on average, a held-out test set of 15 samples
covers at most ~8 families and yields a model that cannot generalize.
The full 280k-row CYB003 product, with ~28 samples per family at the
sample's distribution, will not have this constraint.
We pivoted to **execution_phase prediction**, which has 6,000 rows of
per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable
across seeds. This is a legitimate SOC use case — dynamic-analysis tools
and EDR systems regularly need to tag what phase of execution observed
malware activity belongs to — and it shows the dataset is well-calibrated
even when the headline product use case needs more data.
Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
- `model_xgb.json` — gradient-boosted trees, primary recommendation
- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
## Quick start
```bash
pip install xgboost torch safetensors pandas huggingface_hub
```
```python
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file
REPO = "xpertsystems/cyb003-baseline-classifier"
paths = {n: hf_hub_download(REPO, n) for n in [
"model_xgb.json", "model_mlp.safetensors",
"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}
import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL
meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```
See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.
## Training data
Trained on the public sample of CYB003, 6,000 per-timestep telemetry
rows from 100 malware samples (60 timesteps per sample):
| Phase | Total rows | Train share | Test rows (seed 42) |
|---|---:|---:|---:|
| `initial_drop` | 801 | 13.4% | 120 |
| `lateral_movement` | 799 | 13.3% | 120 |
| `persistence_establishment` | 787 | 13.1% | 119 |
| `data_exfiltration` | 783 | 13.1% | 100 |
| `c2_communication` | 709 | 11.8% | 87 |
| `privilege_escalation` | 705 | 11.8% | 107 |
| `payload_execution` | 705 | 11.8% | 109 |
| `dormancy_dwell` | 250 | 4.2% | 83 |
| `sandbox_evasion_stall` | 234 | 3.9% | 32 |
| `self_destruct_cleanup` | 227 | 3.8% | 23 |
### Group-aware split
A single malware sample generates 60 highly-correlated timesteps. Random
row-level splitting would put timesteps from the same sample in both
train and test, inflating metrics in a way that does not generalize to
new samples.
This release uses **GroupShuffleSplit by `sample_id`** (nested, 70/15/15):
| Fold | Samples | Timesteps |
|---|---:|---:|
| Train | 69 | 4,140 |
| Validation | 16 | 960 |
| Test | 15 | 900 |
All test samples are completely unseen during training. Class imbalance
is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
weighted cross-entropy (MLP).
## Feature pipeline
The bundled `feature_engineering.py` is the canonical feature recipe.
69 features survive after encoding, drawn from:
- **Per-timestep numeric** (10): `timestep`, `api_call_rate`, `registry_write_count`, `network_connection_count`, `process_injection_flag`, `c2_beacon_interval_sec`, `av_signature_hit_flag`, `sandbox_evasion_flag`, `lateral_propagation_count`, `privilege_escalation_flag`
- **PE static features** (11): `pe_entropy_mean`, `pe_entropy_std`, `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`, `packer_detected_flag`
- **Categorical** (6, one-hot encoded): `malware_family`, `threat_actor_tier`, `target_platform`, `obfuscation_technique`, `detection_outcome`, `ep_stack`
- **Engineered** (6): `api_burst_score`, `is_c2_active`, `is_high_net_volume`, `is_stealth_step`, `is_destructive_step`, `lateral_activity_score`
### Leakage audit
No categorical feature has phase->phase purity above 0.17 (uniform
random baseline is 0.10), so nothing in the dataset is an oracle for
the target. The model relies on a mix of `timestep` (strong but not
deterministic) and behavioural features.
## Evaluation
### Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples)
**XGBoost** (the published `model_xgb.json` artifact)
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9792** |
| Accuracy | **0.9178** |
| Macro-F1 | 0.7781 |
| Weighted-F1 | 0.9173 |
**MLP** (the published `model_mlp.safetensors` artifact)
| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | 0.9681 |
| Accuracy | 0.8222 |
| Macro-F1 | 0.7072 |
| Weighted-F1 | 0.8278 |
### Multi-seed robustness (XGBoost, 10 seeds)
Accuracy and ROC-AUC are tight across seeds — the task is genuinely
learnable, not seed-lucky:
| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.905 | 0.010 | 0.882 | 0.921 |
| Macro-F1 | 0.784 | 0.013 | 0.759 | 0.807 |
| Macro ROC-AUC OvR | 0.975 | 0.002 | 0.972 | 0.979 |
Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
All 10 seeds yielded all 10 classes in the test fold, supporting clean
multi-class ROC-AUC computation.
### Per-class F1 (seed 42) — where the signal is and isn't
| Phase | XGBoost F1 | MLP F1 | Note |
|---|---:|---:|---|
| `c2_communication` | **1.000** | 1.000 | Trivial: tight timestep window 52-59 + c2_beacon signal |
| `persistence_establishment` | **0.992** | 0.870 | Tight timestep window 9-17 + registry writes |
| `lateral_movement` | **0.992** | 0.907 | Tight timestep window 26-34 + lateral_propagation |
| `privilege_escalation` | **0.991** | 0.915 | Tight timestep window 18-25 + privilege flag |
| `data_exfiltration` | **0.970** | 0.918 | Tight timestep window 43-51 + network volume |
| `payload_execution` | **0.963** | 0.698 | Tight timestep window 35-42 + API bursts |
| `initial_drop` | **0.945** | 0.886 | Tight timestep window 0-8 |
| `dormancy_dwell` | 0.530 | 0.520 | Hard: spans full 0-59 timestep range |
| `self_destruct_cleanup` | 0.273 | 0.282 | Hard: spans full 0-59, low row count (227) |
| `sandbox_evasion_stall` | 0.125 | 0.077 | Hard: spans full 0-59, low row count (234) |
Seven phases are near-trivially classified because they sit in tight
timestep windows with characteristic behavioural signatures. **Three
phases — `dormancy_dwell`, `sandbox_evasion_stall`, `self_destruct_cleanup`
— scatter across the full 0–59 timestep range** and lack distinctive
behavioural features (idle/evasion phases have low activity by design),
so a flat-tabular event-level model can't reliably disambiguate them.
Sequence models that consider neighbouring timesteps would help here.
### Ablation: which feature groups matter
| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
|---|---:|---:|---:|---:|
| Full feature set (published) | 0.9178 | 0.7781 | 0.9792 | — |
| No `timestep` | 0.6933 | 0.5963 | 0.9264 | **−0.2244** |
| No behavioural features | 0.9089 | 0.7579 | 0.9705 | −0.0089 |
| No PE static features | 0.9167 | 0.7808 | 0.9786 | −0.0011 |
| No engineered features | 0.9200 | 0.7931 | 0.9797 | +0.0022 |
Three clear findings:
1. **`timestep` is by far the dominant feature** (drops 22 pp when removed,
ROC-AUC still 0.93). Malware execution progresses in time, and where
you are in that timeline carries most of the phase signal.
2. **PE static features are barely used for phase prediction.** This is
honest: PE features (entropy, packed sections, import hashes) inform
family classification, not phase classification. A buyer doing family
work should expect to use them; for phase work they can be dropped.
3. **Engineered features and behavioural features each contribute ~1 pp.**
Trees recover most of the engineered features on their own.
### Architecture
**XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.
**MLP:** `69 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.
Training hyperparameters (learning rate, batch size, n_estimators,
early-stopping patience, weight decay, class-weighting strategy) are
held internally by XpertSystems and are not part of this release.
## Limitations
**This is a baseline reference, not a production sandbox or threat detector.**
1. **Three phases are genuinely hard at sample size.** `dormancy_dwell`,
`sandbox_evasion_stall`, and `self_destruct_cleanup` span the full
0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53.
These are the phases by design lacking distinctive moment-to-moment
features (the malware is being quiet to evade detection). Sequence
models or per-sample aggregation would substantially improve these.
2. **The pivot away from malware family classification is dataset-limited,
not method-limited.** Family classification on 100 samples with 10
classes is at majority baseline. The full 280k-row CYB003 product
provides ~5,600 samples and supports proper family classification.
3. **Synthetic-vs-real transfer.** The dataset is synthetic and calibrated
to threat-intelligence and AV-testing benchmark targets (VirusTotal,
AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR,
Verizon DBIR). Real malware telemetry has different noise
characteristics, adversary adaptation, and instrumentation gaps. Do
not assume metrics transfer.
4. **Adversarial robustness not evaluated.** The dataset is not
adversarially generated; the model has not been red-teamed against
evasive samples.
5. **MLP brittleness on OOD inputs.** With ~4k training timesteps, the
MLP can produce confidently-wrong predictions on hand-crafted records
far from the training manifold. XGBoost is more robust. Use both;
treat disagreement as a signal for human review.
6. **`timestep` dominance is a property of the dataset.** Real malware
in production doesn't have a clean "timestep" feature on a per-sample
60-step normalized timeline — that's a simulator artifact. A buyer
transferring this baseline to real sandbox traces would need to
recover an equivalent temporal-position feature from execution-trace
timestamps relative to detonation.
## Notes on dataset schema
The CYB003 sample dataset README describes some fields differently from
the actual schema. The model was trained on the actual schema; this note
helps buyers reconcile what they read with what they receive.
| What the README says | What the data actually contains |
|---|---|
| `pe_entropy` (one column) | `pe_entropy_mean` + `pe_entropy_std` (two columns) |
| `process_injection_count` | `process_injection_flag` (binary, not a count) |
| `c2_beacon_active` | `c2_beacon_interval_sec` (seconds, 0 when inactive) |
| `av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep) | None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names |
| `ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`) | `ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`) |
| 9 malware families listed | 10 families in the data (`apt_implant` is the additional one) |
| `coordinated_campaign_flag` (described as a flag) | Constant = 1 for all rows in the sample (uninformative) |
The actual per-timestep table also contains rich PE-static features not
listed in the README: `import_hash_cluster`, `section_count`,
`packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`,
`code_section_rx_ratio`, `resource_section_entropy`,
`suspicious_import_count`. These are excellent features for family
classification work and are documented in the model's
`feature_engineering.py`.
None of these discrepancies affects model correctness — the feature
pipeline uses the actual column names. If you build your own pipeline
against the dataset, use the actual columns, not the README descriptions.
## Intended use
- **Evaluating fit** of the CYB003 dataset for your malware-analysis
or sandbox-detection research
- **Baseline reference** for new model architectures (especially sequence
models, which should beat this baseline on the late/scattered phases)
- **Teaching and demo** for tabular classification on malware telemetry
- **Feature engineering reference** for per-timestep behavioural data
## Out-of-scope use
- Production sandbox analysis on real malware
- EDR phase tagging on real systems
- Family attribution (this baseline does not address that task; see why above)
- Adversarial-evasion evaluation (dataset not adversarially generated)
- Any operational security decision
## Reproducibility
Outputs above were produced with `seed = 42` (published artifact),
group-aware nested `GroupShuffleSplit` (70/15/15 by sample_id), on the
published sample (`xpertsystems/cyb003-sample`, version 1.0.0, generated
2026-05-16). The feature pipeline in `feature_engineering.py` is
deterministic and the trained weights in this repo correspond exactly
to the metrics above.
Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
`multi_seed_results.json` confirm robust performance across splits.
The training script itself is private to XpertSystems. The published
artifacts contain the feature pipeline, model weights, scaler, metadata,
and validation results — sufficient to reproduce inference but not
training.
## Files in this repo
| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline (load → engineer → encode) |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation (timestep, behavioural, PE static, engineered) |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |
## Contact and full product
The full **CYB003** dataset contains ~349,000 rows across four files,
with calibrated benchmark validation against 12 metrics drawn from
authoritative threat intelligence and AV-testing sources (VirusTotal,
AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon).
The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
& Energy.
- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample
- 🤖 Companion models:
- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
## Citation
```bibtex
@misc{xpertsystems_cyb003_baseline_2026,
title = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification},
author = {XpertSystems.ai},
year = {2026},
url = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier},
note = {Baseline reference model trained on xpertsystems/cyb003-sample}
}
```
|