File size: 19,260 Bytes

c6a80e7

---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - cybersecurity
  - malware
  - malware-behaviour
  - sandbox-analysis
  - edr
  - tabular-classification
  - synthetic-data
  - xgboost
  - baseline
pipeline_tag: tabular-classification
base_model: []
datasets:
  - xpertsystems/cyb003-sample
metrics:
  - accuracy
  - f1
  - roc_auc
model-index:
  - name: cyb003-baseline-classifier
    results:
      - task:
          type: tabular-classification
          name: 10-class malware execution phase classification
        dataset:
          type: xpertsystems/cyb003-sample
          name: CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
        metrics:
          - type: roc_auc
            value: 0.9792
            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
          - type: accuracy
            value: 0.9178
            name: Test accuracy (XGBoost, seed 42)
          - type: f1
            value: 0.7781
            name: Test macro-F1 (XGBoost, seed 42)
          - type: accuracy
            value: 0.905
            name: Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.975
            name: Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.9681
            name: Test macro ROC-AUC OvR (MLP, seed 42)
          - type: accuracy
            value: 0.8222
            name: Test accuracy (MLP, seed 42)
          - type: f1
            value: 0.7072
            name: Test macro-F1 (MLP, seed 42)
---

# CYB003 Baseline Classifier

**Malware execution-phase classifier trained on the CYB003 synthetic
malware behaviour sample. Predicts which of 10 execution phases a
per-timestep telemetry record belongs to, from observable behavioural
and PE-static features.**

> **Baseline reference, not for production use.** This model demonstrates
> that the [CYB003 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb003-sample)
> is learnable end-to-end and gives prospective buyers a working starting
> point. It is not a production sandbox, EDR, or threat-detection system.
> See [Limitations](#limitations).

## Model overview

| Property | Value |
|---|---|
| Task | 10-class execution_phase classification |
| Training data | `xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples) |
| Models | XGBoost + PyTorch MLP |
| Input features | 69 (after one-hot encoding) |
| Split | **Group-aware by sample_id** (disjoint train/val/test samples) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline |

## Why this task instead of malware family classification?

The CYB003 dataset README leads with "training malware family classifiers"
as a suggested use case. We piloted that target first and found it is
**not learnable from the sample dataset** under proper group-aware
evaluation: with only 100 unique samples spread across 10 families,
XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58
— at majority baseline. Per-sample aggregation gives the same result.

This is a **sample-size constraint**, not a feature-engineering failure.
With ~7 samples per family on average, a held-out test set of 15 samples
covers at most ~8 families and yields a model that cannot generalize.
The full 280k-row CYB003 product, with ~28 samples per family at the
sample's distribution, will not have this constraint.

We pivoted to **execution_phase prediction**, which has 6,000 rows of
per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable
across seeds. This is a legitimate SOC use case — dynamic-analysis tools
and EDR systems regularly need to tag what phase of execution observed
malware activity belongs to — and it shows the dataset is well-calibrated
even when the headline product use case needs more data.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

- `model_xgb.json` — gradient-boosted trees, primary recommendation
- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format

## Quick start

```bash
pip install xgboost torch safetensors pandas huggingface_hub
```

```python
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb003-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```

See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.

## Training data

Trained on the public sample of CYB003, 6,000 per-timestep telemetry
rows from 100 malware samples (60 timesteps per sample):

| Phase | Total rows | Train share | Test rows (seed 42) |
|---|---:|---:|---:|
| `initial_drop` | 801 | 13.4% | 120 |
| `lateral_movement` | 799 | 13.3% | 120 |
| `persistence_establishment` | 787 | 13.1% | 119 |
| `data_exfiltration` | 783 | 13.1% | 100 |
| `c2_communication` | 709 | 11.8% | 87 |
| `privilege_escalation` | 705 | 11.8% | 107 |
| `payload_execution` | 705 | 11.8% | 109 |
| `dormancy_dwell` | 250 | 4.2% | 83 |
| `sandbox_evasion_stall` | 234 | 3.9% | 32 |
| `self_destruct_cleanup` | 227 | 3.8% | 23 |

### Group-aware split

A single malware sample generates 60 highly-correlated timesteps. Random
row-level splitting would put timesteps from the same sample in both
train and test, inflating metrics in a way that does not generalize to
new samples.

This release uses **GroupShuffleSplit by `sample_id`** (nested, 70/15/15):

| Fold | Samples | Timesteps |
|---|---:|---:|
| Train | 69 | 4,140 |
| Validation | 16 | 960 |
| Test | 15 | 900 |

All test samples are completely unseen during training. Class imbalance
is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
weighted cross-entropy (MLP).

## Feature pipeline

The bundled `feature_engineering.py` is the canonical feature recipe.
69 features survive after encoding, drawn from:

- **Per-timestep numeric** (10): `timestep`, `api_call_rate`, `registry_write_count`, `network_connection_count`, `process_injection_flag`, `c2_beacon_interval_sec`, `av_signature_hit_flag`, `sandbox_evasion_flag`, `lateral_propagation_count`, `privilege_escalation_flag`
- **PE static features** (11): `pe_entropy_mean`, `pe_entropy_std`, `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`, `packer_detected_flag`
- **Categorical** (6, one-hot encoded): `malware_family`, `threat_actor_tier`, `target_platform`, `obfuscation_technique`, `detection_outcome`, `ep_stack`
- **Engineered** (6): `api_burst_score`, `is_c2_active`, `is_high_net_volume`, `is_stealth_step`, `is_destructive_step`, `lateral_activity_score`

### Leakage audit

No categorical feature has phase->phase purity above 0.17 (uniform
random baseline is 0.10), so nothing in the dataset is an oracle for
the target. The model relies on a mix of `timestep` (strong but not
deterministic) and behavioural features.

## Evaluation

### Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples)

**XGBoost** (the published `model_xgb.json` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9792** |
| Accuracy | **0.9178** |
| Macro-F1 | 0.7781 |
| Weighted-F1 | 0.9173 |

**MLP** (the published `model_mlp.safetensors` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | 0.9681 |
| Accuracy | 0.8222 |
| Macro-F1 | 0.7072 |
| Weighted-F1 | 0.8278 |

### Multi-seed robustness (XGBoost, 10 seeds)

Accuracy and ROC-AUC are tight across seeds — the task is genuinely
learnable, not seed-lucky:

| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.905 | 0.010 | 0.882 | 0.921 |
| Macro-F1 | 0.784 | 0.013 | 0.759 | 0.807 |
| Macro ROC-AUC OvR | 0.975 | 0.002 | 0.972 | 0.979 |

Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
All 10 seeds yielded all 10 classes in the test fold, supporting clean
multi-class ROC-AUC computation.

### Per-class F1 (seed 42) — where the signal is and isn't

| Phase | XGBoost F1 | MLP F1 | Note |
|---|---:|---:|---|
| `c2_communication` | **1.000** | 1.000 | Trivial: tight timestep window 52-59 + c2_beacon signal |
| `persistence_establishment` | **0.992** | 0.870 | Tight timestep window 9-17 + registry writes |
| `lateral_movement` | **0.992** | 0.907 | Tight timestep window 26-34 + lateral_propagation |
| `privilege_escalation` | **0.991** | 0.915 | Tight timestep window 18-25 + privilege flag |
| `data_exfiltration` | **0.970** | 0.918 | Tight timestep window 43-51 + network volume |
| `payload_execution` | **0.963** | 0.698 | Tight timestep window 35-42 + API bursts |
| `initial_drop` | **0.945** | 0.886 | Tight timestep window 0-8 |
| `dormancy_dwell` | 0.530 | 0.520 | Hard: spans full 0-59 timestep range |
| `self_destruct_cleanup` | 0.273 | 0.282 | Hard: spans full 0-59, low row count (227) |
| `sandbox_evasion_stall` | 0.125 | 0.077 | Hard: spans full 0-59, low row count (234) |

Seven phases are near-trivially classified because they sit in tight
timestep windows with characteristic behavioural signatures. **Three
phases — `dormancy_dwell`, `sandbox_evasion_stall`, `self_destruct_cleanup`
— scatter across the full 0–59 timestep range** and lack distinctive
behavioural features (idle/evasion phases have low activity by design),
so a flat-tabular event-level model can't reliably disambiguate them.
Sequence models that consider neighbouring timesteps would help here.

### Ablation: which feature groups matter

| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
|---|---:|---:|---:|---:|
| Full feature set (published) | 0.9178 | 0.7781 | 0.9792 | — |
| No `timestep` | 0.6933 | 0.5963 | 0.9264 | **−0.2244** |
| No behavioural features | 0.9089 | 0.7579 | 0.9705 | −0.0089 |
| No PE static features | 0.9167 | 0.7808 | 0.9786 | −0.0011 |
| No engineered features | 0.9200 | 0.7931 | 0.9797 | +0.0022 |

Three clear findings:

1. **`timestep` is by far the dominant feature** (drops 22 pp when removed,
   ROC-AUC still 0.93). Malware execution progresses in time, and where
   you are in that timeline carries most of the phase signal.
2. **PE static features are barely used for phase prediction.** This is
   honest: PE features (entropy, packed sections, import hashes) inform
   family classification, not phase classification. A buyer doing family
   work should expect to use them; for phase work they can be dropped.
3. **Engineered features and behavioural features each contribute ~1 pp.**
   Trees recover most of the engineered features on their own.

### Architecture

**XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.

**MLP:** `69 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.

Training hyperparameters (learning rate, batch size, n_estimators,
early-stopping patience, weight decay, class-weighting strategy) are
held internally by XpertSystems and are not part of this release.

## Limitations

**This is a baseline reference, not a production sandbox or threat detector.**

1. **Three phases are genuinely hard at sample size.** `dormancy_dwell`,
   `sandbox_evasion_stall`, and `self_destruct_cleanup` span the full
   0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53.
   These are the phases by design lacking distinctive moment-to-moment
   features (the malware is being quiet to evade detection). Sequence
   models or per-sample aggregation would substantially improve these.

2. **The pivot away from malware family classification is dataset-limited,
   not method-limited.** Family classification on 100 samples with 10
   classes is at majority baseline. The full 280k-row CYB003 product
   provides ~5,600 samples and supports proper family classification.

3. **Synthetic-vs-real transfer.** The dataset is synthetic and calibrated
   to threat-intelligence and AV-testing benchmark targets (VirusTotal,
   AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR,
   Verizon DBIR). Real malware telemetry has different noise
   characteristics, adversary adaptation, and instrumentation gaps. Do
   not assume metrics transfer.

4. **Adversarial robustness not evaluated.** The dataset is not
   adversarially generated; the model has not been red-teamed against
   evasive samples.

5. **MLP brittleness on OOD inputs.** With ~4k training timesteps, the
   MLP can produce confidently-wrong predictions on hand-crafted records
   far from the training manifold. XGBoost is more robust. Use both;
   treat disagreement as a signal for human review.

6. **`timestep` dominance is a property of the dataset.** Real malware
   in production doesn't have a clean "timestep" feature on a per-sample
   60-step normalized timeline — that's a simulator artifact. A buyer
   transferring this baseline to real sandbox traces would need to
   recover an equivalent temporal-position feature from execution-trace
   timestamps relative to detonation.

## Notes on dataset schema

The CYB003 sample dataset README describes some fields differently from
the actual schema. The model was trained on the actual schema; this note
helps buyers reconcile what they read with what they receive.

| What the README says | What the data actually contains |
|---|---|
| `pe_entropy` (one column) | `pe_entropy_mean` + `pe_entropy_std` (two columns) |
| `process_injection_count` | `process_injection_flag` (binary, not a count) |
| `c2_beacon_active` | `c2_beacon_interval_sec` (seconds, 0 when inactive) |
| `av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep) | None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names |
| `ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`) | `ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`) |
| 9 malware families listed | 10 families in the data (`apt_implant` is the additional one) |
| `coordinated_campaign_flag` (described as a flag) | Constant = 1 for all rows in the sample (uninformative) |

The actual per-timestep table also contains rich PE-static features not
listed in the README: `import_hash_cluster`, `section_count`,
`packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`,
`code_section_rx_ratio`, `resource_section_entropy`,
`suspicious_import_count`. These are excellent features for family
classification work and are documented in the model's
`feature_engineering.py`.

None of these discrepancies affects model correctness — the feature
pipeline uses the actual column names. If you build your own pipeline
against the dataset, use the actual columns, not the README descriptions.

## Intended use

- **Evaluating fit** of the CYB003 dataset for your malware-analysis
  or sandbox-detection research
- **Baseline reference** for new model architectures (especially sequence
  models, which should beat this baseline on the late/scattered phases)
- **Teaching and demo** for tabular classification on malware telemetry
- **Feature engineering reference** for per-timestep behavioural data

## Out-of-scope use

- Production sandbox analysis on real malware
- EDR phase tagging on real systems
- Family attribution (this baseline does not address that task; see why above)
- Adversarial-evasion evaluation (dataset not adversarially generated)
- Any operational security decision

## Reproducibility

Outputs above were produced with `seed = 42` (published artifact),
group-aware nested `GroupShuffleSplit` (70/15/15 by sample_id), on the
published sample (`xpertsystems/cyb003-sample`, version 1.0.0, generated
2026-05-16). The feature pipeline in `feature_engineering.py` is
deterministic and the trained weights in this repo correspond exactly
to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
`multi_seed_results.json` confirm robust performance across splits.

The training script itself is private to XpertSystems. The published
artifacts contain the feature pipeline, model weights, scaler, metadata,
and validation results — sufficient to reproduce inference but not
training.

## Files in this repo

| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline (load → engineer → encode) |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation (timestep, behavioural, PE static, engineered) |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |

## Contact and full product

The full **CYB003** dataset contains ~349,000 rows across four files,
with calibrated benchmark validation against 12 metrics drawn from
authoritative threat intelligence and AV-testing sources (VirusTotal,
AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon).
The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
& Energy.

- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample
- 🤖 Companion models:
  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)

## Citation

```bibtex
@misc{xpertsystems_cyb003_baseline_2026,
  title  = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb003-sample}
}
```