File size: 19,260 Bytes
c6a80e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - cybersecurity
  - malware
  - malware-behaviour
  - sandbox-analysis
  - edr
  - tabular-classification
  - synthetic-data
  - xgboost
  - baseline
pipeline_tag: tabular-classification
base_model: []
datasets:
  - xpertsystems/cyb003-sample
metrics:
  - accuracy
  - f1
  - roc_auc
model-index:
  - name: cyb003-baseline-classifier
    results:
      - task:
          type: tabular-classification
          name: 10-class malware execution phase classification
        dataset:
          type: xpertsystems/cyb003-sample
          name: CYB003 Synthetic Malware Behaviour & Classification Dataset (Sample)
        metrics:
          - type: roc_auc
            value: 0.9792
            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
          - type: accuracy
            value: 0.9178
            name: Test accuracy (XGBoost, seed 42)
          - type: f1
            value: 0.7781
            name: Test macro-F1 (XGBoost, seed 42)
          - type: accuracy
            value: 0.905
            name: Multi-seed accuracy mean ± 0.010 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.975
            name: Multi-seed ROC-AUC mean ± 0.002 (XGBoost, 10 seeds)
          - type: roc_auc
            value: 0.9681
            name: Test macro ROC-AUC OvR (MLP, seed 42)
          - type: accuracy
            value: 0.8222
            name: Test accuracy (MLP, seed 42)
          - type: f1
            value: 0.7072
            name: Test macro-F1 (MLP, seed 42)
---

# CYB003 Baseline Classifier

**Malware execution-phase classifier trained on the CYB003 synthetic
malware behaviour sample. Predicts which of 10 execution phases a
per-timestep telemetry record belongs to, from observable behavioural
and PE-static features.**

> **Baseline reference, not for production use.** This model demonstrates
> that the [CYB003 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb003-sample)
> is learnable end-to-end and gives prospective buyers a working starting
> point. It is not a production sandbox, EDR, or threat-detection system.
> See [Limitations](#limitations).

## Model overview

| Property | Value |
|---|---|
| Task | 10-class execution_phase classification |
| Training data | `xpertsystems/cyb003-sample` (6,000 timesteps across 100 malware samples) |
| Models | XGBoost + PyTorch MLP |
| Input features | 69 (after one-hot encoding) |
| Split | **Group-aware by sample_id** (disjoint train/val/test samples) |
| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
| License | CC-BY-NC-4.0 (matches dataset) |
| Status | Reference baseline |

## Why this task instead of malware family classification?

The CYB003 dataset README leads with "training malware family classifiers"
as a suggested use case. We piloted that target first and found it is
**not learnable from the sample dataset** under proper group-aware
evaluation: with only 100 unique samples spread across 10 families,
XGBoost on per-timestep features lands at ~15% accuracy and ROC-AUC ~0.58
— at majority baseline. Per-sample aggregation gives the same result.

This is a **sample-size constraint**, not a feature-engineering failure.
With ~7 samples per family on average, a held-out test set of 15 samples
covers at most ~8 families and yields a model that cannot generalize.
The full 280k-row CYB003 product, with ~28 samples per family at the
sample's distribution, will not have this constraint.

We pivoted to **execution_phase prediction**, which has 6,000 rows of
per-timestep data and learns cleanly: 91% accuracy, ROC-AUC 0.98, stable
across seeds. This is a legitimate SOC use case — dynamic-analysis tools
and EDR systems regularly need to tag what phase of execution observed
malware activity belongs to — and it shows the dataset is well-calibrated
even when the headline product use case needs more data.

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

- `model_xgb.json` — gradient-boosted trees, primary recommendation
- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format

## Quick start

```bash
pip install xgboost torch safetensors pandas huggingface_hub
```

```python
from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb003-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import transform_single, load_meta, INT_TO_LABEL

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Predict (see inference_example.ipynb for the full pattern)
X = transform_single(my_timestep_record, meta)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])
```

See [`inference_example.ipynb`](./inference_example.ipynb) for the full
copy-paste demo.

## Training data

Trained on the public sample of CYB003, 6,000 per-timestep telemetry
rows from 100 malware samples (60 timesteps per sample):

| Phase | Total rows | Train share | Test rows (seed 42) |
|---|---:|---:|---:|
| `initial_drop` | 801 | 13.4% | 120 |
| `lateral_movement` | 799 | 13.3% | 120 |
| `persistence_establishment` | 787 | 13.1% | 119 |
| `data_exfiltration` | 783 | 13.1% | 100 |
| `c2_communication` | 709 | 11.8% | 87 |
| `privilege_escalation` | 705 | 11.8% | 107 |
| `payload_execution` | 705 | 11.8% | 109 |
| `dormancy_dwell` | 250 | 4.2% | 83 |
| `sandbox_evasion_stall` | 234 | 3.9% | 32 |
| `self_destruct_cleanup` | 227 | 3.8% | 23 |

### Group-aware split

A single malware sample generates 60 highly-correlated timesteps. Random
row-level splitting would put timesteps from the same sample in both
train and test, inflating metrics in a way that does not generalize to
new samples.

This release uses **GroupShuffleSplit by `sample_id`** (nested, 70/15/15):

| Fold | Samples | Timesteps |
|---|---:|---:|
| Train | 69 | 4,140 |
| Validation | 16 | 960 |
| Test | 15 | 900 |

All test samples are completely unseen during training. Class imbalance
is addressed with `class_weight='balanced'` (XGBoost `sample_weight`) and
weighted cross-entropy (MLP).

## Feature pipeline

The bundled `feature_engineering.py` is the canonical feature recipe.
69 features survive after encoding, drawn from:

- **Per-timestep numeric** (10): `timestep`, `api_call_rate`, `registry_write_count`, `network_connection_count`, `process_injection_flag`, `c2_beacon_interval_sec`, `av_signature_hit_flag`, `sandbox_evasion_flag`, `lateral_propagation_count`, `privilege_escalation_flag`
- **PE static features** (11): `pe_entropy_mean`, `pe_entropy_std`, `import_hash_cluster`, `section_count`, `packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`, `code_section_rx_ratio`, `resource_section_entropy`, `suspicious_import_count`, `packer_detected_flag`
- **Categorical** (6, one-hot encoded): `malware_family`, `threat_actor_tier`, `target_platform`, `obfuscation_technique`, `detection_outcome`, `ep_stack`
- **Engineered** (6): `api_burst_score`, `is_c2_active`, `is_high_net_volume`, `is_stealth_step`, `is_destructive_step`, `lateral_activity_score`

### Leakage audit

No categorical feature has phase->phase purity above 0.17 (uniform
random baseline is 0.10), so nothing in the dataset is an oracle for
the target. The model relies on a mix of `timestep` (strong but not
deterministic) and behavioural features.

## Evaluation

### Test-set metrics, seed 42 (n = 900 timesteps from 15 disjoint samples)

**XGBoost** (the published `model_xgb.json` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | **0.9792** |
| Accuracy | **0.9178** |
| Macro-F1 | 0.7781 |
| Weighted-F1 | 0.9173 |

**MLP** (the published `model_mlp.safetensors` artifact)

| Metric | Value |
|---|---:|
| Macro ROC-AUC (OvR) | 0.9681 |
| Accuracy | 0.8222 |
| Macro-F1 | 0.7072 |
| Weighted-F1 | 0.8278 |

### Multi-seed robustness (XGBoost, 10 seeds)

Accuracy and ROC-AUC are tight across seeds — the task is genuinely
learnable, not seed-lucky:

| Metric | Mean | Std | Min | Max |
|---|---:|---:|---:|---:|
| Accuracy | 0.905 | 0.010 | 0.882 | 0.921 |
| Macro-F1 | 0.784 | 0.013 | 0.759 | 0.807 |
| Macro ROC-AUC OvR | 0.975 | 0.002 | 0.972 | 0.979 |

Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
All 10 seeds yielded all 10 classes in the test fold, supporting clean
multi-class ROC-AUC computation.

### Per-class F1 (seed 42) — where the signal is and isn't

| Phase | XGBoost F1 | MLP F1 | Note |
|---|---:|---:|---|
| `c2_communication` | **1.000** | 1.000 | Trivial: tight timestep window 52-59 + c2_beacon signal |
| `persistence_establishment` | **0.992** | 0.870 | Tight timestep window 9-17 + registry writes |
| `lateral_movement` | **0.992** | 0.907 | Tight timestep window 26-34 + lateral_propagation |
| `privilege_escalation` | **0.991** | 0.915 | Tight timestep window 18-25 + privilege flag |
| `data_exfiltration` | **0.970** | 0.918 | Tight timestep window 43-51 + network volume |
| `payload_execution` | **0.963** | 0.698 | Tight timestep window 35-42 + API bursts |
| `initial_drop` | **0.945** | 0.886 | Tight timestep window 0-8 |
| `dormancy_dwell` | 0.530 | 0.520 | Hard: spans full 0-59 timestep range |
| `self_destruct_cleanup` | 0.273 | 0.282 | Hard: spans full 0-59, low row count (227) |
| `sandbox_evasion_stall` | 0.125 | 0.077 | Hard: spans full 0-59, low row count (234) |

Seven phases are near-trivially classified because they sit in tight
timestep windows with characteristic behavioural signatures. **Three
phases — `dormancy_dwell`, `sandbox_evasion_stall`, `self_destruct_cleanup`
— scatter across the full 0–59 timestep range** and lack distinctive
behavioural features (idle/evasion phases have low activity by design),
so a flat-tabular event-level model can't reliably disambiguate them.
Sequence models that consider neighbouring timesteps would help here.

### Ablation: which feature groups matter

| Configuration | Accuracy | Macro-F1 | ROC-AUC | Δ accuracy |
|---|---:|---:|---:|---:|
| Full feature set (published) | 0.9178 | 0.7781 | 0.9792 | — |
| No `timestep` | 0.6933 | 0.5963 | 0.9264 | **−0.2244** |
| No behavioural features | 0.9089 | 0.7579 | 0.9705 | −0.0089 |
| No PE static features | 0.9167 | 0.7808 | 0.9786 | −0.0011 |
| No engineered features | 0.9200 | 0.7931 | 0.9797 | +0.0022 |

Three clear findings:

1. **`timestep` is by far the dominant feature** (drops 22 pp when removed,
   ROC-AUC still 0.93). Malware execution progresses in time, and where
   you are in that timeline carries most of the phase signal.
2. **PE static features are barely used for phase prediction.** This is
   honest: PE features (entropy, packed sections, import hashes) inform
   family classification, not phase classification. A buyer doing family
   work should expect to use them; for phase work they can be dropped.
3. **Engineered features and behavioural features each contribute ~1 pp.**
   Trees recover most of the engineered features on their own.

### Architecture

**XGBoost:** multi-class gradient boosting (`multi:softprob`, 10 classes),
`hist` tree method, class-balanced sample weights, early stopping on
validation mlogloss.

**MLP:** `69 → 128 → 64 → 10`, each hidden layer followed by `BatchNorm1d`
→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
early stopping on validation macro-F1.

Training hyperparameters (learning rate, batch size, n_estimators,
early-stopping patience, weight decay, class-weighting strategy) are
held internally by XpertSystems and are not part of this release.

## Limitations

**This is a baseline reference, not a production sandbox or threat detector.**

1. **Three phases are genuinely hard at sample size.** `dormancy_dwell`,
   `sandbox_evasion_stall`, and `self_destruct_cleanup` span the full
   0–59 timestep range and have low row counts. Per-class F1 = 0.13–0.53.
   These are the phases by design lacking distinctive moment-to-moment
   features (the malware is being quiet to evade detection). Sequence
   models or per-sample aggregation would substantially improve these.

2. **The pivot away from malware family classification is dataset-limited,
   not method-limited.** Family classification on 100 samples with 10
   classes is at majority baseline. The full 280k-row CYB003 product
   provides ~5,600 samples and supports proper family classification.

3. **Synthetic-vs-real transfer.** The dataset is synthetic and calibrated
   to threat-intelligence and AV-testing benchmark targets (VirusTotal,
   AV-TEST, MITRE ATT&CK Evaluations, Mandiant M-Trends, CrowdStrike GTR,
   Verizon DBIR). Real malware telemetry has different noise
   characteristics, adversary adaptation, and instrumentation gaps. Do
   not assume metrics transfer.

4. **Adversarial robustness not evaluated.** The dataset is not
   adversarially generated; the model has not been red-teamed against
   evasive samples.

5. **MLP brittleness on OOD inputs.** With ~4k training timesteps, the
   MLP can produce confidently-wrong predictions on hand-crafted records
   far from the training manifold. XGBoost is more robust. Use both;
   treat disagreement as a signal for human review.

6. **`timestep` dominance is a property of the dataset.** Real malware
   in production doesn't have a clean "timestep" feature on a per-sample
   60-step normalized timeline — that's a simulator artifact. A buyer
   transferring this baseline to real sandbox traces would need to
   recover an equivalent temporal-position feature from execution-trace
   timestamps relative to detonation.

## Notes on dataset schema

The CYB003 sample dataset README describes some fields differently from
the actual schema. The model was trained on the actual schema; this note
helps buyers reconcile what they read with what they receive.

| What the README says | What the data actually contains |
|---|---|
| `pe_entropy` (one column) | `pe_entropy_mean` + `pe_entropy_std` (two columns) |
| `process_injection_count` | `process_injection_flag` (binary, not a count) |
| `c2_beacon_active` | `c2_beacon_interval_sec` (seconds, 0 when inactive) |
| `av_detected`, `edr_detected`, `sandbox_evaded`, `dwell_time_hours`, `persistence_mechanism`, `lotl_technique_used` (per-timestep) | None of these exist on per-timestep; equivalents (`av_signature_hit_flag`, `sandbox_evasion_flag`) do exist with different names |
| `ep_stack`: 3 values (`legacy_av`, `ngav_ml_based`, `edr_full`) | `ep_stack`: 8 values (`legacy_av_only`, `ngav_ml_based`, `edr_endpoint_detect`, `av_plus_firewall`, `xdr_extended_detect`, `managed_detection_response`, `deception_honeypot`, `no_protection`) |
| 9 malware families listed | 10 families in the data (`apt_implant` is the additional one) |
| `coordinated_campaign_flag` (described as a flag) | Constant = 1 for all rows in the sample (uninformative) |

The actual per-timestep table also contains rich PE-static features not
listed in the README: `import_hash_cluster`, `section_count`,
`packed_section_ratio`, `string_entropy_mean`, `byte_histogram_chi2`,
`code_section_rx_ratio`, `resource_section_entropy`,
`suspicious_import_count`. These are excellent features for family
classification work and are documented in the model's
`feature_engineering.py`.

None of these discrepancies affects model correctness — the feature
pipeline uses the actual column names. If you build your own pipeline
against the dataset, use the actual columns, not the README descriptions.

## Intended use

- **Evaluating fit** of the CYB003 dataset for your malware-analysis
  or sandbox-detection research
- **Baseline reference** for new model architectures (especially sequence
  models, which should beat this baseline on the late/scattered phases)
- **Teaching and demo** for tabular classification on malware telemetry
- **Feature engineering reference** for per-timestep behavioural data

## Out-of-scope use

- Production sandbox analysis on real malware
- EDR phase tagging on real systems
- Family attribution (this baseline does not address that task; see why above)
- Adversarial-evasion evaluation (dataset not adversarially generated)
- Any operational security decision

## Reproducibility

Outputs above were produced with `seed = 42` (published artifact),
group-aware nested `GroupShuffleSplit` (70/15/15 by sample_id), on the
published sample (`xpertsystems/cyb003-sample`, version 1.0.0, generated
2026-05-16). The feature pipeline in `feature_engineering.py` is
deterministic and the trained weights in this repo correspond exactly
to the metrics above.

Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
`multi_seed_results.json` confirm robust performance across splits.

The training script itself is private to XpertSystems. The published
artifacts contain the feature pipeline, model weights, scaler, metadata,
and validation results — sufficient to reproduce inference but not
training.

## Files in this repo

| File | Purpose |
|---|---|
| `model_xgb.json` | XGBoost weights (seed 42) |
| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
| `feature_engineering.py` | Feature pipeline (load → engineer → encode) |
| `feature_meta.json` | Feature column order + categorical levels |
| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
| `ablation_results.json` | Per-feature-group ablation (timestep, behavioural, PE static, engineered) |
| `multi_seed_results.json` | XGBoost metrics across 10 seeds with aggregate statistics |
| `inference_example.ipynb` | End-to-end inference demo notebook |
| `README.md` | This file |

## Contact and full product

The full **CYB003** dataset contains ~349,000 rows across four files,
with calibrated benchmark validation against 12 metrics drawn from
authoritative threat intelligence and AV-testing sources (VirusTotal,
AV-TEST, MITRE ATT&CK Evaluations, Mandiant, CrowdStrike, Verizon).
The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across
Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials
& Energy.

- 📧 **pradeep@xpertsystems.ai**
- 🌐 **https://xpertsystems.ai**
- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb003-sample
- 🤖 Companion models:
  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)

## Citation

```bibtex
@misc{xpertsystems_cyb003_baseline_2026,
  title  = {CYB003 Baseline Classifier: XGBoost and MLP for Malware Execution Phase Classification},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb003-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb003-sample}
}
```