| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - cybersecurity |
| - network-traffic |
| - intrusion-detection |
| - tabular-classification |
| - synthetic-data |
| - xgboost |
| - baseline |
| pipeline_tag: tabular-classification |
| base_model: [] |
| datasets: |
| - xpertsystems/cyb001-sample |
| metrics: |
| - accuracy |
| - f1 |
| model-index: |
| - name: cyb001-baseline-classifier |
| results: |
| - task: |
| type: tabular-classification |
| name: 3-class network flow classification |
| dataset: |
| type: xpertsystems/cyb001-sample |
| name: CYB001 Synthetic Network Traffic (Sample) |
| metrics: |
| - type: accuracy |
| value: 0.9980 |
| name: Test accuracy (XGBoost) |
| - type: f1 |
| value: 0.9961 |
| name: Test macro-F1 (XGBoost) |
| - type: accuracy |
| value: 0.9932 |
| name: Test accuracy (MLP) |
| - type: f1 |
| value: 0.9869 |
| name: Test macro-F1 (MLP) |
| --- |
| |
| # CYB001 Baseline Classifier |
|
|
| **Multi-class network flow classifier trained on the CYB001 synthetic |
| network traffic sample. Predicts `BENIGN`, `MALICIOUS`, or `AMBIGUOUS` |
| from per-flow features.** |
|
|
| > **Baseline reference, not for production use.** This model demonstrates |
| > that the [CYB001 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb001-sample) |
| > is learnable end-to-end and gives prospective buyers a working starting |
| > point to evaluate against their own pipelines. It is not an intrusion |
| > detection system. See [Limitations](#limitations). |
|
|
| ## Model overview |
|
|
| | Property | Value | |
| |---|---| |
| | Task | 3-class flow classification (BENIGN / MALICIOUS / AMBIGUOUS) | |
| | Training data | `xpertsystems/cyb001-sample` (9,770 flows, sample only) | |
| | Models | XGBoost + PyTorch MLP | |
| | Input features | 101 (after one-hot encoding) | |
| | License | CC-BY-NC-4.0 (matches dataset) | |
| | Status | Reference baseline | |
|
|
| Two model artifacts are published. They are designed to be used together β disagreement between them is itself a useful triage signal: |
|
|
| - `model_xgb.json` β gradient-boosted trees, primary recommendation |
| - `model_mlp.safetensors` β PyTorch MLP in SafeTensors format |
|
|
| ## Quick start |
|
|
| ```bash |
| pip install xgboost torch safetensors pandas huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import json, numpy as np, torch, xgboost as xgb |
| from safetensors.torch import load_file |
| |
| REPO = "xpertsystems/cyb001-baseline-classifier" |
| |
| # Download artifacts |
| paths = {n: hf_hub_download(REPO, n) for n in [ |
| "model_xgb.json", "model_mlp.safetensors", |
| "feature_engineering.py", "feature_meta.json", "feature_scaler.json", |
| ]} |
| |
| # Make feature pipeline importable |
| import sys, os |
| sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"])) |
| from feature_engineering import transform_single, load_meta, INT_TO_LABEL |
| |
| meta = load_meta(paths["feature_meta.json"]) |
| xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"]) |
| |
| # Predict (see inference_example.ipynb for full single-record example) |
| X = transform_single(my_flow_record_dict, meta) |
| proba = xgb_model.predict_proba(X)[0] |
| print(INT_TO_LABEL[int(np.argmax(proba))]) |
| ``` |
|
|
| See [`inference_example.ipynb`](./inference_example.ipynb) for a full |
| copy-paste demo including the MLP load path and a batch run on 200 rows |
| from the public sample. |
|
|
| ## Training data |
|
|
| Trained on the public sample of CYB001, 9,770 flows with: |
|
|
| | Label | Train (n=6,838) | Test (n=1,466) | Test share | |
| |---|---:|---:|---:| |
| | BENIGN | 4,916 | 1,054 | 71.9% | |
| | MALICIOUS | 1,378 | 295 | 20.1% | |
| | AMBIGUOUS | 544 | 117 | 8.0% | |
|
|
| Split: 70 / 15 / 15 stratified by label, seed 42. |
|
|
| Class imbalance was addressed with `class_weight='balanced'` (XGBoost |
| `sample_weight`) and weighted cross-entropy (MLP). Stratified splitting |
| preserves the proportion in each fold. |
|
|
| ### Dataset calibration anchors |
|
|
| The CYB001 sample is calibrated to 12 named industry signatures. The |
| features that surface most prominently in the baseline correspond to |
| these anchors: |
|
|
| | Calibrated signature | Target | Observed (sample) | Feature(s) the model uses | |
| |---|---:|---:|---| |
| | `c2_beacon_regularity_score` | 0.78 | 0.77 | `iat_cv`, `inter_arrival_time_std` | |
| | `payload_entropy_benign_mean` | 4.80 | 4.86 | `payload_entropy_mean` | |
| | `fwd_bwd_byte_ratio_benign` | 1.34 | 1.41 | `fwd_bwd_byte_ratio` | |
| | `malicious_flow_rate` | 0.172 | 0.202 | (class prior) | |
| | `protocol_violation_rate` | 0.015 | 0.016 | `protocol_violation_flag`, `protocol_violation_count` | |
| | `scan_probe_density` | 0.043 | 0.045 | `tcp_flag_anomaly_score`, port features | |
|
|
| Full benchmark table in the [dataset card](https://huggingface.co/datasets/xpertsystems/cyb001-sample). |
|
|
| ## Feature pipeline |
|
|
| The bundled `feature_engineering.py` is the canonical feature recipe. |
| The training script and the inference example both call into it. |
|
|
| **Three columns are deliberately excluded** because they leak the label: |
|
|
| - `traffic_category` β perfectly deterministic of label (every `attack_*` |
| category is 100% MALICIOUS, etc.). |
| - `attack_subcategory` β non-null iff label is MALICIOUS. |
| - `attacker_capability_tier` β generator metadata labeled per flow |
| including benign flows; not a real-world observable at inference time. |
|
|
| **Five session-level features were kept** after a per-label leakage audit |
| (`payload_entropy_mean`, `retransmission_rate`, `protocol_violation_count`, |
| `c2_beacon_flag`, `session_risk_score`) because their distributions |
| overlap meaningfully across labels (i.e. they behave like detector |
| outputs, not oracles). **Three were dropped** (`exfil_volume_bytes`, |
| `scan_probe_count`, `lateral_move_flag`) because they are zero for all |
| non-MALICIOUS rows. |
|
|
| Engineered features (each encodes a stated domain hypothesis, see source |
| for the one-line rationale per feature): |
|
|
| - `iat_cv` β inter-arrival-time coefficient of variation. C2 beacon signature. |
| - `fwd_bwd_byte_ratio` β exfiltration signature. |
| - `bytes_per_packet_fwd`, `payload_density` β flow shape. |
| - `tcp_flag_anomaly_score` β RST/URG/FIN density. Scan and protocol-misuse signature. |
| - `hour_of_day`, `is_off_hours` β diurnal pattern. APT and insider tiers are off-peak biased in the dataset calibration. |
| - `is_well_known_dest_port`, `is_ephemeral_src_port` β port observables. |
|
|
| ## Evaluation |
|
|
| ### Test-set metrics (n = 1,466, stratified) |
|
|
| **XGBoost** |
|
|
| | Metric | Value | |
| |---|---:| |
| | Accuracy | 0.9980 | |
| | Macro-F1 | 0.9961 | |
| | Weighted-F1 | 0.9980 | |
| | Macro ROC-AUC (OvR) | β 1.00 | |
|
|
| | Class | F1 | Support | |
| |---|---:|---:| |
| | BENIGN | 0.9986 | 1,054 | |
| | MALICIOUS | 0.9983 | 295 | |
| | AMBIGUOUS | 0.9915 | 117 | |
|
|
| **MLP** |
|
|
| | Metric | Value | |
| |---|---:| |
| | Accuracy | 0.9932 | |
| | Macro-F1 | 0.9869 | |
| | Weighted-F1 | 0.9932 | |
|
|
| | Class | F1 | Support | |
| |---|---:|---:| |
| | BENIGN | 0.9962 | 1,054 | |
| | MALICIOUS | 0.9899 | 295 | |
| | AMBIGUOUS | 0.9746 | 117 | |
|
|
| Confusion matrices and per-class precision/recall are in |
| [`validation_results.json`](./validation_results.json). |
|
|
| ### Ablation: contribution of session-level features |
|
|
| To check whether the model is genuinely reading the flow-level signal or |
| leaning on session aggregates, the same XGBoost configuration was trained |
| with all five session-aggregate features removed: |
|
|
| | Configuration | Accuracy | Macro-F1 | AMBIGUOUS F1 | |
| |---|---:|---:|---:| |
| | Full feature set (published) | 0.9980 | 0.9961 | 0.991 | |
| | Flow-only (session aggregates dropped) | 0.9884 | 0.9776 | 0.957 | |
|
|
| The session join contributes about **+1.0 pp** of accuracy and **+0.02** |
| macro-F1. The model is not session-dominated; the flow-level features |
| carry the bulk of the signal. The full numbers for both configurations |
| are in [`ablation_results.json`](./ablation_results.json). |
|
|
| ### Architecture |
|
|
| **XGBoost:** multi-class gradient boosting (`multi:softprob`, 3 classes), |
| `hist` tree method, class-balanced sample weights, early stopping on |
| validation macro-F1. |
|
|
| **MLP:** `n_features β 128 β 64 β 3`, each hidden layer followed by |
| `BatchNorm1d` β `ReLU` β `Dropout(0.3)`, weighted cross-entropy loss, |
| AdamW optimizer, early stopping on validation macro-F1. |
|
|
| Training hyperparameters (learning rate, batch size, n_estimators, |
| early-stopping patience, weight decay, class-weighting strategy) are |
| held internally by XpertSystems and are not part of this release. |
| |
| ## Limitations |
| |
| **This is a baseline reference, not an intrusion detection system.** |
| |
| 1. **Performance is inflated by synthetic structure.** The numbers above |
| reflect performance on calibrated synthetic data where the BENIGN and |
| attack categories sit on distinct statistical signatures by |
| construction. A real production IDS facing live traffic must contend |
| with concept drift, adversarial evasion, encrypted-traffic ambiguity, |
| and a much fatter long tail of benign behaviour. Expect substantial |
| degradation when transferring to real CICIDS-style datasets or |
| in-the-wild traffic. |
| |
| 2. **Sample size for `AMBIGUOUS` is small.** Only 117 test examples; |
| the per-class F1 has wide confidence bands. The full CYB001 product |
| (~62k AMBIGUOUS flows out of ~500k) supports more reliable estimation. |
| |
| 3. **Trained on the public 1/60th sample only.** The full product |
| contains additional traffic categories, longer sequences, and |
| richer adversary behaviour. A model trained on the full dataset |
| would perform differently β likely lower headline accuracy with |
| better calibration and generalisation. The intent of this release |
| is reference, not state-of-the-art. |
| |
| 4. **Topology features are static labels, not signals.** Fields like |
| `defender_architecture` and `firewall_policy` are descriptive |
| categorical attributes of the network segment, not learned defender |
| responses. They help the model condition on context but do not |
| simulate real adversarial dynamics. |
| |
| 5. **MLP brittleness on OOD inputs.** With ~7k training rows, the MLP |
| can produce confidently-wrong predictions on hand-crafted records |
| whose feature combinations are far from the training manifold. The |
| inference notebook demonstrates this. XGBoost is more robust here. |
| In practice, use both and treat disagreement as a signal for review. |
| |
| 6. **Class imbalance handling is straightforward.** Class-balanced |
| weights work for this sample but production-scale rare-class |
| detection (e.g. APT C2 at < 0.1% of traffic) needs more careful |
| threshold calibration, ranking metrics, and likely calibrated |
| probabilities rather than argmax classification. |
| |
| ## Intended use |
| |
| - **Evaluating fit** of the CYB001 dataset for your IDS / NDR research |
| - **Baseline reference** for new model architectures on synthetic |
| network traffic |
| - **Teaching and demo** for tabular classification on flow-level features |
| - **Feature engineering reference** for CICFlowMeter-compatible fields |
| |
| ## Out-of-scope use |
| |
| - Production intrusion detection on real network traffic |
| - Forensic attribution of real attacks |
| - Adversarial robustness evaluation (the dataset is not adversarially |
| generated) |
| - Any safety-critical decision |
| |
| ## Reproducibility |
| |
| Outputs above were produced with `seed = 42`, stratified 70/15/15 split, |
| on the published sample (`xpertsystems/cyb001-sample`, version 1.0.0, |
| generated 2026-05-16). The feature pipeline in `feature_engineering.py` |
| is deterministic and the trained weights in this repo correspond exactly |
| to the metrics above. |
|
|
| The training script itself is private to XpertSystems. The published |
| artifacts contain the feature pipeline, model weights, scaler, metadata, |
| and validation results β sufficient to reproduce inference but not |
| training. |
|
|
| ## Files in this repo |
|
|
| | File | Purpose | |
| |---|---| |
| | `model_xgb.json` | XGBoost weights | |
| | `model_mlp.safetensors` | PyTorch MLP weights | |
| | `feature_engineering.py` | Feature pipeline (load β engineer β encode) | |
| | `feature_meta.json` | Feature column order + categorical levels | |
| | `feature_scaler.json` | MLP input mean/std (XGBoost ignores) | |
| | `validation_results.json` | Per-class metrics, confusion matrix, architecture | |
| | `ablation_results.json` | Flow-only vs full feature set comparison | |
| | `inference_example.ipynb` | End-to-end inference demo notebook | |
| | `README.md` | This file | |
|
|
| ## Contact and full product |
|
|
| The full **CYB001** dataset contains ~685,000 rows across four files |
| with calibrated A+ benchmark validation. The full XpertSystems.ai |
| synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, |
| Insurance & Risk, Oil & Gas, and Materials & Energy. |
|
|
| - π§ **pradeep@xpertsystems.ai** |
| - π **https://xpertsystems.ai** |
| - π Dataset: https://huggingface.co/datasets/xpertsystems/cyb001-sample |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{xpertsystems_cyb001_baseline_2026, |
| title = {CYB001 Baseline Classifier: XGBoost and MLP for Synthetic Network Flow Classification}, |
| author = {XpertSystems.ai}, |
| year = {2026}, |
| url = {https://huggingface.co/xpertsystems/cyb001-baseline-classifier}, |
| note = {Baseline reference model trained on xpertsystems/cyb001-sample} |
| } |
| ``` |
|
|