Initial release: XGBoost + MLP baseline on CYB001 sample

Browse files

Files changed (9) hide show

README.md +344 -0
ablation_results.json +85 -0
feature_engineering.py +363 -0
feature_meta.json +236 -0
feature_scaler.json +1 -0
inference_example.ipynb +343 -0
model_mlp.safetensors +3 -0
model_xgb.json +0 -0
validation_results.json +109 -0

README.md ADDED Viewed

	@@ -0,0 +1,344 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - cybersecurity
+  - network-traffic
+  - intrusion-detection
+  - tabular-classification
+  - synthetic-data
+  - xgboost
+  - baseline
+pipeline_tag: tabular-classification
+base_model: []
+datasets:
+  - xpertsystems/cyb001-sample
+metrics:
+  - accuracy
+  - f1
+model-index:
+  - name: cyb001-baseline-classifier
+    results:
+      - task:
+          type: tabular-classification
+          name: 3-class network flow classification
+        dataset:
+          type: xpertsystems/cyb001-sample
+          name: CYB001 Synthetic Network Traffic (Sample)
+        metrics:
+          - type: accuracy
+            value: 0.9980
+            name: Test accuracy (XGBoost)
+          - type: f1
+            value: 0.9961
+            name: Test macro-F1 (XGBoost)
+          - type: accuracy
+            value: 0.9932
+            name: Test accuracy (MLP)
+          - type: f1
+            value: 0.9869
+            name: Test macro-F1 (MLP)
+---
+# CYB001 Baseline Classifier
+**Multi-class network flow classifier trained on the CYB001 synthetic
+network traffic sample. Predicts `BENIGN`, `MALICIOUS`, or `AMBIGUOUS`
+from per-flow features.**
+> **Baseline reference, not for production use.** This model demonstrates
+> that the [CYB001 sample dataset](https://huggingface.co/datasets/xpertsystems/cyb001-sample)
+> is learnable end-to-end and gives prospective buyers a working starting
+> point to evaluate against their own pipelines. It is not an intrusion
+> detection system. See [Limitations](#limitations).
+## Model overview
+| Property | Value |
+|---|---|
+| Task | 3-class flow classification (BENIGN / MALICIOUS / AMBIGUOUS) |
+| Training data | `xpertsystems/cyb001-sample` (9,770 flows, sample only) |
+| Models | XGBoost + PyTorch MLP |
+| Input features | 101 (after one-hot encoding) |
+| License | CC-BY-NC-4.0 (matches dataset) |
+| Status | Reference baseline |
+Two model artifacts are published. They are designed to be used together — disagreement between them is itself a useful triage signal:
+- `model_xgb.json` — gradient-boosted trees, primary recommendation
+- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
+## Quick start
+```bash
+pip install xgboost torch safetensors pandas huggingface_hub
+```
+```python
+from huggingface_hub import hf_hub_download
+import json, numpy as np, torch, xgboost as xgb
+from safetensors.torch import load_file
+REPO = "xpertsystems/cyb001-baseline-classifier"
+# Download artifacts
+paths = {n: hf_hub_download(REPO, n) for n in [
+    "model_xgb.json", "model_mlp.safetensors",
+    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
+]}
+# Make feature pipeline importable
+import sys, os
+sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
+from feature_engineering import transform_single, load_meta, INT_TO_LABEL
+meta = load_meta(paths["feature_meta.json"])
+xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
+# Predict (see inference_example.ipynb for full single-record example)
+X = transform_single(my_flow_record_dict, meta)
+proba = xgb_model.predict_proba(X)[0]
+print(INT_TO_LABEL[int(np.argmax(proba))])
+```
+See [`inference_example.ipynb`](./inference_example.ipynb) for a full
+copy-paste demo including the MLP load path and a batch run on 200 rows
+from the public sample.
+## Training data
+Trained on the public sample of CYB001, 9,770 flows with:
+| Label | Train (n=6,838) | Test (n=1,466) | Test share |
+|---|---:|---:|---:|
+| BENIGN | 4,916 | 1,054 | 71.9% |
+| MALICIOUS | 1,378 | 295 | 20.1% |
+| AMBIGUOUS | 544 | 117 | 8.0% |
+Split: 70 / 15 / 15 stratified by label, seed 42.
+Class imbalance was addressed with `class_weight='balanced'` (XGBoost
+`sample_weight`) and weighted cross-entropy (MLP). Stratified splitting
+preserves the proportion in each fold.
+### Dataset calibration anchors
+The CYB001 sample is calibrated to 12 named industry signatures. The
+features that surface most prominently in the baseline correspond to
+these anchors:
+| Calibrated signature | Target | Observed (sample) | Feature(s) the model uses |
+|---|---:|---:|---|
+| `c2_beacon_regularity_score` | 0.78 | 0.77 | `iat_cv`, `inter_arrival_time_std` |
+| `payload_entropy_benign_mean` | 4.80 | 4.86 | `payload_entropy_mean` |
+| `fwd_bwd_byte_ratio_benign` | 1.34 | 1.41 | `fwd_bwd_byte_ratio` |
+| `malicious_flow_rate` | 0.172 | 0.202 | (class prior) |
+| `protocol_violation_rate` | 0.015 | 0.016 | `protocol_violation_flag`, `protocol_violation_count` |
+| `scan_probe_density` | 0.043 | 0.045 | `tcp_flag_anomaly_score`, port features |
+Full benchmark table in the [dataset card](https://huggingface.co/datasets/xpertsystems/cyb001-sample).
+## Feature pipeline
+The bundled `feature_engineering.py` is the canonical feature recipe.
+The training script and the inference example both call into it.
+**Three columns are deliberately excluded** because they leak the label:
+- `traffic_category` — perfectly deterministic of label (every `attack_*`
+  category is 100% MALICIOUS, etc.).
+- `attack_subcategory` — non-null iff label is MALICIOUS.
+- `attacker_capability_tier` — generator metadata labeled per flow
+  including benign flows; not a real-world observable at inference time.
+**Five session-level features were kept** after a per-label leakage audit
+(`payload_entropy_mean`, `retransmission_rate`, `protocol_violation_count`,
+`c2_beacon_flag`, `session_risk_score`) because their distributions
+overlap meaningfully across labels (i.e. they behave like detector
+outputs, not oracles). **Three were dropped** (`exfil_volume_bytes`,
+`scan_probe_count`, `lateral_move_flag`) because they are zero for all
+non-MALICIOUS rows.
+Engineered features (each encodes a stated domain hypothesis, see source
+for the one-line rationale per feature):
+- `iat_cv` — inter-arrival-time coefficient of variation. C2 beacon signature.
+- `fwd_bwd_byte_ratio` — exfiltration signature.
+- `bytes_per_packet_fwd`, `payload_density` — flow shape.
+- `tcp_flag_anomaly_score` — RST/URG/FIN density. Scan and protocol-misuse signature.
+- `hour_of_day`, `is_off_hours` — diurnal pattern. APT and insider tiers are off-peak biased in the dataset calibration.
+- `is_well_known_dest_port`, `is_ephemeral_src_port` — port observables.
+## Evaluation
+### Test-set metrics (n = 1,466, stratified)
+**XGBoost**
+| Metric | Value |
+|---|---:|
+| Accuracy | 0.9980 |
+| Macro-F1 | 0.9961 |
+| Weighted-F1 | 0.9980 |
+| Macro ROC-AUC (OvR) | ≈ 1.00 |
+| Class | F1 | Support |
+|---|---:|---:|
+| BENIGN | 0.9986 | 1,054 |
+| MALICIOUS | 0.9983 | 295 |
+| AMBIGUOUS | 0.9915 | 117 |
+**MLP**
+| Metric | Value |
+|---|---:|
+| Accuracy | 0.9932 |
+| Macro-F1 | 0.9869 |
+| Weighted-F1 | 0.9932 |
+| Class | F1 | Support |
+|---|---:|---:|
+| BENIGN | 0.9962 | 1,054 |
+| MALICIOUS | 0.9899 | 295 |
+| AMBIGUOUS | 0.9746 | 117 |
+Confusion matrices and per-class precision/recall are in
+[`validation_results.json`](./validation_results.json).
+### Ablation: contribution of session-level features
+To check whether the model is genuinely reading the flow-level signal or
+leaning on session aggregates, the same XGBoost configuration was trained
+with all five session-aggregate features removed:
+| Configuration | Accuracy | Macro-F1 | AMBIGUOUS F1 |
+|---|---:|---:|---:|
+| Full feature set (published) | 0.9980 | 0.9961 | 0.991 |
+| Flow-only (session aggregates dropped) | 0.9884 | 0.9776 | 0.957 |
+The session join contributes about **+1.0 pp** of accuracy and **+0.02**
+macro-F1. The model is not session-dominated; the flow-level features
+carry the bulk of the signal. The full numbers for both configurations
+are in [`ablation_results.json`](./ablation_results.json).
+### Architecture
+**XGBoost:** multi-class gradient boosting (`multi:softprob`, 3 classes),
+`hist` tree method, class-balanced sample weights, early stopping on
+validation macro-F1.
+**MLP:** `n_features → 128 → 64 → 3`, each hidden layer followed by
+`BatchNorm1d` → `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss,
+AdamW optimizer, early stopping on validation macro-F1.
+Training hyperparameters (learning rate, batch size, n_estimators,
+early-stopping patience, weight decay, class-weighting strategy) are
+held internally by XpertSystems and are not part of this release.
+## Limitations
+**This is a baseline reference, not an intrusion detection system.**
+1. **Performance is inflated by synthetic structure.** The numbers above
+   reflect performance on calibrated synthetic data where the BENIGN and
+   attack categories sit on distinct statistical signatures by
+   construction. A real production IDS facing live traffic must contend
+   with concept drift, adversarial evasion, encrypted-traffic ambiguity,
+   and a much fatter long tail of benign behaviour. Expect substantial
+   degradation when transferring to real CICIDS-style datasets or
+   in-the-wild traffic.
+2. **Sample size for `AMBIGUOUS` is small.** Only 117 test examples;
+   the per-class F1 has wide confidence bands. The full CYB001 product
+   (~62k AMBIGUOUS flows out of ~500k) supports more reliable estimation.
+3. **Trained on the public 1/60th sample only.** The full product
+   contains additional traffic categories, longer sequences, and
+   richer adversary behaviour. A model trained on the full dataset
+   would perform differently — likely lower headline accuracy with
+   better calibration and generalisation. The intent of this release
+   is reference, not state-of-the-art.
+4. **Topology features are static labels, not signals.** Fields like
+   `defender_architecture` and `firewall_policy` are descriptive
+   categorical attributes of the network segment, not learned defender
+   responses. They help the model condition on context but do not
+   simulate real adversarial dynamics.
+5. **MLP brittleness on OOD inputs.** With ~7k training rows, the MLP
+   can produce confidently-wrong predictions on hand-crafted records
+   whose feature combinations are far from the training manifold. The
+   inference notebook demonstrates this. XGBoost is more robust here.
+   In practice, use both and treat disagreement as a signal for review.
+6. **Class imbalance handling is straightforward.** Class-balanced
+   weights work for this sample but production-scale rare-class
+   detection (e.g. APT C2 at < 0.1% of traffic) needs more careful
+   threshold calibration, ranking metrics, and likely calibrated
+   probabilities rather than argmax classification.
+## Intended use
+- **Evaluating fit** of the CYB001 dataset for your IDS / NDR research
+- **Baseline reference** for new model architectures on synthetic
+  network traffic
+- **Teaching and demo** for tabular classification on flow-level features
+- **Feature engineering reference** for CICFlowMeter-compatible fields
+## Out-of-scope use
+- Production intrusion detection on real network traffic
+- Forensic attribution of real attacks
+- Adversarial robustness evaluation (the dataset is not adversarially
+  generated)
+- Any safety-critical decision
+## Reproducibility
+Outputs above were produced with `seed = 42`, stratified 70/15/15 split,
+on the published sample (`xpertsystems/cyb001-sample`, version 1.0.0,
+generated 2026-05-16). The feature pipeline in `feature_engineering.py`
+is deterministic and the trained weights in this repo correspond exactly
+to the metrics above.
+The training script itself is private to XpertSystems. The published
+artifacts contain the feature pipeline, model weights, scaler, metadata,
+and validation results — sufficient to reproduce inference but not
+training.
+## Files in this repo
+| File | Purpose |
+|---|---|
+| `model_xgb.json` | XGBoost weights |
+| `model_mlp.safetensors` | PyTorch MLP weights |
+| `feature_engineering.py` | Feature pipeline (load → engineer → encode) |
+| `feature_meta.json` | Feature column order + categorical levels |
+| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
+| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
+| `ablation_results.json` | Flow-only vs full feature set comparison |
+| `inference_example.ipynb` | End-to-end inference demo notebook |
+| `README.md` | This file |
+## Contact and full product
+The full **CYB001** dataset contains ~685,000 rows across four files
+with calibrated A+ benchmark validation. The full XpertSystems.ai
+synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare,
+Insurance & Risk, Oil & Gas, and Materials & Energy.
+- 📧 **pradeep@xpertsystems.ai**
+- 🌐 **https://xpertsystems.ai**
+- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb001-sample
+## Citation
+```bibtex
+@misc{xpertsystems_cyb001_baseline_2026,
+  title  = {CYB001 Baseline Classifier: XGBoost and MLP for Synthetic Network Flow Classification},
+  author = {XpertSystems.ai},
+  year   = {2026},
+  url    = {https://huggingface.co/xpertsystems/cyb001-baseline-classifier},
+  note   = {Baseline reference model trained on xpertsystems/cyb001-sample}
+}
+```

ablation_results.json ADDED Viewed

	@@ -0,0 +1,85 @@

+{
+  "purpose": "Quantify how much the session-aggregate features contribute to the headline number. Trained with identical architecture on the same split, with session features dropped.",
+  "session_features_dropped": [
+    "payload_entropy_mean",
+    "retransmission_rate",
+    "protocol_violation_count",
+    "c2_beacon_flag",
+    "session_risk_score"
+  ],
+  "n_features_full": 101,
+  "n_features_flow_only": 96,
+  "full_model_metrics": {
+    "model": "xgboost",
+    "accuracy": 0.9979536152796725,
+    "macro_f1": 0.9961123729105247,
+    "weighted_f1": 0.9979537067605843,
+    "per_class_f1": {
+      "BENIGN": 0.9985761746559089,
+      "MALICIOUS": 0.9983079526226735,
+      "AMBIGUOUS": 0.9914529914529915
+    },
+    "confusion_matrix": {
+      "labels": [
+        "BENIGN",
+        "MALICIOUS",
+        "AMBIGUOUS"
+      ],
+      "matrix": [
+        [
+          1052,
+          1,
+          1
+        ],
+        [
+          0,
+          295,
+          0
+        ],
+        [
+          1,
+          0,
+          116
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.9999888611978185
+  },
+  "flow_only_model_metrics": {
+    "model": "xgboost_flow_only",
+    "accuracy": 0.9884038199181446,
+    "macro_f1": 0.9776308066176851,
+    "weighted_f1": 0.9883464558152856,
+    "per_class_f1": {
+      "BENIGN": 0.9933774834437086,
+      "MALICIOUS": 0.9829931972789115,
+      "AMBIGUOUS": 0.9565217391304348
+    },
+    "confusion_matrix": {
+      "labels": [
+        "BENIGN",
+        "MALICIOUS",
+        "AMBIGUOUS"
+      ],
+      "matrix": [
+        [
+          1050,
+          2,
+          2
+        ],
+        [
+          5,
+          289,
+          1
+        ],
+        [
+          5,
+          2,
+          110
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.9988745635051176
+  },
+  "interpretation": "Removing session aggregates costs roughly 1 percentage point of accuracy. The model is not session-dominated; the flow-level features carry the bulk of the signal."
+}

feature_engineering.py ADDED Viewed

	@@ -0,0 +1,363 @@

+"""
+feature_engineering.py
+======================
+Feature pipeline for the CYB001 baseline classifier.
+This module produces a flow-level feature matrix and label vector from the
+four CSV files distributed with the CYB001 sample dataset on Hugging Face:
+    network_flows.csv     (primary, one row per flow)
+    session_summary.csv   (one row per session, joined on session_id)
+    network_topology.csv  (one row per network segment, joined on segment_id)
+    flow_events.csv       (one row per security event - NOT used for v1
+                           features; flows lose temporal granularity if
+                           aggregated naively. Reserved for future work.)
+The pipeline is deliberately written to be read end-to-end. Every dropped
+column is dropped with a one-line explanation. Every engineered feature
+sits next to a one-sentence motivation. If you are evaluating the CYB001
+product, this file is the feature recipe; what the model "sees" is exactly
+what this file emits.
+Public API
+----------
+    build_features(flows_path, sessions_path, topology_path) -> (X, y, meta)
+        X : pd.DataFrame  - feature matrix, all numeric, no NaNs
+        y : pd.Series     - integer-encoded label (0=BENIGN, 1=MALICIOUS, 2=AMBIGUOUS)
+        meta : dict       - {feature_names, label_encoder, categorical_levels}
+The same `meta` dict is used at inference time so a new flow record gets
+encoded identically to training.
+    transform_single(record, meta) -> np.ndarray
+        Encode a single flow record (dict or 1-row DataFrame) for inference.
+License
+-------
+This file ships with the public model on Hugging Face under CC-BY-NC-4.0,
+matching the dataset license. See README.md.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Constants - what we keep, what we drop, and why
+# ---------------------------------------------------------------------------
+LABEL_ORDER = ["BENIGN", "MALICIOUS", "AMBIGUOUS"]  # index 0, 1, 2
+LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
+INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
+# Columns dropped from network_flows.csv because they are ground-truth
+# generator metadata, not observables a real IDS would have at inference time.
+# Including any of these gives perfect or near-perfect accuracy that does
+# not reflect real-world performance.
+LEAKY_FLOW_COLUMNS = [
+    "traffic_category",          # 100% deterministic of label (attack_*/benign_*/ambiguous_*)
+    "attack_subcategory",        # null iff label != MALICIOUS
+    "attacker_capability_tier",  # labeled per flow including benign - generator metadata
+]
+# Identifier / non-feature columns
+ID_COLUMNS = [
+    "flow_id", "session_id",
+    "source_ip_hash", "destination_ip_hash",   # SHA-256 pseudonyms, not useful as features
+    "flow_start_timestamp",                    # consumed by is_off_hours engineered feature
+]
+# Direct numeric features from network_flows.csv (pass-through)
+DIRECT_NUMERIC_FLOW_FEATURES = [
+    "source_port", "dest_port",
+    "flow_duration_ms",
+    "total_fwd_packets", "total_bwd_packets",
+    "total_bytes_fwd", "total_bytes_bwd",
+    "fwd_packet_len_mean", "fwd_packet_len_std",
+    "bwd_packet_len_mean", "bwd_packet_len_std",
+    "flow_bytes_per_sec", "flow_packets_per_sec",
+    "inter_arrival_time_mean", "inter_arrival_time_std",
+    "tcp_flag_syn_count", "tcp_flag_ack_count", "tcp_flag_fin_count",
+    "tcp_flag_rst_count", "tcp_flag_psh_count", "tcp_flag_urg_count",
+    "retransmission_flag", "fragmentation_flag", "protocol_violation_flag",
+]
+# Session-level numeric features (joined on session_id).
+# Selected after a per-label leakage audit:
+#   KEEP: payload_entropy_mean, retransmission_rate, protocol_violation_count,
+#         c2_beacon_flag, session_risk_score   (overlapping distributions across labels)
+#   DROP: exfil_volume_bytes, scan_probe_count, lateral_move_flag
+#         (zero for all BENIGN/AMBIGUOUS - generator oracles, not detector outputs)
+SESSION_FEATURES_KEEP = [
+    "payload_entropy_mean",
+    "retransmission_rate",
+    "protocol_violation_count",
+    "c2_beacon_flag",
+    "session_risk_score",
+]
+# Topology-level numeric features (joined on segment_id)
+TOPOLOGY_NUMERIC_FEATURES = [
+    "trust_level", "avg_concurrent_flows", "bandwidth_mbps",
+    "nat_enabled", "ids_coverage", "diurnal_peak_factor",
+    "feature_space_dim", "alert_threshold",
+    "retraining_cadence_days", "ensemble_size", "device_count",
+]
+# Categorical columns that get one-hot encoded
+CATEGORICAL_FEATURES = [
+    ("protocol",             "flows"),     # TCP / UDP / HTTPS / DNS / SMTP / SSH / FTP / NTP
+    ("flow_lifecycle_phase", "flows"),     # initiation / handshake / transfer / ...
+    ("source_device_type",   "flows"),     # workstation / server / iot / mobile / cloud / ot
+    ("dest_device_type",     "flows"),
+    ("segment_type",         "topology"),  # corporate_lan / dmz / cloud_workload / ...
+    ("firewall_policy",      "topology"),
+    ("qos_policy",           "topology"),
+    ("defender_architecture","topology"),
+]
+# ---------------------------------------------------------------------------
+# Engineered features
+# ---------------------------------------------------------------------------
+def _safe_divide(num: pd.Series, denom: pd.Series, fill: float = 0.0) -> pd.Series:
+    """Element-wise divide, replacing inf/nan from div-by-zero with `fill`."""
+    out = num / denom.replace(0, np.nan)
+    return out.replace([np.inf, -np.inf], np.nan).fillna(fill)
+def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Add eight engineered features that encode domain hypotheses about how
+    each label class behaves. These are NOT learned; they are stated by hand
+    so a buyer can read this function and see what the model is told to look
+    at. Tree models can recover most of these on their own, but giving them
+    explicitly improves both XGBoost convergence and MLP performance.
+    """
+    df = df.copy()
+    # IAT coefficient of variation. Low cv => regular inter-arrival times
+    # => C2 beacon signature (the dataset is calibrated to cv ~= 0.065 for
+    # APT beacons, regularity score ~= 0.93 per the README).
+    df["iat_cv"] = _safe_divide(df["inter_arrival_time_std"],
+                                df["inter_arrival_time_mean"])
+    # Forward/backward byte ratio. >> 1 indicates upload-heavy flow, which
+    # is the exfiltration signature.
+    df["fwd_bwd_byte_ratio"] = _safe_divide(df["total_bytes_fwd"],
+                                            df["total_bytes_bwd"])
+    # Bytes per packet (forward direction). Combined with packet length
+    # std, separates streaming traffic from short-message protocols.
+    total_fwd = df["total_fwd_packets"].replace(0, np.nan)
+    df["bytes_per_packet_fwd"] = (df["total_bytes_fwd"] / total_fwd).fillna(0)
+    # TCP flag anomaly score. RST and URG together, or high counts relative
+    # to total packets, indicate scan/probe or protocol misuse.
+    total_packets = (df["total_fwd_packets"] + df["total_bwd_packets"]).replace(0, np.nan)
+    flag_total = (df["tcp_flag_rst_count"] + df["tcp_flag_urg_count"]
+                  + df["tcp_flag_fin_count"])
+    df["tcp_flag_anomaly_score"] = (flag_total / total_packets).fillna(0)
+    # Payload density. Bytes per packet, normalized to MTU. Low density on
+    # high packet counts indicates beaconing or keep-alive.
+    total_bytes = df["total_bytes_fwd"] + df["total_bytes_bwd"]
+    df["payload_density"] = (total_bytes / (total_packets * 1500)).fillna(0)
+    # Hour of day from timestamp. Off-hours bias is calibrated into the
+    # APT and insider-threat tiers.
+    ts = pd.to_datetime(df["flow_start_timestamp"], errors="coerce")
+    hour = ts.dt.hour.fillna(12).astype(int)
+    df["hour_of_day"] = hour
+    df["is_off_hours"] = ((hour < 6) | (hour > 22)).astype(int)
+    # Port observables. Well-known ports < 1024, ephemeral ports >= 49152.
+    df["is_well_known_dest_port"] = (df["dest_port"] < 1024).astype(int)
+    df["is_ephemeral_src_port"]   = (df["source_port"] >= 49152).astype(int)
+    return df
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def build_features(
+    flows_path: str | Path,
+    sessions_path: str | Path,
+    topology_path: str | Path,
+) -> tuple[pd.DataFrame, pd.Series, dict[str, Any]]:
+    """
+    Load the three CSVs, join them, drop leaky columns, engineer features,
+    one-hot encode categoricals, and return (X, y, meta).
+    The returned `meta` dict captures the column order and the categorical
+    level set, which is what `transform_single` needs at inference time to
+    encode a new record identically.
+    """
+    flows = pd.read_csv(flows_path)
+    sessions = pd.read_csv(sessions_path)
+    topology = pd.read_csv(topology_path)
+    # Drop columns that leak the label (see LEAKY_FLOW_COLUMNS for rationale)
+    flows = flows.drop(columns=LEAKY_FLOW_COLUMNS, errors="ignore")
+    # Join session-level aggregates
+    df = flows.merge(
+        sessions[["session_id"] + SESSION_FEATURES_KEEP],
+        on="session_id", how="left",
+    )
+    # Join topology features (numeric + categorical)
+    topo_cols = ["segment_id"] + TOPOLOGY_NUMERIC_FEATURES + [
+        col for col, src in CATEGORICAL_FEATURES if src == "topology"
+    ]
+    df = df.merge(topology[topo_cols], on="segment_id", how="left")
+    # Extract labels before adding features
+    y = df["label"].map(LABEL_TO_INT).astype(int)
+    # Engineered features
+    df = _add_engineered_features(df)
+    # Assemble feature columns
+    numeric_features = (
+        DIRECT_NUMERIC_FLOW_FEATURES
+        + SESSION_FEATURES_KEEP
+        + TOPOLOGY_NUMERIC_FEATURES
+        + [
+            "iat_cv", "fwd_bwd_byte_ratio", "bytes_per_packet_fwd",
+            "tcp_flag_anomaly_score", "payload_density",
+            "hour_of_day", "is_off_hours",
+            "is_well_known_dest_port", "is_ephemeral_src_port",
+        ]
+    )
+    X_numeric = df[numeric_features].astype(float)
+    # One-hot encode categoricals. Record the level set in `meta` so we can
+    # reproduce the same columns at inference time even if a new record
+    # contains an unseen level (it will encode to all-zero, which is the
+    # correct fallback for one-hot).
+    categorical_levels: dict[str, list[str]] = {}
+    one_hot_blocks: list[pd.DataFrame] = []
+    for col, _src in CATEGORICAL_FEATURES:
+        levels = sorted(df[col].dropna().unique().tolist())
+        categorical_levels[col] = levels
+        block = pd.get_dummies(
+            df[col].astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        one_hot_blocks.append(block)
+    X = pd.concat([X_numeric.reset_index(drop=True)]
+                  + [b.reset_index(drop=True) for b in one_hot_blocks], axis=1)
+    # Final NaN sweep (defensive - session join can introduce NaN if a
+    # session_id is missing from session_summary.csv).
+    X = X.fillna(0.0)
+    meta = {
+        "feature_names": X.columns.tolist(),
+        "numeric_features": numeric_features,
+        "categorical_levels": categorical_levels,
+        "label_to_int": LABEL_TO_INT,
+        "int_to_label": INT_TO_LABEL,
+    }
+    return X, y, meta
+def transform_single(record: dict | pd.DataFrame, meta: dict[str, Any]) -> np.ndarray:
+    """
+    Encode a single flow record for inference.
+    `record` must contain the same columns as network_flows.csv (minus the
+    leaky columns), plus the joined session and topology fields. If you only
+    have the flow row, you must look up the matching session_summary row and
+    network_topology row and merge them into `record` before calling this.
+    Returns a (1, n_features) numpy array ready for model.predict_proba.
+    """
+    if isinstance(record, dict):
+        df = pd.DataFrame([record])
+    else:
+        df = record.copy()
+    df = _add_engineered_features(df)
+    # Numeric features in fixed order
+    numeric = pd.DataFrame({
+        col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
+        for col in meta["numeric_features"]
+    })
+    # One-hot blocks in fixed order, using the levels seen at fit time
+    blocks: list[pd.DataFrame] = [numeric]
+    for col, levels in meta["categorical_levels"].items():
+        val = df.get(col, pd.Series([None] * len(df)))
+        block = pd.get_dummies(
+            val.astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        # Ensure all expected level columns are present (in case a level
+        # didn't appear in this single record)
+        for lvl in levels:
+            colname = f"{col}_{lvl}"
+            if colname not in block.columns:
+                block[colname] = 0
+        block = block[[f"{col}_{lvl}" for lvl in levels]]
+        blocks.append(block)
+    X = pd.concat(blocks, axis=1).fillna(0.0)
+    # Reorder to match training column order exactly
+    X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
+    return X.values.astype(np.float32)
+def save_meta(meta: dict[str, Any], path: str | Path) -> None:
+    """Persist meta to JSON for inference-time reuse."""
+    serializable = {
+        "feature_names": meta["feature_names"],
+        "numeric_features": meta["numeric_features"],
+        "categorical_levels": meta["categorical_levels"],
+        "label_to_int": meta["label_to_int"],
+        "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
+    }
+    with open(path, "w") as f:
+        json.dump(serializable, f, indent=2)
+def load_meta(path: str | Path) -> dict[str, Any]:
+    """Load meta from JSON."""
+    with open(path) as f:
+        meta = json.load(f)
+    meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
+    return meta
+if __name__ == "__main__":
+    # Smoke test
+    import sys
+    base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
+    X, y, meta = build_features(
+        base / "network_flows.csv",
+        base / "session_summary.csv",
+        base / "network_topology.csv",
+    )
+    print(f"X shape: {X.shape}")
+    print(f"y shape: {y.shape}")
+    print(f"n features: {len(meta['feature_names'])}")
+    print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
+    print(f"X dtypes unique: {X.dtypes.unique()}")
+    print(f"X has NaN: {X.isnull().any().any()}")

feature_meta.json ADDED Viewed

	@@ -0,0 +1,236 @@

+{
+  "feature_names": [
+    "source_port",
+    "dest_port",
+    "flow_duration_ms",
+    "total_fwd_packets",
+    "total_bwd_packets",
+    "total_bytes_fwd",
+    "total_bytes_bwd",
+    "fwd_packet_len_mean",
+    "fwd_packet_len_std",
+    "bwd_packet_len_mean",
+    "bwd_packet_len_std",
+    "flow_bytes_per_sec",
+    "flow_packets_per_sec",
+    "inter_arrival_time_mean",
+    "inter_arrival_time_std",
+    "tcp_flag_syn_count",
+    "tcp_flag_ack_count",
+    "tcp_flag_fin_count",
+    "tcp_flag_rst_count",
+    "tcp_flag_psh_count",
+    "tcp_flag_urg_count",
+    "retransmission_flag",
+    "fragmentation_flag",
+    "protocol_violation_flag",
+    "payload_entropy_mean",
+    "retransmission_rate",
+    "protocol_violation_count",
+    "c2_beacon_flag",
+    "session_risk_score",
+    "trust_level",
+    "avg_concurrent_flows",
+    "bandwidth_mbps",
+    "nat_enabled",
+    "ids_coverage",
+    "diurnal_peak_factor",
+    "feature_space_dim",
+    "alert_threshold",
+    "retraining_cadence_days",
+    "ensemble_size",
+    "device_count",
+    "iat_cv",
+    "fwd_bwd_byte_ratio",
+    "bytes_per_packet_fwd",
+    "tcp_flag_anomaly_score",
+    "payload_density",
+    "hour_of_day",
+    "is_off_hours",
+    "is_well_known_dest_port",
+    "is_ephemeral_src_port",
+    "protocol_DNS",
+    "protocol_FTP",
+    "protocol_HTTPS",
+    "protocol_NTP",
+    "protocol_SMTP",
+    "protocol_SSH",
+    "protocol_TCP",
+    "protocol_UDP",
+    "flow_lifecycle_phase_connection_initiation",
+    "flow_lifecycle_phase_connection_teardown",
+    "flow_lifecycle_phase_data_transfer",
+    "flow_lifecycle_phase_protocol_handshake",
+    "flow_lifecycle_phase_session_maintenance",
+    "source_device_type_cloud_service",
+    "source_device_type_iot_device",
+    "source_device_type_mobile_endpoint",
+    "source_device_type_ot_controller",
+    "source_device_type_server",
+    "source_device_type_workstation",
+    "dest_device_type_cloud_service",
+    "dest_device_type_iot_device",
+    "dest_device_type_mobile_endpoint",
+    "dest_device_type_ot_controller",
+    "dest_device_type_server",
+    "dest_device_type_workstation",
+    "segment_type_cloud_workload",
+    "segment_type_corporate_lan",
+    "segment_type_data_centre_spine",
+    "segment_type_dmz_perimeter",
+    "segment_type_endpoint_fleet",
+    "segment_type_guest_wifi",
+    "segment_type_ot_ics_control_network",
+    "segment_type_soc_management_plane",
+    "segment_type_zero_trust_segment",
+    "firewall_policy_default_deny",
+    "firewall_policy_open_permissive",
+    "firewall_policy_stateful_inspection",
+    "firewall_policy_strict_allowlist",
+    "firewall_policy_zone_based",
+    "qos_policy_best_effort",
+    "qos_policy_dscp_expedited",
+    "qos_policy_none",
+    "qos_policy_priority_queue",
+    "qos_policy_weighted_fair_queue",
+    "defender_architecture_autoencoder_anomaly",
+    "defender_architecture_ensemble_stacked",
+    "defender_architecture_gradient_boosted_tree",
+    "defender_architecture_isolation_forest",
+    "defender_architecture_lstm_behavioural",
+    "defender_architecture_neural_network_dense",
+    "defender_architecture_rule_based_threshold",
+    "defender_architecture_transformer_sequence"
+  ],
+  "numeric_features": [
+    "source_port",
+    "dest_port",
+    "flow_duration_ms",
+    "total_fwd_packets",
+    "total_bwd_packets",
+    "total_bytes_fwd",
+    "total_bytes_bwd",
+    "fwd_packet_len_mean",
+    "fwd_packet_len_std",
+    "bwd_packet_len_mean",
+    "bwd_packet_len_std",
+    "flow_bytes_per_sec",
+    "flow_packets_per_sec",
+    "inter_arrival_time_mean",
+    "inter_arrival_time_std",
+    "tcp_flag_syn_count",
+    "tcp_flag_ack_count",
+    "tcp_flag_fin_count",
+    "tcp_flag_rst_count",
+    "tcp_flag_psh_count",
+    "tcp_flag_urg_count",
+    "retransmission_flag",
+    "fragmentation_flag",
+    "protocol_violation_flag",
+    "payload_entropy_mean",
+    "retransmission_rate",
+    "protocol_violation_count",
+    "c2_beacon_flag",
+    "session_risk_score",
+    "trust_level",
+    "avg_concurrent_flows",
+    "bandwidth_mbps",
+    "nat_enabled",
+    "ids_coverage",
+    "diurnal_peak_factor",
+    "feature_space_dim",
+    "alert_threshold",
+    "retraining_cadence_days",
+    "ensemble_size",
+    "device_count",
+    "iat_cv",
+    "fwd_bwd_byte_ratio",
+    "bytes_per_packet_fwd",
+    "tcp_flag_anomaly_score",
+    "payload_density",
+    "hour_of_day",
+    "is_off_hours",
+    "is_well_known_dest_port",
+    "is_ephemeral_src_port"
+  ],
+  "categorical_levels": {
+    "protocol": [
+      "DNS",
+      "FTP",
+      "HTTPS",
+      "NTP",
+      "SMTP",
+      "SSH",
+      "TCP",
+      "UDP"
+    ],
+    "flow_lifecycle_phase": [
+      "connection_initiation",
+      "connection_teardown",
+      "data_transfer",
+      "protocol_handshake",
+      "session_maintenance"
+    ],
+    "source_device_type": [
+      "cloud_service",
+      "iot_device",
+      "mobile_endpoint",
+      "ot_controller",
+      "server",
+      "workstation"
+    ],
+    "dest_device_type": [
+      "cloud_service",
+      "iot_device",
+      "mobile_endpoint",
+      "ot_controller",
+      "server",
+      "workstation"
+    ],
+    "segment_type": [
+      "cloud_workload",
+      "corporate_lan",
+      "data_centre_spine",
+      "dmz_perimeter",
+      "endpoint_fleet",
+      "guest_wifi",
+      "ot_ics_control_network",
+      "soc_management_plane",
+      "zero_trust_segment"
+    ],
+    "firewall_policy": [
+      "default_deny",
+      "open_permissive",
+      "stateful_inspection",
+      "strict_allowlist",
+      "zone_based"
+    ],
+    "qos_policy": [
+      "best_effort",
+      "dscp_expedited",
+      "none",
+      "priority_queue",
+      "weighted_fair_queue"
+    ],
+    "defender_architecture": [
+      "autoencoder_anomaly",
+      "ensemble_stacked",
+      "gradient_boosted_tree",
+      "isolation_forest",
+      "lstm_behavioural",
+      "neural_network_dense",
+      "rule_based_threshold",
+      "transformer_sequence"
+    ]
+  },
+  "label_to_int": {
+    "BENIGN": 0,
+    "MALICIOUS": 1,
+    "AMBIGUOUS": 2
+  },
+  "int_to_label": {
+    "0": "BENIGN",
+    "1": "MALICIOUS",
+    "2": "AMBIGUOUS"
+  }
+}

feature_scaler.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean": [33246.74978063761, 3092.6561860193037, 3568.162913132495, 40.45100906697865, 37.033050599590524, 27141.939894706054, 18373.229745539633, 681.214243930974, 309.9934191284001, 498.80652237496344, 259.9314126937701, 42512.715467973096, 88.85309593448376, 6202.614864434045, 1564.1459230769228, 0.30622989178122256, 5.673588768645803, 0.02749341912840012, 0.13688212927756654, 0.4839134249780638, 0.006580871599883006, 0.0337818075460661, 0.020181339572974553, 0.04591985960807254, 4.918210324656333, 0.03811861655454812, 0.12445159403334308, 0.03568294823047675, 0.2893867358876865, 0.5161035682948231, 109.0, 884.3067198011114, 0.5624451594033343, 0.7456058496636443, 1.3842245978356242, 70.62840011699328, 0.6860095349517403, 47.037876572097105, 4.353758408891489, 975.8917812225797, 0.5939957658598679, 4.050707090716325, 681.214243930974, 0.007334989923027464, 0.3969841932563371, 12.341620356829482, 0.2244808423515648, 0.8253875402164376, 0.2535829189821585, 0.08028663351857268, 0.0987130739982451, 0.18909037730330505, 0.08730622989178122, 0.10266159695817491, 0.03231939163498099, 0.2244808423515648, 0.18514185434337527, 0.11011991810470897, 0.11085112606025153, 0.4928341620356829, 0.1440479672418836, 0.14214682655747293, 0.1830944720678561, 0.1506288388417666, 0.1646680315881837, 0.15750219362386664, 0.1858730622989178, 0.15823340157940918, 0.16276689090377303, 0.1975723895875987, 0.14360924246855805, 0.15691722725943258, 0.14229306814858145, 0.19684118163205616, 0.0962269669494004, 0.09812810763381105, 0.09213220239836209, 0.09534951740274934, 0.1130447499268792, 0.11801696402456859, 0.1414156186019304, 0.10485522082480257, 0.14083065223749636, 0.19684118163205616, 0.20605440187189236, 0.1715413863702837, 0.23062298917812227, 0.1949400409476455, 0.12035682948230476, 0.2298917812225797, 0.1971336648142732, 0.23866627668909038, 0.213951447791752, 0.0985668324071366, 0.12708394267329629, 0.20108218777420298, 0.15545481134834746, 0.03860778005264697, 0.18207078093009652, 0.10836501901140684, 0.08876864580286634], "std": [18749.233853911337, 9865.164126780137, 4190.513917198199, 70.31297496929587, 55.68108028369978, 52722.78701229442, 33694.402383674795, 335.5644018194022, 49.512342840830556, 270.8822397388281, 40.15262463355536, 244466.99124808173, 651.6831675402408, 42653.08093220378, 11932.405984612855, 1.9458350117686587, 12.533888245900245, 0.1635281068930336, 0.5531404520192984, 1.6803909327338207, 0.08086111508275948, 0.18068030090744233, 0.14063052768856918, 0.20932662052918952, 1.0708323782832943, 0.026009753855312713, 0.39246443174561263, 0.18551201658587205, 0.2398908556875923, 0.21862773256336868, 1.0, 863.9054990654234, 0.49612155514020645, 0.11670762081450198, 0.18533417502330754, 28.409299682046896, 0.16902662104519858, 23.189955831779056, 2.311340124434154, 542.2224222337586, 0.18671542343860886, 16.074926614793874, 335.5644018194022, 0.05411044984657537, 0.16080068264645078, 6.561931141230041, 0.4172704837070426, 0.37966304603421225, 0.4350934458689722, 0.2717563065620617, 0.2982981995627756, 0.3916090317893234, 0.2823039264898729, 0.30353857668308043, 0.1768598962792337, 0.4172704837070426, 0.3884410045076719, 0.31306206184383634, 0.3139706515776878, 0.4999852087822069, 0.3511640419097188, 0.3492262042384573, 0.3867722366583929, 0.3577128801235108, 0.37090779150009945, 0.36430023473877865, 0.3890326466363804, 0.3649864021965326, 0.3691798504061366, 0.3981968465890802, 0.3507187137731729, 0.363749310369518, 0.3493760176096391, 0.39764035792703506, 0.29492381708363086, 0.2975095397646019, 0.28923363165412663, 0.2937185783775046, 0.3166706484526433, 0.3226518008451023, 0.34847525057745316, 0.3063891835921014, 0.3478722137010598, 0.39764035792703506, 0.40449958390284957, 0.37700891937335396, 0.42126236286042335, 0.3961835126103755, 0.3254021329010349, 0.4207938269197238, 0.3978632080793417, 0.42629949768833464, 0.4101229373439798, 0.2981013378512827, 0.3330913383277822, 0.40083866880603014, 0.3623642030345733, 0.19267238579308466, 0.38593107324064563, 0.3108635937093521, 0.28443031546987846]}

inference_example.ipynb ADDED Viewed

	@@ -0,0 +1,343 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CYB001 Baseline Classifier — Inference Example\n",
+    "\n",
+    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict on a new flow record.\n",
+    "\n",
+    "**Models predict one of three labels:** `BENIGN`, `MALICIOUS`, or `AMBIGUOUS`.\n",
+    "\n",
+    "**This is a baseline reference model**, not a production IDS. See the model card for full limitations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Download model artifacts from Hugging Face\n",
+    "\n",
+    "Five files are needed:\n",
+    "- `model_xgb.json` — XGBoost weights\n",
+    "- `model_mlp.safetensors` — PyTorch MLP weights\n",
+    "- `feature_engineering.py` — feature pipeline (must match the one used at training)\n",
+    "- `feature_meta.json` — feature column order + categorical levels\n",
+    "- `feature_scaler.json` — MLP input standardization (mean / std)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "REPO_ID = \"xpertsystems/cyb001-baseline-classifier\"\n",
+    "\n",
+    "files = {}\n",
+    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
+    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
+    "             \"feature_scaler.json\"]:\n",
+    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
+    "    print(f\"  downloaded: {name}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Make feature_engineering.py importable\n",
+    "import sys, shutil, os\n",
+    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
+    "if fe_dir not in sys.path:\n",
+    "    sys.path.insert(0, fe_dir)\n",
+    "\n",
+    "from feature_engineering import transform_single, load_meta, INT_TO_LABEL"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load models and metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import xgboost as xgb\n",
+    "from safetensors.torch import load_file\n",
+    "\n",
+    "# --- Metadata ---\n",
+    "meta = load_meta(files[\"feature_meta.json\"])\n",
+    "with open(files[\"feature_scaler.json\"]) as f:\n",
+    "    scaler = json.load(f)\n",
+    "\n",
+    "N_FEATURES = len(meta[\"feature_names\"])\n",
+    "print(f\"feature count: {N_FEATURES}\")\n",
+    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- XGBoost ---\n",
+    "xgb_model = xgb.XGBClassifier()\n",
+    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
+    "\n",
+    "# --- MLP architecture (must match training) ---\n",
+    "class FlowMLP(nn.Module):\n",
+    "    def __init__(self, n_features, n_classes=3, hidden1=128, hidden2=64, dropout=0.3):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(n_features, hidden1),\n",
+    "            nn.BatchNorm1d(hidden1),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden1, hidden2),\n",
+    "            nn.BatchNorm1d(hidden2),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden2, n_classes),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "mlp_model = FlowMLP(N_FEATURES)\n",
+    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
+    "mlp_model.eval()\n",
+    "print(\"models loaded\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Define a prediction function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
+    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
+    "\n",
+    "def predict_flow(record: dict) -> dict:\n",
+    "    \"\"\"\n",
+    "    Predict the label for one flow record. `record` is a dict containing\n",
+    "    the fields described in the model card's 'Input schema' section.\n",
+    "\n",
+    "    Returns a dict with both models' predictions and per-class probabilities.\n",
+    "    \"\"\"\n",
+    "    X = transform_single(record, meta)\n",
+    "\n",
+    "    # XGBoost\n",
+    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
+    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
+    "\n",
+    "    # MLP\n",
+    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
+    "    with torch.no_grad():\n",
+    "        logits = mlp_model(torch.tensor(Xs))\n",
+    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
+    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
+    "\n",
+    "    return {\n",
+    "        \"xgboost\": {\n",
+    "            \"label\": xgb_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
+    "        },\n",
+    "        \"mlp\": {\n",
+    "            \"label\": mlp_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
+    "        },\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Run on an example record\n",
+    "\n",
+    "The fields below are the union of `network_flows.csv`, the joined session-summary subset, and the joined topology fields. In a real deployment you would assemble these by joining a new flow against your session-summary store and your topology lookup.\n",
+    "\n",
+    "This example is a real `BENIGN` HTTPS flow lifted from the sample dataset (workstation → cloud service, port 443). Both models should agree."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# A real BENIGN HTTPS flow from the sample dataset.\n",
+    "# Workstation -> cloud service, port 443, mid-day. Both models should\n",
+    "# agree on BENIGN. If you hand-construct records, expect occasional\n",
+    "# disagreement between XGBoost and MLP on out-of-distribution inputs -\n",
+    "# disagreement is itself a useful signal; see note below.\n",
+    "example_record = {\n",
+    "    # ---- flow-level fields ----\n",
+    "    \"source_port\": 52789, \"dest_port\": 443, \"protocol\": \"HTTPS\",\n",
+    "    \"flow_start_timestamp\": \"2024-01-20 13:27:58.967\",\n",
+    "    \"flow_duration_ms\": 535,\n",
+    "    \"total_fwd_packets\": 37, \"total_bwd_packets\": 30,\n",
+    "    \"total_bytes_fwd\": 17020, \"total_bytes_bwd\": 23310,\n",
+    "    \"fwd_packet_len_mean\": 460, \"fwd_packet_len_std\": 296,\n",
+    "    \"bwd_packet_len_mean\": 777, \"bwd_packet_len_std\": 226,\n",
+    "    \"flow_bytes_per_sec\": 75383.18, \"flow_packets_per_sec\": 125.23,\n",
+    "    \"inter_arrival_time_mean\": 20.618, \"inter_arrival_time_std\": 8.457,\n",
+    "    \"tcp_flag_syn_count\": 0, \"tcp_flag_ack_count\": 0, \"tcp_flag_fin_count\": 0,\n",
+    "    \"tcp_flag_rst_count\": 0, \"tcp_flag_psh_count\": 0, \"tcp_flag_urg_count\": 0,\n",
+    "    \"flow_lifecycle_phase\": \"protocol_handshake\",\n",
+    "    \"source_device_type\": \"workstation\", \"dest_device_type\": \"cloud_service\",\n",
+    "    \"retransmission_flag\": 0, \"fragmentation_flag\": 0, \"protocol_violation_flag\": 0,\n",
+    "\n",
+    "    # ---- session-level fields (from session_summary.csv join) ----\n",
+    "    \"payload_entropy_mean\": 3.6328,\n",
+    "    \"retransmission_rate\": 0.0631,\n",
+    "    \"protocol_violation_count\": 0,\n",
+    "    \"c2_beacon_flag\": 0,\n",
+    "    \"session_risk_score\": 0.1866,\n",
+    "\n",
+    "    # ---- topology fields (from network_topology.csv join) ----\n",
+    "    \"segment_type\": \"corporate_lan\",\n",
+    "    \"trust_level\": 0.6027, \"avg_concurrent_flows\": 109, \"bandwidth_mbps\": 671.0,\n",
+    "    \"nat_enabled\": 1, \"ids_coverage\": 0.8253, \"diurnal_peak_factor\": 1.6239,\n",
+    "    \"feature_space_dim\": 107, \"alert_threshold\": 0.3089,\n",
+    "    \"retraining_cadence_days\": 39, \"ensemble_size\": 1, \"device_count\": 302,\n",
+    "    \"firewall_policy\": \"zone_based\", \"qos_policy\": \"best_effort\",\n",
+    "    \"defender_architecture\": \"lstm_behavioural\",\n",
+    "}\n",
+    "\n",
+    "result = predict_flow(example_record)\n",
+    "\n",
+    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
+    "for lbl, p in result['xgboost']['probabilities'].items():\n",
+    "    print(f\"    P({lbl}) = {p:.4f}\")\n",
+    "\n",
+    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
+    "for lbl, p in result['mlp']['probabilities'].items():\n",
+    "    print(f\"    P({lbl}) = {p:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Note: when the two models disagree\n",
+    "\n",
+    "XGBoost and the MLP can disagree on out-of-distribution records — particularly hand-crafted inputs whose feature combinations don't lie on the training-data manifold. The MLP, with BatchNorm and only ~7k training rows, has narrower competence than the tree ensemble. Disagreement is itself a useful triage signal: in a production pipeline you would surface those flows for human review rather than auto-act on either prediction.\n",
+    "\n",
+    "On in-distribution records (e.g. real rows from the sample CSV, as used in section 6 below) the two models agree on >99% of cases."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Batch prediction on the sample dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Pull the sample dataset CSVs\n",
+    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb001-sample\", repo_type=\"dataset\")\n",
+    "\n",
+    "flows    = pd.read_csv(f\"{ds_path}/network_flows.csv\")\n",
+    "sessions = pd.read_csv(f\"{ds_path}/session_summary.csv\")\n",
+    "topology = pd.read_csv(f\"{ds_path}/network_topology.csv\")\n",
+    "\n",
+    "# Drop leaky columns the model was never trained on\n",
+    "flows = flows.drop(columns=[\"traffic_category\", \"attack_subcategory\",\n",
+    "                            \"attacker_capability_tier\"], errors=\"ignore\")\n",
+    "\n",
+    "# Build the same enriched frame the training pipeline used\n",
+    "enriched = flows.merge(\n",
+    "    sessions[[\"session_id\", \"payload_entropy_mean\", \"retransmission_rate\",\n",
+    "              \"protocol_violation_count\", \"c2_beacon_flag\", \"session_risk_score\"]],\n",
+    "    on=\"session_id\", how=\"left\",\n",
+    ").merge(topology, on=\"segment_id\", how=\"left\")\n",
+    "\n",
+    "# Score the first 200 rows\n",
+    "sample = enriched.head(200).copy()\n",
+    "preds = []\n",
+    "for _, row in sample.iterrows():\n",
+    "    out = predict_flow(row.to_dict())\n",
+    "    preds.append(out[\"xgboost\"][\"label\"])\n",
+    "\n",
+    "sample[\"xgb_pred\"] = preds\n",
+    "\n",
+    "# Confusion vs ground-truth label\n",
+    "ct = pd.crosstab(sample[\"label\"], sample[\"xgb_pred\"], rownames=[\"true\"], colnames=[\"pred\"])\n",
+    "print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
+    "print(ct)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Next steps\n",
+    "\n",
+    "- See `validation_results.json` for full test-set metrics and architecture details.\n",
+    "- The high accuracy is a property of calibrated synthetic data — see the model card's **Limitations** section before extrapolating to production traffic.\n",
+    "- For the full 685k-row CYB001 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

model_mlp.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b39cd3df3edad09a9b7e41e9adb97f81c8abab37aa5e2c511e53602c74868c0
+size 90324

model_xgb.json ADDED Viewed

The diff for this file is too large to render. See raw diff

validation_results.json ADDED Viewed

	@@ -0,0 +1,109 @@

+{
+  "version": "1.0.0",
+  "dataset": "xpertsystems/cyb001-sample",
+  "split": {
+    "train": 6838,
+    "validation": 1466,
+    "test": 1466,
+    "strategy": "stratified",
+    "seed": 42
+  },
+  "n_features": 101,
+  "label_classes": [
+    "BENIGN",
+    "MALICIOUS",
+    "AMBIGUOUS"
+  ],
+  "class_distribution_train": {
+    "BENIGN": 4915,
+    "MALICIOUS": 1379,
+    "AMBIGUOUS": 544
+  },
+  "class_distribution_test": {
+    "BENIGN": 1054,
+    "MALICIOUS": 295,
+    "AMBIGUOUS": 117
+  },
+  "models": {
+    "xgboost": {
+      "architecture": "Gradient-boosted decision trees, multi:softprob, 3 classes",
+      "framework": "xgboost",
+      "test_metrics": {
+        "model": "xgboost",
+        "accuracy": 0.9979536152796725,
+        "macro_f1": 0.9961123729105247,
+        "weighted_f1": 0.9979537067605843,
+        "per_class_f1": {
+          "BENIGN": 0.9985761746559089,
+          "MALICIOUS": 0.9983079526226735,
+          "AMBIGUOUS": 0.9914529914529915
+        },
+        "confusion_matrix": {
+          "labels": [
+            "BENIGN",
+            "MALICIOUS",
+            "AMBIGUOUS"
+          ],
+          "matrix": [
+            [
+              1052,
+              1,
+              1
+            ],
+            [
+              0,
+              295,
+              0
+            ],
+            [
+              1,
+              0,
+              116
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9999888611978185
+      }
+    },
+    "mlp": {
+      "architecture": "PyTorch MLP, 101 -> 128 -> 64 -> 3, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
+      "framework": "pytorch",
+      "test_metrics": {
+        "model": "mlp",
+        "accuracy": 0.9931787175989086,
+        "macro_f1": 0.9868796182274947,
+        "weighted_f1": 0.9931977860171972,
+        "per_class_f1": {
+          "BENIGN": 0.9961977186311787,
+          "MALICIOUS": 0.9898648648648649,
+          "AMBIGUOUS": 0.9745762711864406
+        },
+        "confusion_matrix": {
+          "labels": [
+            "BENIGN",
+            "MALICIOUS",
+            "AMBIGUOUS"
+          ],
+          "matrix": [
+            [
+              1048,
+              2,
+              4
+            ],
+            [
+              2,
+              293,
+              0
+            ],
+            [
+              0,
+              2,
+              115
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.9995571752214697
+      }
+    }
+  }
+}