File size: 12,853 Bytes

721fce4

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CYB001 Baseline Classifier — Inference Example\n",
    "\n",
    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict on a new flow record.\n",
    "\n",
    "**Models predict one of three labels:** `BENIGN`, `MALICIOUS`, or `AMBIGUOUS`.\n",
    "\n",
    "**This is a baseline reference model**, not a production IDS. See the model card for full limitations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Download model artifacts from Hugging Face\n",
    "\n",
    "Five files are needed:\n",
    "- `model_xgb.json` — XGBoost weights\n",
    "- `model_mlp.safetensors` — PyTorch MLP weights\n",
    "- `feature_engineering.py` — feature pipeline (must match the one used at training)\n",
    "- `feature_meta.json` — feature column order + categorical levels\n",
    "- `feature_scaler.json` — MLP input standardization (mean / std)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "\n",
    "REPO_ID = \"xpertsystems/cyb001-baseline-classifier\"\n",
    "\n",
    "files = {}\n",
    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
    "             \"feature_scaler.json\"]:\n",
    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
    "    print(f\"  downloaded: {name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make feature_engineering.py importable\n",
    "import sys, shutil, os\n",
    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
    "if fe_dir not in sys.path:\n",
    "    sys.path.insert(0, fe_dir)\n",
    "\n",
    "from feature_engineering import transform_single, load_meta, INT_TO_LABEL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Load models and metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import xgboost as xgb\n",
    "from safetensors.torch import load_file\n",
    "\n",
    "# --- Metadata ---\n",
    "meta = load_meta(files[\"feature_meta.json\"])\n",
    "with open(files[\"feature_scaler.json\"]) as f:\n",
    "    scaler = json.load(f)\n",
    "\n",
    "N_FEATURES = len(meta[\"feature_names\"])\n",
    "print(f\"feature count: {N_FEATURES}\")\n",
    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- XGBoost ---\n",
    "xgb_model = xgb.XGBClassifier()\n",
    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
    "\n",
    "# --- MLP architecture (must match training) ---\n",
    "class FlowMLP(nn.Module):\n",
    "    def __init__(self, n_features, n_classes=3, hidden1=128, hidden2=64, dropout=0.3):\n",
    "        super().__init__()\n",
    "        self.net = nn.Sequential(\n",
    "            nn.Linear(n_features, hidden1),\n",
    "            nn.BatchNorm1d(hidden1),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(hidden1, hidden2),\n",
    "            nn.BatchNorm1d(hidden2),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(hidden2, n_classes),\n",
    "        )\n",
    "    def forward(self, x):\n",
    "        return self.net(x)\n",
    "\n",
    "mlp_model = FlowMLP(N_FEATURES)\n",
    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
    "mlp_model.eval()\n",
    "print(\"models loaded\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Define a prediction function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
    "\n",
    "def predict_flow(record: dict) -> dict:\n",
    "    \"\"\"\n",
    "    Predict the label for one flow record. `record` is a dict containing\n",
    "    the fields described in the model card's 'Input schema' section.\n",
    "\n",
    "    Returns a dict with both models' predictions and per-class probabilities.\n",
    "    \"\"\"\n",
    "    X = transform_single(record, meta)\n",
    "\n",
    "    # XGBoost\n",
    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
    "\n",
    "    # MLP\n",
    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
    "    with torch.no_grad():\n",
    "        logits = mlp_model(torch.tensor(Xs))\n",
    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
    "\n",
    "    return {\n",
    "        \"xgboost\": {\n",
    "            \"label\": xgb_label,\n",
    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
    "        },\n",
    "        \"mlp\": {\n",
    "            \"label\": mlp_label,\n",
    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
    "        },\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Run on an example record\n",
    "\n",
    "The fields below are the union of `network_flows.csv`, the joined session-summary subset, and the joined topology fields. In a real deployment you would assemble these by joining a new flow against your session-summary store and your topology lookup.\n",
    "\n",
    "This example is a real `BENIGN` HTTPS flow lifted from the sample dataset (workstation → cloud service, port 443). Both models should agree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A real BENIGN HTTPS flow from the sample dataset.\n",
    "# Workstation -> cloud service, port 443, mid-day. Both models should\n",
    "# agree on BENIGN. If you hand-construct records, expect occasional\n",
    "# disagreement between XGBoost and MLP on out-of-distribution inputs -\n",
    "# disagreement is itself a useful signal; see note below.\n",
    "example_record = {\n",
    "    # ---- flow-level fields ----\n",
    "    \"source_port\": 52789, \"dest_port\": 443, \"protocol\": \"HTTPS\",\n",
    "    \"flow_start_timestamp\": \"2024-01-20 13:27:58.967\",\n",
    "    \"flow_duration_ms\": 535,\n",
    "    \"total_fwd_packets\": 37, \"total_bwd_packets\": 30,\n",
    "    \"total_bytes_fwd\": 17020, \"total_bytes_bwd\": 23310,\n",
    "    \"fwd_packet_len_mean\": 460, \"fwd_packet_len_std\": 296,\n",
    "    \"bwd_packet_len_mean\": 777, \"bwd_packet_len_std\": 226,\n",
    "    \"flow_bytes_per_sec\": 75383.18, \"flow_packets_per_sec\": 125.23,\n",
    "    \"inter_arrival_time_mean\": 20.618, \"inter_arrival_time_std\": 8.457,\n",
    "    \"tcp_flag_syn_count\": 0, \"tcp_flag_ack_count\": 0, \"tcp_flag_fin_count\": 0,\n",
    "    \"tcp_flag_rst_count\": 0, \"tcp_flag_psh_count\": 0, \"tcp_flag_urg_count\": 0,\n",
    "    \"flow_lifecycle_phase\": \"protocol_handshake\",\n",
    "    \"source_device_type\": \"workstation\", \"dest_device_type\": \"cloud_service\",\n",
    "    \"retransmission_flag\": 0, \"fragmentation_flag\": 0, \"protocol_violation_flag\": 0,\n",
    "\n",
    "    # ---- session-level fields (from session_summary.csv join) ----\n",
    "    \"payload_entropy_mean\": 3.6328,\n",
    "    \"retransmission_rate\": 0.0631,\n",
    "    \"protocol_violation_count\": 0,\n",
    "    \"c2_beacon_flag\": 0,\n",
    "    \"session_risk_score\": 0.1866,\n",
    "\n",
    "    # ---- topology fields (from network_topology.csv join) ----\n",
    "    \"segment_type\": \"corporate_lan\",\n",
    "    \"trust_level\": 0.6027, \"avg_concurrent_flows\": 109, \"bandwidth_mbps\": 671.0,\n",
    "    \"nat_enabled\": 1, \"ids_coverage\": 0.8253, \"diurnal_peak_factor\": 1.6239,\n",
    "    \"feature_space_dim\": 107, \"alert_threshold\": 0.3089,\n",
    "    \"retraining_cadence_days\": 39, \"ensemble_size\": 1, \"device_count\": 302,\n",
    "    \"firewall_policy\": \"zone_based\", \"qos_policy\": \"best_effort\",\n",
    "    \"defender_architecture\": \"lstm_behavioural\",\n",
    "}\n",
    "\n",
    "result = predict_flow(example_record)\n",
    "\n",
    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
    "for lbl, p in result['xgboost']['probabilities'].items():\n",
    "    print(f\"    P({lbl}) = {p:.4f}\")\n",
    "\n",
    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
    "for lbl, p in result['mlp']['probabilities'].items():\n",
    "    print(f\"    P({lbl}) = {p:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Note: when the two models disagree\n",
    "\n",
    "XGBoost and the MLP can disagree on out-of-distribution records — particularly hand-crafted inputs whose feature combinations don't lie on the training-data manifold. The MLP, with BatchNorm and only ~7k training rows, has narrower competence than the tree ensemble. Disagreement is itself a useful triage signal: in a production pipeline you would surface those flows for human review rather than auto-act on either prediction.\n",
    "\n",
    "On in-distribution records (e.g. real rows from the sample CSV, as used in section 6 below) the two models agree on >99% of cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Batch prediction on the sample dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import snapshot_download\n",
    "import pandas as pd\n",
    "\n",
    "# Pull the sample dataset CSVs\n",
    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb001-sample\", repo_type=\"dataset\")\n",
    "\n",
    "flows    = pd.read_csv(f\"{ds_path}/network_flows.csv\")\n",
    "sessions = pd.read_csv(f\"{ds_path}/session_summary.csv\")\n",
    "topology = pd.read_csv(f\"{ds_path}/network_topology.csv\")\n",
    "\n",
    "# Drop leaky columns the model was never trained on\n",
    "flows = flows.drop(columns=[\"traffic_category\", \"attack_subcategory\",\n",
    "                            \"attacker_capability_tier\"], errors=\"ignore\")\n",
    "\n",
    "# Build the same enriched frame the training pipeline used\n",
    "enriched = flows.merge(\n",
    "    sessions[[\"session_id\", \"payload_entropy_mean\", \"retransmission_rate\",\n",
    "              \"protocol_violation_count\", \"c2_beacon_flag\", \"session_risk_score\"]],\n",
    "    on=\"session_id\", how=\"left\",\n",
    ").merge(topology, on=\"segment_id\", how=\"left\")\n",
    "\n",
    "# Score the first 200 rows\n",
    "sample = enriched.head(200).copy()\n",
    "preds = []\n",
    "for _, row in sample.iterrows():\n",
    "    out = predict_flow(row.to_dict())\n",
    "    preds.append(out[\"xgboost\"][\"label\"])\n",
    "\n",
    "sample[\"xgb_pred\"] = preds\n",
    "\n",
    "# Confusion vs ground-truth label\n",
    "ct = pd.crosstab(sample[\"label\"], sample[\"xgb_pred\"], rownames=[\"true\"], colnames=[\"pred\"])\n",
    "print(\"Confusion on first 200 sample rows (XGBoost):\")\n",
    "print(ct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Next steps\n",
    "\n",
    "- See `validation_results.json` for full test-set metrics and architecture details.\n",
    "- The high accuracy is a property of calibrated synthetic data — see the model card's **Limitations** section before extrapolating to production traffic.\n",
    "- For the full 685k-row CYB001 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}