File size: 12,747 Bytes

e2c4702

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CYB010 Baseline Classifier — Inference Example\n",
    "\n",
    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **attack lifecycle phase** for a security event.\n",
    "\n",
    "**Models predict one of 5 phases:** `benign_background`, `initial_access`, `lateral_movement`, `persistence_establishment`, `exfiltration_or_impact`.\n",
    "\n",
    "**This is a baseline reference model**, not a production phase classifier. See the model card and **`leakage_diagnostic.json`** for the structural-leakage findings (11 oracle paths documented across the dataset)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Download model artifacts from Hugging Face"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "\n",
    "REPO_ID = \"xpertsystems/cyb010-baseline-classifier\"\n",
    "\n",
    "files = {}\n",
    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
    "             \"feature_scaler.json\"]:\n",
    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
    "    print(f\"  downloaded: {name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys, os\n",
    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
    "if fe_dir not in sys.path:\n",
    "    sys.path.insert(0, fe_dir)\n",
    "\n",
    "from feature_engineering import (\n",
    "    transform_single, load_meta, build_host_lookup, INT_TO_LABEL,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Load models and metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import xgboost as xgb\n",
    "from safetensors.torch import load_file\n",
    "\n",
    "meta = load_meta(files[\"feature_meta.json\"])\n",
    "with open(files[\"feature_scaler.json\"]) as f:\n",
    "    scaler = json.load(f)\n",
    "\n",
    "N_FEATURES = len(meta[\"feature_names\"])\n",
    "N_CLASSES = len(meta[\"int_to_label\"])\n",
    "print(f\"feature count: {N_FEATURES}\")\n",
    "print(f\"class count:   {N_CLASSES}\")\n",
    "print(f\"label classes: {list(meta['int_to_label'].values())}\")\n",
    "print(f\"\\noracle columns excluded (do not pass these to the model):\")\n",
    "for c in meta.get(\"oracle_excluded\", []):\n",
    "    print(f\"  - {c}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_model = xgb.XGBClassifier()\n",
    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
    "\n",
    "# MLP architecture (must match training)\n",
    "class PhaseMLP(nn.Module):\n",
    "    def __init__(self, n_features, n_classes=5, hidden1=128, hidden2=64, dropout=0.3):\n",
    "        super().__init__()\n",
    "        self.net = nn.Sequential(\n",
    "            nn.Linear(n_features, hidden1),\n",
    "            nn.BatchNorm1d(hidden1),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(hidden1, hidden2),\n",
    "            nn.BatchNorm1d(hidden2),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(hidden2, n_classes),\n",
    "        )\n",
    "    def forward(self, x):\n",
    "        return self.net(x)\n",
    "\n",
    "mlp_model = PhaseMLP(N_FEATURES, n_classes=N_CLASSES)\n",
    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
    "mlp_model.eval()\n",
    "print(\"models loaded\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Load host inventory for host-feature lookup\n",
    "\n",
    "The model uses host context (os_type, host_role, defender_posture, etc.) as features. To predict on a new event, we look up its host features from the host_inventory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import snapshot_download\n",
    "\n",
    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb010-sample\", repo_type=\"dataset\")\n",
    "host_lookup = build_host_lookup(f\"{ds_path}/host_inventory.csv\")\n",
    "print(f\"loaded {len(host_lookup)} host records\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Prediction helper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
    "\n",
    "def predict_attack_phase(event: dict) -> dict:\n",
    "    \"\"\"Predict the attack lifecycle phase for one security event.\n",
    "\n",
    "    Note: do NOT include mitre_tactic, mitre_technique_id,\n",
    "    label_malicious, threat_actor_id, threat_actor_profile, or\n",
    "    event_type in the record. These were structural oracles in the\n",
    "    training data and are excluded from the feature set.\n",
    "\n",
    "    Host features (os_type, host_role, etc.) are looked up from\n",
    "    host_inventory by host_id.\n",
    "    \"\"\"\n",
    "    X = transform_single(event, meta, host_lookup=host_lookup)\n",
    "\n",
    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
    "\n",
    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
    "    with torch.no_grad():\n",
    "        logits = mlp_model(torch.tensor(Xs))\n",
    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
    "\n",
    "    return {\n",
    "        \"xgboost\": {\n",
    "            \"label\": xgb_label,\n",
    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
    "        },\n",
    "        \"mlp\": {\n",
    "            \"label\": mlp_label,\n",
    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
    "        },\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Run on an example event\n",
    "\n",
    "Real high-severity authentication event from the CYB010 sample. True phase is `initial_access` — an APT session anomaly with CVSS 7.56 against a workstation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Real event from the sample dataset (true phase: initial_access)\n",
    "example_event = {\n",
    "    \"host_id\": \"HOST-00352\",\n",
    "    \"timestamp\": \"2024-07-22T21:55:40.046569+00:00\",\n",
    "    \"source_port\": 27110,\n",
    "    \"dest_port\": 8443,\n",
    "    \"event_class\": \"authentication\",\n",
    "    \"log_source_type\": \"splunk\",\n",
    "    \"severity_level\": \"high\",\n",
    "    \"label_false_positive\": False,\n",
    "    \"label_log_tampered\": False,\n",
    "    \"cvss_score_analogue\": 7.56,\n",
    "}\n",
    "\n",
    "result = predict_attack_phase(example_event)\n",
    "\n",
    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
    "    print(f\"    P({lbl:30s}) = {p:.4f}\")\n",
    "\n",
    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
    "    print(f\"    P({lbl:30s}) = {p:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Per-class confidence patterns\n",
    "\n",
    "The model has strong confidence on `benign_background` and `exfiltration_or_impact` (per-class F1 0.99 each). The middle phases (`initial_access`, `lateral_movement`, `persistence_establishment`) overlap more in feature space — expect modest confidence (0.4-0.7) on those predictions.\n",
    "\n",
    "`lateral_movement` is the hardest class (F1 0.48 at seed 42). Real SOC data would have stronger sequential signal (event-sequence features within an incident) that the per-event baseline does not capture."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Batch prediction on the sample dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "events = pd.read_csv(f\"{ds_path}/security_events.csv\")\n",
    "\n",
    "# Score the first 500 events\n",
    "sample = events.head(500).copy()\n",
    "preds = [predict_attack_phase(row.to_dict())[\"xgboost\"][\"label\"] for _, row in sample.iterrows()]\n",
    "sample[\"xgb_pred\"] = preds\n",
    "\n",
    "ct = pd.crosstab(sample[\"attack_lifecycle_phase\"], sample[\"xgb_pred\"],\n",
    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
    "print(\"Confusion on first 500 sample events (XGBoost):\")\n",
    "print(ct)\n",
    "acc = (sample[\"attack_lifecycle_phase\"] == sample[\"xgb_pred\"]).mean()\n",
    "print(f\"\\nbatch accuracy on first 500 events (in-distribution): {acc:.4f}\")\n",
    "print(\"\\nNote: this includes training-set events. See validation_results.json\\n\"\n",
    "      \"for proper held-out test metrics (group-aware split by incident_id).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Important reading: the leakage diagnostic\n",
    "\n",
    "Before using CYB010 sample data to train your own models, read **`leakage_diagnostic.json`** in this repo. It documents **11 oracle paths** across the sample's targets:\n",
    "\n",
    "**Phase target oracles (6 paths):**\n",
    "1. `mitre_tactic == \"benign\"` → 100% `benign_background` phase\n",
    "2. `mitre_technique_id` → `mitre_tactic` (perfect ATT&CK-by-design oracle)\n",
    "3. `label_malicious == False` → 100% `benign_background`\n",
    "4. `threat_actor_id == \"NONE\"` → 100% benign\n",
    "5. `threat_actor_profile == \"benign_user\"` → 100% benign\n",
    "6. `event_type` (e.g. `c2_beacon_outbound`) → 100% specific phase\n",
    "\n",
    "**Alert TP target oracles (7 paths)** — for the secondary `label_true_positive` task on `alert_records.csv`:\n",
    "1. `alert_category == \"false_positive_noise\"` → 100% FP\n",
    "2. `label_false_positive` (mirror of target)\n",
    "3. `time_to_detect_seconds == 0` → 100% FP\n",
    "4. `correlated_chain_length == 1` → near-100% FP\n",
    "5. `analyst_triage_priority ∈ {P1,P2,P3}` → 100% TP\n",
    "6. `suppression_reason == NaN` → 100% TP\n",
    "7. `alert_rule_name` (rule names encode the answer)\n",
    "\n",
    "It also documents **2 README-suggested targets that are unlearnable on the sample** after honest leak removal: `threat_actor_profile` 4-class (malicious-only) and `event_class` 12-class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Next steps\n",
    "\n",
    "- See `validation_results.json` for held-out test metrics (3,726 events from ~75 test incidents).\n",
    "- See `multi_seed_results.json` for the across-10-seeds picture (accuracy 0.936 ± 0.007, ROC-AUC 0.988 ± 0.001).\n",
    "- See `ablation_results.json` for per-feature-group contribution. `event_class` carries the dominant signal (−18pp macro-F1 when removed); CVSS features are second.\n",
    "- See **`leakage_diagnostic.json`** for the full 11-oracle-path audit.\n",
    "- For the full ~550k-row CYB010 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}