Spaces:

OGrohit
/

logtriage-env

Running

File size: 14,476 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# LogTriageEnv: Training LLM Agents to Triage Production Incidents\n",
    "\n",
    "**Meta × PyTorch × Scaler OpenEnv Grand Finale 2026**\n",
    "\n",
    "This notebook trains an LLM agent with GRPO to identify root causes in cascading production failures.\n",
    "\n",
    "## Quick Info\n",
    "- **GPU:** T4+ required (15GB+ VRAM)\n",
    "- **Time:** 10-15 minutes\n",
    "- **Model:** Auto-selects 32B→7B→3B based on VRAM\n",
    "- **Output:** Trained model + reward curves + CSV logs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-1",
   "metadata": {},
   "source": [
    "## Step 1: Check GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-2",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "print(\"[GPU CHECK]\")\n",
    "if torch.cuda.is_available():\n",
    "    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9\n",
    "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
    "    print(f\"VRAM: {vram_gb:.1f} GB\")\n",
    "    VRAM_GB = vram_gb\n",
    "else:\n",
    "    print(\"No GPU found\")\n",
    "    VRAM_GB = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4",
   "metadata": {},
   "source": [
    "## Step 2: Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-5",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Installing dependencies in correct order...\")\n",
    "print(\"Step 1: Upgrade pip\")\n",
    "!pip install -q -U pip\n",
    "print(\"Step 2: Install Unsloth FIRST (critical for patching)\")\n",
    "!pip install -q unsloth\n",
    "print(\"Step 3: Install PyTorch\")\n",
    "!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n",
    "print(\"Step 4: Install remaining packages\")\n",
    "!pip install -q bitsandbytes peft trl transformers datasets accelerate matplotlib requests huggingface_hub mergekit llm_blender\n",
    "print(\"✓ All dependencies installed successfully\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-6",
   "metadata": {},
   "source": [
    "## Step 3: The Problem\n",
    "\n",
    "### Scenario: Production Incident at 2 AM\n",
    "\n",
    "Six services firing alerts:\n",
    "```\n",
    "api-gateway      → ERROR: timeout (most visible)\n",
    "auth-service     → WARN: connection pool exhausted\n",
    "user-db          → ERROR: slow query\n",
    "payment-db       → [no logs yet] (ROOT CAUSE - 3 hops upstream)\n",
    "```\n",
    "\n",
    "**Question:** Which service to page first?\n",
    "\n",
    "**Naive Answer:** api-gateway ❌\n",
    "\n",
    "**Correct Answer:** payment-db ✅\n",
    "\n",
    "### Why It's Hard\n",
    "- Root cause **never logs first**\n",
    "- Symptoms cascade before causes appear\n",
    "- Agent must reason **backward** through dependencies\n",
    "- LLaMA 3.3 70B baseline: only 0.65 accuracy\n",
    "\n",
    "### How We Train\n",
    "GRPO with dense reward shaping forces causal reasoning:\n",
    "- +0.3 for correct root cause\n",
    "- +0.3 for correct escalation\n",
    "- +0.3 for correct fix\n",
    "- **0 for wrong combinations**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-7",
   "metadata": {},
   "source": [
    "## Step 4: Intelligent Model Selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-8",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"[MODEL SELECTION]\")\n",
    "\n",
    "if VRAM_GB >= 24:\n",
    "    model_id = \"Qwen/Qwen2.5-32B-Instruct\"\n",
    "    model_size = \"32B (BEST)\"\n",
    "    improvement = \"+0.12 to +0.15\"\n",
    "    print(f\"✓ {VRAM_GB:.1f} GB VRAM\")\n",
    "    print(f\"✓ Selected: {model_size}\")\n",
    "elif VRAM_GB >= 10:\n",
    "    model_id = \"Qwen/Qwen2.5-7B-Instruct\"\n",
    "    model_size = \"7B (GOOD)\"\n",
    "    improvement = \"+0.04 to +0.06\"\n",
    "    print(f\"✓ {VRAM_GB:.1f} GB VRAM\")\n",
    "    print(f\"✓ Selected: {model_size}\")\n",
    "else:\n",
    "    model_id = \"Qwen/Qwen2.5-3B-Instruct\"\n",
    "    model_size = \"3B (FALLBACK)\"\n",
    "    improvement = \"+0.015\"\n",
    "    print(f\"⚠ {VRAM_GB:.1f} GB VRAM (limited)\")\n",
    "    print(f\"⚠ Selected: {model_size}\")\n",
    "\n",
    "print()\n",
    "print(f\"Model: {model_id}\")\n",
    "print(f\"Expected cascading_failure improvement: {improvement}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-9",
   "metadata": {},
   "source": [
    "## Step 5: Launch Training\n",
    "\n",
    "⏱️ This takes ~10-15 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "import os\n",
    "import shutil\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"[STEP 5A] Clone Repository from GitHub\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Clone the repository\n",
    "repo_url = \"https://github.com/rohitdecodes/logtriage-env.git\"\n",
    "repo_dir = \"logtriage-env\"\n",
    "\n",
    "# Remove existing repo if it exists\n",
    "if os.path.exists(repo_dir):\n",
    "    print(f\"⚠ {repo_dir} already exists, removing...\")\n",
    "    shutil.rmtree(repo_dir)\n",
    "\n",
    "try:\n",
    "    print(f\"Cloning from {repo_url}...\")\n",
    "    result = subprocess.run(\n",
    "        [\"git\", \"clone\", repo_url, repo_dir],\n",
    "        capture_output=True,\n",
    "        text=True,\n",
    "        timeout=300\n",
    "    )\n",
    "\n",
    "    if result.returncode == 0:\n",
    "        print(f\"✓ Repository cloned successfully\")\n",
    "        train_py_path = os.path.join(repo_dir, \"train.py\")\n",
    "    else:\n",
    "        print(f\"⚠ Clone failed: {result.stderr}\")\n",
    "        train_py_path = \"train.py\"\n",
    "except Exception as e:\n",
    "    print(f\"⚠ Clone error: {e}\")\n",
    "    train_py_path = \"train.py\"\n",
    "\n",
    "print()\n",
    "print(\"=\"*60)\n",
    "print(\"[STEP 5B] Launch Training\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Check if train.py exists (either from clone or current directory)\n",
    "if os.path.exists(train_py_path):\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    print(\"[START] LogTriageEnv Training\")\n",
    "    print(\"=\"*60)\n",
    "    print(f\"Model: {model_id}\")\n",
    "    print(f\"Episodes: 50 per task (150 total)\")\n",
    "    print(f\"Algorithm: GRPO + 4-bit Unsloth\")\n",
    "    print(\"=\"*60)\n",
    "    print()\n",
    "\n",
    "    cmd = [\n",
    "        \"python\", train_py_path,\n",
    "        \"--model\", model_id,\n",
    "        \"--task\", \"all\",\n",
    "        \"--episodes\", \"50\",\n",
    "        \"--load_in_4bit\",\n",
    "        \"--grpo_max_steps\", \"35\",\n",
    "        \"--env_url\", \"https://ogrohit-logtriage-env.hf.space\"\n",
    "    ]\n",
    "\n",
    "    try:\n",
    "        result = subprocess.run(cmd, capture_output=False, text=True, timeout=1800)\n",
    "        if result.returncode == 0:\n",
    "            print(\"\\n\" + \"=\"*60)\n",
    "            print(\"✓ TRAINING COMPLETE\")\n",
    "            print(\"=\"*60)\n",
    "        else:\n",
    "            print(f\"\\n⚠ Process returned code {result.returncode}\")\n",
    "    except subprocess.TimeoutExpired:\n",
    "        print(\"⚠ Training timed out after 30 minutes\")\n",
    "    except Exception as e:\n",
    "        print(f\"Error: {e}\")\n",
    "else:\n",
    "    print(f\"⚠ train.py not found at {train_py_path}\")\n",
    "    print(\"✗ TRAINING FAILED\")\n",
    "    print(\"Make sure the repository clone was successful or train.py exists in current directory\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-11",
   "metadata": {},
   "source": [
    "## Step 6: Analyze Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"RESULTS\")\n",
    "print(\"=\"*60)\n",
    "print()\n",
    "\n",
    "tasks = [\"single_crash\", \"cascading_failure\", \"silent_degradation\"]\n",
    "\n",
    "for task in tasks:\n",
    "    checkpoint_file = f\"./phase2_checkpoints/{task}_ep50.json\"\n",
    "    \n",
    "    if os.path.exists(checkpoint_file):\n",
    "        with open(checkpoint_file, 'r') as f:\n",
    "            data = json.load(f)\n",
    "        \n",
    "        rewards = data.get('rewards', [])\n",
    "        \n",
    "        if rewards:\n",
    "            first_10 = sum(rewards[:10]) / min(10, len(rewards))\n",
    "            last_10 = sum(rewards[-10:]) / min(10, len(rewards))\n",
    "            improvement = last_10 - first_10\n",
    "            \n",
    "            symbol = \"✓\" if improvement > 0 else \"↓\"\n",
    "            task_name = task.replace(\"_\", \" \").title()\n",
    "            \n",
    "            print(f\"{symbol} {task_name}\")\n",
    "            print(f\"  First 10 avg: {first_10:+.3f}\")\n",
    "            print(f\"  Last 10 avg:  {last_10:+.3f}\")\n",
    "            print(f\"  Improvement:  {improvement:+.3f}\")\n",
    "            print()\n",
    "    else:\n",
    "        print(f\"⚠ {task}: checkpoint not found\")\n",
    "        print()\n",
    "\n",
    "print(\"=\"*60)\n",
    "print(\"✓ Key metric: Cascading Failure improvement\")\n",
    "print(\"  (Shows genuine multi-hop causal learning)\")\n",
    "print(\"=\"*60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-13",
   "metadata": {},
   "source": [
    "## Step 7: Visualize Reward Curves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-14",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import matplotlib.pyplot as plt\n",
    "from PIL import Image\n",
    "\n",
    "if os.path.exists(\"reward_curve.png\"):\n",
    "    img = Image.open(\"reward_curve.png\")\n",
    "    plt.figure(figsize=(14, 8))\n",
    "    plt.imshow(img)\n",
    "    plt.axis('off')\n",
    "    plt.title(\"Training Reward Curves\", fontsize=14, fontweight='bold')\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    print(\"✓ Reward curves displayed\")\n",
    "else:\n",
    "    print(\"⚠ reward_curve.png not found\")\n",
    "    print(\"Generated after first training run\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-15",
   "metadata": {},
   "source": [
    "## Step 8: Verify CSV Logs (Experimental Tracking)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-16",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "print(\"[CSV TRACKING VERIFICATION]\")\n",
    "print()\n",
    "\n",
    "csv_dir = \"./logs\"\n",
    "if os.path.exists(csv_dir):\n",
    "    files = os.listdir(csv_dir)\n",
    "    print(f\"✓ Log directory exists: {csv_dir}\")\n",
    "    print(f\"  CSV files: {files}\")\n",
    "    print()\n",
    "    \n",
    "    # Show sample of first CSV\n",
    "    if files:\n",
    "        csv_file = os.path.join(csv_dir, files[0])\n",
    "        df = pd.read_csv(csv_file)\n",
    "        print(f\"[{files[0]}]\")\n",
    "        print(df.head(10).to_string())\n",
    "        print(f\"\\n✓ {len(df)} episodes tracked\")\n",
    "else:\n",
    "    print(f\"⚠ Log directory not found: {csv_dir}\")\n",
    "    print(\"CSV logs are generated during training\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-17",
   "metadata": {},
   "source": [
    "## Step 9: Download Outputs (Colab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-18",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "try:\n",
    "    from google.colab import files\n",
    "    \n",
    "    # Download key outputs\n",
    "    files_to_download = [\n",
    "        \"reward_curve.png\",\n",
    "        \"logs\",\n",
    "        \"phase2_checkpoints\"\n",
    "    ]\n",
    "    \n",
    "    for f in files_to_download:\n",
    "        if os.path.exists(f):\n",
    "            print(f\"Downloading {f}...\")\n",
    "            if os.path.isfile(f):\n",
    "                files.download(f)\n",
    "            else:\n",
    "                !zip -r {f}.zip {f}\n",
    "                files.download(f\"{f}.zip\")\n",
    "            print(f\"✓ {f} ready\")\n",
    "        \n",
    "except ImportError:\n",
    "    print(\"[INFO] Not in Colab environment\")\n",
    "    print(\"Files saved locally:\")\n",
    "    !ls -lh reward_curve.png logtriage-trained/ phase2_checkpoints/ logs/ 2>/dev/null || echo \"Check current directory\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-19",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "### What You Just Did\n",
    "1. ✓ Auto-selected best model for your GPU\n",
    "2. ✓ Trained on 3 incident types (150 episodes total)\n",
    "3. ✓ Generated reward curves\n",
    "4. ✓ Logged training results to CSV (experimental tracking)\n",
    "5. ✓ Created trained agent ready for deployment\n",
    "\n",
    "### Outputs Generated\n",
    "- `./logtriage-trained/` - Trained model weights\n",
    "- `./phase2_checkpoints/` - Episode checkpoints (JSON)\n",
    "- `./logs/` - CSV files with episode rewards\n",
    "- `reward_curve.png` - Training visualization\n",
    "\n",
    "### Resources\n",
    "- **Live Environment:** https://huggingface.co/spaces/OGrohit/logtriage-env\n",
    "- **GitHub Repository:** https://github.com/rohitdecodes/logtriage-env\n",
    "- **Blog Post:** See README for details"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}