{ "cells": [ { "cell_type": "markdown", "id": "cell-0", "metadata": {}, "source": [ "# LogTriageEnv: Training LLM Agents to Triage Production Incidents\n", "\n", "**Meta × PyTorch × Scaler OpenEnv Grand Finale 2026**\n", "\n", "This notebook trains an LLM agent with GRPO to identify root causes in cascading production failures.\n", "\n", "## Quick Info\n", "- **GPU:** T4+ required (15GB+ VRAM)\n", "- **Time:** 10-15 minutes\n", "- **Model:** Auto-selects 32B→7B→3B based on VRAM\n", "- **Output:** Trained model + reward curves + CSV logs" ] }, { "cell_type": "markdown", "id": "cell-1", "metadata": {}, "source": [ "## Step 1: Check GPU" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-2", "metadata": {}, "outputs": [], "source": [ "!nvidia-smi" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-3", "metadata": {}, "outputs": [], "source": [ "import torch\n", "\n", "print(\"[GPU CHECK]\")\n", "if torch.cuda.is_available():\n", " vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9\n", " print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n", " print(f\"VRAM: {vram_gb:.1f} GB\")\n", " VRAM_GB = vram_gb\n", "else:\n", " print(\"No GPU found\")\n", " VRAM_GB = 0" ] }, { "cell_type": "markdown", "id": "cell-4", "metadata": {}, "source": [ "## Step 2: Install Dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-5", "metadata": {}, "outputs": [], "source": [ "print(\"Installing dependencies in correct order...\")\n", "print(\"Step 1: Upgrade pip\")\n", "!pip install -q -U pip\n", "print(\"Step 2: Install Unsloth FIRST (critical for patching)\")\n", "!pip install -q unsloth\n", "print(\"Step 3: Install PyTorch\")\n", "!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n", "print(\"Step 4: Install remaining packages\")\n", "!pip install -q bitsandbytes peft trl transformers datasets accelerate matplotlib requests huggingface_hub mergekit llm_blender\n", "print(\"✓ All dependencies installed successfully\")" ] }, { "cell_type": "markdown", "id": "cell-6", "metadata": {}, "source": [ "## Step 3: The Problem\n", "\n", "### Scenario: Production Incident at 2 AM\n", "\n", "Six services firing alerts:\n", "```\n", "api-gateway → ERROR: timeout (most visible)\n", "auth-service → WARN: connection pool exhausted\n", "user-db → ERROR: slow query\n", "payment-db → [no logs yet] (ROOT CAUSE - 3 hops upstream)\n", "```\n", "\n", "**Question:** Which service to page first?\n", "\n", "**Naive Answer:** api-gateway ❌\n", "\n", "**Correct Answer:** payment-db ✅\n", "\n", "### Why It's Hard\n", "- Root cause **never logs first**\n", "- Symptoms cascade before causes appear\n", "- Agent must reason **backward** through dependencies\n", "- LLaMA 3.3 70B baseline: only 0.65 accuracy\n", "\n", "### How We Train\n", "GRPO with dense reward shaping forces causal reasoning:\n", "- +0.3 for correct root cause\n", "- +0.3 for correct escalation\n", "- +0.3 for correct fix\n", "- **0 for wrong combinations**" ] }, { "cell_type": "markdown", "id": "cell-7", "metadata": {}, "source": [ "## Step 4: Intelligent Model Selection" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-8", "metadata": {}, "outputs": [], "source": [ "print(\"[MODEL SELECTION]\")\n", "\n", "if VRAM_GB >= 24:\n", " model_id = \"Qwen/Qwen2.5-32B-Instruct\"\n", " model_size = \"32B (BEST)\"\n", " improvement = \"+0.12 to +0.15\"\n", " print(f\"✓ {VRAM_GB:.1f} GB VRAM\")\n", " print(f\"✓ Selected: {model_size}\")\n", "elif VRAM_GB >= 10:\n", " model_id = \"Qwen/Qwen2.5-7B-Instruct\"\n", " model_size = \"7B (GOOD)\"\n", " improvement = \"+0.04 to +0.06\"\n", " print(f\"✓ {VRAM_GB:.1f} GB VRAM\")\n", " print(f\"✓ Selected: {model_size}\")\n", "else:\n", " model_id = \"Qwen/Qwen2.5-3B-Instruct\"\n", " model_size = \"3B (FALLBACK)\"\n", " improvement = \"+0.015\"\n", " print(f\"⚠ {VRAM_GB:.1f} GB VRAM (limited)\")\n", " print(f\"⚠ Selected: {model_size}\")\n", "\n", "print()\n", "print(f\"Model: {model_id}\")\n", "print(f\"Expected cascading_failure improvement: {improvement}\")" ] }, { "cell_type": "markdown", "id": "cell-9", "metadata": {}, "source": [ "## Step 5: Launch Training\n", "\n", "⏱️ This takes ~10-15 minutes" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-10", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import os\n", "import shutil\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"[STEP 5A] Clone Repository from GitHub\")\n", "print(\"=\"*60)\n", "\n", "# Clone the repository\n", "repo_url = \"https://github.com/rohitdecodes/logtriage-env.git\"\n", "repo_dir = \"logtriage-env\"\n", "\n", "# Remove existing repo if it exists\n", "if os.path.exists(repo_dir):\n", " print(f\"⚠ {repo_dir} already exists, removing...\")\n", " shutil.rmtree(repo_dir)\n", "\n", "try:\n", " print(f\"Cloning from {repo_url}...\")\n", " result = subprocess.run(\n", " [\"git\", \"clone\", repo_url, repo_dir],\n", " capture_output=True,\n", " text=True,\n", " timeout=300\n", " )\n", "\n", " if result.returncode == 0:\n", " print(f\"✓ Repository cloned successfully\")\n", " train_py_path = os.path.join(repo_dir, \"train.py\")\n", " else:\n", " print(f\"⚠ Clone failed: {result.stderr}\")\n", " train_py_path = \"train.py\"\n", "except Exception as e:\n", " print(f\"⚠ Clone error: {e}\")\n", " train_py_path = \"train.py\"\n", "\n", "print()\n", "print(\"=\"*60)\n", "print(\"[STEP 5B] Launch Training\")\n", "print(\"=\"*60)\n", "\n", "# Check if train.py exists (either from clone or current directory)\n", "if os.path.exists(train_py_path):\n", " print(\"\\n\" + \"=\"*60)\n", " print(\"[START] LogTriageEnv Training\")\n", " print(\"=\"*60)\n", " print(f\"Model: {model_id}\")\n", " print(f\"Episodes: 50 per task (150 total)\")\n", " print(f\"Algorithm: GRPO + 4-bit Unsloth\")\n", " print(\"=\"*60)\n", " print()\n", "\n", " cmd = [\n", " \"python\", train_py_path,\n", " \"--model\", model_id,\n", " \"--task\", \"all\",\n", " \"--episodes\", \"50\",\n", " \"--load_in_4bit\",\n", " \"--grpo_max_steps\", \"35\",\n", " \"--env_url\", \"https://ogrohit-logtriage-env.hf.space\"\n", " ]\n", "\n", " try:\n", " result = subprocess.run(cmd, capture_output=False, text=True, timeout=1800)\n", " if result.returncode == 0:\n", " print(\"\\n\" + \"=\"*60)\n", " print(\"✓ TRAINING COMPLETE\")\n", " print(\"=\"*60)\n", " else:\n", " print(f\"\\n⚠ Process returned code {result.returncode}\")\n", " except subprocess.TimeoutExpired:\n", " print(\"⚠ Training timed out after 30 minutes\")\n", " except Exception as e:\n", " print(f\"Error: {e}\")\n", "else:\n", " print(f\"⚠ train.py not found at {train_py_path}\")\n", " print(\"✗ TRAINING FAILED\")\n", " print(\"Make sure the repository clone was successful or train.py exists in current directory\")" ] }, { "cell_type": "markdown", "id": "cell-11", "metadata": {}, "source": [ "## Step 6: Analyze Results" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-12", "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"RESULTS\")\n", "print(\"=\"*60)\n", "print()\n", "\n", "tasks = [\"single_crash\", \"cascading_failure\", \"silent_degradation\"]\n", "\n", "for task in tasks:\n", " checkpoint_file = f\"./phase2_checkpoints/{task}_ep50.json\"\n", " \n", " if os.path.exists(checkpoint_file):\n", " with open(checkpoint_file, 'r') as f:\n", " data = json.load(f)\n", " \n", " rewards = data.get('rewards', [])\n", " \n", " if rewards:\n", " first_10 = sum(rewards[:10]) / min(10, len(rewards))\n", " last_10 = sum(rewards[-10:]) / min(10, len(rewards))\n", " improvement = last_10 - first_10\n", " \n", " symbol = \"✓\" if improvement > 0 else \"↓\"\n", " task_name = task.replace(\"_\", \" \").title()\n", " \n", " print(f\"{symbol} {task_name}\")\n", " print(f\" First 10 avg: {first_10:+.3f}\")\n", " print(f\" Last 10 avg: {last_10:+.3f}\")\n", " print(f\" Improvement: {improvement:+.3f}\")\n", " print()\n", " else:\n", " print(f\"⚠ {task}: checkpoint not found\")\n", " print()\n", "\n", "print(\"=\"*60)\n", "print(\"✓ Key metric: Cascading Failure improvement\")\n", "print(\" (Shows genuine multi-hop causal learning)\")\n", "print(\"=\"*60)" ] }, { "cell_type": "markdown", "id": "cell-13", "metadata": {}, "source": [ "## Step 7: Visualize Reward Curves" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-14", "metadata": {}, "outputs": [], "source": [ "import os\n", "import matplotlib.pyplot as plt\n", "from PIL import Image\n", "\n", "if os.path.exists(\"reward_curve.png\"):\n", " img = Image.open(\"reward_curve.png\")\n", " plt.figure(figsize=(14, 8))\n", " plt.imshow(img)\n", " plt.axis('off')\n", " plt.title(\"Training Reward Curves\", fontsize=14, fontweight='bold')\n", " plt.tight_layout()\n", " plt.show()\n", " print(\"✓ Reward curves displayed\")\n", "else:\n", " print(\"⚠ reward_curve.png not found\")\n", " print(\"Generated after first training run\")" ] }, { "cell_type": "markdown", "id": "cell-15", "metadata": {}, "source": [ "## Step 8: Verify CSV Logs (Experimental Tracking)" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-16", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "\n", "print(\"[CSV TRACKING VERIFICATION]\")\n", "print()\n", "\n", "csv_dir = \"./logs\"\n", "if os.path.exists(csv_dir):\n", " files = os.listdir(csv_dir)\n", " print(f\"✓ Log directory exists: {csv_dir}\")\n", " print(f\" CSV files: {files}\")\n", " print()\n", " \n", " # Show sample of first CSV\n", " if files:\n", " csv_file = os.path.join(csv_dir, files[0])\n", " df = pd.read_csv(csv_file)\n", " print(f\"[{files[0]}]\")\n", " print(df.head(10).to_string())\n", " print(f\"\\n✓ {len(df)} episodes tracked\")\n", "else:\n", " print(f\"⚠ Log directory not found: {csv_dir}\")\n", " print(\"CSV logs are generated during training\")" ] }, { "cell_type": "markdown", "id": "cell-17", "metadata": {}, "source": [ "## Step 9: Download Outputs (Colab)" ] }, { "cell_type": "code", "execution_count": null, "id": "cell-18", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "try:\n", " from google.colab import files\n", " \n", " # Download key outputs\n", " files_to_download = [\n", " \"reward_curve.png\",\n", " \"logs\",\n", " \"phase2_checkpoints\"\n", " ]\n", " \n", " for f in files_to_download:\n", " if os.path.exists(f):\n", " print(f\"Downloading {f}...\")\n", " if os.path.isfile(f):\n", " files.download(f)\n", " else:\n", " !zip -r {f}.zip {f}\n", " files.download(f\"{f}.zip\")\n", " print(f\"✓ {f} ready\")\n", " \n", "except ImportError:\n", " print(\"[INFO] Not in Colab environment\")\n", " print(\"Files saved locally:\")\n", " !ls -lh reward_curve.png logtriage-trained/ phase2_checkpoints/ logs/ 2>/dev/null || echo \"Check current directory\"" ] }, { "cell_type": "markdown", "id": "cell-19", "metadata": {}, "source": [ "## Summary\n", "\n", "### What You Just Did\n", "1. ✓ Auto-selected best model for your GPU\n", "2. ✓ Trained on 3 incident types (150 episodes total)\n", "3. ✓ Generated reward curves\n", "4. ✓ Logged training results to CSV (experimental tracking)\n", "5. ✓ Created trained agent ready for deployment\n", "\n", "### Outputs Generated\n", "- `./logtriage-trained/` - Trained model weights\n", "- `./phase2_checkpoints/` - Episode checkpoints (JSON)\n", "- `./logs/` - CSV files with episode rewards\n", "- `reward_curve.png` - Training visualization\n", "\n", "### Resources\n", "- **Live Environment:** https://huggingface.co/spaces/OGrohit/logtriage-env\n", "- **GitHub Repository:** https://github.com/rohitdecodes/logtriage-env\n", "- **Blog Post:** See README for details" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }