Spaces:

OGrohit
/

logtriage-env

Running

App Files Files Community

OGrohit commited on 12 days ago

Commit

7a0f038

verified ·

1 Parent(s): f191fd4

For Judges To Train And Test Script

Browse files

Files changed (1) hide show

LogTriageEnv_Training.ipynb +352 -0

LogTriageEnv_Training.ipynb ADDED Viewed

	@@ -0,0 +1,352 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LogTriageEnv: Training LLM Agents to Triage Production Incidents\n",
+    "\n",
+    "**Meta × PyTorch × Scaler OpenEnv Grand Finale 2026**\n",
+    "\n",
+    "This notebook trains an LLM agent with GRPO to identify root causes in cascading production failures.\n",
+    "\n",
+    "## Quick Info\n",
+    "- **GPU:** T4+ required (15GB+ VRAM)\n",
+    "- **Time:** 10-15 minutes\n",
+    "- **Model:** Auto-selects 32B→7B→3B based on VRAM\n",
+    "- **Output:** Trained model + reward curves"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Check GPU"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "print(\"[GPU CHECK]\")\n",
+    "if torch.cuda.is_available():\n",
+    "    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9\n",
+    "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "    print(f\"VRAM: {vram_gb:.1f} GB\")\n",
+    "    VRAM_GB = vram_gb\n",
+    "else:\n",
+    "    print(\"No GPU found\")\n",
+    "    VRAM_GB = 0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Install Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "print(\"Installing dependencies in correct order...\")\nprint(\"Step 1: Upgrade pip\")\n!pip install -q -U pip\nprint(\"Step 2: Install Unsloth FIRST (critical for patching)\")\n!pip install -q unsloth\nprint(\"Step 3: Install PyTorch\")\n!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\nprint(\"Step 4: Install remaining packages\")\n!pip install -q bitsandbytes peft trl transformers datasets accelerate matplotlib requests huggingface_hub\nprint(\"✓ All dependencies installed successfully\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3: Optional - HuggingFace Login\n",
+    "\n",
+    "Skip this if you just want local training. Uncomment to push to Hub."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optional: Uncomment to login\n",
+    "# from huggingface_hub import login\n",
+    "# login()\n",
+    "\n",
+    "print(\"HF login: SKIPPED (model will save locally)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4: Clone Repository"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os\n\nif not os.path.exists('logtriage-env'):\n    !git clone https://github.com/rohitdecodes/logtriage-env.git\n    os.chdir('logtriage-env')\nelse:\n    os.chdir('logtriage-env')\n\nprint(f\"Working dir: {os.getcwd()}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5: The Problem\n",
+    "\n",
+    "### Scenario: Production Incident at 2 AM\n",
+    "\n",
+    "Six services firing alerts:\n",
+    "```\n",
+    "api-gateway      → ERROR: timeout (most visible)\n",
+    "auth-service     → WARN: connection pool exhausted\n",
+    "user-db          → ERROR: slow query\n",
+    "payment-db       → [no logs yet] (ROOT CAUSE - 3 hops upstream)\n",
+    "```\n",
+    "\n",
+    "**Question:** Which service to page first?\n",
+    "\n",
+    "**Naive Answer:** api-gateway ❌\n",
+    "\n",
+    "**Correct Answer:** payment-db ✅\n",
+    "\n",
+    "### Why It's Hard\n",
+    "- Root cause **never logs first**\n",
+    "- Symptoms cascade before causes appear\n",
+    "- Agent must reason **backward** through dependencies\n",
+    "- LLaMA 3.3 70B baseline: only 0.65 accuracy\n",
+    "\n",
+    "### How We Train\n",
+    "GRPO with dense reward shaping forces causal reasoning:\n",
+    "- +0.3 for correct root cause\n",
+    "- +0.3 for correct escalation\n",
+    "- +0.3 for correct fix\n",
+    "- **0 for wrong combinations**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 6: Intelligent Model Selection"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"[MODEL SELECTION]\")\n",
+    "\n",
+    "if VRAM_GB >= 24:\n",
+    "    model_id = \"Qwen/Qwen2.5-32B-Instruct\"\n",
+    "    model_size = \"32B (BEST)\"\n",
+    "    improvement = \"+0.12 to +0.15\"\n",
+    "    print(f\"✓ {VRAM_GB:.1f} GB VRAM\")\n",
+    "    print(f\"✓ Selected: {model_size}\")\nelif VRAM_GB >= 10:\n",
+    "    model_id = \"Qwen/Qwen2.5-7B-Instruct\"\n",
+    "    model_size = \"7B (GOOD)\"\n",
+    "    improvement = \"+0.04 to +0.06\"\n",
+    "    print(f\"✓ {VRAM_GB:.1f} GB VRAM\")\n",
+    "    print(f\"✓ Selected: {model_size}\")\nelse:\n",
+    "    model_id = \"Qwen/Qwen2.5-3B-Instruct\"\n",
+    "    model_size = \"3B (FALLBACK)\"\n",
+    "    improvement = \"+0.015\"\n",
+    "    print(f\"⚠ {VRAM_GB:.1f} GB VRAM (limited)\")\n",
+    "    print(f\"⚠ Selected: {model_size}\")\n",
+    "\nprint()\nprint(f\"Model: {model_id}\")\nprint(f\"Expected cascading_failure improvement: {improvement}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 7: Launch Training\n",
+    "\n",
+    "⏱️ This takes ~10-15 minutes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import subprocess\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"[START] LogTriageEnv Training\")\n",
+    "print(\"=\"*60)\n",
+    "print(f\"Model: {model_id}\")\n",
+    "print(f\"Episodes: 30 per task (90 total)\")\n",
+    "print(f\"Algorithm: GRPO + 4-bit Unsloth\")\n",
+    "print(\"=\"*60)\nprint()\n",
+    "\n",
+    "cmd = [\n",
+    "    \"python\", \"train.py\",\n",
+    "    \"--model\", model_id,\n",
+    "    \"--task\", \"all\",\n",
+    "    \"--episodes\", \"30\",\n",
+    "    \"--load_in_4bit\",\n",
+    "    \"--grpo_max_steps\", \"10\",\n",
+    "    \"--env_url\", \"https://ogrohit-logtriage-env.hf.space\"\n",
+    "]\n",
+    "\n",
+    "try:\n",
+    "    subprocess.run(cmd, check=True)\n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(\"✓ TRAINING COMPLETE\")\n",
+    "    print(\"=\"*60)\nexcept Exception as e:\n",
+    "    print(f\"Error: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 8: Analyze Results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"RESULTS\")\n",
+    "print(\"=\"*60)\nprint()\n",
+    "\n",
+    "tasks = [\"single_crash\", \"cascading_failure\", \"silent_degradation\"]\n",
+    "\n",
+    "for task in tasks:\n",
+    "    checkpoint_file = f\"./phase2_checkpoints/{task}_ep25.json\"\n",
+    "    \n",
+    "    if os.path.exists(checkpoint_file):\n",
+    "        with open(checkpoint_file, 'r') as f:\n",
+    "            data = json.load(f)\n",
+    "        \n",
+    "        rewards = [ep.get('reward', 0) for ep in data.get('episodes', [])]\n",
+    "        \n",
+    "        if rewards:\n",
+    "            first_10 = sum(rewards[:10]) / 10\n",
+    "            last_10 = sum(rewards[-10:]) / 10\n",
+    "            improvement = last_10 - first_10\n",
+    "            \n",
+    "            symbol = \"✓\" if improvement > 0 else \"↓\"\n",
+    "            task_name = task.replace(\"_\", \" \").title()\n",
+    "            \n",
+    "            print(f\"{symbol} {task_name}\")\n",
+    "            print(f\"  First 10 avg: {first_10:+.3f}\")\n",
+    "            print(f\"  Last 10 avg:  {last_10:+.3f}\")\n",
+    "            print(f\"  Improvement:  {improvement:+.3f}\")\n",
+    "            print()\n",
+    "\nprint(\"=\"*60)\nprint(\"✓ Key metric: Cascading Failure improvement\")\nprint(\"  (Shows genuine multi-hop causal learning)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 9: Visualize Reward Curves"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "if os.path.exists(\"merge_curves.py\"):\n",
+    "    !python merge_curves.py\n",
+    "    print(\"✓ Curves generated\")\nelse:\n",
+    "    print(\"[INFO] merge_curves.py not found\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "from PIL import Image\n",
+    "import os\n",
+    "\n",
+    "if os.path.exists(\"reward_curve.png\"):\n",
+    "    img = Image.open(\"reward_curve.png\")\n",
+    "    plt.figure(figsize=(14, 8))\n",
+    "    plt.imshow(img)\n",
+    "    plt.axis('off')\n",
+    "    plt.title(\"Training Reward Curves\", fontsize=14, fontweight='bold')\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()\nelse:\n",
+    "    print(\"reward_curve.png not found\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 10: Download Outputs (Colab)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "try:\n",
+    "    from google.colab import files\n",
+    "    \n",
+    "    if os.path.exists(\"reward_curve.png\"):\n",
+    "        print(\"Downloading reward_curve.png...\")\n",
+    "        files.download(\"reward_curve.png\")\n",
+    "        print(\"✓ Download started\")\nexcept ImportError:\n",
+    "    print(\"[INFO] Not in Colab. Files saved locally:\")\n",
+    "    !ls -lh reward_curve.png logtriage-trained/ 2>/dev/null || echo \"Check ./logtriage-trained/\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Summary\n\n### What You Just Did\n1. ✓ Auto-selected best model for your GPU\n2. ✓ Trained on 3 incident types (90 episodes total)\n3. ✓ Generated reward curves\n4. ✓ Produced trained agent ready for deployment\n\n### Outputs\n- `./logtriage-trained/` - Trained model\n- `reward_curve.png` - Learning curves\n- `./phase2_checkpoints/` - Episode data\n\n### Next Steps\n1. **Push to Hub:** `huggingface-cli login` then uncomment `--push_to_hub`\n2. **Use Locally:** Load from `./logtriage-trained/`\n3. **Deploy:** Integrate into on-call system\n\n### Resources\n- Environment: https://huggingface.co/spaces/OGrohit/logtriage-env\n- GitHub: https://github.com/rohitdecodes/logtriage-env\n- Blog: https://github.com/rohitdecodes/logtriage-env/blob/main/BLOG_POST.md"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}