Spaces:

Amogh-kal1
/

whipstudio

Sleeping

App Files Files Community

Amogh-kal1 commited on Apr 5

Commit

0c28a91

verified ·

1 Parent(s): 3f12d92

Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

GRADER_ANALYSIS.md +56 -0
HACKATHON_GUIDE.md +214 -0
baseline_agent.py +31 -8
evaluate_mnist.py +816 -0
gradio_app.py +34 -10
improved_agent.py +717 -0
inference.py +368 -0
run_inference.sh +63 -0
server/app.py +38 -8
server/environment.py +3 -2
server/tasks/graders.py +342 -114
server/tasks/task3_oom_leakage.py +5 -1
server/tasks/task6_io_mismatch.py +124 -0
test_api.sh +50 -0
vaidate-submission.sh +185 -0

GRADER_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# Task & Grader Analysis Report
+## 🔴 CRITICAL FINDING: Why Scores Were Identical
+The score `0.7390` appearing for all tasks suggested the LLM was generating code that:
+1. **Ran successfully** (exit_code = 0)
+2. **Output valid metrics** (LOSSES, VAL_ACC, etc.)
+3. **BUT didn't necessarily fix the actual bugs**
+The old graders gave "partial credit" for any running code, leading to similar scores.
+## ✅ FIXES APPLIED
+### 1. Stricter Sigmoid Centers
+- Old centers were too forgiving (e.g., val_acc center at 0.85)
+- New centers require better performance (e.g., val_acc center at 0.90-0.92)
+- Increased steepness for sharper differentiation (15→25-30)
+### 2. Early Rejection for Unfixed Bugs
+- Added explicit checks for "likely unfixed" states
+- Task 3: Reject if val_acc < 0.65 (buggy code gives ~0.50)
+- Task 4: Reject if f1 < 0.40 (buggy code gives ~0.25)
+- Task 5: Reject if buggy state unchanged
+### 3. Task 3 Mismatch Fixed
+- **Old**: Description said "OOM and data leakage"
+- **Actual bug**: Label inversion (`criterion(out, 1 - yb)`)
+- **Fixed**: Updated grader to match actual bug
+### 4. Reduced Base Scores
+- Old task 2 gave 0.40 "free" for avoiding NaN
+- New gives 0.35 base, with stricter accuracy requirements
+## Updated Grader Summary
+| Task | Bug | Key Metric | Threshold | Weight |
+|------|-----|------------|-----------|--------|
+| task1 | LR + step/backward | VAL_ACC | >0.90 | 60% |
+| task2 | NaN loss | No NaN + VAL_ACC | >0.80 | 40% |
+| task3 | Label inversion | VAL_ACC | >0.92 | 60% |
+| task4 | Wrong loss | F1_SCORE | >0.70 | 55% |
+| task5 | Frozen backbone | Fix detection | Binary | 70% |
+## Expected Score Distribution After Fix
+**Well-fixed code** (correct fix): 0.85-1.00
+**Partially fixed** (runs but suboptimal): 0.40-0.70
+**Unfixed** (bug still present): 0.10-0.25
+**Broken** (crashes): 0.00-0.10
+This creates better separation between models of different capability.
+## Files Modified
+1. `server/tasks/graders.py` - All 5 graders updated
+2. `server/tasks/task3_oom_leakage.py` - Description clarified

HACKATHON_GUIDE.md ADDED Viewed

	@@ -0,0 +1,214 @@

+# WhipStudio - OpenEnv Hackathon Submission Guide
+Complete guide for running inference, training, and evaluation for the Scaler Meta PyTorch Hackathon.
+## 🚀 Quick Start
+### 1. Environment Setup
+```bash
+# Set your HuggingFace token
+export HF_TOKEN="your_token_here"
+# For HuggingFace models (recommended)
+export API_BASE_URL="https://api-inference.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
+# Or use the convenience script
+./run_inference.sh https://amogh-kal1-whipstudio.hf.space
+```
+### 2. Run Hackathon Inference
+The `inference.py` script meets all hackathon requirements:
+- ✅ Uses OpenAI-compatible client
+- ✅ Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment
+- ✅ Emits [START], [STEP], [END] logs
+- ✅ Runs all 5 tasks with max 3 attempts each
+```bash
+python inference.py --env-url https://amogh-kal1-whipstudio.hf.space
+```
+## 📊 Training with GRPO
+Train a model using Group Relative Policy Optimization:
+### Basic Training
+```bash
+python improved_agent.py \
+    --env_url https://amogh-kal1-whipstudio.hf.space \
+    --model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
+    --output_dir ./trained-model \
+    --num_iterations 50
+```
+### Memory-Efficient Training (8GB VRAM)
+```bash
+python improved_agent.py \
+    --env_url https://amogh-kal1-whipstudio.hf.space \
+    --model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
+    --use_lora \
+    --use_4bit \
+    --gradient_checkpointing \
+    --output_dir ./trained-model-lora
+```
+### Training Features
+- **Curriculum Learning**: Starts with easier tasks, progresses to harder ones
+- **LoRA Support**: Efficient fine-tuning with adapters
+- **4-bit Quantization**: Train on GPUs with limited VRAM
+- **Checkpoint Saving**: Best model saved automatically
+- **Early Stopping**: Stops when no improvement
+- **Wandb Logging**: Optional tracking with `--use_wandb`
+## 🎯 Evaluation on MNIST
+Compare base vs trained models on an out-of-distribution MNIST debugging task:
+### Compare Two Models
+```bash
+python evaluate_mnist.py \
+    --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
+    --trained_model ./trained-model/best \
+    --num_runs 3
+```
+### Use Real MNIST Dataset
+```bash
+python evaluate_mnist.py \
+    --use_real_mnist \
+    --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
+    --trained_model ./trained-model/best
+```
+### Compare Multiple Models
+```bash
+python evaluate_mnist.py \
+    --use_real_mnist \
+    --models Qwen/Qwen2.5-Coder-1.5B-Instruct \
+             Qwen/Qwen2.5-Coder-7B-Instruct \
+             ./trained-model-v1/best \
+             ./trained-model-v2/best
+```
+## 🔧 Configuration
+### HuggingFace API (Recommended)
+```bash
+export API_BASE_URL="https://api-inference.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-Coder-32B-Instruct"
+export HF_TOKEN="hf_your_token"
+```
+### OpenAI API
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export OPENAI_API_KEY="sk-your-key"
+```
+### Local Model Inference
+```bash
+# Use vLLM or similar OpenAI-compatible server
+export API_BASE_URL="http://localhost:8000/v1"
+export MODEL_NAME="your-local-model"
+export HF_TOKEN="dummy"  # Still required by script
+```
+## 📝 Hackathon Requirements Checklist
+- ✅ **HF Space deploys**: https://amogh-kal1-whipstudio.hf.space
+- ✅ **OpenEnv spec compliance**: openenv.yaml, typed models, endpoints
+- ✅ **Dockerfile builds**: server/Dockerfile
+- ✅ **inference.py exists**: Root directory
+- ✅ **Uses OpenAI Client**: With API_BASE_URL, MODEL_NAME, HF_TOKEN
+- ✅ **Structured logs**: [START], [STEP], [END] format
+- ✅ **3+ tasks with graders**: 5 tasks (task1-task5)
+## 🐛 Troubleshooting
+### 500 Error from HF Space
+```
+[ERROR] Server error '500 Internal Server Error'
+```
+**Solution**:
+1. Visit your HF Space in a browser first: https://amogh-kal1-whipstudio.hf.space
+2. Wait for it to fully start (cold start can take 1-2 minutes)
+3. Check the Space logs for errors
+4. Try the /health endpoint: `curl https://amogh-kal1-whipstudio.hf.space/health`
+### Missing Dependencies
+```bash
+pip install openai httpx transformers torch trl peft bitsandbytes accelerate datasets
+```
+### Out of Memory During Training
+Use memory-efficient options:
+```bash
+python improved_agent.py \
+    --use_4bit \
+    --use_lora \
+    --gradient_checkpointing \
+    --lora_r 8  # Lower rank for less memory
+```
+### HuggingFace API Rate Limits
+If you hit rate limits with HuggingFace's free tier:
+1. Use a smaller model (e.g., 1.5B instead of 32B)
+2. Reduce `--num_iterations` for training
+3. Reduce `--num_runs` for evaluation
+## 📚 File Descriptions
+| File | Purpose |
+|------|---------|
+| `inference.py` | **Hackathon submission script** - runs all tasks with structured logging |
+| `improved_agent.py` | Train model with GRPO (curriculum learning, LoRA, 4-bit) |
+| `evaluate_mnist.py` | Compare models on out-of-distribution MNIST debugging |
+| `run_inference.sh` | Convenience script for quick inference runs |
+| `baseline_agent.py` | Original baseline (not hackathon-compliant) |
+## 🎓 Example Workflow
+```bash
+# 1. Run baseline inference
+export HF_TOKEN="your_token"
+export API_BASE_URL="https://api-inference.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
+python inference.py --env-url https://amogh-kal1-whipstudio.hf.space
+# 2. Train model with GRPO
+python improved_agent.py \
+    --env_url https://amogh-kal1-whipstudio.hf.space \
+    --use_lora --use_4bit \
+    --num_iterations 30 \
+    --output_dir ./my-trained-model
+# 3. Evaluate on MNIST
+python evaluate_mnist.py \
+    --use_real_mnist \
+    --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
+    --trained_model ./my-trained-model/best \
+    --num_runs 5
+# 4. Validate submission
+./vaidate-submission.sh https://amogh-kal1-whipstudio.hf.space
+```
+## 🏆 Tips for Best Results
+1. **Start with small experiments**: Use `--num_iterations 10` first
+2. **Monitor training**: Use `--use_wandb` to track progress
+3. **Curriculum helps**: Keep `--curriculum_stages 3` for better learning
+4. **Real MNIST is harder**: Expect lower scores but more realistic evaluation
+5. **Multiple runs**: Use `--num_runs 5` for statistical significance
+## 📧 Support
+If you encounter issues:
+1. Check the troubleshooting section above
+2. Verify your HF Space is running: visit the URL in browser
+3. Check environment variables: `echo $API_BASE_URL $MODEL_NAME $HF_TOKEN`
+4. Review the logs for detailed error messages

baseline_agent.py CHANGED Viewed

@@ -18,7 +18,17 @@ Rules:
 """.strip()
-def get_model():
     from smolagents import InferenceClientModel
     hf_token = os.environ.get("HF_TOKEN")
@@ -27,8 +37,13 @@ def get_model():
             "HF_TOKEN is not set. Set HF_TOKEN to run /baseline with InferenceClientModel."
         )
     return InferenceClientModel(
-        model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
         token=hf_token,
     )
@@ -81,15 +96,23 @@ def _generate_fixed_code(model, prompt: str) -> str:
     raise AttributeError("Model does not support callable() or generate() inference APIs")
-async def run_single_task(task_id: str, env_url: str = "http://localhost:7860") -> float:
     """Backwards-compatible wrapper that returns just the score."""
-    result = await run_single_task_detailed(task_id, env_url)
     return result["score"]
-async def run_single_task_detailed(task_id: str, env_url: str = "http://localhost:7860") -> dict:
     """Run the baseline agent on a single task. Returns detailed results."""
-    model = get_model()
     timeout = httpx.Timeout(900.0, connect=10.0)
     attempts_log = []
@@ -173,13 +196,13 @@ if __name__ == "__main__":
     async def main():
         scores = {}
-        for tid in ["task1", "task2", "task3"]:
             try:
                 s = await asyncio.wait_for(run_single_task(tid, args.env_url), timeout=95.0)
             except TimeoutError:
                 s = 0.0
             scores[tid] = round(s, 4)
             print(f"{tid}: {s:.4f}")
-        print(f"Average: {sum(scores.values()) / 3:.4f}")
     asyncio.run(main())

 """.strip()
+SUPPORTED_MODEL_IDS = [
+    "Qwen/Qwen2.5-Coder-1.5B-Instruct",
+    "Qwen/Qwen2.5-Coder-3B-Instruct",
+    "Qwen/Qwen2.5-Coder-7B-Instruct",
+    "Qwen/Qwen2.5-Coder-14B-Instruct",
+    "Qwen/Qwen2.5-Coder-32B-Instruct",
+    "mistralai/Mistral-7B-Instruct-v0.3",
+]
+def get_model(model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct"):
     from smolagents import InferenceClientModel
     hf_token = os.environ.get("HF_TOKEN")
             "HF_TOKEN is not set. Set HF_TOKEN to run /baseline with InferenceClientModel."
         )
+    if model_id not in SUPPORTED_MODEL_IDS:
+        raise ValueError(
+            f"Unsupported model_id '{model_id}'. Supported options: {SUPPORTED_MODEL_IDS}"
+        )
     return InferenceClientModel(
+        model_id=model_id,
         token=hf_token,
     )
     raise AttributeError("Model does not support callable() or generate() inference APIs")
+async def run_single_task(
+    task_id: str,
+    env_url: str = "http://localhost:7860",
+    model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct",
+) -> float:
     """Backwards-compatible wrapper that returns just the score."""
+    result = await run_single_task_detailed(task_id, env_url, model_id)
     return result["score"]
+async def run_single_task_detailed(
+    task_id: str,
+    env_url: str = "http://localhost:7860",
+    model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct",
+) -> dict:
     """Run the baseline agent on a single task. Returns detailed results."""
+    model = get_model(model_id)
     timeout = httpx.Timeout(900.0, connect=10.0)
     attempts_log = []
     async def main():
         scores = {}
+        for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
             try:
                 s = await asyncio.wait_for(run_single_task(tid, args.env_url), timeout=95.0)
             except TimeoutError:
                 s = 0.0
             scores[tid] = round(s, 4)
             print(f"{tid}: {s:.4f}")
+        print(f"Average: {sum(scores.values()) / 6:.4f}")
     asyncio.run(main())

evaluate_mnist.py ADDED Viewed

	@@ -0,0 +1,816 @@

+#!/usr/bin/env python3
+"""
+Evaluate untrained vs GRPO-trained Qwen2.5-1.5B-Coder on a real
+MNIST handwritten digit recognition debugging task.
+This script demonstrates that RL-trained models outperform base models
+on out-of-distribution ML debugging tasks.
+The MNIST debugging task is intentionally NOT in the WhipStudio training set,
+making it a true test of generalization.
+Workflow:
+1. Define a deliberately buggy MNIST training pipeline
+2. Load both base model and GRPO-fine-tuned model
+3. Ask each to fix the buggy code
+4. Execute both fixes and compare results
+5. Generate a comparison report
+Requirements:
+    pip install transformers torch peft bitsandbytes
+Usage:
+    # Basic comparison
+    python evaluate_mnist.py \
+        --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
+        --trained_model ./whipstudio-debugger/best
+    # Multiple runs for statistical significance
+    python evaluate_mnist.py \
+        --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
+        --trained_model ./whipstudio-debugger/best \
+        --num_runs 5
+    # Use 4-bit quantization for memory efficiency
+    python evaluate_mnist.py \
+        --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
+        --trained_model ./whipstudio-debugger/best \
+        --use_4bit
+"""
+import argparse
+import json
+import math
+import os
+import re
+import subprocess
+import sys
+import tempfile
+import time
+from pathlib import Path
+from typing import Optional
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+# Optional PEFT import for LoRA models
+try:
+    from peft import PeftModel
+    PEFT_AVAILABLE = True
+except ImportError:
+    PEFT_AVAILABLE = False
+# ══════════════════════════════════════════════════════════════════════════════
+# System Prompt (same as training)
+# ══════════════════════════════════════════════════════════════════════════════
+SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
+You receive a broken training script and must fix ALL bugs.
+Return ONLY the complete corrected Python code. No markdown, no backticks, no explanation.
+Keep all torch.manual_seed() calls intact."""
+# ══════════════════════════════════════════════════════════════════════════════
+# Buggy MNIST Pipeline (Out-of-Distribution Test)
+# ══════════════════════════════════════════════════════════════════════════════
+# Two versions of the buggy code: synthetic (fast) and real MNIST (realistic)
+MNIST_BUGGY_CODE_SYNTHETIC = '''
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, TensorDataset
+torch.manual_seed(42)
+# Simulate MNIST-like data (28x28 images, 10 classes)
+X_train = torch.randn(1000, 1, 28, 28)
+y_train = torch.randint(0, 10, (1000,))
+X_val = torch.randn(200, 1, 28, 28)
+y_val = torch.randint(0, 10, (200,))
+# Make data learnable: label = argmax of mean pixel value in 10 regions
+for i in range(len(X_train)):
+    region_means = X_train[i, 0].reshape(10, -1).mean(dim=1)
+    y_train[i] = region_means.argmax()
+for i in range(len(X_val)):
+    region_means = X_val[i, 0].reshape(10, -1).mean(dim=1)
+    y_val[i] = region_means.argmax()
+train_ds = TensorDataset(X_train, y_train)
+train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
+class SimpleCNN(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
+        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
+        self.fc1 = nn.Linear(32 * 7 * 7, 128)
+        self.fc2 = nn.Linear(128, 10)
+    def forward(self, x):
+        x = F.relu(self.conv1(x))
+        x = F.max_pool2d(x, 2)
+        x = F.relu(self.conv2(x))
+        x = F.max_pool2d(x, 2)
+        x = x.view(x.size(0), -1)
+        x = F.relu(self.fc1(x))
+        # BUG 1: Applying softmax before CrossEntropyLoss (double softmax)
+        x = F.softmax(self.fc2(x), dim=1)
+        return x
+model = SimpleCNN()
+# BUG 2: Using NLLLoss without log_softmax (expects log probabilities)
+criterion = nn.NLLLoss()
+# BUG 3: Learning rate too high for CNN
+optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
+losses = []
+for epoch in range(20):
+    for xb, yb in train_loader:
+        optimizer.zero_grad()
+        out = model(xb)
+        loss = criterion(out, yb)
+        loss.backward()
+        optimizer.step()
+        losses.append(loss.item())
+# Validation
+model.eval()
+with torch.no_grad():
+    val_out = model(X_val)
+    val_preds = val_out.argmax(dim=1)
+    val_acc = (val_preds == y_val).float().mean().item()
+print('##METRICS_START##')
+print('LOSSES:' + str(losses))
+print('VAL_ACC:' + str(round(val_acc, 4)))
+print('##METRICS_END##')
+'''
+MNIST_BUGGY_CODE_REAL = '''
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, Subset
+from torchvision import datasets, transforms
+torch.manual_seed(42)
+# Load REAL MNIST dataset
+transform = transforms.Compose([
+    transforms.ToTensor(),
+    transforms.Normalize((0.1307,), (0.3081,))
+])
+train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
+test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
+# Use subset for faster training (5000 train, 1000 val)
+train_indices = torch.randperm(len(train_dataset))[:5000]
+val_indices = torch.randperm(len(test_dataset))[:1000]
+train_subset = Subset(train_dataset, train_indices)
+val_subset = Subset(test_dataset, val_indices)
+train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)
+val_loader = DataLoader(val_subset, batch_size=256, shuffle=False)
+class SimpleCNN(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
+        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
+        self.fc1 = nn.Linear(32 * 7 * 7, 128)
+        self.fc2 = nn.Linear(128, 10)
+    def forward(self, x):
+        x = F.relu(self.conv1(x))
+        x = F.max_pool2d(x, 2)
+        x = F.relu(self.conv2(x))
+        x = F.max_pool2d(x, 2)
+        x = x.view(x.size(0), -1)
+        x = F.relu(self.fc1(x))
+        # BUG 1: Applying softmax before CrossEntropyLoss (double softmax)
+        x = F.softmax(self.fc2(x), dim=1)
+        return x
+model = SimpleCNN()
+# BUG 2: Using NLLLoss without log_softmax (expects log probabilities)
+criterion = nn.NLLLoss()
+# BUG 3: Learning rate too high for CNN
+optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
+losses = []
+for epoch in range(10):  # 10 epochs on real MNIST
+    for xb, yb in train_loader:
+        optimizer.zero_grad()
+        out = model(xb)
+        loss = criterion(out, yb)
+        loss.backward()
+        optimizer.step()
+        losses.append(loss.item())
+# Validation on real MNIST test set
+model.eval()
+correct = 0
+total = 0
+with torch.no_grad():
+    for xb, yb in val_loader:
+        out = model(xb)
+        preds = out.argmax(dim=1)
+        correct += (preds == yb).sum().item()
+        total += yb.size(0)
+val_acc = correct / total
+print('##METRICS_START##')
+print('LOSSES:' + str(losses[-100:]))  # Last 100 losses to avoid huge output
+print('VAL_ACC:' + str(round(val_acc, 4)))
+print('##METRICS_END##')
+'''
+# Default to synthetic for backward compatibility
+MNIST_BUGGY_CODE = MNIST_BUGGY_CODE_SYNTHETIC
+MNIST_TASK_DESCRIPTION_SYNTHETIC = """
+This is a CNN-based handwritten digit classifier (MNIST-like, 10 classes).
+The model has several bugs preventing it from training properly.
+Bugs to identify and fix:
+1. The forward pass has a problem with activation functions
+2. The loss function doesn't match the model output
+3. The optimizer has problematic hyperparameters
+Fix ALL bugs so that after 20 epochs:
+- Loss converges below 1.5
+- Validation accuracy exceeds 0.50
+Print losses as: LOSSES:[val1, val2, ...]
+Print validation accuracy as: VAL_ACC:X.XX
+Wrap metrics in ##METRICS_START## and ##METRICS_END##.
+"""
+MNIST_TASK_DESCRIPTION_REAL = """
+This is a CNN-based MNIST handwritten digit classifier using the REAL MNIST dataset.
+The model has several bugs preventing it from training properly.
+Bugs to identify and fix:
+1. The forward pass has a problem with activation functions
+2. The loss function doesn't match the model output
+3. The optimizer has problematic hyperparameters
+Fix ALL bugs so that after 10 epochs on real MNIST:
+- Loss converges and decreases over time
+- Validation accuracy exceeds 0.85 (should be achievable on real MNIST)
+Print the last 100 losses as: LOSSES:[val1, val2, ...]
+Print validation accuracy as: VAL_ACC:X.XX
+Wrap metrics in ##METRICS_START## and ##METRICS_END##.
+"""
+MNIST_TASK_DESCRIPTION = MNIST_TASK_DESCRIPTION_SYNTHETIC
+# ══════════════════════════════════════════════════════════════════════════════
+# Helpers
+# ══════════════════════════════════════════════════════════════════════════════
+def load_model(
+    model_path: str,
+    use_4bit: bool = False,
+    is_peft: bool = False,
+    base_model_for_peft: Optional[str] = None,
+) -> tuple:
+    """Load model and tokenizer with optional quantization and PEFT."""
+    print(f"  Loading model from {model_path}...")
+    # Quantization config
+    quantization_config = None
+    if use_4bit:
+        quantization_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+        )
+    # Model kwargs
+    model_kwargs = {
+        "trust_remote_code": True,
+        "device_map": "auto",
+    }
+    if quantization_config:
+        model_kwargs["quantization_config"] = quantization_config
+    else:
+        model_kwargs["torch_dtype"] = torch.bfloat16
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Check if this is a PEFT/LoRA model
+    adapter_config_path = Path(model_path) / "adapter_config.json"
+    if adapter_config_path.exists() or is_peft:
+        if not PEFT_AVAILABLE:
+            raise ImportError("PEFT model detected but peft is not installed")
+        # For PEFT models, we need to load base model first
+        if base_model_for_peft is None:
+            # Try to read from adapter config
+            if adapter_config_path.exists():
+                with open(adapter_config_path) as f:
+                    adapter_config = json.load(f)
+                    base_model_for_peft = adapter_config.get("base_model_name_or_path")
+        if base_model_for_peft is None:
+            raise ValueError("PEFT model requires --base_model_for_peft or adapter_config.json with base_model_name_or_path")
+        print(f"  Loading base model: {base_model_for_peft}")
+        base_model = AutoModelForCausalLM.from_pretrained(base_model_for_peft, **model_kwargs)
+        print(f"  Loading PEFT adapters from: {model_path}")
+        model = PeftModel.from_pretrained(base_model, model_path)
+    else:
+        # Regular model
+        model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
+    return model, tokenizer
+def generate_fix(model, tokenizer, task_description: str, buggy_code: str) -> str:
+    """Generate a fix using the given model."""
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": f"Task: {task_description}\n\nBuggy code:\n{buggy_code}"},
+    ]
+    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
+    inputs = {k: v.to(model.device) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=2048,
+            temperature=0.2,
+            top_p=0.95,
+            do_sample=True,
+            pad_token_id=tokenizer.pad_token_id,
+        )
+    # Decode only the generated tokens
+    generated = outputs[0][inputs["input_ids"].shape[1]:]
+    response = tokenizer.decode(generated, skip_special_tokens=True)
+    # Strip markdown fences if present
+    if "```python" in response:
+        response = response.split("```python", 1)[1].split("```", 1)[0].strip()
+    elif "```" in response:
+        response = response.split("```", 1)[1].split("```", 1)[0].strip()
+    return response.strip()
+def execute_code(code: str, timeout: int = 120) -> dict:
+    """Execute code in a subprocess and return results."""
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
+        f.write(code)
+        tmp_path = f.name
+    start = time.time()
+    try:
+        proc = subprocess.run(
+            [sys.executable, tmp_path],
+            capture_output=True,
+            text=True,
+            timeout=timeout,
+        )
+        elapsed = time.time() - start
+        return {
+            "exit_code": proc.returncode,
+            "stdout": proc.stdout[:8192],
+            "stderr": proc.stderr[:2048],
+            "elapsed": round(elapsed, 2),
+            "timed_out": False,
+        }
+    except subprocess.TimeoutExpired:
+        return {
+            "exit_code": -1,
+            "stdout": "",
+            "stderr": f"Timed out after {timeout}s",
+            "elapsed": timeout,
+            "timed_out": True,
+        }
+    finally:
+        os.unlink(tmp_path)
+def extract_metrics(stdout: str) -> dict:
+    """Parse metrics from stdout."""
+    metrics: dict = {}
+    # Extract metrics block if present
+    block_match = re.search(r"##METRICS_START##(.*?)##METRICS_END##", stdout, re.DOTALL)
+    text = block_match.group(1) if block_match else stdout
+    # Parse losses
+    match = re.search(r"LOSSES:\[([^\]]+)\]", text)
+    if match:
+        try:
+            losses = [float(x.strip()) for x in match.group(1).split(",")]
+            metrics["losses"] = losses
+            metrics["final_loss"] = losses[-1] if losses else None
+            metrics["initial_loss"] = losses[0] if losses else None
+            metrics["nan_count"] = sum(1 for l in losses if math.isnan(l) or math.isinf(l))
+            metrics["num_steps"] = len(losses)
+        except Exception:
+            pass
+    # Parse val_acc
+    match = re.search(r"VAL_ACC:([\d.]+)", text)
+    if match:
+        metrics["val_acc"] = float(match.group(1))
+    return metrics
+def score_mnist_fix(metrics: dict) -> float:
+    """
+    Score an MNIST fix on a 0-1 scale.
+    Criteria:
+    - No NaN/Inf (base requirement)
+    - Final loss < 1.5  (30%)
+    - Val accuracy > 0.5 (50%)
+    - Learning trajectory (20%)
+    """
+    if not metrics:
+        return 0.0
+    if metrics.get("nan_count", 0) > 0:
+        return 0.05
+    score = 0.0
+    # Val accuracy (50% of score)
+    val_acc = metrics.get("val_acc")
+    if val_acc is not None:
+        if val_acc >= 0.7:
+            score += 0.50
+        elif val_acc >= 0.5:
+            score += 0.35
+        elif val_acc >= 0.3:
+            score += 0.15
+    # Final loss (30% of score)
+    final_loss = metrics.get("final_loss")
+    if final_loss is not None:
+        if final_loss < 1.0:
+            score += 0.30
+        elif final_loss < 1.5:
+            score += 0.20
+        elif final_loss < 2.5:
+            score += 0.10
+    # Learning trajectory (20% of score)
+    losses = metrics.get("losses", [])
+    if len(losses) >= 10:
+        first_q = sum(losses[:len(losses) // 4]) / max(1, len(losses) // 4)
+        last_q = sum(losses[-len(losses) // 4:]) / max(1, len(losses) // 4)
+        if last_q < first_q * 0.7:
+            score += 0.20
+        elif last_q < first_q:
+            score += 0.10
+    return min(1.0, score)
+def evaluate_single_model(
+    model_path: str,
+    label: str,
+    use_4bit: bool = False,
+    is_peft: bool = False,
+    base_model_for_peft: Optional[str] = None,
+    use_real_mnist: bool = False,
+) -> dict:
+    """Load a model, generate a fix, execute it, and return results."""
+    print(f"\n{'=' * 60}")
+    print(f"Evaluating: {label}")
+    print(f"  Model: {model_path}")
+    print(f"  Dataset: {'Real MNIST' if use_real_mnist else 'Synthetic'}")
+    print(f"{'=' * 60}")
+    # Select appropriate buggy code and task description
+    if use_real_mnist:
+        buggy_code = MNIST_BUGGY_CODE_REAL
+        task_desc = MNIST_TASK_DESCRIPTION_REAL
+    else:
+        buggy_code = MNIST_BUGGY_CODE_SYNTHETIC
+        task_desc = MNIST_TASK_DESCRIPTION_SYNTHETIC
+    # Load model
+    model, tokenizer = load_model(
+        model_path,
+        use_4bit=use_4bit,
+        is_peft=is_peft,
+        base_model_for_peft=base_model_for_peft,
+    )
+    # Generate fix
+    print("  Generating fix...")
+    start = time.time()
+    fixed_code = generate_fix(model, tokenizer, task_desc, buggy_code)
+    gen_time = time.time() - start
+    print(f"  Generation took {gen_time:.1f}s ({len(fixed_code)} chars)")
+    # Execute (longer timeout for real MNIST due to dataset download)
+    timeout = 300 if use_real_mnist else 120
+    print(f"  Executing fixed code (timeout={timeout}s)...")
+    result = execute_code(fixed_code, timeout=timeout)
+    metrics = extract_metrics(result["stdout"])
+    score = score_mnist_fix(metrics) if result["exit_code"] == 0 else 0.0
+    # Report
+    print(f"\n  Results for {label}:")
+    print(f"    Exit code:    {result['exit_code']}")
+    print(f"    Timed out:    {result['timed_out']}")
+    print(f"    Val accuracy: {metrics.get('val_acc', 'N/A')}")
+    print(f"    Final loss:   {metrics.get('final_loss', 'N/A')}")
+    print(f"    NaN count:    {metrics.get('nan_count', 'N/A')}")
+    print(f"    Score:        {score:.4f}")
+    if result["stderr"] and result["exit_code"] != 0:
+        print(f"    Stderr: {result['stderr'][:500]}")
+    # Free GPU memory
+    del model
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    return {
+        "model": label,
+        "model_path": model_path,
+        "fixed_code": fixed_code,
+        "execution": result,
+        "metrics": metrics,
+        "score": score,
+        "generation_time": gen_time,
+    }
+def print_comparison_table(base_results: list, trained_results: list):
+    """Print a nicely formatted comparison table."""
+    # Aggregate scores
+    base_scores = [r["score"] for r in base_results]
+    trained_scores = [r["score"] for r in trained_results]
+    base_accs = [r["metrics"].get("val_acc", 0) or 0 for r in base_results]
+    trained_accs = [r["metrics"].get("val_acc", 0) or 0 for r in trained_results]
+    avg_base_score = sum(base_scores) / len(base_scores)
+    avg_trained_score = sum(trained_scores) / len(trained_scores)
+    avg_base_acc = sum(base_accs) / len(base_accs)
+    avg_trained_acc = sum(trained_accs) / len(trained_accs)
+    # Table
+    print(f"\n{'=' * 70}")
+    print(f"{'COMPARISON: Base vs GRPO-Trained Model':^70}")
+    print(f"{'=' * 70}")
+    headers = ["Metric", "Base Model", "Trained Model", "Δ (Improvement)"]
+    rows = [
+        ["Average Score", f"{avg_base_score:.4f}", f"{avg_trained_score:.4f}",
+         f"{avg_trained_score - avg_base_score:+.4f}"],
+        ["Average Val Acc", f"{avg_base_acc:.4f}", f"{avg_trained_acc:.4f}",
+         f"{avg_trained_acc - avg_base_acc:+.4f}"],
+        ["Best Score", f"{max(base_scores):.4f}", f"{max(trained_scores):.4f}",
+         f"{max(trained_scores) - max(base_scores):+.4f}"],
+        ["Best Val Acc", f"{max(base_accs):.4f}", f"{max(trained_accs):.4f}",
+         f"{max(trained_accs) - max(base_accs):+.4f}"],
+        ["Success Rate (>0.5)", f"{sum(1 for s in base_scores if s > 0.5)}/{len(base_scores)}",
+         f"{sum(1 for s in trained_scores if s > 0.5)}/{len(trained_scores)}", ""],
+    ]
+    # Calculate column widths
+    col_widths = [max(len(str(r[i])) for r in [headers] + rows) + 2 for i in range(4)]
+    # Print table
+    header_line = "│ " + " │ ".join(h.center(w) for h, w in zip(headers, col_widths)) + " │"
+    sep_line = "├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤"
+    top_line = "┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐"
+    bottom_line = "└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘"
+    print(top_line)
+    print(header_line)
+    print(sep_line)
+    for row in rows:
+        print("│ " + " │ ".join(str(v).center(w) for v, w in zip(row, col_widths)) + " │")
+    print(bottom_line)
+    # Winner announcement
+    print()
+    if avg_trained_score > avg_base_score:
+        delta = avg_trained_score - avg_base_score
+        pct = (delta / max(avg_base_score, 0.001)) * 100
+        print(f"🏆 GRPO-trained model wins by +{delta:.4f} score ({pct:.1f}% improvement)!")
+    elif avg_base_score > avg_trained_score:
+        print(f"⚠️  Base model performed better (may need more training)")
+    else:
+        print(f"🤝 Models tied on average score")
+    return {
+        "base_avg_score": avg_base_score,
+        "trained_avg_score": avg_trained_score,
+        "base_avg_acc": avg_base_acc,
+        "trained_avg_acc": avg_trained_acc,
+        "improvement_score": avg_trained_score - avg_base_score,
+        "improvement_acc": avg_trained_acc - avg_base_acc,
+    }
+# ══════════════════════════════════════════════════════════════════════════════
+# Main
+# ══════════════════════════════════════════════════════════════════════════════
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate and compare multiple models on MNIST debugging",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Compare base vs trained model
+  python evaluate_mnist.py --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct --trained_model ./trained
+  # Use real MNIST dataset
+  python evaluate_mnist.py --use_real_mnist --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct
+  # Compare multiple models
+  python evaluate_mnist.py --models Qwen/Qwen2.5-Coder-1.5B-Instruct ./trained-v1 ./trained-v2
+  # Memory-efficient evaluation
+  python evaluate_mnist.py --use_4bit --base_model Qwen/Qwen2.5-Coder-7B-Instruct
+        """
+    )
+    # Model selection (flexible)
+    parser.add_argument("--base_model", type=str, default="Qwen/Qwen2.5-Coder-1.5B-Instruct",
+                        help="Path or HF name of base model")
+    parser.add_argument("--trained_model", type=str, default=None,
+                        help="Path to GRPO-trained model (optional if using --models)")
+    parser.add_argument("--models", type=str, nargs="+", default=None,
+                        help="List of models to compare (overrides --base_model and --trained_model)")
+    # Dataset options
+    parser.add_argument("--use_real_mnist", action="store_true",
+                        help="Use real MNIST dataset (downloads ~50MB, slower but more realistic)")
+    # Output
+    parser.add_argument("--output_file", type=str, default="mnist_eval_results.json",
+                        help="Output file for detailed results")
+    parser.add_argument("--num_runs", type=int, default=3,
+                        help="Number of evaluation runs per model")
+    # Memory options
+    parser.add_argument("--use_4bit", action="store_true",
+                        help="Use 4-bit quantization for memory efficiency")
+    parser.add_argument("--trained_is_peft", action="store_true",
+                        help="Trained model is a PEFT/LoRA adapter")
+    args = parser.parse_args()
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    dataset_type = "Real MNIST" if args.use_real_mnist else "Synthetic MNIST-like"
+    print(f"\n{'#' * 70}")
+    print(f"{'MNIST DEBUGGING EVALUATION':^70}")
+    print(f"{'#' * 70}")
+    print(f"\nDevice: {device}")
+    print(f"Dataset: {dataset_type}")
+    print(f"Runs per model: {args.num_runs}")
+    print(f"\nMNIST Debugging Task (out-of-distribution):")
+    print(f"  Bugs: softmax before CE, NLLLoss without log, LR=5.0")
+    # Determine which models to evaluate
+    if args.models:
+        # Multi-model comparison mode
+        model_list = args.models
+        print(f"\nModels to compare ({len(model_list)}):")
+        for i, m in enumerate(model_list, 1):
+            print(f"  {i}. {m}")
+    else:
+        # Legacy two-model comparison
+        model_list = [args.base_model]
+        if args.trained_model:
+            model_list.append(args.trained_model)
+        print(f"\nBase model: {args.base_model}")
+        if args.trained_model:
+            print(f"Trained model: {args.trained_model}")
+    # Run evaluations for each model
+    all_results = {model: [] for model in model_list}
+    for run in range(1, args.num_runs + 1):
+        print(f"\n{'─' * 70}")
+        print(f"Run {run}/{args.num_runs}")
+        print(f"{'─' * 70}")
+        for model_path in model_list:
+            model_name = Path(model_path).name if "/" not in model_path else model_path.split("/")[-1]
+            # Determine if this is a PEFT model
+            is_peft = args.trained_is_peft and model_path != args.base_model
+            base_for_peft = args.base_model if is_peft else None
+            result = evaluate_single_model(
+                model_path,
+                f"{model_name} (run {run})",
+                use_4bit=args.use_4bit,
+                is_peft=is_peft,
+                base_model_for_peft=base_for_peft,
+                use_real_mnist=args.use_real_mnist,
+            )
+            all_results[model_path].append(result)
+    # Print comparison table for all models
+    print(f"\n{'=' * 80}")
+    print(f"{'RESULTS SUMMARY':^80}")
+    print(f"{'=' * 80}")
+    # Calculate aggregates for each model
+    model_stats = {}
+    for model_path, results in all_results.items():
+        scores = [r["score"] for r in results]
+        accs = [r["metrics"].get("val_acc", 0) or 0 for r in results]
+        model_stats[model_path] = {
+            "avg_score": sum(scores) / len(scores),
+            "avg_acc": sum(accs) / len(accs),
+            "best_score": max(scores),
+            "best_acc": max(accs),
+            "success_rate": sum(1 for s in scores if s > 0.5) / len(scores),
+        }
+    # Print table
+    headers = ["Model", "Avg Score", "Avg Acc", "Best Score", "Success Rate"]
+    rows = []
+    for model_path, stats in model_stats.items():
+        model_name = Path(model_path).name if "/" not in model_path else model_path.split("/")[-1]
+        rows.append([
+            model_name[:25],  # Truncate long names
+            f"{stats['avg_score']:.4f}",
+            f"{stats['avg_acc']:.4f}",
+            f"{stats['best_score']:.4f}",
+            f"{stats['success_rate']*100:.0f}%",
+        ])
+    col_widths = [max(len(str(r[i])) for r in [headers] + rows) + 2 for i in range(len(headers))]
+    print("┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐")
+    print("│ " + " │ ".join(h.center(w) for h, w in zip(headers, col_widths)) + " │")
+    print("├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤")
+    for row in rows:
+        print("│ " + " │ ".join(str(v).center(w) for v, w in zip(row, col_widths)) + " │")
+    print("└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘")
+    # Find winner
+    best_model = max(model_stats.items(), key=lambda x: x[1]["avg_score"])
+    print(f"\n🏆 Best model: {best_model[0].split('/')[-1]} (avg score: {best_model[1]['avg_score']:.4f})")
+    # Legacy comparison if exactly 2 models
+    summary = None
+    if len(model_list) == 2:
+        base_results = all_results[model_list[0]]
+        trained_results = all_results[model_list[1]]
+        summary = print_comparison_table(base_results, trained_results)
+    # Save detailed results
+    output = {
+        "task": f"MNIST debugging ({dataset_type})",
+        "models": model_list,
+        "num_runs": args.num_runs,
+        "device": device,
+        "use_real_mnist": args.use_real_mnist,
+        "model_stats": model_stats,
+        "summary": summary,
+        "runs": {
+            model_path: [
+                {k: v for k, v in r.items() if k != "fixed_code"}
+                for r in results
+            ]
+            for model_path, results in all_results.items()
+        },
+    }
+    with open(args.output_file, "w") as f:
+        json.dump(output, f, indent=2, default=str)
+    print(f"\n📄 Full results saved to {args.output_file}")
+if __name__ == "__main__":
+    main()

gradio_app.py CHANGED Viewed

@@ -48,6 +48,12 @@ TASK_INFO = {
         "description": "Backbone frozen but its parameters are passed to the optimizer.",
         "hints": "Unfreeze backend or only pass head parameters to Adam.",
     },
 }
 # ── Theme ──────────────────────────────────────────────────────────────────
@@ -418,7 +424,7 @@ def do_run_baseline(base_url: str, task_id: str):
     results_md = "### 🤖 Baseline Agent Results\n\n"
     results_md += "| Task | Score |\n|---|---|\n"
-    for tid in ["task1", "task2", "task3", "task4", "task5"]:
         s = scores.get(tid, 0.0)
         emoji = "🎯" if s >= 0.9 else ("✅" if s >= 0.7 else ("📈" if s >= 0.4 else "⚠️"))
         results_md += f"| {tid} | {emoji} {s:.4f} |\n"
@@ -492,8 +498,21 @@ def build_ui() -> gr.Blocks:
             # ── Left column: Task selector ──
             with gr.Column(scale=1, min_width=280):
                 gr.Markdown("### 📋 Task Selector")
                 task_id = gr.Radio(
-                    choices=["task1", "task2", "task3", "task4", "task5"],
                     value="task1",
                     label="Select Task",
                     info="Choose a debugging challenge",
@@ -642,18 +661,23 @@ Fix optimizer order + learning rate bugs in a linear classifier.
             "task3": "🔴 OOM + Data Leakage",
             "task4": "🟡 Wrong Loss Function",
             "task5": "🟡 Frozen Backbone",
         }
-        def run_baseline_live(base_url_val):
             """Generator that yields live progress as each task completes."""
             base = (base_url_val or DEFAULT_BASE_URL).strip().rstrip("/")
             results = {}
-            lines_header = ["### 🤖 Baseline Agent — Live Progress\n"]
             # Phase 1: Show "starting" state
             yield "\n".join(lines_header + ["⏳ Starting baseline agent..."])
-            for tid in ["task1", "task2", "task3", "task4", "task5"]:
                 tname = TASK_NAMES.get(tid, tid)
                 # Show "running this task" update
@@ -673,7 +697,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
                 # Actually call the per-task endpoint
                 try:
                     with httpx.Client(timeout=180.0) as client:
-                        resp = client.get(f"{base}/baseline/task/{tid}")
                         resp.raise_for_status()
                         data = resp.json()
                 except Exception as exc:
@@ -690,7 +714,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
             final_lines = ["### 🤖 Baseline Agent Results\n", "| Task | Score |", "|---|---|"]
             total = 0.0
             has_errors = False
-            for tid in ["task1", "task2", "task3", "task4", "task5"]:
                 info = results.get(tid, {"score": 0.0})
                 s = info["score"]
                 total += s
@@ -700,7 +724,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
                     has_errors = True
                     final_lines.append(f"\n> ⚠️ `{info['error'][:200]}`\n")
-            avg = total / 5
             final_lines.append(f"\n**Average: {avg:.4f}**")
             if avg >= 0.7:
                 final_lines.append("\n🎉 **Agent performed well!** The environment is solvable.")
@@ -713,7 +737,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
                 final_lines.append("\n---\n> [!WARNING]\n> Some tasks failed. Check if `HF_TOKEN` is valid and the model is accessible.")
             final_lines.append("\n---\n### 🔍 Auto-Agent Generated Code & Execution Logs")
-            for tid in ["task1", "task2", "task3", "task4", "task5"]:
                 info = results.get(tid, {})
                 fixed_code = str(info.get("fixed_code", ""))
                 output = str(info.get("output", ""))
@@ -731,7 +755,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
             outputs=[baseline_output],
         ).then(
             fn=run_baseline_live,
-            inputs=[base_url],
             outputs=[baseline_output],
         )

         "description": "Backbone frozen but its parameters are passed to the optimizer.",
         "hints": "Unfreeze backend or only pass head parameters to Adam.",
     },
+    "task6": {
+        "name": "Input-Output Mismatch",
+        "difficulty": "🔴 Hard",
+        "description": "CNN has 4 bugs: shape mismatch, channel order (HWC/CHW), label encoding, batch dimension.",
+        "hints": "Fix image size (32→28), permute HWC→CHW, use class indices not one-hot, add unsqueeze(0).",
+    },
 }
 # ── Theme ──────────────────────────────────────────────────────────────────
     results_md = "### 🤖 Baseline Agent Results\n\n"
     results_md += "| Task | Score |\n|---|---|\n"
+    for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
         s = scores.get(tid, 0.0)
         emoji = "🎯" if s >= 0.9 else ("✅" if s >= 0.7 else ("📈" if s >= 0.4 else "⚠️"))
         results_md += f"| {tid} | {emoji} {s:.4f} |\n"
             # ── Left column: Task selector ──
             with gr.Column(scale=1, min_width=280):
                 gr.Markdown("### 📋 Task Selector")
+                baseline_model = gr.Dropdown(
+                    choices=[
+                        "Qwen/Qwen2.5-Coder-1.5B-Instruct",
+                        "Qwen/Qwen2.5-Coder-3B-Instruct",
+                        "Qwen/Qwen2.5-Coder-7B-Instruct",
+                        "Qwen/Qwen2.5-Coder-14B-Instruct",
+                        "Qwen/Qwen2.5-Coder-32B-Instruct",
+                        "mistralai/Mistral-7B-Instruct-v0.3",
+                    ],
+                    value="Qwen/Qwen2.5-Coder-32B-Instruct",
+                    label="Auto-Agent Model",
+                    info="Choose which LLM to run for baseline auto-agent",
+                )
                 task_id = gr.Radio(
+                    choices=["task1", "task2", "task3", "task4", "task5", "task6"],
                     value="task1",
                     label="Select Task",
                     info="Choose a debugging challenge",
             "task3": "🔴 OOM + Data Leakage",
             "task4": "🟡 Wrong Loss Function",
             "task5": "🟡 Frozen Backbone",
+            "task6": "🔴 Input-Output Mismatch",
         }
+        def run_baseline_live(base_url_val, model_id_val):
             """Generator that yields live progress as each task completes."""
             base = (base_url_val or DEFAULT_BASE_URL).strip().rstrip("/")
+            model_id = (model_id_val or "Qwen/Qwen2.5-Coder-32B-Instruct").strip()
             results = {}
+            lines_header = [
+                "### 🤖 Baseline Agent — Live Progress\n",
+                f"**Model:** `{model_id}`\n",
+            ]
             # Phase 1: Show "starting" state
             yield "\n".join(lines_header + ["⏳ Starting baseline agent..."])
+            for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
                 tname = TASK_NAMES.get(tid, tid)
                 # Show "running this task" update
                 # Actually call the per-task endpoint
                 try:
                     with httpx.Client(timeout=180.0) as client:
+                        resp = client.get(f"{base}/baseline/task/{tid}", params={"model_id": model_id})
                         resp.raise_for_status()
                         data = resp.json()
                 except Exception as exc:
             final_lines = ["### 🤖 Baseline Agent Results\n", "| Task | Score |", "|---|---|"]
             total = 0.0
             has_errors = False
+            for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
                 info = results.get(tid, {"score": 0.0})
                 s = info["score"]
                 total += s
                     has_errors = True
                     final_lines.append(f"\n> ⚠️ `{info['error'][:200]}`\n")
+            avg = total / 6
             final_lines.append(f"\n**Average: {avg:.4f}**")
             if avg >= 0.7:
                 final_lines.append("\n🎉 **Agent performed well!** The environment is solvable.")
                 final_lines.append("\n---\n> [!WARNING]\n> Some tasks failed. Check if `HF_TOKEN` is valid and the model is accessible.")
             final_lines.append("\n---\n### 🔍 Auto-Agent Generated Code & Execution Logs")
+            for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
                 info = results.get(tid, {})
                 fixed_code = str(info.get("fixed_code", ""))
                 output = str(info.get("output", ""))
             outputs=[baseline_output],
         ).then(
             fn=run_baseline_live,
+            inputs=[base_url, baseline_model],
             outputs=[baseline_output],
         )

improved_agent.py ADDED Viewed

	@@ -0,0 +1,717 @@

+#!/usr/bin/env python3
+"""
+Improved GRPO training script for WhipStudio ML Debug Environment.
+This script trains Qwen2.5-1.5B-Coder (or similar) to debug broken PyTorch scripts
+using Group Relative Policy Optimization (GRPO) with the WhipStudio environment
+as the reward oracle.
+Improvements over basic train_grpo.py:
+1. Memory-efficient training with 4-bit quantization
+2. LoRA fine-tuning for reduced VRAM usage
+3. Curriculum learning (easier tasks first)
+4. Gradient checkpointing for large contexts
+5. Checkpoint saving with best model tracking
+6. Early stopping based on validation scores
+7. Wandb/TensorBoard logging support
+Requirements:
+    pip install trl>=0.15.0 transformers>=4.46.0 datasets torch httpx
+    pip install accelerate peft bitsandbytes wandb
+Usage:
+    # Basic training
+    python improved_agent.py \
+        --env_url https://your-space.hf.space \
+        --output_dir ./whipstudio-debugger
+    # Memory-efficient training (8GB VRAM)
+    python improved_agent.py \
+        --env_url https://your-space.hf.space \
+        --use_4bit \
+        --use_lora \
+        --gradient_checkpointing \
+        --output_dir ./whipstudio-debugger-lora
+    # Full training with wandb logging
+    python improved_agent.py \
+        --env_url https://your-space.hf.space \
+        --use_wandb \
+        --wandb_project whipstudio \
+        --num_iterations 100 \
+        --output_dir ./whipstudio-debugger
+"""
+import argparse
+import json
+import math
+import os
+import random
+import re
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Optional
+import httpx
+import torch
+from datasets import Dataset
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+)
+# TRL imports
+try:
+    from trl import GRPOConfig, GRPOTrainer
+except ImportError:
+    raise ImportError("Please install trl>=0.15.0: pip install trl")
+# PEFT imports (optional)
+try:
+    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+    PEFT_AVAILABLE = True
+except ImportError:
+    PEFT_AVAILABLE = False
+# Wandb import (optional)
+try:
+    import wandb
+    WANDB_AVAILABLE = True
+except ImportError:
+    WANDB_AVAILABLE = False
+# ══════════════════════════════════════════════════════════════════════════════
+# Constants
+# ══════════════════════════════════════════════════════════════════════════════
+SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
+You receive a broken training script and must fix ALL bugs.
+Return ONLY the complete corrected Python code. No markdown, no backticks, no explanation.
+The script must print metrics in the format specified by the task description.
+Keep all torch.manual_seed() calls intact.
+Wrap metrics in ##METRICS_START## and ##METRICS_END## markers."""
+# Task ordering by difficulty for curriculum learning
+TASK_DIFFICULTY = {
+    "task1": 1,  # Easy: broken loop
+    "task4": 2,  # Medium: wrong loss
+    "task5": 2,  # Medium: frozen backbone
+    "task2": 3,  # Medium: NaN loss (tricky)
+    "task3": 4,  # Hard: OOM + leakage
+}
+ALL_TASKS = list(TASK_DIFFICULTY.keys())
+# ══════════════════════════════════════════════════════════════════════════════
+# Environment Client
+# ══════════════════════════════════════════════════════════════════════════════
+class WhipStudioEnv:
+    """Client for the WhipStudio RL environment."""
+    def __init__(self, env_url: str, timeout: float = 180.0):
+        self.env_url = env_url.rstrip("/")
+        self.timeout = httpx.Timeout(timeout, connect=15.0)
+        self._task_cache: dict[str, dict] = {}
+    def reset(self, task_id: str) -> dict:
+        """Reset environment and return observation."""
+        with httpx.Client(timeout=self.timeout) as client:
+            resp = client.post(f"{self.env_url}/reset", json={"task_id": task_id})
+            resp.raise_for_status()
+            data = resp.json()
+            obs = data.get("observation", data)
+            self._task_cache[task_id] = obs
+            return obs
+    def step(self, fixed_code: str, attempt: int = 1) -> dict:
+        """Submit a fix and return the full step result."""
+        payload = {
+            "action": {
+                "fixed_code": fixed_code,
+                "attempt_number": attempt,
+            }
+        }
+        with httpx.Client(timeout=self.timeout) as client:
+            resp = client.post(f"{self.env_url}/step", json=payload)
+            resp.raise_for_status()
+            return resp.json()
+    def get_task_obs(self, task_id: str) -> dict:
+        """Get cached observation or reset to obtain it."""
+        if task_id not in self._task_cache:
+            self.reset(task_id)
+        return self._task_cache[task_id]
+    def health_check(self) -> bool:
+        """Verify the environment is reachable."""
+        try:
+            with httpx.Client(timeout=httpx.Timeout(10.0)) as client:
+                resp = client.get(f"{self.env_url}/health")
+                return resp.status_code == 200
+        except Exception:
+            return False
+# ══════════════════════════════════════════════════════════════════════════════
+# Prompt Utilities
+# ══════════════════════════════════════════════════════════════════════════════
+def build_user_prompt(task_description: str, buggy_code: str) -> str:
+    """Build the user prompt for the model."""
+    return f"Task: {task_description}\n\nBuggy code:\n{buggy_code}"
+def format_chat(tokenizer: Any, user_prompt: str) -> str:
+    """Format as a chat message and return the full text."""
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": user_prompt},
+    ]
+    return tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+def extract_code_from_response(response: str) -> str:
+    """Extract Python code from model response, stripping markdown if present."""
+    text = response.strip()
+    if "```python" in text:
+        text = text.split("```python", 1)[1].split("```", 1)[0].strip()
+    elif "```" in text:
+        text = text.split("```", 1)[1].split("```", 1)[0].strip()
+    return text
+# ══════════════════════════════════════════════════════════════════════════════
+# Reward Function
+# ══════════════════════════════════════════════════════════════════════════════
+def create_reward_function(env: WhipStudioEnv, verbose: bool = True):
+    """
+    Create a reward function compatible with TRL's GRPOTrainer.
+    Includes reward shaping:
+    - Bonus for valid Python syntax
+    - Bonus for including required output markers
+    - Environment reward from grader
+    """
+    def reward_fn(completions: list[list[dict]], **kwargs) -> list[float]:
+        """Compute rewards for a batch of completions."""
+        rewards = []
+        task_ids = kwargs.get("task_id", ["task1"] * len(completions))
+        for i, completion in enumerate(completions):
+            task_id = task_ids[i] if i < len(task_ids) else "task1"
+            try:
+                # Extract assistant's response
+                if isinstance(completion, list):
+                    text = ""
+                    for msg in completion:
+                        if isinstance(msg, dict) and msg.get("role") == "assistant":
+                            text = msg.get("content", "")
+                            break
+                    if not text and completion:
+                        text = str(completion[-1].get("content", ""))
+                elif isinstance(completion, str):
+                    text = completion
+                else:
+                    text = str(completion)
+                fixed_code = extract_code_from_response(text)
+                # Reward shaping: syntax check
+                syntax_bonus = 0.0
+                try:
+                    compile(fixed_code, "<string>", "exec")
+                    syntax_bonus = 0.05
+                except SyntaxError:
+                    pass
+                # Reward shaping: output markers present
+                marker_bonus = 0.0
+                if "LOSSES:" in fixed_code or "##METRICS" in fixed_code:
+                    marker_bonus = 0.02
+                if not fixed_code.strip():
+                    rewards.append(0.0)
+                    continue
+                # Get environment reward
+                env.reset(task_id)
+                result = env.step(fixed_code, attempt=1)
+                env_reward = float(result.get("reward", 0.0) or 0.0)
+                # Total reward (capped at 1.0)
+                total_reward = min(1.0, env_reward + syntax_bonus + marker_bonus)
+                rewards.append(total_reward)
+                if verbose:
+                    print(f"  [reward] task={task_id} env={env_reward:.3f} syntax={syntax_bonus:.2f} total={total_reward:.3f}")
+            except Exception as e:
+                if verbose:
+                    print(f"  [reward] ERROR task={task_id}: {e}")
+                rewards.append(0.0)
+        return rewards
+    return reward_fn
+# ══════════════════════════════════════════════════════════════════════════════
+# Dataset Generation with Curriculum
+# ══════════════════════════════════════════════════════════════════════════════
+def generate_curriculum_dataset(
+    env: WhipStudioEnv,
+    tokenizer: Any,
+    samples_per_task: int = 10,
+    curriculum_stage: int = 0,  # 0 = all tasks, 1 = easier tasks weighted, etc.
+) -> Dataset:
+    """
+    Generate a dataset with curriculum-based sampling.
+    Args:
+        env: WhipStudio environment client
+        tokenizer: Model tokenizer
+        samples_per_task: Base samples per task
+        curriculum_stage: 0=uniform, higher=bias toward easier tasks
+    """
+    records = []
+    # Compute task weights based on curriculum stage
+    task_weights = {}
+    for task_id, difficulty in TASK_DIFFICULTY.items():
+        if curriculum_stage == 0:
+            weight = 1.0
+        else:
+            # Higher curriculum_stage = more weight on easier tasks
+            weight = max(0.2, 1.0 - (difficulty - 1) * 0.2 * curriculum_stage)
+        task_weights[task_id] = weight
+    # Normalize weights
+    total_weight = sum(task_weights.values())
+    task_weights = {k: v / total_weight for k, v in task_weights.items()}
+    for task_id in ALL_TASKS:
+        print(f"  Fetching observation for {task_id} (weight={task_weights[task_id]:.2f})...")
+        obs = env.reset(task_id)
+        user_prompt = build_user_prompt(
+            task_description=obs.get("task_description", ""),
+            buggy_code=obs.get("buggy_code", ""),
+        )
+        formatted = format_chat(tokenizer, user_prompt)
+        # Number of samples proportional to weight
+        n_samples = max(1, int(samples_per_task * task_weights[task_id] * len(ALL_TASKS)))
+        for _ in range(n_samples):
+            records.append({
+                "prompt": formatted,
+                "task_id": task_id,
+            })
+    random.shuffle(records)
+    return Dataset.from_list(records)
+# ══════════════════════════════════════════════════════════════════════════════
+# Model Loading Utilities
+# ══════════════════════════════════════════════════════════════════════════════
+def load_model_and_tokenizer(
+    model_name: str,
+    use_4bit: bool = False,
+    use_8bit: bool = False,
+    gradient_checkpointing: bool = False,
+):
+    """Load model with optional quantization and gradient checkpointing."""
+    print(f"Loading model: {model_name}")
+    # Tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Quantization config
+    quantization_config = None
+    if use_4bit:
+        if not PEFT_AVAILABLE:
+            raise ImportError("4-bit quantization requires peft and bitsandbytes")
+        quantization_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+            bnb_4bit_use_double_quant=True,
+        )
+        print("  Using 4-bit quantization")
+    elif use_8bit:
+        quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+        print("  Using 8-bit quantization")
+    # Model kwargs
+    model_kwargs = {
+        "trust_remote_code": True,
+        "torch_dtype": torch.bfloat16 if not (use_4bit or use_8bit) else None,
+        "device_map": "auto",
+    }
+    if quantization_config:
+        model_kwargs["quantization_config"] = quantization_config
+    # Load model
+    model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
+    # Prepare for k-bit training if quantized
+    if use_4bit or use_8bit:
+        model = prepare_model_for_kbit_training(model)
+    # Gradient checkpointing
+    if gradient_checkpointing:
+        model.gradient_checkpointing_enable()
+        print("  Gradient checkpointing enabled")
+    param_count = sum(p.numel() for p in model.parameters())
+    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"  Total params: {param_count / 1e6:.1f}M, Trainable: {trainable / 1e6:.1f}M")
+    return model, tokenizer
+def apply_lora(
+    model,
+    lora_r: int = 16,
+    lora_alpha: int = 32,
+    target_modules: Optional[list[str]] = None,
+):
+    """Apply LoRA adapters to the model."""
+    if not PEFT_AVAILABLE:
+        raise ImportError("LoRA requires peft: pip install peft")
+    if target_modules is None:
+        # Default targets for Qwen2 and similar architectures
+        target_modules = [
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj",
+        ]
+    lora_config = LoraConfig(
+        r=lora_r,
+        lora_alpha=lora_alpha,
+        target_modules=target_modules,
+        lora_dropout=0.05,
+        bias="none",
+        task_type="CAUSAL_LM",
+    )
+    model = get_peft_model(model, lora_config)
+    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"  LoRA applied: r={lora_r}, trainable params: {trainable / 1e6:.2f}M")
+    return model, lora_config
+# ══════════════════════════════════════════════════════════════════════════════
+# Validation & Evaluation
+# ══════════════════════════════════════════════════════════════════════════════
+def evaluate_model(
+    model,
+    tokenizer,
+    env: WhipStudioEnv,
+    task_ids: list[str] = None,
+    max_new_tokens: int = 2048,
+) -> dict[str, float]:
+    """Evaluate model on tasks and return scores."""
+    if task_ids is None:
+        task_ids = ALL_TASKS
+    model.eval()
+    scores = {}
+    for task_id in task_ids:
+        obs = env.reset(task_id)
+        user_prompt = build_user_prompt(obs["task_description"], obs["buggy_code"])
+        formatted = format_chat(tokenizer, user_prompt)
+        inputs = tokenizer(formatted, return_tensors="pt", truncation=True, max_length=4096)
+        inputs = {k: v.to(model.device) for k, v in inputs.items()}
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens,
+                temperature=0.2,
+                top_p=0.95,
+                do_sample=True,
+                pad_token_id=tokenizer.pad_token_id,
+            )
+        generated = outputs[0][inputs["input_ids"].shape[1]:]
+        response = tokenizer.decode(generated, skip_special_tokens=True)
+        fixed_code = extract_code_from_response(response)
+        env.reset(task_id)
+        result = env.step(fixed_code, attempt=1)
+        reward = float(result.get("reward", 0.0) or 0.0)
+        scores[task_id] = reward
+        print(f"    {task_id}: {reward:.4f}")
+    return scores
+# ══════════════════════════════════════════════════════════════════════════════
+# Main Training Loop
+# ══════════════════════════════════════════════════════════════════════════════
+def main():
+    parser = argparse.ArgumentParser(description="Improved GRPO training for WhipStudio")
+    # Environment
+    parser.add_argument("--env_url", type=str, required=True,
+                        help="URL of the WhipStudio HF Space")
+    # Model
+    parser.add_argument("--model_name", type=str, default="Qwen/Qwen2.5-Coder-1.5B-Instruct",
+                        help="Base model to fine-tune")
+    parser.add_argument("--output_dir", type=str, default="./whipstudio-debugger",
+                        help="Directory to save the trained model")
+    # Quantization & Memory
+    parser.add_argument("--use_4bit", action="store_true",
+                        help="Use 4-bit quantization (requires bitsandbytes)")
+    parser.add_argument("--use_8bit", action="store_true",
+                        help="Use 8-bit quantization")
+    parser.add_argument("--gradient_checkpointing", action="store_true",
+                        help="Enable gradient checkpointing to save memory")
+    # LoRA
+    parser.add_argument("--use_lora", action="store_true",
+                        help="Use LoRA for efficient fine-tuning")
+    parser.add_argument("--lora_r", type=int, default=16,
+                        help="LoRA rank")
+    parser.add_argument("--lora_alpha", type=int, default=32,
+                        help="LoRA alpha")
+    # Training
+    parser.add_argument("--num_iterations", type=int, default=50,
+                        help="Number of training epochs")
+    parser.add_argument("--group_size", type=int, default=4,
+                        help="Number of completions per prompt for GRPO")
+    parser.add_argument("--samples_per_task", type=int, default=10,
+                        help="Base samples per task in dataset")
+    parser.add_argument("--learning_rate", type=float, default=1e-5,
+                        help="Learning rate")
+    parser.add_argument("--max_new_tokens", type=int, default=2048,
+                        help="Max tokens to generate per completion")
+    parser.add_argument("--beta", type=float, default=0.1,
+                        help="KL penalty coefficient")
+    # Curriculum
+    parser.add_argument("--curriculum_stages", type=int, default=3,
+                        help="Number of curriculum stages (0 = no curriculum)")
+    # Logging
+    parser.add_argument("--use_wandb", action="store_true",
+                        help="Log to Weights & Biases")
+    parser.add_argument("--wandb_project", type=str, default="whipstudio",
+                        help="W&B project name")
+    # Early stopping
+    parser.add_argument("--patience", type=int, default=10,
+                        help="Early stopping patience (epochs without improvement)")
+    parser.add_argument("--eval_every", type=int, default=5,
+                        help="Evaluate every N epochs")
+    # Hub
+    parser.add_argument("--push_to_hub", action="store_true",
+                        help="Push trained model to HuggingFace Hub")
+    parser.add_argument("--hub_model_id", type=str, default=None,
+                        help="Model ID on HF Hub")
+    args = parser.parse_args()
+    # ── Verify environment ──
+    print(f"\n{'=' * 60}")
+    print("WhipStudio Improved GRPO Training")
+    print(f"{'=' * 60}")
+    print(f"Environment: {args.env_url}")
+    env = WhipStudioEnv(args.env_url)
+    if not env.health_check():
+        raise ConnectionError(f"Cannot reach WhipStudio at {args.env_url}")
+    print("Environment is reachable ✓")
+    # ── Initialize wandb ──
+    if args.use_wandb:
+        if not WANDB_AVAILABLE:
+            print("Warning: wandb not installed, skipping logging")
+            args.use_wandb = False
+        else:
+            wandb.init(
+                project=args.wandb_project,
+                config=vars(args),
+                name=f"grpo-{args.model_name.split('/')[-1]}",
+            )
+    # ── Load model ──
+    model, tokenizer = load_model_and_tokenizer(
+        args.model_name,
+        use_4bit=args.use_4bit,
+        use_8bit=args.use_8bit,
+        gradient_checkpointing=args.gradient_checkpointing,
+    )
+    # ── Apply LoRA ──
+    peft_config = None
+    if args.use_lora:
+        model, peft_config = apply_lora(
+            model,
+            lora_r=args.lora_r,
+            lora_alpha=args.lora_alpha,
+        )
+    # ── Create output directory ──
+    output_path = Path(args.output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    # ── Training with curriculum ──
+    best_avg_score = 0.0
+    epochs_without_improvement = 0
+    n_stages = max(1, args.curriculum_stages)
+    epochs_per_stage = args.num_iterations // n_stages
+    for stage in range(n_stages):
+        print(f"\n{'=' * 60}")
+        print(f"Curriculum Stage {stage + 1}/{n_stages}")
+        print(f"{'=' * 60}")
+        # Generate dataset for this curriculum stage
+        dataset = generate_curriculum_dataset(
+            env, tokenizer,
+            samples_per_task=args.samples_per_task,
+            curriculum_stage=stage,
+        )
+        print(f"Dataset: {len(dataset)} samples")
+        # Create reward function
+        reward_fn = create_reward_function(env, verbose=True)
+        # Configure GRPO
+        grpo_config = GRPOConfig(
+            output_dir=str(output_path / f"stage_{stage}"),
+            num_train_epochs=epochs_per_stage,
+            per_device_train_batch_size=1,
+            gradient_accumulation_steps=4,
+            learning_rate=args.learning_rate,
+            lr_scheduler_type="cosine",
+            warmup_ratio=0.1,
+            max_completion_length=args.max_new_tokens,
+            num_generations=args.group_size,
+            logging_steps=1,
+            save_steps=epochs_per_stage,
+            save_total_limit=2,
+            bf16=True,
+            report_to="wandb" if args.use_wandb else "none",
+            beta=args.beta,
+            remove_unused_columns=False,
+        )
+        # Initialize trainer
+        trainer = GRPOTrainer(
+            model=model,
+            args=grpo_config,
+            train_dataset=dataset,
+            processing_class=tokenizer,
+            reward_funcs=reward_fn,
+            peft_config=peft_config if stage == 0 else None,  # Only apply peft on first stage
+        )
+        # Train
+        print(f"\nTraining stage {stage + 1}...")
+        train_result = trainer.train()
+        print(f"  Stage {stage + 1} complete: {train_result.global_step} steps")
+        # Evaluate
+        print("\nEvaluating...")
+        scores = evaluate_model(model, tokenizer, env)
+        avg_score = sum(scores.values()) / len(scores)
+        print(f"  Average score: {avg_score:.4f}")
+        if args.use_wandb:
+            wandb.log({
+                "stage": stage + 1,
+                "avg_score": avg_score,
+                **{f"score/{k}": v for k, v in scores.items()},
+            })
+        # Track best model
+        if avg_score > best_avg_score:
+            best_avg_score = avg_score
+            epochs_without_improvement = 0
+            # Save best model
+            best_path = output_path / "best"
+            trainer.save_model(str(best_path))
+            tokenizer.save_pretrained(str(best_path))
+            print(f"  New best model saved (score={avg_score:.4f})")
+        else:
+            epochs_without_improvement += epochs_per_stage
+        # Early stopping
+        if epochs_without_improvement >= args.patience:
+            print(f"\nEarly stopping: no improvement for {args.patience} epochs")
+            break
+    # ── Final save ──
+    final_path = output_path / "final"
+    trainer.save_model(str(final_path))
+    tokenizer.save_pretrained(str(final_path))
+    print(f"\nFinal model saved to {final_path}")
+    # ── Push to hub ──
+    if args.push_to_hub and args.hub_model_id:
+        print(f"Pushing to Hub as {args.hub_model_id}...")
+        trainer.push_to_hub(args.hub_model_id)
+        tokenizer.push_to_hub(args.hub_model_id)
+        print("Pushed to Hub ✓")
+    # ── Final evaluation ──
+    print(f"\n{'=' * 60}")
+    print("Final Evaluation on All Tasks")
+    print(f"{'=' * 60}")
+    final_scores = evaluate_model(model, tokenizer, env)
+    final_avg = sum(final_scores.values()) / len(final_scores)
+    print(f"\nFinal average score: {final_avg:.4f}")
+    print(f"Best average score during training: {best_avg_score:.4f}")
+    if args.use_wandb:
+        wandb.log({"final_avg_score": final_avg})
+        wandb.finish()
+    # ── Save training summary ──
+    summary = {
+        "model_name": args.model_name,
+        "final_avg_score": final_avg,
+        "best_avg_score": best_avg_score,
+        "final_scores": final_scores,
+        "curriculum_stages": n_stages,
+        "use_lora": args.use_lora,
+        "use_4bit": args.use_4bit,
+    }
+    with open(output_path / "training_summary.json", "w") as f:
+        json.dump(summary, f, indent=2)
+    print("\nTraining complete! ✓")
+if __name__ == "__main__":
+    main()

inference.py ADDED Viewed

	@@ -0,0 +1,368 @@

+#!/usr/bin/env python3
+"""
+Hackathon-compliant inference script for WhipStudio ML Debug Environment.
+This script follows the Scaler Meta PyTorch Hackathon requirements:
+- Uses OpenAI-compatible client with API_BASE_URL and MODEL_NAME
+- Emits structured stdout logs: [START], [STEP], [END]
+- Respects runtime limit (<20 min) and resource constraints
+Environment Variables:
+    API_BASE_URL: The API endpoint for the LLM (e.g., https://api.openai.com/v1)
+    MODEL_NAME: The model identifier (e.g., gpt-4, Qwen/Qwen2.5-Coder-32B-Instruct)
+    HF_TOKEN: Your API key / HuggingFace token
+Usage:
+    # With environment at localhost
+    python inference.py --env-url http://localhost:7860
+    # With HF Space
+    python inference.py --env-url https://your-space.hf.space
+"""
+import argparse
+import json
+import os
+import sys
+import time
+from typing import Any
+import httpx
+from openai import OpenAI
+# ── Configuration ─────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
+You receive a broken training script and must fix ALL bugs in it.
+Rules:
+- Return ONLY the complete corrected Python code, nothing else.
+- No markdown, no backticks, no explanation text.
+- The script must print losses in format: LOSSES:[v1, v2, ...]
+- For tasks requiring validation metrics, also print: VAL_ACC:X.XX or VAL_ACCS:[v1,...] and FINAL_LOSS:X.XX
+- Keep all torch.manual_seed() calls intact.
+- Wrap all metrics in ##METRICS_START## and ##METRICS_END## markers.""".strip()
+TASK_IDS = ["task1", "task2", "task3", "task4", "task5"]
+MAX_ATTEMPTS_PER_TASK = 3
+REQUEST_TIMEOUT = 180.0  # 3 minutes per LLM call
+STEP_TIMEOUT = 120.0     # 2 minutes per step (code execution)
+# ── Logging Helpers ───────────────────────────────────────────────────────────
+def log_start(task_id: str) -> None:
+    """Emit [START] log for a task."""
+    print(f"[START] task_id={task_id}", flush=True)
+def log_step(task_id: str, step: int, action_summary: str, reward: float, done: bool) -> None:
+    """Emit [STEP] log for a step within a task."""
+    print(
+        f"[STEP] task_id={task_id} step={step} action={action_summary} reward={reward:.4f} done={str(done).lower()}",
+        flush=True
+    )
+def log_end(task_id: str, final_score: float) -> None:
+    """Emit [END] log for a task."""
+    print(f"[END] task_id={task_id} final_score={final_score:.4f}", flush=True)
+# ── LLM Client ────────────────────────────────────────────────────────────────
+def get_openai_client() -> OpenAI:
+    """Initialize OpenAI-compatible client from environment variables."""
+    api_base = os.environ.get("API_BASE_URL")
+    api_key = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY")
+    if not api_key:
+        raise RuntimeError(
+            "HF_TOKEN or OPENAI_API_KEY must be set in environment"
+        )
+    # Default to OpenAI API if no base URL specified
+    if not api_base:
+        api_base = "https://api.openai.com/v1"
+    return OpenAI(
+        base_url=api_base,
+        api_key=api_key,
+        timeout=REQUEST_TIMEOUT,
+    )
+def get_model_name() -> str:
+    """Get model name from environment or use default."""
+    return os.environ.get("MODEL_NAME", "gpt-4o-mini")
+def generate_fix(client: OpenAI, model: str, prompt: str) -> str:
+    """Generate a code fix using the LLM."""
+    try:
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+            temperature=0.2,
+            max_tokens=4096,
+        )
+        content = response.choices[0].message.content or ""
+        # Strip markdown fences if present
+        if "```python" in content:
+            content = content.split("```python", 1)[1].split("```", 1)[0].strip()
+        elif "```" in content:
+            content = content.split("```", 1)[1].split("```", 1)[0].strip()
+        return content.strip()
+    except Exception as e:
+        print(f"[ERROR] LLM call failed: {e}", file=sys.stderr)
+        return ""
+# ── Environment Client ────────────────────────────────────────────────────────
+class WhipStudioClient:
+    """HTTP client for the WhipStudio environment."""
+    def __init__(self, env_url: str):
+        self.env_url = env_url.rstrip("/")
+        self.timeout = httpx.Timeout(STEP_TIMEOUT, connect=10.0)
+    def health_check(self) -> bool:
+        """Check if the environment is reachable."""
+        try:
+            with httpx.Client(timeout=httpx.Timeout(10.0)) as client:
+                resp = client.get(f"{self.env_url}/health")
+                return resp.status_code == 200
+        except Exception:
+            return False
+    def reset(self, task_id: str) -> dict:
+        """Reset environment to a specific task."""
+        with httpx.Client(timeout=self.timeout) as client:
+            resp = client.post(
+                f"{self.env_url}/reset",
+                json={"task_id": task_id}
+            )
+            resp.raise_for_status()
+            data = resp.json()
+            return data.get("observation", data)
+    def step(self, fixed_code: str, attempt_number: int = 1) -> dict:
+        """Submit a fix and get the result."""
+        payload = {
+            "action": {
+                "fixed_code": fixed_code,
+                "attempt_number": attempt_number,
+            }
+        }
+        with httpx.Client(timeout=self.timeout) as client:
+            resp = client.post(f"{self.env_url}/step", json=payload)
+            # Handle potential 422 from different API versions
+            if resp.status_code == 422:
+                resp = client.post(
+                    f"{self.env_url}/step",
+                    json={
+                        "fixed_code": fixed_code,
+                        "attempt_number": attempt_number,
+                    }
+                )
+            resp.raise_for_status()
+            return resp.json()
+    def get_tasks(self) -> list[str]:
+        """Get list of available tasks (returns task IDs only)."""
+        try:
+            with httpx.Client(timeout=self.timeout) as client:
+                resp = client.get(f"{self.env_url}/tasks")
+                if resp.status_code == 200:
+                    data = resp.json()
+                    if isinstance(data, list):
+                        # Extract task IDs from task objects
+                        task_ids = []
+                        for t in data:
+                            if isinstance(t, dict):
+                                task_ids.append(t.get("id", str(t)))
+                            else:
+                                task_ids.append(str(t))
+                        return task_ids
+                    elif isinstance(data, dict):
+                        tasks = data.get("tasks", [])
+                        return [t.get("id") if isinstance(t, dict) else str(t) for t in tasks]
+        except Exception as e:
+            print(f"[WARNING] Could not fetch tasks from /tasks endpoint: {e}", file=sys.stderr)
+        # Fallback to default task IDs
+        return TASK_IDS
+# ── Main Inference Loop ───────────────────────────────────────────────────────
+def build_prompt(obs: dict) -> str:
+    """Build the user prompt from observation."""
+    task_desc = obs.get("task_description", "Fix the buggy code.")
+    buggy_code = obs.get("buggy_code", "")
+    error_log = obs.get("error_log", "None")
+    last_reward = obs.get("last_reward", 0.0)
+    return f"""Task: {task_desc}
+Buggy code:
+{buggy_code}
+Previous execution output (if any):
+{error_log}
+Previous score: {last_reward}""".strip()
+def run_task(
+    env: WhipStudioClient,
+    llm_client: OpenAI,
+    model: str,
+    task_id: str,
+) -> float:
+    """Run inference on a single task. Returns the best score achieved."""
+    # Ensure task_id is a string
+    if isinstance(task_id, dict):
+        task_id = task_id.get("id", str(task_id))
+    log_start(task_id)
+    try:
+        obs = env.reset(task_id)
+    except Exception as e:
+        error_msg = str(e)
+        print(f"[ERROR] Failed to reset {task_id}: {error_msg}", file=sys.stderr)
+        # Check if it's a 500 error - likely environment issue
+        if "500" in error_msg:
+            print(f"[ERROR] HF Space returned 500 - the environment may be starting up or having issues", file=sys.stderr)
+            print(f"[ERROR] Try visiting https://your-space.hf.space in a browser first", file=sys.stderr)
+        log_end(task_id, 0.0)
+        return 0.0
+    best_score = 0.0
+    for step in range(1, MAX_ATTEMPTS_PER_TASK + 1):
+        prompt = build_prompt(obs)
+        # Generate fix
+        fixed_code = generate_fix(llm_client, model, prompt)
+        if not fixed_code.strip():
+            log_step(task_id, step, "empty_response", 0.0, False)
+            continue
+        # Submit fix
+        try:
+            result = env.step(fixed_code, attempt_number=step)
+            reward = float(result.get("reward", 0.0) or 0.0)
+            done = result.get("done", False)
+            obs = result.get("observation", obs)
+            # Track best score
+            if reward > best_score:
+                best_score = reward
+            # Log step
+            code_len = len(fixed_code)
+            log_step(task_id, step, f"submit_fix({code_len}chars)", reward, done)
+            # Early exit if task is solved
+            if done or reward >= 0.95:
+                break
+        except Exception as e:
+            print(f"[ERROR] Step failed for {task_id}: {e}", file=sys.stderr)
+            log_step(task_id, step, "step_error", 0.0, False)
+    log_end(task_id, best_score)
+    return best_score
+def main():
+    parser = argparse.ArgumentParser(
+        description="WhipStudio inference script for OpenEnv Hackathon"
+    )
+    parser.add_argument(
+        "--env-url",
+        default=os.environ.get("ENV_URL", "http://localhost:7860"),
+        help="URL of the WhipStudio environment"
+    )
+    parser.add_argument(
+        "--tasks",
+        nargs="+",
+        default=None,
+        help="Specific tasks to run (default: all tasks)"
+    )
+    args = parser.parse_args()
+    # Initialize clients
+    print(f"[INFO] Connecting to environment at {args.env_url}", flush=True)
+    env = WhipStudioClient(args.env_url)
+    if not env.health_check():
+        print(f"[ERROR] Cannot reach environment at {args.env_url}", file=sys.stderr)
+        sys.exit(1)
+    print("[INFO] Environment is reachable", flush=True)
+    # Initialize LLM client
+    llm_client = get_openai_client()
+    model = get_model_name()
+    print(f"[INFO] Using model: {model}", flush=True)
+    # Determine which tasks to run
+    if args.tasks:
+        task_ids = args.tasks
+    else:
+        task_ids = env.get_tasks()
+    print(f"[INFO] Running tasks: {task_ids}", flush=True)
+    # Run inference on all tasks
+    start_time = time.time()
+    scores = {}
+    for task_id in task_ids:
+        task_start = time.time()
+        score = run_task(env, llm_client, model, task_id)
+        scores[task_id] = score
+        task_elapsed = time.time() - task_start
+        print(f"[INFO] {task_id} completed in {task_elapsed:.1f}s with score {score:.4f}", flush=True)
+    # Summary
+    total_elapsed = time.time() - start_time
+    avg_score = sum(scores.values()) / len(scores) if scores else 0.0
+    print("\n" + "=" * 50, flush=True)
+    print("[SUMMARY]", flush=True)
+    print(f"  Tasks completed: {len(scores)}", flush=True)
+    print(f"  Total time: {total_elapsed:.1f}s", flush=True)
+    print(f"  Average score: {avg_score:.4f}", flush=True)
+    print("  Per-task scores:", flush=True)
+    for tid, score in scores.items():
+        print(f"    {tid}: {score:.4f}", flush=True)
+    print("=" * 50, flush=True)
+    # Exit with error if average score is too low (optional)
+    if avg_score < 0.1:
+        print("[WARNING] Average score below 0.1 threshold", file=sys.stderr)
+if __name__ == "__main__":
+    main()

run_inference.sh ADDED Viewed

	@@ -0,0 +1,63 @@

+#!/bin/bash
+# Quick setup script for WhipStudio OpenEnv Hackathon
+set -e
+echo "=========================================="
+echo "WhipStudio Hackathon Setup"
+echo "=========================================="
+# Step 1: Check environment variables
+echo ""
+echo "Step 1: Checking environment variables..."
+if [ -z "$HF_TOKEN" ]; then
+    echo "⚠️  HF_TOKEN not set"
+    if [ -f .env ]; then
+        echo "   Loading from .env file..."
+        export HF_TOKEN=$(grep -v '^#' .env | head -1)
+        echo "   ✓ HF_TOKEN loaded"
+    else
+        echo "   ❌ Please set HF_TOKEN environment variable or create .env file"
+        exit 1
+    fi
+else
+    echo "   ✓ HF_TOKEN is set"
+fi
+if [ -z "$API_BASE_URL" ]; then
+    echo "⚠️  API_BASE_URL not set, using HuggingFace Inference API"
+    export API_BASE_URL="https://api-inference.huggingface.co/v1"
+fi
+echo "   ✓ API_BASE_URL: $API_BASE_URL"
+if [ -z "$MODEL_NAME" ]; then
+    echo "⚠️  MODEL_NAME not set, using default"
+    export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
+fi
+echo "   ✓ MODEL_NAME: $MODEL_NAME"
+# Step 2: Check HF Space
+ENV_URL="${1:-https://amogh-kal1-whipstudio.hf.space}"
+echo ""
+echo "Step 2: Checking HF Space at $ENV_URL..."
+if curl -s --max-time 10 "$ENV_URL/health" > /dev/null 2>&1; then
+    echo "   ✓ HF Space is reachable"
+else
+    echo "   ❌ HF Space not reachable or still starting up"
+    echo "   Try visiting $ENV_URL in your browser first"
+    exit 1
+fi
+# Step 3: Run inference
+echo ""
+echo "Step 3: Running inference..."
+echo ""
+python3 inference.py --env-url "$ENV_URL"
+echo ""
+echo "=========================================="
+echo "✅ Inference complete!"
+echo "=========================================="

server/app.py CHANGED Viewed

@@ -95,6 +95,7 @@ def list_tasks():
             {"id": "task3", "name": "OOM and data leakage", "difficulty": "hard"},
             {"id": "task4", "name": "Wrong loss function", "difficulty": "medium"},
             {"id": "task5", "name": "Frozen backbone", "difficulty": "medium"},
         ],
         "action_schema": {
             "fixed_code": "string (required) — complete runnable Python script",
@@ -122,16 +123,26 @@ def run_grader(payload: dict):
 @app.get("/baseline")
 async def run_baseline(request: Request):
     try:
-        from ..baseline_agent import run_single_task
     except ImportError:
-        from baseline_agent import run_single_task
     env_url = str(request.base_url).rstrip("/")
     results = {}
     task_scores = {}
-    for task_id in ["task1", "task2", "task3", "task4", "task5"]:
         try:
-            score = await asyncio.wait_for(run_single_task(task_id, env_url), timeout=120.0)
             results[task_id] = round(score, 4)
             task_scores[task_id] = round(score, 4)
         except TimeoutError:
@@ -147,24 +158,43 @@ async def run_baseline(request: Request):
             task_scores[task_id] = 0.0
             results[f"{task_id}_error"] = f"internal_error: {exc.__class__.__name__}: {exc}"
     avg = round(sum(task_scores.values()) / max(1, len(task_scores)), 4)
-    return {"baseline_scores": results, "average": avg, "env_url": env_url}
 @app.get("/baseline/task/{task_id}")
 async def run_baseline_single(task_id: str, request: Request):
     """Run the baseline agent on a single task. Returns score + details."""
     try:
-        from ..baseline_agent import run_single_task_detailed
     except ImportError:
-        from baseline_agent import run_single_task_detailed
     env_url = str(request.base_url).rstrip("/")
     try:
-        result = await asyncio.wait_for(run_single_task_detailed(task_id, env_url), timeout=120.0)
         return {
             "task_id": task_id,
             "score": round(result["score"], 4),
             "status": "ok",
             "fixed_code": result.get("fixed_code", ""),
             "output": result.get("output", ""),
             "attempts": result.get("attempts", []),

             {"id": "task3", "name": "OOM and data leakage", "difficulty": "hard"},
             {"id": "task4", "name": "Wrong loss function", "difficulty": "medium"},
             {"id": "task5", "name": "Frozen backbone", "difficulty": "medium"},
+            {"id": "task6", "name": "Input-Output mismatch", "difficulty": "hard"},
         ],
         "action_schema": {
             "fixed_code": "string (required) — complete runnable Python script",
 @app.get("/baseline")
 async def run_baseline(request: Request):
     try:
+        from ..baseline_agent import SUPPORTED_MODEL_IDS, run_single_task
     except ImportError:
+        from baseline_agent import SUPPORTED_MODEL_IDS, run_single_task
     env_url = str(request.base_url).rstrip("/")
+    model_id = request.query_params.get("model_id", "Qwen/Qwen2.5-Coder-32B-Instruct")
+    if model_id not in SUPPORTED_MODEL_IDS:
+        return {
+            "error": f"Unsupported model_id '{model_id}'",
+            "supported_model_ids": SUPPORTED_MODEL_IDS,
+        }
     results = {}
     task_scores = {}
+    for task_id in ["task1", "task2", "task3", "task4", "task5", "task6"]:
         try:
+            score = await asyncio.wait_for(
+                run_single_task(task_id, env_url, model_id=model_id),
+                timeout=120.0,
+            )
             results[task_id] = round(score, 4)
             task_scores[task_id] = round(score, 4)
         except TimeoutError:
             task_scores[task_id] = 0.0
             results[f"{task_id}_error"] = f"internal_error: {exc.__class__.__name__}: {exc}"
     avg = round(sum(task_scores.values()) / max(1, len(task_scores)), 4)
+    return {
+        "baseline_scores": results,
+        "average": avg,
+        "env_url": env_url,
+        "model_id": model_id,
+    }
 @app.get("/baseline/task/{task_id}")
 async def run_baseline_single(task_id: str, request: Request):
     """Run the baseline agent on a single task. Returns score + details."""
     try:
+        from ..baseline_agent import SUPPORTED_MODEL_IDS, run_single_task_detailed
     except ImportError:
+        from baseline_agent import SUPPORTED_MODEL_IDS, run_single_task_detailed
     env_url = str(request.base_url).rstrip("/")
+    model_id = request.query_params.get("model_id", "Qwen/Qwen2.5-Coder-32B-Instruct")
+    if model_id not in SUPPORTED_MODEL_IDS:
+        return {
+            "task_id": task_id,
+            "score": 0.0,
+            "status": "error",
+            "error": f"Unsupported model_id '{model_id}'",
+            "supported_model_ids": SUPPORTED_MODEL_IDS,
+        }
     try:
+        result = await asyncio.wait_for(
+            run_single_task_detailed(task_id, env_url, model_id=model_id),
+            timeout=120.0,
+        )
         return {
             "task_id": task_id,
             "score": round(result["score"], 4),
             "status": "ok",
+            "model_id": model_id,
             "fixed_code": result.get("fixed_code", ""),
             "output": result.get("output", ""),
             "attempts": result.get("attempts", []),

server/environment.py CHANGED Viewed

@@ -9,12 +9,12 @@ from openenv.core.env_server.types import State
 try:
     from ..models import MLDebugAction, MLDebugObservation
     from .sandbox import execute_code
-    from .tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone
     from .tasks.graders import parse_losses, parse_val_accs, score_task
 except ImportError:
     from models import MLDebugAction, MLDebugObservation
     from server.sandbox import execute_code
-    from server.tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone
     from server.tasks.graders import parse_losses, parse_val_accs, score_task
 TASKS = {
@@ -23,6 +23,7 @@ TASKS = {
     "task3": task3_oom_leakage,
     "task4": task4_wrong_loss,
     "task5": task5_frozen_backbone,
 }

 try:
     from ..models import MLDebugAction, MLDebugObservation
     from .sandbox import execute_code
+    from .tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone, task6_io_mismatch
     from .tasks.graders import parse_losses, parse_val_accs, score_task
 except ImportError:
     from models import MLDebugAction, MLDebugObservation
     from server.sandbox import execute_code
+    from server.tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone, task6_io_mismatch
     from server.tasks.graders import parse_losses, parse_val_accs, score_task
 TASKS = {
     "task3": task3_oom_leakage,
     "task4": task4_wrong_loss,
     "task5": task5_frozen_backbone,
+    "task6": task6_io_mismatch,
 }

server/tasks/graders.py CHANGED Viewed

@@ -109,11 +109,11 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
     Task 1: Broken Training Loop
     Bugs: 1) lr=10.0 (too high), 2) step() before backward()
-    Grading criteria:
-    - Must have low final loss (<0.3) - indicates proper training
-    - Must have high validation accuracy (>0.85) - indicates learning
-    - Must show monotonic improvement - indicates proper gradient flow
-    - Must NOT have loss spikes - indicates stable training
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
     if not valid:
@@ -128,10 +128,10 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
     if not losses:
         return 0.1, {"reason": "no_losses_parsed"}
-    # Check for NaN/Inf - indicates numerical instability
     nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
     if nan_count > 0:
-        return 0.15, {"reason": "nan_inf_found", "nan_count": nan_count}
     val_acc = parse_scalar(result.stdout, "VAL_ACC")
     if val_acc is None:
@@ -141,33 +141,39 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
     initial_loss = losses[0]
     max_loss = max(losses)
-    # Check for loss instability (spikes indicate LR too high)
-    # Healthy training shouldn't have losses > 5x initial loss
-    if max_loss > initial_loss * 5.0 or max_loss > 10.0:
-        return 0.2, {
             "reason": "loss_unstable_spikes",
             "max_loss": max_loss,
             "final_loss": final_loss,
             "val_acc": val_acc
         }
-    # Check for loss explosion at end
-    if final_loss > 5.0:
-        return 0.15, {"reason": "loss_unstable", "final_loss": final_loss, "val_acc": val_acc}
-    # Primary: Validation accuracy (higher is better, target > 0.85)
-    acc_score = sigmoid_score(val_acc, center=0.85, steepness=15.0, higher_is_better=True) * 0.5
-    # Secondary: Final loss should be low (lower is better, target < 0.3)
-    loss_score = sigmoid_score(final_loss, center=0.3, steepness=8.0, higher_is_better=False) * 0.3
-    # Bonus: Monotonic improvement (loss should decrease over time)
     monotonic_bonus = 0.0
     if len(losses) >= 10:
-        first_quarter = sum(losses[:len(losses)//4]) / (len(losses)//4)
-        last_quarter = sum(losses[-len(losses)//4:]) / (len(losses)//4)
-        if last_quarter < first_quarter * 0.7:  # At least 30% improvement
-            monotonic_bonus = 0.2
     final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
     breakdown = {
@@ -187,10 +193,10 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
     Task 2: NaN Loss
     Bug: torch.log(pred) when pred can be 0.0 after sigmoid
-    Grading criteria:
-    - Must have NO NaN/Inf losses - this is the primary test
-    - Must have good validation accuracy (>0.75)
-    - Must show loss convergence (<0.4)
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
     if not valid:
@@ -207,11 +213,11 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
     nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
-    # Primary criterion: NO NaN/Inf allowed - this is the core bug being tested
     nan_ratio = nan_count / len(losses)
     if nan_count > 0:
-        # Heavily penalize any NaN - this is THE bug we're testing
-        return max(0.05, 0.3 * (1.0 - nan_ratio)), {
             "reason": "has_nans",
             "nan_ratio": nan_ratio,
             "nan_count": nan_count
@@ -219,19 +225,19 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
     val_acc = parse_scalar(result.stdout, "VAL_ACC")
     if val_acc is None:
-        return 0.2, {"reason": "no_val_acc_but_no_nans"}
     finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
     final_loss = finite_losses[-1] if finite_losses else float('inf')
-    # No NaN = base score of 0.4 (the bug is fixed)
-    base_score = 0.4
-    # Validation accuracy bonus (higher is better, target > 0.75)
-    acc_score = sigmoid_score(val_acc, center=0.75, steepness=12.0, higher_is_better=True) * 0.35
-    # Convergence bonus (lower is better, target < 0.4)
-    convergence_score = sigmoid_score(final_loss, center=0.4, steepness=6.0, higher_is_better=False) * 0.25
     final_score = min(1.0, base_score + acc_score + convergence_score)
     breakdown = {
@@ -247,14 +253,13 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
 def grade_task3(result: RunResult) -> tuple[float, dict]:
     """
-    Task 3: Memory Leak + Missing zero_grad
-    Bugs: 1) total_loss += loss retains graph (memory leak)
-          2) Missing optimizer.zero_grad() causes gradient accumulation
-    Grading criteria:
-    - FINAL_LOSS should be reasonable (<20) - memory leak fixed
-    - VAL_ACC should be high (>0.8) - gradient accumulation fixed
-    - Learning trajectory should improve over epochs
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
     if not valid:
@@ -271,40 +276,51 @@ def grade_task3(result: RunResult) -> tuple[float, dict]:
     val_accs = parse_val_accs(result.stdout)
     final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
-    # Memory leak check: FINAL_LOSS should be reasonable
-    # With .item(), total_loss is sum of scalars (~12-20 for 20 epochs)
-    memory_score = 0.0
-    if final_loss_val is not None:
-        memory_score = sigmoid_score(final_loss_val, center=20.0, steepness=0.2, higher_is_better=False) * 0.35
-    else:
-        memory_score = 0.0
-    # Gradient accumulation check: accuracy should be high if training properly
-    # Without zero_grad(), gradients accumulate and training degrades
     acc_score = 0.0
     final_acc = 0.0
     early_acc = 0.0
     trajectory_bonus = 0.0
-    if val_accs and len(val_accs) >= 2:
-        early_acc = sum(val_accs[:3]) / min(3, len(val_accs))
-        final_acc = val_accs[-1]
-        # Final accuracy is the main indicator of correct training
-        acc_score = sigmoid_score(final_acc, center=0.8, steepness=15.0, higher_is_better=True) * 0.45
-        # Learning trajectory: should improve over time
-        if len(val_accs) >= 5:
-            improvement = final_acc - early_acc
-            if improvement > 0.05:
-                trajectory_bonus = 0.1
-            elif improvement > 0.0:
-                trajectory_bonus = 0.05
-    final_score = min(1.0, memory_score + acc_score + trajectory_bonus)
     breakdown = {
-        "memory_score": round(memory_score, 4),
         "acc_score": round(acc_score, 4),
         "trajectory_bonus": round(trajectory_bonus, 4),
         "early_acc": round(early_acc, 4),
         "final_acc": round(final_acc, 4),
@@ -318,10 +334,10 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
     Task 4: Wrong Loss (Multi-label Classification)
     Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
-    Grading criteria:
-    - F1 score should be high (> 0.6) - primary metric
-    - avg_labels should be > 1.0 (proper multi-label output)
-    - Loss should converge
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
     if not valid:
@@ -337,29 +353,45 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
     avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
     f1 = parse_scalar(result.stdout, "F1_SCORE")
-    # F1 score - PRIMARY metric (higher is better, target > 0.6)
     f1_score_val = 0.0
     if f1 is not None:
-        f1_score_val = sigmoid_score(f1, center=0.6, steepness=10.0, higher_is_better=True) * 0.5
-    # Multi-label check: avg_labels should be > 1.0 (proper multi-label predictions)
-    # With 30% probability per class and 5 classes, expected avg ~1.5 labels/sample
     labels_score = 0.0
     if avg_labels is not None:
-        if avg_labels < 0.5:
-            # Way too few labels - likely single-label behavior
-            labels_score = 0.0
         elif avg_labels >= 1.0:
-            # Good - multiple labels per sample
-            labels_score = 0.3
         else:
-            # Partial credit
-            labels_score = sigmoid_score(avg_labels, center=1.0, steepness=5.0, higher_is_better=True) * 0.3
-    # Loss convergence (lower is better, target < 0.5)
     loss_score = 0.0
     if final_loss is not None:
-        loss_score = sigmoid_score(final_loss, center=0.5, steepness=4.0, higher_is_better=False) * 0.2
     final_score = min(1.0, f1_score_val + labels_score + loss_score)
     breakdown = {
@@ -379,14 +411,14 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
     Bug: Backbone is frozen but still passed to optimizer (wastes memory)
     Valid fixes:
-    1. Unfreeze backbone -> grad_norm > 0, same param count
-    2. Only pass head params to optimizer -> grad_norm = 0, reduced param count
-    The buggy code has: grad_norm = 0, param_count = 530442 (full model)
-    Grading criteria:
-    - Either backbone has gradients (unfrozen), OR
-    - Optimizer param count is reduced (only head)
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
     if not valid:
@@ -402,30 +434,39 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
     grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
     param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
-    # Loss should be reasonable (10-class classification, CE loss)
-    loss_score = 0.0
-    if final_loss is not None:
-        loss_score = sigmoid_score(final_loss, center=2.5, steepness=2.0, higher_is_better=False) * 0.3
-    # The bug: frozen backbone (grad_norm=0) but full params in optimizer (param_count=530442)
-    # Fix 1: Unfreeze -> grad_norm > 0 (any amount)
-    # Fix 2: Only head -> param_count < 100000 (head has ~5130 params)
     fix_score = 0.0
     fix_type = "none"
-    if grad_norm is not None and grad_norm > 0.1:
-        # Backbone is unfrozen and training
-        fix_score = 0.7
         fix_type = "unfrozen"
-    elif param_count is not None and param_count < 100000:
-        # Only head params in optimizer (head has ~5130 params)
-        fix_score = 0.7
         fix_type = "head_only"
-    elif grad_norm is not None and grad_norm == 0.0 and (param_count is None or param_count > 100000):
-        # Buggy state: frozen backbone but full params in optimizer
-        fix_score = 0.0
-        fix_type = "buggy"
     final_score = min(1.0, loss_score + fix_score)
     breakdown = {
@@ -439,6 +480,192 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
     return final_score, breakdown
 def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
     graders = {
         "task1": grade_task1,
@@ -446,6 +673,7 @@ def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
         "task3": grade_task3,
         "task4": grade_task4,
         "task5": grade_task5,
     }
     if task_id not in graders:
         raise ValueError(f"Unknown task_id: {task_id}")

     Task 1: Broken Training Loop
     Bugs: 1) lr=10.0 (too high), 2) step() before backward()
+    Grading criteria (STRICT thresholds for differentiation):
+    - VAL_ACC > 0.90 required for high score (target is >0.85)
+    - Final loss < 0.2 required for high score (target is <0.3)
+    - Must show monotonic improvement
+    - Penalize any instability heavily
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
     if not valid:
     if not losses:
         return 0.1, {"reason": "no_losses_parsed"}
+    # Check for NaN/Inf - indicates numerical instability (LR bug not fully fixed)
     nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
     if nan_count > 0:
+        return 0.1, {"reason": "nan_inf_found", "nan_count": nan_count}
     val_acc = parse_scalar(result.stdout, "VAL_ACC")
     if val_acc is None:
     initial_loss = losses[0]
     max_loss = max(losses)
+    # STRICT: Check for loss instability (spikes indicate LR still too high)
+    if max_loss > initial_loss * 3.0 or max_loss > 5.0:
+        return 0.15, {
             "reason": "loss_unstable_spikes",
             "max_loss": max_loss,
             "final_loss": final_loss,
             "val_acc": val_acc
         }
+    # STRICT: Loss must converge well
+    if final_loss > 1.0:
+        return 0.2, {"reason": "loss_not_converged", "final_loss": final_loss, "val_acc": val_acc}
+    # STRICT thresholds - center points raised for better differentiation
+    # Target: val_acc > 0.90, final_loss < 0.15
+    # Primary: Validation accuracy (60% weight)
+    # Use steeper sigmoid for sharper differentiation
+    acc_score = sigmoid_score(val_acc, center=0.90, steepness=25.0, higher_is_better=True) * 0.60
+    # Secondary: Final loss (30% weight) - must be low
+    loss_score = sigmoid_score(final_loss, center=0.15, steepness=15.0, higher_is_better=False) * 0.30
+    # Bonus: Monotonic improvement - must be significant
     monotonic_bonus = 0.0
     if len(losses) >= 10:
+        first_half = sum(losses[:len(losses)//2]) / (len(losses)//2)
+        last_half = sum(losses[-len(losses)//2:]) / (len(losses)//2)
+        improvement_ratio = (first_half - last_half) / first_half if first_half > 0 else 0
+        if improvement_ratio > 0.5:  # >50% improvement required
+            monotonic_bonus = 0.10
+        elif improvement_ratio > 0.3:
+            monotonic_bonus = 0.05
     final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
     breakdown = {
     Task 2: NaN Loss
     Bug: torch.log(pred) when pred can be 0.0 after sigmoid
+    Grading criteria (STRICT - NaN elimination is PRIMARY):
+    - ZERO NaN/Inf required (this is the bug!)
+    - VAL_ACC > 0.80 required for high score
+    - Loss must converge < 0.3
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
     if not valid:
     nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
+    # PRIMARY: NO NaN/Inf at ALL - this is THE bug being tested
     nan_ratio = nan_count / len(losses)
     if nan_count > 0:
+        # STRICT: Any NaN = major failure (max 0.25 score)
+        return max(0.05, 0.25 * (1.0 - nan_ratio)), {
             "reason": "has_nans",
             "nan_ratio": nan_ratio,
             "nan_count": nan_count
     val_acc = parse_scalar(result.stdout, "VAL_ACC")
     if val_acc is None:
+        return 0.25, {"reason": "no_val_acc_but_no_nans"}
     finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
     final_loss = finite_losses[-1] if finite_losses else float('inf')
+    # No NaN = base score of 0.35 (bug is fixed but need to verify quality)
+    base_score = 0.35
+    # STRICT: Validation accuracy (40% weight, center at 0.80)
+    acc_score = sigmoid_score(val_acc, center=0.80, steepness=20.0, higher_is_better=True) * 0.40
+    # STRICT: Convergence (25% weight, center at 0.25)
+    convergence_score = sigmoid_score(final_loss, center=0.25, steepness=10.0, higher_is_better=False) * 0.25
     final_score = min(1.0, base_score + acc_score + convergence_score)
     breakdown = {
 def grade_task3(result: RunResult) -> tuple[float, dict]:
     """
+    Task 3: Label Inversion Bug
+    Bug: criterion(out, 1 - yb) inverts labels — should be criterion(out, yb)
+    Grading criteria (STRICT - accuracy is PRIMARY):
+    - VAL_ACC > 0.90 required (buggy code gives ~0.50)
+    - FINAL_LOSS < 0.25 required
+    - Must show learning trajectory improvement
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
     if not valid:
     val_accs = parse_val_accs(result.stdout)
     final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
+    # CRITICAL CHECK: Buggy code produces ~0.50 accuracy (random)
+    # Fixed code should produce >0.90 accuracy
     acc_score = 0.0
     final_acc = 0.0
     early_acc = 0.0
     trajectory_bonus = 0.0
+    if not val_accs or len(val_accs) < 2:
+        return 0.15, {"reason": "no_val_accs_parsed"}
+    early_acc = sum(val_accs[:3]) / min(3, len(val_accs))
+    final_acc = val_accs[-1]
+    # STRICT: Final accuracy must be high (>0.90 target)
+    # The bug makes accuracy ~0.50, so anything <0.70 is likely unfixed
+    if final_acc < 0.65:
+        return 0.15, {
+            "reason": "accuracy_too_low_likely_unfixed",
+            "final_acc": final_acc,
+            "expected": ">0.90 for fixed code"
+        }
+    # Primary: Final accuracy (60% weight, center at 0.92)
+    acc_score = sigmoid_score(final_acc, center=0.92, steepness=30.0, higher_is_better=True) * 0.60
+    # Secondary: Loss convergence (25% weight)
+    loss_score = 0.0
+    if final_loss_val is not None:
+        loss_score = sigmoid_score(final_loss_val, center=0.20, steepness=12.0, higher_is_better=False) * 0.25
+    # Bonus: Learning trajectory (15% weight)
+    if len(val_accs) >= 5:
+        improvement = final_acc - early_acc
+        if improvement > 0.15:  # Significant learning
+            trajectory_bonus = 0.15
+        elif improvement > 0.05:
+            trajectory_bonus = 0.08
+        elif improvement > 0.0:
+            trajectory_bonus = 0.03
+    final_score = min(1.0, acc_score + loss_score + trajectory_bonus)
     breakdown = {
         "acc_score": round(acc_score, 4),
+        "loss_score": round(loss_score, 4),
         "trajectory_bonus": round(trajectory_bonus, 4),
         "early_acc": round(early_acc, 4),
         "final_acc": round(final_acc, 4),
     Task 4: Wrong Loss (Multi-label Classification)
     Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
+    Grading criteria (STRICT):
+    - F1 > 0.70 required (buggy code gives ~0.2-0.3)
+    - avg_labels > 1.2 required (proper multi-hot predictions)
+    - Loss must converge < 0.4
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
     if not valid:
     avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
     f1 = parse_scalar(result.stdout, "F1_SCORE")
+    # CRITICAL: Check for multi-label behavior
+    # With CrossEntropyLoss, model predicts only 1 label per sample (avg_labels ≈ 1.0)
+    # With BCEWithLogitsLoss, model should predict multiple (avg_labels > 1.0)
+    if avg_labels is not None and avg_labels < 0.8:
+        return 0.15, {
+            "reason": "too_few_labels_single_label_behavior",
+            "avg_labels": avg_labels,
+            "expected": ">1.2 for multi-label"
+        }
+    # STRICT: F1 score - PRIMARY metric (55% weight)
     f1_score_val = 0.0
     if f1 is not None:
+        if f1 < 0.40:
+            # Very low F1 indicates bug not fixed
+            return 0.20, {
+                "reason": "f1_too_low_likely_unfixed",
+                "f1": f1,
+                "expected": ">0.65 for fixed code"
+            }
+        f1_score_val = sigmoid_score(f1, center=0.70, steepness=15.0, higher_is_better=True) * 0.55
+    else:
+        return 0.15, {"reason": "no_f1_score_parsed"}
+    # Multi-label check: avg_labels (25% weight)
     labels_score = 0.0
     if avg_labels is not None:
+        if avg_labels >= 1.3:
+            labels_score = 0.25  # Full score for proper multi-label
         elif avg_labels >= 1.0:
+            labels_score = 0.15  # Partial - borderline multi-label
         else:
+            labels_score = sigmoid_score(avg_labels, center=1.0, steepness=8.0, higher_is_better=True) * 0.15
+    # Loss convergence (20% weight)
     loss_score = 0.0
     if final_loss is not None:
+        loss_score = sigmoid_score(final_loss, center=0.35, steepness=8.0, higher_is_better=False) * 0.20
     final_score = min(1.0, f1_score_val + labels_score + loss_score)
     breakdown = {
     Bug: Backbone is frozen but still passed to optimizer (wastes memory)
     Valid fixes:
+    1. Unfreeze backbone -> grad_norm > 0
+    2. Only pass head params to optimizer -> param_count < 10000
+    Buggy state: grad_norm = 0, param_count = 530442
+    Grading criteria (STRICT - binary fix detection):
+    - Must demonstrate ONE of the two valid fixes
+    - Loss must be reasonable (<3.0 for CrossEntropy on 10 classes)
     """
     valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
     if not valid:
     grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
     param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
+    # Detect fix type FIRST
     fix_score = 0.0
     fix_type = "none"
+    # Fix 1: Unfreeze backbone (grad_norm > 0)
+    if grad_norm is not None and grad_norm > 0.01:
+        fix_score = 0.70
         fix_type = "unfrozen"
+    # Fix 2: Only head params (param_count should be ~5130 for Linear(512, 10))
+    elif param_count is not None and param_count < 15000:
+        fix_score = 0.70
         fix_type = "head_only"
+    # Buggy state: frozen (grad_norm=0) but full params (>500000)
+    elif grad_norm is not None and grad_norm == 0.0:
+        if param_count is not None and param_count > 100000:
+            return 0.10, {
+                "reason": "buggy_state_unchanged",
+                "grad_norm": grad_norm,
+                "param_count": param_count,
+                "hint": "Either unfreeze backbone or only pass head params to optimizer"
+            }
+    if fix_score == 0.0:
+        return 0.15, {
+            "reason": "could_not_detect_valid_fix",
+            "grad_norm": grad_norm,
+            "param_count": param_count
+        }
+    # Loss should be reasonable (30% weight)
+    loss_score = 0.0
+    if final_loss is not None:
+        loss_score = sigmoid_score(final_loss, center=2.5, steepness=3.0, higher_is_better=False) * 0.30
     final_score = min(1.0, loss_score + fix_score)
     breakdown = {
     return final_score, breakdown
+def grade_task6(result: RunResult) -> tuple[float, dict]:
+    """
+    Task 6: Input-Output Mismatch (Multiple Bugs)
+    Bugs to fix:
+    1. Shape mismatch: 32x32 images but model expects 28x28
+    2. Channel order: HWC format but model expects CHW
+    3. Label encoding: One-hot labels but CrossEntropyLoss expects indices
+    4. Batch dimension: Single sample missing batch dim in validation
+    Anti-gaming measures:
+    - Must have actual CNN training (convolutions detected in code)
+    - Must show learning trajectory (loss decrease)
+    - Must have reasonable epoch count (>= 20)
+    - Penalize hardcoded metrics or unrealistic outputs
+    - Check for actual tensor operations (permute, reshape, etc.)
+    """
+    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task6")
+    if not valid:
+        return 0.0, {"reason": reason}
+    if result.timed_out:
+        return 0.05, {"reason": "timed_out"}
+    if result.exit_code != 0:
+        # Check for specific error types that indicate partial fixes
+        stderr_lower = result.stderr.lower()
+        if "shape" in stderr_lower or "size" in stderr_lower:
+            return 0.10, {"reason": "shape_error_unfixed", "stderr": result.stderr[:500]}
+        if "dimension" in stderr_lower or "dim" in stderr_lower:
+            return 0.10, {"reason": "dimension_error_unfixed", "stderr": result.stderr[:500]}
+        if "expected" in stderr_lower and "got" in stderr_lower:
+            return 0.10, {"reason": "type_mismatch_unfixed", "stderr": result.stderr[:500]}
+        return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
+    code = result.fixed_code
+    # ANTI-GAMING: Check for genuine CNN architecture (not replaced with fake output)
+    has_conv = "Conv2d" in code or "conv2d" in code
+    has_training_loop = "optimizer.step()" in code or "optimizer.step()" in code
+    has_model_forward = "model(" in code
+    if not has_conv:
+        return 0.05, {"reason": "gaming_no_convolution", "hint": "Original CNN architecture must be preserved"}
+    if not has_training_loop:
+        return 0.05, {"reason": "gaming_no_training", "hint": "Must have actual training loop"}
+    if not has_model_forward:
+        return 0.05, {"reason": "gaming_no_forward", "hint": "Must use model for inference"}
+    # Parse metrics
+    losses = parse_losses(result.stdout)
+    val_acc = parse_scalar(result.stdout, "VAL_ACC")
+    final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
+    # ANTI-GAMING: Check for hardcoded/faked metrics
+    if "print('VAL_ACC:0.9" in code or "print(\"VAL_ACC:0.9" in code:
+        return 0.05, {"reason": "gaming_hardcoded_metrics"}
+    # ANTI-GAMING: Require reasonable number of loss values (epoch count)
+    if len(losses) < 15:
+        return 0.15, {"reason": "too_few_epochs", "epoch_count": len(losses), "expected": ">=20"}
+    # ANTI-GAMING: Loss should show learning (not flat or random)
+    if len(losses) >= 10:
+        first_quarter = sum(losses[:len(losses)//4]) / (len(losses)//4)
+        last_quarter = sum(losses[-len(losses)//4:]) / (len(losses)//4)
+        if first_quarter <= last_quarter:
+            return 0.20, {
+                "reason": "no_learning_trajectory",
+                "first_quarter_loss": round(first_quarter, 4),
+                "last_quarter_loss": round(last_quarter, 4),
+                "hint": "Loss should decrease during training"
+            }
+    # ANTI-GAMING: Check for unrealistic perfect scores with no learning
+    # High accuracy is OK if there's a valid learning trajectory
+    if val_acc is not None and val_acc > 0.99:
+        # Only flag if loss didn't converge properly (suggests hardcoded output)
+        if final_loss is not None and final_loss > 0.1:
+            return 0.25, {"reason": "unrealistic_accuracy_no_convergence", "val_acc": val_acc, "final_loss": final_loss}
+    if val_acc is None:
+        return 0.15, {"reason": "no_val_acc_parsed"}
+    if final_loss is None:
+        return 0.15, {"reason": "no_final_loss_parsed"}
+    # Check for NaN/Inf in losses
+    nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
+    if nan_count > 0:
+        return 0.10, {"reason": "nan_in_losses", "nan_count": nan_count}
+    # ====== BUG FIX DETECTION ======
+    bug_fixes_detected = 0
+    fix_details = {}
+    # Bug 1: Shape fix - check for resize or architecture change
+    has_resize = any(kw in code for kw in ["resize", "interpolate", "F.adaptive", "8 * 8", "8*8"])
+    has_28_in_data = "28, 28" in code or "28,28" in code
+    if has_resize or has_28_in_data:
+        bug_fixes_detected += 1
+        fix_details["shape_fix"] = True
+    else:
+        fix_details["shape_fix"] = False
+    # Bug 2: Channel order fix - check for permute/transpose OR data created in CHW format
+    has_permute = "permute" in code or "transpose" in code or "contiguous" in code
+    has_channel_reorder = ".permute(0, 3, 1, 2)" in code or "permute(0,3,1,2)" in code
+    # Alternative fix: data created directly in CHW format (n_samples, 1, H, W)
+    has_chw_data = any(pat in code for pat in ["n_samples, 1, 28", "n_samples, 1, 32", "(n_samples, 1,"])
+    if has_permute or has_channel_reorder or has_chw_data:
+        bug_fixes_detected += 1
+        fix_details["channel_fix"] = True
+    else:
+        fix_details["channel_fix"] = False
+    # Bug 3: Label encoding fix - check for argmax or returning indices
+    has_label_fix = any(kw in code for kw in [
+        "argmax", "class_indices", "torch.arange",
+        "labels.long()", "y.long()", "remove one_hot"
+    ])
+    # Also check if one_hot is removed from generate_data
+    no_onehot = "one_hot" not in code or ("# " in code and "one_hot" in code)
+    if has_label_fix or no_onehot:
+        bug_fixes_detected += 1
+        fix_details["label_fix"] = True
+    else:
+        fix_details["label_fix"] = False
+    # Bug 4: Batch dimension fix - check for unsqueeze on single sample
+    has_batch_fix = any(kw in code for kw in ["unsqueeze(0)", "unsqueeze( 0)", "[None,", "[None ,"])
+    if has_batch_fix:
+        bug_fixes_detected += 1
+        fix_details["batch_fix"] = True
+    else:
+        fix_details["batch_fix"] = False
+    # ====== SCORING ======
+    # Base score from bug fixes (40% weight - 10% per bug)
+    bug_fix_score = 0.10 * bug_fixes_detected
+    # Accuracy score (35% weight) - strict threshold
+    # With 5 classes, random is 20%, buggy is ~20-30%, fixed should be >80%
+    if val_acc < 0.50:
+        # Below 50% suggests not all bugs fixed
+        acc_score = 0.0
+        acc_penalty_reason = "accuracy_too_low"
+    else:
+        acc_score = sigmoid_score(val_acc, center=0.82, steepness=20.0, higher_is_better=True) * 0.35
+        acc_penalty_reason = None
+    # Loss convergence score (15% weight)
+    loss_score = sigmoid_score(final_loss, center=0.40, steepness=8.0, higher_is_better=False) * 0.15
+    # Learning trajectory bonus (10% weight)
+    trajectory_bonus = 0.0
+    if len(losses) >= 10:
+        first_half = sum(losses[:len(losses)//2]) / (len(losses)//2)
+        last_half = sum(losses[-len(losses)//2:]) / (len(losses)//2)
+        improvement_ratio = (first_half - last_half) / first_half if first_half > 0 else 0
+        if improvement_ratio > 0.5:
+            trajectory_bonus = 0.10
+        elif improvement_ratio > 0.3:
+            trajectory_bonus = 0.05
+    final_score = min(1.0, bug_fix_score + acc_score + loss_score + trajectory_bonus)
+    breakdown = {
+        "bug_fix_score": round(bug_fix_score, 4),
+        "bugs_fixed": bug_fixes_detected,
+        "fix_details": fix_details,
+        "acc_score": round(acc_score, 4),
+        "loss_score": round(loss_score, 4),
+        "trajectory_bonus": round(trajectory_bonus, 4),
+        "val_acc": val_acc,
+        "final_loss": final_loss,
+        "epoch_count": len(losses),
+    }
+    if acc_penalty_reason:
+        breakdown["acc_penalty_reason"] = acc_penalty_reason
+    return final_score, breakdown
 def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
     graders = {
         "task1": grade_task1,
         "task3": grade_task3,
         "task4": grade_task4,
         "task5": grade_task5,
+        "task6": grade_task6,
     }
     if task_id not in graders:
         raise ValueError(f"Unknown task_id: {task_id}")

server/tasks/task3_oom_leakage.py CHANGED Viewed

@@ -1,7 +1,11 @@
 TASK_DESCRIPTION = """
 This binary classification trainer has a bug causing validation accuracy around 50%.
-Fix the bug. After 20 epochs: VAL_ACC > 0.90, FINAL_LOSS < 0.3.
 Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
 """
 BUGGY_CODE = """

 TASK_DESCRIPTION = """
 This binary classification trainer has a bug causing validation accuracy around 50%.
+The bug inverts the labels during training. Fix it so after 20 epochs:
+- VAL_ACC > 0.90 (the primary goal)
+- FINAL_LOSS < 0.3
 Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
+Wrap output in ##METRICS_START## and ##METRICS_END##
 """
 BUGGY_CODE = """

server/tasks/task6_io_mismatch.py ADDED Viewed

	@@ -0,0 +1,124 @@

+TASK_DESCRIPTION = """
+This image classification script has multiple input-output mismatch bugs that cause
+silent failures or crashes. The model is a simple CNN trained on synthetic "images".
+There are 4 BUGS to fix:
+1. Shape mismatch: The model expects 28x28 images but data generator creates 32x32
+2. Channel order mismatch: Model expects CHW but data is HWC format
+3. Label encoding mismatch: Model expects class indices but labels are one-hot encoded
+4. Batch dimension mismatch: A validation step processes unbatched data
+Fix all bugs so that:
+- Training runs without errors for 30 epochs
+- VAL_ACC > 0.85
+- FINAL_LOSS < 0.5
+Print as: LOSSES:[l1,l2,...], VAL_ACC:X.XX, FINAL_LOSS:X.XX
+Wrap output in ##METRICS_START## and ##METRICS_END##
+"""
+BUGGY_CODE = """
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, TensorDataset
+torch.manual_seed(42)
+NUM_CLASSES = 5
+BATCH_SIZE = 32
+EPOCHS = 30
+# BUG 1: Create 32x32 images but model expects 28x28
+# Generate synthetic image data (HWC format - common from PIL/OpenCV)
+def generate_data(n_samples):
+    # Creates images in HWC format (Height x Width x Channels)
+    images = torch.randn(n_samples, 32, 32, 1)  # BUG: Wrong size & HWC format
+    # Each class has a distinct pattern based on mean pixel value region
+    class_indices = torch.randint(0, NUM_CLASSES, (n_samples,))
+    for i, c in enumerate(class_indices):
+        images[i] += c * 0.5  # Add class-dependent offset
+    # BUG 3: Return one-hot labels instead of class indices
+    labels = F.one_hot(class_indices, NUM_CLASSES).float()
+    return images, labels
+X_train, y_train = generate_data(800)
+X_val, y_val = generate_data(200)
+# BUG 2: Model expects CHW format (Channels x Height x Width) and 28x28 images
+class SimpleCNN(nn.Module):
+    def __init__(self):
+        super().__init__()
+        # Expecting input: (batch, 1, 28, 28) in CHW format
+        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
+        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
+        self.pool = nn.MaxPool2d(2)
+        # After two pooling ops on 28x28: 28->14->7, so 7*7*32 = 1568
+        self.fc = nn.Linear(7 * 7 * 32, NUM_CLASSES)
+    def forward(self, x):
+        # Expects x to be (batch, channels, height, width)
+        x = self.pool(F.relu(self.conv1(x)))  # -> (batch, 16, 14, 14)
+        x = self.pool(F.relu(self.conv2(x)))  # -> (batch, 32, 7, 7)
+        x = x.view(x.size(0), -1)
+        return self.fc(x)
+model = SimpleCNN()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = nn.CrossEntropyLoss()  # Expects class indices, not one-hot
+train_ds = TensorDataset(X_train, y_train)
+train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
+losses = []
+for epoch in range(EPOCHS):
+    model.train()
+    epoch_loss = 0.0
+    for images, labels in train_loader:
+        optimizer.zero_grad()
+        # Missing: permute from HWC to CHW format
+        # Missing: resize from 32x32 to 28x28
+        outputs = model(images)
+        # BUG: criterion expects class indices but labels are one-hot
+        loss = criterion(outputs, labels)
+        loss.backward()
+        optimizer.step()
+        epoch_loss += loss.item()
+    losses.append(epoch_loss / len(train_loader))
+# Validation
+model.eval()
+with torch.no_grad():
+    # BUG 4: Process single sample without batch dimension
+    sample = X_val[0]  # Shape: (32, 32, 1) - missing batch dim
+    single_pred = model(sample)  # Will crash: expects (batch, C, H, W)
+    # Full validation (also has format issues)
+    val_outputs = model(X_val)
+    val_preds = val_outputs.argmax(dim=1)
+    val_labels = y_val.argmax(dim=1)  # Convert one-hot back to indices for comparison
+    val_acc = (val_preds == val_labels).float().mean().item()
+print('##METRICS_START##')
+print('LOSSES:' + str([round(l, 4) for l in losses]))
+print('VAL_ACC:' + str(round(val_acc, 4)))
+print('FINAL_LOSS:' + str(round(losses[-1], 4)))
+print('##METRICS_END##')
+"""
+GROUND_TRUTH_BUGS = [
+    "Shape mismatch: Images are 32x32 but model expects 28x28 - need to resize or fix model architecture",
+    "Channel order mismatch: Data is in HWC format but model expects CHW - use .permute(0, 3, 1, 2)",
+    "Label encoding mismatch: Labels are one-hot but CrossEntropyLoss expects class indices - use .argmax(dim=1) or change generate_data",
+    "Batch dimension mismatch: single sample missing batch dimension - use sample.unsqueeze(0)",
+]
+# Expected fixes (for grader reference):
+# 1. Either resize images to 28x28 OR change model to expect 32x32 (fc layer: 8*8*32)
+# 2. Add: images = images.permute(0, 3, 1, 2)  # HWC -> CHW
+# 3. Either: labels = class_indices (return indices) OR labels = labels.argmax(dim=1) before criterion
+# 4. Add: sample = sample.unsqueeze(0).permute(0, 3, 1, 2) before model(sample)

test_api.sh ADDED Viewed

	@@ -0,0 +1,50 @@

+#!/bin/bash
+# Test HuggingFace Inference Providers API
+echo "Testing HuggingFace Inference Providers..."
+echo ""
+HF_TOKEN="${HF_TOKEN:-$(grep -v '^#' .env 2>/dev/null | head -1)}"
+if [ -z "$HF_TOKEN" ]; then
+    echo "❌ HF_TOKEN not set"
+    exit 1
+fi
+echo "Testing with model: Qwen/Qwen2.5-7B-Instruct"
+echo ""
+response=$(curl -s -w "\nHTTP_CODE:%{http_code}" \
+  https://router.huggingface.co/v1/chat/completions \
+  -H "Authorization: Bearer $HF_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-7B-Instruct",
+    "messages": [{"role": "user", "content": "Say hello in 5 words"}],
+    "max_tokens": 20
+  }')
+http_code=$(echo "$response" | grep "HTTP_CODE" | cut -d: -f2)
+body=$(echo "$response" | sed '/HTTP_CODE/d')
+if [ "$http_code" = "200" ]; then
+    echo "✅ API Test Successful!"
+    echo ""
+    echo "Response:"
+    echo "$body" | python3 -m json.tool 2>/dev/null || echo "$body"
+else
+    echo "❌ API Test Failed (HTTP $http_code)"
+    echo ""
+    echo "Response:"
+    echo "$body"
+    exit 1
+fi
+echo ""
+echo "===================="
+echo "✅ Configuration is working!"
+echo "Use this in your .bashrc or .env:"
+echo ""
+echo "export API_BASE_URL=\"https://router.huggingface.co/v1\""
+echo "export MODEL_NAME=\"Qwen/Qwen2.5-7B-Instruct\""
+echo "export HF_TOKEN=\"$HF_TOKEN\""

vaidate-submission.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0