Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- GRADER_ANALYSIS.md +56 -0
- HACKATHON_GUIDE.md +214 -0
- baseline_agent.py +31 -8
- evaluate_mnist.py +816 -0
- gradio_app.py +34 -10
- improved_agent.py +717 -0
- inference.py +368 -0
- run_inference.sh +63 -0
- server/app.py +38 -8
- server/environment.py +3 -2
- server/tasks/graders.py +342 -114
- server/tasks/task3_oom_leakage.py +5 -1
- server/tasks/task6_io_mismatch.py +124 -0
- test_api.sh +50 -0
- vaidate-submission.sh +185 -0
GRADER_ANALYSIS.md
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Task & Grader Analysis Report
|
| 2 |
+
|
| 3 |
+
## 🔴 CRITICAL FINDING: Why Scores Were Identical
|
| 4 |
+
|
| 5 |
+
The score `0.7390` appearing for all tasks suggested the LLM was generating code that:
|
| 6 |
+
1. **Ran successfully** (exit_code = 0)
|
| 7 |
+
2. **Output valid metrics** (LOSSES, VAL_ACC, etc.)
|
| 8 |
+
3. **BUT didn't necessarily fix the actual bugs**
|
| 9 |
+
|
| 10 |
+
The old graders gave "partial credit" for any running code, leading to similar scores.
|
| 11 |
+
|
| 12 |
+
## ✅ FIXES APPLIED
|
| 13 |
+
|
| 14 |
+
### 1. Stricter Sigmoid Centers
|
| 15 |
+
- Old centers were too forgiving (e.g., val_acc center at 0.85)
|
| 16 |
+
- New centers require better performance (e.g., val_acc center at 0.90-0.92)
|
| 17 |
+
- Increased steepness for sharper differentiation (15→25-30)
|
| 18 |
+
|
| 19 |
+
### 2. Early Rejection for Unfixed Bugs
|
| 20 |
+
- Added explicit checks for "likely unfixed" states
|
| 21 |
+
- Task 3: Reject if val_acc < 0.65 (buggy code gives ~0.50)
|
| 22 |
+
- Task 4: Reject if f1 < 0.40 (buggy code gives ~0.25)
|
| 23 |
+
- Task 5: Reject if buggy state unchanged
|
| 24 |
+
|
| 25 |
+
### 3. Task 3 Mismatch Fixed
|
| 26 |
+
- **Old**: Description said "OOM and data leakage"
|
| 27 |
+
- **Actual bug**: Label inversion (`criterion(out, 1 - yb)`)
|
| 28 |
+
- **Fixed**: Updated grader to match actual bug
|
| 29 |
+
|
| 30 |
+
### 4. Reduced Base Scores
|
| 31 |
+
- Old task 2 gave 0.40 "free" for avoiding NaN
|
| 32 |
+
- New gives 0.35 base, with stricter accuracy requirements
|
| 33 |
+
|
| 34 |
+
## Updated Grader Summary
|
| 35 |
+
|
| 36 |
+
| Task | Bug | Key Metric | Threshold | Weight |
|
| 37 |
+
|------|-----|------------|-----------|--------|
|
| 38 |
+
| task1 | LR + step/backward | VAL_ACC | >0.90 | 60% |
|
| 39 |
+
| task2 | NaN loss | No NaN + VAL_ACC | >0.80 | 40% |
|
| 40 |
+
| task3 | Label inversion | VAL_ACC | >0.92 | 60% |
|
| 41 |
+
| task4 | Wrong loss | F1_SCORE | >0.70 | 55% |
|
| 42 |
+
| task5 | Frozen backbone | Fix detection | Binary | 70% |
|
| 43 |
+
|
| 44 |
+
## Expected Score Distribution After Fix
|
| 45 |
+
|
| 46 |
+
**Well-fixed code** (correct fix): 0.85-1.00
|
| 47 |
+
**Partially fixed** (runs but suboptimal): 0.40-0.70
|
| 48 |
+
**Unfixed** (bug still present): 0.10-0.25
|
| 49 |
+
**Broken** (crashes): 0.00-0.10
|
| 50 |
+
|
| 51 |
+
This creates better separation between models of different capability.
|
| 52 |
+
|
| 53 |
+
## Files Modified
|
| 54 |
+
|
| 55 |
+
1. `server/tasks/graders.py` - All 5 graders updated
|
| 56 |
+
2. `server/tasks/task3_oom_leakage.py` - Description clarified
|
HACKATHON_GUIDE.md
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# WhipStudio - OpenEnv Hackathon Submission Guide
|
| 2 |
+
|
| 3 |
+
Complete guide for running inference, training, and evaluation for the Scaler Meta PyTorch Hackathon.
|
| 4 |
+
|
| 5 |
+
## 🚀 Quick Start
|
| 6 |
+
|
| 7 |
+
### 1. Environment Setup
|
| 8 |
+
|
| 9 |
+
```bash
|
| 10 |
+
# Set your HuggingFace token
|
| 11 |
+
export HF_TOKEN="your_token_here"
|
| 12 |
+
|
| 13 |
+
# For HuggingFace models (recommended)
|
| 14 |
+
export API_BASE_URL="https://api-inference.huggingface.co/v1"
|
| 15 |
+
export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
|
| 16 |
+
|
| 17 |
+
# Or use the convenience script
|
| 18 |
+
./run_inference.sh https://amogh-kal1-whipstudio.hf.space
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
### 2. Run Hackathon Inference
|
| 22 |
+
|
| 23 |
+
The `inference.py` script meets all hackathon requirements:
|
| 24 |
+
- ✅ Uses OpenAI-compatible client
|
| 25 |
+
- ✅ Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment
|
| 26 |
+
- ✅ Emits [START], [STEP], [END] logs
|
| 27 |
+
- ✅ Runs all 5 tasks with max 3 attempts each
|
| 28 |
+
|
| 29 |
+
```bash
|
| 30 |
+
python inference.py --env-url https://amogh-kal1-whipstudio.hf.space
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
## 📊 Training with GRPO
|
| 34 |
+
|
| 35 |
+
Train a model using Group Relative Policy Optimization:
|
| 36 |
+
|
| 37 |
+
### Basic Training
|
| 38 |
+
```bash
|
| 39 |
+
python improved_agent.py \
|
| 40 |
+
--env_url https://amogh-kal1-whipstudio.hf.space \
|
| 41 |
+
--model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 42 |
+
--output_dir ./trained-model \
|
| 43 |
+
--num_iterations 50
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Memory-Efficient Training (8GB VRAM)
|
| 47 |
+
```bash
|
| 48 |
+
python improved_agent.py \
|
| 49 |
+
--env_url https://amogh-kal1-whipstudio.hf.space \
|
| 50 |
+
--model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 51 |
+
--use_lora \
|
| 52 |
+
--use_4bit \
|
| 53 |
+
--gradient_checkpointing \
|
| 54 |
+
--output_dir ./trained-model-lora
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### Training Features
|
| 58 |
+
- **Curriculum Learning**: Starts with easier tasks, progresses to harder ones
|
| 59 |
+
- **LoRA Support**: Efficient fine-tuning with adapters
|
| 60 |
+
- **4-bit Quantization**: Train on GPUs with limited VRAM
|
| 61 |
+
- **Checkpoint Saving**: Best model saved automatically
|
| 62 |
+
- **Early Stopping**: Stops when no improvement
|
| 63 |
+
- **Wandb Logging**: Optional tracking with `--use_wandb`
|
| 64 |
+
|
| 65 |
+
## 🎯 Evaluation on MNIST
|
| 66 |
+
|
| 67 |
+
Compare base vs trained models on an out-of-distribution MNIST debugging task:
|
| 68 |
+
|
| 69 |
+
### Compare Two Models
|
| 70 |
+
```bash
|
| 71 |
+
python evaluate_mnist.py \
|
| 72 |
+
--base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 73 |
+
--trained_model ./trained-model/best \
|
| 74 |
+
--num_runs 3
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### Use Real MNIST Dataset
|
| 78 |
+
```bash
|
| 79 |
+
python evaluate_mnist.py \
|
| 80 |
+
--use_real_mnist \
|
| 81 |
+
--base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 82 |
+
--trained_model ./trained-model/best
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
### Compare Multiple Models
|
| 86 |
+
```bash
|
| 87 |
+
python evaluate_mnist.py \
|
| 88 |
+
--use_real_mnist \
|
| 89 |
+
--models Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 90 |
+
Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 91 |
+
./trained-model-v1/best \
|
| 92 |
+
./trained-model-v2/best
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## 🔧 Configuration
|
| 96 |
+
|
| 97 |
+
### HuggingFace API (Recommended)
|
| 98 |
+
```bash
|
| 99 |
+
export API_BASE_URL="https://api-inference.huggingface.co/v1"
|
| 100 |
+
export MODEL_NAME="Qwen/Qwen2.5-Coder-32B-Instruct"
|
| 101 |
+
export HF_TOKEN="hf_your_token"
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
### OpenAI API
|
| 105 |
+
```bash
|
| 106 |
+
export API_BASE_URL="https://api.openai.com/v1"
|
| 107 |
+
export MODEL_NAME="gpt-4o-mini"
|
| 108 |
+
export OPENAI_API_KEY="sk-your-key"
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### Local Model Inference
|
| 112 |
+
```bash
|
| 113 |
+
# Use vLLM or similar OpenAI-compatible server
|
| 114 |
+
export API_BASE_URL="http://localhost:8000/v1"
|
| 115 |
+
export MODEL_NAME="your-local-model"
|
| 116 |
+
export HF_TOKEN="dummy" # Still required by script
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## 📝 Hackathon Requirements Checklist
|
| 120 |
+
|
| 121 |
+
- ✅ **HF Space deploys**: https://amogh-kal1-whipstudio.hf.space
|
| 122 |
+
- ✅ **OpenEnv spec compliance**: openenv.yaml, typed models, endpoints
|
| 123 |
+
- ✅ **Dockerfile builds**: server/Dockerfile
|
| 124 |
+
- ✅ **inference.py exists**: Root directory
|
| 125 |
+
- ✅ **Uses OpenAI Client**: With API_BASE_URL, MODEL_NAME, HF_TOKEN
|
| 126 |
+
- ✅ **Structured logs**: [START], [STEP], [END] format
|
| 127 |
+
- ✅ **3+ tasks with graders**: 5 tasks (task1-task5)
|
| 128 |
+
|
| 129 |
+
## 🐛 Troubleshooting
|
| 130 |
+
|
| 131 |
+
### 500 Error from HF Space
|
| 132 |
+
```
|
| 133 |
+
[ERROR] Server error '500 Internal Server Error'
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
**Solution**:
|
| 137 |
+
1. Visit your HF Space in a browser first: https://amogh-kal1-whipstudio.hf.space
|
| 138 |
+
2. Wait for it to fully start (cold start can take 1-2 minutes)
|
| 139 |
+
3. Check the Space logs for errors
|
| 140 |
+
4. Try the /health endpoint: `curl https://amogh-kal1-whipstudio.hf.space/health`
|
| 141 |
+
|
| 142 |
+
### Missing Dependencies
|
| 143 |
+
```bash
|
| 144 |
+
pip install openai httpx transformers torch trl peft bitsandbytes accelerate datasets
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
### Out of Memory During Training
|
| 148 |
+
Use memory-efficient options:
|
| 149 |
+
```bash
|
| 150 |
+
python improved_agent.py \
|
| 151 |
+
--use_4bit \
|
| 152 |
+
--use_lora \
|
| 153 |
+
--gradient_checkpointing \
|
| 154 |
+
--lora_r 8 # Lower rank for less memory
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
### HuggingFace API Rate Limits
|
| 158 |
+
If you hit rate limits with HuggingFace's free tier:
|
| 159 |
+
1. Use a smaller model (e.g., 1.5B instead of 32B)
|
| 160 |
+
2. Reduce `--num_iterations` for training
|
| 161 |
+
3. Reduce `--num_runs` for evaluation
|
| 162 |
+
|
| 163 |
+
## 📚 File Descriptions
|
| 164 |
+
|
| 165 |
+
| File | Purpose |
|
| 166 |
+
|------|---------|
|
| 167 |
+
| `inference.py` | **Hackathon submission script** - runs all tasks with structured logging |
|
| 168 |
+
| `improved_agent.py` | Train model with GRPO (curriculum learning, LoRA, 4-bit) |
|
| 169 |
+
| `evaluate_mnist.py` | Compare models on out-of-distribution MNIST debugging |
|
| 170 |
+
| `run_inference.sh` | Convenience script for quick inference runs |
|
| 171 |
+
| `baseline_agent.py` | Original baseline (not hackathon-compliant) |
|
| 172 |
+
|
| 173 |
+
## 🎓 Example Workflow
|
| 174 |
+
|
| 175 |
+
```bash
|
| 176 |
+
# 1. Run baseline inference
|
| 177 |
+
export HF_TOKEN="your_token"
|
| 178 |
+
export API_BASE_URL="https://api-inference.huggingface.co/v1"
|
| 179 |
+
export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
|
| 180 |
+
python inference.py --env-url https://amogh-kal1-whipstudio.hf.space
|
| 181 |
+
|
| 182 |
+
# 2. Train model with GRPO
|
| 183 |
+
python improved_agent.py \
|
| 184 |
+
--env_url https://amogh-kal1-whipstudio.hf.space \
|
| 185 |
+
--use_lora --use_4bit \
|
| 186 |
+
--num_iterations 30 \
|
| 187 |
+
--output_dir ./my-trained-model
|
| 188 |
+
|
| 189 |
+
# 3. Evaluate on MNIST
|
| 190 |
+
python evaluate_mnist.py \
|
| 191 |
+
--use_real_mnist \
|
| 192 |
+
--base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 193 |
+
--trained_model ./my-trained-model/best \
|
| 194 |
+
--num_runs 5
|
| 195 |
+
|
| 196 |
+
# 4. Validate submission
|
| 197 |
+
./vaidate-submission.sh https://amogh-kal1-whipstudio.hf.space
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
## 🏆 Tips for Best Results
|
| 201 |
+
|
| 202 |
+
1. **Start with small experiments**: Use `--num_iterations 10` first
|
| 203 |
+
2. **Monitor training**: Use `--use_wandb` to track progress
|
| 204 |
+
3. **Curriculum helps**: Keep `--curriculum_stages 3` for better learning
|
| 205 |
+
4. **Real MNIST is harder**: Expect lower scores but more realistic evaluation
|
| 206 |
+
5. **Multiple runs**: Use `--num_runs 5` for statistical significance
|
| 207 |
+
|
| 208 |
+
## 📧 Support
|
| 209 |
+
|
| 210 |
+
If you encounter issues:
|
| 211 |
+
1. Check the troubleshooting section above
|
| 212 |
+
2. Verify your HF Space is running: visit the URL in browser
|
| 213 |
+
3. Check environment variables: `echo $API_BASE_URL $MODEL_NAME $HF_TOKEN`
|
| 214 |
+
4. Review the logs for detailed error messages
|
baseline_agent.py
CHANGED
|
@@ -18,7 +18,17 @@ Rules:
|
|
| 18 |
""".strip()
|
| 19 |
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
from smolagents import InferenceClientModel
|
| 23 |
|
| 24 |
hf_token = os.environ.get("HF_TOKEN")
|
|
@@ -27,8 +37,13 @@ def get_model():
|
|
| 27 |
"HF_TOKEN is not set. Set HF_TOKEN to run /baseline with InferenceClientModel."
|
| 28 |
)
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
return InferenceClientModel(
|
| 31 |
-
model_id=
|
| 32 |
token=hf_token,
|
| 33 |
)
|
| 34 |
|
|
@@ -81,15 +96,23 @@ def _generate_fixed_code(model, prompt: str) -> str:
|
|
| 81 |
raise AttributeError("Model does not support callable() or generate() inference APIs")
|
| 82 |
|
| 83 |
|
| 84 |
-
async def run_single_task(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
"""Backwards-compatible wrapper that returns just the score."""
|
| 86 |
-
result = await run_single_task_detailed(task_id, env_url)
|
| 87 |
return result["score"]
|
| 88 |
|
| 89 |
|
| 90 |
-
async def run_single_task_detailed(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
"""Run the baseline agent on a single task. Returns detailed results."""
|
| 92 |
-
model = get_model()
|
| 93 |
timeout = httpx.Timeout(900.0, connect=10.0)
|
| 94 |
|
| 95 |
attempts_log = []
|
|
@@ -173,13 +196,13 @@ if __name__ == "__main__":
|
|
| 173 |
|
| 174 |
async def main():
|
| 175 |
scores = {}
|
| 176 |
-
for tid in ["task1", "task2", "task3"]:
|
| 177 |
try:
|
| 178 |
s = await asyncio.wait_for(run_single_task(tid, args.env_url), timeout=95.0)
|
| 179 |
except TimeoutError:
|
| 180 |
s = 0.0
|
| 181 |
scores[tid] = round(s, 4)
|
| 182 |
print(f"{tid}: {s:.4f}")
|
| 183 |
-
print(f"Average: {sum(scores.values()) /
|
| 184 |
|
| 185 |
asyncio.run(main())
|
|
|
|
| 18 |
""".strip()
|
| 19 |
|
| 20 |
|
| 21 |
+
SUPPORTED_MODEL_IDS = [
|
| 22 |
+
"Qwen/Qwen2.5-Coder-1.5B-Instruct",
|
| 23 |
+
"Qwen/Qwen2.5-Coder-3B-Instruct",
|
| 24 |
+
"Qwen/Qwen2.5-Coder-7B-Instruct",
|
| 25 |
+
"Qwen/Qwen2.5-Coder-14B-Instruct",
|
| 26 |
+
"Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 27 |
+
"mistralai/Mistral-7B-Instruct-v0.3",
|
| 28 |
+
]
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def get_model(model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct"):
|
| 32 |
from smolagents import InferenceClientModel
|
| 33 |
|
| 34 |
hf_token = os.environ.get("HF_TOKEN")
|
|
|
|
| 37 |
"HF_TOKEN is not set. Set HF_TOKEN to run /baseline with InferenceClientModel."
|
| 38 |
)
|
| 39 |
|
| 40 |
+
if model_id not in SUPPORTED_MODEL_IDS:
|
| 41 |
+
raise ValueError(
|
| 42 |
+
f"Unsupported model_id '{model_id}'. Supported options: {SUPPORTED_MODEL_IDS}"
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
return InferenceClientModel(
|
| 46 |
+
model_id=model_id,
|
| 47 |
token=hf_token,
|
| 48 |
)
|
| 49 |
|
|
|
|
| 96 |
raise AttributeError("Model does not support callable() or generate() inference APIs")
|
| 97 |
|
| 98 |
|
| 99 |
+
async def run_single_task(
|
| 100 |
+
task_id: str,
|
| 101 |
+
env_url: str = "http://localhost:7860",
|
| 102 |
+
model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 103 |
+
) -> float:
|
| 104 |
"""Backwards-compatible wrapper that returns just the score."""
|
| 105 |
+
result = await run_single_task_detailed(task_id, env_url, model_id)
|
| 106 |
return result["score"]
|
| 107 |
|
| 108 |
|
| 109 |
+
async def run_single_task_detailed(
|
| 110 |
+
task_id: str,
|
| 111 |
+
env_url: str = "http://localhost:7860",
|
| 112 |
+
model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 113 |
+
) -> dict:
|
| 114 |
"""Run the baseline agent on a single task. Returns detailed results."""
|
| 115 |
+
model = get_model(model_id)
|
| 116 |
timeout = httpx.Timeout(900.0, connect=10.0)
|
| 117 |
|
| 118 |
attempts_log = []
|
|
|
|
| 196 |
|
| 197 |
async def main():
|
| 198 |
scores = {}
|
| 199 |
+
for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
|
| 200 |
try:
|
| 201 |
s = await asyncio.wait_for(run_single_task(tid, args.env_url), timeout=95.0)
|
| 202 |
except TimeoutError:
|
| 203 |
s = 0.0
|
| 204 |
scores[tid] = round(s, 4)
|
| 205 |
print(f"{tid}: {s:.4f}")
|
| 206 |
+
print(f"Average: {sum(scores.values()) / 6:.4f}")
|
| 207 |
|
| 208 |
asyncio.run(main())
|
evaluate_mnist.py
ADDED
|
@@ -0,0 +1,816 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Evaluate untrained vs GRPO-trained Qwen2.5-1.5B-Coder on a real
|
| 4 |
+
MNIST handwritten digit recognition debugging task.
|
| 5 |
+
|
| 6 |
+
This script demonstrates that RL-trained models outperform base models
|
| 7 |
+
on out-of-distribution ML debugging tasks.
|
| 8 |
+
|
| 9 |
+
The MNIST debugging task is intentionally NOT in the WhipStudio training set,
|
| 10 |
+
making it a true test of generalization.
|
| 11 |
+
|
| 12 |
+
Workflow:
|
| 13 |
+
1. Define a deliberately buggy MNIST training pipeline
|
| 14 |
+
2. Load both base model and GRPO-fine-tuned model
|
| 15 |
+
3. Ask each to fix the buggy code
|
| 16 |
+
4. Execute both fixes and compare results
|
| 17 |
+
5. Generate a comparison report
|
| 18 |
+
|
| 19 |
+
Requirements:
|
| 20 |
+
pip install transformers torch peft bitsandbytes
|
| 21 |
+
|
| 22 |
+
Usage:
|
| 23 |
+
# Basic comparison
|
| 24 |
+
python evaluate_mnist.py \
|
| 25 |
+
--base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 26 |
+
--trained_model ./whipstudio-debugger/best
|
| 27 |
+
|
| 28 |
+
# Multiple runs for statistical significance
|
| 29 |
+
python evaluate_mnist.py \
|
| 30 |
+
--base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 31 |
+
--trained_model ./whipstudio-debugger/best \
|
| 32 |
+
--num_runs 5
|
| 33 |
+
|
| 34 |
+
# Use 4-bit quantization for memory efficiency
|
| 35 |
+
python evaluate_mnist.py \
|
| 36 |
+
--base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 37 |
+
--trained_model ./whipstudio-debugger/best \
|
| 38 |
+
--use_4bit
|
| 39 |
+
"""
|
| 40 |
+
|
| 41 |
+
import argparse
|
| 42 |
+
import json
|
| 43 |
+
import math
|
| 44 |
+
import os
|
| 45 |
+
import re
|
| 46 |
+
import subprocess
|
| 47 |
+
import sys
|
| 48 |
+
import tempfile
|
| 49 |
+
import time
|
| 50 |
+
from pathlib import Path
|
| 51 |
+
from typing import Optional
|
| 52 |
+
|
| 53 |
+
import torch
|
| 54 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
| 55 |
+
|
| 56 |
+
# Optional PEFT import for LoRA models
|
| 57 |
+
try:
|
| 58 |
+
from peft import PeftModel
|
| 59 |
+
PEFT_AVAILABLE = True
|
| 60 |
+
except ImportError:
|
| 61 |
+
PEFT_AVAILABLE = False
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 65 |
+
# System Prompt (same as training)
|
| 66 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 67 |
+
|
| 68 |
+
SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
|
| 69 |
+
You receive a broken training script and must fix ALL bugs.
|
| 70 |
+
Return ONLY the complete corrected Python code. No markdown, no backticks, no explanation.
|
| 71 |
+
Keep all torch.manual_seed() calls intact."""
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 75 |
+
# Buggy MNIST Pipeline (Out-of-Distribution Test)
|
| 76 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 77 |
+
|
| 78 |
+
# Two versions of the buggy code: synthetic (fast) and real MNIST (realistic)
|
| 79 |
+
|
| 80 |
+
MNIST_BUGGY_CODE_SYNTHETIC = '''
|
| 81 |
+
import torch
|
| 82 |
+
import torch.nn as nn
|
| 83 |
+
import torch.nn.functional as F
|
| 84 |
+
from torch.utils.data import DataLoader, TensorDataset
|
| 85 |
+
|
| 86 |
+
torch.manual_seed(42)
|
| 87 |
+
|
| 88 |
+
# Simulate MNIST-like data (28x28 images, 10 classes)
|
| 89 |
+
X_train = torch.randn(1000, 1, 28, 28)
|
| 90 |
+
y_train = torch.randint(0, 10, (1000,))
|
| 91 |
+
X_val = torch.randn(200, 1, 28, 28)
|
| 92 |
+
y_val = torch.randint(0, 10, (200,))
|
| 93 |
+
|
| 94 |
+
# Make data learnable: label = argmax of mean pixel value in 10 regions
|
| 95 |
+
for i in range(len(X_train)):
|
| 96 |
+
region_means = X_train[i, 0].reshape(10, -1).mean(dim=1)
|
| 97 |
+
y_train[i] = region_means.argmax()
|
| 98 |
+
for i in range(len(X_val)):
|
| 99 |
+
region_means = X_val[i, 0].reshape(10, -1).mean(dim=1)
|
| 100 |
+
y_val[i] = region_means.argmax()
|
| 101 |
+
|
| 102 |
+
train_ds = TensorDataset(X_train, y_train)
|
| 103 |
+
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
|
| 104 |
+
|
| 105 |
+
class SimpleCNN(nn.Module):
|
| 106 |
+
def __init__(self):
|
| 107 |
+
super().__init__()
|
| 108 |
+
self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
|
| 109 |
+
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
|
| 110 |
+
self.fc1 = nn.Linear(32 * 7 * 7, 128)
|
| 111 |
+
self.fc2 = nn.Linear(128, 10)
|
| 112 |
+
|
| 113 |
+
def forward(self, x):
|
| 114 |
+
x = F.relu(self.conv1(x))
|
| 115 |
+
x = F.max_pool2d(x, 2)
|
| 116 |
+
x = F.relu(self.conv2(x))
|
| 117 |
+
x = F.max_pool2d(x, 2)
|
| 118 |
+
x = x.view(x.size(0), -1)
|
| 119 |
+
x = F.relu(self.fc1(x))
|
| 120 |
+
# BUG 1: Applying softmax before CrossEntropyLoss (double softmax)
|
| 121 |
+
x = F.softmax(self.fc2(x), dim=1)
|
| 122 |
+
return x
|
| 123 |
+
|
| 124 |
+
model = SimpleCNN()
|
| 125 |
+
|
| 126 |
+
# BUG 2: Using NLLLoss without log_softmax (expects log probabilities)
|
| 127 |
+
criterion = nn.NLLLoss()
|
| 128 |
+
|
| 129 |
+
# BUG 3: Learning rate too high for CNN
|
| 130 |
+
optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
|
| 131 |
+
|
| 132 |
+
losses = []
|
| 133 |
+
for epoch in range(20):
|
| 134 |
+
for xb, yb in train_loader:
|
| 135 |
+
optimizer.zero_grad()
|
| 136 |
+
out = model(xb)
|
| 137 |
+
loss = criterion(out, yb)
|
| 138 |
+
loss.backward()
|
| 139 |
+
optimizer.step()
|
| 140 |
+
losses.append(loss.item())
|
| 141 |
+
|
| 142 |
+
# Validation
|
| 143 |
+
model.eval()
|
| 144 |
+
with torch.no_grad():
|
| 145 |
+
val_out = model(X_val)
|
| 146 |
+
val_preds = val_out.argmax(dim=1)
|
| 147 |
+
val_acc = (val_preds == y_val).float().mean().item()
|
| 148 |
+
|
| 149 |
+
print('##METRICS_START##')
|
| 150 |
+
print('LOSSES:' + str(losses))
|
| 151 |
+
print('VAL_ACC:' + str(round(val_acc, 4)))
|
| 152 |
+
print('##METRICS_END##')
|
| 153 |
+
'''
|
| 154 |
+
|
| 155 |
+
MNIST_BUGGY_CODE_REAL = '''
|
| 156 |
+
import torch
|
| 157 |
+
import torch.nn as nn
|
| 158 |
+
import torch.nn.functional as F
|
| 159 |
+
from torch.utils.data import DataLoader, Subset
|
| 160 |
+
from torchvision import datasets, transforms
|
| 161 |
+
|
| 162 |
+
torch.manual_seed(42)
|
| 163 |
+
|
| 164 |
+
# Load REAL MNIST dataset
|
| 165 |
+
transform = transforms.Compose([
|
| 166 |
+
transforms.ToTensor(),
|
| 167 |
+
transforms.Normalize((0.1307,), (0.3081,))
|
| 168 |
+
])
|
| 169 |
+
|
| 170 |
+
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
|
| 171 |
+
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
|
| 172 |
+
|
| 173 |
+
# Use subset for faster training (5000 train, 1000 val)
|
| 174 |
+
train_indices = torch.randperm(len(train_dataset))[:5000]
|
| 175 |
+
val_indices = torch.randperm(len(test_dataset))[:1000]
|
| 176 |
+
|
| 177 |
+
train_subset = Subset(train_dataset, train_indices)
|
| 178 |
+
val_subset = Subset(test_dataset, val_indices)
|
| 179 |
+
|
| 180 |
+
train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)
|
| 181 |
+
val_loader = DataLoader(val_subset, batch_size=256, shuffle=False)
|
| 182 |
+
|
| 183 |
+
class SimpleCNN(nn.Module):
|
| 184 |
+
def __init__(self):
|
| 185 |
+
super().__init__()
|
| 186 |
+
self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
|
| 187 |
+
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
|
| 188 |
+
self.fc1 = nn.Linear(32 * 7 * 7, 128)
|
| 189 |
+
self.fc2 = nn.Linear(128, 10)
|
| 190 |
+
|
| 191 |
+
def forward(self, x):
|
| 192 |
+
x = F.relu(self.conv1(x))
|
| 193 |
+
x = F.max_pool2d(x, 2)
|
| 194 |
+
x = F.relu(self.conv2(x))
|
| 195 |
+
x = F.max_pool2d(x, 2)
|
| 196 |
+
x = x.view(x.size(0), -1)
|
| 197 |
+
x = F.relu(self.fc1(x))
|
| 198 |
+
# BUG 1: Applying softmax before CrossEntropyLoss (double softmax)
|
| 199 |
+
x = F.softmax(self.fc2(x), dim=1)
|
| 200 |
+
return x
|
| 201 |
+
|
| 202 |
+
model = SimpleCNN()
|
| 203 |
+
|
| 204 |
+
# BUG 2: Using NLLLoss without log_softmax (expects log probabilities)
|
| 205 |
+
criterion = nn.NLLLoss()
|
| 206 |
+
|
| 207 |
+
# BUG 3: Learning rate too high for CNN
|
| 208 |
+
optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
|
| 209 |
+
|
| 210 |
+
losses = []
|
| 211 |
+
for epoch in range(10): # 10 epochs on real MNIST
|
| 212 |
+
for xb, yb in train_loader:
|
| 213 |
+
optimizer.zero_grad()
|
| 214 |
+
out = model(xb)
|
| 215 |
+
loss = criterion(out, yb)
|
| 216 |
+
loss.backward()
|
| 217 |
+
optimizer.step()
|
| 218 |
+
losses.append(loss.item())
|
| 219 |
+
|
| 220 |
+
# Validation on real MNIST test set
|
| 221 |
+
model.eval()
|
| 222 |
+
correct = 0
|
| 223 |
+
total = 0
|
| 224 |
+
with torch.no_grad():
|
| 225 |
+
for xb, yb in val_loader:
|
| 226 |
+
out = model(xb)
|
| 227 |
+
preds = out.argmax(dim=1)
|
| 228 |
+
correct += (preds == yb).sum().item()
|
| 229 |
+
total += yb.size(0)
|
| 230 |
+
|
| 231 |
+
val_acc = correct / total
|
| 232 |
+
|
| 233 |
+
print('##METRICS_START##')
|
| 234 |
+
print('LOSSES:' + str(losses[-100:])) # Last 100 losses to avoid huge output
|
| 235 |
+
print('VAL_ACC:' + str(round(val_acc, 4)))
|
| 236 |
+
print('##METRICS_END##')
|
| 237 |
+
'''
|
| 238 |
+
|
| 239 |
+
# Default to synthetic for backward compatibility
|
| 240 |
+
MNIST_BUGGY_CODE = MNIST_BUGGY_CODE_SYNTHETIC
|
| 241 |
+
|
| 242 |
+
MNIST_TASK_DESCRIPTION_SYNTHETIC = """
|
| 243 |
+
This is a CNN-based handwritten digit classifier (MNIST-like, 10 classes).
|
| 244 |
+
The model has several bugs preventing it from training properly.
|
| 245 |
+
|
| 246 |
+
Bugs to identify and fix:
|
| 247 |
+
1. The forward pass has a problem with activation functions
|
| 248 |
+
2. The loss function doesn't match the model output
|
| 249 |
+
3. The optimizer has problematic hyperparameters
|
| 250 |
+
|
| 251 |
+
Fix ALL bugs so that after 20 epochs:
|
| 252 |
+
- Loss converges below 1.5
|
| 253 |
+
- Validation accuracy exceeds 0.50
|
| 254 |
+
|
| 255 |
+
Print losses as: LOSSES:[val1, val2, ...]
|
| 256 |
+
Print validation accuracy as: VAL_ACC:X.XX
|
| 257 |
+
Wrap metrics in ##METRICS_START## and ##METRICS_END##.
|
| 258 |
+
"""
|
| 259 |
+
|
| 260 |
+
MNIST_TASK_DESCRIPTION_REAL = """
|
| 261 |
+
This is a CNN-based MNIST handwritten digit classifier using the REAL MNIST dataset.
|
| 262 |
+
The model has several bugs preventing it from training properly.
|
| 263 |
+
|
| 264 |
+
Bugs to identify and fix:
|
| 265 |
+
1. The forward pass has a problem with activation functions
|
| 266 |
+
2. The loss function doesn't match the model output
|
| 267 |
+
3. The optimizer has problematic hyperparameters
|
| 268 |
+
|
| 269 |
+
Fix ALL bugs so that after 10 epochs on real MNIST:
|
| 270 |
+
- Loss converges and decreases over time
|
| 271 |
+
- Validation accuracy exceeds 0.85 (should be achievable on real MNIST)
|
| 272 |
+
|
| 273 |
+
Print the last 100 losses as: LOSSES:[val1, val2, ...]
|
| 274 |
+
Print validation accuracy as: VAL_ACC:X.XX
|
| 275 |
+
Wrap metrics in ##METRICS_START## and ##METRICS_END##.
|
| 276 |
+
"""
|
| 277 |
+
|
| 278 |
+
MNIST_TASK_DESCRIPTION = MNIST_TASK_DESCRIPTION_SYNTHETIC
|
| 279 |
+
|
| 280 |
+
|
| 281 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 282 |
+
# Helpers
|
| 283 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 284 |
+
|
| 285 |
+
def load_model(
|
| 286 |
+
model_path: str,
|
| 287 |
+
use_4bit: bool = False,
|
| 288 |
+
is_peft: bool = False,
|
| 289 |
+
base_model_for_peft: Optional[str] = None,
|
| 290 |
+
) -> tuple:
|
| 291 |
+
"""Load model and tokenizer with optional quantization and PEFT."""
|
| 292 |
+
|
| 293 |
+
print(f" Loading model from {model_path}...")
|
| 294 |
+
|
| 295 |
+
# Quantization config
|
| 296 |
+
quantization_config = None
|
| 297 |
+
if use_4bit:
|
| 298 |
+
quantization_config = BitsAndBytesConfig(
|
| 299 |
+
load_in_4bit=True,
|
| 300 |
+
bnb_4bit_quant_type="nf4",
|
| 301 |
+
bnb_4bit_compute_dtype=torch.bfloat16,
|
| 302 |
+
)
|
| 303 |
+
|
| 304 |
+
# Model kwargs
|
| 305 |
+
model_kwargs = {
|
| 306 |
+
"trust_remote_code": True,
|
| 307 |
+
"device_map": "auto",
|
| 308 |
+
}
|
| 309 |
+
if quantization_config:
|
| 310 |
+
model_kwargs["quantization_config"] = quantization_config
|
| 311 |
+
else:
|
| 312 |
+
model_kwargs["torch_dtype"] = torch.bfloat16
|
| 313 |
+
|
| 314 |
+
# Load tokenizer
|
| 315 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 316 |
+
if tokenizer.pad_token is None:
|
| 317 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 318 |
+
|
| 319 |
+
# Check if this is a PEFT/LoRA model
|
| 320 |
+
adapter_config_path = Path(model_path) / "adapter_config.json"
|
| 321 |
+
if adapter_config_path.exists() or is_peft:
|
| 322 |
+
if not PEFT_AVAILABLE:
|
| 323 |
+
raise ImportError("PEFT model detected but peft is not installed")
|
| 324 |
+
|
| 325 |
+
# For PEFT models, we need to load base model first
|
| 326 |
+
if base_model_for_peft is None:
|
| 327 |
+
# Try to read from adapter config
|
| 328 |
+
if adapter_config_path.exists():
|
| 329 |
+
with open(adapter_config_path) as f:
|
| 330 |
+
adapter_config = json.load(f)
|
| 331 |
+
base_model_for_peft = adapter_config.get("base_model_name_or_path")
|
| 332 |
+
|
| 333 |
+
if base_model_for_peft is None:
|
| 334 |
+
raise ValueError("PEFT model requires --base_model_for_peft or adapter_config.json with base_model_name_or_path")
|
| 335 |
+
|
| 336 |
+
print(f" Loading base model: {base_model_for_peft}")
|
| 337 |
+
base_model = AutoModelForCausalLM.from_pretrained(base_model_for_peft, **model_kwargs)
|
| 338 |
+
|
| 339 |
+
print(f" Loading PEFT adapters from: {model_path}")
|
| 340 |
+
model = PeftModel.from_pretrained(base_model, model_path)
|
| 341 |
+
else:
|
| 342 |
+
# Regular model
|
| 343 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
|
| 344 |
+
|
| 345 |
+
return model, tokenizer
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
def generate_fix(model, tokenizer, task_description: str, buggy_code: str) -> str:
|
| 349 |
+
"""Generate a fix using the given model."""
|
| 350 |
+
messages = [
|
| 351 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 352 |
+
{"role": "user", "content": f"Task: {task_description}\n\nBuggy code:\n{buggy_code}"},
|
| 353 |
+
]
|
| 354 |
+
|
| 355 |
+
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 356 |
+
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
|
| 357 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 358 |
+
|
| 359 |
+
with torch.no_grad():
|
| 360 |
+
outputs = model.generate(
|
| 361 |
+
**inputs,
|
| 362 |
+
max_new_tokens=2048,
|
| 363 |
+
temperature=0.2,
|
| 364 |
+
top_p=0.95,
|
| 365 |
+
do_sample=True,
|
| 366 |
+
pad_token_id=tokenizer.pad_token_id,
|
| 367 |
+
)
|
| 368 |
+
|
| 369 |
+
# Decode only the generated tokens
|
| 370 |
+
generated = outputs[0][inputs["input_ids"].shape[1]:]
|
| 371 |
+
response = tokenizer.decode(generated, skip_special_tokens=True)
|
| 372 |
+
|
| 373 |
+
# Strip markdown fences if present
|
| 374 |
+
if "```python" in response:
|
| 375 |
+
response = response.split("```python", 1)[1].split("```", 1)[0].strip()
|
| 376 |
+
elif "```" in response:
|
| 377 |
+
response = response.split("```", 1)[1].split("```", 1)[0].strip()
|
| 378 |
+
|
| 379 |
+
return response.strip()
|
| 380 |
+
|
| 381 |
+
|
| 382 |
+
def execute_code(code: str, timeout: int = 120) -> dict:
|
| 383 |
+
"""Execute code in a subprocess and return results."""
|
| 384 |
+
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
|
| 385 |
+
f.write(code)
|
| 386 |
+
tmp_path = f.name
|
| 387 |
+
|
| 388 |
+
start = time.time()
|
| 389 |
+
try:
|
| 390 |
+
proc = subprocess.run(
|
| 391 |
+
[sys.executable, tmp_path],
|
| 392 |
+
capture_output=True,
|
| 393 |
+
text=True,
|
| 394 |
+
timeout=timeout,
|
| 395 |
+
)
|
| 396 |
+
elapsed = time.time() - start
|
| 397 |
+
return {
|
| 398 |
+
"exit_code": proc.returncode,
|
| 399 |
+
"stdout": proc.stdout[:8192],
|
| 400 |
+
"stderr": proc.stderr[:2048],
|
| 401 |
+
"elapsed": round(elapsed, 2),
|
| 402 |
+
"timed_out": False,
|
| 403 |
+
}
|
| 404 |
+
except subprocess.TimeoutExpired:
|
| 405 |
+
return {
|
| 406 |
+
"exit_code": -1,
|
| 407 |
+
"stdout": "",
|
| 408 |
+
"stderr": f"Timed out after {timeout}s",
|
| 409 |
+
"elapsed": timeout,
|
| 410 |
+
"timed_out": True,
|
| 411 |
+
}
|
| 412 |
+
finally:
|
| 413 |
+
os.unlink(tmp_path)
|
| 414 |
+
|
| 415 |
+
|
| 416 |
+
def extract_metrics(stdout: str) -> dict:
|
| 417 |
+
"""Parse metrics from stdout."""
|
| 418 |
+
metrics: dict = {}
|
| 419 |
+
|
| 420 |
+
# Extract metrics block if present
|
| 421 |
+
block_match = re.search(r"##METRICS_START##(.*?)##METRICS_END##", stdout, re.DOTALL)
|
| 422 |
+
text = block_match.group(1) if block_match else stdout
|
| 423 |
+
|
| 424 |
+
# Parse losses
|
| 425 |
+
match = re.search(r"LOSSES:\[([^\]]+)\]", text)
|
| 426 |
+
if match:
|
| 427 |
+
try:
|
| 428 |
+
losses = [float(x.strip()) for x in match.group(1).split(",")]
|
| 429 |
+
metrics["losses"] = losses
|
| 430 |
+
metrics["final_loss"] = losses[-1] if losses else None
|
| 431 |
+
metrics["initial_loss"] = losses[0] if losses else None
|
| 432 |
+
metrics["nan_count"] = sum(1 for l in losses if math.isnan(l) or math.isinf(l))
|
| 433 |
+
metrics["num_steps"] = len(losses)
|
| 434 |
+
except Exception:
|
| 435 |
+
pass
|
| 436 |
+
|
| 437 |
+
# Parse val_acc
|
| 438 |
+
match = re.search(r"VAL_ACC:([\d.]+)", text)
|
| 439 |
+
if match:
|
| 440 |
+
metrics["val_acc"] = float(match.group(1))
|
| 441 |
+
|
| 442 |
+
return metrics
|
| 443 |
+
|
| 444 |
+
|
| 445 |
+
def score_mnist_fix(metrics: dict) -> float:
|
| 446 |
+
"""
|
| 447 |
+
Score an MNIST fix on a 0-1 scale.
|
| 448 |
+
|
| 449 |
+
Criteria:
|
| 450 |
+
- No NaN/Inf (base requirement)
|
| 451 |
+
- Final loss < 1.5 (30%)
|
| 452 |
+
- Val accuracy > 0.5 (50%)
|
| 453 |
+
- Learning trajectory (20%)
|
| 454 |
+
"""
|
| 455 |
+
if not metrics:
|
| 456 |
+
return 0.0
|
| 457 |
+
|
| 458 |
+
if metrics.get("nan_count", 0) > 0:
|
| 459 |
+
return 0.05
|
| 460 |
+
|
| 461 |
+
score = 0.0
|
| 462 |
+
|
| 463 |
+
# Val accuracy (50% of score)
|
| 464 |
+
val_acc = metrics.get("val_acc")
|
| 465 |
+
if val_acc is not None:
|
| 466 |
+
if val_acc >= 0.7:
|
| 467 |
+
score += 0.50
|
| 468 |
+
elif val_acc >= 0.5:
|
| 469 |
+
score += 0.35
|
| 470 |
+
elif val_acc >= 0.3:
|
| 471 |
+
score += 0.15
|
| 472 |
+
|
| 473 |
+
# Final loss (30% of score)
|
| 474 |
+
final_loss = metrics.get("final_loss")
|
| 475 |
+
if final_loss is not None:
|
| 476 |
+
if final_loss < 1.0:
|
| 477 |
+
score += 0.30
|
| 478 |
+
elif final_loss < 1.5:
|
| 479 |
+
score += 0.20
|
| 480 |
+
elif final_loss < 2.5:
|
| 481 |
+
score += 0.10
|
| 482 |
+
|
| 483 |
+
# Learning trajectory (20% of score)
|
| 484 |
+
losses = metrics.get("losses", [])
|
| 485 |
+
if len(losses) >= 10:
|
| 486 |
+
first_q = sum(losses[:len(losses) // 4]) / max(1, len(losses) // 4)
|
| 487 |
+
last_q = sum(losses[-len(losses) // 4:]) / max(1, len(losses) // 4)
|
| 488 |
+
if last_q < first_q * 0.7:
|
| 489 |
+
score += 0.20
|
| 490 |
+
elif last_q < first_q:
|
| 491 |
+
score += 0.10
|
| 492 |
+
|
| 493 |
+
return min(1.0, score)
|
| 494 |
+
|
| 495 |
+
|
| 496 |
+
def evaluate_single_model(
|
| 497 |
+
model_path: str,
|
| 498 |
+
label: str,
|
| 499 |
+
use_4bit: bool = False,
|
| 500 |
+
is_peft: bool = False,
|
| 501 |
+
base_model_for_peft: Optional[str] = None,
|
| 502 |
+
use_real_mnist: bool = False,
|
| 503 |
+
) -> dict:
|
| 504 |
+
"""Load a model, generate a fix, execute it, and return results."""
|
| 505 |
+
print(f"\n{'=' * 60}")
|
| 506 |
+
print(f"Evaluating: {label}")
|
| 507 |
+
print(f" Model: {model_path}")
|
| 508 |
+
print(f" Dataset: {'Real MNIST' if use_real_mnist else 'Synthetic'}")
|
| 509 |
+
print(f"{'=' * 60}")
|
| 510 |
+
|
| 511 |
+
# Select appropriate buggy code and task description
|
| 512 |
+
if use_real_mnist:
|
| 513 |
+
buggy_code = MNIST_BUGGY_CODE_REAL
|
| 514 |
+
task_desc = MNIST_TASK_DESCRIPTION_REAL
|
| 515 |
+
else:
|
| 516 |
+
buggy_code = MNIST_BUGGY_CODE_SYNTHETIC
|
| 517 |
+
task_desc = MNIST_TASK_DESCRIPTION_SYNTHETIC
|
| 518 |
+
|
| 519 |
+
# Load model
|
| 520 |
+
model, tokenizer = load_model(
|
| 521 |
+
model_path,
|
| 522 |
+
use_4bit=use_4bit,
|
| 523 |
+
is_peft=is_peft,
|
| 524 |
+
base_model_for_peft=base_model_for_peft,
|
| 525 |
+
)
|
| 526 |
+
|
| 527 |
+
# Generate fix
|
| 528 |
+
print(" Generating fix...")
|
| 529 |
+
start = time.time()
|
| 530 |
+
fixed_code = generate_fix(model, tokenizer, task_desc, buggy_code)
|
| 531 |
+
gen_time = time.time() - start
|
| 532 |
+
print(f" Generation took {gen_time:.1f}s ({len(fixed_code)} chars)")
|
| 533 |
+
|
| 534 |
+
# Execute (longer timeout for real MNIST due to dataset download)
|
| 535 |
+
timeout = 300 if use_real_mnist else 120
|
| 536 |
+
print(f" Executing fixed code (timeout={timeout}s)...")
|
| 537 |
+
result = execute_code(fixed_code, timeout=timeout)
|
| 538 |
+
metrics = extract_metrics(result["stdout"])
|
| 539 |
+
score = score_mnist_fix(metrics) if result["exit_code"] == 0 else 0.0
|
| 540 |
+
|
| 541 |
+
# Report
|
| 542 |
+
print(f"\n Results for {label}:")
|
| 543 |
+
print(f" Exit code: {result['exit_code']}")
|
| 544 |
+
print(f" Timed out: {result['timed_out']}")
|
| 545 |
+
print(f" Val accuracy: {metrics.get('val_acc', 'N/A')}")
|
| 546 |
+
print(f" Final loss: {metrics.get('final_loss', 'N/A')}")
|
| 547 |
+
print(f" NaN count: {metrics.get('nan_count', 'N/A')}")
|
| 548 |
+
print(f" Score: {score:.4f}")
|
| 549 |
+
|
| 550 |
+
if result["stderr"] and result["exit_code"] != 0:
|
| 551 |
+
print(f" Stderr: {result['stderr'][:500]}")
|
| 552 |
+
|
| 553 |
+
# Free GPU memory
|
| 554 |
+
del model
|
| 555 |
+
if torch.cuda.is_available():
|
| 556 |
+
torch.cuda.empty_cache()
|
| 557 |
+
|
| 558 |
+
return {
|
| 559 |
+
"model": label,
|
| 560 |
+
"model_path": model_path,
|
| 561 |
+
"fixed_code": fixed_code,
|
| 562 |
+
"execution": result,
|
| 563 |
+
"metrics": metrics,
|
| 564 |
+
"score": score,
|
| 565 |
+
"generation_time": gen_time,
|
| 566 |
+
}
|
| 567 |
+
|
| 568 |
+
|
| 569 |
+
def print_comparison_table(base_results: list, trained_results: list):
|
| 570 |
+
"""Print a nicely formatted comparison table."""
|
| 571 |
+
# Aggregate scores
|
| 572 |
+
base_scores = [r["score"] for r in base_results]
|
| 573 |
+
trained_scores = [r["score"] for r in trained_results]
|
| 574 |
+
base_accs = [r["metrics"].get("val_acc", 0) or 0 for r in base_results]
|
| 575 |
+
trained_accs = [r["metrics"].get("val_acc", 0) or 0 for r in trained_results]
|
| 576 |
+
|
| 577 |
+
avg_base_score = sum(base_scores) / len(base_scores)
|
| 578 |
+
avg_trained_score = sum(trained_scores) / len(trained_scores)
|
| 579 |
+
avg_base_acc = sum(base_accs) / len(base_accs)
|
| 580 |
+
avg_trained_acc = sum(trained_accs) / len(trained_accs)
|
| 581 |
+
|
| 582 |
+
# Table
|
| 583 |
+
print(f"\n{'=' * 70}")
|
| 584 |
+
print(f"{'COMPARISON: Base vs GRPO-Trained Model':^70}")
|
| 585 |
+
print(f"{'=' * 70}")
|
| 586 |
+
|
| 587 |
+
headers = ["Metric", "Base Model", "Trained Model", "Δ (Improvement)"]
|
| 588 |
+
rows = [
|
| 589 |
+
["Average Score", f"{avg_base_score:.4f}", f"{avg_trained_score:.4f}",
|
| 590 |
+
f"{avg_trained_score - avg_base_score:+.4f}"],
|
| 591 |
+
["Average Val Acc", f"{avg_base_acc:.4f}", f"{avg_trained_acc:.4f}",
|
| 592 |
+
f"{avg_trained_acc - avg_base_acc:+.4f}"],
|
| 593 |
+
["Best Score", f"{max(base_scores):.4f}", f"{max(trained_scores):.4f}",
|
| 594 |
+
f"{max(trained_scores) - max(base_scores):+.4f}"],
|
| 595 |
+
["Best Val Acc", f"{max(base_accs):.4f}", f"{max(trained_accs):.4f}",
|
| 596 |
+
f"{max(trained_accs) - max(base_accs):+.4f}"],
|
| 597 |
+
["Success Rate (>0.5)", f"{sum(1 for s in base_scores if s > 0.5)}/{len(base_scores)}",
|
| 598 |
+
f"{sum(1 for s in trained_scores if s > 0.5)}/{len(trained_scores)}", ""],
|
| 599 |
+
]
|
| 600 |
+
|
| 601 |
+
# Calculate column widths
|
| 602 |
+
col_widths = [max(len(str(r[i])) for r in [headers] + rows) + 2 for i in range(4)]
|
| 603 |
+
|
| 604 |
+
# Print table
|
| 605 |
+
header_line = "│ " + " │ ".join(h.center(w) for h, w in zip(headers, col_widths)) + " │"
|
| 606 |
+
sep_line = "├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤"
|
| 607 |
+
top_line = "┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐"
|
| 608 |
+
bottom_line = "└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘"
|
| 609 |
+
|
| 610 |
+
print(top_line)
|
| 611 |
+
print(header_line)
|
| 612 |
+
print(sep_line)
|
| 613 |
+
for row in rows:
|
| 614 |
+
print("│ " + " │ ".join(str(v).center(w) for v, w in zip(row, col_widths)) + " │")
|
| 615 |
+
print(bottom_line)
|
| 616 |
+
|
| 617 |
+
# Winner announcement
|
| 618 |
+
print()
|
| 619 |
+
if avg_trained_score > avg_base_score:
|
| 620 |
+
delta = avg_trained_score - avg_base_score
|
| 621 |
+
pct = (delta / max(avg_base_score, 0.001)) * 100
|
| 622 |
+
print(f"🏆 GRPO-trained model wins by +{delta:.4f} score ({pct:.1f}% improvement)!")
|
| 623 |
+
elif avg_base_score > avg_trained_score:
|
| 624 |
+
print(f"⚠️ Base model performed better (may need more training)")
|
| 625 |
+
else:
|
| 626 |
+
print(f"🤝 Models tied on average score")
|
| 627 |
+
|
| 628 |
+
return {
|
| 629 |
+
"base_avg_score": avg_base_score,
|
| 630 |
+
"trained_avg_score": avg_trained_score,
|
| 631 |
+
"base_avg_acc": avg_base_acc,
|
| 632 |
+
"trained_avg_acc": avg_trained_acc,
|
| 633 |
+
"improvement_score": avg_trained_score - avg_base_score,
|
| 634 |
+
"improvement_acc": avg_trained_acc - avg_base_acc,
|
| 635 |
+
}
|
| 636 |
+
|
| 637 |
+
|
| 638 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 639 |
+
# Main
|
| 640 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 641 |
+
|
| 642 |
+
def main():
|
| 643 |
+
parser = argparse.ArgumentParser(
|
| 644 |
+
description="Evaluate and compare multiple models on MNIST debugging",
|
| 645 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
| 646 |
+
epilog="""
|
| 647 |
+
Examples:
|
| 648 |
+
# Compare base vs trained model
|
| 649 |
+
python evaluate_mnist.py --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct --trained_model ./trained
|
| 650 |
+
|
| 651 |
+
# Use real MNIST dataset
|
| 652 |
+
python evaluate_mnist.py --use_real_mnist --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct
|
| 653 |
+
|
| 654 |
+
# Compare multiple models
|
| 655 |
+
python evaluate_mnist.py --models Qwen/Qwen2.5-Coder-1.5B-Instruct ./trained-v1 ./trained-v2
|
| 656 |
+
|
| 657 |
+
# Memory-efficient evaluation
|
| 658 |
+
python evaluate_mnist.py --use_4bit --base_model Qwen/Qwen2.5-Coder-7B-Instruct
|
| 659 |
+
"""
|
| 660 |
+
)
|
| 661 |
+
|
| 662 |
+
# Model selection (flexible)
|
| 663 |
+
parser.add_argument("--base_model", type=str, default="Qwen/Qwen2.5-Coder-1.5B-Instruct",
|
| 664 |
+
help="Path or HF name of base model")
|
| 665 |
+
parser.add_argument("--trained_model", type=str, default=None,
|
| 666 |
+
help="Path to GRPO-trained model (optional if using --models)")
|
| 667 |
+
parser.add_argument("--models", type=str, nargs="+", default=None,
|
| 668 |
+
help="List of models to compare (overrides --base_model and --trained_model)")
|
| 669 |
+
|
| 670 |
+
# Dataset options
|
| 671 |
+
parser.add_argument("--use_real_mnist", action="store_true",
|
| 672 |
+
help="Use real MNIST dataset (downloads ~50MB, slower but more realistic)")
|
| 673 |
+
|
| 674 |
+
# Output
|
| 675 |
+
parser.add_argument("--output_file", type=str, default="mnist_eval_results.json",
|
| 676 |
+
help="Output file for detailed results")
|
| 677 |
+
parser.add_argument("--num_runs", type=int, default=3,
|
| 678 |
+
help="Number of evaluation runs per model")
|
| 679 |
+
|
| 680 |
+
# Memory options
|
| 681 |
+
parser.add_argument("--use_4bit", action="store_true",
|
| 682 |
+
help="Use 4-bit quantization for memory efficiency")
|
| 683 |
+
parser.add_argument("--trained_is_peft", action="store_true",
|
| 684 |
+
help="Trained model is a PEFT/LoRA adapter")
|
| 685 |
+
|
| 686 |
+
args = parser.parse_args()
|
| 687 |
+
|
| 688 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 689 |
+
dataset_type = "Real MNIST" if args.use_real_mnist else "Synthetic MNIST-like"
|
| 690 |
+
|
| 691 |
+
print(f"\n{'#' * 70}")
|
| 692 |
+
print(f"{'MNIST DEBUGGING EVALUATION':^70}")
|
| 693 |
+
print(f"{'#' * 70}")
|
| 694 |
+
print(f"\nDevice: {device}")
|
| 695 |
+
print(f"Dataset: {dataset_type}")
|
| 696 |
+
print(f"Runs per model: {args.num_runs}")
|
| 697 |
+
print(f"\nMNIST Debugging Task (out-of-distribution):")
|
| 698 |
+
print(f" Bugs: softmax before CE, NLLLoss without log, LR=5.0")
|
| 699 |
+
|
| 700 |
+
# Determine which models to evaluate
|
| 701 |
+
if args.models:
|
| 702 |
+
# Multi-model comparison mode
|
| 703 |
+
model_list = args.models
|
| 704 |
+
print(f"\nModels to compare ({len(model_list)}):")
|
| 705 |
+
for i, m in enumerate(model_list, 1):
|
| 706 |
+
print(f" {i}. {m}")
|
| 707 |
+
else:
|
| 708 |
+
# Legacy two-model comparison
|
| 709 |
+
model_list = [args.base_model]
|
| 710 |
+
if args.trained_model:
|
| 711 |
+
model_list.append(args.trained_model)
|
| 712 |
+
print(f"\nBase model: {args.base_model}")
|
| 713 |
+
if args.trained_model:
|
| 714 |
+
print(f"Trained model: {args.trained_model}")
|
| 715 |
+
|
| 716 |
+
# Run evaluations for each model
|
| 717 |
+
all_results = {model: [] for model in model_list}
|
| 718 |
+
|
| 719 |
+
for run in range(1, args.num_runs + 1):
|
| 720 |
+
print(f"\n{'─' * 70}")
|
| 721 |
+
print(f"Run {run}/{args.num_runs}")
|
| 722 |
+
print(f"{'─' * 70}")
|
| 723 |
+
|
| 724 |
+
for model_path in model_list:
|
| 725 |
+
model_name = Path(model_path).name if "/" not in model_path else model_path.split("/")[-1]
|
| 726 |
+
|
| 727 |
+
# Determine if this is a PEFT model
|
| 728 |
+
is_peft = args.trained_is_peft and model_path != args.base_model
|
| 729 |
+
base_for_peft = args.base_model if is_peft else None
|
| 730 |
+
|
| 731 |
+
result = evaluate_single_model(
|
| 732 |
+
model_path,
|
| 733 |
+
f"{model_name} (run {run})",
|
| 734 |
+
use_4bit=args.use_4bit,
|
| 735 |
+
is_peft=is_peft,
|
| 736 |
+
base_model_for_peft=base_for_peft,
|
| 737 |
+
use_real_mnist=args.use_real_mnist,
|
| 738 |
+
)
|
| 739 |
+
all_results[model_path].append(result)
|
| 740 |
+
|
| 741 |
+
# Print comparison table for all models
|
| 742 |
+
print(f"\n{'=' * 80}")
|
| 743 |
+
print(f"{'RESULTS SUMMARY':^80}")
|
| 744 |
+
print(f"{'=' * 80}")
|
| 745 |
+
|
| 746 |
+
# Calculate aggregates for each model
|
| 747 |
+
model_stats = {}
|
| 748 |
+
for model_path, results in all_results.items():
|
| 749 |
+
scores = [r["score"] for r in results]
|
| 750 |
+
accs = [r["metrics"].get("val_acc", 0) or 0 for r in results]
|
| 751 |
+
model_stats[model_path] = {
|
| 752 |
+
"avg_score": sum(scores) / len(scores),
|
| 753 |
+
"avg_acc": sum(accs) / len(accs),
|
| 754 |
+
"best_score": max(scores),
|
| 755 |
+
"best_acc": max(accs),
|
| 756 |
+
"success_rate": sum(1 for s in scores if s > 0.5) / len(scores),
|
| 757 |
+
}
|
| 758 |
+
|
| 759 |
+
# Print table
|
| 760 |
+
headers = ["Model", "Avg Score", "Avg Acc", "Best Score", "Success Rate"]
|
| 761 |
+
rows = []
|
| 762 |
+
for model_path, stats in model_stats.items():
|
| 763 |
+
model_name = Path(model_path).name if "/" not in model_path else model_path.split("/")[-1]
|
| 764 |
+
rows.append([
|
| 765 |
+
model_name[:25], # Truncate long names
|
| 766 |
+
f"{stats['avg_score']:.4f}",
|
| 767 |
+
f"{stats['avg_acc']:.4f}",
|
| 768 |
+
f"{stats['best_score']:.4f}",
|
| 769 |
+
f"{stats['success_rate']*100:.0f}%",
|
| 770 |
+
])
|
| 771 |
+
|
| 772 |
+
col_widths = [max(len(str(r[i])) for r in [headers] + rows) + 2 for i in range(len(headers))]
|
| 773 |
+
|
| 774 |
+
print("┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐")
|
| 775 |
+
print("│ " + " │ ".join(h.center(w) for h, w in zip(headers, col_widths)) + " │")
|
| 776 |
+
print("├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤")
|
| 777 |
+
for row in rows:
|
| 778 |
+
print("│ " + " │ ".join(str(v).center(w) for v, w in zip(row, col_widths)) + " │")
|
| 779 |
+
print("└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘")
|
| 780 |
+
|
| 781 |
+
# Find winner
|
| 782 |
+
best_model = max(model_stats.items(), key=lambda x: x[1]["avg_score"])
|
| 783 |
+
print(f"\n🏆 Best model: {best_model[0].split('/')[-1]} (avg score: {best_model[1]['avg_score']:.4f})")
|
| 784 |
+
|
| 785 |
+
# Legacy comparison if exactly 2 models
|
| 786 |
+
summary = None
|
| 787 |
+
if len(model_list) == 2:
|
| 788 |
+
base_results = all_results[model_list[0]]
|
| 789 |
+
trained_results = all_results[model_list[1]]
|
| 790 |
+
summary = print_comparison_table(base_results, trained_results)
|
| 791 |
+
|
| 792 |
+
# Save detailed results
|
| 793 |
+
output = {
|
| 794 |
+
"task": f"MNIST debugging ({dataset_type})",
|
| 795 |
+
"models": model_list,
|
| 796 |
+
"num_runs": args.num_runs,
|
| 797 |
+
"device": device,
|
| 798 |
+
"use_real_mnist": args.use_real_mnist,
|
| 799 |
+
"model_stats": model_stats,
|
| 800 |
+
"summary": summary,
|
| 801 |
+
"runs": {
|
| 802 |
+
model_path: [
|
| 803 |
+
{k: v for k, v in r.items() if k != "fixed_code"}
|
| 804 |
+
for r in results
|
| 805 |
+
]
|
| 806 |
+
for model_path, results in all_results.items()
|
| 807 |
+
},
|
| 808 |
+
}
|
| 809 |
+
|
| 810 |
+
with open(args.output_file, "w") as f:
|
| 811 |
+
json.dump(output, f, indent=2, default=str)
|
| 812 |
+
print(f"\n📄 Full results saved to {args.output_file}")
|
| 813 |
+
|
| 814 |
+
|
| 815 |
+
if __name__ == "__main__":
|
| 816 |
+
main()
|
gradio_app.py
CHANGED
|
@@ -48,6 +48,12 @@ TASK_INFO = {
|
|
| 48 |
"description": "Backbone frozen but its parameters are passed to the optimizer.",
|
| 49 |
"hints": "Unfreeze backend or only pass head parameters to Adam.",
|
| 50 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
}
|
| 52 |
|
| 53 |
# ── Theme ──────────────────────────────────────────────────────────────────
|
|
@@ -418,7 +424,7 @@ def do_run_baseline(base_url: str, task_id: str):
|
|
| 418 |
|
| 419 |
results_md = "### 🤖 Baseline Agent Results\n\n"
|
| 420 |
results_md += "| Task | Score |\n|---|---|\n"
|
| 421 |
-
for tid in ["task1", "task2", "task3", "task4", "task5"]:
|
| 422 |
s = scores.get(tid, 0.0)
|
| 423 |
emoji = "🎯" if s >= 0.9 else ("✅" if s >= 0.7 else ("📈" if s >= 0.4 else "⚠️"))
|
| 424 |
results_md += f"| {tid} | {emoji} {s:.4f} |\n"
|
|
@@ -492,8 +498,21 @@ def build_ui() -> gr.Blocks:
|
|
| 492 |
# ── Left column: Task selector ──
|
| 493 |
with gr.Column(scale=1, min_width=280):
|
| 494 |
gr.Markdown("### 📋 Task Selector")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 495 |
task_id = gr.Radio(
|
| 496 |
-
choices=["task1", "task2", "task3", "task4", "task5"],
|
| 497 |
value="task1",
|
| 498 |
label="Select Task",
|
| 499 |
info="Choose a debugging challenge",
|
|
@@ -642,18 +661,23 @@ Fix optimizer order + learning rate bugs in a linear classifier.
|
|
| 642 |
"task3": "🔴 OOM + Data Leakage",
|
| 643 |
"task4": "🟡 Wrong Loss Function",
|
| 644 |
"task5": "🟡 Frozen Backbone",
|
|
|
|
| 645 |
}
|
| 646 |
|
| 647 |
-
def run_baseline_live(base_url_val):
|
| 648 |
"""Generator that yields live progress as each task completes."""
|
| 649 |
base = (base_url_val or DEFAULT_BASE_URL).strip().rstrip("/")
|
|
|
|
| 650 |
results = {}
|
| 651 |
-
lines_header = [
|
|
|
|
|
|
|
|
|
|
| 652 |
|
| 653 |
# Phase 1: Show "starting" state
|
| 654 |
yield "\n".join(lines_header + ["⏳ Starting baseline agent..."])
|
| 655 |
|
| 656 |
-
for tid in ["task1", "task2", "task3", "task4", "task5"]:
|
| 657 |
tname = TASK_NAMES.get(tid, tid)
|
| 658 |
|
| 659 |
# Show "running this task" update
|
|
@@ -673,7 +697,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
|
|
| 673 |
# Actually call the per-task endpoint
|
| 674 |
try:
|
| 675 |
with httpx.Client(timeout=180.0) as client:
|
| 676 |
-
resp = client.get(f"{base}/baseline/task/{tid}")
|
| 677 |
resp.raise_for_status()
|
| 678 |
data = resp.json()
|
| 679 |
except Exception as exc:
|
|
@@ -690,7 +714,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
|
|
| 690 |
final_lines = ["### 🤖 Baseline Agent Results\n", "| Task | Score |", "|---|---|"]
|
| 691 |
total = 0.0
|
| 692 |
has_errors = False
|
| 693 |
-
for tid in ["task1", "task2", "task3", "task4", "task5"]:
|
| 694 |
info = results.get(tid, {"score": 0.0})
|
| 695 |
s = info["score"]
|
| 696 |
total += s
|
|
@@ -700,7 +724,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
|
|
| 700 |
has_errors = True
|
| 701 |
final_lines.append(f"\n> ⚠️ `{info['error'][:200]}`\n")
|
| 702 |
|
| 703 |
-
avg = total /
|
| 704 |
final_lines.append(f"\n**Average: {avg:.4f}**")
|
| 705 |
if avg >= 0.7:
|
| 706 |
final_lines.append("\n🎉 **Agent performed well!** The environment is solvable.")
|
|
@@ -713,7 +737,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
|
|
| 713 |
final_lines.append("\n---\n> [!WARNING]\n> Some tasks failed. Check if `HF_TOKEN` is valid and the model is accessible.")
|
| 714 |
|
| 715 |
final_lines.append("\n---\n### 🔍 Auto-Agent Generated Code & Execution Logs")
|
| 716 |
-
for tid in ["task1", "task2", "task3", "task4", "task5"]:
|
| 717 |
info = results.get(tid, {})
|
| 718 |
fixed_code = str(info.get("fixed_code", ""))
|
| 719 |
output = str(info.get("output", ""))
|
|
@@ -731,7 +755,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
|
|
| 731 |
outputs=[baseline_output],
|
| 732 |
).then(
|
| 733 |
fn=run_baseline_live,
|
| 734 |
-
inputs=[base_url],
|
| 735 |
outputs=[baseline_output],
|
| 736 |
)
|
| 737 |
|
|
|
|
| 48 |
"description": "Backbone frozen but its parameters are passed to the optimizer.",
|
| 49 |
"hints": "Unfreeze backend or only pass head parameters to Adam.",
|
| 50 |
},
|
| 51 |
+
"task6": {
|
| 52 |
+
"name": "Input-Output Mismatch",
|
| 53 |
+
"difficulty": "🔴 Hard",
|
| 54 |
+
"description": "CNN has 4 bugs: shape mismatch, channel order (HWC/CHW), label encoding, batch dimension.",
|
| 55 |
+
"hints": "Fix image size (32→28), permute HWC→CHW, use class indices not one-hot, add unsqueeze(0).",
|
| 56 |
+
},
|
| 57 |
}
|
| 58 |
|
| 59 |
# ── Theme ──────────────────────────────────────────────────────────────────
|
|
|
|
| 424 |
|
| 425 |
results_md = "### 🤖 Baseline Agent Results\n\n"
|
| 426 |
results_md += "| Task | Score |\n|---|---|\n"
|
| 427 |
+
for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
|
| 428 |
s = scores.get(tid, 0.0)
|
| 429 |
emoji = "🎯" if s >= 0.9 else ("✅" if s >= 0.7 else ("📈" if s >= 0.4 else "⚠️"))
|
| 430 |
results_md += f"| {tid} | {emoji} {s:.4f} |\n"
|
|
|
|
| 498 |
# ── Left column: Task selector ──
|
| 499 |
with gr.Column(scale=1, min_width=280):
|
| 500 |
gr.Markdown("### 📋 Task Selector")
|
| 501 |
+
baseline_model = gr.Dropdown(
|
| 502 |
+
choices=[
|
| 503 |
+
"Qwen/Qwen2.5-Coder-1.5B-Instruct",
|
| 504 |
+
"Qwen/Qwen2.5-Coder-3B-Instruct",
|
| 505 |
+
"Qwen/Qwen2.5-Coder-7B-Instruct",
|
| 506 |
+
"Qwen/Qwen2.5-Coder-14B-Instruct",
|
| 507 |
+
"Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 508 |
+
"mistralai/Mistral-7B-Instruct-v0.3",
|
| 509 |
+
],
|
| 510 |
+
value="Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 511 |
+
label="Auto-Agent Model",
|
| 512 |
+
info="Choose which LLM to run for baseline auto-agent",
|
| 513 |
+
)
|
| 514 |
task_id = gr.Radio(
|
| 515 |
+
choices=["task1", "task2", "task3", "task4", "task5", "task6"],
|
| 516 |
value="task1",
|
| 517 |
label="Select Task",
|
| 518 |
info="Choose a debugging challenge",
|
|
|
|
| 661 |
"task3": "🔴 OOM + Data Leakage",
|
| 662 |
"task4": "🟡 Wrong Loss Function",
|
| 663 |
"task5": "🟡 Frozen Backbone",
|
| 664 |
+
"task6": "🔴 Input-Output Mismatch",
|
| 665 |
}
|
| 666 |
|
| 667 |
+
def run_baseline_live(base_url_val, model_id_val):
|
| 668 |
"""Generator that yields live progress as each task completes."""
|
| 669 |
base = (base_url_val or DEFAULT_BASE_URL).strip().rstrip("/")
|
| 670 |
+
model_id = (model_id_val or "Qwen/Qwen2.5-Coder-32B-Instruct").strip()
|
| 671 |
results = {}
|
| 672 |
+
lines_header = [
|
| 673 |
+
"### 🤖 Baseline Agent — Live Progress\n",
|
| 674 |
+
f"**Model:** `{model_id}`\n",
|
| 675 |
+
]
|
| 676 |
|
| 677 |
# Phase 1: Show "starting" state
|
| 678 |
yield "\n".join(lines_header + ["⏳ Starting baseline agent..."])
|
| 679 |
|
| 680 |
+
for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
|
| 681 |
tname = TASK_NAMES.get(tid, tid)
|
| 682 |
|
| 683 |
# Show "running this task" update
|
|
|
|
| 697 |
# Actually call the per-task endpoint
|
| 698 |
try:
|
| 699 |
with httpx.Client(timeout=180.0) as client:
|
| 700 |
+
resp = client.get(f"{base}/baseline/task/{tid}", params={"model_id": model_id})
|
| 701 |
resp.raise_for_status()
|
| 702 |
data = resp.json()
|
| 703 |
except Exception as exc:
|
|
|
|
| 714 |
final_lines = ["### 🤖 Baseline Agent Results\n", "| Task | Score |", "|---|---|"]
|
| 715 |
total = 0.0
|
| 716 |
has_errors = False
|
| 717 |
+
for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
|
| 718 |
info = results.get(tid, {"score": 0.0})
|
| 719 |
s = info["score"]
|
| 720 |
total += s
|
|
|
|
| 724 |
has_errors = True
|
| 725 |
final_lines.append(f"\n> ⚠️ `{info['error'][:200]}`\n")
|
| 726 |
|
| 727 |
+
avg = total / 6
|
| 728 |
final_lines.append(f"\n**Average: {avg:.4f}**")
|
| 729 |
if avg >= 0.7:
|
| 730 |
final_lines.append("\n🎉 **Agent performed well!** The environment is solvable.")
|
|
|
|
| 737 |
final_lines.append("\n---\n> [!WARNING]\n> Some tasks failed. Check if `HF_TOKEN` is valid and the model is accessible.")
|
| 738 |
|
| 739 |
final_lines.append("\n---\n### 🔍 Auto-Agent Generated Code & Execution Logs")
|
| 740 |
+
for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
|
| 741 |
info = results.get(tid, {})
|
| 742 |
fixed_code = str(info.get("fixed_code", ""))
|
| 743 |
output = str(info.get("output", ""))
|
|
|
|
| 755 |
outputs=[baseline_output],
|
| 756 |
).then(
|
| 757 |
fn=run_baseline_live,
|
| 758 |
+
inputs=[base_url, baseline_model],
|
| 759 |
outputs=[baseline_output],
|
| 760 |
)
|
| 761 |
|
improved_agent.py
ADDED
|
@@ -0,0 +1,717 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Improved GRPO training script for WhipStudio ML Debug Environment.
|
| 4 |
+
|
| 5 |
+
This script trains Qwen2.5-1.5B-Coder (or similar) to debug broken PyTorch scripts
|
| 6 |
+
using Group Relative Policy Optimization (GRPO) with the WhipStudio environment
|
| 7 |
+
as the reward oracle.
|
| 8 |
+
|
| 9 |
+
Improvements over basic train_grpo.py:
|
| 10 |
+
1. Memory-efficient training with 4-bit quantization
|
| 11 |
+
2. LoRA fine-tuning for reduced VRAM usage
|
| 12 |
+
3. Curriculum learning (easier tasks first)
|
| 13 |
+
4. Gradient checkpointing for large contexts
|
| 14 |
+
5. Checkpoint saving with best model tracking
|
| 15 |
+
6. Early stopping based on validation scores
|
| 16 |
+
7. Wandb/TensorBoard logging support
|
| 17 |
+
|
| 18 |
+
Requirements:
|
| 19 |
+
pip install trl>=0.15.0 transformers>=4.46.0 datasets torch httpx
|
| 20 |
+
pip install accelerate peft bitsandbytes wandb
|
| 21 |
+
|
| 22 |
+
Usage:
|
| 23 |
+
# Basic training
|
| 24 |
+
python improved_agent.py \
|
| 25 |
+
--env_url https://your-space.hf.space \
|
| 26 |
+
--output_dir ./whipstudio-debugger
|
| 27 |
+
|
| 28 |
+
# Memory-efficient training (8GB VRAM)
|
| 29 |
+
python improved_agent.py \
|
| 30 |
+
--env_url https://your-space.hf.space \
|
| 31 |
+
--use_4bit \
|
| 32 |
+
--use_lora \
|
| 33 |
+
--gradient_checkpointing \
|
| 34 |
+
--output_dir ./whipstudio-debugger-lora
|
| 35 |
+
|
| 36 |
+
# Full training with wandb logging
|
| 37 |
+
python improved_agent.py \
|
| 38 |
+
--env_url https://your-space.hf.space \
|
| 39 |
+
--use_wandb \
|
| 40 |
+
--wandb_project whipstudio \
|
| 41 |
+
--num_iterations 100 \
|
| 42 |
+
--output_dir ./whipstudio-debugger
|
| 43 |
+
"""
|
| 44 |
+
|
| 45 |
+
import argparse
|
| 46 |
+
import json
|
| 47 |
+
import math
|
| 48 |
+
import os
|
| 49 |
+
import random
|
| 50 |
+
import re
|
| 51 |
+
import time
|
| 52 |
+
from dataclasses import dataclass
|
| 53 |
+
from pathlib import Path
|
| 54 |
+
from typing import Any, Optional
|
| 55 |
+
|
| 56 |
+
import httpx
|
| 57 |
+
import torch
|
| 58 |
+
from datasets import Dataset
|
| 59 |
+
from transformers import (
|
| 60 |
+
AutoModelForCausalLM,
|
| 61 |
+
AutoTokenizer,
|
| 62 |
+
BitsAndBytesConfig,
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
# TRL imports
|
| 66 |
+
try:
|
| 67 |
+
from trl import GRPOConfig, GRPOTrainer
|
| 68 |
+
except ImportError:
|
| 69 |
+
raise ImportError("Please install trl>=0.15.0: pip install trl")
|
| 70 |
+
|
| 71 |
+
# PEFT imports (optional)
|
| 72 |
+
try:
|
| 73 |
+
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
|
| 74 |
+
PEFT_AVAILABLE = True
|
| 75 |
+
except ImportError:
|
| 76 |
+
PEFT_AVAILABLE = False
|
| 77 |
+
|
| 78 |
+
# Wandb import (optional)
|
| 79 |
+
try:
|
| 80 |
+
import wandb
|
| 81 |
+
WANDB_AVAILABLE = True
|
| 82 |
+
except ImportError:
|
| 83 |
+
WANDB_AVAILABLE = False
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 87 |
+
# Constants
|
| 88 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 89 |
+
|
| 90 |
+
SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
|
| 91 |
+
You receive a broken training script and must fix ALL bugs.
|
| 92 |
+
Return ONLY the complete corrected Python code. No markdown, no backticks, no explanation.
|
| 93 |
+
The script must print metrics in the format specified by the task description.
|
| 94 |
+
Keep all torch.manual_seed() calls intact.
|
| 95 |
+
Wrap metrics in ##METRICS_START## and ##METRICS_END## markers."""
|
| 96 |
+
|
| 97 |
+
# Task ordering by difficulty for curriculum learning
|
| 98 |
+
TASK_DIFFICULTY = {
|
| 99 |
+
"task1": 1, # Easy: broken loop
|
| 100 |
+
"task4": 2, # Medium: wrong loss
|
| 101 |
+
"task5": 2, # Medium: frozen backbone
|
| 102 |
+
"task2": 3, # Medium: NaN loss (tricky)
|
| 103 |
+
"task3": 4, # Hard: OOM + leakage
|
| 104 |
+
}
|
| 105 |
+
|
| 106 |
+
ALL_TASKS = list(TASK_DIFFICULTY.keys())
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 110 |
+
# Environment Client
|
| 111 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 112 |
+
|
| 113 |
+
class WhipStudioEnv:
|
| 114 |
+
"""Client for the WhipStudio RL environment."""
|
| 115 |
+
|
| 116 |
+
def __init__(self, env_url: str, timeout: float = 180.0):
|
| 117 |
+
self.env_url = env_url.rstrip("/")
|
| 118 |
+
self.timeout = httpx.Timeout(timeout, connect=15.0)
|
| 119 |
+
self._task_cache: dict[str, dict] = {}
|
| 120 |
+
|
| 121 |
+
def reset(self, task_id: str) -> dict:
|
| 122 |
+
"""Reset environment and return observation."""
|
| 123 |
+
with httpx.Client(timeout=self.timeout) as client:
|
| 124 |
+
resp = client.post(f"{self.env_url}/reset", json={"task_id": task_id})
|
| 125 |
+
resp.raise_for_status()
|
| 126 |
+
data = resp.json()
|
| 127 |
+
obs = data.get("observation", data)
|
| 128 |
+
self._task_cache[task_id] = obs
|
| 129 |
+
return obs
|
| 130 |
+
|
| 131 |
+
def step(self, fixed_code: str, attempt: int = 1) -> dict:
|
| 132 |
+
"""Submit a fix and return the full step result."""
|
| 133 |
+
payload = {
|
| 134 |
+
"action": {
|
| 135 |
+
"fixed_code": fixed_code,
|
| 136 |
+
"attempt_number": attempt,
|
| 137 |
+
}
|
| 138 |
+
}
|
| 139 |
+
with httpx.Client(timeout=self.timeout) as client:
|
| 140 |
+
resp = client.post(f"{self.env_url}/step", json=payload)
|
| 141 |
+
resp.raise_for_status()
|
| 142 |
+
return resp.json()
|
| 143 |
+
|
| 144 |
+
def get_task_obs(self, task_id: str) -> dict:
|
| 145 |
+
"""Get cached observation or reset to obtain it."""
|
| 146 |
+
if task_id not in self._task_cache:
|
| 147 |
+
self.reset(task_id)
|
| 148 |
+
return self._task_cache[task_id]
|
| 149 |
+
|
| 150 |
+
def health_check(self) -> bool:
|
| 151 |
+
"""Verify the environment is reachable."""
|
| 152 |
+
try:
|
| 153 |
+
with httpx.Client(timeout=httpx.Timeout(10.0)) as client:
|
| 154 |
+
resp = client.get(f"{self.env_url}/health")
|
| 155 |
+
return resp.status_code == 200
|
| 156 |
+
except Exception:
|
| 157 |
+
return False
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 161 |
+
# Prompt Utilities
|
| 162 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 163 |
+
|
| 164 |
+
def build_user_prompt(task_description: str, buggy_code: str) -> str:
|
| 165 |
+
"""Build the user prompt for the model."""
|
| 166 |
+
return f"Task: {task_description}\n\nBuggy code:\n{buggy_code}"
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def format_chat(tokenizer: Any, user_prompt: str) -> str:
|
| 170 |
+
"""Format as a chat message and return the full text."""
|
| 171 |
+
messages = [
|
| 172 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 173 |
+
{"role": "user", "content": user_prompt},
|
| 174 |
+
]
|
| 175 |
+
return tokenizer.apply_chat_template(
|
| 176 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 177 |
+
)
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
def extract_code_from_response(response: str) -> str:
|
| 181 |
+
"""Extract Python code from model response, stripping markdown if present."""
|
| 182 |
+
text = response.strip()
|
| 183 |
+
if "```python" in text:
|
| 184 |
+
text = text.split("```python", 1)[1].split("```", 1)[0].strip()
|
| 185 |
+
elif "```" in text:
|
| 186 |
+
text = text.split("```", 1)[1].split("```", 1)[0].strip()
|
| 187 |
+
return text
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 191 |
+
# Reward Function
|
| 192 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 193 |
+
|
| 194 |
+
def create_reward_function(env: WhipStudioEnv, verbose: bool = True):
|
| 195 |
+
"""
|
| 196 |
+
Create a reward function compatible with TRL's GRPOTrainer.
|
| 197 |
+
|
| 198 |
+
Includes reward shaping:
|
| 199 |
+
- Bonus for valid Python syntax
|
| 200 |
+
- Bonus for including required output markers
|
| 201 |
+
- Environment reward from grader
|
| 202 |
+
"""
|
| 203 |
+
|
| 204 |
+
def reward_fn(completions: list[list[dict]], **kwargs) -> list[float]:
|
| 205 |
+
"""Compute rewards for a batch of completions."""
|
| 206 |
+
rewards = []
|
| 207 |
+
task_ids = kwargs.get("task_id", ["task1"] * len(completions))
|
| 208 |
+
|
| 209 |
+
for i, completion in enumerate(completions):
|
| 210 |
+
task_id = task_ids[i] if i < len(task_ids) else "task1"
|
| 211 |
+
|
| 212 |
+
try:
|
| 213 |
+
# Extract assistant's response
|
| 214 |
+
if isinstance(completion, list):
|
| 215 |
+
text = ""
|
| 216 |
+
for msg in completion:
|
| 217 |
+
if isinstance(msg, dict) and msg.get("role") == "assistant":
|
| 218 |
+
text = msg.get("content", "")
|
| 219 |
+
break
|
| 220 |
+
if not text and completion:
|
| 221 |
+
text = str(completion[-1].get("content", ""))
|
| 222 |
+
elif isinstance(completion, str):
|
| 223 |
+
text = completion
|
| 224 |
+
else:
|
| 225 |
+
text = str(completion)
|
| 226 |
+
|
| 227 |
+
fixed_code = extract_code_from_response(text)
|
| 228 |
+
|
| 229 |
+
# Reward shaping: syntax check
|
| 230 |
+
syntax_bonus = 0.0
|
| 231 |
+
try:
|
| 232 |
+
compile(fixed_code, "<string>", "exec")
|
| 233 |
+
syntax_bonus = 0.05
|
| 234 |
+
except SyntaxError:
|
| 235 |
+
pass
|
| 236 |
+
|
| 237 |
+
# Reward shaping: output markers present
|
| 238 |
+
marker_bonus = 0.0
|
| 239 |
+
if "LOSSES:" in fixed_code or "##METRICS" in fixed_code:
|
| 240 |
+
marker_bonus = 0.02
|
| 241 |
+
|
| 242 |
+
if not fixed_code.strip():
|
| 243 |
+
rewards.append(0.0)
|
| 244 |
+
continue
|
| 245 |
+
|
| 246 |
+
# Get environment reward
|
| 247 |
+
env.reset(task_id)
|
| 248 |
+
result = env.step(fixed_code, attempt=1)
|
| 249 |
+
env_reward = float(result.get("reward", 0.0) or 0.0)
|
| 250 |
+
|
| 251 |
+
# Total reward (capped at 1.0)
|
| 252 |
+
total_reward = min(1.0, env_reward + syntax_bonus + marker_bonus)
|
| 253 |
+
rewards.append(total_reward)
|
| 254 |
+
|
| 255 |
+
if verbose:
|
| 256 |
+
print(f" [reward] task={task_id} env={env_reward:.3f} syntax={syntax_bonus:.2f} total={total_reward:.3f}")
|
| 257 |
+
|
| 258 |
+
except Exception as e:
|
| 259 |
+
if verbose:
|
| 260 |
+
print(f" [reward] ERROR task={task_id}: {e}")
|
| 261 |
+
rewards.append(0.0)
|
| 262 |
+
|
| 263 |
+
return rewards
|
| 264 |
+
|
| 265 |
+
return reward_fn
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 269 |
+
# Dataset Generation with Curriculum
|
| 270 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 271 |
+
|
| 272 |
+
def generate_curriculum_dataset(
|
| 273 |
+
env: WhipStudioEnv,
|
| 274 |
+
tokenizer: Any,
|
| 275 |
+
samples_per_task: int = 10,
|
| 276 |
+
curriculum_stage: int = 0, # 0 = all tasks, 1 = easier tasks weighted, etc.
|
| 277 |
+
) -> Dataset:
|
| 278 |
+
"""
|
| 279 |
+
Generate a dataset with curriculum-based sampling.
|
| 280 |
+
|
| 281 |
+
Args:
|
| 282 |
+
env: WhipStudio environment client
|
| 283 |
+
tokenizer: Model tokenizer
|
| 284 |
+
samples_per_task: Base samples per task
|
| 285 |
+
curriculum_stage: 0=uniform, higher=bias toward easier tasks
|
| 286 |
+
"""
|
| 287 |
+
records = []
|
| 288 |
+
|
| 289 |
+
# Compute task weights based on curriculum stage
|
| 290 |
+
task_weights = {}
|
| 291 |
+
for task_id, difficulty in TASK_DIFFICULTY.items():
|
| 292 |
+
if curriculum_stage == 0:
|
| 293 |
+
weight = 1.0
|
| 294 |
+
else:
|
| 295 |
+
# Higher curriculum_stage = more weight on easier tasks
|
| 296 |
+
weight = max(0.2, 1.0 - (difficulty - 1) * 0.2 * curriculum_stage)
|
| 297 |
+
task_weights[task_id] = weight
|
| 298 |
+
|
| 299 |
+
# Normalize weights
|
| 300 |
+
total_weight = sum(task_weights.values())
|
| 301 |
+
task_weights = {k: v / total_weight for k, v in task_weights.items()}
|
| 302 |
+
|
| 303 |
+
for task_id in ALL_TASKS:
|
| 304 |
+
print(f" Fetching observation for {task_id} (weight={task_weights[task_id]:.2f})...")
|
| 305 |
+
obs = env.reset(task_id)
|
| 306 |
+
user_prompt = build_user_prompt(
|
| 307 |
+
task_description=obs.get("task_description", ""),
|
| 308 |
+
buggy_code=obs.get("buggy_code", ""),
|
| 309 |
+
)
|
| 310 |
+
formatted = format_chat(tokenizer, user_prompt)
|
| 311 |
+
|
| 312 |
+
# Number of samples proportional to weight
|
| 313 |
+
n_samples = max(1, int(samples_per_task * task_weights[task_id] * len(ALL_TASKS)))
|
| 314 |
+
|
| 315 |
+
for _ in range(n_samples):
|
| 316 |
+
records.append({
|
| 317 |
+
"prompt": formatted,
|
| 318 |
+
"task_id": task_id,
|
| 319 |
+
})
|
| 320 |
+
|
| 321 |
+
random.shuffle(records)
|
| 322 |
+
return Dataset.from_list(records)
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 326 |
+
# Model Loading Utilities
|
| 327 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 328 |
+
|
| 329 |
+
def load_model_and_tokenizer(
|
| 330 |
+
model_name: str,
|
| 331 |
+
use_4bit: bool = False,
|
| 332 |
+
use_8bit: bool = False,
|
| 333 |
+
gradient_checkpointing: bool = False,
|
| 334 |
+
):
|
| 335 |
+
"""Load model with optional quantization and gradient checkpointing."""
|
| 336 |
+
|
| 337 |
+
print(f"Loading model: {model_name}")
|
| 338 |
+
|
| 339 |
+
# Tokenizer
|
| 340 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 341 |
+
if tokenizer.pad_token is None:
|
| 342 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 343 |
+
|
| 344 |
+
# Quantization config
|
| 345 |
+
quantization_config = None
|
| 346 |
+
if use_4bit:
|
| 347 |
+
if not PEFT_AVAILABLE:
|
| 348 |
+
raise ImportError("4-bit quantization requires peft and bitsandbytes")
|
| 349 |
+
quantization_config = BitsAndBytesConfig(
|
| 350 |
+
load_in_4bit=True,
|
| 351 |
+
bnb_4bit_quant_type="nf4",
|
| 352 |
+
bnb_4bit_compute_dtype=torch.bfloat16,
|
| 353 |
+
bnb_4bit_use_double_quant=True,
|
| 354 |
+
)
|
| 355 |
+
print(" Using 4-bit quantization")
|
| 356 |
+
elif use_8bit:
|
| 357 |
+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
| 358 |
+
print(" Using 8-bit quantization")
|
| 359 |
+
|
| 360 |
+
# Model kwargs
|
| 361 |
+
model_kwargs = {
|
| 362 |
+
"trust_remote_code": True,
|
| 363 |
+
"torch_dtype": torch.bfloat16 if not (use_4bit or use_8bit) else None,
|
| 364 |
+
"device_map": "auto",
|
| 365 |
+
}
|
| 366 |
+
if quantization_config:
|
| 367 |
+
model_kwargs["quantization_config"] = quantization_config
|
| 368 |
+
|
| 369 |
+
# Load model
|
| 370 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
|
| 371 |
+
|
| 372 |
+
# Prepare for k-bit training if quantized
|
| 373 |
+
if use_4bit or use_8bit:
|
| 374 |
+
model = prepare_model_for_kbit_training(model)
|
| 375 |
+
|
| 376 |
+
# Gradient checkpointing
|
| 377 |
+
if gradient_checkpointing:
|
| 378 |
+
model.gradient_checkpointing_enable()
|
| 379 |
+
print(" Gradient checkpointing enabled")
|
| 380 |
+
|
| 381 |
+
param_count = sum(p.numel() for p in model.parameters())
|
| 382 |
+
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
| 383 |
+
print(f" Total params: {param_count / 1e6:.1f}M, Trainable: {trainable / 1e6:.1f}M")
|
| 384 |
+
|
| 385 |
+
return model, tokenizer
|
| 386 |
+
|
| 387 |
+
|
| 388 |
+
def apply_lora(
|
| 389 |
+
model,
|
| 390 |
+
lora_r: int = 16,
|
| 391 |
+
lora_alpha: int = 32,
|
| 392 |
+
target_modules: Optional[list[str]] = None,
|
| 393 |
+
):
|
| 394 |
+
"""Apply LoRA adapters to the model."""
|
| 395 |
+
if not PEFT_AVAILABLE:
|
| 396 |
+
raise ImportError("LoRA requires peft: pip install peft")
|
| 397 |
+
|
| 398 |
+
if target_modules is None:
|
| 399 |
+
# Default targets for Qwen2 and similar architectures
|
| 400 |
+
target_modules = [
|
| 401 |
+
"q_proj", "k_proj", "v_proj", "o_proj",
|
| 402 |
+
"gate_proj", "up_proj", "down_proj",
|
| 403 |
+
]
|
| 404 |
+
|
| 405 |
+
lora_config = LoraConfig(
|
| 406 |
+
r=lora_r,
|
| 407 |
+
lora_alpha=lora_alpha,
|
| 408 |
+
target_modules=target_modules,
|
| 409 |
+
lora_dropout=0.05,
|
| 410 |
+
bias="none",
|
| 411 |
+
task_type="CAUSAL_LM",
|
| 412 |
+
)
|
| 413 |
+
|
| 414 |
+
model = get_peft_model(model, lora_config)
|
| 415 |
+
|
| 416 |
+
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
| 417 |
+
print(f" LoRA applied: r={lora_r}, trainable params: {trainable / 1e6:.2f}M")
|
| 418 |
+
|
| 419 |
+
return model, lora_config
|
| 420 |
+
|
| 421 |
+
|
| 422 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 423 |
+
# Validation & Evaluation
|
| 424 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 425 |
+
|
| 426 |
+
def evaluate_model(
|
| 427 |
+
model,
|
| 428 |
+
tokenizer,
|
| 429 |
+
env: WhipStudioEnv,
|
| 430 |
+
task_ids: list[str] = None,
|
| 431 |
+
max_new_tokens: int = 2048,
|
| 432 |
+
) -> dict[str, float]:
|
| 433 |
+
"""Evaluate model on tasks and return scores."""
|
| 434 |
+
if task_ids is None:
|
| 435 |
+
task_ids = ALL_TASKS
|
| 436 |
+
|
| 437 |
+
model.eval()
|
| 438 |
+
scores = {}
|
| 439 |
+
|
| 440 |
+
for task_id in task_ids:
|
| 441 |
+
obs = env.reset(task_id)
|
| 442 |
+
user_prompt = build_user_prompt(obs["task_description"], obs["buggy_code"])
|
| 443 |
+
formatted = format_chat(tokenizer, user_prompt)
|
| 444 |
+
|
| 445 |
+
inputs = tokenizer(formatted, return_tensors="pt", truncation=True, max_length=4096)
|
| 446 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 447 |
+
|
| 448 |
+
with torch.no_grad():
|
| 449 |
+
outputs = model.generate(
|
| 450 |
+
**inputs,
|
| 451 |
+
max_new_tokens=max_new_tokens,
|
| 452 |
+
temperature=0.2,
|
| 453 |
+
top_p=0.95,
|
| 454 |
+
do_sample=True,
|
| 455 |
+
pad_token_id=tokenizer.pad_token_id,
|
| 456 |
+
)
|
| 457 |
+
|
| 458 |
+
generated = outputs[0][inputs["input_ids"].shape[1]:]
|
| 459 |
+
response = tokenizer.decode(generated, skip_special_tokens=True)
|
| 460 |
+
fixed_code = extract_code_from_response(response)
|
| 461 |
+
|
| 462 |
+
env.reset(task_id)
|
| 463 |
+
result = env.step(fixed_code, attempt=1)
|
| 464 |
+
reward = float(result.get("reward", 0.0) or 0.0)
|
| 465 |
+
scores[task_id] = reward
|
| 466 |
+
print(f" {task_id}: {reward:.4f}")
|
| 467 |
+
|
| 468 |
+
return scores
|
| 469 |
+
|
| 470 |
+
|
| 471 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 472 |
+
# Main Training Loop
|
| 473 |
+
# ══════════════════════════════════════════════════════════════════════════════
|
| 474 |
+
|
| 475 |
+
def main():
|
| 476 |
+
parser = argparse.ArgumentParser(description="Improved GRPO training for WhipStudio")
|
| 477 |
+
|
| 478 |
+
# Environment
|
| 479 |
+
parser.add_argument("--env_url", type=str, required=True,
|
| 480 |
+
help="URL of the WhipStudio HF Space")
|
| 481 |
+
|
| 482 |
+
# Model
|
| 483 |
+
parser.add_argument("--model_name", type=str, default="Qwen/Qwen2.5-Coder-1.5B-Instruct",
|
| 484 |
+
help="Base model to fine-tune")
|
| 485 |
+
parser.add_argument("--output_dir", type=str, default="./whipstudio-debugger",
|
| 486 |
+
help="Directory to save the trained model")
|
| 487 |
+
|
| 488 |
+
# Quantization & Memory
|
| 489 |
+
parser.add_argument("--use_4bit", action="store_true",
|
| 490 |
+
help="Use 4-bit quantization (requires bitsandbytes)")
|
| 491 |
+
parser.add_argument("--use_8bit", action="store_true",
|
| 492 |
+
help="Use 8-bit quantization")
|
| 493 |
+
parser.add_argument("--gradient_checkpointing", action="store_true",
|
| 494 |
+
help="Enable gradient checkpointing to save memory")
|
| 495 |
+
|
| 496 |
+
# LoRA
|
| 497 |
+
parser.add_argument("--use_lora", action="store_true",
|
| 498 |
+
help="Use LoRA for efficient fine-tuning")
|
| 499 |
+
parser.add_argument("--lora_r", type=int, default=16,
|
| 500 |
+
help="LoRA rank")
|
| 501 |
+
parser.add_argument("--lora_alpha", type=int, default=32,
|
| 502 |
+
help="LoRA alpha")
|
| 503 |
+
|
| 504 |
+
# Training
|
| 505 |
+
parser.add_argument("--num_iterations", type=int, default=50,
|
| 506 |
+
help="Number of training epochs")
|
| 507 |
+
parser.add_argument("--group_size", type=int, default=4,
|
| 508 |
+
help="Number of completions per prompt for GRPO")
|
| 509 |
+
parser.add_argument("--samples_per_task", type=int, default=10,
|
| 510 |
+
help="Base samples per task in dataset")
|
| 511 |
+
parser.add_argument("--learning_rate", type=float, default=1e-5,
|
| 512 |
+
help="Learning rate")
|
| 513 |
+
parser.add_argument("--max_new_tokens", type=int, default=2048,
|
| 514 |
+
help="Max tokens to generate per completion")
|
| 515 |
+
parser.add_argument("--beta", type=float, default=0.1,
|
| 516 |
+
help="KL penalty coefficient")
|
| 517 |
+
|
| 518 |
+
# Curriculum
|
| 519 |
+
parser.add_argument("--curriculum_stages", type=int, default=3,
|
| 520 |
+
help="Number of curriculum stages (0 = no curriculum)")
|
| 521 |
+
|
| 522 |
+
# Logging
|
| 523 |
+
parser.add_argument("--use_wandb", action="store_true",
|
| 524 |
+
help="Log to Weights & Biases")
|
| 525 |
+
parser.add_argument("--wandb_project", type=str, default="whipstudio",
|
| 526 |
+
help="W&B project name")
|
| 527 |
+
|
| 528 |
+
# Early stopping
|
| 529 |
+
parser.add_argument("--patience", type=int, default=10,
|
| 530 |
+
help="Early stopping patience (epochs without improvement)")
|
| 531 |
+
parser.add_argument("--eval_every", type=int, default=5,
|
| 532 |
+
help="Evaluate every N epochs")
|
| 533 |
+
|
| 534 |
+
# Hub
|
| 535 |
+
parser.add_argument("--push_to_hub", action="store_true",
|
| 536 |
+
help="Push trained model to HuggingFace Hub")
|
| 537 |
+
parser.add_argument("--hub_model_id", type=str, default=None,
|
| 538 |
+
help="Model ID on HF Hub")
|
| 539 |
+
|
| 540 |
+
args = parser.parse_args()
|
| 541 |
+
|
| 542 |
+
# ── Verify environment ──
|
| 543 |
+
print(f"\n{'=' * 60}")
|
| 544 |
+
print("WhipStudio Improved GRPO Training")
|
| 545 |
+
print(f"{'=' * 60}")
|
| 546 |
+
print(f"Environment: {args.env_url}")
|
| 547 |
+
|
| 548 |
+
env = WhipStudioEnv(args.env_url)
|
| 549 |
+
if not env.health_check():
|
| 550 |
+
raise ConnectionError(f"Cannot reach WhipStudio at {args.env_url}")
|
| 551 |
+
print("Environment is reachable ✓")
|
| 552 |
+
|
| 553 |
+
# ── Initialize wandb ──
|
| 554 |
+
if args.use_wandb:
|
| 555 |
+
if not WANDB_AVAILABLE:
|
| 556 |
+
print("Warning: wandb not installed, skipping logging")
|
| 557 |
+
args.use_wandb = False
|
| 558 |
+
else:
|
| 559 |
+
wandb.init(
|
| 560 |
+
project=args.wandb_project,
|
| 561 |
+
config=vars(args),
|
| 562 |
+
name=f"grpo-{args.model_name.split('/')[-1]}",
|
| 563 |
+
)
|
| 564 |
+
|
| 565 |
+
# ── Load model ──
|
| 566 |
+
model, tokenizer = load_model_and_tokenizer(
|
| 567 |
+
args.model_name,
|
| 568 |
+
use_4bit=args.use_4bit,
|
| 569 |
+
use_8bit=args.use_8bit,
|
| 570 |
+
gradient_checkpointing=args.gradient_checkpointing,
|
| 571 |
+
)
|
| 572 |
+
|
| 573 |
+
# ── Apply LoRA ──
|
| 574 |
+
peft_config = None
|
| 575 |
+
if args.use_lora:
|
| 576 |
+
model, peft_config = apply_lora(
|
| 577 |
+
model,
|
| 578 |
+
lora_r=args.lora_r,
|
| 579 |
+
lora_alpha=args.lora_alpha,
|
| 580 |
+
)
|
| 581 |
+
|
| 582 |
+
# ── Create output directory ──
|
| 583 |
+
output_path = Path(args.output_dir)
|
| 584 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 585 |
+
|
| 586 |
+
# ── Training with curriculum ──
|
| 587 |
+
best_avg_score = 0.0
|
| 588 |
+
epochs_without_improvement = 0
|
| 589 |
+
|
| 590 |
+
n_stages = max(1, args.curriculum_stages)
|
| 591 |
+
epochs_per_stage = args.num_iterations // n_stages
|
| 592 |
+
|
| 593 |
+
for stage in range(n_stages):
|
| 594 |
+
print(f"\n{'=' * 60}")
|
| 595 |
+
print(f"Curriculum Stage {stage + 1}/{n_stages}")
|
| 596 |
+
print(f"{'=' * 60}")
|
| 597 |
+
|
| 598 |
+
# Generate dataset for this curriculum stage
|
| 599 |
+
dataset = generate_curriculum_dataset(
|
| 600 |
+
env, tokenizer,
|
| 601 |
+
samples_per_task=args.samples_per_task,
|
| 602 |
+
curriculum_stage=stage,
|
| 603 |
+
)
|
| 604 |
+
print(f"Dataset: {len(dataset)} samples")
|
| 605 |
+
|
| 606 |
+
# Create reward function
|
| 607 |
+
reward_fn = create_reward_function(env, verbose=True)
|
| 608 |
+
|
| 609 |
+
# Configure GRPO
|
| 610 |
+
grpo_config = GRPOConfig(
|
| 611 |
+
output_dir=str(output_path / f"stage_{stage}"),
|
| 612 |
+
num_train_epochs=epochs_per_stage,
|
| 613 |
+
per_device_train_batch_size=1,
|
| 614 |
+
gradient_accumulation_steps=4,
|
| 615 |
+
learning_rate=args.learning_rate,
|
| 616 |
+
lr_scheduler_type="cosine",
|
| 617 |
+
warmup_ratio=0.1,
|
| 618 |
+
max_completion_length=args.max_new_tokens,
|
| 619 |
+
num_generations=args.group_size,
|
| 620 |
+
logging_steps=1,
|
| 621 |
+
save_steps=epochs_per_stage,
|
| 622 |
+
save_total_limit=2,
|
| 623 |
+
bf16=True,
|
| 624 |
+
report_to="wandb" if args.use_wandb else "none",
|
| 625 |
+
beta=args.beta,
|
| 626 |
+
remove_unused_columns=False,
|
| 627 |
+
)
|
| 628 |
+
|
| 629 |
+
# Initialize trainer
|
| 630 |
+
trainer = GRPOTrainer(
|
| 631 |
+
model=model,
|
| 632 |
+
args=grpo_config,
|
| 633 |
+
train_dataset=dataset,
|
| 634 |
+
processing_class=tokenizer,
|
| 635 |
+
reward_funcs=reward_fn,
|
| 636 |
+
peft_config=peft_config if stage == 0 else None, # Only apply peft on first stage
|
| 637 |
+
)
|
| 638 |
+
|
| 639 |
+
# Train
|
| 640 |
+
print(f"\nTraining stage {stage + 1}...")
|
| 641 |
+
train_result = trainer.train()
|
| 642 |
+
print(f" Stage {stage + 1} complete: {train_result.global_step} steps")
|
| 643 |
+
|
| 644 |
+
# Evaluate
|
| 645 |
+
print("\nEvaluating...")
|
| 646 |
+
scores = evaluate_model(model, tokenizer, env)
|
| 647 |
+
avg_score = sum(scores.values()) / len(scores)
|
| 648 |
+
print(f" Average score: {avg_score:.4f}")
|
| 649 |
+
|
| 650 |
+
if args.use_wandb:
|
| 651 |
+
wandb.log({
|
| 652 |
+
"stage": stage + 1,
|
| 653 |
+
"avg_score": avg_score,
|
| 654 |
+
**{f"score/{k}": v for k, v in scores.items()},
|
| 655 |
+
})
|
| 656 |
+
|
| 657 |
+
# Track best model
|
| 658 |
+
if avg_score > best_avg_score:
|
| 659 |
+
best_avg_score = avg_score
|
| 660 |
+
epochs_without_improvement = 0
|
| 661 |
+
# Save best model
|
| 662 |
+
best_path = output_path / "best"
|
| 663 |
+
trainer.save_model(str(best_path))
|
| 664 |
+
tokenizer.save_pretrained(str(best_path))
|
| 665 |
+
print(f" New best model saved (score={avg_score:.4f})")
|
| 666 |
+
else:
|
| 667 |
+
epochs_without_improvement += epochs_per_stage
|
| 668 |
+
|
| 669 |
+
# Early stopping
|
| 670 |
+
if epochs_without_improvement >= args.patience:
|
| 671 |
+
print(f"\nEarly stopping: no improvement for {args.patience} epochs")
|
| 672 |
+
break
|
| 673 |
+
|
| 674 |
+
# ── Final save ──
|
| 675 |
+
final_path = output_path / "final"
|
| 676 |
+
trainer.save_model(str(final_path))
|
| 677 |
+
tokenizer.save_pretrained(str(final_path))
|
| 678 |
+
print(f"\nFinal model saved to {final_path}")
|
| 679 |
+
|
| 680 |
+
# ── Push to hub ──
|
| 681 |
+
if args.push_to_hub and args.hub_model_id:
|
| 682 |
+
print(f"Pushing to Hub as {args.hub_model_id}...")
|
| 683 |
+
trainer.push_to_hub(args.hub_model_id)
|
| 684 |
+
tokenizer.push_to_hub(args.hub_model_id)
|
| 685 |
+
print("Pushed to Hub ✓")
|
| 686 |
+
|
| 687 |
+
# ── Final evaluation ──
|
| 688 |
+
print(f"\n{'=' * 60}")
|
| 689 |
+
print("Final Evaluation on All Tasks")
|
| 690 |
+
print(f"{'=' * 60}")
|
| 691 |
+
final_scores = evaluate_model(model, tokenizer, env)
|
| 692 |
+
final_avg = sum(final_scores.values()) / len(final_scores)
|
| 693 |
+
print(f"\nFinal average score: {final_avg:.4f}")
|
| 694 |
+
print(f"Best average score during training: {best_avg_score:.4f}")
|
| 695 |
+
|
| 696 |
+
if args.use_wandb:
|
| 697 |
+
wandb.log({"final_avg_score": final_avg})
|
| 698 |
+
wandb.finish()
|
| 699 |
+
|
| 700 |
+
# ── Save training summary ──
|
| 701 |
+
summary = {
|
| 702 |
+
"model_name": args.model_name,
|
| 703 |
+
"final_avg_score": final_avg,
|
| 704 |
+
"best_avg_score": best_avg_score,
|
| 705 |
+
"final_scores": final_scores,
|
| 706 |
+
"curriculum_stages": n_stages,
|
| 707 |
+
"use_lora": args.use_lora,
|
| 708 |
+
"use_4bit": args.use_4bit,
|
| 709 |
+
}
|
| 710 |
+
with open(output_path / "training_summary.json", "w") as f:
|
| 711 |
+
json.dump(summary, f, indent=2)
|
| 712 |
+
|
| 713 |
+
print("\nTraining complete! ✓")
|
| 714 |
+
|
| 715 |
+
|
| 716 |
+
if __name__ == "__main__":
|
| 717 |
+
main()
|
inference.py
ADDED
|
@@ -0,0 +1,368 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Hackathon-compliant inference script for WhipStudio ML Debug Environment.
|
| 4 |
+
|
| 5 |
+
This script follows the Scaler Meta PyTorch Hackathon requirements:
|
| 6 |
+
- Uses OpenAI-compatible client with API_BASE_URL and MODEL_NAME
|
| 7 |
+
- Emits structured stdout logs: [START], [STEP], [END]
|
| 8 |
+
- Respects runtime limit (<20 min) and resource constraints
|
| 9 |
+
|
| 10 |
+
Environment Variables:
|
| 11 |
+
API_BASE_URL: The API endpoint for the LLM (e.g., https://api.openai.com/v1)
|
| 12 |
+
MODEL_NAME: The model identifier (e.g., gpt-4, Qwen/Qwen2.5-Coder-32B-Instruct)
|
| 13 |
+
HF_TOKEN: Your API key / HuggingFace token
|
| 14 |
+
|
| 15 |
+
Usage:
|
| 16 |
+
# With environment at localhost
|
| 17 |
+
python inference.py --env-url http://localhost:7860
|
| 18 |
+
|
| 19 |
+
# With HF Space
|
| 20 |
+
python inference.py --env-url https://your-space.hf.space
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
import argparse
|
| 24 |
+
import json
|
| 25 |
+
import os
|
| 26 |
+
import sys
|
| 27 |
+
import time
|
| 28 |
+
from typing import Any
|
| 29 |
+
|
| 30 |
+
import httpx
|
| 31 |
+
from openai import OpenAI
|
| 32 |
+
|
| 33 |
+
# ── Configuration ─────────────────────────────────────────────────────────────
|
| 34 |
+
|
| 35 |
+
SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
|
| 36 |
+
You receive a broken training script and must fix ALL bugs in it.
|
| 37 |
+
|
| 38 |
+
Rules:
|
| 39 |
+
- Return ONLY the complete corrected Python code, nothing else.
|
| 40 |
+
- No markdown, no backticks, no explanation text.
|
| 41 |
+
- The script must print losses in format: LOSSES:[v1, v2, ...]
|
| 42 |
+
- For tasks requiring validation metrics, also print: VAL_ACC:X.XX or VAL_ACCS:[v1,...] and FINAL_LOSS:X.XX
|
| 43 |
+
- Keep all torch.manual_seed() calls intact.
|
| 44 |
+
- Wrap all metrics in ##METRICS_START## and ##METRICS_END## markers.""".strip()
|
| 45 |
+
|
| 46 |
+
TASK_IDS = ["task1", "task2", "task3", "task4", "task5"]
|
| 47 |
+
|
| 48 |
+
MAX_ATTEMPTS_PER_TASK = 3
|
| 49 |
+
REQUEST_TIMEOUT = 180.0 # 3 minutes per LLM call
|
| 50 |
+
STEP_TIMEOUT = 120.0 # 2 minutes per step (code execution)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
# ── Logging Helpers ───────────────────────────────────────────────────────────
|
| 54 |
+
|
| 55 |
+
def log_start(task_id: str) -> None:
|
| 56 |
+
"""Emit [START] log for a task."""
|
| 57 |
+
print(f"[START] task_id={task_id}", flush=True)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def log_step(task_id: str, step: int, action_summary: str, reward: float, done: bool) -> None:
|
| 61 |
+
"""Emit [STEP] log for a step within a task."""
|
| 62 |
+
print(
|
| 63 |
+
f"[STEP] task_id={task_id} step={step} action={action_summary} reward={reward:.4f} done={str(done).lower()}",
|
| 64 |
+
flush=True
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def log_end(task_id: str, final_score: float) -> None:
|
| 69 |
+
"""Emit [END] log for a task."""
|
| 70 |
+
print(f"[END] task_id={task_id} final_score={final_score:.4f}", flush=True)
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
# ── LLM Client ────────────────────────────────────────────────────────────────
|
| 74 |
+
|
| 75 |
+
def get_openai_client() -> OpenAI:
|
| 76 |
+
"""Initialize OpenAI-compatible client from environment variables."""
|
| 77 |
+
api_base = os.environ.get("API_BASE_URL")
|
| 78 |
+
api_key = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY")
|
| 79 |
+
|
| 80 |
+
if not api_key:
|
| 81 |
+
raise RuntimeError(
|
| 82 |
+
"HF_TOKEN or OPENAI_API_KEY must be set in environment"
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
# Default to OpenAI API if no base URL specified
|
| 86 |
+
if not api_base:
|
| 87 |
+
api_base = "https://api.openai.com/v1"
|
| 88 |
+
|
| 89 |
+
return OpenAI(
|
| 90 |
+
base_url=api_base,
|
| 91 |
+
api_key=api_key,
|
| 92 |
+
timeout=REQUEST_TIMEOUT,
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def get_model_name() -> str:
|
| 97 |
+
"""Get model name from environment or use default."""
|
| 98 |
+
return os.environ.get("MODEL_NAME", "gpt-4o-mini")
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def generate_fix(client: OpenAI, model: str, prompt: str) -> str:
|
| 102 |
+
"""Generate a code fix using the LLM."""
|
| 103 |
+
try:
|
| 104 |
+
response = client.chat.completions.create(
|
| 105 |
+
model=model,
|
| 106 |
+
messages=[
|
| 107 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 108 |
+
{"role": "user", "content": prompt},
|
| 109 |
+
],
|
| 110 |
+
temperature=0.2,
|
| 111 |
+
max_tokens=4096,
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
content = response.choices[0].message.content or ""
|
| 115 |
+
|
| 116 |
+
# Strip markdown fences if present
|
| 117 |
+
if "```python" in content:
|
| 118 |
+
content = content.split("```python", 1)[1].split("```", 1)[0].strip()
|
| 119 |
+
elif "```" in content:
|
| 120 |
+
content = content.split("```", 1)[1].split("```", 1)[0].strip()
|
| 121 |
+
|
| 122 |
+
return content.strip()
|
| 123 |
+
|
| 124 |
+
except Exception as e:
|
| 125 |
+
print(f"[ERROR] LLM call failed: {e}", file=sys.stderr)
|
| 126 |
+
return ""
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
# ── Environment Client ────────────────────────────────────────────────────────
|
| 130 |
+
|
| 131 |
+
class WhipStudioClient:
|
| 132 |
+
"""HTTP client for the WhipStudio environment."""
|
| 133 |
+
|
| 134 |
+
def __init__(self, env_url: str):
|
| 135 |
+
self.env_url = env_url.rstrip("/")
|
| 136 |
+
self.timeout = httpx.Timeout(STEP_TIMEOUT, connect=10.0)
|
| 137 |
+
|
| 138 |
+
def health_check(self) -> bool:
|
| 139 |
+
"""Check if the environment is reachable."""
|
| 140 |
+
try:
|
| 141 |
+
with httpx.Client(timeout=httpx.Timeout(10.0)) as client:
|
| 142 |
+
resp = client.get(f"{self.env_url}/health")
|
| 143 |
+
return resp.status_code == 200
|
| 144 |
+
except Exception:
|
| 145 |
+
return False
|
| 146 |
+
|
| 147 |
+
def reset(self, task_id: str) -> dict:
|
| 148 |
+
"""Reset environment to a specific task."""
|
| 149 |
+
with httpx.Client(timeout=self.timeout) as client:
|
| 150 |
+
resp = client.post(
|
| 151 |
+
f"{self.env_url}/reset",
|
| 152 |
+
json={"task_id": task_id}
|
| 153 |
+
)
|
| 154 |
+
resp.raise_for_status()
|
| 155 |
+
data = resp.json()
|
| 156 |
+
return data.get("observation", data)
|
| 157 |
+
|
| 158 |
+
def step(self, fixed_code: str, attempt_number: int = 1) -> dict:
|
| 159 |
+
"""Submit a fix and get the result."""
|
| 160 |
+
payload = {
|
| 161 |
+
"action": {
|
| 162 |
+
"fixed_code": fixed_code,
|
| 163 |
+
"attempt_number": attempt_number,
|
| 164 |
+
}
|
| 165 |
+
}
|
| 166 |
+
|
| 167 |
+
with httpx.Client(timeout=self.timeout) as client:
|
| 168 |
+
resp = client.post(f"{self.env_url}/step", json=payload)
|
| 169 |
+
|
| 170 |
+
# Handle potential 422 from different API versions
|
| 171 |
+
if resp.status_code == 422:
|
| 172 |
+
resp = client.post(
|
| 173 |
+
f"{self.env_url}/step",
|
| 174 |
+
json={
|
| 175 |
+
"fixed_code": fixed_code,
|
| 176 |
+
"attempt_number": attempt_number,
|
| 177 |
+
}
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
resp.raise_for_status()
|
| 181 |
+
return resp.json()
|
| 182 |
+
|
| 183 |
+
def get_tasks(self) -> list[str]:
|
| 184 |
+
"""Get list of available tasks (returns task IDs only)."""
|
| 185 |
+
try:
|
| 186 |
+
with httpx.Client(timeout=self.timeout) as client:
|
| 187 |
+
resp = client.get(f"{self.env_url}/tasks")
|
| 188 |
+
if resp.status_code == 200:
|
| 189 |
+
data = resp.json()
|
| 190 |
+
if isinstance(data, list):
|
| 191 |
+
# Extract task IDs from task objects
|
| 192 |
+
task_ids = []
|
| 193 |
+
for t in data:
|
| 194 |
+
if isinstance(t, dict):
|
| 195 |
+
task_ids.append(t.get("id", str(t)))
|
| 196 |
+
else:
|
| 197 |
+
task_ids.append(str(t))
|
| 198 |
+
return task_ids
|
| 199 |
+
elif isinstance(data, dict):
|
| 200 |
+
tasks = data.get("tasks", [])
|
| 201 |
+
return [t.get("id") if isinstance(t, dict) else str(t) for t in tasks]
|
| 202 |
+
except Exception as e:
|
| 203 |
+
print(f"[WARNING] Could not fetch tasks from /tasks endpoint: {e}", file=sys.stderr)
|
| 204 |
+
|
| 205 |
+
# Fallback to default task IDs
|
| 206 |
+
return TASK_IDS
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
# ── Main Inference Loop ───────────────────────────────────────────────────────
|
| 210 |
+
|
| 211 |
+
def build_prompt(obs: dict) -> str:
|
| 212 |
+
"""Build the user prompt from observation."""
|
| 213 |
+
task_desc = obs.get("task_description", "Fix the buggy code.")
|
| 214 |
+
buggy_code = obs.get("buggy_code", "")
|
| 215 |
+
error_log = obs.get("error_log", "None")
|
| 216 |
+
last_reward = obs.get("last_reward", 0.0)
|
| 217 |
+
|
| 218 |
+
return f"""Task: {task_desc}
|
| 219 |
+
|
| 220 |
+
Buggy code:
|
| 221 |
+
{buggy_code}
|
| 222 |
+
|
| 223 |
+
Previous execution output (if any):
|
| 224 |
+
{error_log}
|
| 225 |
+
|
| 226 |
+
Previous score: {last_reward}""".strip()
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
def run_task(
|
| 230 |
+
env: WhipStudioClient,
|
| 231 |
+
llm_client: OpenAI,
|
| 232 |
+
model: str,
|
| 233 |
+
task_id: str,
|
| 234 |
+
) -> float:
|
| 235 |
+
"""Run inference on a single task. Returns the best score achieved."""
|
| 236 |
+
|
| 237 |
+
# Ensure task_id is a string
|
| 238 |
+
if isinstance(task_id, dict):
|
| 239 |
+
task_id = task_id.get("id", str(task_id))
|
| 240 |
+
|
| 241 |
+
log_start(task_id)
|
| 242 |
+
|
| 243 |
+
try:
|
| 244 |
+
obs = env.reset(task_id)
|
| 245 |
+
except Exception as e:
|
| 246 |
+
error_msg = str(e)
|
| 247 |
+
print(f"[ERROR] Failed to reset {task_id}: {error_msg}", file=sys.stderr)
|
| 248 |
+
|
| 249 |
+
# Check if it's a 500 error - likely environment issue
|
| 250 |
+
if "500" in error_msg:
|
| 251 |
+
print(f"[ERROR] HF Space returned 500 - the environment may be starting up or having issues", file=sys.stderr)
|
| 252 |
+
print(f"[ERROR] Try visiting https://your-space.hf.space in a browser first", file=sys.stderr)
|
| 253 |
+
|
| 254 |
+
log_end(task_id, 0.0)
|
| 255 |
+
return 0.0
|
| 256 |
+
|
| 257 |
+
best_score = 0.0
|
| 258 |
+
|
| 259 |
+
for step in range(1, MAX_ATTEMPTS_PER_TASK + 1):
|
| 260 |
+
prompt = build_prompt(obs)
|
| 261 |
+
|
| 262 |
+
# Generate fix
|
| 263 |
+
fixed_code = generate_fix(llm_client, model, prompt)
|
| 264 |
+
|
| 265 |
+
if not fixed_code.strip():
|
| 266 |
+
log_step(task_id, step, "empty_response", 0.0, False)
|
| 267 |
+
continue
|
| 268 |
+
|
| 269 |
+
# Submit fix
|
| 270 |
+
try:
|
| 271 |
+
result = env.step(fixed_code, attempt_number=step)
|
| 272 |
+
|
| 273 |
+
reward = float(result.get("reward", 0.0) or 0.0)
|
| 274 |
+
done = result.get("done", False)
|
| 275 |
+
obs = result.get("observation", obs)
|
| 276 |
+
|
| 277 |
+
# Track best score
|
| 278 |
+
if reward > best_score:
|
| 279 |
+
best_score = reward
|
| 280 |
+
|
| 281 |
+
# Log step
|
| 282 |
+
code_len = len(fixed_code)
|
| 283 |
+
log_step(task_id, step, f"submit_fix({code_len}chars)", reward, done)
|
| 284 |
+
|
| 285 |
+
# Early exit if task is solved
|
| 286 |
+
if done or reward >= 0.95:
|
| 287 |
+
break
|
| 288 |
+
|
| 289 |
+
except Exception as e:
|
| 290 |
+
print(f"[ERROR] Step failed for {task_id}: {e}", file=sys.stderr)
|
| 291 |
+
log_step(task_id, step, "step_error", 0.0, False)
|
| 292 |
+
|
| 293 |
+
log_end(task_id, best_score)
|
| 294 |
+
return best_score
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
def main():
|
| 298 |
+
parser = argparse.ArgumentParser(
|
| 299 |
+
description="WhipStudio inference script for OpenEnv Hackathon"
|
| 300 |
+
)
|
| 301 |
+
parser.add_argument(
|
| 302 |
+
"--env-url",
|
| 303 |
+
default=os.environ.get("ENV_URL", "http://localhost:7860"),
|
| 304 |
+
help="URL of the WhipStudio environment"
|
| 305 |
+
)
|
| 306 |
+
parser.add_argument(
|
| 307 |
+
"--tasks",
|
| 308 |
+
nargs="+",
|
| 309 |
+
default=None,
|
| 310 |
+
help="Specific tasks to run (default: all tasks)"
|
| 311 |
+
)
|
| 312 |
+
args = parser.parse_args()
|
| 313 |
+
|
| 314 |
+
# Initialize clients
|
| 315 |
+
print(f"[INFO] Connecting to environment at {args.env_url}", flush=True)
|
| 316 |
+
env = WhipStudioClient(args.env_url)
|
| 317 |
+
|
| 318 |
+
if not env.health_check():
|
| 319 |
+
print(f"[ERROR] Cannot reach environment at {args.env_url}", file=sys.stderr)
|
| 320 |
+
sys.exit(1)
|
| 321 |
+
|
| 322 |
+
print("[INFO] Environment is reachable", flush=True)
|
| 323 |
+
|
| 324 |
+
# Initialize LLM client
|
| 325 |
+
llm_client = get_openai_client()
|
| 326 |
+
model = get_model_name()
|
| 327 |
+
print(f"[INFO] Using model: {model}", flush=True)
|
| 328 |
+
|
| 329 |
+
# Determine which tasks to run
|
| 330 |
+
if args.tasks:
|
| 331 |
+
task_ids = args.tasks
|
| 332 |
+
else:
|
| 333 |
+
task_ids = env.get_tasks()
|
| 334 |
+
|
| 335 |
+
print(f"[INFO] Running tasks: {task_ids}", flush=True)
|
| 336 |
+
|
| 337 |
+
# Run inference on all tasks
|
| 338 |
+
start_time = time.time()
|
| 339 |
+
scores = {}
|
| 340 |
+
|
| 341 |
+
for task_id in task_ids:
|
| 342 |
+
task_start = time.time()
|
| 343 |
+
score = run_task(env, llm_client, model, task_id)
|
| 344 |
+
scores[task_id] = score
|
| 345 |
+
task_elapsed = time.time() - task_start
|
| 346 |
+
print(f"[INFO] {task_id} completed in {task_elapsed:.1f}s with score {score:.4f}", flush=True)
|
| 347 |
+
|
| 348 |
+
# Summary
|
| 349 |
+
total_elapsed = time.time() - start_time
|
| 350 |
+
avg_score = sum(scores.values()) / len(scores) if scores else 0.0
|
| 351 |
+
|
| 352 |
+
print("\n" + "=" * 50, flush=True)
|
| 353 |
+
print("[SUMMARY]", flush=True)
|
| 354 |
+
print(f" Tasks completed: {len(scores)}", flush=True)
|
| 355 |
+
print(f" Total time: {total_elapsed:.1f}s", flush=True)
|
| 356 |
+
print(f" Average score: {avg_score:.4f}", flush=True)
|
| 357 |
+
print(" Per-task scores:", flush=True)
|
| 358 |
+
for tid, score in scores.items():
|
| 359 |
+
print(f" {tid}: {score:.4f}", flush=True)
|
| 360 |
+
print("=" * 50, flush=True)
|
| 361 |
+
|
| 362 |
+
# Exit with error if average score is too low (optional)
|
| 363 |
+
if avg_score < 0.1:
|
| 364 |
+
print("[WARNING] Average score below 0.1 threshold", file=sys.stderr)
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
if __name__ == "__main__":
|
| 368 |
+
main()
|
run_inference.sh
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
# Quick setup script for WhipStudio OpenEnv Hackathon
|
| 3 |
+
|
| 4 |
+
set -e
|
| 5 |
+
|
| 6 |
+
echo "=========================================="
|
| 7 |
+
echo "WhipStudio Hackathon Setup"
|
| 8 |
+
echo "=========================================="
|
| 9 |
+
|
| 10 |
+
# Step 1: Check environment variables
|
| 11 |
+
echo ""
|
| 12 |
+
echo "Step 1: Checking environment variables..."
|
| 13 |
+
|
| 14 |
+
if [ -z "$HF_TOKEN" ]; then
|
| 15 |
+
echo "⚠️ HF_TOKEN not set"
|
| 16 |
+
if [ -f .env ]; then
|
| 17 |
+
echo " Loading from .env file..."
|
| 18 |
+
export HF_TOKEN=$(grep -v '^#' .env | head -1)
|
| 19 |
+
echo " ✓ HF_TOKEN loaded"
|
| 20 |
+
else
|
| 21 |
+
echo " ❌ Please set HF_TOKEN environment variable or create .env file"
|
| 22 |
+
exit 1
|
| 23 |
+
fi
|
| 24 |
+
else
|
| 25 |
+
echo " ✓ HF_TOKEN is set"
|
| 26 |
+
fi
|
| 27 |
+
|
| 28 |
+
if [ -z "$API_BASE_URL" ]; then
|
| 29 |
+
echo "⚠️ API_BASE_URL not set, using HuggingFace Inference API"
|
| 30 |
+
export API_BASE_URL="https://api-inference.huggingface.co/v1"
|
| 31 |
+
fi
|
| 32 |
+
echo " ✓ API_BASE_URL: $API_BASE_URL"
|
| 33 |
+
|
| 34 |
+
if [ -z "$MODEL_NAME" ]; then
|
| 35 |
+
echo "⚠️ MODEL_NAME not set, using default"
|
| 36 |
+
export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
|
| 37 |
+
fi
|
| 38 |
+
echo " ✓ MODEL_NAME: $MODEL_NAME"
|
| 39 |
+
|
| 40 |
+
# Step 2: Check HF Space
|
| 41 |
+
ENV_URL="${1:-https://amogh-kal1-whipstudio.hf.space}"
|
| 42 |
+
echo ""
|
| 43 |
+
echo "Step 2: Checking HF Space at $ENV_URL..."
|
| 44 |
+
|
| 45 |
+
if curl -s --max-time 10 "$ENV_URL/health" > /dev/null 2>&1; then
|
| 46 |
+
echo " ✓ HF Space is reachable"
|
| 47 |
+
else
|
| 48 |
+
echo " ❌ HF Space not reachable or still starting up"
|
| 49 |
+
echo " Try visiting $ENV_URL in your browser first"
|
| 50 |
+
exit 1
|
| 51 |
+
fi
|
| 52 |
+
|
| 53 |
+
# Step 3: Run inference
|
| 54 |
+
echo ""
|
| 55 |
+
echo "Step 3: Running inference..."
|
| 56 |
+
echo ""
|
| 57 |
+
|
| 58 |
+
python3 inference.py --env-url "$ENV_URL"
|
| 59 |
+
|
| 60 |
+
echo ""
|
| 61 |
+
echo "=========================================="
|
| 62 |
+
echo "✅ Inference complete!"
|
| 63 |
+
echo "=========================================="
|
server/app.py
CHANGED
|
@@ -95,6 +95,7 @@ def list_tasks():
|
|
| 95 |
{"id": "task3", "name": "OOM and data leakage", "difficulty": "hard"},
|
| 96 |
{"id": "task4", "name": "Wrong loss function", "difficulty": "medium"},
|
| 97 |
{"id": "task5", "name": "Frozen backbone", "difficulty": "medium"},
|
|
|
|
| 98 |
],
|
| 99 |
"action_schema": {
|
| 100 |
"fixed_code": "string (required) — complete runnable Python script",
|
|
@@ -122,16 +123,26 @@ def run_grader(payload: dict):
|
|
| 122 |
@app.get("/baseline")
|
| 123 |
async def run_baseline(request: Request):
|
| 124 |
try:
|
| 125 |
-
from ..baseline_agent import run_single_task
|
| 126 |
except ImportError:
|
| 127 |
-
from baseline_agent import run_single_task
|
| 128 |
|
| 129 |
env_url = str(request.base_url).rstrip("/")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
results = {}
|
| 131 |
task_scores = {}
|
| 132 |
-
for task_id in ["task1", "task2", "task3", "task4", "task5"]:
|
| 133 |
try:
|
| 134 |
-
score = await asyncio.wait_for(
|
|
|
|
|
|
|
|
|
|
| 135 |
results[task_id] = round(score, 4)
|
| 136 |
task_scores[task_id] = round(score, 4)
|
| 137 |
except TimeoutError:
|
|
@@ -147,24 +158,43 @@ async def run_baseline(request: Request):
|
|
| 147 |
task_scores[task_id] = 0.0
|
| 148 |
results[f"{task_id}_error"] = f"internal_error: {exc.__class__.__name__}: {exc}"
|
| 149 |
avg = round(sum(task_scores.values()) / max(1, len(task_scores)), 4)
|
| 150 |
-
return {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
|
| 153 |
@app.get("/baseline/task/{task_id}")
|
| 154 |
async def run_baseline_single(task_id: str, request: Request):
|
| 155 |
"""Run the baseline agent on a single task. Returns score + details."""
|
| 156 |
try:
|
| 157 |
-
from ..baseline_agent import run_single_task_detailed
|
| 158 |
except ImportError:
|
| 159 |
-
from baseline_agent import run_single_task_detailed
|
| 160 |
|
| 161 |
env_url = str(request.base_url).rstrip("/")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
try:
|
| 163 |
-
result = await asyncio.wait_for(
|
|
|
|
|
|
|
|
|
|
| 164 |
return {
|
| 165 |
"task_id": task_id,
|
| 166 |
"score": round(result["score"], 4),
|
| 167 |
"status": "ok",
|
|
|
|
| 168 |
"fixed_code": result.get("fixed_code", ""),
|
| 169 |
"output": result.get("output", ""),
|
| 170 |
"attempts": result.get("attempts", []),
|
|
|
|
| 95 |
{"id": "task3", "name": "OOM and data leakage", "difficulty": "hard"},
|
| 96 |
{"id": "task4", "name": "Wrong loss function", "difficulty": "medium"},
|
| 97 |
{"id": "task5", "name": "Frozen backbone", "difficulty": "medium"},
|
| 98 |
+
{"id": "task6", "name": "Input-Output mismatch", "difficulty": "hard"},
|
| 99 |
],
|
| 100 |
"action_schema": {
|
| 101 |
"fixed_code": "string (required) — complete runnable Python script",
|
|
|
|
| 123 |
@app.get("/baseline")
|
| 124 |
async def run_baseline(request: Request):
|
| 125 |
try:
|
| 126 |
+
from ..baseline_agent import SUPPORTED_MODEL_IDS, run_single_task
|
| 127 |
except ImportError:
|
| 128 |
+
from baseline_agent import SUPPORTED_MODEL_IDS, run_single_task
|
| 129 |
|
| 130 |
env_url = str(request.base_url).rstrip("/")
|
| 131 |
+
model_id = request.query_params.get("model_id", "Qwen/Qwen2.5-Coder-32B-Instruct")
|
| 132 |
+
if model_id not in SUPPORTED_MODEL_IDS:
|
| 133 |
+
return {
|
| 134 |
+
"error": f"Unsupported model_id '{model_id}'",
|
| 135 |
+
"supported_model_ids": SUPPORTED_MODEL_IDS,
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
results = {}
|
| 139 |
task_scores = {}
|
| 140 |
+
for task_id in ["task1", "task2", "task3", "task4", "task5", "task6"]:
|
| 141 |
try:
|
| 142 |
+
score = await asyncio.wait_for(
|
| 143 |
+
run_single_task(task_id, env_url, model_id=model_id),
|
| 144 |
+
timeout=120.0,
|
| 145 |
+
)
|
| 146 |
results[task_id] = round(score, 4)
|
| 147 |
task_scores[task_id] = round(score, 4)
|
| 148 |
except TimeoutError:
|
|
|
|
| 158 |
task_scores[task_id] = 0.0
|
| 159 |
results[f"{task_id}_error"] = f"internal_error: {exc.__class__.__name__}: {exc}"
|
| 160 |
avg = round(sum(task_scores.values()) / max(1, len(task_scores)), 4)
|
| 161 |
+
return {
|
| 162 |
+
"baseline_scores": results,
|
| 163 |
+
"average": avg,
|
| 164 |
+
"env_url": env_url,
|
| 165 |
+
"model_id": model_id,
|
| 166 |
+
}
|
| 167 |
|
| 168 |
|
| 169 |
@app.get("/baseline/task/{task_id}")
|
| 170 |
async def run_baseline_single(task_id: str, request: Request):
|
| 171 |
"""Run the baseline agent on a single task. Returns score + details."""
|
| 172 |
try:
|
| 173 |
+
from ..baseline_agent import SUPPORTED_MODEL_IDS, run_single_task_detailed
|
| 174 |
except ImportError:
|
| 175 |
+
from baseline_agent import SUPPORTED_MODEL_IDS, run_single_task_detailed
|
| 176 |
|
| 177 |
env_url = str(request.base_url).rstrip("/")
|
| 178 |
+
model_id = request.query_params.get("model_id", "Qwen/Qwen2.5-Coder-32B-Instruct")
|
| 179 |
+
if model_id not in SUPPORTED_MODEL_IDS:
|
| 180 |
+
return {
|
| 181 |
+
"task_id": task_id,
|
| 182 |
+
"score": 0.0,
|
| 183 |
+
"status": "error",
|
| 184 |
+
"error": f"Unsupported model_id '{model_id}'",
|
| 185 |
+
"supported_model_ids": SUPPORTED_MODEL_IDS,
|
| 186 |
+
}
|
| 187 |
+
|
| 188 |
try:
|
| 189 |
+
result = await asyncio.wait_for(
|
| 190 |
+
run_single_task_detailed(task_id, env_url, model_id=model_id),
|
| 191 |
+
timeout=120.0,
|
| 192 |
+
)
|
| 193 |
return {
|
| 194 |
"task_id": task_id,
|
| 195 |
"score": round(result["score"], 4),
|
| 196 |
"status": "ok",
|
| 197 |
+
"model_id": model_id,
|
| 198 |
"fixed_code": result.get("fixed_code", ""),
|
| 199 |
"output": result.get("output", ""),
|
| 200 |
"attempts": result.get("attempts", []),
|
server/environment.py
CHANGED
|
@@ -9,12 +9,12 @@ from openenv.core.env_server.types import State
|
|
| 9 |
try:
|
| 10 |
from ..models import MLDebugAction, MLDebugObservation
|
| 11 |
from .sandbox import execute_code
|
| 12 |
-
from .tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone
|
| 13 |
from .tasks.graders import parse_losses, parse_val_accs, score_task
|
| 14 |
except ImportError:
|
| 15 |
from models import MLDebugAction, MLDebugObservation
|
| 16 |
from server.sandbox import execute_code
|
| 17 |
-
from server.tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone
|
| 18 |
from server.tasks.graders import parse_losses, parse_val_accs, score_task
|
| 19 |
|
| 20 |
TASKS = {
|
|
@@ -23,6 +23,7 @@ TASKS = {
|
|
| 23 |
"task3": task3_oom_leakage,
|
| 24 |
"task4": task4_wrong_loss,
|
| 25 |
"task5": task5_frozen_backbone,
|
|
|
|
| 26 |
}
|
| 27 |
|
| 28 |
|
|
|
|
| 9 |
try:
|
| 10 |
from ..models import MLDebugAction, MLDebugObservation
|
| 11 |
from .sandbox import execute_code
|
| 12 |
+
from .tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone, task6_io_mismatch
|
| 13 |
from .tasks.graders import parse_losses, parse_val_accs, score_task
|
| 14 |
except ImportError:
|
| 15 |
from models import MLDebugAction, MLDebugObservation
|
| 16 |
from server.sandbox import execute_code
|
| 17 |
+
from server.tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone, task6_io_mismatch
|
| 18 |
from server.tasks.graders import parse_losses, parse_val_accs, score_task
|
| 19 |
|
| 20 |
TASKS = {
|
|
|
|
| 23 |
"task3": task3_oom_leakage,
|
| 24 |
"task4": task4_wrong_loss,
|
| 25 |
"task5": task5_frozen_backbone,
|
| 26 |
+
"task6": task6_io_mismatch,
|
| 27 |
}
|
| 28 |
|
| 29 |
|
server/tasks/graders.py
CHANGED
|
@@ -109,11 +109,11 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
|
|
| 109 |
Task 1: Broken Training Loop
|
| 110 |
Bugs: 1) lr=10.0 (too high), 2) step() before backward()
|
| 111 |
|
| 112 |
-
Grading criteria:
|
| 113 |
-
-
|
| 114 |
-
-
|
| 115 |
-
- Must show monotonic improvement
|
| 116 |
-
-
|
| 117 |
"""
|
| 118 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
|
| 119 |
if not valid:
|
|
@@ -128,10 +128,10 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
|
|
| 128 |
if not losses:
|
| 129 |
return 0.1, {"reason": "no_losses_parsed"}
|
| 130 |
|
| 131 |
-
# Check for NaN/Inf - indicates numerical instability
|
| 132 |
nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
|
| 133 |
if nan_count > 0:
|
| 134 |
-
return 0.
|
| 135 |
|
| 136 |
val_acc = parse_scalar(result.stdout, "VAL_ACC")
|
| 137 |
if val_acc is None:
|
|
@@ -141,33 +141,39 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
|
|
| 141 |
initial_loss = losses[0]
|
| 142 |
max_loss = max(losses)
|
| 143 |
|
| 144 |
-
# Check for loss instability (spikes indicate LR too high)
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
return 0.2, {
|
| 148 |
"reason": "loss_unstable_spikes",
|
| 149 |
"max_loss": max_loss,
|
| 150 |
"final_loss": final_loss,
|
| 151 |
"val_acc": val_acc
|
| 152 |
}
|
| 153 |
|
| 154 |
-
#
|
| 155 |
-
if final_loss >
|
| 156 |
-
return 0.
|
| 157 |
|
| 158 |
-
#
|
| 159 |
-
|
| 160 |
|
| 161 |
-
#
|
| 162 |
-
|
|
|
|
| 163 |
|
| 164 |
-
#
|
|
|
|
|
|
|
|
|
|
| 165 |
monotonic_bonus = 0.0
|
| 166 |
if len(losses) >= 10:
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
|
| 173 |
breakdown = {
|
|
@@ -187,10 +193,10 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
|
|
| 187 |
Task 2: NaN Loss
|
| 188 |
Bug: torch.log(pred) when pred can be 0.0 after sigmoid
|
| 189 |
|
| 190 |
-
Grading criteria:
|
| 191 |
-
-
|
| 192 |
-
-
|
| 193 |
-
-
|
| 194 |
"""
|
| 195 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
|
| 196 |
if not valid:
|
|
@@ -207,11 +213,11 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
|
|
| 207 |
|
| 208 |
nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
|
| 209 |
|
| 210 |
-
#
|
| 211 |
nan_ratio = nan_count / len(losses)
|
| 212 |
if nan_count > 0:
|
| 213 |
-
#
|
| 214 |
-
return max(0.05, 0.
|
| 215 |
"reason": "has_nans",
|
| 216 |
"nan_ratio": nan_ratio,
|
| 217 |
"nan_count": nan_count
|
|
@@ -219,19 +225,19 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
|
|
| 219 |
|
| 220 |
val_acc = parse_scalar(result.stdout, "VAL_ACC")
|
| 221 |
if val_acc is None:
|
| 222 |
-
return 0.
|
| 223 |
|
| 224 |
finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
|
| 225 |
final_loss = finite_losses[-1] if finite_losses else float('inf')
|
| 226 |
|
| 227 |
-
# No NaN = base score of 0.
|
| 228 |
-
base_score = 0.
|
| 229 |
|
| 230 |
-
# Validation accuracy
|
| 231 |
-
acc_score = sigmoid_score(val_acc, center=0.
|
| 232 |
|
| 233 |
-
# Convergence
|
| 234 |
-
convergence_score = sigmoid_score(final_loss, center=0.
|
| 235 |
|
| 236 |
final_score = min(1.0, base_score + acc_score + convergence_score)
|
| 237 |
breakdown = {
|
|
@@ -247,14 +253,13 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
|
|
| 247 |
|
| 248 |
def grade_task3(result: RunResult) -> tuple[float, dict]:
|
| 249 |
"""
|
| 250 |
-
Task 3:
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
- FINAL_LOSS
|
| 256 |
-
-
|
| 257 |
-
- Learning trajectory should improve over epochs
|
| 258 |
"""
|
| 259 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
|
| 260 |
if not valid:
|
|
@@ -271,40 +276,51 @@ def grade_task3(result: RunResult) -> tuple[float, dict]:
|
|
| 271 |
val_accs = parse_val_accs(result.stdout)
|
| 272 |
final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
|
| 273 |
|
| 274 |
-
#
|
| 275 |
-
#
|
| 276 |
-
|
| 277 |
-
if final_loss_val is not None:
|
| 278 |
-
memory_score = sigmoid_score(final_loss_val, center=20.0, steepness=0.2, higher_is_better=False) * 0.35
|
| 279 |
-
else:
|
| 280 |
-
memory_score = 0.0
|
| 281 |
-
|
| 282 |
-
# Gradient accumulation check: accuracy should be high if training properly
|
| 283 |
-
# Without zero_grad(), gradients accumulate and training degrades
|
| 284 |
acc_score = 0.0
|
| 285 |
final_acc = 0.0
|
| 286 |
early_acc = 0.0
|
| 287 |
trajectory_bonus = 0.0
|
| 288 |
|
| 289 |
-
if val_accs
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 305 |
breakdown = {
|
| 306 |
-
"memory_score": round(memory_score, 4),
|
| 307 |
"acc_score": round(acc_score, 4),
|
|
|
|
| 308 |
"trajectory_bonus": round(trajectory_bonus, 4),
|
| 309 |
"early_acc": round(early_acc, 4),
|
| 310 |
"final_acc": round(final_acc, 4),
|
|
@@ -318,10 +334,10 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
|
|
| 318 |
Task 4: Wrong Loss (Multi-label Classification)
|
| 319 |
Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
|
| 320 |
|
| 321 |
-
Grading criteria:
|
| 322 |
-
- F1
|
| 323 |
-
- avg_labels
|
| 324 |
-
- Loss
|
| 325 |
"""
|
| 326 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
|
| 327 |
if not valid:
|
|
@@ -337,29 +353,45 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
|
|
| 337 |
avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
|
| 338 |
f1 = parse_scalar(result.stdout, "F1_SCORE")
|
| 339 |
|
| 340 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 341 |
f1_score_val = 0.0
|
| 342 |
if f1 is not None:
|
| 343 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 344 |
|
| 345 |
-
# Multi-label check: avg_labels
|
| 346 |
-
# With 30% probability per class and 5 classes, expected avg ~1.5 labels/sample
|
| 347 |
labels_score = 0.0
|
| 348 |
if avg_labels is not None:
|
| 349 |
-
if avg_labels
|
| 350 |
-
|
| 351 |
-
labels_score = 0.0
|
| 352 |
elif avg_labels >= 1.0:
|
| 353 |
-
#
|
| 354 |
-
labels_score = 0.3
|
| 355 |
else:
|
| 356 |
-
|
| 357 |
-
labels_score = sigmoid_score(avg_labels, center=1.0, steepness=5.0, higher_is_better=True) * 0.3
|
| 358 |
|
| 359 |
-
# Loss convergence (
|
| 360 |
loss_score = 0.0
|
| 361 |
if final_loss is not None:
|
| 362 |
-
loss_score = sigmoid_score(final_loss, center=0.
|
| 363 |
|
| 364 |
final_score = min(1.0, f1_score_val + labels_score + loss_score)
|
| 365 |
breakdown = {
|
|
@@ -379,14 +411,14 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
|
|
| 379 |
Bug: Backbone is frozen but still passed to optimizer (wastes memory)
|
| 380 |
|
| 381 |
Valid fixes:
|
| 382 |
-
1. Unfreeze backbone -> grad_norm > 0
|
| 383 |
-
2. Only pass head params to optimizer ->
|
| 384 |
|
| 385 |
-
|
| 386 |
|
| 387 |
-
Grading criteria:
|
| 388 |
-
-
|
| 389 |
-
-
|
| 390 |
"""
|
| 391 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
|
| 392 |
if not valid:
|
|
@@ -402,30 +434,39 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
|
|
| 402 |
grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
|
| 403 |
param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
|
| 404 |
|
| 405 |
-
#
|
| 406 |
-
loss_score = 0.0
|
| 407 |
-
if final_loss is not None:
|
| 408 |
-
loss_score = sigmoid_score(final_loss, center=2.5, steepness=2.0, higher_is_better=False) * 0.3
|
| 409 |
-
|
| 410 |
-
# The bug: frozen backbone (grad_norm=0) but full params in optimizer (param_count=530442)
|
| 411 |
-
# Fix 1: Unfreeze -> grad_norm > 0 (any amount)
|
| 412 |
-
# Fix 2: Only head -> param_count < 100000 (head has ~5130 params)
|
| 413 |
-
|
| 414 |
fix_score = 0.0
|
| 415 |
fix_type = "none"
|
| 416 |
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
fix_score = 0.
|
| 420 |
fix_type = "unfrozen"
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
fix_score = 0.
|
| 424 |
fix_type = "head_only"
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 429 |
|
| 430 |
final_score = min(1.0, loss_score + fix_score)
|
| 431 |
breakdown = {
|
|
@@ -439,6 +480,192 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
|
|
| 439 |
return final_score, breakdown
|
| 440 |
|
| 441 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 442 |
def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
|
| 443 |
graders = {
|
| 444 |
"task1": grade_task1,
|
|
@@ -446,6 +673,7 @@ def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
|
|
| 446 |
"task3": grade_task3,
|
| 447 |
"task4": grade_task4,
|
| 448 |
"task5": grade_task5,
|
|
|
|
| 449 |
}
|
| 450 |
if task_id not in graders:
|
| 451 |
raise ValueError(f"Unknown task_id: {task_id}")
|
|
|
|
| 109 |
Task 1: Broken Training Loop
|
| 110 |
Bugs: 1) lr=10.0 (too high), 2) step() before backward()
|
| 111 |
|
| 112 |
+
Grading criteria (STRICT thresholds for differentiation):
|
| 113 |
+
- VAL_ACC > 0.90 required for high score (target is >0.85)
|
| 114 |
+
- Final loss < 0.2 required for high score (target is <0.3)
|
| 115 |
+
- Must show monotonic improvement
|
| 116 |
+
- Penalize any instability heavily
|
| 117 |
"""
|
| 118 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
|
| 119 |
if not valid:
|
|
|
|
| 128 |
if not losses:
|
| 129 |
return 0.1, {"reason": "no_losses_parsed"}
|
| 130 |
|
| 131 |
+
# Check for NaN/Inf - indicates numerical instability (LR bug not fully fixed)
|
| 132 |
nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
|
| 133 |
if nan_count > 0:
|
| 134 |
+
return 0.1, {"reason": "nan_inf_found", "nan_count": nan_count}
|
| 135 |
|
| 136 |
val_acc = parse_scalar(result.stdout, "VAL_ACC")
|
| 137 |
if val_acc is None:
|
|
|
|
| 141 |
initial_loss = losses[0]
|
| 142 |
max_loss = max(losses)
|
| 143 |
|
| 144 |
+
# STRICT: Check for loss instability (spikes indicate LR still too high)
|
| 145 |
+
if max_loss > initial_loss * 3.0 or max_loss > 5.0:
|
| 146 |
+
return 0.15, {
|
|
|
|
| 147 |
"reason": "loss_unstable_spikes",
|
| 148 |
"max_loss": max_loss,
|
| 149 |
"final_loss": final_loss,
|
| 150 |
"val_acc": val_acc
|
| 151 |
}
|
| 152 |
|
| 153 |
+
# STRICT: Loss must converge well
|
| 154 |
+
if final_loss > 1.0:
|
| 155 |
+
return 0.2, {"reason": "loss_not_converged", "final_loss": final_loss, "val_acc": val_acc}
|
| 156 |
|
| 157 |
+
# STRICT thresholds - center points raised for better differentiation
|
| 158 |
+
# Target: val_acc > 0.90, final_loss < 0.15
|
| 159 |
|
| 160 |
+
# Primary: Validation accuracy (60% weight)
|
| 161 |
+
# Use steeper sigmoid for sharper differentiation
|
| 162 |
+
acc_score = sigmoid_score(val_acc, center=0.90, steepness=25.0, higher_is_better=True) * 0.60
|
| 163 |
|
| 164 |
+
# Secondary: Final loss (30% weight) - must be low
|
| 165 |
+
loss_score = sigmoid_score(final_loss, center=0.15, steepness=15.0, higher_is_better=False) * 0.30
|
| 166 |
+
|
| 167 |
+
# Bonus: Monotonic improvement - must be significant
|
| 168 |
monotonic_bonus = 0.0
|
| 169 |
if len(losses) >= 10:
|
| 170 |
+
first_half = sum(losses[:len(losses)//2]) / (len(losses)//2)
|
| 171 |
+
last_half = sum(losses[-len(losses)//2:]) / (len(losses)//2)
|
| 172 |
+
improvement_ratio = (first_half - last_half) / first_half if first_half > 0 else 0
|
| 173 |
+
if improvement_ratio > 0.5: # >50% improvement required
|
| 174 |
+
monotonic_bonus = 0.10
|
| 175 |
+
elif improvement_ratio > 0.3:
|
| 176 |
+
monotonic_bonus = 0.05
|
| 177 |
|
| 178 |
final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
|
| 179 |
breakdown = {
|
|
|
|
| 193 |
Task 2: NaN Loss
|
| 194 |
Bug: torch.log(pred) when pred can be 0.0 after sigmoid
|
| 195 |
|
| 196 |
+
Grading criteria (STRICT - NaN elimination is PRIMARY):
|
| 197 |
+
- ZERO NaN/Inf required (this is the bug!)
|
| 198 |
+
- VAL_ACC > 0.80 required for high score
|
| 199 |
+
- Loss must converge < 0.3
|
| 200 |
"""
|
| 201 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
|
| 202 |
if not valid:
|
|
|
|
| 213 |
|
| 214 |
nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
|
| 215 |
|
| 216 |
+
# PRIMARY: NO NaN/Inf at ALL - this is THE bug being tested
|
| 217 |
nan_ratio = nan_count / len(losses)
|
| 218 |
if nan_count > 0:
|
| 219 |
+
# STRICT: Any NaN = major failure (max 0.25 score)
|
| 220 |
+
return max(0.05, 0.25 * (1.0 - nan_ratio)), {
|
| 221 |
"reason": "has_nans",
|
| 222 |
"nan_ratio": nan_ratio,
|
| 223 |
"nan_count": nan_count
|
|
|
|
| 225 |
|
| 226 |
val_acc = parse_scalar(result.stdout, "VAL_ACC")
|
| 227 |
if val_acc is None:
|
| 228 |
+
return 0.25, {"reason": "no_val_acc_but_no_nans"}
|
| 229 |
|
| 230 |
finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
|
| 231 |
final_loss = finite_losses[-1] if finite_losses else float('inf')
|
| 232 |
|
| 233 |
+
# No NaN = base score of 0.35 (bug is fixed but need to verify quality)
|
| 234 |
+
base_score = 0.35
|
| 235 |
|
| 236 |
+
# STRICT: Validation accuracy (40% weight, center at 0.80)
|
| 237 |
+
acc_score = sigmoid_score(val_acc, center=0.80, steepness=20.0, higher_is_better=True) * 0.40
|
| 238 |
|
| 239 |
+
# STRICT: Convergence (25% weight, center at 0.25)
|
| 240 |
+
convergence_score = sigmoid_score(final_loss, center=0.25, steepness=10.0, higher_is_better=False) * 0.25
|
| 241 |
|
| 242 |
final_score = min(1.0, base_score + acc_score + convergence_score)
|
| 243 |
breakdown = {
|
|
|
|
| 253 |
|
| 254 |
def grade_task3(result: RunResult) -> tuple[float, dict]:
|
| 255 |
"""
|
| 256 |
+
Task 3: Label Inversion Bug
|
| 257 |
+
Bug: criterion(out, 1 - yb) inverts labels — should be criterion(out, yb)
|
| 258 |
+
|
| 259 |
+
Grading criteria (STRICT - accuracy is PRIMARY):
|
| 260 |
+
- VAL_ACC > 0.90 required (buggy code gives ~0.50)
|
| 261 |
+
- FINAL_LOSS < 0.25 required
|
| 262 |
+
- Must show learning trajectory improvement
|
|
|
|
| 263 |
"""
|
| 264 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
|
| 265 |
if not valid:
|
|
|
|
| 276 |
val_accs = parse_val_accs(result.stdout)
|
| 277 |
final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
|
| 278 |
|
| 279 |
+
# CRITICAL CHECK: Buggy code produces ~0.50 accuracy (random)
|
| 280 |
+
# Fixed code should produce >0.90 accuracy
|
| 281 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 282 |
acc_score = 0.0
|
| 283 |
final_acc = 0.0
|
| 284 |
early_acc = 0.0
|
| 285 |
trajectory_bonus = 0.0
|
| 286 |
|
| 287 |
+
if not val_accs or len(val_accs) < 2:
|
| 288 |
+
return 0.15, {"reason": "no_val_accs_parsed"}
|
| 289 |
+
|
| 290 |
+
early_acc = sum(val_accs[:3]) / min(3, len(val_accs))
|
| 291 |
+
final_acc = val_accs[-1]
|
| 292 |
+
|
| 293 |
+
# STRICT: Final accuracy must be high (>0.90 target)
|
| 294 |
+
# The bug makes accuracy ~0.50, so anything <0.70 is likely unfixed
|
| 295 |
+
if final_acc < 0.65:
|
| 296 |
+
return 0.15, {
|
| 297 |
+
"reason": "accuracy_too_low_likely_unfixed",
|
| 298 |
+
"final_acc": final_acc,
|
| 299 |
+
"expected": ">0.90 for fixed code"
|
| 300 |
+
}
|
| 301 |
+
|
| 302 |
+
# Primary: Final accuracy (60% weight, center at 0.92)
|
| 303 |
+
acc_score = sigmoid_score(final_acc, center=0.92, steepness=30.0, higher_is_better=True) * 0.60
|
| 304 |
+
|
| 305 |
+
# Secondary: Loss convergence (25% weight)
|
| 306 |
+
loss_score = 0.0
|
| 307 |
+
if final_loss_val is not None:
|
| 308 |
+
loss_score = sigmoid_score(final_loss_val, center=0.20, steepness=12.0, higher_is_better=False) * 0.25
|
| 309 |
+
|
| 310 |
+
# Bonus: Learning trajectory (15% weight)
|
| 311 |
+
if len(val_accs) >= 5:
|
| 312 |
+
improvement = final_acc - early_acc
|
| 313 |
+
if improvement > 0.15: # Significant learning
|
| 314 |
+
trajectory_bonus = 0.15
|
| 315 |
+
elif improvement > 0.05:
|
| 316 |
+
trajectory_bonus = 0.08
|
| 317 |
+
elif improvement > 0.0:
|
| 318 |
+
trajectory_bonus = 0.03
|
| 319 |
+
|
| 320 |
+
final_score = min(1.0, acc_score + loss_score + trajectory_bonus)
|
| 321 |
breakdown = {
|
|
|
|
| 322 |
"acc_score": round(acc_score, 4),
|
| 323 |
+
"loss_score": round(loss_score, 4),
|
| 324 |
"trajectory_bonus": round(trajectory_bonus, 4),
|
| 325 |
"early_acc": round(early_acc, 4),
|
| 326 |
"final_acc": round(final_acc, 4),
|
|
|
|
| 334 |
Task 4: Wrong Loss (Multi-label Classification)
|
| 335 |
Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
|
| 336 |
|
| 337 |
+
Grading criteria (STRICT):
|
| 338 |
+
- F1 > 0.70 required (buggy code gives ~0.2-0.3)
|
| 339 |
+
- avg_labels > 1.2 required (proper multi-hot predictions)
|
| 340 |
+
- Loss must converge < 0.4
|
| 341 |
"""
|
| 342 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
|
| 343 |
if not valid:
|
|
|
|
| 353 |
avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
|
| 354 |
f1 = parse_scalar(result.stdout, "F1_SCORE")
|
| 355 |
|
| 356 |
+
# CRITICAL: Check for multi-label behavior
|
| 357 |
+
# With CrossEntropyLoss, model predicts only 1 label per sample (avg_labels ≈ 1.0)
|
| 358 |
+
# With BCEWithLogitsLoss, model should predict multiple (avg_labels > 1.0)
|
| 359 |
+
|
| 360 |
+
if avg_labels is not None and avg_labels < 0.8:
|
| 361 |
+
return 0.15, {
|
| 362 |
+
"reason": "too_few_labels_single_label_behavior",
|
| 363 |
+
"avg_labels": avg_labels,
|
| 364 |
+
"expected": ">1.2 for multi-label"
|
| 365 |
+
}
|
| 366 |
+
|
| 367 |
+
# STRICT: F1 score - PRIMARY metric (55% weight)
|
| 368 |
f1_score_val = 0.0
|
| 369 |
if f1 is not None:
|
| 370 |
+
if f1 < 0.40:
|
| 371 |
+
# Very low F1 indicates bug not fixed
|
| 372 |
+
return 0.20, {
|
| 373 |
+
"reason": "f1_too_low_likely_unfixed",
|
| 374 |
+
"f1": f1,
|
| 375 |
+
"expected": ">0.65 for fixed code"
|
| 376 |
+
}
|
| 377 |
+
f1_score_val = sigmoid_score(f1, center=0.70, steepness=15.0, higher_is_better=True) * 0.55
|
| 378 |
+
else:
|
| 379 |
+
return 0.15, {"reason": "no_f1_score_parsed"}
|
| 380 |
|
| 381 |
+
# Multi-label check: avg_labels (25% weight)
|
|
|
|
| 382 |
labels_score = 0.0
|
| 383 |
if avg_labels is not None:
|
| 384 |
+
if avg_labels >= 1.3:
|
| 385 |
+
labels_score = 0.25 # Full score for proper multi-label
|
|
|
|
| 386 |
elif avg_labels >= 1.0:
|
| 387 |
+
labels_score = 0.15 # Partial - borderline multi-label
|
|
|
|
| 388 |
else:
|
| 389 |
+
labels_score = sigmoid_score(avg_labels, center=1.0, steepness=8.0, higher_is_better=True) * 0.15
|
|
|
|
| 390 |
|
| 391 |
+
# Loss convergence (20% weight)
|
| 392 |
loss_score = 0.0
|
| 393 |
if final_loss is not None:
|
| 394 |
+
loss_score = sigmoid_score(final_loss, center=0.35, steepness=8.0, higher_is_better=False) * 0.20
|
| 395 |
|
| 396 |
final_score = min(1.0, f1_score_val + labels_score + loss_score)
|
| 397 |
breakdown = {
|
|
|
|
| 411 |
Bug: Backbone is frozen but still passed to optimizer (wastes memory)
|
| 412 |
|
| 413 |
Valid fixes:
|
| 414 |
+
1. Unfreeze backbone -> grad_norm > 0
|
| 415 |
+
2. Only pass head params to optimizer -> param_count < 10000
|
| 416 |
|
| 417 |
+
Buggy state: grad_norm = 0, param_count = 530442
|
| 418 |
|
| 419 |
+
Grading criteria (STRICT - binary fix detection):
|
| 420 |
+
- Must demonstrate ONE of the two valid fixes
|
| 421 |
+
- Loss must be reasonable (<3.0 for CrossEntropy on 10 classes)
|
| 422 |
"""
|
| 423 |
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
|
| 424 |
if not valid:
|
|
|
|
| 434 |
grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
|
| 435 |
param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
|
| 436 |
|
| 437 |
+
# Detect fix type FIRST
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 438 |
fix_score = 0.0
|
| 439 |
fix_type = "none"
|
| 440 |
|
| 441 |
+
# Fix 1: Unfreeze backbone (grad_norm > 0)
|
| 442 |
+
if grad_norm is not None and grad_norm > 0.01:
|
| 443 |
+
fix_score = 0.70
|
| 444 |
fix_type = "unfrozen"
|
| 445 |
+
# Fix 2: Only head params (param_count should be ~5130 for Linear(512, 10))
|
| 446 |
+
elif param_count is not None and param_count < 15000:
|
| 447 |
+
fix_score = 0.70
|
| 448 |
fix_type = "head_only"
|
| 449 |
+
# Buggy state: frozen (grad_norm=0) but full params (>500000)
|
| 450 |
+
elif grad_norm is not None and grad_norm == 0.0:
|
| 451 |
+
if param_count is not None and param_count > 100000:
|
| 452 |
+
return 0.10, {
|
| 453 |
+
"reason": "buggy_state_unchanged",
|
| 454 |
+
"grad_norm": grad_norm,
|
| 455 |
+
"param_count": param_count,
|
| 456 |
+
"hint": "Either unfreeze backbone or only pass head params to optimizer"
|
| 457 |
+
}
|
| 458 |
+
|
| 459 |
+
if fix_score == 0.0:
|
| 460 |
+
return 0.15, {
|
| 461 |
+
"reason": "could_not_detect_valid_fix",
|
| 462 |
+
"grad_norm": grad_norm,
|
| 463 |
+
"param_count": param_count
|
| 464 |
+
}
|
| 465 |
+
|
| 466 |
+
# Loss should be reasonable (30% weight)
|
| 467 |
+
loss_score = 0.0
|
| 468 |
+
if final_loss is not None:
|
| 469 |
+
loss_score = sigmoid_score(final_loss, center=2.5, steepness=3.0, higher_is_better=False) * 0.30
|
| 470 |
|
| 471 |
final_score = min(1.0, loss_score + fix_score)
|
| 472 |
breakdown = {
|
|
|
|
| 480 |
return final_score, breakdown
|
| 481 |
|
| 482 |
|
| 483 |
+
def grade_task6(result: RunResult) -> tuple[float, dict]:
|
| 484 |
+
"""
|
| 485 |
+
Task 6: Input-Output Mismatch (Multiple Bugs)
|
| 486 |
+
|
| 487 |
+
Bugs to fix:
|
| 488 |
+
1. Shape mismatch: 32x32 images but model expects 28x28
|
| 489 |
+
2. Channel order: HWC format but model expects CHW
|
| 490 |
+
3. Label encoding: One-hot labels but CrossEntropyLoss expects indices
|
| 491 |
+
4. Batch dimension: Single sample missing batch dim in validation
|
| 492 |
+
|
| 493 |
+
Anti-gaming measures:
|
| 494 |
+
- Must have actual CNN training (convolutions detected in code)
|
| 495 |
+
- Must show learning trajectory (loss decrease)
|
| 496 |
+
- Must have reasonable epoch count (>= 20)
|
| 497 |
+
- Penalize hardcoded metrics or unrealistic outputs
|
| 498 |
+
- Check for actual tensor operations (permute, reshape, etc.)
|
| 499 |
+
"""
|
| 500 |
+
valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task6")
|
| 501 |
+
if not valid:
|
| 502 |
+
return 0.0, {"reason": reason}
|
| 503 |
+
|
| 504 |
+
if result.timed_out:
|
| 505 |
+
return 0.05, {"reason": "timed_out"}
|
| 506 |
+
|
| 507 |
+
if result.exit_code != 0:
|
| 508 |
+
# Check for specific error types that indicate partial fixes
|
| 509 |
+
stderr_lower = result.stderr.lower()
|
| 510 |
+
if "shape" in stderr_lower or "size" in stderr_lower:
|
| 511 |
+
return 0.10, {"reason": "shape_error_unfixed", "stderr": result.stderr[:500]}
|
| 512 |
+
if "dimension" in stderr_lower or "dim" in stderr_lower:
|
| 513 |
+
return 0.10, {"reason": "dimension_error_unfixed", "stderr": result.stderr[:500]}
|
| 514 |
+
if "expected" in stderr_lower and "got" in stderr_lower:
|
| 515 |
+
return 0.10, {"reason": "type_mismatch_unfixed", "stderr": result.stderr[:500]}
|
| 516 |
+
return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
|
| 517 |
+
|
| 518 |
+
code = result.fixed_code
|
| 519 |
+
|
| 520 |
+
# ANTI-GAMING: Check for genuine CNN architecture (not replaced with fake output)
|
| 521 |
+
has_conv = "Conv2d" in code or "conv2d" in code
|
| 522 |
+
has_training_loop = "optimizer.step()" in code or "optimizer.step()" in code
|
| 523 |
+
has_model_forward = "model(" in code
|
| 524 |
+
|
| 525 |
+
if not has_conv:
|
| 526 |
+
return 0.05, {"reason": "gaming_no_convolution", "hint": "Original CNN architecture must be preserved"}
|
| 527 |
+
if not has_training_loop:
|
| 528 |
+
return 0.05, {"reason": "gaming_no_training", "hint": "Must have actual training loop"}
|
| 529 |
+
if not has_model_forward:
|
| 530 |
+
return 0.05, {"reason": "gaming_no_forward", "hint": "Must use model for inference"}
|
| 531 |
+
|
| 532 |
+
# Parse metrics
|
| 533 |
+
losses = parse_losses(result.stdout)
|
| 534 |
+
val_acc = parse_scalar(result.stdout, "VAL_ACC")
|
| 535 |
+
final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
|
| 536 |
+
|
| 537 |
+
# ANTI-GAMING: Check for hardcoded/faked metrics
|
| 538 |
+
if "print('VAL_ACC:0.9" in code or "print(\"VAL_ACC:0.9" in code:
|
| 539 |
+
return 0.05, {"reason": "gaming_hardcoded_metrics"}
|
| 540 |
+
|
| 541 |
+
# ANTI-GAMING: Require reasonable number of loss values (epoch count)
|
| 542 |
+
if len(losses) < 15:
|
| 543 |
+
return 0.15, {"reason": "too_few_epochs", "epoch_count": len(losses), "expected": ">=20"}
|
| 544 |
+
|
| 545 |
+
# ANTI-GAMING: Loss should show learning (not flat or random)
|
| 546 |
+
if len(losses) >= 10:
|
| 547 |
+
first_quarter = sum(losses[:len(losses)//4]) / (len(losses)//4)
|
| 548 |
+
last_quarter = sum(losses[-len(losses)//4:]) / (len(losses)//4)
|
| 549 |
+
|
| 550 |
+
if first_quarter <= last_quarter:
|
| 551 |
+
return 0.20, {
|
| 552 |
+
"reason": "no_learning_trajectory",
|
| 553 |
+
"first_quarter_loss": round(first_quarter, 4),
|
| 554 |
+
"last_quarter_loss": round(last_quarter, 4),
|
| 555 |
+
"hint": "Loss should decrease during training"
|
| 556 |
+
}
|
| 557 |
+
|
| 558 |
+
# ANTI-GAMING: Check for unrealistic perfect scores with no learning
|
| 559 |
+
# High accuracy is OK if there's a valid learning trajectory
|
| 560 |
+
if val_acc is not None and val_acc > 0.99:
|
| 561 |
+
# Only flag if loss didn't converge properly (suggests hardcoded output)
|
| 562 |
+
if final_loss is not None and final_loss > 0.1:
|
| 563 |
+
return 0.25, {"reason": "unrealistic_accuracy_no_convergence", "val_acc": val_acc, "final_loss": final_loss}
|
| 564 |
+
|
| 565 |
+
if val_acc is None:
|
| 566 |
+
return 0.15, {"reason": "no_val_acc_parsed"}
|
| 567 |
+
|
| 568 |
+
if final_loss is None:
|
| 569 |
+
return 0.15, {"reason": "no_final_loss_parsed"}
|
| 570 |
+
|
| 571 |
+
# Check for NaN/Inf in losses
|
| 572 |
+
nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
|
| 573 |
+
if nan_count > 0:
|
| 574 |
+
return 0.10, {"reason": "nan_in_losses", "nan_count": nan_count}
|
| 575 |
+
|
| 576 |
+
# ====== BUG FIX DETECTION ======
|
| 577 |
+
bug_fixes_detected = 0
|
| 578 |
+
fix_details = {}
|
| 579 |
+
|
| 580 |
+
# Bug 1: Shape fix - check for resize or architecture change
|
| 581 |
+
has_resize = any(kw in code for kw in ["resize", "interpolate", "F.adaptive", "8 * 8", "8*8"])
|
| 582 |
+
has_28_in_data = "28, 28" in code or "28,28" in code
|
| 583 |
+
if has_resize or has_28_in_data:
|
| 584 |
+
bug_fixes_detected += 1
|
| 585 |
+
fix_details["shape_fix"] = True
|
| 586 |
+
else:
|
| 587 |
+
fix_details["shape_fix"] = False
|
| 588 |
+
|
| 589 |
+
# Bug 2: Channel order fix - check for permute/transpose OR data created in CHW format
|
| 590 |
+
has_permute = "permute" in code or "transpose" in code or "contiguous" in code
|
| 591 |
+
has_channel_reorder = ".permute(0, 3, 1, 2)" in code or "permute(0,3,1,2)" in code
|
| 592 |
+
# Alternative fix: data created directly in CHW format (n_samples, 1, H, W)
|
| 593 |
+
has_chw_data = any(pat in code for pat in ["n_samples, 1, 28", "n_samples, 1, 32", "(n_samples, 1,"])
|
| 594 |
+
if has_permute or has_channel_reorder or has_chw_data:
|
| 595 |
+
bug_fixes_detected += 1
|
| 596 |
+
fix_details["channel_fix"] = True
|
| 597 |
+
else:
|
| 598 |
+
fix_details["channel_fix"] = False
|
| 599 |
+
|
| 600 |
+
# Bug 3: Label encoding fix - check for argmax or returning indices
|
| 601 |
+
has_label_fix = any(kw in code for kw in [
|
| 602 |
+
"argmax", "class_indices", "torch.arange",
|
| 603 |
+
"labels.long()", "y.long()", "remove one_hot"
|
| 604 |
+
])
|
| 605 |
+
# Also check if one_hot is removed from generate_data
|
| 606 |
+
no_onehot = "one_hot" not in code or ("# " in code and "one_hot" in code)
|
| 607 |
+
if has_label_fix or no_onehot:
|
| 608 |
+
bug_fixes_detected += 1
|
| 609 |
+
fix_details["label_fix"] = True
|
| 610 |
+
else:
|
| 611 |
+
fix_details["label_fix"] = False
|
| 612 |
+
|
| 613 |
+
# Bug 4: Batch dimension fix - check for unsqueeze on single sample
|
| 614 |
+
has_batch_fix = any(kw in code for kw in ["unsqueeze(0)", "unsqueeze( 0)", "[None,", "[None ,"])
|
| 615 |
+
if has_batch_fix:
|
| 616 |
+
bug_fixes_detected += 1
|
| 617 |
+
fix_details["batch_fix"] = True
|
| 618 |
+
else:
|
| 619 |
+
fix_details["batch_fix"] = False
|
| 620 |
+
|
| 621 |
+
# ====== SCORING ======
|
| 622 |
+
# Base score from bug fixes (40% weight - 10% per bug)
|
| 623 |
+
bug_fix_score = 0.10 * bug_fixes_detected
|
| 624 |
+
|
| 625 |
+
# Accuracy score (35% weight) - strict threshold
|
| 626 |
+
# With 5 classes, random is 20%, buggy is ~20-30%, fixed should be >80%
|
| 627 |
+
if val_acc < 0.50:
|
| 628 |
+
# Below 50% suggests not all bugs fixed
|
| 629 |
+
acc_score = 0.0
|
| 630 |
+
acc_penalty_reason = "accuracy_too_low"
|
| 631 |
+
else:
|
| 632 |
+
acc_score = sigmoid_score(val_acc, center=0.82, steepness=20.0, higher_is_better=True) * 0.35
|
| 633 |
+
acc_penalty_reason = None
|
| 634 |
+
|
| 635 |
+
# Loss convergence score (15% weight)
|
| 636 |
+
loss_score = sigmoid_score(final_loss, center=0.40, steepness=8.0, higher_is_better=False) * 0.15
|
| 637 |
+
|
| 638 |
+
# Learning trajectory bonus (10% weight)
|
| 639 |
+
trajectory_bonus = 0.0
|
| 640 |
+
if len(losses) >= 10:
|
| 641 |
+
first_half = sum(losses[:len(losses)//2]) / (len(losses)//2)
|
| 642 |
+
last_half = sum(losses[-len(losses)//2:]) / (len(losses)//2)
|
| 643 |
+
improvement_ratio = (first_half - last_half) / first_half if first_half > 0 else 0
|
| 644 |
+
if improvement_ratio > 0.5:
|
| 645 |
+
trajectory_bonus = 0.10
|
| 646 |
+
elif improvement_ratio > 0.3:
|
| 647 |
+
trajectory_bonus = 0.05
|
| 648 |
+
|
| 649 |
+
final_score = min(1.0, bug_fix_score + acc_score + loss_score + trajectory_bonus)
|
| 650 |
+
|
| 651 |
+
breakdown = {
|
| 652 |
+
"bug_fix_score": round(bug_fix_score, 4),
|
| 653 |
+
"bugs_fixed": bug_fixes_detected,
|
| 654 |
+
"fix_details": fix_details,
|
| 655 |
+
"acc_score": round(acc_score, 4),
|
| 656 |
+
"loss_score": round(loss_score, 4),
|
| 657 |
+
"trajectory_bonus": round(trajectory_bonus, 4),
|
| 658 |
+
"val_acc": val_acc,
|
| 659 |
+
"final_loss": final_loss,
|
| 660 |
+
"epoch_count": len(losses),
|
| 661 |
+
}
|
| 662 |
+
|
| 663 |
+
if acc_penalty_reason:
|
| 664 |
+
breakdown["acc_penalty_reason"] = acc_penalty_reason
|
| 665 |
+
|
| 666 |
+
return final_score, breakdown
|
| 667 |
+
|
| 668 |
+
|
| 669 |
def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
|
| 670 |
graders = {
|
| 671 |
"task1": grade_task1,
|
|
|
|
| 673 |
"task3": grade_task3,
|
| 674 |
"task4": grade_task4,
|
| 675 |
"task5": grade_task5,
|
| 676 |
+
"task6": grade_task6,
|
| 677 |
}
|
| 678 |
if task_id not in graders:
|
| 679 |
raise ValueError(f"Unknown task_id: {task_id}")
|
server/tasks/task3_oom_leakage.py
CHANGED
|
@@ -1,7 +1,11 @@
|
|
| 1 |
TASK_DESCRIPTION = """
|
| 2 |
This binary classification trainer has a bug causing validation accuracy around 50%.
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
|
|
|
|
| 5 |
"""
|
| 6 |
|
| 7 |
BUGGY_CODE = """
|
|
|
|
| 1 |
TASK_DESCRIPTION = """
|
| 2 |
This binary classification trainer has a bug causing validation accuracy around 50%.
|
| 3 |
+
The bug inverts the labels during training. Fix it so after 20 epochs:
|
| 4 |
+
- VAL_ACC > 0.90 (the primary goal)
|
| 5 |
+
- FINAL_LOSS < 0.3
|
| 6 |
+
|
| 7 |
Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
|
| 8 |
+
Wrap output in ##METRICS_START## and ##METRICS_END##
|
| 9 |
"""
|
| 10 |
|
| 11 |
BUGGY_CODE = """
|
server/tasks/task6_io_mismatch.py
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
TASK_DESCRIPTION = """
|
| 2 |
+
This image classification script has multiple input-output mismatch bugs that cause
|
| 3 |
+
silent failures or crashes. The model is a simple CNN trained on synthetic "images".
|
| 4 |
+
|
| 5 |
+
There are 4 BUGS to fix:
|
| 6 |
+
1. Shape mismatch: The model expects 28x28 images but data generator creates 32x32
|
| 7 |
+
2. Channel order mismatch: Model expects CHW but data is HWC format
|
| 8 |
+
3. Label encoding mismatch: Model expects class indices but labels are one-hot encoded
|
| 9 |
+
4. Batch dimension mismatch: A validation step processes unbatched data
|
| 10 |
+
|
| 11 |
+
Fix all bugs so that:
|
| 12 |
+
- Training runs without errors for 30 epochs
|
| 13 |
+
- VAL_ACC > 0.85
|
| 14 |
+
- FINAL_LOSS < 0.5
|
| 15 |
+
|
| 16 |
+
Print as: LOSSES:[l1,l2,...], VAL_ACC:X.XX, FINAL_LOSS:X.XX
|
| 17 |
+
Wrap output in ##METRICS_START## and ##METRICS_END##
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
BUGGY_CODE = """
|
| 21 |
+
import torch
|
| 22 |
+
import torch.nn as nn
|
| 23 |
+
import torch.nn.functional as F
|
| 24 |
+
from torch.utils.data import DataLoader, TensorDataset
|
| 25 |
+
|
| 26 |
+
torch.manual_seed(42)
|
| 27 |
+
|
| 28 |
+
NUM_CLASSES = 5
|
| 29 |
+
BATCH_SIZE = 32
|
| 30 |
+
EPOCHS = 30
|
| 31 |
+
|
| 32 |
+
# BUG 1: Create 32x32 images but model expects 28x28
|
| 33 |
+
# Generate synthetic image data (HWC format - common from PIL/OpenCV)
|
| 34 |
+
def generate_data(n_samples):
|
| 35 |
+
# Creates images in HWC format (Height x Width x Channels)
|
| 36 |
+
images = torch.randn(n_samples, 32, 32, 1) # BUG: Wrong size & HWC format
|
| 37 |
+
# Each class has a distinct pattern based on mean pixel value region
|
| 38 |
+
class_indices = torch.randint(0, NUM_CLASSES, (n_samples,))
|
| 39 |
+
for i, c in enumerate(class_indices):
|
| 40 |
+
images[i] += c * 0.5 # Add class-dependent offset
|
| 41 |
+
|
| 42 |
+
# BUG 3: Return one-hot labels instead of class indices
|
| 43 |
+
labels = F.one_hot(class_indices, NUM_CLASSES).float()
|
| 44 |
+
return images, labels
|
| 45 |
+
|
| 46 |
+
X_train, y_train = generate_data(800)
|
| 47 |
+
X_val, y_val = generate_data(200)
|
| 48 |
+
|
| 49 |
+
# BUG 2: Model expects CHW format (Channels x Height x Width) and 28x28 images
|
| 50 |
+
class SimpleCNN(nn.Module):
|
| 51 |
+
def __init__(self):
|
| 52 |
+
super().__init__()
|
| 53 |
+
# Expecting input: (batch, 1, 28, 28) in CHW format
|
| 54 |
+
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
|
| 55 |
+
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
|
| 56 |
+
self.pool = nn.MaxPool2d(2)
|
| 57 |
+
# After two pooling ops on 28x28: 28->14->7, so 7*7*32 = 1568
|
| 58 |
+
self.fc = nn.Linear(7 * 7 * 32, NUM_CLASSES)
|
| 59 |
+
|
| 60 |
+
def forward(self, x):
|
| 61 |
+
# Expects x to be (batch, channels, height, width)
|
| 62 |
+
x = self.pool(F.relu(self.conv1(x))) # -> (batch, 16, 14, 14)
|
| 63 |
+
x = self.pool(F.relu(self.conv2(x))) # -> (batch, 32, 7, 7)
|
| 64 |
+
x = x.view(x.size(0), -1)
|
| 65 |
+
return self.fc(x)
|
| 66 |
+
|
| 67 |
+
model = SimpleCNN()
|
| 68 |
+
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
| 69 |
+
criterion = nn.CrossEntropyLoss() # Expects class indices, not one-hot
|
| 70 |
+
|
| 71 |
+
train_ds = TensorDataset(X_train, y_train)
|
| 72 |
+
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
|
| 73 |
+
|
| 74 |
+
losses = []
|
| 75 |
+
for epoch in range(EPOCHS):
|
| 76 |
+
model.train()
|
| 77 |
+
epoch_loss = 0.0
|
| 78 |
+
for images, labels in train_loader:
|
| 79 |
+
optimizer.zero_grad()
|
| 80 |
+
|
| 81 |
+
# Missing: permute from HWC to CHW format
|
| 82 |
+
# Missing: resize from 32x32 to 28x28
|
| 83 |
+
outputs = model(images)
|
| 84 |
+
|
| 85 |
+
# BUG: criterion expects class indices but labels are one-hot
|
| 86 |
+
loss = criterion(outputs, labels)
|
| 87 |
+
|
| 88 |
+
loss.backward()
|
| 89 |
+
optimizer.step()
|
| 90 |
+
epoch_loss += loss.item()
|
| 91 |
+
losses.append(epoch_loss / len(train_loader))
|
| 92 |
+
|
| 93 |
+
# Validation
|
| 94 |
+
model.eval()
|
| 95 |
+
with torch.no_grad():
|
| 96 |
+
# BUG 4: Process single sample without batch dimension
|
| 97 |
+
sample = X_val[0] # Shape: (32, 32, 1) - missing batch dim
|
| 98 |
+
single_pred = model(sample) # Will crash: expects (batch, C, H, W)
|
| 99 |
+
|
| 100 |
+
# Full validation (also has format issues)
|
| 101 |
+
val_outputs = model(X_val)
|
| 102 |
+
val_preds = val_outputs.argmax(dim=1)
|
| 103 |
+
val_labels = y_val.argmax(dim=1) # Convert one-hot back to indices for comparison
|
| 104 |
+
val_acc = (val_preds == val_labels).float().mean().item()
|
| 105 |
+
|
| 106 |
+
print('##METRICS_START##')
|
| 107 |
+
print('LOSSES:' + str([round(l, 4) for l in losses]))
|
| 108 |
+
print('VAL_ACC:' + str(round(val_acc, 4)))
|
| 109 |
+
print('FINAL_LOSS:' + str(round(losses[-1], 4)))
|
| 110 |
+
print('##METRICS_END##')
|
| 111 |
+
"""
|
| 112 |
+
|
| 113 |
+
GROUND_TRUTH_BUGS = [
|
| 114 |
+
"Shape mismatch: Images are 32x32 but model expects 28x28 - need to resize or fix model architecture",
|
| 115 |
+
"Channel order mismatch: Data is in HWC format but model expects CHW - use .permute(0, 3, 1, 2)",
|
| 116 |
+
"Label encoding mismatch: Labels are one-hot but CrossEntropyLoss expects class indices - use .argmax(dim=1) or change generate_data",
|
| 117 |
+
"Batch dimension mismatch: single sample missing batch dimension - use sample.unsqueeze(0)",
|
| 118 |
+
]
|
| 119 |
+
|
| 120 |
+
# Expected fixes (for grader reference):
|
| 121 |
+
# 1. Either resize images to 28x28 OR change model to expect 32x32 (fc layer: 8*8*32)
|
| 122 |
+
# 2. Add: images = images.permute(0, 3, 1, 2) # HWC -> CHW
|
| 123 |
+
# 3. Either: labels = class_indices (return indices) OR labels = labels.argmax(dim=1) before criterion
|
| 124 |
+
# 4. Add: sample = sample.unsqueeze(0).permute(0, 3, 1, 2) before model(sample)
|
test_api.sh
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
# Test HuggingFace Inference Providers API
|
| 3 |
+
|
| 4 |
+
echo "Testing HuggingFace Inference Providers..."
|
| 5 |
+
echo ""
|
| 6 |
+
|
| 7 |
+
HF_TOKEN="${HF_TOKEN:-$(grep -v '^#' .env 2>/dev/null | head -1)}"
|
| 8 |
+
|
| 9 |
+
if [ -z "$HF_TOKEN" ]; then
|
| 10 |
+
echo "❌ HF_TOKEN not set"
|
| 11 |
+
exit 1
|
| 12 |
+
fi
|
| 13 |
+
|
| 14 |
+
echo "Testing with model: Qwen/Qwen2.5-7B-Instruct"
|
| 15 |
+
echo ""
|
| 16 |
+
|
| 17 |
+
response=$(curl -s -w "\nHTTP_CODE:%{http_code}" \
|
| 18 |
+
https://router.huggingface.co/v1/chat/completions \
|
| 19 |
+
-H "Authorization: Bearer $HF_TOKEN" \
|
| 20 |
+
-H "Content-Type: application/json" \
|
| 21 |
+
-d '{
|
| 22 |
+
"model": "Qwen/Qwen2.5-7B-Instruct",
|
| 23 |
+
"messages": [{"role": "user", "content": "Say hello in 5 words"}],
|
| 24 |
+
"max_tokens": 20
|
| 25 |
+
}')
|
| 26 |
+
|
| 27 |
+
http_code=$(echo "$response" | grep "HTTP_CODE" | cut -d: -f2)
|
| 28 |
+
body=$(echo "$response" | sed '/HTTP_CODE/d')
|
| 29 |
+
|
| 30 |
+
if [ "$http_code" = "200" ]; then
|
| 31 |
+
echo "✅ API Test Successful!"
|
| 32 |
+
echo ""
|
| 33 |
+
echo "Response:"
|
| 34 |
+
echo "$body" | python3 -m json.tool 2>/dev/null || echo "$body"
|
| 35 |
+
else
|
| 36 |
+
echo "❌ API Test Failed (HTTP $http_code)"
|
| 37 |
+
echo ""
|
| 38 |
+
echo "Response:"
|
| 39 |
+
echo "$body"
|
| 40 |
+
exit 1
|
| 41 |
+
fi
|
| 42 |
+
|
| 43 |
+
echo ""
|
| 44 |
+
echo "===================="
|
| 45 |
+
echo "✅ Configuration is working!"
|
| 46 |
+
echo "Use this in your .bashrc or .env:"
|
| 47 |
+
echo ""
|
| 48 |
+
echo "export API_BASE_URL=\"https://router.huggingface.co/v1\""
|
| 49 |
+
echo "export MODEL_NAME=\"Qwen/Qwen2.5-7B-Instruct\""
|
| 50 |
+
echo "export HF_TOKEN=\"$HF_TOKEN\""
|
vaidate-submission.sh
ADDED
|
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
#
|
| 3 |
+
# validate-submission.sh — OpenEnv Submission Validator
|
| 4 |
+
#
|
| 5 |
+
# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
|
| 6 |
+
#
|
| 7 |
+
# Prerequisites:
|
| 8 |
+
# - Docker: https://docs.docker.com/get-docker/
|
| 9 |
+
# - openenv-core: pip install openenv-core
|
| 10 |
+
# - curl (usually pre-installed)
|
| 11 |
+
#
|
| 12 |
+
# Run:
|
| 13 |
+
# curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
|
| 14 |
+
#
|
| 15 |
+
# Or download and run locally:
|
| 16 |
+
# chmod +x validate-submission.sh
|
| 17 |
+
# ./validate-submission.sh <ping_url> [repo_dir]
|
| 18 |
+
#
|
| 19 |
+
# Arguments:
|
| 20 |
+
# ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
|
| 21 |
+
# repo_dir Path to your repo (default: current directory)
|
| 22 |
+
#
|
| 23 |
+
# Examples:
|
| 24 |
+
# ./validate-submission.sh https://my-team.hf.space
|
| 25 |
+
# ./validate-submission.sh https://my-team.hf.space ./my-repo
|
| 26 |
+
#
|
| 27 |
+
|
| 28 |
+
set -uo pipefail
|
| 29 |
+
|
| 30 |
+
DOCKER_BUILD_TIMEOUT=600
|
| 31 |
+
if [ -t 1 ]; then
|
| 32 |
+
RED='\033[0;31m'
|
| 33 |
+
GREEN='\033[0;32m'
|
| 34 |
+
YELLOW='\033[1;33m'
|
| 35 |
+
BOLD='\033[1m'
|
| 36 |
+
NC='\033[0m'
|
| 37 |
+
else
|
| 38 |
+
RED='' GREEN='' YELLOW='' BOLD='' NC=''
|
| 39 |
+
fi
|
| 40 |
+
|
| 41 |
+
run_with_timeout() {
|
| 42 |
+
local secs="$1"; shift
|
| 43 |
+
if command -v timeout &>/dev/null; then
|
| 44 |
+
timeout "$secs" "$@"
|
| 45 |
+
elif command -v gtimeout &>/dev/null; then
|
| 46 |
+
gtimeout "$secs" "$@"
|
| 47 |
+
else
|
| 48 |
+
"$@" &
|
| 49 |
+
local pid=$!
|
| 50 |
+
( sleep "$secs" && kill "$pid" 2>/dev/null ) &
|
| 51 |
+
local watcher=$!
|
| 52 |
+
wait "$pid" 2>/dev/null
|
| 53 |
+
local rc=$?
|
| 54 |
+
kill "$watcher" 2>/dev/null
|
| 55 |
+
wait "$watcher" 2>/dev/null
|
| 56 |
+
return $rc
|
| 57 |
+
fi
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
portable_mktemp() {
|
| 61 |
+
local prefix="${1:-validate}"
|
| 62 |
+
mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
CLEANUP_FILES=()
|
| 66 |
+
cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
|
| 67 |
+
trap cleanup EXIT
|
| 68 |
+
|
| 69 |
+
PING_URL="${1:-}"
|
| 70 |
+
REPO_DIR="${2:-.}"
|
| 71 |
+
|
| 72 |
+
if [ -z "$PING_URL" ]; then
|
| 73 |
+
printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
|
| 74 |
+
printf "\n"
|
| 75 |
+
printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
|
| 76 |
+
printf " repo_dir Path to your repo (default: current directory)\n"
|
| 77 |
+
exit 1
|
| 78 |
+
fi
|
| 79 |
+
|
| 80 |
+
if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
|
| 81 |
+
printf "Error: directory '%s' not found\n" "${2:-.}"
|
| 82 |
+
exit 1
|
| 83 |
+
fi
|
| 84 |
+
PING_URL="${PING_URL%/}"
|
| 85 |
+
export PING_URL
|
| 86 |
+
PASS=0
|
| 87 |
+
|
| 88 |
+
log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
|
| 89 |
+
pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
|
| 90 |
+
fail() { log "${RED}FAILED${NC} -- $1"; }
|
| 91 |
+
hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
|
| 92 |
+
stop_at() {
|
| 93 |
+
printf "\n"
|
| 94 |
+
printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
|
| 95 |
+
exit 1
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
printf "\n"
|
| 99 |
+
printf "${BOLD}========================================${NC}\n"
|
| 100 |
+
printf "${BOLD} OpenEnv Submission Validator${NC}\n"
|
| 101 |
+
printf "${BOLD}========================================${NC}\n"
|
| 102 |
+
log "Repo: $REPO_DIR"
|
| 103 |
+
log "Ping URL: $PING_URL"
|
| 104 |
+
printf "\n"
|
| 105 |
+
|
| 106 |
+
log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
|
| 107 |
+
|
| 108 |
+
CURL_OUTPUT=$(portable_mktemp "validate-curl")
|
| 109 |
+
CLEANUP_FILES+=("$CURL_OUTPUT")
|
| 110 |
+
HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
|
| 111 |
+
-H "Content-Type: application/json" -d '{}' \
|
| 112 |
+
"$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
|
| 113 |
+
|
| 114 |
+
if [ "$HTTP_CODE" = "200" ]; then
|
| 115 |
+
pass "HF Space is live and responds to /reset"
|
| 116 |
+
elif [ "$HTTP_CODE" = "000" ]; then
|
| 117 |
+
fail "HF Space not reachable (connection failed or timed out)"
|
| 118 |
+
hint "Check your network connection and that the Space is running."
|
| 119 |
+
hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
|
| 120 |
+
stop_at "Step 1"
|
| 121 |
+
else
|
| 122 |
+
fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
|
| 123 |
+
hint "Make sure your Space is running and the URL is correct."
|
| 124 |
+
hint "Try opening $PING_URL in your browser first."
|
| 125 |
+
stop_at "Step 1"
|
| 126 |
+
fi
|
| 127 |
+
|
| 128 |
+
log "${BOLD}Step 2/3: Running docker build${NC} ..."
|
| 129 |
+
|
| 130 |
+
if ! command -v docker &>/dev/null; then
|
| 131 |
+
fail "docker command not found"
|
| 132 |
+
hint "Install Docker: https://docs.docker.com/get-docker/"
|
| 133 |
+
stop_at "Step 2"
|
| 134 |
+
fi
|
| 135 |
+
|
| 136 |
+
if [ -f "$REPO_DIR/Dockerfile" ]; then
|
| 137 |
+
DOCKER_CONTEXT="$REPO_DIR"
|
| 138 |
+
elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
|
| 139 |
+
DOCKER_CONTEXT="$REPO_DIR/server"
|
| 140 |
+
else
|
| 141 |
+
fail "No Dockerfile found in repo root or server/ directory"
|
| 142 |
+
stop_at "Step 2"
|
| 143 |
+
fi
|
| 144 |
+
|
| 145 |
+
log " Found Dockerfile in $DOCKER_CONTEXT"
|
| 146 |
+
|
| 147 |
+
BUILD_OK=false
|
| 148 |
+
BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
|
| 149 |
+
|
| 150 |
+
if [ "$BUILD_OK" = true ]; then
|
| 151 |
+
pass "Docker build succeeded"
|
| 152 |
+
else
|
| 153 |
+
fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
|
| 154 |
+
printf "%s\n" "$BUILD_OUTPUT" | tail -20
|
| 155 |
+
stop_at "Step 2"
|
| 156 |
+
fi
|
| 157 |
+
|
| 158 |
+
log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
|
| 159 |
+
|
| 160 |
+
if ! command -v openenv &>/dev/null; then
|
| 161 |
+
fail "openenv command not found"
|
| 162 |
+
hint "Install it: pip install openenv-core"
|
| 163 |
+
stop_at "Step 3"
|
| 164 |
+
fi
|
| 165 |
+
|
| 166 |
+
VALIDATE_OK=false
|
| 167 |
+
VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
|
| 168 |
+
|
| 169 |
+
if [ "$VALIDATE_OK" = true ]; then
|
| 170 |
+
pass "openenv validate passed"
|
| 171 |
+
[ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
|
| 172 |
+
else
|
| 173 |
+
fail "openenv validate failed"
|
| 174 |
+
printf "%s\n" "$VALIDATE_OUTPUT"
|
| 175 |
+
stop_at "Step 3"
|
| 176 |
+
fi
|
| 177 |
+
|
| 178 |
+
printf "\n"
|
| 179 |
+
printf "${BOLD}========================================${NC}\n"
|
| 180 |
+
printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
|
| 181 |
+
printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
|
| 182 |
+
printf "${BOLD}========================================${NC}\n"
|
| 183 |
+
printf "\n"
|
| 184 |
+
|
| 185 |
+
exit 0
|