Amogh-kal1 commited on
Commit
0c28a91
·
verified ·
1 Parent(s): 3f12d92

Upload folder using huggingface_hub

Browse files
GRADER_ANALYSIS.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Task & Grader Analysis Report
2
+
3
+ ## 🔴 CRITICAL FINDING: Why Scores Were Identical
4
+
5
+ The score `0.7390` appearing for all tasks suggested the LLM was generating code that:
6
+ 1. **Ran successfully** (exit_code = 0)
7
+ 2. **Output valid metrics** (LOSSES, VAL_ACC, etc.)
8
+ 3. **BUT didn't necessarily fix the actual bugs**
9
+
10
+ The old graders gave "partial credit" for any running code, leading to similar scores.
11
+
12
+ ## ✅ FIXES APPLIED
13
+
14
+ ### 1. Stricter Sigmoid Centers
15
+ - Old centers were too forgiving (e.g., val_acc center at 0.85)
16
+ - New centers require better performance (e.g., val_acc center at 0.90-0.92)
17
+ - Increased steepness for sharper differentiation (15→25-30)
18
+
19
+ ### 2. Early Rejection for Unfixed Bugs
20
+ - Added explicit checks for "likely unfixed" states
21
+ - Task 3: Reject if val_acc < 0.65 (buggy code gives ~0.50)
22
+ - Task 4: Reject if f1 < 0.40 (buggy code gives ~0.25)
23
+ - Task 5: Reject if buggy state unchanged
24
+
25
+ ### 3. Task 3 Mismatch Fixed
26
+ - **Old**: Description said "OOM and data leakage"
27
+ - **Actual bug**: Label inversion (`criterion(out, 1 - yb)`)
28
+ - **Fixed**: Updated grader to match actual bug
29
+
30
+ ### 4. Reduced Base Scores
31
+ - Old task 2 gave 0.40 "free" for avoiding NaN
32
+ - New gives 0.35 base, with stricter accuracy requirements
33
+
34
+ ## Updated Grader Summary
35
+
36
+ | Task | Bug | Key Metric | Threshold | Weight |
37
+ |------|-----|------------|-----------|--------|
38
+ | task1 | LR + step/backward | VAL_ACC | >0.90 | 60% |
39
+ | task2 | NaN loss | No NaN + VAL_ACC | >0.80 | 40% |
40
+ | task3 | Label inversion | VAL_ACC | >0.92 | 60% |
41
+ | task4 | Wrong loss | F1_SCORE | >0.70 | 55% |
42
+ | task5 | Frozen backbone | Fix detection | Binary | 70% |
43
+
44
+ ## Expected Score Distribution After Fix
45
+
46
+ **Well-fixed code** (correct fix): 0.85-1.00
47
+ **Partially fixed** (runs but suboptimal): 0.40-0.70
48
+ **Unfixed** (bug still present): 0.10-0.25
49
+ **Broken** (crashes): 0.00-0.10
50
+
51
+ This creates better separation between models of different capability.
52
+
53
+ ## Files Modified
54
+
55
+ 1. `server/tasks/graders.py` - All 5 graders updated
56
+ 2. `server/tasks/task3_oom_leakage.py` - Description clarified
HACKATHON_GUIDE.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WhipStudio - OpenEnv Hackathon Submission Guide
2
+
3
+ Complete guide for running inference, training, and evaluation for the Scaler Meta PyTorch Hackathon.
4
+
5
+ ## 🚀 Quick Start
6
+
7
+ ### 1. Environment Setup
8
+
9
+ ```bash
10
+ # Set your HuggingFace token
11
+ export HF_TOKEN="your_token_here"
12
+
13
+ # For HuggingFace models (recommended)
14
+ export API_BASE_URL="https://api-inference.huggingface.co/v1"
15
+ export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
16
+
17
+ # Or use the convenience script
18
+ ./run_inference.sh https://amogh-kal1-whipstudio.hf.space
19
+ ```
20
+
21
+ ### 2. Run Hackathon Inference
22
+
23
+ The `inference.py` script meets all hackathon requirements:
24
+ - ✅ Uses OpenAI-compatible client
25
+ - ✅ Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment
26
+ - ✅ Emits [START], [STEP], [END] logs
27
+ - ✅ Runs all 5 tasks with max 3 attempts each
28
+
29
+ ```bash
30
+ python inference.py --env-url https://amogh-kal1-whipstudio.hf.space
31
+ ```
32
+
33
+ ## 📊 Training with GRPO
34
+
35
+ Train a model using Group Relative Policy Optimization:
36
+
37
+ ### Basic Training
38
+ ```bash
39
+ python improved_agent.py \
40
+ --env_url https://amogh-kal1-whipstudio.hf.space \
41
+ --model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
42
+ --output_dir ./trained-model \
43
+ --num_iterations 50
44
+ ```
45
+
46
+ ### Memory-Efficient Training (8GB VRAM)
47
+ ```bash
48
+ python improved_agent.py \
49
+ --env_url https://amogh-kal1-whipstudio.hf.space \
50
+ --model_name Qwen/Qwen2.5-Coder-1.5B-Instruct \
51
+ --use_lora \
52
+ --use_4bit \
53
+ --gradient_checkpointing \
54
+ --output_dir ./trained-model-lora
55
+ ```
56
+
57
+ ### Training Features
58
+ - **Curriculum Learning**: Starts with easier tasks, progresses to harder ones
59
+ - **LoRA Support**: Efficient fine-tuning with adapters
60
+ - **4-bit Quantization**: Train on GPUs with limited VRAM
61
+ - **Checkpoint Saving**: Best model saved automatically
62
+ - **Early Stopping**: Stops when no improvement
63
+ - **Wandb Logging**: Optional tracking with `--use_wandb`
64
+
65
+ ## 🎯 Evaluation on MNIST
66
+
67
+ Compare base vs trained models on an out-of-distribution MNIST debugging task:
68
+
69
+ ### Compare Two Models
70
+ ```bash
71
+ python evaluate_mnist.py \
72
+ --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
73
+ --trained_model ./trained-model/best \
74
+ --num_runs 3
75
+ ```
76
+
77
+ ### Use Real MNIST Dataset
78
+ ```bash
79
+ python evaluate_mnist.py \
80
+ --use_real_mnist \
81
+ --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
82
+ --trained_model ./trained-model/best
83
+ ```
84
+
85
+ ### Compare Multiple Models
86
+ ```bash
87
+ python evaluate_mnist.py \
88
+ --use_real_mnist \
89
+ --models Qwen/Qwen2.5-Coder-1.5B-Instruct \
90
+ Qwen/Qwen2.5-Coder-7B-Instruct \
91
+ ./trained-model-v1/best \
92
+ ./trained-model-v2/best
93
+ ```
94
+
95
+ ## 🔧 Configuration
96
+
97
+ ### HuggingFace API (Recommended)
98
+ ```bash
99
+ export API_BASE_URL="https://api-inference.huggingface.co/v1"
100
+ export MODEL_NAME="Qwen/Qwen2.5-Coder-32B-Instruct"
101
+ export HF_TOKEN="hf_your_token"
102
+ ```
103
+
104
+ ### OpenAI API
105
+ ```bash
106
+ export API_BASE_URL="https://api.openai.com/v1"
107
+ export MODEL_NAME="gpt-4o-mini"
108
+ export OPENAI_API_KEY="sk-your-key"
109
+ ```
110
+
111
+ ### Local Model Inference
112
+ ```bash
113
+ # Use vLLM or similar OpenAI-compatible server
114
+ export API_BASE_URL="http://localhost:8000/v1"
115
+ export MODEL_NAME="your-local-model"
116
+ export HF_TOKEN="dummy" # Still required by script
117
+ ```
118
+
119
+ ## 📝 Hackathon Requirements Checklist
120
+
121
+ - ✅ **HF Space deploys**: https://amogh-kal1-whipstudio.hf.space
122
+ - ✅ **OpenEnv spec compliance**: openenv.yaml, typed models, endpoints
123
+ - ✅ **Dockerfile builds**: server/Dockerfile
124
+ - ✅ **inference.py exists**: Root directory
125
+ - ✅ **Uses OpenAI Client**: With API_BASE_URL, MODEL_NAME, HF_TOKEN
126
+ - ✅ **Structured logs**: [START], [STEP], [END] format
127
+ - ✅ **3+ tasks with graders**: 5 tasks (task1-task5)
128
+
129
+ ## 🐛 Troubleshooting
130
+
131
+ ### 500 Error from HF Space
132
+ ```
133
+ [ERROR] Server error '500 Internal Server Error'
134
+ ```
135
+
136
+ **Solution**:
137
+ 1. Visit your HF Space in a browser first: https://amogh-kal1-whipstudio.hf.space
138
+ 2. Wait for it to fully start (cold start can take 1-2 minutes)
139
+ 3. Check the Space logs for errors
140
+ 4. Try the /health endpoint: `curl https://amogh-kal1-whipstudio.hf.space/health`
141
+
142
+ ### Missing Dependencies
143
+ ```bash
144
+ pip install openai httpx transformers torch trl peft bitsandbytes accelerate datasets
145
+ ```
146
+
147
+ ### Out of Memory During Training
148
+ Use memory-efficient options:
149
+ ```bash
150
+ python improved_agent.py \
151
+ --use_4bit \
152
+ --use_lora \
153
+ --gradient_checkpointing \
154
+ --lora_r 8 # Lower rank for less memory
155
+ ```
156
+
157
+ ### HuggingFace API Rate Limits
158
+ If you hit rate limits with HuggingFace's free tier:
159
+ 1. Use a smaller model (e.g., 1.5B instead of 32B)
160
+ 2. Reduce `--num_iterations` for training
161
+ 3. Reduce `--num_runs` for evaluation
162
+
163
+ ## 📚 File Descriptions
164
+
165
+ | File | Purpose |
166
+ |------|---------|
167
+ | `inference.py` | **Hackathon submission script** - runs all tasks with structured logging |
168
+ | `improved_agent.py` | Train model with GRPO (curriculum learning, LoRA, 4-bit) |
169
+ | `evaluate_mnist.py` | Compare models on out-of-distribution MNIST debugging |
170
+ | `run_inference.sh` | Convenience script for quick inference runs |
171
+ | `baseline_agent.py` | Original baseline (not hackathon-compliant) |
172
+
173
+ ## 🎓 Example Workflow
174
+
175
+ ```bash
176
+ # 1. Run baseline inference
177
+ export HF_TOKEN="your_token"
178
+ export API_BASE_URL="https://api-inference.huggingface.co/v1"
179
+ export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
180
+ python inference.py --env-url https://amogh-kal1-whipstudio.hf.space
181
+
182
+ # 2. Train model with GRPO
183
+ python improved_agent.py \
184
+ --env_url https://amogh-kal1-whipstudio.hf.space \
185
+ --use_lora --use_4bit \
186
+ --num_iterations 30 \
187
+ --output_dir ./my-trained-model
188
+
189
+ # 3. Evaluate on MNIST
190
+ python evaluate_mnist.py \
191
+ --use_real_mnist \
192
+ --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
193
+ --trained_model ./my-trained-model/best \
194
+ --num_runs 5
195
+
196
+ # 4. Validate submission
197
+ ./vaidate-submission.sh https://amogh-kal1-whipstudio.hf.space
198
+ ```
199
+
200
+ ## 🏆 Tips for Best Results
201
+
202
+ 1. **Start with small experiments**: Use `--num_iterations 10` first
203
+ 2. **Monitor training**: Use `--use_wandb` to track progress
204
+ 3. **Curriculum helps**: Keep `--curriculum_stages 3` for better learning
205
+ 4. **Real MNIST is harder**: Expect lower scores but more realistic evaluation
206
+ 5. **Multiple runs**: Use `--num_runs 5` for statistical significance
207
+
208
+ ## 📧 Support
209
+
210
+ If you encounter issues:
211
+ 1. Check the troubleshooting section above
212
+ 2. Verify your HF Space is running: visit the URL in browser
213
+ 3. Check environment variables: `echo $API_BASE_URL $MODEL_NAME $HF_TOKEN`
214
+ 4. Review the logs for detailed error messages
baseline_agent.py CHANGED
@@ -18,7 +18,17 @@ Rules:
18
  """.strip()
19
 
20
 
21
- def get_model():
 
 
 
 
 
 
 
 
 
 
22
  from smolagents import InferenceClientModel
23
 
24
  hf_token = os.environ.get("HF_TOKEN")
@@ -27,8 +37,13 @@ def get_model():
27
  "HF_TOKEN is not set. Set HF_TOKEN to run /baseline with InferenceClientModel."
28
  )
29
 
 
 
 
 
 
30
  return InferenceClientModel(
31
- model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
32
  token=hf_token,
33
  )
34
 
@@ -81,15 +96,23 @@ def _generate_fixed_code(model, prompt: str) -> str:
81
  raise AttributeError("Model does not support callable() or generate() inference APIs")
82
 
83
 
84
- async def run_single_task(task_id: str, env_url: str = "http://localhost:7860") -> float:
 
 
 
 
85
  """Backwards-compatible wrapper that returns just the score."""
86
- result = await run_single_task_detailed(task_id, env_url)
87
  return result["score"]
88
 
89
 
90
- async def run_single_task_detailed(task_id: str, env_url: str = "http://localhost:7860") -> dict:
 
 
 
 
91
  """Run the baseline agent on a single task. Returns detailed results."""
92
- model = get_model()
93
  timeout = httpx.Timeout(900.0, connect=10.0)
94
 
95
  attempts_log = []
@@ -173,13 +196,13 @@ if __name__ == "__main__":
173
 
174
  async def main():
175
  scores = {}
176
- for tid in ["task1", "task2", "task3"]:
177
  try:
178
  s = await asyncio.wait_for(run_single_task(tid, args.env_url), timeout=95.0)
179
  except TimeoutError:
180
  s = 0.0
181
  scores[tid] = round(s, 4)
182
  print(f"{tid}: {s:.4f}")
183
- print(f"Average: {sum(scores.values()) / 3:.4f}")
184
 
185
  asyncio.run(main())
 
18
  """.strip()
19
 
20
 
21
+ SUPPORTED_MODEL_IDS = [
22
+ "Qwen/Qwen2.5-Coder-1.5B-Instruct",
23
+ "Qwen/Qwen2.5-Coder-3B-Instruct",
24
+ "Qwen/Qwen2.5-Coder-7B-Instruct",
25
+ "Qwen/Qwen2.5-Coder-14B-Instruct",
26
+ "Qwen/Qwen2.5-Coder-32B-Instruct",
27
+ "mistralai/Mistral-7B-Instruct-v0.3",
28
+ ]
29
+
30
+
31
+ def get_model(model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct"):
32
  from smolagents import InferenceClientModel
33
 
34
  hf_token = os.environ.get("HF_TOKEN")
 
37
  "HF_TOKEN is not set. Set HF_TOKEN to run /baseline with InferenceClientModel."
38
  )
39
 
40
+ if model_id not in SUPPORTED_MODEL_IDS:
41
+ raise ValueError(
42
+ f"Unsupported model_id '{model_id}'. Supported options: {SUPPORTED_MODEL_IDS}"
43
+ )
44
+
45
  return InferenceClientModel(
46
+ model_id=model_id,
47
  token=hf_token,
48
  )
49
 
 
96
  raise AttributeError("Model does not support callable() or generate() inference APIs")
97
 
98
 
99
+ async def run_single_task(
100
+ task_id: str,
101
+ env_url: str = "http://localhost:7860",
102
+ model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct",
103
+ ) -> float:
104
  """Backwards-compatible wrapper that returns just the score."""
105
+ result = await run_single_task_detailed(task_id, env_url, model_id)
106
  return result["score"]
107
 
108
 
109
+ async def run_single_task_detailed(
110
+ task_id: str,
111
+ env_url: str = "http://localhost:7860",
112
+ model_id: str = "Qwen/Qwen2.5-Coder-32B-Instruct",
113
+ ) -> dict:
114
  """Run the baseline agent on a single task. Returns detailed results."""
115
+ model = get_model(model_id)
116
  timeout = httpx.Timeout(900.0, connect=10.0)
117
 
118
  attempts_log = []
 
196
 
197
  async def main():
198
  scores = {}
199
+ for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
200
  try:
201
  s = await asyncio.wait_for(run_single_task(tid, args.env_url), timeout=95.0)
202
  except TimeoutError:
203
  s = 0.0
204
  scores[tid] = round(s, 4)
205
  print(f"{tid}: {s:.4f}")
206
+ print(f"Average: {sum(scores.values()) / 6:.4f}")
207
 
208
  asyncio.run(main())
evaluate_mnist.py ADDED
@@ -0,0 +1,816 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Evaluate untrained vs GRPO-trained Qwen2.5-1.5B-Coder on a real
4
+ MNIST handwritten digit recognition debugging task.
5
+
6
+ This script demonstrates that RL-trained models outperform base models
7
+ on out-of-distribution ML debugging tasks.
8
+
9
+ The MNIST debugging task is intentionally NOT in the WhipStudio training set,
10
+ making it a true test of generalization.
11
+
12
+ Workflow:
13
+ 1. Define a deliberately buggy MNIST training pipeline
14
+ 2. Load both base model and GRPO-fine-tuned model
15
+ 3. Ask each to fix the buggy code
16
+ 4. Execute both fixes and compare results
17
+ 5. Generate a comparison report
18
+
19
+ Requirements:
20
+ pip install transformers torch peft bitsandbytes
21
+
22
+ Usage:
23
+ # Basic comparison
24
+ python evaluate_mnist.py \
25
+ --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
26
+ --trained_model ./whipstudio-debugger/best
27
+
28
+ # Multiple runs for statistical significance
29
+ python evaluate_mnist.py \
30
+ --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
31
+ --trained_model ./whipstudio-debugger/best \
32
+ --num_runs 5
33
+
34
+ # Use 4-bit quantization for memory efficiency
35
+ python evaluate_mnist.py \
36
+ --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct \
37
+ --trained_model ./whipstudio-debugger/best \
38
+ --use_4bit
39
+ """
40
+
41
+ import argparse
42
+ import json
43
+ import math
44
+ import os
45
+ import re
46
+ import subprocess
47
+ import sys
48
+ import tempfile
49
+ import time
50
+ from pathlib import Path
51
+ from typing import Optional
52
+
53
+ import torch
54
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
55
+
56
+ # Optional PEFT import for LoRA models
57
+ try:
58
+ from peft import PeftModel
59
+ PEFT_AVAILABLE = True
60
+ except ImportError:
61
+ PEFT_AVAILABLE = False
62
+
63
+
64
+ # ══════════════════════════════════════════════════════════════════════════════
65
+ # System Prompt (same as training)
66
+ # ══════════════════════════════════════════════════════════════════════════════
67
+
68
+ SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
69
+ You receive a broken training script and must fix ALL bugs.
70
+ Return ONLY the complete corrected Python code. No markdown, no backticks, no explanation.
71
+ Keep all torch.manual_seed() calls intact."""
72
+
73
+
74
+ # ══════════════════════════════════════════════════════════════════════════════
75
+ # Buggy MNIST Pipeline (Out-of-Distribution Test)
76
+ # ══════════════════════════════════════════════════════════════════════════════
77
+
78
+ # Two versions of the buggy code: synthetic (fast) and real MNIST (realistic)
79
+
80
+ MNIST_BUGGY_CODE_SYNTHETIC = '''
81
+ import torch
82
+ import torch.nn as nn
83
+ import torch.nn.functional as F
84
+ from torch.utils.data import DataLoader, TensorDataset
85
+
86
+ torch.manual_seed(42)
87
+
88
+ # Simulate MNIST-like data (28x28 images, 10 classes)
89
+ X_train = torch.randn(1000, 1, 28, 28)
90
+ y_train = torch.randint(0, 10, (1000,))
91
+ X_val = torch.randn(200, 1, 28, 28)
92
+ y_val = torch.randint(0, 10, (200,))
93
+
94
+ # Make data learnable: label = argmax of mean pixel value in 10 regions
95
+ for i in range(len(X_train)):
96
+ region_means = X_train[i, 0].reshape(10, -1).mean(dim=1)
97
+ y_train[i] = region_means.argmax()
98
+ for i in range(len(X_val)):
99
+ region_means = X_val[i, 0].reshape(10, -1).mean(dim=1)
100
+ y_val[i] = region_means.argmax()
101
+
102
+ train_ds = TensorDataset(X_train, y_train)
103
+ train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
104
+
105
+ class SimpleCNN(nn.Module):
106
+ def __init__(self):
107
+ super().__init__()
108
+ self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
109
+ self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
110
+ self.fc1 = nn.Linear(32 * 7 * 7, 128)
111
+ self.fc2 = nn.Linear(128, 10)
112
+
113
+ def forward(self, x):
114
+ x = F.relu(self.conv1(x))
115
+ x = F.max_pool2d(x, 2)
116
+ x = F.relu(self.conv2(x))
117
+ x = F.max_pool2d(x, 2)
118
+ x = x.view(x.size(0), -1)
119
+ x = F.relu(self.fc1(x))
120
+ # BUG 1: Applying softmax before CrossEntropyLoss (double softmax)
121
+ x = F.softmax(self.fc2(x), dim=1)
122
+ return x
123
+
124
+ model = SimpleCNN()
125
+
126
+ # BUG 2: Using NLLLoss without log_softmax (expects log probabilities)
127
+ criterion = nn.NLLLoss()
128
+
129
+ # BUG 3: Learning rate too high for CNN
130
+ optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
131
+
132
+ losses = []
133
+ for epoch in range(20):
134
+ for xb, yb in train_loader:
135
+ optimizer.zero_grad()
136
+ out = model(xb)
137
+ loss = criterion(out, yb)
138
+ loss.backward()
139
+ optimizer.step()
140
+ losses.append(loss.item())
141
+
142
+ # Validation
143
+ model.eval()
144
+ with torch.no_grad():
145
+ val_out = model(X_val)
146
+ val_preds = val_out.argmax(dim=1)
147
+ val_acc = (val_preds == y_val).float().mean().item()
148
+
149
+ print('##METRICS_START##')
150
+ print('LOSSES:' + str(losses))
151
+ print('VAL_ACC:' + str(round(val_acc, 4)))
152
+ print('##METRICS_END##')
153
+ '''
154
+
155
+ MNIST_BUGGY_CODE_REAL = '''
156
+ import torch
157
+ import torch.nn as nn
158
+ import torch.nn.functional as F
159
+ from torch.utils.data import DataLoader, Subset
160
+ from torchvision import datasets, transforms
161
+
162
+ torch.manual_seed(42)
163
+
164
+ # Load REAL MNIST dataset
165
+ transform = transforms.Compose([
166
+ transforms.ToTensor(),
167
+ transforms.Normalize((0.1307,), (0.3081,))
168
+ ])
169
+
170
+ train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
171
+ test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
172
+
173
+ # Use subset for faster training (5000 train, 1000 val)
174
+ train_indices = torch.randperm(len(train_dataset))[:5000]
175
+ val_indices = torch.randperm(len(test_dataset))[:1000]
176
+
177
+ train_subset = Subset(train_dataset, train_indices)
178
+ val_subset = Subset(test_dataset, val_indices)
179
+
180
+ train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)
181
+ val_loader = DataLoader(val_subset, batch_size=256, shuffle=False)
182
+
183
+ class SimpleCNN(nn.Module):
184
+ def __init__(self):
185
+ super().__init__()
186
+ self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
187
+ self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
188
+ self.fc1 = nn.Linear(32 * 7 * 7, 128)
189
+ self.fc2 = nn.Linear(128, 10)
190
+
191
+ def forward(self, x):
192
+ x = F.relu(self.conv1(x))
193
+ x = F.max_pool2d(x, 2)
194
+ x = F.relu(self.conv2(x))
195
+ x = F.max_pool2d(x, 2)
196
+ x = x.view(x.size(0), -1)
197
+ x = F.relu(self.fc1(x))
198
+ # BUG 1: Applying softmax before CrossEntropyLoss (double softmax)
199
+ x = F.softmax(self.fc2(x), dim=1)
200
+ return x
201
+
202
+ model = SimpleCNN()
203
+
204
+ # BUG 2: Using NLLLoss without log_softmax (expects log probabilities)
205
+ criterion = nn.NLLLoss()
206
+
207
+ # BUG 3: Learning rate too high for CNN
208
+ optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
209
+
210
+ losses = []
211
+ for epoch in range(10): # 10 epochs on real MNIST
212
+ for xb, yb in train_loader:
213
+ optimizer.zero_grad()
214
+ out = model(xb)
215
+ loss = criterion(out, yb)
216
+ loss.backward()
217
+ optimizer.step()
218
+ losses.append(loss.item())
219
+
220
+ # Validation on real MNIST test set
221
+ model.eval()
222
+ correct = 0
223
+ total = 0
224
+ with torch.no_grad():
225
+ for xb, yb in val_loader:
226
+ out = model(xb)
227
+ preds = out.argmax(dim=1)
228
+ correct += (preds == yb).sum().item()
229
+ total += yb.size(0)
230
+
231
+ val_acc = correct / total
232
+
233
+ print('##METRICS_START##')
234
+ print('LOSSES:' + str(losses[-100:])) # Last 100 losses to avoid huge output
235
+ print('VAL_ACC:' + str(round(val_acc, 4)))
236
+ print('##METRICS_END##')
237
+ '''
238
+
239
+ # Default to synthetic for backward compatibility
240
+ MNIST_BUGGY_CODE = MNIST_BUGGY_CODE_SYNTHETIC
241
+
242
+ MNIST_TASK_DESCRIPTION_SYNTHETIC = """
243
+ This is a CNN-based handwritten digit classifier (MNIST-like, 10 classes).
244
+ The model has several bugs preventing it from training properly.
245
+
246
+ Bugs to identify and fix:
247
+ 1. The forward pass has a problem with activation functions
248
+ 2. The loss function doesn't match the model output
249
+ 3. The optimizer has problematic hyperparameters
250
+
251
+ Fix ALL bugs so that after 20 epochs:
252
+ - Loss converges below 1.5
253
+ - Validation accuracy exceeds 0.50
254
+
255
+ Print losses as: LOSSES:[val1, val2, ...]
256
+ Print validation accuracy as: VAL_ACC:X.XX
257
+ Wrap metrics in ##METRICS_START## and ##METRICS_END##.
258
+ """
259
+
260
+ MNIST_TASK_DESCRIPTION_REAL = """
261
+ This is a CNN-based MNIST handwritten digit classifier using the REAL MNIST dataset.
262
+ The model has several bugs preventing it from training properly.
263
+
264
+ Bugs to identify and fix:
265
+ 1. The forward pass has a problem with activation functions
266
+ 2. The loss function doesn't match the model output
267
+ 3. The optimizer has problematic hyperparameters
268
+
269
+ Fix ALL bugs so that after 10 epochs on real MNIST:
270
+ - Loss converges and decreases over time
271
+ - Validation accuracy exceeds 0.85 (should be achievable on real MNIST)
272
+
273
+ Print the last 100 losses as: LOSSES:[val1, val2, ...]
274
+ Print validation accuracy as: VAL_ACC:X.XX
275
+ Wrap metrics in ##METRICS_START## and ##METRICS_END##.
276
+ """
277
+
278
+ MNIST_TASK_DESCRIPTION = MNIST_TASK_DESCRIPTION_SYNTHETIC
279
+
280
+
281
+ # ══════════════════════════════════════════════════════════════════════════════
282
+ # Helpers
283
+ # ══════════════════════════════════════════════════════════════════════════════
284
+
285
+ def load_model(
286
+ model_path: str,
287
+ use_4bit: bool = False,
288
+ is_peft: bool = False,
289
+ base_model_for_peft: Optional[str] = None,
290
+ ) -> tuple:
291
+ """Load model and tokenizer with optional quantization and PEFT."""
292
+
293
+ print(f" Loading model from {model_path}...")
294
+
295
+ # Quantization config
296
+ quantization_config = None
297
+ if use_4bit:
298
+ quantization_config = BitsAndBytesConfig(
299
+ load_in_4bit=True,
300
+ bnb_4bit_quant_type="nf4",
301
+ bnb_4bit_compute_dtype=torch.bfloat16,
302
+ )
303
+
304
+ # Model kwargs
305
+ model_kwargs = {
306
+ "trust_remote_code": True,
307
+ "device_map": "auto",
308
+ }
309
+ if quantization_config:
310
+ model_kwargs["quantization_config"] = quantization_config
311
+ else:
312
+ model_kwargs["torch_dtype"] = torch.bfloat16
313
+
314
+ # Load tokenizer
315
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
316
+ if tokenizer.pad_token is None:
317
+ tokenizer.pad_token = tokenizer.eos_token
318
+
319
+ # Check if this is a PEFT/LoRA model
320
+ adapter_config_path = Path(model_path) / "adapter_config.json"
321
+ if adapter_config_path.exists() or is_peft:
322
+ if not PEFT_AVAILABLE:
323
+ raise ImportError("PEFT model detected but peft is not installed")
324
+
325
+ # For PEFT models, we need to load base model first
326
+ if base_model_for_peft is None:
327
+ # Try to read from adapter config
328
+ if adapter_config_path.exists():
329
+ with open(adapter_config_path) as f:
330
+ adapter_config = json.load(f)
331
+ base_model_for_peft = adapter_config.get("base_model_name_or_path")
332
+
333
+ if base_model_for_peft is None:
334
+ raise ValueError("PEFT model requires --base_model_for_peft or adapter_config.json with base_model_name_or_path")
335
+
336
+ print(f" Loading base model: {base_model_for_peft}")
337
+ base_model = AutoModelForCausalLM.from_pretrained(base_model_for_peft, **model_kwargs)
338
+
339
+ print(f" Loading PEFT adapters from: {model_path}")
340
+ model = PeftModel.from_pretrained(base_model, model_path)
341
+ else:
342
+ # Regular model
343
+ model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
344
+
345
+ return model, tokenizer
346
+
347
+
348
+ def generate_fix(model, tokenizer, task_description: str, buggy_code: str) -> str:
349
+ """Generate a fix using the given model."""
350
+ messages = [
351
+ {"role": "system", "content": SYSTEM_PROMPT},
352
+ {"role": "user", "content": f"Task: {task_description}\n\nBuggy code:\n{buggy_code}"},
353
+ ]
354
+
355
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
356
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
357
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
358
+
359
+ with torch.no_grad():
360
+ outputs = model.generate(
361
+ **inputs,
362
+ max_new_tokens=2048,
363
+ temperature=0.2,
364
+ top_p=0.95,
365
+ do_sample=True,
366
+ pad_token_id=tokenizer.pad_token_id,
367
+ )
368
+
369
+ # Decode only the generated tokens
370
+ generated = outputs[0][inputs["input_ids"].shape[1]:]
371
+ response = tokenizer.decode(generated, skip_special_tokens=True)
372
+
373
+ # Strip markdown fences if present
374
+ if "```python" in response:
375
+ response = response.split("```python", 1)[1].split("```", 1)[0].strip()
376
+ elif "```" in response:
377
+ response = response.split("```", 1)[1].split("```", 1)[0].strip()
378
+
379
+ return response.strip()
380
+
381
+
382
+ def execute_code(code: str, timeout: int = 120) -> dict:
383
+ """Execute code in a subprocess and return results."""
384
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
385
+ f.write(code)
386
+ tmp_path = f.name
387
+
388
+ start = time.time()
389
+ try:
390
+ proc = subprocess.run(
391
+ [sys.executable, tmp_path],
392
+ capture_output=True,
393
+ text=True,
394
+ timeout=timeout,
395
+ )
396
+ elapsed = time.time() - start
397
+ return {
398
+ "exit_code": proc.returncode,
399
+ "stdout": proc.stdout[:8192],
400
+ "stderr": proc.stderr[:2048],
401
+ "elapsed": round(elapsed, 2),
402
+ "timed_out": False,
403
+ }
404
+ except subprocess.TimeoutExpired:
405
+ return {
406
+ "exit_code": -1,
407
+ "stdout": "",
408
+ "stderr": f"Timed out after {timeout}s",
409
+ "elapsed": timeout,
410
+ "timed_out": True,
411
+ }
412
+ finally:
413
+ os.unlink(tmp_path)
414
+
415
+
416
+ def extract_metrics(stdout: str) -> dict:
417
+ """Parse metrics from stdout."""
418
+ metrics: dict = {}
419
+
420
+ # Extract metrics block if present
421
+ block_match = re.search(r"##METRICS_START##(.*?)##METRICS_END##", stdout, re.DOTALL)
422
+ text = block_match.group(1) if block_match else stdout
423
+
424
+ # Parse losses
425
+ match = re.search(r"LOSSES:\[([^\]]+)\]", text)
426
+ if match:
427
+ try:
428
+ losses = [float(x.strip()) for x in match.group(1).split(",")]
429
+ metrics["losses"] = losses
430
+ metrics["final_loss"] = losses[-1] if losses else None
431
+ metrics["initial_loss"] = losses[0] if losses else None
432
+ metrics["nan_count"] = sum(1 for l in losses if math.isnan(l) or math.isinf(l))
433
+ metrics["num_steps"] = len(losses)
434
+ except Exception:
435
+ pass
436
+
437
+ # Parse val_acc
438
+ match = re.search(r"VAL_ACC:([\d.]+)", text)
439
+ if match:
440
+ metrics["val_acc"] = float(match.group(1))
441
+
442
+ return metrics
443
+
444
+
445
+ def score_mnist_fix(metrics: dict) -> float:
446
+ """
447
+ Score an MNIST fix on a 0-1 scale.
448
+
449
+ Criteria:
450
+ - No NaN/Inf (base requirement)
451
+ - Final loss < 1.5 (30%)
452
+ - Val accuracy > 0.5 (50%)
453
+ - Learning trajectory (20%)
454
+ """
455
+ if not metrics:
456
+ return 0.0
457
+
458
+ if metrics.get("nan_count", 0) > 0:
459
+ return 0.05
460
+
461
+ score = 0.0
462
+
463
+ # Val accuracy (50% of score)
464
+ val_acc = metrics.get("val_acc")
465
+ if val_acc is not None:
466
+ if val_acc >= 0.7:
467
+ score += 0.50
468
+ elif val_acc >= 0.5:
469
+ score += 0.35
470
+ elif val_acc >= 0.3:
471
+ score += 0.15
472
+
473
+ # Final loss (30% of score)
474
+ final_loss = metrics.get("final_loss")
475
+ if final_loss is not None:
476
+ if final_loss < 1.0:
477
+ score += 0.30
478
+ elif final_loss < 1.5:
479
+ score += 0.20
480
+ elif final_loss < 2.5:
481
+ score += 0.10
482
+
483
+ # Learning trajectory (20% of score)
484
+ losses = metrics.get("losses", [])
485
+ if len(losses) >= 10:
486
+ first_q = sum(losses[:len(losses) // 4]) / max(1, len(losses) // 4)
487
+ last_q = sum(losses[-len(losses) // 4:]) / max(1, len(losses) // 4)
488
+ if last_q < first_q * 0.7:
489
+ score += 0.20
490
+ elif last_q < first_q:
491
+ score += 0.10
492
+
493
+ return min(1.0, score)
494
+
495
+
496
+ def evaluate_single_model(
497
+ model_path: str,
498
+ label: str,
499
+ use_4bit: bool = False,
500
+ is_peft: bool = False,
501
+ base_model_for_peft: Optional[str] = None,
502
+ use_real_mnist: bool = False,
503
+ ) -> dict:
504
+ """Load a model, generate a fix, execute it, and return results."""
505
+ print(f"\n{'=' * 60}")
506
+ print(f"Evaluating: {label}")
507
+ print(f" Model: {model_path}")
508
+ print(f" Dataset: {'Real MNIST' if use_real_mnist else 'Synthetic'}")
509
+ print(f"{'=' * 60}")
510
+
511
+ # Select appropriate buggy code and task description
512
+ if use_real_mnist:
513
+ buggy_code = MNIST_BUGGY_CODE_REAL
514
+ task_desc = MNIST_TASK_DESCRIPTION_REAL
515
+ else:
516
+ buggy_code = MNIST_BUGGY_CODE_SYNTHETIC
517
+ task_desc = MNIST_TASK_DESCRIPTION_SYNTHETIC
518
+
519
+ # Load model
520
+ model, tokenizer = load_model(
521
+ model_path,
522
+ use_4bit=use_4bit,
523
+ is_peft=is_peft,
524
+ base_model_for_peft=base_model_for_peft,
525
+ )
526
+
527
+ # Generate fix
528
+ print(" Generating fix...")
529
+ start = time.time()
530
+ fixed_code = generate_fix(model, tokenizer, task_desc, buggy_code)
531
+ gen_time = time.time() - start
532
+ print(f" Generation took {gen_time:.1f}s ({len(fixed_code)} chars)")
533
+
534
+ # Execute (longer timeout for real MNIST due to dataset download)
535
+ timeout = 300 if use_real_mnist else 120
536
+ print(f" Executing fixed code (timeout={timeout}s)...")
537
+ result = execute_code(fixed_code, timeout=timeout)
538
+ metrics = extract_metrics(result["stdout"])
539
+ score = score_mnist_fix(metrics) if result["exit_code"] == 0 else 0.0
540
+
541
+ # Report
542
+ print(f"\n Results for {label}:")
543
+ print(f" Exit code: {result['exit_code']}")
544
+ print(f" Timed out: {result['timed_out']}")
545
+ print(f" Val accuracy: {metrics.get('val_acc', 'N/A')}")
546
+ print(f" Final loss: {metrics.get('final_loss', 'N/A')}")
547
+ print(f" NaN count: {metrics.get('nan_count', 'N/A')}")
548
+ print(f" Score: {score:.4f}")
549
+
550
+ if result["stderr"] and result["exit_code"] != 0:
551
+ print(f" Stderr: {result['stderr'][:500]}")
552
+
553
+ # Free GPU memory
554
+ del model
555
+ if torch.cuda.is_available():
556
+ torch.cuda.empty_cache()
557
+
558
+ return {
559
+ "model": label,
560
+ "model_path": model_path,
561
+ "fixed_code": fixed_code,
562
+ "execution": result,
563
+ "metrics": metrics,
564
+ "score": score,
565
+ "generation_time": gen_time,
566
+ }
567
+
568
+
569
+ def print_comparison_table(base_results: list, trained_results: list):
570
+ """Print a nicely formatted comparison table."""
571
+ # Aggregate scores
572
+ base_scores = [r["score"] for r in base_results]
573
+ trained_scores = [r["score"] for r in trained_results]
574
+ base_accs = [r["metrics"].get("val_acc", 0) or 0 for r in base_results]
575
+ trained_accs = [r["metrics"].get("val_acc", 0) or 0 for r in trained_results]
576
+
577
+ avg_base_score = sum(base_scores) / len(base_scores)
578
+ avg_trained_score = sum(trained_scores) / len(trained_scores)
579
+ avg_base_acc = sum(base_accs) / len(base_accs)
580
+ avg_trained_acc = sum(trained_accs) / len(trained_accs)
581
+
582
+ # Table
583
+ print(f"\n{'=' * 70}")
584
+ print(f"{'COMPARISON: Base vs GRPO-Trained Model':^70}")
585
+ print(f"{'=' * 70}")
586
+
587
+ headers = ["Metric", "Base Model", "Trained Model", "Δ (Improvement)"]
588
+ rows = [
589
+ ["Average Score", f"{avg_base_score:.4f}", f"{avg_trained_score:.4f}",
590
+ f"{avg_trained_score - avg_base_score:+.4f}"],
591
+ ["Average Val Acc", f"{avg_base_acc:.4f}", f"{avg_trained_acc:.4f}",
592
+ f"{avg_trained_acc - avg_base_acc:+.4f}"],
593
+ ["Best Score", f"{max(base_scores):.4f}", f"{max(trained_scores):.4f}",
594
+ f"{max(trained_scores) - max(base_scores):+.4f}"],
595
+ ["Best Val Acc", f"{max(base_accs):.4f}", f"{max(trained_accs):.4f}",
596
+ f"{max(trained_accs) - max(base_accs):+.4f}"],
597
+ ["Success Rate (>0.5)", f"{sum(1 for s in base_scores if s > 0.5)}/{len(base_scores)}",
598
+ f"{sum(1 for s in trained_scores if s > 0.5)}/{len(trained_scores)}", ""],
599
+ ]
600
+
601
+ # Calculate column widths
602
+ col_widths = [max(len(str(r[i])) for r in [headers] + rows) + 2 for i in range(4)]
603
+
604
+ # Print table
605
+ header_line = "│ " + " │ ".join(h.center(w) for h, w in zip(headers, col_widths)) + " │"
606
+ sep_line = "├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤"
607
+ top_line = "┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐"
608
+ bottom_line = "└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘"
609
+
610
+ print(top_line)
611
+ print(header_line)
612
+ print(sep_line)
613
+ for row in rows:
614
+ print("│ " + " │ ".join(str(v).center(w) for v, w in zip(row, col_widths)) + " │")
615
+ print(bottom_line)
616
+
617
+ # Winner announcement
618
+ print()
619
+ if avg_trained_score > avg_base_score:
620
+ delta = avg_trained_score - avg_base_score
621
+ pct = (delta / max(avg_base_score, 0.001)) * 100
622
+ print(f"🏆 GRPO-trained model wins by +{delta:.4f} score ({pct:.1f}% improvement)!")
623
+ elif avg_base_score > avg_trained_score:
624
+ print(f"⚠️ Base model performed better (may need more training)")
625
+ else:
626
+ print(f"🤝 Models tied on average score")
627
+
628
+ return {
629
+ "base_avg_score": avg_base_score,
630
+ "trained_avg_score": avg_trained_score,
631
+ "base_avg_acc": avg_base_acc,
632
+ "trained_avg_acc": avg_trained_acc,
633
+ "improvement_score": avg_trained_score - avg_base_score,
634
+ "improvement_acc": avg_trained_acc - avg_base_acc,
635
+ }
636
+
637
+
638
+ # ══════════════════════════════════════════════════════════════════════════════
639
+ # Main
640
+ # ══════════════════════════════════════════════════════════════════════════════
641
+
642
+ def main():
643
+ parser = argparse.ArgumentParser(
644
+ description="Evaluate and compare multiple models on MNIST debugging",
645
+ formatter_class=argparse.RawDescriptionHelpFormatter,
646
+ epilog="""
647
+ Examples:
648
+ # Compare base vs trained model
649
+ python evaluate_mnist.py --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct --trained_model ./trained
650
+
651
+ # Use real MNIST dataset
652
+ python evaluate_mnist.py --use_real_mnist --base_model Qwen/Qwen2.5-Coder-1.5B-Instruct
653
+
654
+ # Compare multiple models
655
+ python evaluate_mnist.py --models Qwen/Qwen2.5-Coder-1.5B-Instruct ./trained-v1 ./trained-v2
656
+
657
+ # Memory-efficient evaluation
658
+ python evaluate_mnist.py --use_4bit --base_model Qwen/Qwen2.5-Coder-7B-Instruct
659
+ """
660
+ )
661
+
662
+ # Model selection (flexible)
663
+ parser.add_argument("--base_model", type=str, default="Qwen/Qwen2.5-Coder-1.5B-Instruct",
664
+ help="Path or HF name of base model")
665
+ parser.add_argument("--trained_model", type=str, default=None,
666
+ help="Path to GRPO-trained model (optional if using --models)")
667
+ parser.add_argument("--models", type=str, nargs="+", default=None,
668
+ help="List of models to compare (overrides --base_model and --trained_model)")
669
+
670
+ # Dataset options
671
+ parser.add_argument("--use_real_mnist", action="store_true",
672
+ help="Use real MNIST dataset (downloads ~50MB, slower but more realistic)")
673
+
674
+ # Output
675
+ parser.add_argument("--output_file", type=str, default="mnist_eval_results.json",
676
+ help="Output file for detailed results")
677
+ parser.add_argument("--num_runs", type=int, default=3,
678
+ help="Number of evaluation runs per model")
679
+
680
+ # Memory options
681
+ parser.add_argument("--use_4bit", action="store_true",
682
+ help="Use 4-bit quantization for memory efficiency")
683
+ parser.add_argument("--trained_is_peft", action="store_true",
684
+ help="Trained model is a PEFT/LoRA adapter")
685
+
686
+ args = parser.parse_args()
687
+
688
+ device = "cuda" if torch.cuda.is_available() else "cpu"
689
+ dataset_type = "Real MNIST" if args.use_real_mnist else "Synthetic MNIST-like"
690
+
691
+ print(f"\n{'#' * 70}")
692
+ print(f"{'MNIST DEBUGGING EVALUATION':^70}")
693
+ print(f"{'#' * 70}")
694
+ print(f"\nDevice: {device}")
695
+ print(f"Dataset: {dataset_type}")
696
+ print(f"Runs per model: {args.num_runs}")
697
+ print(f"\nMNIST Debugging Task (out-of-distribution):")
698
+ print(f" Bugs: softmax before CE, NLLLoss without log, LR=5.0")
699
+
700
+ # Determine which models to evaluate
701
+ if args.models:
702
+ # Multi-model comparison mode
703
+ model_list = args.models
704
+ print(f"\nModels to compare ({len(model_list)}):")
705
+ for i, m in enumerate(model_list, 1):
706
+ print(f" {i}. {m}")
707
+ else:
708
+ # Legacy two-model comparison
709
+ model_list = [args.base_model]
710
+ if args.trained_model:
711
+ model_list.append(args.trained_model)
712
+ print(f"\nBase model: {args.base_model}")
713
+ if args.trained_model:
714
+ print(f"Trained model: {args.trained_model}")
715
+
716
+ # Run evaluations for each model
717
+ all_results = {model: [] for model in model_list}
718
+
719
+ for run in range(1, args.num_runs + 1):
720
+ print(f"\n{'─' * 70}")
721
+ print(f"Run {run}/{args.num_runs}")
722
+ print(f"{'─' * 70}")
723
+
724
+ for model_path in model_list:
725
+ model_name = Path(model_path).name if "/" not in model_path else model_path.split("/")[-1]
726
+
727
+ # Determine if this is a PEFT model
728
+ is_peft = args.trained_is_peft and model_path != args.base_model
729
+ base_for_peft = args.base_model if is_peft else None
730
+
731
+ result = evaluate_single_model(
732
+ model_path,
733
+ f"{model_name} (run {run})",
734
+ use_4bit=args.use_4bit,
735
+ is_peft=is_peft,
736
+ base_model_for_peft=base_for_peft,
737
+ use_real_mnist=args.use_real_mnist,
738
+ )
739
+ all_results[model_path].append(result)
740
+
741
+ # Print comparison table for all models
742
+ print(f"\n{'=' * 80}")
743
+ print(f"{'RESULTS SUMMARY':^80}")
744
+ print(f"{'=' * 80}")
745
+
746
+ # Calculate aggregates for each model
747
+ model_stats = {}
748
+ for model_path, results in all_results.items():
749
+ scores = [r["score"] for r in results]
750
+ accs = [r["metrics"].get("val_acc", 0) or 0 for r in results]
751
+ model_stats[model_path] = {
752
+ "avg_score": sum(scores) / len(scores),
753
+ "avg_acc": sum(accs) / len(accs),
754
+ "best_score": max(scores),
755
+ "best_acc": max(accs),
756
+ "success_rate": sum(1 for s in scores if s > 0.5) / len(scores),
757
+ }
758
+
759
+ # Print table
760
+ headers = ["Model", "Avg Score", "Avg Acc", "Best Score", "Success Rate"]
761
+ rows = []
762
+ for model_path, stats in model_stats.items():
763
+ model_name = Path(model_path).name if "/" not in model_path else model_path.split("/")[-1]
764
+ rows.append([
765
+ model_name[:25], # Truncate long names
766
+ f"{stats['avg_score']:.4f}",
767
+ f"{stats['avg_acc']:.4f}",
768
+ f"{stats['best_score']:.4f}",
769
+ f"{stats['success_rate']*100:.0f}%",
770
+ ])
771
+
772
+ col_widths = [max(len(str(r[i])) for r in [headers] + rows) + 2 for i in range(len(headers))]
773
+
774
+ print("┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐")
775
+ print("│ " + " │ ".join(h.center(w) for h, w in zip(headers, col_widths)) + " │")
776
+ print("├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤")
777
+ for row in rows:
778
+ print("│ " + " │ ".join(str(v).center(w) for v, w in zip(row, col_widths)) + " │")
779
+ print("└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘")
780
+
781
+ # Find winner
782
+ best_model = max(model_stats.items(), key=lambda x: x[1]["avg_score"])
783
+ print(f"\n🏆 Best model: {best_model[0].split('/')[-1]} (avg score: {best_model[1]['avg_score']:.4f})")
784
+
785
+ # Legacy comparison if exactly 2 models
786
+ summary = None
787
+ if len(model_list) == 2:
788
+ base_results = all_results[model_list[0]]
789
+ trained_results = all_results[model_list[1]]
790
+ summary = print_comparison_table(base_results, trained_results)
791
+
792
+ # Save detailed results
793
+ output = {
794
+ "task": f"MNIST debugging ({dataset_type})",
795
+ "models": model_list,
796
+ "num_runs": args.num_runs,
797
+ "device": device,
798
+ "use_real_mnist": args.use_real_mnist,
799
+ "model_stats": model_stats,
800
+ "summary": summary,
801
+ "runs": {
802
+ model_path: [
803
+ {k: v for k, v in r.items() if k != "fixed_code"}
804
+ for r in results
805
+ ]
806
+ for model_path, results in all_results.items()
807
+ },
808
+ }
809
+
810
+ with open(args.output_file, "w") as f:
811
+ json.dump(output, f, indent=2, default=str)
812
+ print(f"\n📄 Full results saved to {args.output_file}")
813
+
814
+
815
+ if __name__ == "__main__":
816
+ main()
gradio_app.py CHANGED
@@ -48,6 +48,12 @@ TASK_INFO = {
48
  "description": "Backbone frozen but its parameters are passed to the optimizer.",
49
  "hints": "Unfreeze backend or only pass head parameters to Adam.",
50
  },
 
 
 
 
 
 
51
  }
52
 
53
  # ── Theme ──────────────────────────────────────────────────────────────────
@@ -418,7 +424,7 @@ def do_run_baseline(base_url: str, task_id: str):
418
 
419
  results_md = "### 🤖 Baseline Agent Results\n\n"
420
  results_md += "| Task | Score |\n|---|---|\n"
421
- for tid in ["task1", "task2", "task3", "task4", "task5"]:
422
  s = scores.get(tid, 0.0)
423
  emoji = "🎯" if s >= 0.9 else ("✅" if s >= 0.7 else ("📈" if s >= 0.4 else "⚠️"))
424
  results_md += f"| {tid} | {emoji} {s:.4f} |\n"
@@ -492,8 +498,21 @@ def build_ui() -> gr.Blocks:
492
  # ── Left column: Task selector ──
493
  with gr.Column(scale=1, min_width=280):
494
  gr.Markdown("### 📋 Task Selector")
 
 
 
 
 
 
 
 
 
 
 
 
 
495
  task_id = gr.Radio(
496
- choices=["task1", "task2", "task3", "task4", "task5"],
497
  value="task1",
498
  label="Select Task",
499
  info="Choose a debugging challenge",
@@ -642,18 +661,23 @@ Fix optimizer order + learning rate bugs in a linear classifier.
642
  "task3": "🔴 OOM + Data Leakage",
643
  "task4": "🟡 Wrong Loss Function",
644
  "task5": "🟡 Frozen Backbone",
 
645
  }
646
 
647
- def run_baseline_live(base_url_val):
648
  """Generator that yields live progress as each task completes."""
649
  base = (base_url_val or DEFAULT_BASE_URL).strip().rstrip("/")
 
650
  results = {}
651
- lines_header = ["### 🤖 Baseline Agent — Live Progress\n"]
 
 
 
652
 
653
  # Phase 1: Show "starting" state
654
  yield "\n".join(lines_header + ["⏳ Starting baseline agent..."])
655
 
656
- for tid in ["task1", "task2", "task3", "task4", "task5"]:
657
  tname = TASK_NAMES.get(tid, tid)
658
 
659
  # Show "running this task" update
@@ -673,7 +697,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
673
  # Actually call the per-task endpoint
674
  try:
675
  with httpx.Client(timeout=180.0) as client:
676
- resp = client.get(f"{base}/baseline/task/{tid}")
677
  resp.raise_for_status()
678
  data = resp.json()
679
  except Exception as exc:
@@ -690,7 +714,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
690
  final_lines = ["### 🤖 Baseline Agent Results\n", "| Task | Score |", "|---|---|"]
691
  total = 0.0
692
  has_errors = False
693
- for tid in ["task1", "task2", "task3", "task4", "task5"]:
694
  info = results.get(tid, {"score": 0.0})
695
  s = info["score"]
696
  total += s
@@ -700,7 +724,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
700
  has_errors = True
701
  final_lines.append(f"\n> ⚠️ `{info['error'][:200]}`\n")
702
 
703
- avg = total / 5
704
  final_lines.append(f"\n**Average: {avg:.4f}**")
705
  if avg >= 0.7:
706
  final_lines.append("\n🎉 **Agent performed well!** The environment is solvable.")
@@ -713,7 +737,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
713
  final_lines.append("\n---\n> [!WARNING]\n> Some tasks failed. Check if `HF_TOKEN` is valid and the model is accessible.")
714
 
715
  final_lines.append("\n---\n### 🔍 Auto-Agent Generated Code & Execution Logs")
716
- for tid in ["task1", "task2", "task3", "task4", "task5"]:
717
  info = results.get(tid, {})
718
  fixed_code = str(info.get("fixed_code", ""))
719
  output = str(info.get("output", ""))
@@ -731,7 +755,7 @@ Fix optimizer order + learning rate bugs in a linear classifier.
731
  outputs=[baseline_output],
732
  ).then(
733
  fn=run_baseline_live,
734
- inputs=[base_url],
735
  outputs=[baseline_output],
736
  )
737
 
 
48
  "description": "Backbone frozen but its parameters are passed to the optimizer.",
49
  "hints": "Unfreeze backend or only pass head parameters to Adam.",
50
  },
51
+ "task6": {
52
+ "name": "Input-Output Mismatch",
53
+ "difficulty": "🔴 Hard",
54
+ "description": "CNN has 4 bugs: shape mismatch, channel order (HWC/CHW), label encoding, batch dimension.",
55
+ "hints": "Fix image size (32→28), permute HWC→CHW, use class indices not one-hot, add unsqueeze(0).",
56
+ },
57
  }
58
 
59
  # ── Theme ──────────────────────────────────────────────────────────────────
 
424
 
425
  results_md = "### 🤖 Baseline Agent Results\n\n"
426
  results_md += "| Task | Score |\n|---|---|\n"
427
+ for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
428
  s = scores.get(tid, 0.0)
429
  emoji = "🎯" if s >= 0.9 else ("✅" if s >= 0.7 else ("📈" if s >= 0.4 else "⚠️"))
430
  results_md += f"| {tid} | {emoji} {s:.4f} |\n"
 
498
  # ── Left column: Task selector ──
499
  with gr.Column(scale=1, min_width=280):
500
  gr.Markdown("### 📋 Task Selector")
501
+ baseline_model = gr.Dropdown(
502
+ choices=[
503
+ "Qwen/Qwen2.5-Coder-1.5B-Instruct",
504
+ "Qwen/Qwen2.5-Coder-3B-Instruct",
505
+ "Qwen/Qwen2.5-Coder-7B-Instruct",
506
+ "Qwen/Qwen2.5-Coder-14B-Instruct",
507
+ "Qwen/Qwen2.5-Coder-32B-Instruct",
508
+ "mistralai/Mistral-7B-Instruct-v0.3",
509
+ ],
510
+ value="Qwen/Qwen2.5-Coder-32B-Instruct",
511
+ label="Auto-Agent Model",
512
+ info="Choose which LLM to run for baseline auto-agent",
513
+ )
514
  task_id = gr.Radio(
515
+ choices=["task1", "task2", "task3", "task4", "task5", "task6"],
516
  value="task1",
517
  label="Select Task",
518
  info="Choose a debugging challenge",
 
661
  "task3": "🔴 OOM + Data Leakage",
662
  "task4": "🟡 Wrong Loss Function",
663
  "task5": "🟡 Frozen Backbone",
664
+ "task6": "🔴 Input-Output Mismatch",
665
  }
666
 
667
+ def run_baseline_live(base_url_val, model_id_val):
668
  """Generator that yields live progress as each task completes."""
669
  base = (base_url_val or DEFAULT_BASE_URL).strip().rstrip("/")
670
+ model_id = (model_id_val or "Qwen/Qwen2.5-Coder-32B-Instruct").strip()
671
  results = {}
672
+ lines_header = [
673
+ "### 🤖 Baseline Agent — Live Progress\n",
674
+ f"**Model:** `{model_id}`\n",
675
+ ]
676
 
677
  # Phase 1: Show "starting" state
678
  yield "\n".join(lines_header + ["⏳ Starting baseline agent..."])
679
 
680
+ for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
681
  tname = TASK_NAMES.get(tid, tid)
682
 
683
  # Show "running this task" update
 
697
  # Actually call the per-task endpoint
698
  try:
699
  with httpx.Client(timeout=180.0) as client:
700
+ resp = client.get(f"{base}/baseline/task/{tid}", params={"model_id": model_id})
701
  resp.raise_for_status()
702
  data = resp.json()
703
  except Exception as exc:
 
714
  final_lines = ["### 🤖 Baseline Agent Results\n", "| Task | Score |", "|---|---|"]
715
  total = 0.0
716
  has_errors = False
717
+ for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
718
  info = results.get(tid, {"score": 0.0})
719
  s = info["score"]
720
  total += s
 
724
  has_errors = True
725
  final_lines.append(f"\n> ⚠️ `{info['error'][:200]}`\n")
726
 
727
+ avg = total / 6
728
  final_lines.append(f"\n**Average: {avg:.4f}**")
729
  if avg >= 0.7:
730
  final_lines.append("\n🎉 **Agent performed well!** The environment is solvable.")
 
737
  final_lines.append("\n---\n> [!WARNING]\n> Some tasks failed. Check if `HF_TOKEN` is valid and the model is accessible.")
738
 
739
  final_lines.append("\n---\n### 🔍 Auto-Agent Generated Code & Execution Logs")
740
+ for tid in ["task1", "task2", "task3", "task4", "task5", "task6"]:
741
  info = results.get(tid, {})
742
  fixed_code = str(info.get("fixed_code", ""))
743
  output = str(info.get("output", ""))
 
755
  outputs=[baseline_output],
756
  ).then(
757
  fn=run_baseline_live,
758
+ inputs=[base_url, baseline_model],
759
  outputs=[baseline_output],
760
  )
761
 
improved_agent.py ADDED
@@ -0,0 +1,717 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Improved GRPO training script for WhipStudio ML Debug Environment.
4
+
5
+ This script trains Qwen2.5-1.5B-Coder (or similar) to debug broken PyTorch scripts
6
+ using Group Relative Policy Optimization (GRPO) with the WhipStudio environment
7
+ as the reward oracle.
8
+
9
+ Improvements over basic train_grpo.py:
10
+ 1. Memory-efficient training with 4-bit quantization
11
+ 2. LoRA fine-tuning for reduced VRAM usage
12
+ 3. Curriculum learning (easier tasks first)
13
+ 4. Gradient checkpointing for large contexts
14
+ 5. Checkpoint saving with best model tracking
15
+ 6. Early stopping based on validation scores
16
+ 7. Wandb/TensorBoard logging support
17
+
18
+ Requirements:
19
+ pip install trl>=0.15.0 transformers>=4.46.0 datasets torch httpx
20
+ pip install accelerate peft bitsandbytes wandb
21
+
22
+ Usage:
23
+ # Basic training
24
+ python improved_agent.py \
25
+ --env_url https://your-space.hf.space \
26
+ --output_dir ./whipstudio-debugger
27
+
28
+ # Memory-efficient training (8GB VRAM)
29
+ python improved_agent.py \
30
+ --env_url https://your-space.hf.space \
31
+ --use_4bit \
32
+ --use_lora \
33
+ --gradient_checkpointing \
34
+ --output_dir ./whipstudio-debugger-lora
35
+
36
+ # Full training with wandb logging
37
+ python improved_agent.py \
38
+ --env_url https://your-space.hf.space \
39
+ --use_wandb \
40
+ --wandb_project whipstudio \
41
+ --num_iterations 100 \
42
+ --output_dir ./whipstudio-debugger
43
+ """
44
+
45
+ import argparse
46
+ import json
47
+ import math
48
+ import os
49
+ import random
50
+ import re
51
+ import time
52
+ from dataclasses import dataclass
53
+ from pathlib import Path
54
+ from typing import Any, Optional
55
+
56
+ import httpx
57
+ import torch
58
+ from datasets import Dataset
59
+ from transformers import (
60
+ AutoModelForCausalLM,
61
+ AutoTokenizer,
62
+ BitsAndBytesConfig,
63
+ )
64
+
65
+ # TRL imports
66
+ try:
67
+ from trl import GRPOConfig, GRPOTrainer
68
+ except ImportError:
69
+ raise ImportError("Please install trl>=0.15.0: pip install trl")
70
+
71
+ # PEFT imports (optional)
72
+ try:
73
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
74
+ PEFT_AVAILABLE = True
75
+ except ImportError:
76
+ PEFT_AVAILABLE = False
77
+
78
+ # Wandb import (optional)
79
+ try:
80
+ import wandb
81
+ WANDB_AVAILABLE = True
82
+ except ImportError:
83
+ WANDB_AVAILABLE = False
84
+
85
+
86
+ # ══════════════════════════════════════════════════════════════════════════════
87
+ # Constants
88
+ # ══════════════════════════════════════════════════════════════════════════════
89
+
90
+ SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
91
+ You receive a broken training script and must fix ALL bugs.
92
+ Return ONLY the complete corrected Python code. No markdown, no backticks, no explanation.
93
+ The script must print metrics in the format specified by the task description.
94
+ Keep all torch.manual_seed() calls intact.
95
+ Wrap metrics in ##METRICS_START## and ##METRICS_END## markers."""
96
+
97
+ # Task ordering by difficulty for curriculum learning
98
+ TASK_DIFFICULTY = {
99
+ "task1": 1, # Easy: broken loop
100
+ "task4": 2, # Medium: wrong loss
101
+ "task5": 2, # Medium: frozen backbone
102
+ "task2": 3, # Medium: NaN loss (tricky)
103
+ "task3": 4, # Hard: OOM + leakage
104
+ }
105
+
106
+ ALL_TASKS = list(TASK_DIFFICULTY.keys())
107
+
108
+
109
+ # ══════════════════════════════════════════════════════════════════════════════
110
+ # Environment Client
111
+ # ══════════════════════════════════════════════════════════════════════════════
112
+
113
+ class WhipStudioEnv:
114
+ """Client for the WhipStudio RL environment."""
115
+
116
+ def __init__(self, env_url: str, timeout: float = 180.0):
117
+ self.env_url = env_url.rstrip("/")
118
+ self.timeout = httpx.Timeout(timeout, connect=15.0)
119
+ self._task_cache: dict[str, dict] = {}
120
+
121
+ def reset(self, task_id: str) -> dict:
122
+ """Reset environment and return observation."""
123
+ with httpx.Client(timeout=self.timeout) as client:
124
+ resp = client.post(f"{self.env_url}/reset", json={"task_id": task_id})
125
+ resp.raise_for_status()
126
+ data = resp.json()
127
+ obs = data.get("observation", data)
128
+ self._task_cache[task_id] = obs
129
+ return obs
130
+
131
+ def step(self, fixed_code: str, attempt: int = 1) -> dict:
132
+ """Submit a fix and return the full step result."""
133
+ payload = {
134
+ "action": {
135
+ "fixed_code": fixed_code,
136
+ "attempt_number": attempt,
137
+ }
138
+ }
139
+ with httpx.Client(timeout=self.timeout) as client:
140
+ resp = client.post(f"{self.env_url}/step", json=payload)
141
+ resp.raise_for_status()
142
+ return resp.json()
143
+
144
+ def get_task_obs(self, task_id: str) -> dict:
145
+ """Get cached observation or reset to obtain it."""
146
+ if task_id not in self._task_cache:
147
+ self.reset(task_id)
148
+ return self._task_cache[task_id]
149
+
150
+ def health_check(self) -> bool:
151
+ """Verify the environment is reachable."""
152
+ try:
153
+ with httpx.Client(timeout=httpx.Timeout(10.0)) as client:
154
+ resp = client.get(f"{self.env_url}/health")
155
+ return resp.status_code == 200
156
+ except Exception:
157
+ return False
158
+
159
+
160
+ # ══════════════════════════════════════════════════════════════════════════════
161
+ # Prompt Utilities
162
+ # ══════════════════════════════════════════════════════════════════════════════
163
+
164
+ def build_user_prompt(task_description: str, buggy_code: str) -> str:
165
+ """Build the user prompt for the model."""
166
+ return f"Task: {task_description}\n\nBuggy code:\n{buggy_code}"
167
+
168
+
169
+ def format_chat(tokenizer: Any, user_prompt: str) -> str:
170
+ """Format as a chat message and return the full text."""
171
+ messages = [
172
+ {"role": "system", "content": SYSTEM_PROMPT},
173
+ {"role": "user", "content": user_prompt},
174
+ ]
175
+ return tokenizer.apply_chat_template(
176
+ messages, tokenize=False, add_generation_prompt=True
177
+ )
178
+
179
+
180
+ def extract_code_from_response(response: str) -> str:
181
+ """Extract Python code from model response, stripping markdown if present."""
182
+ text = response.strip()
183
+ if "```python" in text:
184
+ text = text.split("```python", 1)[1].split("```", 1)[0].strip()
185
+ elif "```" in text:
186
+ text = text.split("```", 1)[1].split("```", 1)[0].strip()
187
+ return text
188
+
189
+
190
+ # ══════════════════════════════════════════════════════════════════════════════
191
+ # Reward Function
192
+ # ══════════════════════════════════════════════════════════════════════════════
193
+
194
+ def create_reward_function(env: WhipStudioEnv, verbose: bool = True):
195
+ """
196
+ Create a reward function compatible with TRL's GRPOTrainer.
197
+
198
+ Includes reward shaping:
199
+ - Bonus for valid Python syntax
200
+ - Bonus for including required output markers
201
+ - Environment reward from grader
202
+ """
203
+
204
+ def reward_fn(completions: list[list[dict]], **kwargs) -> list[float]:
205
+ """Compute rewards for a batch of completions."""
206
+ rewards = []
207
+ task_ids = kwargs.get("task_id", ["task1"] * len(completions))
208
+
209
+ for i, completion in enumerate(completions):
210
+ task_id = task_ids[i] if i < len(task_ids) else "task1"
211
+
212
+ try:
213
+ # Extract assistant's response
214
+ if isinstance(completion, list):
215
+ text = ""
216
+ for msg in completion:
217
+ if isinstance(msg, dict) and msg.get("role") == "assistant":
218
+ text = msg.get("content", "")
219
+ break
220
+ if not text and completion:
221
+ text = str(completion[-1].get("content", ""))
222
+ elif isinstance(completion, str):
223
+ text = completion
224
+ else:
225
+ text = str(completion)
226
+
227
+ fixed_code = extract_code_from_response(text)
228
+
229
+ # Reward shaping: syntax check
230
+ syntax_bonus = 0.0
231
+ try:
232
+ compile(fixed_code, "<string>", "exec")
233
+ syntax_bonus = 0.05
234
+ except SyntaxError:
235
+ pass
236
+
237
+ # Reward shaping: output markers present
238
+ marker_bonus = 0.0
239
+ if "LOSSES:" in fixed_code or "##METRICS" in fixed_code:
240
+ marker_bonus = 0.02
241
+
242
+ if not fixed_code.strip():
243
+ rewards.append(0.0)
244
+ continue
245
+
246
+ # Get environment reward
247
+ env.reset(task_id)
248
+ result = env.step(fixed_code, attempt=1)
249
+ env_reward = float(result.get("reward", 0.0) or 0.0)
250
+
251
+ # Total reward (capped at 1.0)
252
+ total_reward = min(1.0, env_reward + syntax_bonus + marker_bonus)
253
+ rewards.append(total_reward)
254
+
255
+ if verbose:
256
+ print(f" [reward] task={task_id} env={env_reward:.3f} syntax={syntax_bonus:.2f} total={total_reward:.3f}")
257
+
258
+ except Exception as e:
259
+ if verbose:
260
+ print(f" [reward] ERROR task={task_id}: {e}")
261
+ rewards.append(0.0)
262
+
263
+ return rewards
264
+
265
+ return reward_fn
266
+
267
+
268
+ # ══════════════════════════════════════════════════════════════════════════════
269
+ # Dataset Generation with Curriculum
270
+ # ══════════════════════════════════════════════════════════════════════════════
271
+
272
+ def generate_curriculum_dataset(
273
+ env: WhipStudioEnv,
274
+ tokenizer: Any,
275
+ samples_per_task: int = 10,
276
+ curriculum_stage: int = 0, # 0 = all tasks, 1 = easier tasks weighted, etc.
277
+ ) -> Dataset:
278
+ """
279
+ Generate a dataset with curriculum-based sampling.
280
+
281
+ Args:
282
+ env: WhipStudio environment client
283
+ tokenizer: Model tokenizer
284
+ samples_per_task: Base samples per task
285
+ curriculum_stage: 0=uniform, higher=bias toward easier tasks
286
+ """
287
+ records = []
288
+
289
+ # Compute task weights based on curriculum stage
290
+ task_weights = {}
291
+ for task_id, difficulty in TASK_DIFFICULTY.items():
292
+ if curriculum_stage == 0:
293
+ weight = 1.0
294
+ else:
295
+ # Higher curriculum_stage = more weight on easier tasks
296
+ weight = max(0.2, 1.0 - (difficulty - 1) * 0.2 * curriculum_stage)
297
+ task_weights[task_id] = weight
298
+
299
+ # Normalize weights
300
+ total_weight = sum(task_weights.values())
301
+ task_weights = {k: v / total_weight for k, v in task_weights.items()}
302
+
303
+ for task_id in ALL_TASKS:
304
+ print(f" Fetching observation for {task_id} (weight={task_weights[task_id]:.2f})...")
305
+ obs = env.reset(task_id)
306
+ user_prompt = build_user_prompt(
307
+ task_description=obs.get("task_description", ""),
308
+ buggy_code=obs.get("buggy_code", ""),
309
+ )
310
+ formatted = format_chat(tokenizer, user_prompt)
311
+
312
+ # Number of samples proportional to weight
313
+ n_samples = max(1, int(samples_per_task * task_weights[task_id] * len(ALL_TASKS)))
314
+
315
+ for _ in range(n_samples):
316
+ records.append({
317
+ "prompt": formatted,
318
+ "task_id": task_id,
319
+ })
320
+
321
+ random.shuffle(records)
322
+ return Dataset.from_list(records)
323
+
324
+
325
+ # ══════════════════════════════════════════════════════════════════════════════
326
+ # Model Loading Utilities
327
+ # ══════════════════════════════════════════════════════════════════════════════
328
+
329
+ def load_model_and_tokenizer(
330
+ model_name: str,
331
+ use_4bit: bool = False,
332
+ use_8bit: bool = False,
333
+ gradient_checkpointing: bool = False,
334
+ ):
335
+ """Load model with optional quantization and gradient checkpointing."""
336
+
337
+ print(f"Loading model: {model_name}")
338
+
339
+ # Tokenizer
340
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
341
+ if tokenizer.pad_token is None:
342
+ tokenizer.pad_token = tokenizer.eos_token
343
+
344
+ # Quantization config
345
+ quantization_config = None
346
+ if use_4bit:
347
+ if not PEFT_AVAILABLE:
348
+ raise ImportError("4-bit quantization requires peft and bitsandbytes")
349
+ quantization_config = BitsAndBytesConfig(
350
+ load_in_4bit=True,
351
+ bnb_4bit_quant_type="nf4",
352
+ bnb_4bit_compute_dtype=torch.bfloat16,
353
+ bnb_4bit_use_double_quant=True,
354
+ )
355
+ print(" Using 4-bit quantization")
356
+ elif use_8bit:
357
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
358
+ print(" Using 8-bit quantization")
359
+
360
+ # Model kwargs
361
+ model_kwargs = {
362
+ "trust_remote_code": True,
363
+ "torch_dtype": torch.bfloat16 if not (use_4bit or use_8bit) else None,
364
+ "device_map": "auto",
365
+ }
366
+ if quantization_config:
367
+ model_kwargs["quantization_config"] = quantization_config
368
+
369
+ # Load model
370
+ model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
371
+
372
+ # Prepare for k-bit training if quantized
373
+ if use_4bit or use_8bit:
374
+ model = prepare_model_for_kbit_training(model)
375
+
376
+ # Gradient checkpointing
377
+ if gradient_checkpointing:
378
+ model.gradient_checkpointing_enable()
379
+ print(" Gradient checkpointing enabled")
380
+
381
+ param_count = sum(p.numel() for p in model.parameters())
382
+ trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
383
+ print(f" Total params: {param_count / 1e6:.1f}M, Trainable: {trainable / 1e6:.1f}M")
384
+
385
+ return model, tokenizer
386
+
387
+
388
+ def apply_lora(
389
+ model,
390
+ lora_r: int = 16,
391
+ lora_alpha: int = 32,
392
+ target_modules: Optional[list[str]] = None,
393
+ ):
394
+ """Apply LoRA adapters to the model."""
395
+ if not PEFT_AVAILABLE:
396
+ raise ImportError("LoRA requires peft: pip install peft")
397
+
398
+ if target_modules is None:
399
+ # Default targets for Qwen2 and similar architectures
400
+ target_modules = [
401
+ "q_proj", "k_proj", "v_proj", "o_proj",
402
+ "gate_proj", "up_proj", "down_proj",
403
+ ]
404
+
405
+ lora_config = LoraConfig(
406
+ r=lora_r,
407
+ lora_alpha=lora_alpha,
408
+ target_modules=target_modules,
409
+ lora_dropout=0.05,
410
+ bias="none",
411
+ task_type="CAUSAL_LM",
412
+ )
413
+
414
+ model = get_peft_model(model, lora_config)
415
+
416
+ trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
417
+ print(f" LoRA applied: r={lora_r}, trainable params: {trainable / 1e6:.2f}M")
418
+
419
+ return model, lora_config
420
+
421
+
422
+ # ══════════════════════════════════════════════════════════════════════════════
423
+ # Validation & Evaluation
424
+ # ══════════════════════════════════════════════════════════════════════════════
425
+
426
+ def evaluate_model(
427
+ model,
428
+ tokenizer,
429
+ env: WhipStudioEnv,
430
+ task_ids: list[str] = None,
431
+ max_new_tokens: int = 2048,
432
+ ) -> dict[str, float]:
433
+ """Evaluate model on tasks and return scores."""
434
+ if task_ids is None:
435
+ task_ids = ALL_TASKS
436
+
437
+ model.eval()
438
+ scores = {}
439
+
440
+ for task_id in task_ids:
441
+ obs = env.reset(task_id)
442
+ user_prompt = build_user_prompt(obs["task_description"], obs["buggy_code"])
443
+ formatted = format_chat(tokenizer, user_prompt)
444
+
445
+ inputs = tokenizer(formatted, return_tensors="pt", truncation=True, max_length=4096)
446
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
447
+
448
+ with torch.no_grad():
449
+ outputs = model.generate(
450
+ **inputs,
451
+ max_new_tokens=max_new_tokens,
452
+ temperature=0.2,
453
+ top_p=0.95,
454
+ do_sample=True,
455
+ pad_token_id=tokenizer.pad_token_id,
456
+ )
457
+
458
+ generated = outputs[0][inputs["input_ids"].shape[1]:]
459
+ response = tokenizer.decode(generated, skip_special_tokens=True)
460
+ fixed_code = extract_code_from_response(response)
461
+
462
+ env.reset(task_id)
463
+ result = env.step(fixed_code, attempt=1)
464
+ reward = float(result.get("reward", 0.0) or 0.0)
465
+ scores[task_id] = reward
466
+ print(f" {task_id}: {reward:.4f}")
467
+
468
+ return scores
469
+
470
+
471
+ # ══════════════════════════════════════════════════════════════════════════════
472
+ # Main Training Loop
473
+ # ══════════════════════════════════════════════════════════════════════════════
474
+
475
+ def main():
476
+ parser = argparse.ArgumentParser(description="Improved GRPO training for WhipStudio")
477
+
478
+ # Environment
479
+ parser.add_argument("--env_url", type=str, required=True,
480
+ help="URL of the WhipStudio HF Space")
481
+
482
+ # Model
483
+ parser.add_argument("--model_name", type=str, default="Qwen/Qwen2.5-Coder-1.5B-Instruct",
484
+ help="Base model to fine-tune")
485
+ parser.add_argument("--output_dir", type=str, default="./whipstudio-debugger",
486
+ help="Directory to save the trained model")
487
+
488
+ # Quantization & Memory
489
+ parser.add_argument("--use_4bit", action="store_true",
490
+ help="Use 4-bit quantization (requires bitsandbytes)")
491
+ parser.add_argument("--use_8bit", action="store_true",
492
+ help="Use 8-bit quantization")
493
+ parser.add_argument("--gradient_checkpointing", action="store_true",
494
+ help="Enable gradient checkpointing to save memory")
495
+
496
+ # LoRA
497
+ parser.add_argument("--use_lora", action="store_true",
498
+ help="Use LoRA for efficient fine-tuning")
499
+ parser.add_argument("--lora_r", type=int, default=16,
500
+ help="LoRA rank")
501
+ parser.add_argument("--lora_alpha", type=int, default=32,
502
+ help="LoRA alpha")
503
+
504
+ # Training
505
+ parser.add_argument("--num_iterations", type=int, default=50,
506
+ help="Number of training epochs")
507
+ parser.add_argument("--group_size", type=int, default=4,
508
+ help="Number of completions per prompt for GRPO")
509
+ parser.add_argument("--samples_per_task", type=int, default=10,
510
+ help="Base samples per task in dataset")
511
+ parser.add_argument("--learning_rate", type=float, default=1e-5,
512
+ help="Learning rate")
513
+ parser.add_argument("--max_new_tokens", type=int, default=2048,
514
+ help="Max tokens to generate per completion")
515
+ parser.add_argument("--beta", type=float, default=0.1,
516
+ help="KL penalty coefficient")
517
+
518
+ # Curriculum
519
+ parser.add_argument("--curriculum_stages", type=int, default=3,
520
+ help="Number of curriculum stages (0 = no curriculum)")
521
+
522
+ # Logging
523
+ parser.add_argument("--use_wandb", action="store_true",
524
+ help="Log to Weights & Biases")
525
+ parser.add_argument("--wandb_project", type=str, default="whipstudio",
526
+ help="W&B project name")
527
+
528
+ # Early stopping
529
+ parser.add_argument("--patience", type=int, default=10,
530
+ help="Early stopping patience (epochs without improvement)")
531
+ parser.add_argument("--eval_every", type=int, default=5,
532
+ help="Evaluate every N epochs")
533
+
534
+ # Hub
535
+ parser.add_argument("--push_to_hub", action="store_true",
536
+ help="Push trained model to HuggingFace Hub")
537
+ parser.add_argument("--hub_model_id", type=str, default=None,
538
+ help="Model ID on HF Hub")
539
+
540
+ args = parser.parse_args()
541
+
542
+ # ── Verify environment ──
543
+ print(f"\n{'=' * 60}")
544
+ print("WhipStudio Improved GRPO Training")
545
+ print(f"{'=' * 60}")
546
+ print(f"Environment: {args.env_url}")
547
+
548
+ env = WhipStudioEnv(args.env_url)
549
+ if not env.health_check():
550
+ raise ConnectionError(f"Cannot reach WhipStudio at {args.env_url}")
551
+ print("Environment is reachable ✓")
552
+
553
+ # ── Initialize wandb ──
554
+ if args.use_wandb:
555
+ if not WANDB_AVAILABLE:
556
+ print("Warning: wandb not installed, skipping logging")
557
+ args.use_wandb = False
558
+ else:
559
+ wandb.init(
560
+ project=args.wandb_project,
561
+ config=vars(args),
562
+ name=f"grpo-{args.model_name.split('/')[-1]}",
563
+ )
564
+
565
+ # ── Load model ──
566
+ model, tokenizer = load_model_and_tokenizer(
567
+ args.model_name,
568
+ use_4bit=args.use_4bit,
569
+ use_8bit=args.use_8bit,
570
+ gradient_checkpointing=args.gradient_checkpointing,
571
+ )
572
+
573
+ # ── Apply LoRA ──
574
+ peft_config = None
575
+ if args.use_lora:
576
+ model, peft_config = apply_lora(
577
+ model,
578
+ lora_r=args.lora_r,
579
+ lora_alpha=args.lora_alpha,
580
+ )
581
+
582
+ # ── Create output directory ──
583
+ output_path = Path(args.output_dir)
584
+ output_path.mkdir(parents=True, exist_ok=True)
585
+
586
+ # ── Training with curriculum ──
587
+ best_avg_score = 0.0
588
+ epochs_without_improvement = 0
589
+
590
+ n_stages = max(1, args.curriculum_stages)
591
+ epochs_per_stage = args.num_iterations // n_stages
592
+
593
+ for stage in range(n_stages):
594
+ print(f"\n{'=' * 60}")
595
+ print(f"Curriculum Stage {stage + 1}/{n_stages}")
596
+ print(f"{'=' * 60}")
597
+
598
+ # Generate dataset for this curriculum stage
599
+ dataset = generate_curriculum_dataset(
600
+ env, tokenizer,
601
+ samples_per_task=args.samples_per_task,
602
+ curriculum_stage=stage,
603
+ )
604
+ print(f"Dataset: {len(dataset)} samples")
605
+
606
+ # Create reward function
607
+ reward_fn = create_reward_function(env, verbose=True)
608
+
609
+ # Configure GRPO
610
+ grpo_config = GRPOConfig(
611
+ output_dir=str(output_path / f"stage_{stage}"),
612
+ num_train_epochs=epochs_per_stage,
613
+ per_device_train_batch_size=1,
614
+ gradient_accumulation_steps=4,
615
+ learning_rate=args.learning_rate,
616
+ lr_scheduler_type="cosine",
617
+ warmup_ratio=0.1,
618
+ max_completion_length=args.max_new_tokens,
619
+ num_generations=args.group_size,
620
+ logging_steps=1,
621
+ save_steps=epochs_per_stage,
622
+ save_total_limit=2,
623
+ bf16=True,
624
+ report_to="wandb" if args.use_wandb else "none",
625
+ beta=args.beta,
626
+ remove_unused_columns=False,
627
+ )
628
+
629
+ # Initialize trainer
630
+ trainer = GRPOTrainer(
631
+ model=model,
632
+ args=grpo_config,
633
+ train_dataset=dataset,
634
+ processing_class=tokenizer,
635
+ reward_funcs=reward_fn,
636
+ peft_config=peft_config if stage == 0 else None, # Only apply peft on first stage
637
+ )
638
+
639
+ # Train
640
+ print(f"\nTraining stage {stage + 1}...")
641
+ train_result = trainer.train()
642
+ print(f" Stage {stage + 1} complete: {train_result.global_step} steps")
643
+
644
+ # Evaluate
645
+ print("\nEvaluating...")
646
+ scores = evaluate_model(model, tokenizer, env)
647
+ avg_score = sum(scores.values()) / len(scores)
648
+ print(f" Average score: {avg_score:.4f}")
649
+
650
+ if args.use_wandb:
651
+ wandb.log({
652
+ "stage": stage + 1,
653
+ "avg_score": avg_score,
654
+ **{f"score/{k}": v for k, v in scores.items()},
655
+ })
656
+
657
+ # Track best model
658
+ if avg_score > best_avg_score:
659
+ best_avg_score = avg_score
660
+ epochs_without_improvement = 0
661
+ # Save best model
662
+ best_path = output_path / "best"
663
+ trainer.save_model(str(best_path))
664
+ tokenizer.save_pretrained(str(best_path))
665
+ print(f" New best model saved (score={avg_score:.4f})")
666
+ else:
667
+ epochs_without_improvement += epochs_per_stage
668
+
669
+ # Early stopping
670
+ if epochs_without_improvement >= args.patience:
671
+ print(f"\nEarly stopping: no improvement for {args.patience} epochs")
672
+ break
673
+
674
+ # ── Final save ──
675
+ final_path = output_path / "final"
676
+ trainer.save_model(str(final_path))
677
+ tokenizer.save_pretrained(str(final_path))
678
+ print(f"\nFinal model saved to {final_path}")
679
+
680
+ # ── Push to hub ──
681
+ if args.push_to_hub and args.hub_model_id:
682
+ print(f"Pushing to Hub as {args.hub_model_id}...")
683
+ trainer.push_to_hub(args.hub_model_id)
684
+ tokenizer.push_to_hub(args.hub_model_id)
685
+ print("Pushed to Hub ✓")
686
+
687
+ # ── Final evaluation ──
688
+ print(f"\n{'=' * 60}")
689
+ print("Final Evaluation on All Tasks")
690
+ print(f"{'=' * 60}")
691
+ final_scores = evaluate_model(model, tokenizer, env)
692
+ final_avg = sum(final_scores.values()) / len(final_scores)
693
+ print(f"\nFinal average score: {final_avg:.4f}")
694
+ print(f"Best average score during training: {best_avg_score:.4f}")
695
+
696
+ if args.use_wandb:
697
+ wandb.log({"final_avg_score": final_avg})
698
+ wandb.finish()
699
+
700
+ # ── Save training summary ──
701
+ summary = {
702
+ "model_name": args.model_name,
703
+ "final_avg_score": final_avg,
704
+ "best_avg_score": best_avg_score,
705
+ "final_scores": final_scores,
706
+ "curriculum_stages": n_stages,
707
+ "use_lora": args.use_lora,
708
+ "use_4bit": args.use_4bit,
709
+ }
710
+ with open(output_path / "training_summary.json", "w") as f:
711
+ json.dump(summary, f, indent=2)
712
+
713
+ print("\nTraining complete! ✓")
714
+
715
+
716
+ if __name__ == "__main__":
717
+ main()
inference.py ADDED
@@ -0,0 +1,368 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Hackathon-compliant inference script for WhipStudio ML Debug Environment.
4
+
5
+ This script follows the Scaler Meta PyTorch Hackathon requirements:
6
+ - Uses OpenAI-compatible client with API_BASE_URL and MODEL_NAME
7
+ - Emits structured stdout logs: [START], [STEP], [END]
8
+ - Respects runtime limit (<20 min) and resource constraints
9
+
10
+ Environment Variables:
11
+ API_BASE_URL: The API endpoint for the LLM (e.g., https://api.openai.com/v1)
12
+ MODEL_NAME: The model identifier (e.g., gpt-4, Qwen/Qwen2.5-Coder-32B-Instruct)
13
+ HF_TOKEN: Your API key / HuggingFace token
14
+
15
+ Usage:
16
+ # With environment at localhost
17
+ python inference.py --env-url http://localhost:7860
18
+
19
+ # With HF Space
20
+ python inference.py --env-url https://your-space.hf.space
21
+ """
22
+
23
+ import argparse
24
+ import json
25
+ import os
26
+ import sys
27
+ import time
28
+ from typing import Any
29
+
30
+ import httpx
31
+ from openai import OpenAI
32
+
33
+ # ── Configuration ─────────────────────────────────────────────────────────────
34
+
35
+ SYSTEM_PROMPT = """You are an expert PyTorch debugging agent.
36
+ You receive a broken training script and must fix ALL bugs in it.
37
+
38
+ Rules:
39
+ - Return ONLY the complete corrected Python code, nothing else.
40
+ - No markdown, no backticks, no explanation text.
41
+ - The script must print losses in format: LOSSES:[v1, v2, ...]
42
+ - For tasks requiring validation metrics, also print: VAL_ACC:X.XX or VAL_ACCS:[v1,...] and FINAL_LOSS:X.XX
43
+ - Keep all torch.manual_seed() calls intact.
44
+ - Wrap all metrics in ##METRICS_START## and ##METRICS_END## markers.""".strip()
45
+
46
+ TASK_IDS = ["task1", "task2", "task3", "task4", "task5"]
47
+
48
+ MAX_ATTEMPTS_PER_TASK = 3
49
+ REQUEST_TIMEOUT = 180.0 # 3 minutes per LLM call
50
+ STEP_TIMEOUT = 120.0 # 2 minutes per step (code execution)
51
+
52
+
53
+ # ── Logging Helpers ───────────────────────────────────────────────────────────
54
+
55
+ def log_start(task_id: str) -> None:
56
+ """Emit [START] log for a task."""
57
+ print(f"[START] task_id={task_id}", flush=True)
58
+
59
+
60
+ def log_step(task_id: str, step: int, action_summary: str, reward: float, done: bool) -> None:
61
+ """Emit [STEP] log for a step within a task."""
62
+ print(
63
+ f"[STEP] task_id={task_id} step={step} action={action_summary} reward={reward:.4f} done={str(done).lower()}",
64
+ flush=True
65
+ )
66
+
67
+
68
+ def log_end(task_id: str, final_score: float) -> None:
69
+ """Emit [END] log for a task."""
70
+ print(f"[END] task_id={task_id} final_score={final_score:.4f}", flush=True)
71
+
72
+
73
+ # ── LLM Client ────────────────────────────────────────────────────────────────
74
+
75
+ def get_openai_client() -> OpenAI:
76
+ """Initialize OpenAI-compatible client from environment variables."""
77
+ api_base = os.environ.get("API_BASE_URL")
78
+ api_key = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY")
79
+
80
+ if not api_key:
81
+ raise RuntimeError(
82
+ "HF_TOKEN or OPENAI_API_KEY must be set in environment"
83
+ )
84
+
85
+ # Default to OpenAI API if no base URL specified
86
+ if not api_base:
87
+ api_base = "https://api.openai.com/v1"
88
+
89
+ return OpenAI(
90
+ base_url=api_base,
91
+ api_key=api_key,
92
+ timeout=REQUEST_TIMEOUT,
93
+ )
94
+
95
+
96
+ def get_model_name() -> str:
97
+ """Get model name from environment or use default."""
98
+ return os.environ.get("MODEL_NAME", "gpt-4o-mini")
99
+
100
+
101
+ def generate_fix(client: OpenAI, model: str, prompt: str) -> str:
102
+ """Generate a code fix using the LLM."""
103
+ try:
104
+ response = client.chat.completions.create(
105
+ model=model,
106
+ messages=[
107
+ {"role": "system", "content": SYSTEM_PROMPT},
108
+ {"role": "user", "content": prompt},
109
+ ],
110
+ temperature=0.2,
111
+ max_tokens=4096,
112
+ )
113
+
114
+ content = response.choices[0].message.content or ""
115
+
116
+ # Strip markdown fences if present
117
+ if "```python" in content:
118
+ content = content.split("```python", 1)[1].split("```", 1)[0].strip()
119
+ elif "```" in content:
120
+ content = content.split("```", 1)[1].split("```", 1)[0].strip()
121
+
122
+ return content.strip()
123
+
124
+ except Exception as e:
125
+ print(f"[ERROR] LLM call failed: {e}", file=sys.stderr)
126
+ return ""
127
+
128
+
129
+ # ── Environment Client ────────────────────────────────────────────────────────
130
+
131
+ class WhipStudioClient:
132
+ """HTTP client for the WhipStudio environment."""
133
+
134
+ def __init__(self, env_url: str):
135
+ self.env_url = env_url.rstrip("/")
136
+ self.timeout = httpx.Timeout(STEP_TIMEOUT, connect=10.0)
137
+
138
+ def health_check(self) -> bool:
139
+ """Check if the environment is reachable."""
140
+ try:
141
+ with httpx.Client(timeout=httpx.Timeout(10.0)) as client:
142
+ resp = client.get(f"{self.env_url}/health")
143
+ return resp.status_code == 200
144
+ except Exception:
145
+ return False
146
+
147
+ def reset(self, task_id: str) -> dict:
148
+ """Reset environment to a specific task."""
149
+ with httpx.Client(timeout=self.timeout) as client:
150
+ resp = client.post(
151
+ f"{self.env_url}/reset",
152
+ json={"task_id": task_id}
153
+ )
154
+ resp.raise_for_status()
155
+ data = resp.json()
156
+ return data.get("observation", data)
157
+
158
+ def step(self, fixed_code: str, attempt_number: int = 1) -> dict:
159
+ """Submit a fix and get the result."""
160
+ payload = {
161
+ "action": {
162
+ "fixed_code": fixed_code,
163
+ "attempt_number": attempt_number,
164
+ }
165
+ }
166
+
167
+ with httpx.Client(timeout=self.timeout) as client:
168
+ resp = client.post(f"{self.env_url}/step", json=payload)
169
+
170
+ # Handle potential 422 from different API versions
171
+ if resp.status_code == 422:
172
+ resp = client.post(
173
+ f"{self.env_url}/step",
174
+ json={
175
+ "fixed_code": fixed_code,
176
+ "attempt_number": attempt_number,
177
+ }
178
+ )
179
+
180
+ resp.raise_for_status()
181
+ return resp.json()
182
+
183
+ def get_tasks(self) -> list[str]:
184
+ """Get list of available tasks (returns task IDs only)."""
185
+ try:
186
+ with httpx.Client(timeout=self.timeout) as client:
187
+ resp = client.get(f"{self.env_url}/tasks")
188
+ if resp.status_code == 200:
189
+ data = resp.json()
190
+ if isinstance(data, list):
191
+ # Extract task IDs from task objects
192
+ task_ids = []
193
+ for t in data:
194
+ if isinstance(t, dict):
195
+ task_ids.append(t.get("id", str(t)))
196
+ else:
197
+ task_ids.append(str(t))
198
+ return task_ids
199
+ elif isinstance(data, dict):
200
+ tasks = data.get("tasks", [])
201
+ return [t.get("id") if isinstance(t, dict) else str(t) for t in tasks]
202
+ except Exception as e:
203
+ print(f"[WARNING] Could not fetch tasks from /tasks endpoint: {e}", file=sys.stderr)
204
+
205
+ # Fallback to default task IDs
206
+ return TASK_IDS
207
+
208
+
209
+ # ── Main Inference Loop ───────────────────────────────────────────────────────
210
+
211
+ def build_prompt(obs: dict) -> str:
212
+ """Build the user prompt from observation."""
213
+ task_desc = obs.get("task_description", "Fix the buggy code.")
214
+ buggy_code = obs.get("buggy_code", "")
215
+ error_log = obs.get("error_log", "None")
216
+ last_reward = obs.get("last_reward", 0.0)
217
+
218
+ return f"""Task: {task_desc}
219
+
220
+ Buggy code:
221
+ {buggy_code}
222
+
223
+ Previous execution output (if any):
224
+ {error_log}
225
+
226
+ Previous score: {last_reward}""".strip()
227
+
228
+
229
+ def run_task(
230
+ env: WhipStudioClient,
231
+ llm_client: OpenAI,
232
+ model: str,
233
+ task_id: str,
234
+ ) -> float:
235
+ """Run inference on a single task. Returns the best score achieved."""
236
+
237
+ # Ensure task_id is a string
238
+ if isinstance(task_id, dict):
239
+ task_id = task_id.get("id", str(task_id))
240
+
241
+ log_start(task_id)
242
+
243
+ try:
244
+ obs = env.reset(task_id)
245
+ except Exception as e:
246
+ error_msg = str(e)
247
+ print(f"[ERROR] Failed to reset {task_id}: {error_msg}", file=sys.stderr)
248
+
249
+ # Check if it's a 500 error - likely environment issue
250
+ if "500" in error_msg:
251
+ print(f"[ERROR] HF Space returned 500 - the environment may be starting up or having issues", file=sys.stderr)
252
+ print(f"[ERROR] Try visiting https://your-space.hf.space in a browser first", file=sys.stderr)
253
+
254
+ log_end(task_id, 0.0)
255
+ return 0.0
256
+
257
+ best_score = 0.0
258
+
259
+ for step in range(1, MAX_ATTEMPTS_PER_TASK + 1):
260
+ prompt = build_prompt(obs)
261
+
262
+ # Generate fix
263
+ fixed_code = generate_fix(llm_client, model, prompt)
264
+
265
+ if not fixed_code.strip():
266
+ log_step(task_id, step, "empty_response", 0.0, False)
267
+ continue
268
+
269
+ # Submit fix
270
+ try:
271
+ result = env.step(fixed_code, attempt_number=step)
272
+
273
+ reward = float(result.get("reward", 0.0) or 0.0)
274
+ done = result.get("done", False)
275
+ obs = result.get("observation", obs)
276
+
277
+ # Track best score
278
+ if reward > best_score:
279
+ best_score = reward
280
+
281
+ # Log step
282
+ code_len = len(fixed_code)
283
+ log_step(task_id, step, f"submit_fix({code_len}chars)", reward, done)
284
+
285
+ # Early exit if task is solved
286
+ if done or reward >= 0.95:
287
+ break
288
+
289
+ except Exception as e:
290
+ print(f"[ERROR] Step failed for {task_id}: {e}", file=sys.stderr)
291
+ log_step(task_id, step, "step_error", 0.0, False)
292
+
293
+ log_end(task_id, best_score)
294
+ return best_score
295
+
296
+
297
+ def main():
298
+ parser = argparse.ArgumentParser(
299
+ description="WhipStudio inference script for OpenEnv Hackathon"
300
+ )
301
+ parser.add_argument(
302
+ "--env-url",
303
+ default=os.environ.get("ENV_URL", "http://localhost:7860"),
304
+ help="URL of the WhipStudio environment"
305
+ )
306
+ parser.add_argument(
307
+ "--tasks",
308
+ nargs="+",
309
+ default=None,
310
+ help="Specific tasks to run (default: all tasks)"
311
+ )
312
+ args = parser.parse_args()
313
+
314
+ # Initialize clients
315
+ print(f"[INFO] Connecting to environment at {args.env_url}", flush=True)
316
+ env = WhipStudioClient(args.env_url)
317
+
318
+ if not env.health_check():
319
+ print(f"[ERROR] Cannot reach environment at {args.env_url}", file=sys.stderr)
320
+ sys.exit(1)
321
+
322
+ print("[INFO] Environment is reachable", flush=True)
323
+
324
+ # Initialize LLM client
325
+ llm_client = get_openai_client()
326
+ model = get_model_name()
327
+ print(f"[INFO] Using model: {model}", flush=True)
328
+
329
+ # Determine which tasks to run
330
+ if args.tasks:
331
+ task_ids = args.tasks
332
+ else:
333
+ task_ids = env.get_tasks()
334
+
335
+ print(f"[INFO] Running tasks: {task_ids}", flush=True)
336
+
337
+ # Run inference on all tasks
338
+ start_time = time.time()
339
+ scores = {}
340
+
341
+ for task_id in task_ids:
342
+ task_start = time.time()
343
+ score = run_task(env, llm_client, model, task_id)
344
+ scores[task_id] = score
345
+ task_elapsed = time.time() - task_start
346
+ print(f"[INFO] {task_id} completed in {task_elapsed:.1f}s with score {score:.4f}", flush=True)
347
+
348
+ # Summary
349
+ total_elapsed = time.time() - start_time
350
+ avg_score = sum(scores.values()) / len(scores) if scores else 0.0
351
+
352
+ print("\n" + "=" * 50, flush=True)
353
+ print("[SUMMARY]", flush=True)
354
+ print(f" Tasks completed: {len(scores)}", flush=True)
355
+ print(f" Total time: {total_elapsed:.1f}s", flush=True)
356
+ print(f" Average score: {avg_score:.4f}", flush=True)
357
+ print(" Per-task scores:", flush=True)
358
+ for tid, score in scores.items():
359
+ print(f" {tid}: {score:.4f}", flush=True)
360
+ print("=" * 50, flush=True)
361
+
362
+ # Exit with error if average score is too low (optional)
363
+ if avg_score < 0.1:
364
+ print("[WARNING] Average score below 0.1 threshold", file=sys.stderr)
365
+
366
+
367
+ if __name__ == "__main__":
368
+ main()
run_inference.sh ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Quick setup script for WhipStudio OpenEnv Hackathon
3
+
4
+ set -e
5
+
6
+ echo "=========================================="
7
+ echo "WhipStudio Hackathon Setup"
8
+ echo "=========================================="
9
+
10
+ # Step 1: Check environment variables
11
+ echo ""
12
+ echo "Step 1: Checking environment variables..."
13
+
14
+ if [ -z "$HF_TOKEN" ]; then
15
+ echo "⚠️ HF_TOKEN not set"
16
+ if [ -f .env ]; then
17
+ echo " Loading from .env file..."
18
+ export HF_TOKEN=$(grep -v '^#' .env | head -1)
19
+ echo " ✓ HF_TOKEN loaded"
20
+ else
21
+ echo " ❌ Please set HF_TOKEN environment variable or create .env file"
22
+ exit 1
23
+ fi
24
+ else
25
+ echo " ✓ HF_TOKEN is set"
26
+ fi
27
+
28
+ if [ -z "$API_BASE_URL" ]; then
29
+ echo "⚠️ API_BASE_URL not set, using HuggingFace Inference API"
30
+ export API_BASE_URL="https://api-inference.huggingface.co/v1"
31
+ fi
32
+ echo " ✓ API_BASE_URL: $API_BASE_URL"
33
+
34
+ if [ -z "$MODEL_NAME" ]; then
35
+ echo "⚠️ MODEL_NAME not set, using default"
36
+ export MODEL_NAME="Qwen/Qwen2.5-Coder-1.5B-Instruct"
37
+ fi
38
+ echo " ✓ MODEL_NAME: $MODEL_NAME"
39
+
40
+ # Step 2: Check HF Space
41
+ ENV_URL="${1:-https://amogh-kal1-whipstudio.hf.space}"
42
+ echo ""
43
+ echo "Step 2: Checking HF Space at $ENV_URL..."
44
+
45
+ if curl -s --max-time 10 "$ENV_URL/health" > /dev/null 2>&1; then
46
+ echo " ✓ HF Space is reachable"
47
+ else
48
+ echo " ❌ HF Space not reachable or still starting up"
49
+ echo " Try visiting $ENV_URL in your browser first"
50
+ exit 1
51
+ fi
52
+
53
+ # Step 3: Run inference
54
+ echo ""
55
+ echo "Step 3: Running inference..."
56
+ echo ""
57
+
58
+ python3 inference.py --env-url "$ENV_URL"
59
+
60
+ echo ""
61
+ echo "=========================================="
62
+ echo "✅ Inference complete!"
63
+ echo "=========================================="
server/app.py CHANGED
@@ -95,6 +95,7 @@ def list_tasks():
95
  {"id": "task3", "name": "OOM and data leakage", "difficulty": "hard"},
96
  {"id": "task4", "name": "Wrong loss function", "difficulty": "medium"},
97
  {"id": "task5", "name": "Frozen backbone", "difficulty": "medium"},
 
98
  ],
99
  "action_schema": {
100
  "fixed_code": "string (required) — complete runnable Python script",
@@ -122,16 +123,26 @@ def run_grader(payload: dict):
122
  @app.get("/baseline")
123
  async def run_baseline(request: Request):
124
  try:
125
- from ..baseline_agent import run_single_task
126
  except ImportError:
127
- from baseline_agent import run_single_task
128
 
129
  env_url = str(request.base_url).rstrip("/")
 
 
 
 
 
 
 
130
  results = {}
131
  task_scores = {}
132
- for task_id in ["task1", "task2", "task3", "task4", "task5"]:
133
  try:
134
- score = await asyncio.wait_for(run_single_task(task_id, env_url), timeout=120.0)
 
 
 
135
  results[task_id] = round(score, 4)
136
  task_scores[task_id] = round(score, 4)
137
  except TimeoutError:
@@ -147,24 +158,43 @@ async def run_baseline(request: Request):
147
  task_scores[task_id] = 0.0
148
  results[f"{task_id}_error"] = f"internal_error: {exc.__class__.__name__}: {exc}"
149
  avg = round(sum(task_scores.values()) / max(1, len(task_scores)), 4)
150
- return {"baseline_scores": results, "average": avg, "env_url": env_url}
 
 
 
 
 
151
 
152
 
153
  @app.get("/baseline/task/{task_id}")
154
  async def run_baseline_single(task_id: str, request: Request):
155
  """Run the baseline agent on a single task. Returns score + details."""
156
  try:
157
- from ..baseline_agent import run_single_task_detailed
158
  except ImportError:
159
- from baseline_agent import run_single_task_detailed
160
 
161
  env_url = str(request.base_url).rstrip("/")
 
 
 
 
 
 
 
 
 
 
162
  try:
163
- result = await asyncio.wait_for(run_single_task_detailed(task_id, env_url), timeout=120.0)
 
 
 
164
  return {
165
  "task_id": task_id,
166
  "score": round(result["score"], 4),
167
  "status": "ok",
 
168
  "fixed_code": result.get("fixed_code", ""),
169
  "output": result.get("output", ""),
170
  "attempts": result.get("attempts", []),
 
95
  {"id": "task3", "name": "OOM and data leakage", "difficulty": "hard"},
96
  {"id": "task4", "name": "Wrong loss function", "difficulty": "medium"},
97
  {"id": "task5", "name": "Frozen backbone", "difficulty": "medium"},
98
+ {"id": "task6", "name": "Input-Output mismatch", "difficulty": "hard"},
99
  ],
100
  "action_schema": {
101
  "fixed_code": "string (required) — complete runnable Python script",
 
123
  @app.get("/baseline")
124
  async def run_baseline(request: Request):
125
  try:
126
+ from ..baseline_agent import SUPPORTED_MODEL_IDS, run_single_task
127
  except ImportError:
128
+ from baseline_agent import SUPPORTED_MODEL_IDS, run_single_task
129
 
130
  env_url = str(request.base_url).rstrip("/")
131
+ model_id = request.query_params.get("model_id", "Qwen/Qwen2.5-Coder-32B-Instruct")
132
+ if model_id not in SUPPORTED_MODEL_IDS:
133
+ return {
134
+ "error": f"Unsupported model_id '{model_id}'",
135
+ "supported_model_ids": SUPPORTED_MODEL_IDS,
136
+ }
137
+
138
  results = {}
139
  task_scores = {}
140
+ for task_id in ["task1", "task2", "task3", "task4", "task5", "task6"]:
141
  try:
142
+ score = await asyncio.wait_for(
143
+ run_single_task(task_id, env_url, model_id=model_id),
144
+ timeout=120.0,
145
+ )
146
  results[task_id] = round(score, 4)
147
  task_scores[task_id] = round(score, 4)
148
  except TimeoutError:
 
158
  task_scores[task_id] = 0.0
159
  results[f"{task_id}_error"] = f"internal_error: {exc.__class__.__name__}: {exc}"
160
  avg = round(sum(task_scores.values()) / max(1, len(task_scores)), 4)
161
+ return {
162
+ "baseline_scores": results,
163
+ "average": avg,
164
+ "env_url": env_url,
165
+ "model_id": model_id,
166
+ }
167
 
168
 
169
  @app.get("/baseline/task/{task_id}")
170
  async def run_baseline_single(task_id: str, request: Request):
171
  """Run the baseline agent on a single task. Returns score + details."""
172
  try:
173
+ from ..baseline_agent import SUPPORTED_MODEL_IDS, run_single_task_detailed
174
  except ImportError:
175
+ from baseline_agent import SUPPORTED_MODEL_IDS, run_single_task_detailed
176
 
177
  env_url = str(request.base_url).rstrip("/")
178
+ model_id = request.query_params.get("model_id", "Qwen/Qwen2.5-Coder-32B-Instruct")
179
+ if model_id not in SUPPORTED_MODEL_IDS:
180
+ return {
181
+ "task_id": task_id,
182
+ "score": 0.0,
183
+ "status": "error",
184
+ "error": f"Unsupported model_id '{model_id}'",
185
+ "supported_model_ids": SUPPORTED_MODEL_IDS,
186
+ }
187
+
188
  try:
189
+ result = await asyncio.wait_for(
190
+ run_single_task_detailed(task_id, env_url, model_id=model_id),
191
+ timeout=120.0,
192
+ )
193
  return {
194
  "task_id": task_id,
195
  "score": round(result["score"], 4),
196
  "status": "ok",
197
+ "model_id": model_id,
198
  "fixed_code": result.get("fixed_code", ""),
199
  "output": result.get("output", ""),
200
  "attempts": result.get("attempts", []),
server/environment.py CHANGED
@@ -9,12 +9,12 @@ from openenv.core.env_server.types import State
9
  try:
10
  from ..models import MLDebugAction, MLDebugObservation
11
  from .sandbox import execute_code
12
- from .tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone
13
  from .tasks.graders import parse_losses, parse_val_accs, score_task
14
  except ImportError:
15
  from models import MLDebugAction, MLDebugObservation
16
  from server.sandbox import execute_code
17
- from server.tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone
18
  from server.tasks.graders import parse_losses, parse_val_accs, score_task
19
 
20
  TASKS = {
@@ -23,6 +23,7 @@ TASKS = {
23
  "task3": task3_oom_leakage,
24
  "task4": task4_wrong_loss,
25
  "task5": task5_frozen_backbone,
 
26
  }
27
 
28
 
 
9
  try:
10
  from ..models import MLDebugAction, MLDebugObservation
11
  from .sandbox import execute_code
12
+ from .tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone, task6_io_mismatch
13
  from .tasks.graders import parse_losses, parse_val_accs, score_task
14
  except ImportError:
15
  from models import MLDebugAction, MLDebugObservation
16
  from server.sandbox import execute_code
17
+ from server.tasks import task1_broken_loop, task2_nan_loss, task3_oom_leakage, task4_wrong_loss, task5_frozen_backbone, task6_io_mismatch
18
  from server.tasks.graders import parse_losses, parse_val_accs, score_task
19
 
20
  TASKS = {
 
23
  "task3": task3_oom_leakage,
24
  "task4": task4_wrong_loss,
25
  "task5": task5_frozen_backbone,
26
+ "task6": task6_io_mismatch,
27
  }
28
 
29
 
server/tasks/graders.py CHANGED
@@ -109,11 +109,11 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
109
  Task 1: Broken Training Loop
110
  Bugs: 1) lr=10.0 (too high), 2) step() before backward()
111
 
112
- Grading criteria:
113
- - Must have low final loss (<0.3) - indicates proper training
114
- - Must have high validation accuracy (>0.85) - indicates learning
115
- - Must show monotonic improvement - indicates proper gradient flow
116
- - Must NOT have loss spikes - indicates stable training
117
  """
118
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
119
  if not valid:
@@ -128,10 +128,10 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
128
  if not losses:
129
  return 0.1, {"reason": "no_losses_parsed"}
130
 
131
- # Check for NaN/Inf - indicates numerical instability
132
  nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
133
  if nan_count > 0:
134
- return 0.15, {"reason": "nan_inf_found", "nan_count": nan_count}
135
 
136
  val_acc = parse_scalar(result.stdout, "VAL_ACC")
137
  if val_acc is None:
@@ -141,33 +141,39 @@ def grade_task1(result: RunResult) -> tuple[float, dict]:
141
  initial_loss = losses[0]
142
  max_loss = max(losses)
143
 
144
- # Check for loss instability (spikes indicate LR too high)
145
- # Healthy training shouldn't have losses > 5x initial loss
146
- if max_loss > initial_loss * 5.0 or max_loss > 10.0:
147
- return 0.2, {
148
  "reason": "loss_unstable_spikes",
149
  "max_loss": max_loss,
150
  "final_loss": final_loss,
151
  "val_acc": val_acc
152
  }
153
 
154
- # Check for loss explosion at end
155
- if final_loss > 5.0:
156
- return 0.15, {"reason": "loss_unstable", "final_loss": final_loss, "val_acc": val_acc}
157
 
158
- # Primary: Validation accuracy (higher is better, target > 0.85)
159
- acc_score = sigmoid_score(val_acc, center=0.85, steepness=15.0, higher_is_better=True) * 0.5
160
 
161
- # Secondary: Final loss should be low (lower is better, target < 0.3)
162
- loss_score = sigmoid_score(final_loss, center=0.3, steepness=8.0, higher_is_better=False) * 0.3
 
163
 
164
- # Bonus: Monotonic improvement (loss should decrease over time)
 
 
 
165
  monotonic_bonus = 0.0
166
  if len(losses) >= 10:
167
- first_quarter = sum(losses[:len(losses)//4]) / (len(losses)//4)
168
- last_quarter = sum(losses[-len(losses)//4:]) / (len(losses)//4)
169
- if last_quarter < first_quarter * 0.7: # At least 30% improvement
170
- monotonic_bonus = 0.2
 
 
 
171
 
172
  final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
173
  breakdown = {
@@ -187,10 +193,10 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
187
  Task 2: NaN Loss
188
  Bug: torch.log(pred) when pred can be 0.0 after sigmoid
189
 
190
- Grading criteria:
191
- - Must have NO NaN/Inf losses - this is the primary test
192
- - Must have good validation accuracy (>0.75)
193
- - Must show loss convergence (<0.4)
194
  """
195
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
196
  if not valid:
@@ -207,11 +213,11 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
207
 
208
  nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
209
 
210
- # Primary criterion: NO NaN/Inf allowed - this is the core bug being tested
211
  nan_ratio = nan_count / len(losses)
212
  if nan_count > 0:
213
- # Heavily penalize any NaN - this is THE bug we're testing
214
- return max(0.05, 0.3 * (1.0 - nan_ratio)), {
215
  "reason": "has_nans",
216
  "nan_ratio": nan_ratio,
217
  "nan_count": nan_count
@@ -219,19 +225,19 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
219
 
220
  val_acc = parse_scalar(result.stdout, "VAL_ACC")
221
  if val_acc is None:
222
- return 0.2, {"reason": "no_val_acc_but_no_nans"}
223
 
224
  finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
225
  final_loss = finite_losses[-1] if finite_losses else float('inf')
226
 
227
- # No NaN = base score of 0.4 (the bug is fixed)
228
- base_score = 0.4
229
 
230
- # Validation accuracy bonus (higher is better, target > 0.75)
231
- acc_score = sigmoid_score(val_acc, center=0.75, steepness=12.0, higher_is_better=True) * 0.35
232
 
233
- # Convergence bonus (lower is better, target < 0.4)
234
- convergence_score = sigmoid_score(final_loss, center=0.4, steepness=6.0, higher_is_better=False) * 0.25
235
 
236
  final_score = min(1.0, base_score + acc_score + convergence_score)
237
  breakdown = {
@@ -247,14 +253,13 @@ def grade_task2(result: RunResult) -> tuple[float, dict]:
247
 
248
  def grade_task3(result: RunResult) -> tuple[float, dict]:
249
  """
250
- Task 3: Memory Leak + Missing zero_grad
251
- Bugs: 1) total_loss += loss retains graph (memory leak)
252
- 2) Missing optimizer.zero_grad() causes gradient accumulation
253
-
254
- Grading criteria:
255
- - FINAL_LOSS should be reasonable (<20) - memory leak fixed
256
- - VAL_ACC should be high (>0.8) - gradient accumulation fixed
257
- - Learning trajectory should improve over epochs
258
  """
259
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
260
  if not valid:
@@ -271,40 +276,51 @@ def grade_task3(result: RunResult) -> tuple[float, dict]:
271
  val_accs = parse_val_accs(result.stdout)
272
  final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
273
 
274
- # Memory leak check: FINAL_LOSS should be reasonable
275
- # With .item(), total_loss is sum of scalars (~12-20 for 20 epochs)
276
- memory_score = 0.0
277
- if final_loss_val is not None:
278
- memory_score = sigmoid_score(final_loss_val, center=20.0, steepness=0.2, higher_is_better=False) * 0.35
279
- else:
280
- memory_score = 0.0
281
-
282
- # Gradient accumulation check: accuracy should be high if training properly
283
- # Without zero_grad(), gradients accumulate and training degrades
284
  acc_score = 0.0
285
  final_acc = 0.0
286
  early_acc = 0.0
287
  trajectory_bonus = 0.0
288
 
289
- if val_accs and len(val_accs) >= 2:
290
- early_acc = sum(val_accs[:3]) / min(3, len(val_accs))
291
- final_acc = val_accs[-1]
292
-
293
- # Final accuracy is the main indicator of correct training
294
- acc_score = sigmoid_score(final_acc, center=0.8, steepness=15.0, higher_is_better=True) * 0.45
295
-
296
- # Learning trajectory: should improve over time
297
- if len(val_accs) >= 5:
298
- improvement = final_acc - early_acc
299
- if improvement > 0.05:
300
- trajectory_bonus = 0.1
301
- elif improvement > 0.0:
302
- trajectory_bonus = 0.05
303
-
304
- final_score = min(1.0, memory_score + acc_score + trajectory_bonus)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
305
  breakdown = {
306
- "memory_score": round(memory_score, 4),
307
  "acc_score": round(acc_score, 4),
 
308
  "trajectory_bonus": round(trajectory_bonus, 4),
309
  "early_acc": round(early_acc, 4),
310
  "final_acc": round(final_acc, 4),
@@ -318,10 +334,10 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
318
  Task 4: Wrong Loss (Multi-label Classification)
319
  Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
320
 
321
- Grading criteria:
322
- - F1 score should be high (> 0.6) - primary metric
323
- - avg_labels should be > 1.0 (proper multi-label output)
324
- - Loss should converge
325
  """
326
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
327
  if not valid:
@@ -337,29 +353,45 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
337
  avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
338
  f1 = parse_scalar(result.stdout, "F1_SCORE")
339
 
340
- # F1 score - PRIMARY metric (higher is better, target > 0.6)
 
 
 
 
 
 
 
 
 
 
 
341
  f1_score_val = 0.0
342
  if f1 is not None:
343
- f1_score_val = sigmoid_score(f1, center=0.6, steepness=10.0, higher_is_better=True) * 0.5
 
 
 
 
 
 
 
 
 
344
 
345
- # Multi-label check: avg_labels should be > 1.0 (proper multi-label predictions)
346
- # With 30% probability per class and 5 classes, expected avg ~1.5 labels/sample
347
  labels_score = 0.0
348
  if avg_labels is not None:
349
- if avg_labels < 0.5:
350
- # Way too few labels - likely single-label behavior
351
- labels_score = 0.0
352
  elif avg_labels >= 1.0:
353
- # Good - multiple labels per sample
354
- labels_score = 0.3
355
  else:
356
- # Partial credit
357
- labels_score = sigmoid_score(avg_labels, center=1.0, steepness=5.0, higher_is_better=True) * 0.3
358
 
359
- # Loss convergence (lower is better, target < 0.5)
360
  loss_score = 0.0
361
  if final_loss is not None:
362
- loss_score = sigmoid_score(final_loss, center=0.5, steepness=4.0, higher_is_better=False) * 0.2
363
 
364
  final_score = min(1.0, f1_score_val + labels_score + loss_score)
365
  breakdown = {
@@ -379,14 +411,14 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
379
  Bug: Backbone is frozen but still passed to optimizer (wastes memory)
380
 
381
  Valid fixes:
382
- 1. Unfreeze backbone -> grad_norm > 0, same param count
383
- 2. Only pass head params to optimizer -> grad_norm = 0, reduced param count
384
 
385
- The buggy code has: grad_norm = 0, param_count = 530442 (full model)
386
 
387
- Grading criteria:
388
- - Either backbone has gradients (unfrozen), OR
389
- - Optimizer param count is reduced (only head)
390
  """
391
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
392
  if not valid:
@@ -402,30 +434,39 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
402
  grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
403
  param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
404
 
405
- # Loss should be reasonable (10-class classification, CE loss)
406
- loss_score = 0.0
407
- if final_loss is not None:
408
- loss_score = sigmoid_score(final_loss, center=2.5, steepness=2.0, higher_is_better=False) * 0.3
409
-
410
- # The bug: frozen backbone (grad_norm=0) but full params in optimizer (param_count=530442)
411
- # Fix 1: Unfreeze -> grad_norm > 0 (any amount)
412
- # Fix 2: Only head -> param_count < 100000 (head has ~5130 params)
413
-
414
  fix_score = 0.0
415
  fix_type = "none"
416
 
417
- if grad_norm is not None and grad_norm > 0.1:
418
- # Backbone is unfrozen and training
419
- fix_score = 0.7
420
  fix_type = "unfrozen"
421
- elif param_count is not None and param_count < 100000:
422
- # Only head params in optimizer (head has ~5130 params)
423
- fix_score = 0.7
424
  fix_type = "head_only"
425
- elif grad_norm is not None and grad_norm == 0.0 and (param_count is None or param_count > 100000):
426
- # Buggy state: frozen backbone but full params in optimizer
427
- fix_score = 0.0
428
- fix_type = "buggy"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
429
 
430
  final_score = min(1.0, loss_score + fix_score)
431
  breakdown = {
@@ -439,6 +480,192 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
439
  return final_score, breakdown
440
 
441
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
442
  def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
443
  graders = {
444
  "task1": grade_task1,
@@ -446,6 +673,7 @@ def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
446
  "task3": grade_task3,
447
  "task4": grade_task4,
448
  "task5": grade_task5,
 
449
  }
450
  if task_id not in graders:
451
  raise ValueError(f"Unknown task_id: {task_id}")
 
109
  Task 1: Broken Training Loop
110
  Bugs: 1) lr=10.0 (too high), 2) step() before backward()
111
 
112
+ Grading criteria (STRICT thresholds for differentiation):
113
+ - VAL_ACC > 0.90 required for high score (target is >0.85)
114
+ - Final loss < 0.2 required for high score (target is <0.3)
115
+ - Must show monotonic improvement
116
+ - Penalize any instability heavily
117
  """
118
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
119
  if not valid:
 
128
  if not losses:
129
  return 0.1, {"reason": "no_losses_parsed"}
130
 
131
+ # Check for NaN/Inf - indicates numerical instability (LR bug not fully fixed)
132
  nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
133
  if nan_count > 0:
134
+ return 0.1, {"reason": "nan_inf_found", "nan_count": nan_count}
135
 
136
  val_acc = parse_scalar(result.stdout, "VAL_ACC")
137
  if val_acc is None:
 
141
  initial_loss = losses[0]
142
  max_loss = max(losses)
143
 
144
+ # STRICT: Check for loss instability (spikes indicate LR still too high)
145
+ if max_loss > initial_loss * 3.0 or max_loss > 5.0:
146
+ return 0.15, {
 
147
  "reason": "loss_unstable_spikes",
148
  "max_loss": max_loss,
149
  "final_loss": final_loss,
150
  "val_acc": val_acc
151
  }
152
 
153
+ # STRICT: Loss must converge well
154
+ if final_loss > 1.0:
155
+ return 0.2, {"reason": "loss_not_converged", "final_loss": final_loss, "val_acc": val_acc}
156
 
157
+ # STRICT thresholds - center points raised for better differentiation
158
+ # Target: val_acc > 0.90, final_loss < 0.15
159
 
160
+ # Primary: Validation accuracy (60% weight)
161
+ # Use steeper sigmoid for sharper differentiation
162
+ acc_score = sigmoid_score(val_acc, center=0.90, steepness=25.0, higher_is_better=True) * 0.60
163
 
164
+ # Secondary: Final loss (30% weight) - must be low
165
+ loss_score = sigmoid_score(final_loss, center=0.15, steepness=15.0, higher_is_better=False) * 0.30
166
+
167
+ # Bonus: Monotonic improvement - must be significant
168
  monotonic_bonus = 0.0
169
  if len(losses) >= 10:
170
+ first_half = sum(losses[:len(losses)//2]) / (len(losses)//2)
171
+ last_half = sum(losses[-len(losses)//2:]) / (len(losses)//2)
172
+ improvement_ratio = (first_half - last_half) / first_half if first_half > 0 else 0
173
+ if improvement_ratio > 0.5: # >50% improvement required
174
+ monotonic_bonus = 0.10
175
+ elif improvement_ratio > 0.3:
176
+ monotonic_bonus = 0.05
177
 
178
  final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
179
  breakdown = {
 
193
  Task 2: NaN Loss
194
  Bug: torch.log(pred) when pred can be 0.0 after sigmoid
195
 
196
+ Grading criteria (STRICT - NaN elimination is PRIMARY):
197
+ - ZERO NaN/Inf required (this is the bug!)
198
+ - VAL_ACC > 0.80 required for high score
199
+ - Loss must converge < 0.3
200
  """
201
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
202
  if not valid:
 
213
 
214
  nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
215
 
216
+ # PRIMARY: NO NaN/Inf at ALL - this is THE bug being tested
217
  nan_ratio = nan_count / len(losses)
218
  if nan_count > 0:
219
+ # STRICT: Any NaN = major failure (max 0.25 score)
220
+ return max(0.05, 0.25 * (1.0 - nan_ratio)), {
221
  "reason": "has_nans",
222
  "nan_ratio": nan_ratio,
223
  "nan_count": nan_count
 
225
 
226
  val_acc = parse_scalar(result.stdout, "VAL_ACC")
227
  if val_acc is None:
228
+ return 0.25, {"reason": "no_val_acc_but_no_nans"}
229
 
230
  finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
231
  final_loss = finite_losses[-1] if finite_losses else float('inf')
232
 
233
+ # No NaN = base score of 0.35 (bug is fixed but need to verify quality)
234
+ base_score = 0.35
235
 
236
+ # STRICT: Validation accuracy (40% weight, center at 0.80)
237
+ acc_score = sigmoid_score(val_acc, center=0.80, steepness=20.0, higher_is_better=True) * 0.40
238
 
239
+ # STRICT: Convergence (25% weight, center at 0.25)
240
+ convergence_score = sigmoid_score(final_loss, center=0.25, steepness=10.0, higher_is_better=False) * 0.25
241
 
242
  final_score = min(1.0, base_score + acc_score + convergence_score)
243
  breakdown = {
 
253
 
254
  def grade_task3(result: RunResult) -> tuple[float, dict]:
255
  """
256
+ Task 3: Label Inversion Bug
257
+ Bug: criterion(out, 1 - yb) inverts labels should be criterion(out, yb)
258
+
259
+ Grading criteria (STRICT - accuracy is PRIMARY):
260
+ - VAL_ACC > 0.90 required (buggy code gives ~0.50)
261
+ - FINAL_LOSS < 0.25 required
262
+ - Must show learning trajectory improvement
 
263
  """
264
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
265
  if not valid:
 
276
  val_accs = parse_val_accs(result.stdout)
277
  final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
278
 
279
+ # CRITICAL CHECK: Buggy code produces ~0.50 accuracy (random)
280
+ # Fixed code should produce >0.90 accuracy
281
+
 
 
 
 
 
 
 
282
  acc_score = 0.0
283
  final_acc = 0.0
284
  early_acc = 0.0
285
  trajectory_bonus = 0.0
286
 
287
+ if not val_accs or len(val_accs) < 2:
288
+ return 0.15, {"reason": "no_val_accs_parsed"}
289
+
290
+ early_acc = sum(val_accs[:3]) / min(3, len(val_accs))
291
+ final_acc = val_accs[-1]
292
+
293
+ # STRICT: Final accuracy must be high (>0.90 target)
294
+ # The bug makes accuracy ~0.50, so anything <0.70 is likely unfixed
295
+ if final_acc < 0.65:
296
+ return 0.15, {
297
+ "reason": "accuracy_too_low_likely_unfixed",
298
+ "final_acc": final_acc,
299
+ "expected": ">0.90 for fixed code"
300
+ }
301
+
302
+ # Primary: Final accuracy (60% weight, center at 0.92)
303
+ acc_score = sigmoid_score(final_acc, center=0.92, steepness=30.0, higher_is_better=True) * 0.60
304
+
305
+ # Secondary: Loss convergence (25% weight)
306
+ loss_score = 0.0
307
+ if final_loss_val is not None:
308
+ loss_score = sigmoid_score(final_loss_val, center=0.20, steepness=12.0, higher_is_better=False) * 0.25
309
+
310
+ # Bonus: Learning trajectory (15% weight)
311
+ if len(val_accs) >= 5:
312
+ improvement = final_acc - early_acc
313
+ if improvement > 0.15: # Significant learning
314
+ trajectory_bonus = 0.15
315
+ elif improvement > 0.05:
316
+ trajectory_bonus = 0.08
317
+ elif improvement > 0.0:
318
+ trajectory_bonus = 0.03
319
+
320
+ final_score = min(1.0, acc_score + loss_score + trajectory_bonus)
321
  breakdown = {
 
322
  "acc_score": round(acc_score, 4),
323
+ "loss_score": round(loss_score, 4),
324
  "trajectory_bonus": round(trajectory_bonus, 4),
325
  "early_acc": round(early_acc, 4),
326
  "final_acc": round(final_acc, 4),
 
334
  Task 4: Wrong Loss (Multi-label Classification)
335
  Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
336
 
337
+ Grading criteria (STRICT):
338
+ - F1 > 0.70 required (buggy code gives ~0.2-0.3)
339
+ - avg_labels > 1.2 required (proper multi-hot predictions)
340
+ - Loss must converge < 0.4
341
  """
342
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
343
  if not valid:
 
353
  avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
354
  f1 = parse_scalar(result.stdout, "F1_SCORE")
355
 
356
+ # CRITICAL: Check for multi-label behavior
357
+ # With CrossEntropyLoss, model predicts only 1 label per sample (avg_labels ≈ 1.0)
358
+ # With BCEWithLogitsLoss, model should predict multiple (avg_labels > 1.0)
359
+
360
+ if avg_labels is not None and avg_labels < 0.8:
361
+ return 0.15, {
362
+ "reason": "too_few_labels_single_label_behavior",
363
+ "avg_labels": avg_labels,
364
+ "expected": ">1.2 for multi-label"
365
+ }
366
+
367
+ # STRICT: F1 score - PRIMARY metric (55% weight)
368
  f1_score_val = 0.0
369
  if f1 is not None:
370
+ if f1 < 0.40:
371
+ # Very low F1 indicates bug not fixed
372
+ return 0.20, {
373
+ "reason": "f1_too_low_likely_unfixed",
374
+ "f1": f1,
375
+ "expected": ">0.65 for fixed code"
376
+ }
377
+ f1_score_val = sigmoid_score(f1, center=0.70, steepness=15.0, higher_is_better=True) * 0.55
378
+ else:
379
+ return 0.15, {"reason": "no_f1_score_parsed"}
380
 
381
+ # Multi-label check: avg_labels (25% weight)
 
382
  labels_score = 0.0
383
  if avg_labels is not None:
384
+ if avg_labels >= 1.3:
385
+ labels_score = 0.25 # Full score for proper multi-label
 
386
  elif avg_labels >= 1.0:
387
+ labels_score = 0.15 # Partial - borderline multi-label
 
388
  else:
389
+ labels_score = sigmoid_score(avg_labels, center=1.0, steepness=8.0, higher_is_better=True) * 0.15
 
390
 
391
+ # Loss convergence (20% weight)
392
  loss_score = 0.0
393
  if final_loss is not None:
394
+ loss_score = sigmoid_score(final_loss, center=0.35, steepness=8.0, higher_is_better=False) * 0.20
395
 
396
  final_score = min(1.0, f1_score_val + labels_score + loss_score)
397
  breakdown = {
 
411
  Bug: Backbone is frozen but still passed to optimizer (wastes memory)
412
 
413
  Valid fixes:
414
+ 1. Unfreeze backbone -> grad_norm > 0
415
+ 2. Only pass head params to optimizer -> param_count < 10000
416
 
417
+ Buggy state: grad_norm = 0, param_count = 530442
418
 
419
+ Grading criteria (STRICT - binary fix detection):
420
+ - Must demonstrate ONE of the two valid fixes
421
+ - Loss must be reasonable (<3.0 for CrossEntropy on 10 classes)
422
  """
423
  valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
424
  if not valid:
 
434
  grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
435
  param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
436
 
437
+ # Detect fix type FIRST
 
 
 
 
 
 
 
 
438
  fix_score = 0.0
439
  fix_type = "none"
440
 
441
+ # Fix 1: Unfreeze backbone (grad_norm > 0)
442
+ if grad_norm is not None and grad_norm > 0.01:
443
+ fix_score = 0.70
444
  fix_type = "unfrozen"
445
+ # Fix 2: Only head params (param_count should be ~5130 for Linear(512, 10))
446
+ elif param_count is not None and param_count < 15000:
447
+ fix_score = 0.70
448
  fix_type = "head_only"
449
+ # Buggy state: frozen (grad_norm=0) but full params (>500000)
450
+ elif grad_norm is not None and grad_norm == 0.0:
451
+ if param_count is not None and param_count > 100000:
452
+ return 0.10, {
453
+ "reason": "buggy_state_unchanged",
454
+ "grad_norm": grad_norm,
455
+ "param_count": param_count,
456
+ "hint": "Either unfreeze backbone or only pass head params to optimizer"
457
+ }
458
+
459
+ if fix_score == 0.0:
460
+ return 0.15, {
461
+ "reason": "could_not_detect_valid_fix",
462
+ "grad_norm": grad_norm,
463
+ "param_count": param_count
464
+ }
465
+
466
+ # Loss should be reasonable (30% weight)
467
+ loss_score = 0.0
468
+ if final_loss is not None:
469
+ loss_score = sigmoid_score(final_loss, center=2.5, steepness=3.0, higher_is_better=False) * 0.30
470
 
471
  final_score = min(1.0, loss_score + fix_score)
472
  breakdown = {
 
480
  return final_score, breakdown
481
 
482
 
483
+ def grade_task6(result: RunResult) -> tuple[float, dict]:
484
+ """
485
+ Task 6: Input-Output Mismatch (Multiple Bugs)
486
+
487
+ Bugs to fix:
488
+ 1. Shape mismatch: 32x32 images but model expects 28x28
489
+ 2. Channel order: HWC format but model expects CHW
490
+ 3. Label encoding: One-hot labels but CrossEntropyLoss expects indices
491
+ 4. Batch dimension: Single sample missing batch dim in validation
492
+
493
+ Anti-gaming measures:
494
+ - Must have actual CNN training (convolutions detected in code)
495
+ - Must show learning trajectory (loss decrease)
496
+ - Must have reasonable epoch count (>= 20)
497
+ - Penalize hardcoded metrics or unrealistic outputs
498
+ - Check for actual tensor operations (permute, reshape, etc.)
499
+ """
500
+ valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task6")
501
+ if not valid:
502
+ return 0.0, {"reason": reason}
503
+
504
+ if result.timed_out:
505
+ return 0.05, {"reason": "timed_out"}
506
+
507
+ if result.exit_code != 0:
508
+ # Check for specific error types that indicate partial fixes
509
+ stderr_lower = result.stderr.lower()
510
+ if "shape" in stderr_lower or "size" in stderr_lower:
511
+ return 0.10, {"reason": "shape_error_unfixed", "stderr": result.stderr[:500]}
512
+ if "dimension" in stderr_lower or "dim" in stderr_lower:
513
+ return 0.10, {"reason": "dimension_error_unfixed", "stderr": result.stderr[:500]}
514
+ if "expected" in stderr_lower and "got" in stderr_lower:
515
+ return 0.10, {"reason": "type_mismatch_unfixed", "stderr": result.stderr[:500]}
516
+ return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
517
+
518
+ code = result.fixed_code
519
+
520
+ # ANTI-GAMING: Check for genuine CNN architecture (not replaced with fake output)
521
+ has_conv = "Conv2d" in code or "conv2d" in code
522
+ has_training_loop = "optimizer.step()" in code or "optimizer.step()" in code
523
+ has_model_forward = "model(" in code
524
+
525
+ if not has_conv:
526
+ return 0.05, {"reason": "gaming_no_convolution", "hint": "Original CNN architecture must be preserved"}
527
+ if not has_training_loop:
528
+ return 0.05, {"reason": "gaming_no_training", "hint": "Must have actual training loop"}
529
+ if not has_model_forward:
530
+ return 0.05, {"reason": "gaming_no_forward", "hint": "Must use model for inference"}
531
+
532
+ # Parse metrics
533
+ losses = parse_losses(result.stdout)
534
+ val_acc = parse_scalar(result.stdout, "VAL_ACC")
535
+ final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
536
+
537
+ # ANTI-GAMING: Check for hardcoded/faked metrics
538
+ if "print('VAL_ACC:0.9" in code or "print(\"VAL_ACC:0.9" in code:
539
+ return 0.05, {"reason": "gaming_hardcoded_metrics"}
540
+
541
+ # ANTI-GAMING: Require reasonable number of loss values (epoch count)
542
+ if len(losses) < 15:
543
+ return 0.15, {"reason": "too_few_epochs", "epoch_count": len(losses), "expected": ">=20"}
544
+
545
+ # ANTI-GAMING: Loss should show learning (not flat or random)
546
+ if len(losses) >= 10:
547
+ first_quarter = sum(losses[:len(losses)//4]) / (len(losses)//4)
548
+ last_quarter = sum(losses[-len(losses)//4:]) / (len(losses)//4)
549
+
550
+ if first_quarter <= last_quarter:
551
+ return 0.20, {
552
+ "reason": "no_learning_trajectory",
553
+ "first_quarter_loss": round(first_quarter, 4),
554
+ "last_quarter_loss": round(last_quarter, 4),
555
+ "hint": "Loss should decrease during training"
556
+ }
557
+
558
+ # ANTI-GAMING: Check for unrealistic perfect scores with no learning
559
+ # High accuracy is OK if there's a valid learning trajectory
560
+ if val_acc is not None and val_acc > 0.99:
561
+ # Only flag if loss didn't converge properly (suggests hardcoded output)
562
+ if final_loss is not None and final_loss > 0.1:
563
+ return 0.25, {"reason": "unrealistic_accuracy_no_convergence", "val_acc": val_acc, "final_loss": final_loss}
564
+
565
+ if val_acc is None:
566
+ return 0.15, {"reason": "no_val_acc_parsed"}
567
+
568
+ if final_loss is None:
569
+ return 0.15, {"reason": "no_final_loss_parsed"}
570
+
571
+ # Check for NaN/Inf in losses
572
+ nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
573
+ if nan_count > 0:
574
+ return 0.10, {"reason": "nan_in_losses", "nan_count": nan_count}
575
+
576
+ # ====== BUG FIX DETECTION ======
577
+ bug_fixes_detected = 0
578
+ fix_details = {}
579
+
580
+ # Bug 1: Shape fix - check for resize or architecture change
581
+ has_resize = any(kw in code for kw in ["resize", "interpolate", "F.adaptive", "8 * 8", "8*8"])
582
+ has_28_in_data = "28, 28" in code or "28,28" in code
583
+ if has_resize or has_28_in_data:
584
+ bug_fixes_detected += 1
585
+ fix_details["shape_fix"] = True
586
+ else:
587
+ fix_details["shape_fix"] = False
588
+
589
+ # Bug 2: Channel order fix - check for permute/transpose OR data created in CHW format
590
+ has_permute = "permute" in code or "transpose" in code or "contiguous" in code
591
+ has_channel_reorder = ".permute(0, 3, 1, 2)" in code or "permute(0,3,1,2)" in code
592
+ # Alternative fix: data created directly in CHW format (n_samples, 1, H, W)
593
+ has_chw_data = any(pat in code for pat in ["n_samples, 1, 28", "n_samples, 1, 32", "(n_samples, 1,"])
594
+ if has_permute or has_channel_reorder or has_chw_data:
595
+ bug_fixes_detected += 1
596
+ fix_details["channel_fix"] = True
597
+ else:
598
+ fix_details["channel_fix"] = False
599
+
600
+ # Bug 3: Label encoding fix - check for argmax or returning indices
601
+ has_label_fix = any(kw in code for kw in [
602
+ "argmax", "class_indices", "torch.arange",
603
+ "labels.long()", "y.long()", "remove one_hot"
604
+ ])
605
+ # Also check if one_hot is removed from generate_data
606
+ no_onehot = "one_hot" not in code or ("# " in code and "one_hot" in code)
607
+ if has_label_fix or no_onehot:
608
+ bug_fixes_detected += 1
609
+ fix_details["label_fix"] = True
610
+ else:
611
+ fix_details["label_fix"] = False
612
+
613
+ # Bug 4: Batch dimension fix - check for unsqueeze on single sample
614
+ has_batch_fix = any(kw in code for kw in ["unsqueeze(0)", "unsqueeze( 0)", "[None,", "[None ,"])
615
+ if has_batch_fix:
616
+ bug_fixes_detected += 1
617
+ fix_details["batch_fix"] = True
618
+ else:
619
+ fix_details["batch_fix"] = False
620
+
621
+ # ====== SCORING ======
622
+ # Base score from bug fixes (40% weight - 10% per bug)
623
+ bug_fix_score = 0.10 * bug_fixes_detected
624
+
625
+ # Accuracy score (35% weight) - strict threshold
626
+ # With 5 classes, random is 20%, buggy is ~20-30%, fixed should be >80%
627
+ if val_acc < 0.50:
628
+ # Below 50% suggests not all bugs fixed
629
+ acc_score = 0.0
630
+ acc_penalty_reason = "accuracy_too_low"
631
+ else:
632
+ acc_score = sigmoid_score(val_acc, center=0.82, steepness=20.0, higher_is_better=True) * 0.35
633
+ acc_penalty_reason = None
634
+
635
+ # Loss convergence score (15% weight)
636
+ loss_score = sigmoid_score(final_loss, center=0.40, steepness=8.0, higher_is_better=False) * 0.15
637
+
638
+ # Learning trajectory bonus (10% weight)
639
+ trajectory_bonus = 0.0
640
+ if len(losses) >= 10:
641
+ first_half = sum(losses[:len(losses)//2]) / (len(losses)//2)
642
+ last_half = sum(losses[-len(losses)//2:]) / (len(losses)//2)
643
+ improvement_ratio = (first_half - last_half) / first_half if first_half > 0 else 0
644
+ if improvement_ratio > 0.5:
645
+ trajectory_bonus = 0.10
646
+ elif improvement_ratio > 0.3:
647
+ trajectory_bonus = 0.05
648
+
649
+ final_score = min(1.0, bug_fix_score + acc_score + loss_score + trajectory_bonus)
650
+
651
+ breakdown = {
652
+ "bug_fix_score": round(bug_fix_score, 4),
653
+ "bugs_fixed": bug_fixes_detected,
654
+ "fix_details": fix_details,
655
+ "acc_score": round(acc_score, 4),
656
+ "loss_score": round(loss_score, 4),
657
+ "trajectory_bonus": round(trajectory_bonus, 4),
658
+ "val_acc": val_acc,
659
+ "final_loss": final_loss,
660
+ "epoch_count": len(losses),
661
+ }
662
+
663
+ if acc_penalty_reason:
664
+ breakdown["acc_penalty_reason"] = acc_penalty_reason
665
+
666
+ return final_score, breakdown
667
+
668
+
669
  def score_task(task_id: str, result: RunResult) -> tuple[float, dict]:
670
  graders = {
671
  "task1": grade_task1,
 
673
  "task3": grade_task3,
674
  "task4": grade_task4,
675
  "task5": grade_task5,
676
+ "task6": grade_task6,
677
  }
678
  if task_id not in graders:
679
  raise ValueError(f"Unknown task_id: {task_id}")
server/tasks/task3_oom_leakage.py CHANGED
@@ -1,7 +1,11 @@
1
  TASK_DESCRIPTION = """
2
  This binary classification trainer has a bug causing validation accuracy around 50%.
3
- Fix the bug. After 20 epochs: VAL_ACC > 0.90, FINAL_LOSS < 0.3.
 
 
 
4
  Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
 
5
  """
6
 
7
  BUGGY_CODE = """
 
1
  TASK_DESCRIPTION = """
2
  This binary classification trainer has a bug causing validation accuracy around 50%.
3
+ The bug inverts the labels during training. Fix it so after 20 epochs:
4
+ - VAL_ACC > 0.90 (the primary goal)
5
+ - FINAL_LOSS < 0.3
6
+
7
  Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
8
+ Wrap output in ##METRICS_START## and ##METRICS_END##
9
  """
10
 
11
  BUGGY_CODE = """
server/tasks/task6_io_mismatch.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TASK_DESCRIPTION = """
2
+ This image classification script has multiple input-output mismatch bugs that cause
3
+ silent failures or crashes. The model is a simple CNN trained on synthetic "images".
4
+
5
+ There are 4 BUGS to fix:
6
+ 1. Shape mismatch: The model expects 28x28 images but data generator creates 32x32
7
+ 2. Channel order mismatch: Model expects CHW but data is HWC format
8
+ 3. Label encoding mismatch: Model expects class indices but labels are one-hot encoded
9
+ 4. Batch dimension mismatch: A validation step processes unbatched data
10
+
11
+ Fix all bugs so that:
12
+ - Training runs without errors for 30 epochs
13
+ - VAL_ACC > 0.85
14
+ - FINAL_LOSS < 0.5
15
+
16
+ Print as: LOSSES:[l1,l2,...], VAL_ACC:X.XX, FINAL_LOSS:X.XX
17
+ Wrap output in ##METRICS_START## and ##METRICS_END##
18
+ """
19
+
20
+ BUGGY_CODE = """
21
+ import torch
22
+ import torch.nn as nn
23
+ import torch.nn.functional as F
24
+ from torch.utils.data import DataLoader, TensorDataset
25
+
26
+ torch.manual_seed(42)
27
+
28
+ NUM_CLASSES = 5
29
+ BATCH_SIZE = 32
30
+ EPOCHS = 30
31
+
32
+ # BUG 1: Create 32x32 images but model expects 28x28
33
+ # Generate synthetic image data (HWC format - common from PIL/OpenCV)
34
+ def generate_data(n_samples):
35
+ # Creates images in HWC format (Height x Width x Channels)
36
+ images = torch.randn(n_samples, 32, 32, 1) # BUG: Wrong size & HWC format
37
+ # Each class has a distinct pattern based on mean pixel value region
38
+ class_indices = torch.randint(0, NUM_CLASSES, (n_samples,))
39
+ for i, c in enumerate(class_indices):
40
+ images[i] += c * 0.5 # Add class-dependent offset
41
+
42
+ # BUG 3: Return one-hot labels instead of class indices
43
+ labels = F.one_hot(class_indices, NUM_CLASSES).float()
44
+ return images, labels
45
+
46
+ X_train, y_train = generate_data(800)
47
+ X_val, y_val = generate_data(200)
48
+
49
+ # BUG 2: Model expects CHW format (Channels x Height x Width) and 28x28 images
50
+ class SimpleCNN(nn.Module):
51
+ def __init__(self):
52
+ super().__init__()
53
+ # Expecting input: (batch, 1, 28, 28) in CHW format
54
+ self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
55
+ self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
56
+ self.pool = nn.MaxPool2d(2)
57
+ # After two pooling ops on 28x28: 28->14->7, so 7*7*32 = 1568
58
+ self.fc = nn.Linear(7 * 7 * 32, NUM_CLASSES)
59
+
60
+ def forward(self, x):
61
+ # Expects x to be (batch, channels, height, width)
62
+ x = self.pool(F.relu(self.conv1(x))) # -> (batch, 16, 14, 14)
63
+ x = self.pool(F.relu(self.conv2(x))) # -> (batch, 32, 7, 7)
64
+ x = x.view(x.size(0), -1)
65
+ return self.fc(x)
66
+
67
+ model = SimpleCNN()
68
+ optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
69
+ criterion = nn.CrossEntropyLoss() # Expects class indices, not one-hot
70
+
71
+ train_ds = TensorDataset(X_train, y_train)
72
+ train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
73
+
74
+ losses = []
75
+ for epoch in range(EPOCHS):
76
+ model.train()
77
+ epoch_loss = 0.0
78
+ for images, labels in train_loader:
79
+ optimizer.zero_grad()
80
+
81
+ # Missing: permute from HWC to CHW format
82
+ # Missing: resize from 32x32 to 28x28
83
+ outputs = model(images)
84
+
85
+ # BUG: criterion expects class indices but labels are one-hot
86
+ loss = criterion(outputs, labels)
87
+
88
+ loss.backward()
89
+ optimizer.step()
90
+ epoch_loss += loss.item()
91
+ losses.append(epoch_loss / len(train_loader))
92
+
93
+ # Validation
94
+ model.eval()
95
+ with torch.no_grad():
96
+ # BUG 4: Process single sample without batch dimension
97
+ sample = X_val[0] # Shape: (32, 32, 1) - missing batch dim
98
+ single_pred = model(sample) # Will crash: expects (batch, C, H, W)
99
+
100
+ # Full validation (also has format issues)
101
+ val_outputs = model(X_val)
102
+ val_preds = val_outputs.argmax(dim=1)
103
+ val_labels = y_val.argmax(dim=1) # Convert one-hot back to indices for comparison
104
+ val_acc = (val_preds == val_labels).float().mean().item()
105
+
106
+ print('##METRICS_START##')
107
+ print('LOSSES:' + str([round(l, 4) for l in losses]))
108
+ print('VAL_ACC:' + str(round(val_acc, 4)))
109
+ print('FINAL_LOSS:' + str(round(losses[-1], 4)))
110
+ print('##METRICS_END##')
111
+ """
112
+
113
+ GROUND_TRUTH_BUGS = [
114
+ "Shape mismatch: Images are 32x32 but model expects 28x28 - need to resize or fix model architecture",
115
+ "Channel order mismatch: Data is in HWC format but model expects CHW - use .permute(0, 3, 1, 2)",
116
+ "Label encoding mismatch: Labels are one-hot but CrossEntropyLoss expects class indices - use .argmax(dim=1) or change generate_data",
117
+ "Batch dimension mismatch: single sample missing batch dimension - use sample.unsqueeze(0)",
118
+ ]
119
+
120
+ # Expected fixes (for grader reference):
121
+ # 1. Either resize images to 28x28 OR change model to expect 32x32 (fc layer: 8*8*32)
122
+ # 2. Add: images = images.permute(0, 3, 1, 2) # HWC -> CHW
123
+ # 3. Either: labels = class_indices (return indices) OR labels = labels.argmax(dim=1) before criterion
124
+ # 4. Add: sample = sample.unsqueeze(0).permute(0, 3, 1, 2) before model(sample)
test_api.sh ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Test HuggingFace Inference Providers API
3
+
4
+ echo "Testing HuggingFace Inference Providers..."
5
+ echo ""
6
+
7
+ HF_TOKEN="${HF_TOKEN:-$(grep -v '^#' .env 2>/dev/null | head -1)}"
8
+
9
+ if [ -z "$HF_TOKEN" ]; then
10
+ echo "❌ HF_TOKEN not set"
11
+ exit 1
12
+ fi
13
+
14
+ echo "Testing with model: Qwen/Qwen2.5-7B-Instruct"
15
+ echo ""
16
+
17
+ response=$(curl -s -w "\nHTTP_CODE:%{http_code}" \
18
+ https://router.huggingface.co/v1/chat/completions \
19
+ -H "Authorization: Bearer $HF_TOKEN" \
20
+ -H "Content-Type: application/json" \
21
+ -d '{
22
+ "model": "Qwen/Qwen2.5-7B-Instruct",
23
+ "messages": [{"role": "user", "content": "Say hello in 5 words"}],
24
+ "max_tokens": 20
25
+ }')
26
+
27
+ http_code=$(echo "$response" | grep "HTTP_CODE" | cut -d: -f2)
28
+ body=$(echo "$response" | sed '/HTTP_CODE/d')
29
+
30
+ if [ "$http_code" = "200" ]; then
31
+ echo "✅ API Test Successful!"
32
+ echo ""
33
+ echo "Response:"
34
+ echo "$body" | python3 -m json.tool 2>/dev/null || echo "$body"
35
+ else
36
+ echo "❌ API Test Failed (HTTP $http_code)"
37
+ echo ""
38
+ echo "Response:"
39
+ echo "$body"
40
+ exit 1
41
+ fi
42
+
43
+ echo ""
44
+ echo "===================="
45
+ echo "✅ Configuration is working!"
46
+ echo "Use this in your .bashrc or .env:"
47
+ echo ""
48
+ echo "export API_BASE_URL=\"https://router.huggingface.co/v1\""
49
+ echo "export MODEL_NAME=\"Qwen/Qwen2.5-7B-Instruct\""
50
+ echo "export HF_TOKEN=\"$HF_TOKEN\""
vaidate-submission.sh ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh — OpenEnv Submission Validator
4
+ #
5
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
6
+ #
7
+ # Prerequisites:
8
+ # - Docker: https://docs.docker.com/get-docker/
9
+ # - openenv-core: pip install openenv-core
10
+ # - curl (usually pre-installed)
11
+ #
12
+ # Run:
13
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
14
+ #
15
+ # Or download and run locally:
16
+ # chmod +x validate-submission.sh
17
+ # ./validate-submission.sh <ping_url> [repo_dir]
18
+ #
19
+ # Arguments:
20
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
21
+ # repo_dir Path to your repo (default: current directory)
22
+ #
23
+ # Examples:
24
+ # ./validate-submission.sh https://my-team.hf.space
25
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
26
+ #
27
+
28
+ set -uo pipefail
29
+
30
+ DOCKER_BUILD_TIMEOUT=600
31
+ if [ -t 1 ]; then
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ BOLD='\033[1m'
36
+ NC='\033[0m'
37
+ else
38
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
39
+ fi
40
+
41
+ run_with_timeout() {
42
+ local secs="$1"; shift
43
+ if command -v timeout &>/dev/null; then
44
+ timeout "$secs" "$@"
45
+ elif command -v gtimeout &>/dev/null; then
46
+ gtimeout "$secs" "$@"
47
+ else
48
+ "$@" &
49
+ local pid=$!
50
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
51
+ local watcher=$!
52
+ wait "$pid" 2>/dev/null
53
+ local rc=$?
54
+ kill "$watcher" 2>/dev/null
55
+ wait "$watcher" 2>/dev/null
56
+ return $rc
57
+ fi
58
+ }
59
+
60
+ portable_mktemp() {
61
+ local prefix="${1:-validate}"
62
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
63
+ }
64
+
65
+ CLEANUP_FILES=()
66
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
67
+ trap cleanup EXIT
68
+
69
+ PING_URL="${1:-}"
70
+ REPO_DIR="${2:-.}"
71
+
72
+ if [ -z "$PING_URL" ]; then
73
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
74
+ printf "\n"
75
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
76
+ printf " repo_dir Path to your repo (default: current directory)\n"
77
+ exit 1
78
+ fi
79
+
80
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
81
+ printf "Error: directory '%s' not found\n" "${2:-.}"
82
+ exit 1
83
+ fi
84
+ PING_URL="${PING_URL%/}"
85
+ export PING_URL
86
+ PASS=0
87
+
88
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
89
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
90
+ fail() { log "${RED}FAILED${NC} -- $1"; }
91
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
92
+ stop_at() {
93
+ printf "\n"
94
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
95
+ exit 1
96
+ }
97
+
98
+ printf "\n"
99
+ printf "${BOLD}========================================${NC}\n"
100
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
101
+ printf "${BOLD}========================================${NC}\n"
102
+ log "Repo: $REPO_DIR"
103
+ log "Ping URL: $PING_URL"
104
+ printf "\n"
105
+
106
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
107
+
108
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
109
+ CLEANUP_FILES+=("$CURL_OUTPUT")
110
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
111
+ -H "Content-Type: application/json" -d '{}' \
112
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
113
+
114
+ if [ "$HTTP_CODE" = "200" ]; then
115
+ pass "HF Space is live and responds to /reset"
116
+ elif [ "$HTTP_CODE" = "000" ]; then
117
+ fail "HF Space not reachable (connection failed or timed out)"
118
+ hint "Check your network connection and that the Space is running."
119
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
120
+ stop_at "Step 1"
121
+ else
122
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
123
+ hint "Make sure your Space is running and the URL is correct."
124
+ hint "Try opening $PING_URL in your browser first."
125
+ stop_at "Step 1"
126
+ fi
127
+
128
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
129
+
130
+ if ! command -v docker &>/dev/null; then
131
+ fail "docker command not found"
132
+ hint "Install Docker: https://docs.docker.com/get-docker/"
133
+ stop_at "Step 2"
134
+ fi
135
+
136
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
137
+ DOCKER_CONTEXT="$REPO_DIR"
138
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
139
+ DOCKER_CONTEXT="$REPO_DIR/server"
140
+ else
141
+ fail "No Dockerfile found in repo root or server/ directory"
142
+ stop_at "Step 2"
143
+ fi
144
+
145
+ log " Found Dockerfile in $DOCKER_CONTEXT"
146
+
147
+ BUILD_OK=false
148
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
149
+
150
+ if [ "$BUILD_OK" = true ]; then
151
+ pass "Docker build succeeded"
152
+ else
153
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
154
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
155
+ stop_at "Step 2"
156
+ fi
157
+
158
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
159
+
160
+ if ! command -v openenv &>/dev/null; then
161
+ fail "openenv command not found"
162
+ hint "Install it: pip install openenv-core"
163
+ stop_at "Step 3"
164
+ fi
165
+
166
+ VALIDATE_OK=false
167
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
168
+
169
+ if [ "$VALIDATE_OK" = true ]; then
170
+ pass "openenv validate passed"
171
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
172
+ else
173
+ fail "openenv validate failed"
174
+ printf "%s\n" "$VALIDATE_OUTPUT"
175
+ stop_at "Step 3"
176
+ fi
177
+
178
+ printf "\n"
179
+ printf "${BOLD}========================================${NC}\n"
180
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
181
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
182
+ printf "${BOLD}========================================${NC}\n"
183
+ printf "\n"
184
+
185
+ exit 0