File size: 9,479 Bytes
ffd85e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
# WhipStudio Task Documentation

Detailed documentation for all 6 debugging tasks in WhipStudio.

---

## Task Overview

| Task | Difficulty | Bug Count | Key Skills Tested |
|------|------------|-----------|-------------------|
| task1 | 🟒 Easy | 2 | Basic PyTorch training loop |
| task2 | 🟑 Medium | 1 | Numerical stability |
| task3 | πŸ”΄ Hard | 2 | Memory management, data handling |
| task4 | 🟑 Medium | 2 | Loss function selection, evaluation |
| task5 | 🟑 Medium | 1 | Transfer learning, parameter freezing |
| task6 | πŸ”΄ Hard | 4 | Data preprocessing, CNN architecture |

---

## Task 1: Broken Training Loop

**Difficulty:** 🟒 Easy  
**Category:** Basic Training Loop

### Description
A 2-class linear classifier training loop has two bugs preventing convergence.

### Bugs
1. **Learning rate too high** (`lr=10.0` instead of `lr=0.01`)
2. **Wrong optimizer order** (`optimizer.step()` before `loss.backward()`)

### Success Criteria
- Final loss < 0.3
- Validation accuracy > 0.85
- Losses should decrease monotonically

### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, VAL_ACC > 0.90 | 0.85 - 1.0 |
| One bug fixed, some improvement | 0.40 - 0.70 |
| No bugs fixed, code runs | 0.15 - 0.25 |
| Code crashes | 0.0 - 0.10 |

### Hints
- Check the learning rate value
- Compare the order of `backward()` and `step()` calls
- Standard pattern: `zero_grad()` β†’ forward β†’ loss β†’ `backward()` β†’ `step()`

---

## Task 2: Silent NaN Loss

**Difficulty:** 🟑 Medium  
**Category:** Numerical Stability

### Description
A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.

### Bugs
1. **Unprotected log(0)** - The loss uses `torch.log(pred)` where `pred` can be 0.0

### Success Criteria
- No NaN values in loss
- Training completes successfully
- Validation accuracy > 0.80

### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| NaN fixed, VAL_ACC > 0.85 | 0.85 - 1.0 |
| NaN fixed, lower accuracy | 0.50 - 0.75 |
| Still has NaN | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |

### Hints
- Look for `torch.log()` calls
- Use `torch.clamp(pred, min=1e-7)` before taking log
- Or use `torch.log1p(pred - 1 + 1e-7)` for more stability
- Alternative: use built-in `F.binary_cross_entropy` with `eps` parameter

### Common Mistakes
- Only fixing one of the log calls (there may be multiple)
- Using too small epsilon (1e-15 can still cause issues)
- Removing the loss entirely instead of fixing it

---

## Task 3: OOM + Data Leakage

**Difficulty:** πŸ”΄ Hard  
**Category:** Memory Management, Data Pipeline

### Description
A CNN training script has memory issues and data leakage between train/validation sets.

### Bugs
1. **Graph accumulation** - `total_loss += loss` keeps the computation graph, causing OOM
2. **Data leakage** - Augmentation applied before train/val split, leaking information

### Success Criteria
- No OOM errors
- Training completes all epochs
- Validation accuracy > 0.90 (proper separation)

### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, VAL_ACC > 0.92 | 0.85 - 1.0 |
| One bug fixed | 0.40 - 0.65 |
| Bugs not fixed, VAL_ACC < 0.70 | 0.10 - 0.25 |
| OOM error | 0.0 - 0.10 |

### Hints
- Use `total_loss += loss.item()` or `total_loss += loss.detach()` to prevent graph retention
- Split data BEFORE applying augmentation
- Apply augmentation only to training set, not validation

### Common Mistakes
- Only fixing one bug
- Using `.detach()` incorrectly (must call `.item()` for scalar losses)
- Applying different augmentations to train/val but still from same augmented source

---

## Task 4: Wrong Loss Function

**Difficulty:** 🟑 Medium  
**Category:** Loss Function Selection

### Description
A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.

### Bugs
1. **Wrong loss function** - CrossEntropyLoss is for single-label, not multi-label
2. **Wrong evaluation** - Predictions should be sigmoid > 0.5, not argmax

### Success Criteria
- F1 score > 0.70
- Correct multi-hot predictions
- Training shows improvement

### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, F1 > 0.75 | 0.85 - 1.0 |
| Loss fixed, eval partially correct | 0.50 - 0.75 |
| Wrong loss still used | 0.10 - 0.30 |
| Code crashes | 0.0 - 0.10 |

### Hints
- Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
- Use `nn.BCEWithLogitsLoss()` for multi-label
- For predictions: `(torch.sigmoid(output) > 0.5).float()`
- Don't use softmax/argmax - that's for single-label

### Common Mistakes
- Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
- Keeping argmax evaluation (should be threshold-based)
- Incorrect target format (should be float, not long)

---

## Task 5: Frozen Backbone

**Difficulty:** 🟑 Medium  
**Category:** Transfer Learning

### Description
A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.

### Bugs
1. **Frozen parameters in optimizer** - Backbone is frozen but all params passed to Adam

### Success Criteria
- Only trainable parameters in optimizer, OR
- Backbone unfrozen and training
- Model shows improvement (loss decreases)

### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Bug properly fixed | 0.85 - 1.0 |
| Partial fix (some improvement) | 0.40 - 0.70 |
| No fix (backbone still frozen + in optimizer) | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |

### Two Valid Solutions

**Solution A: Only optimize trainable params**
```python
# Only pass parameters that require gradients
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=0.001
)
```

**Solution B: Unfreeze the backbone**
```python
# Remove the freezing, train everything
for param in model.backbone.parameters():
    param.requires_grad = True
```

### Hints
- Check which parameters have `requires_grad=True`
- Filter parameters when creating optimizer
- Or change `requires_grad` to True for backbone

### Common Mistakes
- Unfreezing after optimizer creation (optimizer already captured frozen params)
- Only unfreezing some layers
- Forgetting to also handle bias parameters

---

## Task 6: Input-Output Mismatch

**Difficulty:** πŸ”΄ Hard  
**Category:** Data Preprocessing, CNN Architecture

### Description
A CNN for image classification has multiple bugs related to data format and model architecture mismatches.

### Bugs (4 total)
1. **Image size mismatch** - Data is 32x32, model expects 28x28 (fc layer size wrong)
2. **Channel order** - Data is HWC (Height, Width, Channels), model expects CHW
3. **Label encoding** - Labels are one-hot encoded, CrossEntropyLoss expects class indices
4. **Batch dimension** - Single samples missing batch dimension during validation

### Success Criteria
- All 4 bugs fixed
- Model trains without shape errors
- Validation accuracy > 0.85

### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| All 4 bugs fixed, VAL_ACC > 0.90 | 0.90 - 1.0 |
| 3 bugs fixed | 0.65 - 0.80 |
| 2 bugs fixed | 0.40 - 0.55 |
| 1 bug fixed | 0.20 - 0.35 |
| No bugs fixed | 0.0 - 0.15 |

### Hints

**Bug 1: Image size**
- Option A: Resize images to 28x28 (`F.interpolate` or change data generation)
- Option B: Adjust fc layer input size (7*7*32 β†’ 8*8*32 for 32x32 input)

**Bug 2: Channel order**
```python
# Convert HWC to CHW
images = images.permute(0, 3, 1, 2)  # [N, H, W, C] β†’ [N, C, H, W]
```

**Bug 3: Label encoding**
```python
# Convert one-hot to class indices
labels = one_hot_labels.argmax(dim=1)
```

**Bug 4: Batch dimension**
```python
# Add batch dimension for single samples
x = x.unsqueeze(0)  # [C, H, W] β†’ [1, C, H, W]
```

### Common Mistakes
- Only fixing 1-2 of the 4 bugs
- Fixing in wrong order (need to be careful with dimensions)
- Using wrong dimension for permute/argmax
- Forgetting to handle validation differently from training

---

## Grading Philosophy

### Continuous Rewards
WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:
- Partial credit for fixing some bugs
- Partial credit for code that runs but doesn't meet thresholds
- Higher scores require meeting stricter thresholds

### Why Continuous?
1. **Better RL training signal** - Agents can learn from small improvements
2. **Differentiates solutions** - Distinguishes between "almost correct" and "completely wrong"
3. **Encourages incremental progress** - Reward shaping guides learning

### Score Interpretation
| Score Range | Meaning |
|-------------|---------|
| 0.90 - 1.00 | Excellent - All bugs fixed, exceeds targets |
| 0.70 - 0.89 | Good - Main bugs fixed, meets basic targets |
| 0.40 - 0.69 | Partial - Some bugs fixed, shows improvement |
| 0.15 - 0.39 | Minimal - Code runs but bugs remain |
| 0.00 - 0.14 | Failed - Crashes or no meaningful output |

---

## Tips for Agents

1. **Read the task description carefully** - It tells you what metrics to print
2. **Keep seed values** - Don't remove `torch.manual_seed()` calls
3. **Print required metrics** - Must output `LOSSES:[...]` and task-specific metrics
4. **Test with tools first** - Use debugging tools before submitting
5. **Fix all bugs** - Partial fixes get partial credit
6. **Don't over-engineer** - Minimal fixes are better than rewrites