Spaces:
Sleeping
Sleeping
File size: 9,479 Bytes
ffd85e1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | # WhipStudio Task Documentation
Detailed documentation for all 6 debugging tasks in WhipStudio.
---
## Task Overview
| Task | Difficulty | Bug Count | Key Skills Tested |
|------|------------|-----------|-------------------|
| task1 | π’ Easy | 2 | Basic PyTorch training loop |
| task2 | π‘ Medium | 1 | Numerical stability |
| task3 | π΄ Hard | 2 | Memory management, data handling |
| task4 | π‘ Medium | 2 | Loss function selection, evaluation |
| task5 | π‘ Medium | 1 | Transfer learning, parameter freezing |
| task6 | π΄ Hard | 4 | Data preprocessing, CNN architecture |
---
## Task 1: Broken Training Loop
**Difficulty:** π’ Easy
**Category:** Basic Training Loop
### Description
A 2-class linear classifier training loop has two bugs preventing convergence.
### Bugs
1. **Learning rate too high** (`lr=10.0` instead of `lr=0.01`)
2. **Wrong optimizer order** (`optimizer.step()` before `loss.backward()`)
### Success Criteria
- Final loss < 0.3
- Validation accuracy > 0.85
- Losses should decrease monotonically
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, VAL_ACC > 0.90 | 0.85 - 1.0 |
| One bug fixed, some improvement | 0.40 - 0.70 |
| No bugs fixed, code runs | 0.15 - 0.25 |
| Code crashes | 0.0 - 0.10 |
### Hints
- Check the learning rate value
- Compare the order of `backward()` and `step()` calls
- Standard pattern: `zero_grad()` β forward β loss β `backward()` β `step()`
---
## Task 2: Silent NaN Loss
**Difficulty:** π‘ Medium
**Category:** Numerical Stability
### Description
A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.
### Bugs
1. **Unprotected log(0)** - The loss uses `torch.log(pred)` where `pred` can be 0.0
### Success Criteria
- No NaN values in loss
- Training completes successfully
- Validation accuracy > 0.80
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| NaN fixed, VAL_ACC > 0.85 | 0.85 - 1.0 |
| NaN fixed, lower accuracy | 0.50 - 0.75 |
| Still has NaN | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |
### Hints
- Look for `torch.log()` calls
- Use `torch.clamp(pred, min=1e-7)` before taking log
- Or use `torch.log1p(pred - 1 + 1e-7)` for more stability
- Alternative: use built-in `F.binary_cross_entropy` with `eps` parameter
### Common Mistakes
- Only fixing one of the log calls (there may be multiple)
- Using too small epsilon (1e-15 can still cause issues)
- Removing the loss entirely instead of fixing it
---
## Task 3: OOM + Data Leakage
**Difficulty:** π΄ Hard
**Category:** Memory Management, Data Pipeline
### Description
A CNN training script has memory issues and data leakage between train/validation sets.
### Bugs
1. **Graph accumulation** - `total_loss += loss` keeps the computation graph, causing OOM
2. **Data leakage** - Augmentation applied before train/val split, leaking information
### Success Criteria
- No OOM errors
- Training completes all epochs
- Validation accuracy > 0.90 (proper separation)
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, VAL_ACC > 0.92 | 0.85 - 1.0 |
| One bug fixed | 0.40 - 0.65 |
| Bugs not fixed, VAL_ACC < 0.70 | 0.10 - 0.25 |
| OOM error | 0.0 - 0.10 |
### Hints
- Use `total_loss += loss.item()` or `total_loss += loss.detach()` to prevent graph retention
- Split data BEFORE applying augmentation
- Apply augmentation only to training set, not validation
### Common Mistakes
- Only fixing one bug
- Using `.detach()` incorrectly (must call `.item()` for scalar losses)
- Applying different augmentations to train/val but still from same augmented source
---
## Task 4: Wrong Loss Function
**Difficulty:** π‘ Medium
**Category:** Loss Function Selection
### Description
A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.
### Bugs
1. **Wrong loss function** - CrossEntropyLoss is for single-label, not multi-label
2. **Wrong evaluation** - Predictions should be sigmoid > 0.5, not argmax
### Success Criteria
- F1 score > 0.70
- Correct multi-hot predictions
- Training shows improvement
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, F1 > 0.75 | 0.85 - 1.0 |
| Loss fixed, eval partially correct | 0.50 - 0.75 |
| Wrong loss still used | 0.10 - 0.30 |
| Code crashes | 0.0 - 0.10 |
### Hints
- Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
- Use `nn.BCEWithLogitsLoss()` for multi-label
- For predictions: `(torch.sigmoid(output) > 0.5).float()`
- Don't use softmax/argmax - that's for single-label
### Common Mistakes
- Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
- Keeping argmax evaluation (should be threshold-based)
- Incorrect target format (should be float, not long)
---
## Task 5: Frozen Backbone
**Difficulty:** π‘ Medium
**Category:** Transfer Learning
### Description
A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.
### Bugs
1. **Frozen parameters in optimizer** - Backbone is frozen but all params passed to Adam
### Success Criteria
- Only trainable parameters in optimizer, OR
- Backbone unfrozen and training
- Model shows improvement (loss decreases)
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Bug properly fixed | 0.85 - 1.0 |
| Partial fix (some improvement) | 0.40 - 0.70 |
| No fix (backbone still frozen + in optimizer) | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |
### Two Valid Solutions
**Solution A: Only optimize trainable params**
```python
# Only pass parameters that require gradients
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=0.001
)
```
**Solution B: Unfreeze the backbone**
```python
# Remove the freezing, train everything
for param in model.backbone.parameters():
param.requires_grad = True
```
### Hints
- Check which parameters have `requires_grad=True`
- Filter parameters when creating optimizer
- Or change `requires_grad` to True for backbone
### Common Mistakes
- Unfreezing after optimizer creation (optimizer already captured frozen params)
- Only unfreezing some layers
- Forgetting to also handle bias parameters
---
## Task 6: Input-Output Mismatch
**Difficulty:** π΄ Hard
**Category:** Data Preprocessing, CNN Architecture
### Description
A CNN for image classification has multiple bugs related to data format and model architecture mismatches.
### Bugs (4 total)
1. **Image size mismatch** - Data is 32x32, model expects 28x28 (fc layer size wrong)
2. **Channel order** - Data is HWC (Height, Width, Channels), model expects CHW
3. **Label encoding** - Labels are one-hot encoded, CrossEntropyLoss expects class indices
4. **Batch dimension** - Single samples missing batch dimension during validation
### Success Criteria
- All 4 bugs fixed
- Model trains without shape errors
- Validation accuracy > 0.85
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| All 4 bugs fixed, VAL_ACC > 0.90 | 0.90 - 1.0 |
| 3 bugs fixed | 0.65 - 0.80 |
| 2 bugs fixed | 0.40 - 0.55 |
| 1 bug fixed | 0.20 - 0.35 |
| No bugs fixed | 0.0 - 0.15 |
### Hints
**Bug 1: Image size**
- Option A: Resize images to 28x28 (`F.interpolate` or change data generation)
- Option B: Adjust fc layer input size (7*7*32 β 8*8*32 for 32x32 input)
**Bug 2: Channel order**
```python
# Convert HWC to CHW
images = images.permute(0, 3, 1, 2) # [N, H, W, C] β [N, C, H, W]
```
**Bug 3: Label encoding**
```python
# Convert one-hot to class indices
labels = one_hot_labels.argmax(dim=1)
```
**Bug 4: Batch dimension**
```python
# Add batch dimension for single samples
x = x.unsqueeze(0) # [C, H, W] β [1, C, H, W]
```
### Common Mistakes
- Only fixing 1-2 of the 4 bugs
- Fixing in wrong order (need to be careful with dimensions)
- Using wrong dimension for permute/argmax
- Forgetting to handle validation differently from training
---
## Grading Philosophy
### Continuous Rewards
WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:
- Partial credit for fixing some bugs
- Partial credit for code that runs but doesn't meet thresholds
- Higher scores require meeting stricter thresholds
### Why Continuous?
1. **Better RL training signal** - Agents can learn from small improvements
2. **Differentiates solutions** - Distinguishes between "almost correct" and "completely wrong"
3. **Encourages incremental progress** - Reward shaping guides learning
### Score Interpretation
| Score Range | Meaning |
|-------------|---------|
| 0.90 - 1.00 | Excellent - All bugs fixed, exceeds targets |
| 0.70 - 0.89 | Good - Main bugs fixed, meets basic targets |
| 0.40 - 0.69 | Partial - Some bugs fixed, shows improvement |
| 0.15 - 0.39 | Minimal - Code runs but bugs remain |
| 0.00 - 0.14 | Failed - Crashes or no meaningful output |
---
## Tips for Agents
1. **Read the task description carefully** - It tells you what metrics to print
2. **Keep seed values** - Don't remove `torch.manual_seed()` calls
3. **Print required metrics** - Must output `LOSSES:[...]` and task-specific metrics
4. **Test with tools first** - Use debugging tools before submitting
5. **Fix all bugs** - Partial fixes get partial credit
6. **Don't over-engineer** - Minimal fixes are better than rewrites
|