whipstudio / docs /TASKS.md
Amogh-kal1's picture
Upload folder using huggingface_hub
ffd85e1 verified
# WhipStudio Task Documentation
Detailed documentation for all 6 debugging tasks in WhipStudio.
---
## Task Overview
| Task | Difficulty | Bug Count | Key Skills Tested |
|------|------------|-----------|-------------------|
| task1 | 🟢 Easy | 2 | Basic PyTorch training loop |
| task2 | 🟡 Medium | 1 | Numerical stability |
| task3 | 🔴 Hard | 2 | Memory management, data handling |
| task4 | 🟡 Medium | 2 | Loss function selection, evaluation |
| task5 | 🟡 Medium | 1 | Transfer learning, parameter freezing |
| task6 | 🔴 Hard | 4 | Data preprocessing, CNN architecture |
---
## Task 1: Broken Training Loop
**Difficulty:** 🟢 Easy
**Category:** Basic Training Loop
### Description
A 2-class linear classifier training loop has two bugs preventing convergence.
### Bugs
1. **Learning rate too high** (`lr=10.0` instead of `lr=0.01`)
2. **Wrong optimizer order** (`optimizer.step()` before `loss.backward()`)
### Success Criteria
- Final loss < 0.3
- Validation accuracy > 0.85
- Losses should decrease monotonically
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, VAL_ACC > 0.90 | 0.85 - 1.0 |
| One bug fixed, some improvement | 0.40 - 0.70 |
| No bugs fixed, code runs | 0.15 - 0.25 |
| Code crashes | 0.0 - 0.10 |
### Hints
- Check the learning rate value
- Compare the order of `backward()` and `step()` calls
- Standard pattern: `zero_grad()` → forward → loss → `backward()` → `step()`
---
## Task 2: Silent NaN Loss
**Difficulty:** 🟡 Medium
**Category:** Numerical Stability
### Description
A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.
### Bugs
1. **Unprotected log(0)** - The loss uses `torch.log(pred)` where `pred` can be 0.0
### Success Criteria
- No NaN values in loss
- Training completes successfully
- Validation accuracy > 0.80
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| NaN fixed, VAL_ACC > 0.85 | 0.85 - 1.0 |
| NaN fixed, lower accuracy | 0.50 - 0.75 |
| Still has NaN | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |
### Hints
- Look for `torch.log()` calls
- Use `torch.clamp(pred, min=1e-7)` before taking log
- Or use `torch.log1p(pred - 1 + 1e-7)` for more stability
- Alternative: use built-in `F.binary_cross_entropy` with `eps` parameter
### Common Mistakes
- Only fixing one of the log calls (there may be multiple)
- Using too small epsilon (1e-15 can still cause issues)
- Removing the loss entirely instead of fixing it
---
## Task 3: OOM + Data Leakage
**Difficulty:** 🔴 Hard
**Category:** Memory Management, Data Pipeline
### Description
A CNN training script has memory issues and data leakage between train/validation sets.
### Bugs
1. **Graph accumulation** - `total_loss += loss` keeps the computation graph, causing OOM
2. **Data leakage** - Augmentation applied before train/val split, leaking information
### Success Criteria
- No OOM errors
- Training completes all epochs
- Validation accuracy > 0.90 (proper separation)
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, VAL_ACC > 0.92 | 0.85 - 1.0 |
| One bug fixed | 0.40 - 0.65 |
| Bugs not fixed, VAL_ACC < 0.70 | 0.10 - 0.25 |
| OOM error | 0.0 - 0.10 |
### Hints
- Use `total_loss += loss.item()` or `total_loss += loss.detach()` to prevent graph retention
- Split data BEFORE applying augmentation
- Apply augmentation only to training set, not validation
### Common Mistakes
- Only fixing one bug
- Using `.detach()` incorrectly (must call `.item()` for scalar losses)
- Applying different augmentations to train/val but still from same augmented source
---
## Task 4: Wrong Loss Function
**Difficulty:** 🟡 Medium
**Category:** Loss Function Selection
### Description
A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.
### Bugs
1. **Wrong loss function** - CrossEntropyLoss is for single-label, not multi-label
2. **Wrong evaluation** - Predictions should be sigmoid > 0.5, not argmax
### Success Criteria
- F1 score > 0.70
- Correct multi-hot predictions
- Training shows improvement
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Both bugs fixed, F1 > 0.75 | 0.85 - 1.0 |
| Loss fixed, eval partially correct | 0.50 - 0.75 |
| Wrong loss still used | 0.10 - 0.30 |
| Code crashes | 0.0 - 0.10 |
### Hints
- Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
- Use `nn.BCEWithLogitsLoss()` for multi-label
- For predictions: `(torch.sigmoid(output) > 0.5).float()`
- Don't use softmax/argmax - that's for single-label
### Common Mistakes
- Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
- Keeping argmax evaluation (should be threshold-based)
- Incorrect target format (should be float, not long)
---
## Task 5: Frozen Backbone
**Difficulty:** 🟡 Medium
**Category:** Transfer Learning
### Description
A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.
### Bugs
1. **Frozen parameters in optimizer** - Backbone is frozen but all params passed to Adam
### Success Criteria
- Only trainable parameters in optimizer, OR
- Backbone unfrozen and training
- Model shows improvement (loss decreases)
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| Bug properly fixed | 0.85 - 1.0 |
| Partial fix (some improvement) | 0.40 - 0.70 |
| No fix (backbone still frozen + in optimizer) | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |
### Two Valid Solutions
**Solution A: Only optimize trainable params**
```python
# Only pass parameters that require gradients
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=0.001
)
```
**Solution B: Unfreeze the backbone**
```python
# Remove the freezing, train everything
for param in model.backbone.parameters():
param.requires_grad = True
```
### Hints
- Check which parameters have `requires_grad=True`
- Filter parameters when creating optimizer
- Or change `requires_grad` to True for backbone
### Common Mistakes
- Unfreezing after optimizer creation (optimizer already captured frozen params)
- Only unfreezing some layers
- Forgetting to also handle bias parameters
---
## Task 6: Input-Output Mismatch
**Difficulty:** 🔴 Hard
**Category:** Data Preprocessing, CNN Architecture
### Description
A CNN for image classification has multiple bugs related to data format and model architecture mismatches.
### Bugs (4 total)
1. **Image size mismatch** - Data is 32x32, model expects 28x28 (fc layer size wrong)
2. **Channel order** - Data is HWC (Height, Width, Channels), model expects CHW
3. **Label encoding** - Labels are one-hot encoded, CrossEntropyLoss expects class indices
4. **Batch dimension** - Single samples missing batch dimension during validation
### Success Criteria
- All 4 bugs fixed
- Model trains without shape errors
- Validation accuracy > 0.85
### Grading Rubric
| Condition | Score Range |
|-----------|-------------|
| All 4 bugs fixed, VAL_ACC > 0.90 | 0.90 - 1.0 |
| 3 bugs fixed | 0.65 - 0.80 |
| 2 bugs fixed | 0.40 - 0.55 |
| 1 bug fixed | 0.20 - 0.35 |
| No bugs fixed | 0.0 - 0.15 |
### Hints
**Bug 1: Image size**
- Option A: Resize images to 28x28 (`F.interpolate` or change data generation)
- Option B: Adjust fc layer input size (7*7*32 → 8*8*32 for 32x32 input)
**Bug 2: Channel order**
```python
# Convert HWC to CHW
images = images.permute(0, 3, 1, 2) # [N, H, W, C] → [N, C, H, W]
```
**Bug 3: Label encoding**
```python
# Convert one-hot to class indices
labels = one_hot_labels.argmax(dim=1)
```
**Bug 4: Batch dimension**
```python
# Add batch dimension for single samples
x = x.unsqueeze(0) # [C, H, W] → [1, C, H, W]
```
### Common Mistakes
- Only fixing 1-2 of the 4 bugs
- Fixing in wrong order (need to be careful with dimensions)
- Using wrong dimension for permute/argmax
- Forgetting to handle validation differently from training
---
## Grading Philosophy
### Continuous Rewards
WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:
- Partial credit for fixing some bugs
- Partial credit for code that runs but doesn't meet thresholds
- Higher scores require meeting stricter thresholds
### Why Continuous?
1. **Better RL training signal** - Agents can learn from small improvements
2. **Differentiates solutions** - Distinguishes between "almost correct" and "completely wrong"
3. **Encourages incremental progress** - Reward shaping guides learning
### Score Interpretation
| Score Range | Meaning |
|-------------|---------|
| 0.90 - 1.00 | Excellent - All bugs fixed, exceeds targets |
| 0.70 - 0.89 | Good - Main bugs fixed, meets basic targets |
| 0.40 - 0.69 | Partial - Some bugs fixed, shows improvement |
| 0.15 - 0.39 | Minimal - Code runs but bugs remain |
| 0.00 - 0.14 | Failed - Crashes or no meaningful output |
---
## Tips for Agents
1. **Read the task description carefully** - It tells you what metrics to print
2. **Keep seed values** - Don't remove `torch.manual_seed()` calls
3. **Print required metrics** - Must output `LOSSES:[...]` and task-specific metrics
4. **Test with tools first** - Use debugging tools before submitting
5. **Fix all bugs** - Partial fixes get partial credit
6. **Don't over-engineer** - Minimal fixes are better than rewrites