Spaces:
Sleeping
WhipStudio Task Documentation
Detailed documentation for all 6 debugging tasks in WhipStudio.
Task Overview
| Task | Difficulty | Bug Count | Key Skills Tested |
|---|---|---|---|
| task1 | π’ Easy | 2 | Basic PyTorch training loop |
| task2 | π‘ Medium | 1 | Numerical stability |
| task3 | π΄ Hard | 2 | Memory management, data handling |
| task4 | π‘ Medium | 2 | Loss function selection, evaluation |
| task5 | π‘ Medium | 1 | Transfer learning, parameter freezing |
| task6 | π΄ Hard | 4 | Data preprocessing, CNN architecture |
Task 1: Broken Training Loop
Difficulty: π’ Easy
Category: Basic Training Loop
Description
A 2-class linear classifier training loop has two bugs preventing convergence.
Bugs
- Learning rate too high (
lr=10.0instead oflr=0.01) - Wrong optimizer order (
optimizer.step()beforeloss.backward())
Success Criteria
- Final loss < 0.3
- Validation accuracy > 0.85
- Losses should decrease monotonically
Grading Rubric
| Condition | Score Range |
|---|---|
| Both bugs fixed, VAL_ACC > 0.90 | 0.85 - 1.0 |
| One bug fixed, some improvement | 0.40 - 0.70 |
| No bugs fixed, code runs | 0.15 - 0.25 |
| Code crashes | 0.0 - 0.10 |
Hints
- Check the learning rate value
- Compare the order of
backward()andstep()calls - Standard pattern:
zero_grad()β forward β loss βbackward()βstep()
Task 2: Silent NaN Loss
Difficulty: π‘ Medium
Category: Numerical Stability
Description
A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.
Bugs
- Unprotected log(0) - The loss uses
torch.log(pred)wherepredcan be 0.0
Success Criteria
- No NaN values in loss
- Training completes successfully
- Validation accuracy > 0.80
Grading Rubric
| Condition | Score Range |
|---|---|
| NaN fixed, VAL_ACC > 0.85 | 0.85 - 1.0 |
| NaN fixed, lower accuracy | 0.50 - 0.75 |
| Still has NaN | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |
Hints
- Look for
torch.log()calls - Use
torch.clamp(pred, min=1e-7)before taking log - Or use
torch.log1p(pred - 1 + 1e-7)for more stability - Alternative: use built-in
F.binary_cross_entropywithepsparameter
Common Mistakes
- Only fixing one of the log calls (there may be multiple)
- Using too small epsilon (1e-15 can still cause issues)
- Removing the loss entirely instead of fixing it
Task 3: OOM + Data Leakage
Difficulty: π΄ Hard
Category: Memory Management, Data Pipeline
Description
A CNN training script has memory issues and data leakage between train/validation sets.
Bugs
- Graph accumulation -
total_loss += losskeeps the computation graph, causing OOM - Data leakage - Augmentation applied before train/val split, leaking information
Success Criteria
- No OOM errors
- Training completes all epochs
- Validation accuracy > 0.90 (proper separation)
Grading Rubric
| Condition | Score Range |
|---|---|
| Both bugs fixed, VAL_ACC > 0.92 | 0.85 - 1.0 |
| One bug fixed | 0.40 - 0.65 |
| Bugs not fixed, VAL_ACC < 0.70 | 0.10 - 0.25 |
| OOM error | 0.0 - 0.10 |
Hints
- Use
total_loss += loss.item()ortotal_loss += loss.detach()to prevent graph retention - Split data BEFORE applying augmentation
- Apply augmentation only to training set, not validation
Common Mistakes
- Only fixing one bug
- Using
.detach()incorrectly (must call.item()for scalar losses) - Applying different augmentations to train/val but still from same augmented source
Task 4: Wrong Loss Function
Difficulty: π‘ Medium
Category: Loss Function Selection
Description
A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.
Bugs
- Wrong loss function - CrossEntropyLoss is for single-label, not multi-label
- Wrong evaluation - Predictions should be sigmoid > 0.5, not argmax
Success Criteria
- F1 score > 0.70
- Correct multi-hot predictions
- Training shows improvement
Grading Rubric
| Condition | Score Range |
|---|---|
| Both bugs fixed, F1 > 0.75 | 0.85 - 1.0 |
| Loss fixed, eval partially correct | 0.50 - 0.75 |
| Wrong loss still used | 0.10 - 0.30 |
| Code crashes | 0.0 - 0.10 |
Hints
- Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
- Use
nn.BCEWithLogitsLoss()for multi-label - For predictions:
(torch.sigmoid(output) > 0.5).float() - Don't use softmax/argmax - that's for single-label
Common Mistakes
- Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
- Keeping argmax evaluation (should be threshold-based)
- Incorrect target format (should be float, not long)
Task 5: Frozen Backbone
Difficulty: π‘ Medium
Category: Transfer Learning
Description
A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.
Bugs
- Frozen parameters in optimizer - Backbone is frozen but all params passed to Adam
Success Criteria
- Only trainable parameters in optimizer, OR
- Backbone unfrozen and training
- Model shows improvement (loss decreases)
Grading Rubric
| Condition | Score Range |
|---|---|
| Bug properly fixed | 0.85 - 1.0 |
| Partial fix (some improvement) | 0.40 - 0.70 |
| No fix (backbone still frozen + in optimizer) | 0.10 - 0.25 |
| Code crashes | 0.0 - 0.10 |
Two Valid Solutions
Solution A: Only optimize trainable params
# Only pass parameters that require gradients
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=0.001
)
Solution B: Unfreeze the backbone
# Remove the freezing, train everything
for param in model.backbone.parameters():
param.requires_grad = True
Hints
- Check which parameters have
requires_grad=True - Filter parameters when creating optimizer
- Or change
requires_gradto True for backbone
Common Mistakes
- Unfreezing after optimizer creation (optimizer already captured frozen params)
- Only unfreezing some layers
- Forgetting to also handle bias parameters
Task 6: Input-Output Mismatch
Difficulty: π΄ Hard
Category: Data Preprocessing, CNN Architecture
Description
A CNN for image classification has multiple bugs related to data format and model architecture mismatches.
Bugs (4 total)
- Image size mismatch - Data is 32x32, model expects 28x28 (fc layer size wrong)
- Channel order - Data is HWC (Height, Width, Channels), model expects CHW
- Label encoding - Labels are one-hot encoded, CrossEntropyLoss expects class indices
- Batch dimension - Single samples missing batch dimension during validation
Success Criteria
- All 4 bugs fixed
- Model trains without shape errors
- Validation accuracy > 0.85
Grading Rubric
| Condition | Score Range |
|---|---|
| All 4 bugs fixed, VAL_ACC > 0.90 | 0.90 - 1.0 |
| 3 bugs fixed | 0.65 - 0.80 |
| 2 bugs fixed | 0.40 - 0.55 |
| 1 bug fixed | 0.20 - 0.35 |
| No bugs fixed | 0.0 - 0.15 |
Hints
Bug 1: Image size
- Option A: Resize images to 28x28 (
F.interpolateor change data generation) - Option B: Adjust fc layer input size (7732 β 8832 for 32x32 input)
Bug 2: Channel order
# Convert HWC to CHW
images = images.permute(0, 3, 1, 2) # [N, H, W, C] β [N, C, H, W]
Bug 3: Label encoding
# Convert one-hot to class indices
labels = one_hot_labels.argmax(dim=1)
Bug 4: Batch dimension
# Add batch dimension for single samples
x = x.unsqueeze(0) # [C, H, W] β [1, C, H, W]
Common Mistakes
- Only fixing 1-2 of the 4 bugs
- Fixing in wrong order (need to be careful with dimensions)
- Using wrong dimension for permute/argmax
- Forgetting to handle validation differently from training
Grading Philosophy
Continuous Rewards
WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:
- Partial credit for fixing some bugs
- Partial credit for code that runs but doesn't meet thresholds
- Higher scores require meeting stricter thresholds
Why Continuous?
- Better RL training signal - Agents can learn from small improvements
- Differentiates solutions - Distinguishes between "almost correct" and "completely wrong"
- Encourages incremental progress - Reward shaping guides learning
Score Interpretation
| Score Range | Meaning |
|---|---|
| 0.90 - 1.00 | Excellent - All bugs fixed, exceeds targets |
| 0.70 - 0.89 | Good - Main bugs fixed, meets basic targets |
| 0.40 - 0.69 | Partial - Some bugs fixed, shows improvement |
| 0.15 - 0.39 | Minimal - Code runs but bugs remain |
| 0.00 - 0.14 | Failed - Crashes or no meaningful output |
Tips for Agents
- Read the task description carefully - It tells you what metrics to print
- Keep seed values - Don't remove
torch.manual_seed()calls - Print required metrics - Must output
LOSSES:[...]and task-specific metrics - Test with tools first - Use debugging tools before submitting
- Fix all bugs - Partial fixes get partial credit
- Don't over-engineer - Minimal fixes are better than rewrites