whipstudio / docs /TASKS.md
Amogh-kal1's picture
Upload folder using huggingface_hub
ffd85e1 verified

WhipStudio Task Documentation

Detailed documentation for all 6 debugging tasks in WhipStudio.


Task Overview

Task Difficulty Bug Count Key Skills Tested
task1 🟒 Easy 2 Basic PyTorch training loop
task2 🟑 Medium 1 Numerical stability
task3 πŸ”΄ Hard 2 Memory management, data handling
task4 🟑 Medium 2 Loss function selection, evaluation
task5 🟑 Medium 1 Transfer learning, parameter freezing
task6 πŸ”΄ Hard 4 Data preprocessing, CNN architecture

Task 1: Broken Training Loop

Difficulty: 🟒 Easy
Category: Basic Training Loop

Description

A 2-class linear classifier training loop has two bugs preventing convergence.

Bugs

  1. Learning rate too high (lr=10.0 instead of lr=0.01)
  2. Wrong optimizer order (optimizer.step() before loss.backward())

Success Criteria

  • Final loss < 0.3
  • Validation accuracy > 0.85
  • Losses should decrease monotonically

Grading Rubric

Condition Score Range
Both bugs fixed, VAL_ACC > 0.90 0.85 - 1.0
One bug fixed, some improvement 0.40 - 0.70
No bugs fixed, code runs 0.15 - 0.25
Code crashes 0.0 - 0.10

Hints

  • Check the learning rate value
  • Compare the order of backward() and step() calls
  • Standard pattern: zero_grad() β†’ forward β†’ loss β†’ backward() β†’ step()

Task 2: Silent NaN Loss

Difficulty: 🟑 Medium
Category: Numerical Stability

Description

A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.

Bugs

  1. Unprotected log(0) - The loss uses torch.log(pred) where pred can be 0.0

Success Criteria

  • No NaN values in loss
  • Training completes successfully
  • Validation accuracy > 0.80

Grading Rubric

Condition Score Range
NaN fixed, VAL_ACC > 0.85 0.85 - 1.0
NaN fixed, lower accuracy 0.50 - 0.75
Still has NaN 0.10 - 0.25
Code crashes 0.0 - 0.10

Hints

  • Look for torch.log() calls
  • Use torch.clamp(pred, min=1e-7) before taking log
  • Or use torch.log1p(pred - 1 + 1e-7) for more stability
  • Alternative: use built-in F.binary_cross_entropy with eps parameter

Common Mistakes

  • Only fixing one of the log calls (there may be multiple)
  • Using too small epsilon (1e-15 can still cause issues)
  • Removing the loss entirely instead of fixing it

Task 3: OOM + Data Leakage

Difficulty: πŸ”΄ Hard
Category: Memory Management, Data Pipeline

Description

A CNN training script has memory issues and data leakage between train/validation sets.

Bugs

  1. Graph accumulation - total_loss += loss keeps the computation graph, causing OOM
  2. Data leakage - Augmentation applied before train/val split, leaking information

Success Criteria

  • No OOM errors
  • Training completes all epochs
  • Validation accuracy > 0.90 (proper separation)

Grading Rubric

Condition Score Range
Both bugs fixed, VAL_ACC > 0.92 0.85 - 1.0
One bug fixed 0.40 - 0.65
Bugs not fixed, VAL_ACC < 0.70 0.10 - 0.25
OOM error 0.0 - 0.10

Hints

  • Use total_loss += loss.item() or total_loss += loss.detach() to prevent graph retention
  • Split data BEFORE applying augmentation
  • Apply augmentation only to training set, not validation

Common Mistakes

  • Only fixing one bug
  • Using .detach() incorrectly (must call .item() for scalar losses)
  • Applying different augmentations to train/val but still from same augmented source

Task 4: Wrong Loss Function

Difficulty: 🟑 Medium
Category: Loss Function Selection

Description

A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.

Bugs

  1. Wrong loss function - CrossEntropyLoss is for single-label, not multi-label
  2. Wrong evaluation - Predictions should be sigmoid > 0.5, not argmax

Success Criteria

  • F1 score > 0.70
  • Correct multi-hot predictions
  • Training shows improvement

Grading Rubric

Condition Score Range
Both bugs fixed, F1 > 0.75 0.85 - 1.0
Loss fixed, eval partially correct 0.50 - 0.75
Wrong loss still used 0.10 - 0.30
Code crashes 0.0 - 0.10

Hints

  • Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
  • Use nn.BCEWithLogitsLoss() for multi-label
  • For predictions: (torch.sigmoid(output) > 0.5).float()
  • Don't use softmax/argmax - that's for single-label

Common Mistakes

  • Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
  • Keeping argmax evaluation (should be threshold-based)
  • Incorrect target format (should be float, not long)

Task 5: Frozen Backbone

Difficulty: 🟑 Medium
Category: Transfer Learning

Description

A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.

Bugs

  1. Frozen parameters in optimizer - Backbone is frozen but all params passed to Adam

Success Criteria

  • Only trainable parameters in optimizer, OR
  • Backbone unfrozen and training
  • Model shows improvement (loss decreases)

Grading Rubric

Condition Score Range
Bug properly fixed 0.85 - 1.0
Partial fix (some improvement) 0.40 - 0.70
No fix (backbone still frozen + in optimizer) 0.10 - 0.25
Code crashes 0.0 - 0.10

Two Valid Solutions

Solution A: Only optimize trainable params

# Only pass parameters that require gradients
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=0.001
)

Solution B: Unfreeze the backbone

# Remove the freezing, train everything
for param in model.backbone.parameters():
    param.requires_grad = True

Hints

  • Check which parameters have requires_grad=True
  • Filter parameters when creating optimizer
  • Or change requires_grad to True for backbone

Common Mistakes

  • Unfreezing after optimizer creation (optimizer already captured frozen params)
  • Only unfreezing some layers
  • Forgetting to also handle bias parameters

Task 6: Input-Output Mismatch

Difficulty: πŸ”΄ Hard
Category: Data Preprocessing, CNN Architecture

Description

A CNN for image classification has multiple bugs related to data format and model architecture mismatches.

Bugs (4 total)

  1. Image size mismatch - Data is 32x32, model expects 28x28 (fc layer size wrong)
  2. Channel order - Data is HWC (Height, Width, Channels), model expects CHW
  3. Label encoding - Labels are one-hot encoded, CrossEntropyLoss expects class indices
  4. Batch dimension - Single samples missing batch dimension during validation

Success Criteria

  • All 4 bugs fixed
  • Model trains without shape errors
  • Validation accuracy > 0.85

Grading Rubric

Condition Score Range
All 4 bugs fixed, VAL_ACC > 0.90 0.90 - 1.0
3 bugs fixed 0.65 - 0.80
2 bugs fixed 0.40 - 0.55
1 bug fixed 0.20 - 0.35
No bugs fixed 0.0 - 0.15

Hints

Bug 1: Image size

  • Option A: Resize images to 28x28 (F.interpolate or change data generation)
  • Option B: Adjust fc layer input size (7732 β†’ 8832 for 32x32 input)

Bug 2: Channel order

# Convert HWC to CHW
images = images.permute(0, 3, 1, 2)  # [N, H, W, C] β†’ [N, C, H, W]

Bug 3: Label encoding

# Convert one-hot to class indices
labels = one_hot_labels.argmax(dim=1)

Bug 4: Batch dimension

# Add batch dimension for single samples
x = x.unsqueeze(0)  # [C, H, W] β†’ [1, C, H, W]

Common Mistakes

  • Only fixing 1-2 of the 4 bugs
  • Fixing in wrong order (need to be careful with dimensions)
  • Using wrong dimension for permute/argmax
  • Forgetting to handle validation differently from training

Grading Philosophy

Continuous Rewards

WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:

  • Partial credit for fixing some bugs
  • Partial credit for code that runs but doesn't meet thresholds
  • Higher scores require meeting stricter thresholds

Why Continuous?

  1. Better RL training signal - Agents can learn from small improvements
  2. Differentiates solutions - Distinguishes between "almost correct" and "completely wrong"
  3. Encourages incremental progress - Reward shaping guides learning

Score Interpretation

Score Range Meaning
0.90 - 1.00 Excellent - All bugs fixed, exceeds targets
0.70 - 0.89 Good - Main bugs fixed, meets basic targets
0.40 - 0.69 Partial - Some bugs fixed, shows improvement
0.15 - 0.39 Minimal - Code runs but bugs remain
0.00 - 0.14 Failed - Crashes or no meaningful output

Tips for Agents

  1. Read the task description carefully - It tells you what metrics to print
  2. Keep seed values - Don't remove torch.manual_seed() calls
  3. Print required metrics - Must output LOSSES:[...] and task-specific metrics
  4. Test with tools first - Use debugging tools before submitting
  5. Fix all bugs - Partial fixes get partial credit
  6. Don't over-engineer - Minimal fixes are better than rewrites