Spaces:

Amogh-kal1
/

whipstudio

Sleeping

App Files Files Community

whipstudio / docs /TASKS.md

Amogh-kal1

Upload folder using huggingface_hub

ffd85e1 verified about 1 month ago

preview code

raw

history blame contribute delete

9.48 kB

WhipStudio Task Documentation

Detailed documentation for all 6 debugging tasks in WhipStudio.

Task Overview

Task	Difficulty	Bug Count	Key Skills Tested
task1	🟢 Easy	2	Basic PyTorch training loop
task2	🟡 Medium	1	Numerical stability
task3	🔴 Hard	2	Memory management, data handling
task4	🟡 Medium	2	Loss function selection, evaluation
task5	🟡 Medium	1	Transfer learning, parameter freezing
task6	🔴 Hard	4	Data preprocessing, CNN architecture

Task 1: Broken Training Loop

Difficulty: 🟢 Easy
Category: Basic Training Loop

Description

A 2-class linear classifier training loop has two bugs preventing convergence.

Bugs

Learning rate too high (lr=10.0 instead of lr=0.01)
Wrong optimizer order (optimizer.step() before loss.backward())

Success Criteria

Final loss < 0.3
Validation accuracy > 0.85
Losses should decrease monotonically

Grading Rubric

Condition	Score Range
Both bugs fixed, VAL_ACC > 0.90	0.85 - 1.0
One bug fixed, some improvement	0.40 - 0.70
No bugs fixed, code runs	0.15 - 0.25
Code crashes	0.0 - 0.10

Hints

Check the learning rate value
Compare the order of backward() and step() calls
Standard pattern: zero_grad() → forward → loss → backward() → step()

Task 2: Silent NaN Loss

Difficulty: 🟡 Medium
Category: Numerical Stability

Description

A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.

Bugs

Unprotected log(0) - The loss uses torch.log(pred) where pred can be 0.0

Success Criteria

No NaN values in loss
Training completes successfully
Validation accuracy > 0.80

Grading Rubric

Condition	Score Range
NaN fixed, VAL_ACC > 0.85	0.85 - 1.0
NaN fixed, lower accuracy	0.50 - 0.75
Still has NaN	0.10 - 0.25
Code crashes	0.0 - 0.10

Hints

Look for torch.log() calls
Use torch.clamp(pred, min=1e-7) before taking log
Or use torch.log1p(pred - 1 + 1e-7) for more stability
Alternative: use built-in F.binary_cross_entropy with eps parameter

Common Mistakes

Only fixing one of the log calls (there may be multiple)
Using too small epsilon (1e-15 can still cause issues)
Removing the loss entirely instead of fixing it

Task 3: OOM + Data Leakage

Difficulty: 🔴 Hard
Category: Memory Management, Data Pipeline

Description

A CNN training script has memory issues and data leakage between train/validation sets.

Bugs

Graph accumulation - total_loss += loss keeps the computation graph, causing OOM
Data leakage - Augmentation applied before train/val split, leaking information

Success Criteria

No OOM errors
Training completes all epochs
Validation accuracy > 0.90 (proper separation)

Grading Rubric

Condition	Score Range
Both bugs fixed, VAL_ACC > 0.92	0.85 - 1.0
One bug fixed	0.40 - 0.65
Bugs not fixed, VAL_ACC < 0.70	0.10 - 0.25
OOM error	0.0 - 0.10

Hints

Use total_loss += loss.item() or total_loss += loss.detach() to prevent graph retention
Split data BEFORE applying augmentation
Apply augmentation only to training set, not validation

Common Mistakes

Only fixing one bug
Using .detach() incorrectly (must call .item() for scalar losses)
Applying different augmentations to train/val but still from same augmented source

Task 4: Wrong Loss Function

Difficulty: 🟡 Medium
Category: Loss Function Selection

Description

A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.

Bugs

Wrong loss function - CrossEntropyLoss is for single-label, not multi-label
Wrong evaluation - Predictions should be sigmoid > 0.5, not argmax

Success Criteria

F1 score > 0.70
Correct multi-hot predictions
Training shows improvement

Grading Rubric

Condition	Score Range
Both bugs fixed, F1 > 0.75	0.85 - 1.0
Loss fixed, eval partially correct	0.50 - 0.75
Wrong loss still used	0.10 - 0.30
Code crashes	0.0 - 0.10

Hints

Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
Use nn.BCEWithLogitsLoss() for multi-label
For predictions: (torch.sigmoid(output) > 0.5).float()
Don't use softmax/argmax - that's for single-label

Common Mistakes

Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
Keeping argmax evaluation (should be threshold-based)
Incorrect target format (should be float, not long)

Task 5: Frozen Backbone

Difficulty: 🟡 Medium
Category: Transfer Learning

Description

A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.

Bugs

Frozen parameters in optimizer - Backbone is frozen but all params passed to Adam

Success Criteria

Only trainable parameters in optimizer, OR
Backbone unfrozen and training
Model shows improvement (loss decreases)

Grading Rubric

Condition	Score Range
Bug properly fixed	0.85 - 1.0
Partial fix (some improvement)	0.40 - 0.70
No fix (backbone still frozen + in optimizer)	0.10 - 0.25
Code crashes	0.0 - 0.10

Two Valid Solutions

Solution A: Only optimize trainable params

# Only pass parameters that require gradients
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=0.001
)

Solution B: Unfreeze the backbone

# Remove the freezing, train everything
for param in model.backbone.parameters():
    param.requires_grad = True

Hints

Check which parameters have requires_grad=True
Filter parameters when creating optimizer
Or change requires_grad to True for backbone

Common Mistakes

Unfreezing after optimizer creation (optimizer already captured frozen params)
Only unfreezing some layers
Forgetting to also handle bias parameters

Task 6: Input-Output Mismatch

Difficulty: 🔴 Hard
Category: Data Preprocessing, CNN Architecture

Description

A CNN for image classification has multiple bugs related to data format and model architecture mismatches.

Bugs (4 total)

Image size mismatch - Data is 32x32, model expects 28x28 (fc layer size wrong)
Channel order - Data is HWC (Height, Width, Channels), model expects CHW
Label encoding - Labels are one-hot encoded, CrossEntropyLoss expects class indices
Batch dimension - Single samples missing batch dimension during validation

Success Criteria

All 4 bugs fixed
Model trains without shape errors
Validation accuracy > 0.85

Grading Rubric

Condition	Score Range
All 4 bugs fixed, VAL_ACC > 0.90	0.90 - 1.0
3 bugs fixed	0.65 - 0.80
2 bugs fixed	0.40 - 0.55
1 bug fixed	0.20 - 0.35
No bugs fixed	0.0 - 0.15

Hints

Bug 1: Image size

Option A: Resize images to 28x28 (F.interpolate or change data generation)
Option B: Adjust fc layer input size (7732 → 8832 for 32x32 input)

Bug 2: Channel order

# Convert HWC to CHW
images = images.permute(0, 3, 1, 2)  # [N, H, W, C] → [N, C, H, W]

Bug 3: Label encoding

# Convert one-hot to class indices
labels = one_hot_labels.argmax(dim=1)

Bug 4: Batch dimension

# Add batch dimension for single samples
x = x.unsqueeze(0)  # [C, H, W] → [1, C, H, W]

Common Mistakes

Only fixing 1-2 of the 4 bugs
Fixing in wrong order (need to be careful with dimensions)
Using wrong dimension for permute/argmax
Forgetting to handle validation differently from training

Grading Philosophy

Continuous Rewards

WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:

Partial credit for fixing some bugs
Partial credit for code that runs but doesn't meet thresholds
Higher scores require meeting stricter thresholds

Why Continuous?

Better RL training signal - Agents can learn from small improvements
Differentiates solutions - Distinguishes between "almost correct" and "completely wrong"
Encourages incremental progress - Reward shaping guides learning

Score Interpretation

Score Range	Meaning
0.90 - 1.00	Excellent - All bugs fixed, exceeds targets
0.70 - 0.89	Good - Main bugs fixed, meets basic targets
0.40 - 0.69	Partial - Some bugs fixed, shows improvement
0.15 - 0.39	Minimal - Code runs but bugs remain
0.00 - 0.14	Failed - Crashes or no meaningful output

Tips for Agents

Read the task description carefully - It tells you what metrics to print
Keep seed values - Don't remove torch.manual_seed() calls
Print required metrics - Must output LOSSES:[...] and task-specific metrics
Test with tools first - Use debugging tools before submitting
Fix all bugs - Partial fixes get partial credit
Don't over-engineer - Minimal fixes are better than rewrites