Spaces:
Sleeping
Sleeping
| # WhipStudio Task Documentation | |
| Detailed documentation for all 6 debugging tasks in WhipStudio. | |
| --- | |
| ## Task Overview | |
| | Task | Difficulty | Bug Count | Key Skills Tested | | |
| |------|------------|-----------|-------------------| | |
| | task1 | 🟢 Easy | 2 | Basic PyTorch training loop | | |
| | task2 | 🟡 Medium | 1 | Numerical stability | | |
| | task3 | 🔴 Hard | 2 | Memory management, data handling | | |
| | task4 | 🟡 Medium | 2 | Loss function selection, evaluation | | |
| | task5 | 🟡 Medium | 1 | Transfer learning, parameter freezing | | |
| | task6 | 🔴 Hard | 4 | Data preprocessing, CNN architecture | | |
| --- | |
| ## Task 1: Broken Training Loop | |
| **Difficulty:** 🟢 Easy | |
| **Category:** Basic Training Loop | |
| ### Description | |
| A 2-class linear classifier training loop has two bugs preventing convergence. | |
| ### Bugs | |
| 1. **Learning rate too high** (`lr=10.0` instead of `lr=0.01`) | |
| 2. **Wrong optimizer order** (`optimizer.step()` before `loss.backward()`) | |
| ### Success Criteria | |
| - Final loss < 0.3 | |
| - Validation accuracy > 0.85 | |
| - Losses should decrease monotonically | |
| ### Grading Rubric | |
| | Condition | Score Range | | |
| |-----------|-------------| | |
| | Both bugs fixed, VAL_ACC > 0.90 | 0.85 - 1.0 | | |
| | One bug fixed, some improvement | 0.40 - 0.70 | | |
| | No bugs fixed, code runs | 0.15 - 0.25 | | |
| | Code crashes | 0.0 - 0.10 | | |
| ### Hints | |
| - Check the learning rate value | |
| - Compare the order of `backward()` and `step()` calls | |
| - Standard pattern: `zero_grad()` → forward → loss → `backward()` → `step()` | |
| --- | |
| ## Task 2: Silent NaN Loss | |
| **Difficulty:** 🟡 Medium | |
| **Category:** Numerical Stability | |
| ### Description | |
| A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors. | |
| ### Bugs | |
| 1. **Unprotected log(0)** - The loss uses `torch.log(pred)` where `pred` can be 0.0 | |
| ### Success Criteria | |
| - No NaN values in loss | |
| - Training completes successfully | |
| - Validation accuracy > 0.80 | |
| ### Grading Rubric | |
| | Condition | Score Range | | |
| |-----------|-------------| | |
| | NaN fixed, VAL_ACC > 0.85 | 0.85 - 1.0 | | |
| | NaN fixed, lower accuracy | 0.50 - 0.75 | | |
| | Still has NaN | 0.10 - 0.25 | | |
| | Code crashes | 0.0 - 0.10 | | |
| ### Hints | |
| - Look for `torch.log()` calls | |
| - Use `torch.clamp(pred, min=1e-7)` before taking log | |
| - Or use `torch.log1p(pred - 1 + 1e-7)` for more stability | |
| - Alternative: use built-in `F.binary_cross_entropy` with `eps` parameter | |
| ### Common Mistakes | |
| - Only fixing one of the log calls (there may be multiple) | |
| - Using too small epsilon (1e-15 can still cause issues) | |
| - Removing the loss entirely instead of fixing it | |
| --- | |
| ## Task 3: OOM + Data Leakage | |
| **Difficulty:** 🔴 Hard | |
| **Category:** Memory Management, Data Pipeline | |
| ### Description | |
| A CNN training script has memory issues and data leakage between train/validation sets. | |
| ### Bugs | |
| 1. **Graph accumulation** - `total_loss += loss` keeps the computation graph, causing OOM | |
| 2. **Data leakage** - Augmentation applied before train/val split, leaking information | |
| ### Success Criteria | |
| - No OOM errors | |
| - Training completes all epochs | |
| - Validation accuracy > 0.90 (proper separation) | |
| ### Grading Rubric | |
| | Condition | Score Range | | |
| |-----------|-------------| | |
| | Both bugs fixed, VAL_ACC > 0.92 | 0.85 - 1.0 | | |
| | One bug fixed | 0.40 - 0.65 | | |
| | Bugs not fixed, VAL_ACC < 0.70 | 0.10 - 0.25 | | |
| | OOM error | 0.0 - 0.10 | | |
| ### Hints | |
| - Use `total_loss += loss.item()` or `total_loss += loss.detach()` to prevent graph retention | |
| - Split data BEFORE applying augmentation | |
| - Apply augmentation only to training set, not validation | |
| ### Common Mistakes | |
| - Only fixing one bug | |
| - Using `.detach()` incorrectly (must call `.item()` for scalar losses) | |
| - Applying different augmentations to train/val but still from same augmented source | |
| --- | |
| ## Task 4: Wrong Loss Function | |
| **Difficulty:** 🟡 Medium | |
| **Category:** Loss Function Selection | |
| ### Description | |
| A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss. | |
| ### Bugs | |
| 1. **Wrong loss function** - CrossEntropyLoss is for single-label, not multi-label | |
| 2. **Wrong evaluation** - Predictions should be sigmoid > 0.5, not argmax | |
| ### Success Criteria | |
| - F1 score > 0.70 | |
| - Correct multi-hot predictions | |
| - Training shows improvement | |
| ### Grading Rubric | |
| | Condition | Score Range | | |
| |-----------|-------------| | |
| | Both bugs fixed, F1 > 0.75 | 0.85 - 1.0 | | |
| | Loss fixed, eval partially correct | 0.50 - 0.75 | | |
| | Wrong loss still used | 0.10 - 0.30 | | |
| | Code crashes | 0.0 - 0.10 | | |
| ### Hints | |
| - Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor]) | |
| - Use `nn.BCEWithLogitsLoss()` for multi-label | |
| - For predictions: `(torch.sigmoid(output) > 0.5).float()` | |
| - Don't use softmax/argmax - that's for single-label | |
| ### Common Mistakes | |
| - Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid) | |
| - Keeping argmax evaluation (should be threshold-based) | |
| - Incorrect target format (should be float, not long) | |
| --- | |
| ## Task 5: Frozen Backbone | |
| **Difficulty:** 🟡 Medium | |
| **Category:** Transfer Learning | |
| ### Description | |
| A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer. | |
| ### Bugs | |
| 1. **Frozen parameters in optimizer** - Backbone is frozen but all params passed to Adam | |
| ### Success Criteria | |
| - Only trainable parameters in optimizer, OR | |
| - Backbone unfrozen and training | |
| - Model shows improvement (loss decreases) | |
| ### Grading Rubric | |
| | Condition | Score Range | | |
| |-----------|-------------| | |
| | Bug properly fixed | 0.85 - 1.0 | | |
| | Partial fix (some improvement) | 0.40 - 0.70 | | |
| | No fix (backbone still frozen + in optimizer) | 0.10 - 0.25 | | |
| | Code crashes | 0.0 - 0.10 | | |
| ### Two Valid Solutions | |
| **Solution A: Only optimize trainable params** | |
| ```python | |
| # Only pass parameters that require gradients | |
| optimizer = torch.optim.Adam( | |
| filter(lambda p: p.requires_grad, model.parameters()), | |
| lr=0.001 | |
| ) | |
| ``` | |
| **Solution B: Unfreeze the backbone** | |
| ```python | |
| # Remove the freezing, train everything | |
| for param in model.backbone.parameters(): | |
| param.requires_grad = True | |
| ``` | |
| ### Hints | |
| - Check which parameters have `requires_grad=True` | |
| - Filter parameters when creating optimizer | |
| - Or change `requires_grad` to True for backbone | |
| ### Common Mistakes | |
| - Unfreezing after optimizer creation (optimizer already captured frozen params) | |
| - Only unfreezing some layers | |
| - Forgetting to also handle bias parameters | |
| --- | |
| ## Task 6: Input-Output Mismatch | |
| **Difficulty:** 🔴 Hard | |
| **Category:** Data Preprocessing, CNN Architecture | |
| ### Description | |
| A CNN for image classification has multiple bugs related to data format and model architecture mismatches. | |
| ### Bugs (4 total) | |
| 1. **Image size mismatch** - Data is 32x32, model expects 28x28 (fc layer size wrong) | |
| 2. **Channel order** - Data is HWC (Height, Width, Channels), model expects CHW | |
| 3. **Label encoding** - Labels are one-hot encoded, CrossEntropyLoss expects class indices | |
| 4. **Batch dimension** - Single samples missing batch dimension during validation | |
| ### Success Criteria | |
| - All 4 bugs fixed | |
| - Model trains without shape errors | |
| - Validation accuracy > 0.85 | |
| ### Grading Rubric | |
| | Condition | Score Range | | |
| |-----------|-------------| | |
| | All 4 bugs fixed, VAL_ACC > 0.90 | 0.90 - 1.0 | | |
| | 3 bugs fixed | 0.65 - 0.80 | | |
| | 2 bugs fixed | 0.40 - 0.55 | | |
| | 1 bug fixed | 0.20 - 0.35 | | |
| | No bugs fixed | 0.0 - 0.15 | | |
| ### Hints | |
| **Bug 1: Image size** | |
| - Option A: Resize images to 28x28 (`F.interpolate` or change data generation) | |
| - Option B: Adjust fc layer input size (7*7*32 → 8*8*32 for 32x32 input) | |
| **Bug 2: Channel order** | |
| ```python | |
| # Convert HWC to CHW | |
| images = images.permute(0, 3, 1, 2) # [N, H, W, C] → [N, C, H, W] | |
| ``` | |
| **Bug 3: Label encoding** | |
| ```python | |
| # Convert one-hot to class indices | |
| labels = one_hot_labels.argmax(dim=1) | |
| ``` | |
| **Bug 4: Batch dimension** | |
| ```python | |
| # Add batch dimension for single samples | |
| x = x.unsqueeze(0) # [C, H, W] → [1, C, H, W] | |
| ``` | |
| ### Common Mistakes | |
| - Only fixing 1-2 of the 4 bugs | |
| - Fixing in wrong order (need to be careful with dimensions) | |
| - Using wrong dimension for permute/argmax | |
| - Forgetting to handle validation differently from training | |
| --- | |
| ## Grading Philosophy | |
| ### Continuous Rewards | |
| WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail: | |
| - Partial credit for fixing some bugs | |
| - Partial credit for code that runs but doesn't meet thresholds | |
| - Higher scores require meeting stricter thresholds | |
| ### Why Continuous? | |
| 1. **Better RL training signal** - Agents can learn from small improvements | |
| 2. **Differentiates solutions** - Distinguishes between "almost correct" and "completely wrong" | |
| 3. **Encourages incremental progress** - Reward shaping guides learning | |
| ### Score Interpretation | |
| | Score Range | Meaning | | |
| |-------------|---------| | |
| | 0.90 - 1.00 | Excellent - All bugs fixed, exceeds targets | | |
| | 0.70 - 0.89 | Good - Main bugs fixed, meets basic targets | | |
| | 0.40 - 0.69 | Partial - Some bugs fixed, shows improvement | | |
| | 0.15 - 0.39 | Minimal - Code runs but bugs remain | | |
| | 0.00 - 0.14 | Failed - Crashes or no meaningful output | | |
| --- | |
| ## Tips for Agents | |
| 1. **Read the task description carefully** - It tells you what metrics to print | |
| 2. **Keep seed values** - Don't remove `torch.manual_seed()` calls | |
| 3. **Print required metrics** - Must output `LOSSES:[...]` and task-specific metrics | |
| 4. **Test with tools first** - Use debugging tools before submitting | |
| 5. **Fix all bugs** - Partial fixes get partial credit | |
| 6. **Don't over-engineer** - Minimal fixes are better than rewrites | |