Spaces:

Amogh-kal1
/

whipstudio

Sleeping

App Files Files Community

whipstudio / docs /TASKS.md

Amogh-kal1

Upload folder using huggingface_hub

ffd85e1 verified about 1 month ago

preview code

raw

history blame contribute delete

9.48 kB

	# WhipStudio Task Documentation

	Detailed documentation for all 6 debugging tasks in WhipStudio.

	---

	## Task Overview

	\| Task \| Difficulty \| Bug Count \| Key Skills Tested \|
	\|------\|------------\|-----------\|-------------------\|
	\| task1 \| 🟢 Easy \| 2 \| Basic PyTorch training loop \|
	\| task2 \| 🟡 Medium \| 1 \| Numerical stability \|
	\| task3 \| 🔴 Hard \| 2 \| Memory management, data handling \|
	\| task4 \| 🟡 Medium \| 2 \| Loss function selection, evaluation \|
	\| task5 \| 🟡 Medium \| 1 \| Transfer learning, parameter freezing \|
	\| task6 \| 🔴 Hard \| 4 \| Data preprocessing, CNN architecture \|

	---

	## Task 1: Broken Training Loop

	Difficulty: 🟢 Easy
	Category: Basic Training Loop

	### Description
	A 2-class linear classifier training loop has two bugs preventing convergence.

	### Bugs
	1. Learning rate too high (`lr=10.0` instead of `lr=0.01`)
	2. Wrong optimizer order (`optimizer.step()` before `loss.backward()`)

	### Success Criteria
	- Final loss < 0.3
	- Validation accuracy > 0.85
	- Losses should decrease monotonically

	### Grading Rubric
	\| Condition \| Score Range \|
	\|-----------\|-------------\|
	\| Both bugs fixed, VAL_ACC > 0.90 \| 0.85 - 1.0 \|
	\| One bug fixed, some improvement \| 0.40 - 0.70 \|
	\| No bugs fixed, code runs \| 0.15 - 0.25 \|
	\| Code crashes \| 0.0 - 0.10 \|

	### Hints
	- Check the learning rate value
	- Compare the order of `backward()` and `step()` calls
	- Standard pattern: `zero_grad()` → forward → loss → `backward()` → `step()`

	---

	## Task 2: Silent NaN Loss

	Difficulty: 🟡 Medium
	Category: Numerical Stability

	### Description
	A custom binary classification loss produces NaN values silently, causing training to fail without obvious errors.

	### Bugs
	1. Unprotected log(0) - The loss uses `torch.log(pred)` where `pred` can be 0.0

	### Success Criteria
	- No NaN values in loss
	- Training completes successfully
	- Validation accuracy > 0.80

	### Grading Rubric
	\| Condition \| Score Range \|
	\|-----------\|-------------\|
	\| NaN fixed, VAL_ACC > 0.85 \| 0.85 - 1.0 \|
	\| NaN fixed, lower accuracy \| 0.50 - 0.75 \|
	\| Still has NaN \| 0.10 - 0.25 \|
	\| Code crashes \| 0.0 - 0.10 \|

	### Hints
	- Look for `torch.log()` calls
	- Use `torch.clamp(pred, min=1e-7)` before taking log
	- Or use `torch.log1p(pred - 1 + 1e-7)` for more stability
	- Alternative: use built-in `F.binary_cross_entropy` with `eps` parameter

	### Common Mistakes
	- Only fixing one of the log calls (there may be multiple)
	- Using too small epsilon (1e-15 can still cause issues)
	- Removing the loss entirely instead of fixing it

	---

	## Task 3: OOM + Data Leakage

	Difficulty: 🔴 Hard
	Category: Memory Management, Data Pipeline

	### Description
	A CNN training script has memory issues and data leakage between train/validation sets.

	### Bugs
	1. Graph accumulation - `total_loss += loss` keeps the computation graph, causing OOM
	2. Data leakage - Augmentation applied before train/val split, leaking information

	### Success Criteria
	- No OOM errors
	- Training completes all epochs
	- Validation accuracy > 0.90 (proper separation)

	### Grading Rubric
	\| Condition \| Score Range \|
	\|-----------\|-------------\|
	\| Both bugs fixed, VAL_ACC > 0.92 \| 0.85 - 1.0 \|
	\| One bug fixed \| 0.40 - 0.65 \|
	\| Bugs not fixed, VAL_ACC < 0.70 \| 0.10 - 0.25 \|
	\| OOM error \| 0.0 - 0.10 \|

	### Hints
	- Use `total_loss += loss.item()` or `total_loss += loss.detach()` to prevent graph retention
	- Split data BEFORE applying augmentation
	- Apply augmentation only to training set, not validation

	### Common Mistakes
	- Only fixing one bug
	- Using `.detach()` incorrectly (must call `.item()` for scalar losses)
	- Applying different augmentations to train/val but still from same augmented source

	---

	## Task 4: Wrong Loss Function

	Difficulty: 🟡 Medium
	Category: Loss Function Selection

	### Description
	A multi-label classification task incorrectly uses CrossEntropyLoss instead of BCEWithLogitsLoss.

	### Bugs
	1. Wrong loss function - CrossEntropyLoss is for single-label, not multi-label
	2. Wrong evaluation - Predictions should be sigmoid > 0.5, not argmax

	### Success Criteria
	- F1 score > 0.70
	- Correct multi-hot predictions
	- Training shows improvement

	### Grading Rubric
	\| Condition \| Score Range \|
	\|-----------\|-------------\|
	\| Both bugs fixed, F1 > 0.75 \| 0.85 - 1.0 \|
	\| Loss fixed, eval partially correct \| 0.50 - 0.75 \|
	\| Wrong loss still used \| 0.10 - 0.30 \|
	\| Code crashes \| 0.0 - 0.10 \|

	### Hints
	- Multi-label: each sample can have multiple labels (e.g., image tagged with [cat, dog, outdoor])
	- Use `nn.BCEWithLogitsLoss()` for multi-label
	- For predictions: `(torch.sigmoid(output) > 0.5).float()`
	- Don't use softmax/argmax - that's for single-label

	### Common Mistakes
	- Using BCELoss without sigmoid (need BCEWithLogitsLoss or explicit sigmoid)
	- Keeping argmax evaluation (should be threshold-based)
	- Incorrect target format (should be float, not long)

	---

	## Task 5: Frozen Backbone

	Difficulty: 🟡 Medium
	Category: Transfer Learning

	### Description
	A transfer learning setup freezes the backbone but still passes frozen parameters to the optimizer.

	### Bugs
	1. Frozen parameters in optimizer - Backbone is frozen but all params passed to Adam

	### Success Criteria
	- Only trainable parameters in optimizer, OR
	- Backbone unfrozen and training
	- Model shows improvement (loss decreases)

	### Grading Rubric
	\| Condition \| Score Range \|
	\|-----------\|-------------\|
	\| Bug properly fixed \| 0.85 - 1.0 \|
	\| Partial fix (some improvement) \| 0.40 - 0.70 \|
	\| No fix (backbone still frozen + in optimizer) \| 0.10 - 0.25 \|
	\| Code crashes \| 0.0 - 0.10 \|

	### Two Valid Solutions

	Solution A: Only optimize trainable params
	```python
	# Only pass parameters that require gradients
	optimizer = torch.optim.Adam(
	filter(lambda p: p.requires_grad, model.parameters()),
	lr=0.001
	)
	```

	Solution B: Unfreeze the backbone
	```python
	# Remove the freezing, train everything
	for param in model.backbone.parameters():
	param.requires_grad = True
	```

	### Hints
	- Check which parameters have `requires_grad=True`
	- Filter parameters when creating optimizer
	- Or change `requires_grad` to True for backbone

	### Common Mistakes
	- Unfreezing after optimizer creation (optimizer already captured frozen params)
	- Only unfreezing some layers
	- Forgetting to also handle bias parameters

	---

	## Task 6: Input-Output Mismatch

	Difficulty: 🔴 Hard
	Category: Data Preprocessing, CNN Architecture

	### Description
	A CNN for image classification has multiple bugs related to data format and model architecture mismatches.

	### Bugs (4 total)
	1. Image size mismatch - Data is 32x32, model expects 28x28 (fc layer size wrong)
	2. Channel order - Data is HWC (Height, Width, Channels), model expects CHW
	3. Label encoding - Labels are one-hot encoded, CrossEntropyLoss expects class indices
	4. Batch dimension - Single samples missing batch dimension during validation

	### Success Criteria
	- All 4 bugs fixed
	- Model trains without shape errors
	- Validation accuracy > 0.85

	### Grading Rubric
	\| Condition \| Score Range \|
	\|-----------\|-------------\|
	\| All 4 bugs fixed, VAL_ACC > 0.90 \| 0.90 - 1.0 \|
	\| 3 bugs fixed \| 0.65 - 0.80 \|
	\| 2 bugs fixed \| 0.40 - 0.55 \|
	\| 1 bug fixed \| 0.20 - 0.35 \|
	\| No bugs fixed \| 0.0 - 0.15 \|

	### Hints

	Bug 1: Image size
	- Option A: Resize images to 28x28 (`F.interpolate` or change data generation)
	- Option B: Adjust fc layer input size (7732 → 8832 for 32x32 input)

	Bug 2: Channel order
	```python
	# Convert HWC to CHW
	images = images.permute(0, 3, 1, 2) # [N, H, W, C] → [N, C, H, W]
	```

	Bug 3: Label encoding
	```python
	# Convert one-hot to class indices
	labels = one_hot_labels.argmax(dim=1)
	```

	Bug 4: Batch dimension
	```python
	# Add batch dimension for single samples
	x = x.unsqueeze(0) # [C, H, W] → [1, C, H, W]
	```

	### Common Mistakes
	- Only fixing 1-2 of the 4 bugs
	- Fixing in wrong order (need to be careful with dimensions)
	- Using wrong dimension for permute/argmax
	- Forgetting to handle validation differently from training

	---

	## Grading Philosophy

	### Continuous Rewards
	WhipStudio uses continuous scoring (0.0-1.0) rather than binary pass/fail:
	- Partial credit for fixing some bugs
	- Partial credit for code that runs but doesn't meet thresholds
	- Higher scores require meeting stricter thresholds

	### Why Continuous?
	1. Better RL training signal - Agents can learn from small improvements
	2. Differentiates solutions - Distinguishes between "almost correct" and "completely wrong"
	3. Encourages incremental progress - Reward shaping guides learning

	### Score Interpretation
	\| Score Range \| Meaning \|
	\|-------------\|---------\|
	\| 0.90 - 1.00 \| Excellent - All bugs fixed, exceeds targets \|
	\| 0.70 - 0.89 \| Good - Main bugs fixed, meets basic targets \|
	\| 0.40 - 0.69 \| Partial - Some bugs fixed, shows improvement \|
	\| 0.15 - 0.39 \| Minimal - Code runs but bugs remain \|
	\| 0.00 - 0.14 \| Failed - Crashes or no meaningful output \|

	---

	## Tips for Agents

	1. Read the task description carefully - It tells you what metrics to print
	2. Keep seed values - Don't remove `torch.manual_seed()` calls
	3. Print required metrics - Must output `LOSSES:[...]` and task-specific metrics
	4. Test with tools first - Use debugging tools before submitting
	5. Fix all bugs - Partial fixes get partial credit
	6. Don't over-engineer - Minimal fixes are better than rewrites