Spaces:

Amogh-kal1
/

whipstudio

Sleeping

App Files Files Community

Amogh-kal1 commited on Mar 31

Commit

3f12d92

verified ·

1 Parent(s): 7ac53ec

Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

TASK_IMPROVEMENTS_SUMMARY.md +104 -0
server/tasks/graders.py +277 -79
server/tasks/task1_broken_loop.py +34 -11
server/tasks/task2_nan_loss.py +40 -14
server/tasks/task3_oom_leakage.py +20 -23
server/tasks/task5_frozen_backbone.py +15 -7

TASK_IMPROVEMENTS_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,104 @@

+# Task Improvements Summary
+## Overview
+This document summarizes the improvements made to the WhipStudio ML debugging environment tasks and graders.
+## Key Issues Fixed
+### 1. Unstable Datasets (Tasks 1 & 2)
+**Problem**: Tasks were generating random data inside training loops, making loss values non-deterministic and graders unreliable.
+**Solution**:
+- Fixed datasets with deterministic seeds (`torch.manual_seed()`)
+- Clear train/validation splits
+- Learnable patterns (e.g., `y = (X[:, 0] > 0).long()`)
+### 2. Gameable Graders
+**Problem**: High learning rates (e.g., lr=1000) could get full scores by producing low loss values despite unstable training.
+**Solution**:
+- Added **loss spike detection** in Task 1 grader
+- If `max_loss > initial_loss * 5.0` or `max_loss > 10.0`, submission is penalized
+- Partial fixes with bad LR get capped at 0.2 score
+### 3. Inverted Scoring Logic
+**Problem**: The `sigmoid_reward()` function had confusing `invert` parameter that caused inverted scoring (low F1 → high score).
+**Solution**:
+- Created new `sigmoid_score(value, center, steepness, higher_is_better)` function
+- Clear semantics: `higher_is_better=True` rewards values above center
+### 4. Task-Specific Validation
+**Problem**: Generic validation rejected valid submissions (e.g., Task 5 required loops but single forward pass was valid).
+**Solution**:
+- `is_valid_submission(code, stdout, exit_code, task_id)` now takes task_id
+- Task-specific validation rules
+## Task Details
+### Task 1: Broken Training Loop
+- **Bugs**: `lr=10.0`, `step()` before `backward()`
+- **Buggy score**: ~0.003
+- **Fixed score**: ~0.74
+- **Spike detection**: Penalizes unstable training (score capped at 0.2)
+### Task 2: NaN Loss
+- **Bug**: `torch.log(pred)` when pred can be 0.0
+- **Fix**: Increased buggy LR to 0.5 to actually trigger NaN
+- **Buggy score**: ~0.16 (has NaN values)
+- **Fixed score**: ~0.83
+### Task 3: Label Inversion
+- **Bug**: `criterion(out, 1 - yb)` inverts labels
+- **Buggy score**: ~0.34 (accuracy ~5%)
+- **Fixed score**: ~0.80 (accuracy ~95%)
+### Task 4: Wrong Loss (Multi-label)
+- **Bug**: Using `CrossEntropyLoss` instead of `BCEWithLogitsLoss`
+- **Buggy score**: ~0.74 (F1 ~0.73)
+- **Fixed score**: ~0.97 (F1 = 1.0)
+### Task 5: Frozen Backbone
+- **Bug**: Backbone frozen but still passed to optimizer
+- **Two valid fixes**:
+  1. Unfreeze backbone (grad_norm > 0)
+  2. Only pass head params (param_count < 100k)
+- **Added**: `OPTIMIZER_PARAM_COUNT` metric for grading
+- **Buggy score**: ~0.18
+- **Fixed score**: ~0.88
+## Grading Structure
+All graders follow a consistent pattern:
+```python
+# Primary metric (50-60% weight)
+primary_score = sigmoid_score(metric, center, steepness, higher_is_better) * weight
+# Secondary metrics (30% weight)
+secondary_score = ...
+# Bonus conditions (10-20%)
+bonus = ...
+final_score = min(1.0, primary_score + secondary_score + bonus)
+```
+## Testing Results
+| Task | Buggy Score | Fixed Score | Discrimination |
+|------|-------------|-------------|----------------|
+| 1    | 0.003       | 0.739       | ✅ Excellent   |
+| 2    | 0.157       | 0.827       | ✅ Excellent   |
+| 3    | 0.344       | 0.804       | ✅ Excellent   |
+| 4    | 0.735       | 0.966       | ✅ Good        |
+| 5    | 0.179       | 0.879       | ✅ Excellent   |
+## Files Modified
+- `server/tasks/task1_broken_loop.py` - Fixed dataset, learnable pattern
+- `server/tasks/task2_nan_loss.py` - Increased LR to trigger NaN bug
+- `server/tasks/task3_oom_leakage.py` - Redesigned with label inversion bug
+- `server/tasks/task5_frozen_backbone.py` - Added OPTIMIZER_PARAM_COUNT metric
+- `server/tasks/graders.py` - Complete rewrite with proper scoring logic

server/tasks/graders.py CHANGED Viewed

@@ -45,18 +45,24 @@ def parse_val_accs(stdout: str) -> list[float]:
 def parse_scalar(stdout: str, key: str) -> float | None:
     stdout = extract_metrics_block(stdout)
-    match = re.search(rf"{key}:([-\d.]+)", stdout)
     return float(match.group(1)) if match else None
-def is_valid_submission(code: str, stdout: str, exit_code: int) -> tuple[bool, str]:
     if exit_code == 0:
-        if "LOSSES:" not in stdout and "FINAL_LOSS:" not in stdout:
             return False, "No valid metrics output detected"
         if "LOSSES:" in stdout:
             losses = parse_losses(stdout)
             if len(losses) < 5:
                 return False, "Fewer than 5 loss values parsed"
     try:
         tree = ast.parse(code)
         if not any(isinstance(node, (ast.For, ast.While)) for node in ast.walk(tree)):
@@ -66,88 +72,191 @@ def is_valid_submission(code: str, stdout: str, exit_code: int) -> tuple[bool, s
     return True, ""
-def sigmoid_reward(value: float, center: float, steepness: float, invert: bool = False) -> float:
     try:
-        if invert:
             x = steepness * (value - center)
         else:
             x = steepness * (center - value)
         return round(1.0 / (1.0 + math.exp(-x)), 4)
     except OverflowError:
-        return 0.0 if (invert and value > center) or (not invert and value < center) else 1.0
 def grade_task1(result: RunResult) -> tuple[float, dict]:
-    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code)
     if not valid:
         return 0.0, {"reason": reason}
     if result.timed_out:
         return 0.05, {"reason": "timed_out"}
     if result.exit_code != 0:
-        return 0.0, {"reason": "crash"}
     losses = parse_losses(result.stdout)
     if not losses:
         return 0.1, {"reason": "no_losses_parsed"}
-    if any(math.isnan(loss) or math.isinf(loss) for loss in losses):
-        return 0.15, {"reason": "nan_inf_found"}
-    final = losses[-1]
-    base_score = sigmoid_reward(final, center=0.75, steepness=3.0, invert=True)
-    bonus = 0.0
-    half = len(losses) // 2
-    if half > 0:
-        first_half = sum(losses[:half]) / half
-        second_half = sum(losses[half:]) / len(losses[half:])
-        if second_half < 0.85 * first_half:
-            bonus = 0.1
-    final_score = min(1.0, base_score + bonus)
-    breakdown = {"base_score": base_score, "monotonicity_bonus": bonus}
     return final_score, breakdown
 def grade_task2(result: RunResult) -> tuple[float, dict]:
-    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code)
     if not valid:
         return 0.0, {"reason": reason}
     if result.timed_out:
         return 0.05, {"reason": "timed_out"}
     if result.exit_code != 0:
-        return 0.0, {"reason": "crash"}
     losses = parse_losses(result.stdout)
     if not losses or len(losses) < 30:
         return 0.1, {"reason": "too_few_losses"}
     nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
-    if nan_count == len(losses):
-        return 0.0, {"reason": "all_nans"}
     nan_ratio = nan_count / len(losses)
     finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
-    final_finite_loss = finite_losses[-1] if finite_losses else float('inf')
-    convergence_score = sigmoid_reward(final_finite_loss, center=0.5, steepness=4.0, invert=True)
-    convergence_score *= (1.0 - nan_ratio)
-    stability_bonus = 0.0
-    if len(finite_losses) >= 20:
-        tail = finite_losses[-20:]
-        mean_tail = sum(tail) / len(tail)
-        tail_variance = sum((x - mean_tail) ** 2 for x in tail) / len(tail)
-        stability_bonus = sigmoid_reward(tail_variance, center=0.01, steepness=200.0, invert=True) * 0.1
-    final_score = min(1.0, convergence_score + stability_bonus)
-    breakdown = {"convergence_score": convergence_score, "nan_penalty": (1.0 - nan_ratio), "stability_bonus": stability_bonus, "nan_ratio": nan_ratio}
     return final_score, breakdown
 def grade_task3(result: RunResult) -> tuple[float, dict]:
-    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code)
     if not valid:
         return 0.0, {"reason": reason}
@@ -155,35 +264,66 @@ def grade_task3(result: RunResult) -> tuple[float, dict]:
         return 0.1, {"reason": "timed_out"}
     if result.exit_code != 0:
-        if "out of memory" in result.stderr.lower():
             return 0.1, {"reason": "oom"}
-        return 0.0, {"reason": "crash"}
     val_accs = parse_val_accs(result.stdout)
     final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
     memory_score = 0.0
     if final_loss_val is not None:
-        memory_score = sigmoid_reward(final_loss_val, center=50.0, steepness=0.05, invert=True) * 0.5
-    leakage_score = 0.0
-    early_acc = 0.0
     final_acc = 0.0
     if val_accs and len(val_accs) >= 2:
-        early_acc = sum(val_accs[:2]) / 2.0
         final_acc = val_accs[-1]
-        leak_p1 = sigmoid_reward(early_acc, center=0.75, steepness=20.0, invert=True) * 0.3
-        leak_p2 = sigmoid_reward(final_acc, center=0.68, steepness=15.0, invert=False) * 0.7
-        leakage_score = (leak_p1 + leak_p2) * 0.5
-    final_score = min(1.0, memory_score + leakage_score)
-    breakdown = {"memory_score": memory_score, "leakage_score": leakage_score, "early_acc": early_acc, "final_acc": final_acc}
     return final_score, breakdown
 def grade_task4(result: RunResult) -> tuple[float, dict]:
-    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code)
     if not valid:
         return 0.0, {"reason": reason}
@@ -191,31 +331,64 @@ def grade_task4(result: RunResult) -> tuple[float, dict]:
         return 0.1, {"reason": "timed_out"}
     if result.exit_code != 0:
-        return 0.0, {"reason": "crash"}
     final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
     avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
     f1 = parse_scalar(result.stdout, "F1_SCORE")
-    loss_score = 0.0
-    if final_loss is not None:
-        loss_score = sigmoid_reward(final_loss, center=0.5, steepness=4.0, invert=True) * 0.3
     labels_score = 0.0
     if avg_labels is not None:
-        labels_score = sigmoid_reward(avg_labels, center=1.0, steepness=5.0, invert=False) * 0.3
-    f1_s = 0.0
-    if f1 is not None:
-        f1_s = sigmoid_reward(f1, center=0.6, steepness=10.0, invert=False) * 0.4
-    final_score = min(1.0, loss_score + labels_score + f1_s)
-    breakdown = {"loss_score": loss_score, "labels_score": labels_score, "f1_score": f1_s}
     return final_score, breakdown
 def grade_task5(result: RunResult) -> tuple[float, dict]:
-    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code)
     if not valid:
         return 0.0, {"reason": reason}
@@ -223,21 +396,46 @@ def grade_task5(result: RunResult) -> tuple[float, dict]:
         return 0.1, {"reason": "timed_out"}
     if result.exit_code != 0:
-        return 0.0, {"reason": "crash"}
     final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
     grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
     loss_score = 0.0
     if final_loss is not None:
-        loss_score = sigmoid_reward(final_loss, center=2.2, steepness=3.0, invert=True) * 0.5
-    grad_score = 0.0
-    if grad_norm is not None:
-        grad_score = sigmoid_reward(grad_norm, center=0.001, steepness=1000.0, invert=False) * 0.5
-    final_score = min(1.0, loss_score + grad_score)
-    breakdown = {"loss_score": loss_score, "grad_score": grad_score}
     return final_score, breakdown

 def parse_scalar(stdout: str, key: str) -> float | None:
     stdout = extract_metrics_block(stdout)
+    match = re.search(rf"{key}:([-\d.eE+]+)", stdout)
     return float(match.group(1)) if match else None
+def is_valid_submission(code: str, stdout: str, exit_code: int, task_id: str = "") -> tuple[bool, str]:
+    """Validate submission with task-specific rules."""
     if exit_code == 0:
+        if "LOSSES:" not in stdout and "FINAL_LOSS:" not in stdout and "VAL_ACCS:" not in stdout:
             return False, "No valid metrics output detected"
         if "LOSSES:" in stdout:
             losses = parse_losses(stdout)
             if len(losses) < 5:
                 return False, "Fewer than 5 loss values parsed"
+    # Task 5 doesn't require a loop - it's a single forward/backward pass
+    if task_id == "task5":
+        return True, ""
     try:
         tree = ast.parse(code)
         if not any(isinstance(node, (ast.For, ast.While)) for node in ast.walk(tree)):
     return True, ""
+def sigmoid_score(value: float, center: float, steepness: float, higher_is_better: bool = True) -> float:
+    """
+    Compute sigmoid-based score.
+    Args:
+        value: The metric value to score
+        center: The center point of the sigmoid (value at which score = 0.5)
+        steepness: How quickly the score transitions around the center
+        higher_is_better: If True, reward values > center. If False, reward values < center.
+    Returns:
+        Score between 0.0 and 1.0
+    """
     try:
+        if higher_is_better:
             x = steepness * (value - center)
         else:
             x = steepness * (center - value)
         return round(1.0 / (1.0 + math.exp(-x)), 4)
     except OverflowError:
+        if higher_is_better:
+            return 1.0 if value > center else 0.0
+        else:
+            return 1.0 if value < center else 0.0
+# Keep old function for backwards compatibility but mark deprecated
+def sigmoid_reward(value: float, center: float, steepness: float, invert: bool = False) -> float:
+    """Deprecated: Use sigmoid_score with higher_is_better parameter instead."""
+    return sigmoid_score(value, center, steepness, higher_is_better=invert)
 def grade_task1(result: RunResult) -> tuple[float, dict]:
+    """
+    Task 1: Broken Training Loop
+    Bugs: 1) lr=10.0 (too high), 2) step() before backward()
+    Grading criteria:
+    - Must have low final loss (<0.3) - indicates proper training
+    - Must have high validation accuracy (>0.85) - indicates learning
+    - Must show monotonic improvement - indicates proper gradient flow
+    - Must NOT have loss spikes - indicates stable training
+    """
+    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task1")
     if not valid:
         return 0.0, {"reason": reason}
     if result.timed_out:
         return 0.05, {"reason": "timed_out"}
     if result.exit_code != 0:
+        return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
     losses = parse_losses(result.stdout)
     if not losses:
         return 0.1, {"reason": "no_losses_parsed"}
+    # Check for NaN/Inf - indicates numerical instability
+    nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
+    if nan_count > 0:
+        return 0.15, {"reason": "nan_inf_found", "nan_count": nan_count}
+    val_acc = parse_scalar(result.stdout, "VAL_ACC")
+    if val_acc is None:
+        return 0.1, {"reason": "no_val_acc"}
+    final_loss = losses[-1]
+    initial_loss = losses[0]
+    max_loss = max(losses)
+    # Check for loss instability (spikes indicate LR too high)
+    # Healthy training shouldn't have losses > 5x initial loss
+    if max_loss > initial_loss * 5.0 or max_loss > 10.0:
+        return 0.2, {
+            "reason": "loss_unstable_spikes",
+            "max_loss": max_loss,
+            "final_loss": final_loss,
+            "val_acc": val_acc
+        }
+    # Check for loss explosion at end
+    if final_loss > 5.0:
+        return 0.15, {"reason": "loss_unstable", "final_loss": final_loss, "val_acc": val_acc}
+    # Primary: Validation accuracy (higher is better, target > 0.85)
+    acc_score = sigmoid_score(val_acc, center=0.85, steepness=15.0, higher_is_better=True) * 0.5
+    # Secondary: Final loss should be low (lower is better, target < 0.3)
+    loss_score = sigmoid_score(final_loss, center=0.3, steepness=8.0, higher_is_better=False) * 0.3
+    # Bonus: Monotonic improvement (loss should decrease over time)
+    monotonic_bonus = 0.0
+    if len(losses) >= 10:
+        first_quarter = sum(losses[:len(losses)//4]) / (len(losses)//4)
+        last_quarter = sum(losses[-len(losses)//4:]) / (len(losses)//4)
+        if last_quarter < first_quarter * 0.7:  # At least 30% improvement
+            monotonic_bonus = 0.2
+    final_score = min(1.0, acc_score + loss_score + monotonic_bonus)
+    breakdown = {
+        "acc_score": round(acc_score, 4),
+        "loss_score": round(loss_score, 4),
+        "monotonic_bonus": monotonic_bonus,
+        "val_acc": val_acc,
+        "final_loss": final_loss,
+        "initial_loss": initial_loss,
+        "max_loss": max_loss
+    }
     return final_score, breakdown
 def grade_task2(result: RunResult) -> tuple[float, dict]:
+    """
+    Task 2: NaN Loss
+    Bug: torch.log(pred) when pred can be 0.0 after sigmoid
+    Grading criteria:
+    - Must have NO NaN/Inf losses - this is the primary test
+    - Must have good validation accuracy (>0.75)
+    - Must show loss convergence (<0.4)
+    """
+    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task2")
     if not valid:
         return 0.0, {"reason": reason}
     if result.timed_out:
         return 0.05, {"reason": "timed_out"}
     if result.exit_code != 0:
+        return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
     losses = parse_losses(result.stdout)
     if not losses or len(losses) < 30:
         return 0.1, {"reason": "too_few_losses"}
     nan_count = sum(1 for loss in losses if math.isnan(loss) or math.isinf(loss))
+    # Primary criterion: NO NaN/Inf allowed - this is the core bug being tested
     nan_ratio = nan_count / len(losses)
+    if nan_count > 0:
+        # Heavily penalize any NaN - this is THE bug we're testing
+        return max(0.05, 0.3 * (1.0 - nan_ratio)), {
+            "reason": "has_nans",
+            "nan_ratio": nan_ratio,
+            "nan_count": nan_count
+        }
+    val_acc = parse_scalar(result.stdout, "VAL_ACC")
+    if val_acc is None:
+        return 0.2, {"reason": "no_val_acc_but_no_nans"}
     finite_losses = [loss for loss in losses if not math.isnan(loss) and not math.isinf(loss)]
+    final_loss = finite_losses[-1] if finite_losses else float('inf')
+    # No NaN = base score of 0.4 (the bug is fixed)
+    base_score = 0.4
+    # Validation accuracy bonus (higher is better, target > 0.75)
+    acc_score = sigmoid_score(val_acc, center=0.75, steepness=12.0, higher_is_better=True) * 0.35
+    # Convergence bonus (lower is better, target < 0.4)
+    convergence_score = sigmoid_score(final_loss, center=0.4, steepness=6.0, higher_is_better=False) * 0.25
+    final_score = min(1.0, base_score + acc_score + convergence_score)
+    breakdown = {
+        "base_score": base_score,
+        "acc_score": round(acc_score, 4),
+        "convergence_score": round(convergence_score, 4),
+        "nan_count": nan_count,
+        "val_acc": val_acc,
+        "final_loss": final_loss
+    }
     return final_score, breakdown
 def grade_task3(result: RunResult) -> tuple[float, dict]:
+    """
+    Task 3: Memory Leak + Missing zero_grad
+    Bugs: 1) total_loss += loss retains graph (memory leak)
+          2) Missing optimizer.zero_grad() causes gradient accumulation
+    Grading criteria:
+    - FINAL_LOSS should be reasonable (<20) - memory leak fixed
+    - VAL_ACC should be high (>0.8) - gradient accumulation fixed
+    - Learning trajectory should improve over epochs
+    """
+    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task3")
     if not valid:
         return 0.0, {"reason": reason}
         return 0.1, {"reason": "timed_out"}
     if result.exit_code != 0:
+        if "out of memory" in result.stderr.lower() or "oom" in result.stderr.lower():
             return 0.1, {"reason": "oom"}
+        return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
     val_accs = parse_val_accs(result.stdout)
     final_loss_val = parse_scalar(result.stdout, "FINAL_LOSS")
+    # Memory leak check: FINAL_LOSS should be reasonable
+    # With .item(), total_loss is sum of scalars (~12-20 for 20 epochs)
     memory_score = 0.0
     if final_loss_val is not None:
+        memory_score = sigmoid_score(final_loss_val, center=20.0, steepness=0.2, higher_is_better=False) * 0.35
+    else:
+        memory_score = 0.0
+    # Gradient accumulation check: accuracy should be high if training properly
+    # Without zero_grad(), gradients accumulate and training degrades
+    acc_score = 0.0
     final_acc = 0.0
+    early_acc = 0.0
+    trajectory_bonus = 0.0
     if val_accs and len(val_accs) >= 2:
+        early_acc = sum(val_accs[:3]) / min(3, len(val_accs))
         final_acc = val_accs[-1]
+        # Final accuracy is the main indicator of correct training
+        acc_score = sigmoid_score(final_acc, center=0.8, steepness=15.0, higher_is_better=True) * 0.45
+        # Learning trajectory: should improve over time
+        if len(val_accs) >= 5:
+            improvement = final_acc - early_acc
+            if improvement > 0.05:
+                trajectory_bonus = 0.1
+            elif improvement > 0.0:
+                trajectory_bonus = 0.05
+    final_score = min(1.0, memory_score + acc_score + trajectory_bonus)
+    breakdown = {
+        "memory_score": round(memory_score, 4),
+        "acc_score": round(acc_score, 4),
+        "trajectory_bonus": round(trajectory_bonus, 4),
+        "early_acc": round(early_acc, 4),
+        "final_acc": round(final_acc, 4),
+        "final_loss": final_loss_val
+    }
     return final_score, breakdown
 def grade_task4(result: RunResult) -> tuple[float, dict]:
+    """
+    Task 4: Wrong Loss (Multi-label Classification)
+    Bug: Using CrossEntropyLoss instead of BCEWithLogitsLoss for multi-label
+    Grading criteria:
+    - F1 score should be high (> 0.6) - primary metric
+    - avg_labels should be > 1.0 (proper multi-label output)
+    - Loss should converge
+    """
+    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task4")
     if not valid:
         return 0.0, {"reason": reason}
         return 0.1, {"reason": "timed_out"}
     if result.exit_code != 0:
+        return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
     final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
     avg_labels = parse_scalar(result.stdout, "AVG_LABELS")
     f1 = parse_scalar(result.stdout, "F1_SCORE")
+    # F1 score - PRIMARY metric (higher is better, target > 0.6)
+    f1_score_val = 0.0
+    if f1 is not None:
+        f1_score_val = sigmoid_score(f1, center=0.6, steepness=10.0, higher_is_better=True) * 0.5
+    # Multi-label check: avg_labels should be > 1.0 (proper multi-label predictions)
+    # With 30% probability per class and 5 classes, expected avg ~1.5 labels/sample
     labels_score = 0.0
     if avg_labels is not None:
+        if avg_labels < 0.5:
+            # Way too few labels - likely single-label behavior
+            labels_score = 0.0
+        elif avg_labels >= 1.0:
+            # Good - multiple labels per sample
+            labels_score = 0.3
+        else:
+            # Partial credit
+            labels_score = sigmoid_score(avg_labels, center=1.0, steepness=5.0, higher_is_better=True) * 0.3
+    # Loss convergence (lower is better, target < 0.5)
+    loss_score = 0.0
+    if final_loss is not None:
+        loss_score = sigmoid_score(final_loss, center=0.5, steepness=4.0, higher_is_better=False) * 0.2
+    final_score = min(1.0, f1_score_val + labels_score + loss_score)
+    breakdown = {
+        "f1_score": round(f1_score_val, 4),
+        "labels_score": round(labels_score, 4),
+        "loss_score": round(loss_score, 4),
+        "avg_labels": avg_labels,
+        "f1": f1,
+        "final_loss": final_loss
+    }
     return final_score, breakdown
 def grade_task5(result: RunResult) -> tuple[float, dict]:
+    """
+    Task 5: Frozen Backbone with Optimizer Waste
+    Bug: Backbone is frozen but still passed to optimizer (wastes memory)
+    Valid fixes:
+    1. Unfreeze backbone -> grad_norm > 0, same param count
+    2. Only pass head params to optimizer -> grad_norm = 0, reduced param count
+    The buggy code has: grad_norm = 0, param_count = 530442 (full model)
+    Grading criteria:
+    - Either backbone has gradients (unfrozen), OR
+    - Optimizer param count is reduced (only head)
+    """
+    valid, reason = is_valid_submission(result.fixed_code, result.stdout, result.exit_code, "task5")
     if not valid:
         return 0.0, {"reason": reason}
         return 0.1, {"reason": "timed_out"}
     if result.exit_code != 0:
+        return 0.0, {"reason": "crash", "stderr": result.stderr[:500]}
     final_loss = parse_scalar(result.stdout, "FINAL_LOSS")
     grad_norm = parse_scalar(result.stdout, "BACKBONE_GRAD_NORM")
+    param_count = parse_scalar(result.stdout, "OPTIMIZER_PARAM_COUNT")
+    # Loss should be reasonable (10-class classification, CE loss)
     loss_score = 0.0
     if final_loss is not None:
+        loss_score = sigmoid_score(final_loss, center=2.5, steepness=2.0, higher_is_better=False) * 0.3
+    # The bug: frozen backbone (grad_norm=0) but full params in optimizer (param_count=530442)
+    # Fix 1: Unfreeze -> grad_norm > 0 (any amount)
+    # Fix 2: Only head -> param_count < 100000 (head has ~5130 params)
+    fix_score = 0.0
+    fix_type = "none"
+    if grad_norm is not None and grad_norm > 0.1:
+        # Backbone is unfrozen and training
+        fix_score = 0.7
+        fix_type = "unfrozen"
+    elif param_count is not None and param_count < 100000:
+        # Only head params in optimizer (head has ~5130 params)
+        fix_score = 0.7
+        fix_type = "head_only"
+    elif grad_norm is not None and grad_norm == 0.0 and (param_count is None or param_count > 100000):
+        # Buggy state: frozen backbone but full params in optimizer
+        fix_score = 0.0
+        fix_type = "buggy"
+    final_score = min(1.0, loss_score + fix_score)
+    breakdown = {
+        "loss_score": round(loss_score, 4),
+        "fix_score": round(fix_score, 4),
+        "fix_type": fix_type,
+        "grad_norm": grad_norm,
+        "param_count": param_count,
+        "final_loss": final_loss
+    }
     return final_score, breakdown

server/tasks/task1_broken_loop.py CHANGED Viewed

@@ -1,29 +1,52 @@
 TASK_DESCRIPTION = """
 This 2-class linear classifier training loop has bugs preventing convergence.
-Fix it so that after 50 steps the loss is below 0.75 and decreasing.
-Model: nn.Linear(10, 2), dataset: random 2-class, 32 samples/batch.
 Print losses as: LOSSES:[val1, val2, ...]
 """
 BUGGY_CODE = """
 import torch
 import torch.nn as nn
 torch.manual_seed(0)
 model = nn.Linear(10, 2)
 optimizer = torch.optim.Adam(model.parameters(), lr=10.0)  # BUG 1: lr too high
 criterion = nn.CrossEntropyLoss()
 losses = []
-for step in range(50):
-    x = torch.randn(32, 10)
-    y = torch.randint(0, 2, (32,))
-    optimizer.zero_grad()
-    logits = model(x)
-    loss = criterion(logits, y)
-    optimizer.step()   # BUG 2: step before backward
-    loss.backward()    # BUG 3: backward after step
-    losses.append(loss.item())
 print('##METRICS_START##')
 print('LOSSES:' + str(losses))
 print('##METRICS_END##')
 """

 TASK_DESCRIPTION = """
 This 2-class linear classifier training loop has bugs preventing convergence.
+Fix it so that after 50 epochs the loss is below 0.5 and validation accuracy is above 0.80.
+Model: nn.Linear(10, 2), dataset: fixed 2-class (160 train, 40 val samples).
 Print losses as: LOSSES:[val1, val2, ...]
+Print validation accuracy as: VAL_ACC:X.XX
 """
 BUGGY_CODE = """
 import torch
 import torch.nn as nn
+from torch.utils.data import TensorDataset, DataLoader
 torch.manual_seed(0)
+# Generate fixed training and validation datasets with learnable pattern
+# y = 1 if first feature > 0, else 0
+X_train = torch.randn(160, 10)
+y_train = (X_train[:, 0] > 0).long()
+X_val = torch.randn(40, 10)
+y_val = (X_val[:, 0] > 0).long()
+train_dataset = TensorDataset(X_train, y_train)
+train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
 model = nn.Linear(10, 2)
 optimizer = torch.optim.Adam(model.parameters(), lr=10.0)  # BUG 1: lr too high
 criterion = nn.CrossEntropyLoss()
 losses = []
+for epoch in range(50):
+    for x, y in train_loader:
+        optimizer.zero_grad()
+        logits = model(x)
+        loss = criterion(logits, y)
+        optimizer.step()   # BUG 2: step before backward
+        loss.backward()    # BUG 3: backward after step
+        losses.append(loss.item())
+# Validation
+model.eval()
+with torch.no_grad():
+    val_logits = model(X_val)
+    val_preds = val_logits.argmax(dim=1)
+    val_acc = (val_preds == y_val).float().mean().item()
 print('##METRICS_START##')
 print('LOSSES:' + str(losses))
+print('VAL_ACC:' + str(round(val_acc, 4)))
 print('##METRICS_END##')
 """

server/tasks/task2_nan_loss.py CHANGED Viewed

@@ -1,32 +1,58 @@
 TASK_DESCRIPTION = """
-This binary regression trainer produces NaN loss around step 15.
-Fix the numerical instability so loss stays finite for all 60 steps
-and the final loss is below 0.5.
 Print losses as: LOSSES:[val1, val2, ...]
 """
 BUGGY_CODE = """
 import torch
 import torch.nn as nn
 torch.manual_seed(42)
 model = nn.Linear(16, 1)
-optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
 losses = []
-for step in range(60):
-    x = torch.randn(64, 16)
-    y = torch.rand(64, 1)
-    optimizer.zero_grad()
-    pred = torch.sigmoid(model(x))
-    # BUG: log(pred) can be -inf when pred rounds to 0.0
-    loss = -torch.mean(y * torch.log(pred) + (1 - y) * torch.log(1 - pred))
-    loss.backward()
-    optimizer.step()
-    losses.append(loss.item())
 print('##METRICS_START##')
 print('LOSSES:' + str(losses))
 print('##METRICS_END##')
 """
 GROUND_TRUTH_BUGS = [
     "torch.log(pred) when pred can be 0.0 after sigmoid — use F.binary_cross_entropy or clamp",
 ]

 TASK_DESCRIPTION = """
+This binary classification trainer produces NaN loss after a few epochs.
+Fix the numerical instability so loss stays finite for all 60 epochs
+and the final loss is below 0.4 with validation accuracy above 0.75.
 Print losses as: LOSSES:[val1, val2, ...]
+Print validation accuracy as: VAL_ACC:X.XX
 """
 BUGGY_CODE = """
 import torch
 import torch.nn as nn
+from torch.utils.data import TensorDataset, DataLoader
 torch.manual_seed(42)
+# Generate fixed training and validation datasets with learnable pattern
+# y = 1 if sum of first 3 features > 0, else 0
+X_train = torch.randn(320, 16)
+y_train = (X_train[:, :3].sum(dim=1, keepdim=True) > 0).float()
+X_val = torch.randn(80, 16)
+y_val = (X_val[:, :3].sum(dim=1, keepdim=True) > 0).float()
+train_dataset = TensorDataset(X_train, y_train)
+train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
 model = nn.Linear(16, 1)
+# BUG AMPLIFIER: Higher learning rate makes predictions more extreme, causing log(0)
+optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
 losses = []
+for epoch in range(60):
+    for x, y in train_loader:
+        optimizer.zero_grad()
+        pred = torch.sigmoid(model(x))
+        # BUG: log(pred) can be -inf when pred rounds to 0.0 due to extreme weights
+        # This happens because SGD with high LR pushes weights to extreme values
+        loss = -torch.mean(y * torch.log(pred) + (1 - y) * torch.log(1 - pred))
+        loss.backward()
+        optimizer.step()
+        losses.append(loss.item())
+# Validation
+model.eval()
+with torch.no_grad():
+    val_pred = torch.sigmoid(model(X_val))
+    val_binary = (val_pred > 0.5).float()
+    val_acc = (val_binary == y_val).float().mean().item()
 print('##METRICS_START##')
 print('LOSSES:' + str(losses))
+print('VAL_ACC:' + str(round(val_acc, 4)))
 print('##METRICS_END##')
 """
 GROUND_TRUTH_BUGS = [
     "torch.log(pred) when pred can be 0.0 after sigmoid — use F.binary_cross_entropy or clamp",
+    "High learning rate (0.5) causes extreme predictions",
 ]

server/tasks/task3_oom_leakage.py CHANGED Viewed

@@ -1,50 +1,47 @@
 TASK_DESCRIPTION = """
-This trainer has TWO independent bugs:
-1. A memory leak causing OOM crash before epoch 5 on CPU.
-2. Data leakage inflating validation accuracy.
-Fix both. After 20 epochs: val_acc > 0.70, no OOM, no suspicious early accuracy spike.
 Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
 """
 BUGGY_CODE = """
 import torch
 import torch.nn as nn
-from torch.utils.data import DataLoader, TensorDataset, random_split
 torch.manual_seed(42)
-X = torch.randn(1000, 20)
-y = (X[:, 0] > 0).float()
-# BUG 1: augmentation before split — val set gets augmented
-X = X + torch.randn_like(X) * 0.1
-train_ds, val_ds = random_split(TensorDataset(X, y), [800, 200])
 model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
-optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
 criterion = nn.BCEWithLogitsLoss()
-train_losses, val_accs = [], []
-total_loss = torch.tensor(0.0)  # BUG 2: keeps computation graph alive
 for epoch in range(20):
     model.train()
-    for xb, yb in DataLoader(train_ds, batch_size=32):
         optimizer.zero_grad()
         out = model(xb).squeeze()
-        loss = criterion(out, yb)
         loss.backward()
         optimizer.step()
-        total_loss = total_loss + loss  # BUG 2: graph accumulates
     model.eval()
     with torch.no_grad():
-        idx = val_ds.indices
-        xv, yv = X[idx], y[idx]
-        preds = (torch.sigmoid(model(xv)) > 0.5).float()
-        acc = (preds == yv).float().mean().item()
     val_accs.append(round(acc, 4))
 print('##METRICS_START##')
 print('VAL_ACCS:' + str(val_accs))
-print('FINAL_LOSS:' + str(total_loss.item()))
 print('##METRICS_END##')
 """
 GROUND_TRUTH_BUGS = [
-    "Augmentation applied before split — move after split, apply to train only",
-    "total_loss += loss retains graph — use total_loss += loss.item()",
 ]

 TASK_DESCRIPTION = """
+This binary classification trainer has a bug causing validation accuracy around 50%.
+Fix the bug. After 20 epochs: VAL_ACC > 0.90, FINAL_LOSS < 0.3.
 Print as: VAL_ACCS:[v1,v2,...] and FINAL_LOSS:X.XX
 """
 BUGGY_CODE = """
 import torch
 import torch.nn as nn
+from torch.utils.data import DataLoader, TensorDataset
 torch.manual_seed(42)
+X_train = torch.randn(800, 20)
+y_train = (X_train[:, 0] > 0).float()
+X_val = torch.randn(200, 20)
+y_val = (X_val[:, 0] > 0).float()
+train_ds = TensorDataset(X_train, y_train)
 model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
+optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
 criterion = nn.BCEWithLogitsLoss()
+val_accs = []
+losses = []
 for epoch in range(20):
     model.train()
+    for xb, yb in DataLoader(train_ds, batch_size=32, shuffle=True):
         optimizer.zero_grad()
         out = model(xb).squeeze()
+        # BUG: Wrong label transformation - should use yb directly
+        loss = criterion(out, 1 - yb)
         loss.backward()
         optimizer.step()
+        losses.append(loss.item())
     model.eval()
     with torch.no_grad():
+        preds = (torch.sigmoid(model(X_val).squeeze()) > 0.5).float()
+        acc = (preds == y_val).float().mean().item()
     val_accs.append(round(acc, 4))
 print('##METRICS_START##')
 print('VAL_ACCS:' + str(val_accs))
+print('FINAL_LOSS:' + str(sum(losses[-25:])/25))
 print('##METRICS_END##')
 """
 GROUND_TRUTH_BUGS = [
+    "Label inversion: criterion(out, 1 - yb) inverts the labels — use criterion(out, yb)",
 ]

server/tasks/task5_frozen_backbone.py CHANGED Viewed

@@ -1,8 +1,15 @@
 TASK_DESCRIPTION = """
 This is a standard transfer learning setup classifying 10 categories.
 The developer froze the backbone during testing, but forgot to unfreeze it while still passing its parameters to the optimizer.
-Fix the code so the backbone actually trains, or only pass the head parameters.
-The grader checks the gradient norm of the backbone from the first backward pass.
 """
 BUGGY_CODE = """
@@ -23,17 +30,20 @@ backbone = nn.Sequential(
     nn.ReLU()
 )
-# BUG: backbone is frozen, but passed to optimizer
 backbone.requires_grad_(False)
 head = nn.Linear(512, 10)
-# passing both backbone and head to optimizer even though backbone is frozen
 optimizer = torch.optim.Adam(
     list(backbone.parameters()) + list(head.parameters()), lr=0.001
 )
 criterion = nn.CrossEntropyLoss()
 losses = []
 # Take one step to check gradients
@@ -52,11 +62,9 @@ backbone_grad_norm = sum(
 optimizer.step()
 losses.append(loss.item())
-# Note: if backbone is properly frozen and only head is passed, backbone_grad_norm will be 0 but optimizer won't complain.
-# If backbone is unfrozen, backbone_grad_norm will be > 0.
-# The grader handles both valid solutions.
 print('##METRICS_START##')
 print('FINAL_LOSS:' + str(losses[-1]))
 print('BACKBONE_GRAD_NORM:' + str(backbone_grad_norm))
 print('##METRICS_END##')
 """

 TASK_DESCRIPTION = """
 This is a standard transfer learning setup classifying 10 categories.
 The developer froze the backbone during testing, but forgot to unfreeze it while still passing its parameters to the optimizer.
+This wastes memory and computation as frozen params don't need optimizer state.
+Fix the code so EITHER:
+1. The backbone actually trains (unfreeze it), OR
+2. Only pass trainable parameters to the optimizer
+The grader checks:
+- BACKBONE_GRAD_NORM: >0 means backbone is training, =0 means properly frozen
+- OPTIMIZER_PARAM_COUNT: Should be reduced if only passing head params
 """
 BUGGY_CODE = """
     nn.ReLU()
 )
+# BUG: backbone is frozen, but passed to optimizer (wastes memory/compute)
 backbone.requires_grad_(False)
 head = nn.Linear(512, 10)
+# BUG: passing frozen backbone params to optimizer
 optimizer = torch.optim.Adam(
     list(backbone.parameters()) + list(head.parameters()), lr=0.001
 )
 criterion = nn.CrossEntropyLoss()
+# Count params in optimizer (for grading)
+optimizer_param_count = sum(p.numel() for g in optimizer.param_groups for p in g['params'])
 losses = []
 # Take one step to check gradients
 optimizer.step()
 losses.append(loss.item())
 print('##METRICS_START##')
 print('FINAL_LOSS:' + str(losses[-1]))
 print('BACKBONE_GRAD_NORM:' + str(backbone_grad_norm))
+print('OPTIMIZER_PARAM_COUNT:' + str(optimizer_param_count))
 print('##METRICS_END##')
 """