Spaces:

Cooked4riyal
/

EntropyEnv

Running

immortalindeed commited on 10 days ago

Commit

f3fd4ef

1 Parent(s): 1ecd7e1

Spec-compliance overhaul: remove difficulty_multiplier, weighted blend scoring, dep_hard fix, [END] format

FIXES:
1. inference.py: Remove score= from [END] lines (not in official spec)
2. inference.py: Scoring changed to 0.60*max + 0.40*mean (from pure average)
3. inference.py: All abort/error paths emit spec-compliant [END] with empty rewards
4. base_grader.py: difficulty_multiplier() REMOVED (uniform caps killed variance)
5. dependency_cases.py: dep_hard min_actions=1, seq=['migrate_api'] (was 2x which forced repetition penalty)
6. app.py /inference: Parse scores from final_scores JSON (not task_id/total_reward which didn't exist)
7. app.py /benchmark: Same weighted blend + no score= in [END]
8. README.md: Fix [END] example, update scoring description

Files changed (5) hide show

README.md +13 -27
inference.py +158 -134
server/app.py +63 -35
server/datasets/dependency_cases.py +292 -298
server/graders/base_grader.py +94 -51

README.md CHANGED Viewed

@@ -106,7 +106,7 @@ Agents detect missing steps in hospital workflows, rank them by clinical priorit
 | 🔄 **Multi-Turn Episodes** | Agents iterate through identify → act → revise workflows |
 | 🛡️ **3-Stage Validation** | Schema → Domain → Consistency checks with helpful error hints |
 | 📊 **Score Breakdown** | Per-component feedback in every step so agents learn *what* to improve |
-| 🏎️ **Fatal Error Handling** | Automatic 402/401 detection stops wasted API calls immediately |
 | 🌐 **Universal LLM Support** | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) |
 | 🐳 **Docker-Ready** | One-command deploy to Hugging Face Spaces |
 | 📈 **GRPO-Compatible** | Smooth reward gradients designed for policy optimization training |
@@ -229,33 +229,19 @@ entropyenv/
 ## 📈 Baseline Performance
-Tested across 14 models from 9 providers. Scores range from **0.01 to 0.80**, demonstrating genuine difficulty discrimination:
 | Model | Provider | sec_easy | sec_med | sec_hard | dep_easy | dep_med | dep_hard | cli_easy | cli_med | cli_hard | **Avg** |
 |-------|----------|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:-------:|
-| DeepSeek R1 | DeepSeek | 0.87 | 0.36 | 0.61 | 0.83 | 0.95 | 0.85 | 0.99 | 0.95 | 0.80 | **0.80** |
-| Gemma-4-26B | Google | 0.87 | 0.33 | 0.48 | 0.99 | 0.95 | 0.85 | 0.99 | 0.84 | 0.83 | **0.79** |
-| Mistral Small | Mistral | 0.65 | 0.37 | 0.59 | 0.99 | 0.95 | 0.85 | 0.99 | 0.95 | 0.67 | **0.78** |
-| Nemotron 70B | NVIDIA | 0.88 | 0.25 | 0.54 | 0.83 | 0.95 | 0.85 | 0.99 | 0.93 | 0.77 | **0.77** |
-| Gemma-4-31B | Google | 0.87 | 0.37 | 0.47 | 0.83 | 0.95 | 0.85 | 0.74 | 0.85 | 0.83 | **0.75** |
-| Qwen3-32B | Alibaba | 0.53 | 0.34 | 0.42 | 0.99 | 0.95 | 0.85 | 0.99 | 0.80 | 0.79 | **0.74** |
-| Claude Haiku 4.5 | Anthropic | 0.53 | 0.53 | 0.49 | 0.99 | 0.95 | 0.85 | 0.74 | 0.84 | 0.67 | **0.73** |
-| Grok 4.20 | xAI | 0.87 | 0.49 | 0.41 | 0.99 | 0.95 | 0.85 | 0.09 | 0.84 | 0.83 | **0.70** |
-| Grok 3 | xAI | 0.53 | 0.29 | 0.44 | 0.45 | 0.95 | 0.85 | 0.74 | 0.95 | 0.83 | **0.67** |
-| Llama 3.3 70B | Meta | 0.87 | 0.20 | 0.38 | 0.83 | 0.95 | 0.85 | 0.09 | 0.84 | 0.83 | **0.65** |
-| GPT-OSS-20B | OpenAI | 0.65 | 0.16 | 0.51 | 0.99 | 0.95 | 0.85 | 0.09 | 0.57 | 0.83 | **0.62** |
-| Llama 3.1 8B | Meta | 0.53 | 0.22 | 0.44 | 0.45 | 0.67 | 0.85 | 0.74 | 0.48 | 0.80 | **0.57** |
-| GPT-OSS-120B | OpenAI | 0.87 | 0.21 | 0.20 | 0.99 | 0.11 | 0.13 | 0.74 | 0.95 | 0.45 | **0.52** |
-| Qwen3.5-9B | Alibaba | 0.87 | 0.72 | 0.51 | 0.99 | 0.11 | 0.20 | 0.05 | 0.01 | 0.02 | **0.38** |
-| MiniMax M2.5 | MiniMax | 0.53 | 0.13 | 0.02 | 0.45 | 0.01 | 0.01 | 0.74 | 0.23 | 0.12 | **0.25** |
-| MiniMax M2.7 | MiniMax | 0.53 | 0.01 | 0.39 | 0.45 | 0.01 | 0.01 | 0.04 | 0.11 | 0.42 | **0.22** |
-| MiMo-v2 Pro | Xiaomi | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | **0.01** |
-**Key observations:**
-- 🎯 **Clear difficulty progression:** Easy > Medium > Hard across all domains
-- 📊 **High variance:** Scores range from 0.01 (incompatible models) to 0.80 (DeepSeek R1)
-- 🔬 **Security is hardest:** Even top models score < 0.61 on `sec_hard` (propose_fix/revise_fix are genuinely difficult)
-- 🧠 **Model discrimination:** The benchmark clearly separates 70B+ reasoning models from smaller/weaker ones
 ---
@@ -264,10 +250,10 @@ Tested across 14 models from 9 providers. Scores range from **0.01 to 0.80**, de
 The baseline `inference.py` emits structured logs matching the OpenEnv spec:
 ```
-[START] task=sec_easy env=multi-agent-dev-tools-env model=Qwen/Qwen2.5-72B-Instruct
 [STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
 [STEP] step=2 action=propose_fix reward=0.92 done=true error=null
-[END] success=true steps=2 score=0.89 rewards=0.85,0.92
 ```
 ---

 | 🔄 **Multi-Turn Episodes** | Agents iterate through identify → act → revise workflows |
 | 🛡️ **3-Stage Validation** | Schema → Domain → Consistency checks with helpful error hints |
 | 📊 **Score Breakdown** | Per-component feedback in every step so agents learn *what* to improve |
+| 🏎️ **Fatal Error Handling** | Automatic 402/401/403 detection stops wasted API calls immediately |
 | 🌐 **Universal LLM Support** | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) |
 | 🐳 **Docker-Ready** | One-command deploy to Hugging Face Spaces |
 | 📈 **GRPO-Compatible** | Smooth reward gradients designed for policy optimization training |
 ## 📈 Baseline Performance
+> **Note:** Scores below are from the latest grading revision (v3: weighted 0.60×max + 0.40×mean scoring, difficulty_multiplier removed, dep_hard done-condition fixed). Re-benchmarking across 14+ models in progress.
 | Model | Provider | sec_easy | sec_med | sec_hard | dep_easy | dep_med | dep_hard | cli_easy | cli_med | cli_hard | **Avg** |
 |-------|----------|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:-------:|
+| *Benchmarking in progress...* | | | | | | | | | | | |
+**Scoring formula:** `score = 0.60 × max(step_rewards) + 0.40 × mean(step_rewards)`, clamped to `[0.01, 0.99]`
+**Design principles:**
+- 🎯 **No artificial difficulty caps** — scores reflect actual grader correctness
+- 📊 **Weighted blend** — rewards consistently good episodes over single-lucky-step flukes
+- 🔬 **Spec-compliant** — `[END]` lines have NO `score=` field per official guidelines
+- 🧠 **14+ model families tested** for universal compatibility
 ---
 The baseline `inference.py` emits structured logs matching the OpenEnv spec:
 ```
+[START] task=sec_easy env=EntropyEnv model=Qwen/Qwen2.5-72B-Instruct
 [STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
 [STEP] step=2 action=propose_fix reward=0.92 done=true error=null
+[END] success=true steps=2 rewards=0.85,0.92
 ```
 ---

inference.py CHANGED Viewed

@@ -2,10 +2,18 @@
 # Mandatory baseline inference script for OpenEnv hackathon.
 # Uses OpenAI-compatible client for HuggingFace Inference API.
 #
-# STDOUT FORMAT (mandatory — any deviation causes scoring failure):
 #   [START] task=<task_name> env=<benchmark> model=<model_name>
 #   [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
-#   [END]   success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>
 import os
 import re
@@ -20,23 +28,21 @@ try:
 except ImportError:
     pass
-# ── Mandatory environment variables ──
 API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
-MODEL_NAME   = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
 HF_TOKEN     = os.getenv("HF_TOKEN")
-ENV_URL      = os.getenv("ENV_URL", "http://localhost:7860")
 MAX_STEPS   = 8
 TEMPERATURE = 0.1
 MAX_TOKENS  = 400
 BENCHMARK   = "EntropyEnv"
-# ── FATAL error codes: stop the entire task immediately, don't loop ──
-# 402 = payment required, 401 = unauthorized, 403 = forbidden
-# 429 = rate limit (stop task, not whole run), 503 = model unavailable
-FATAL_HTTP_CODES = {402, 401, 403}
-RETRYABLE_HTTP_CODES = {429, 500, 502, 503, 504}
-MAX_CONSECUTIVE_ERRORS = 3  # stop task after 3 consecutive API errors
 TASKS = [
     "sec_easy", "sec_medium", "sec_hard",
@@ -75,7 +81,7 @@ EXACT FORMAT EXAMPLES — copy field names exactly:
 {"action_type": "revise_fix", "fix_code": "cursor.execute(sql, values)", "addressed_feedback": "Used parameterized queries and added input validation"}
 {"action_type": "flag_outdated", "packages": {"torch": "1.9.0"}, "deprecated_api": "torch.autograd.Variable", "replacement": "plain tensor"}
 {"action_type": "resolve_conflict", "packages": {"torch": "2.1.0", "numpy": "1.24.0"}, "reasoning": "torch 2.1 requires numpy >=1.24"}
-{"action_type": "migrate_api", "completed_items": ["break_001", "break_002", "break_003"], "code_changes": {"break_001": "use torch.where", "break_002": "use tensor.shape[0]", "break_003": "use .detach().numpy()"}}
 {"action_type": "detect_gap", "missing_steps": ["pre_op_consent"], "risk_level": "critical"}
 {"action_type": "rank_issues", "priority_order": ["resolve_insurance", "pre_op_consent", "book_specialist"]}
 {"action_type": "order_steps", "recovery_steps": ["resolve_insurance", "complete_pre_op", "book_specialist", "schedule_surgery"]}
@@ -85,12 +91,9 @@ CRITICAL: Output ONLY the JSON object. Nothing before or after it.
 def _extract_http_code(error_str: str) -> int:
-    """Extract HTTP status code from error message string. Returns 0 if not found."""
-    # Matches patterns like "Error code: 402" or "status_code=402" or "HTTP 402"
     match = re.search(r'(?:Error code:|status_code=|HTTP )\s*(\d{3})', str(error_str))
     if match:
         return int(match.group(1))
-    # Also check for bare 4xx/5xx at start of error
     match = re.search(r'\b(4\d{2}|5\d{2})\b', str(error_str))
     if match:
         return int(match.group(1))
@@ -98,31 +101,26 @@ def _extract_http_code(error_str: str) -> int:
 def _is_fatal_error(error_str: str) -> bool:
-    """Return True if this error means we should stop ALL tasks (not just this one)."""
     code = _extract_http_code(error_str)
     if code in FATAL_HTTP_CODES:
         return True
-    # Also catch keyword patterns
-    fatal_keywords = ['insufficient credits', 'unauthorized', 'invalid api key',
-                      'authentication failed', 'no api key', 'forbidden']
-    err_lower = str(error_str).lower()
-    return any(kw in err_lower for kw in fatal_keywords)
-def _is_task_fatal_error(error_str: str) -> bool:
-    """Return True if this error means we should stop THIS task but try others."""
     code = _extract_http_code(error_str)
-    if code in RETRYABLE_HTTP_CODES:
         return True
-    task_fatal_keywords = ['model not found', 'model unavailable', 'context length',
-                           'maximum context', 'rate limit']
-    err_lower = str(error_str).lower()
-    return any(kw in err_lower for kw in task_fatal_keywords)
 def build_user_prompt(step_num: int, obs: dict, history: list) -> str:
     task_type = obs.get("task_type", "unknown")
-    task_id   = obs.get("task_id", "unknown")
     task_sub  = obs.get("task_subtype", "")
     parts = [f"Step {step_num} | task_type={task_type} | task_id={task_id} | subtype={task_sub}"]
@@ -132,8 +130,8 @@ def build_user_prompt(step_num: int, obs: dict, history: list) -> str:
         last = history[-1]
         parts.append(f"Actions used: {used}")
         parts.append(f"Last reward: {last['reward']:.2f}")
-        if last["reward"] < 0.4:
-            parts.append(f"⚠️ Low score. Try different approach.")
     if obs.get("validation_failed"):
         parts.append(f"\n❌ VALIDATION FAILED!")
@@ -145,39 +143,34 @@ def build_user_prompt(step_num: int, obs: dict, history: list) -> str:
         parts.append(obs["reviewer_feedback"])
     obs_copy = dict(obs)
-    compat_matrix = obs_copy.pop("compatibility_matrix", None)
-    dep_graph = obs_copy.pop("dependency_graph", None)
-    core_text = json.dumps(obs_copy, default=str, indent=2)
     parts.append(f"\nObservation:\n{core_text}")
-    if compat_matrix:
-        parts.append(f"\nCompatibility Matrix (use this to resolve conflicts):")
-        for pkg, versions in compat_matrix.items():
-            parts.append(f"  {pkg}:")
             for ver, deps in versions.items():
-                if deps:
-                    parts.append(f"    {ver} → requires {deps}")
-                else:
-                    parts.append(f"    {ver} → (no deps)")
-    if dep_graph:
-        parts.append(f"\nDependency Graph (prerequisites must come first):")
-        for step, prereqs in dep_graph.items():
-            if prereqs:
-                parts.append(f"  {step} requires: {prereqs}")
-            else:
-                parts.append(f"  {step} → (no prereqs)")
     if task_type == "security":
         used_types = [h["action_type"] for h in history]
-        if not used_types or "identify_vulnerability" not in used_types:
             parts.append("\n➡️ NEXT: identify_vulnerability")
         elif "propose_fix" not in used_types:
             parts.append("\n➡️ NEXT: propose_fix")
         else:
             parts.append("\n➡️ NEXT: revise_fix (address reviewer_feedback)")
     elif task_type == "clinical":
         used_types = [h["action_type"] for h in history]
         if "detect_gap" not in used_types:
@@ -195,15 +188,13 @@ def parse_action(raw_text: str) -> dict:
     """Parse LLM response into action dict. Universal model compatibility."""
     text = raw_text.strip()
     for tag in ["think", "thinking", "reasoning", "reflection", "thought", "antThinking"]:
-        open_tag = f"<{tag}>"
-        close_tag = f"</{tag}>"
         if open_tag in text:
-            if close_tag in text:
-                text = text.split(close_tag)[-1].strip()
-            else:
-                text = text.split(open_tag)[-1].strip()
     if "```json" in text:
         text = text.split("```json")[1].split("```")[0].strip()
     elif "```" in text:
@@ -211,6 +202,7 @@ def parse_action(raw_text: str) -> dict:
         if len(parts) >= 3:
             text = parts[1].strip()
     if not text.startswith("{"):
         start = text.find("{")
         if start >= 0:
@@ -233,42 +225,72 @@ def parse_action(raw_text: str) -> dict:
     return {"action_type": "error", "raw": text[:100]}
-def run_task(client: OpenAI, task_id: str) -> tuple:
-    """Run a single task. Returns (score, is_fatal_api_error).
-    is_fatal_api_error=True means the caller should stop ALL remaining tasks.
     """
-    # Reset environment
     try:
         resp = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id}, timeout=30)
         data = resp.json()
     except Exception as e:
         print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
-        print(f"[END] success=false steps=0 score=0.01 rewards=", flush=True)
         return 0.01, False
     if "error" in data and not data.get("episode_id"):
         print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
-        print(f"[END] success=false steps=0 score=0.01 rewards=", flush=True)
         return 0.01, False
     episode_id = data.get("episode_id", "unknown")
-    obs = data.get("observation", data)
     print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
-    rewards = []
-    history = []
-    step_num = 0
-    consecutive_errors = 0
     for step_num in range(1, MAX_STEPS + 1):
         user_prompt = build_user_prompt(step_num, obs, history)
-        error_msg = None
-        fatal_error = False
-        task_fatal = False
         try:
             reply = client.chat.completions.create(
                 model=MODEL_NAME,
@@ -280,80 +302,85 @@ def run_task(client: OpenAI, task_id: str) -> tuple:
                 max_tokens=MAX_TOKENS,
             )
             response_text = (reply.choices[0].message.content or "").strip()
-            consecutive_errors = 0  # reset on success
         except Exception as e:
-            error_msg = str(e)
             response_text = '{"action_type": "error"}'
-            consecutive_errors += 1
-            # Check if this is a fatal error (auth/payment) — stop everything
             if _is_fatal_error(error_msg):
                 fatal_error = True
-                short_err = error_msg[:120].replace('\n', ' ')
-                print(f"[STEP] step={step_num} action=invalid reward=0.01 done=true error=FATAL:{short_err}", flush=True)
                 rewards.append(0.01)
-                step_num_final = step_num
                 break
-            # Check if this is a task-level fatal (rate limit, model unavailable)
-            if _is_task_fatal_error(error_msg) or consecutive_errors >= MAX_CONSECUTIVE_ERRORS:
-                task_fatal = True
-                short_err = error_msg[:120].replace('\n', ' ')
-                print(f"[STEP] step={step_num} action=invalid reward=0.01 done=true error=TASK_STOP:{short_err}", flush=True)
                 rewards.append(0.01)
-                step_num_final = step_num
                 break
-        action = parse_action(response_text)
         action_type = action.get("action_type", "unknown")
         action["episode_id"] = episode_id
         try:
             step_resp = requests.post(f"{ENV_URL}/step", json=action, timeout=30)
             step_data = step_resp.json()
         except Exception as e:
-            short_err = str(e)[:100]
-            print(f"[STEP] step={step_num} action={action_type} reward=0.01 done=true error={short_err}", flush=True)
             rewards.append(0.01)
-            step_num_final = step_num
-            fatal_error = False
             break
-        reward = float(step_data.get("reward", 0.0))
-        done   = bool(step_data.get("done", False))
-        obs    = step_data.get("observation", step_data)
         step_error = step_data.get("error") or error_msg
         rewards.append(reward)
         history.append({"step": step_num, "action_type": action_type, "reward": reward, "done": done})
-        display_action = action_type
-        if obs.get("validation_failed"):
-            display_action = "invalid"
-        error_val = step_error if step_error else "null"
-        # Truncate long error messages in output
-        if error_val and error_val != "null" and len(str(error_val)) > 150:
-            error_val = str(error_val)[:150] + "..."
-        print(f"[STEP] step={step_num} action={display_action} reward={reward:.2f} done={str(done).lower()} error={error_val}", flush=True)
-        step_num_final = step_num
         if done:
-            fatal_error = False
             break
-    else:
-        step_num_final = step_num
-        fatal_error = False
-    avg_reward = sum(rewards) / max(len(rewards), 1) if rewards else 0.01
-    score = round(min(max(avg_reward, 0.01), 0.99), 4)
-    success = score > 0.01
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(f"[END] success={str(success).lower()} steps={step_num_final} score={score:.2f} rewards={rewards_str}", flush=True)
     return score, fatal_error
@@ -361,63 +388,60 @@ def run_task(client: OpenAI, task_id: str) -> tuple:
 def main() -> None:
     """Run all 9 tasks and report final scores."""
     if not HF_TOKEN:
-        print("ERROR: Set HF_TOKEN or API_KEY environment variable.", flush=True)
         return
     client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
-    # Health check
     try:
         health = requests.get(f"{ENV_URL}/", timeout=10, headers={"Accept": "application/json"})
         health_data = health.json()
-        print(f"Environment: {health_data.get('env', 'unknown')} | Tasks: {health_data.get('tasks', 0)}", flush=True)
     except Exception as e:
         print(f"ERROR: Cannot connect to environment at {ENV_URL}: {e}", flush=True)
         return
-    scores = {}
-    had_fatal_error = False
     for task_id in TASKS:
         try:
             score, is_fatal = run_task(client, task_id)
             scores[task_id] = score
-            # If we hit a fatal API error (402/401/403), stop ALL remaining tasks
             if is_fatal:
-                had_fatal_error = True
-                print(f"\n🚫 Fatal API error on {task_id}. Stopping all remaining tasks.", flush=True)
-                print(f"   Likely cause: invalid token, no credits, or unauthorized access.", flush=True)
-                # Emit mandatory [START]/[END] lines for remaining tasks (spec compliance)
                 for remaining in TASKS:
                     if remaining not in scores:
                         scores[remaining] = 0.01
                         print(f"[START] task={remaining} env={BENCHMARK} model={MODEL_NAME}", flush=True)
-                        print(f"[END] success=false steps=0 score=0.01 rewards=", flush=True)
                 break
         except Exception as e:
-            print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
-            print(f"[END] success=false steps=0 score=0.01 rewards=", flush=True)
             scores[task_id] = 0.01
-    avg = round(sum(scores.values()) / max(len(scores), 1), 2)
-    print(f"\n✅ All tasks complete! Average: {avg:.2f}", flush=True)
     print(json.dumps({"final_scores": scores}), flush=True)
-    # Only save to disk if the run was NOT killed by a fatal API error.
-    # A run where the model had no credits or invalid token produces all-0.01
-    # scores that would corrupt the benchmark history.
-    if had_fatal_error:
-        print(f"⚠️ Results NOT saved — run was aborted due to a fatal API error (invalid token / no credits).", flush=True)
-        print(f"   Fix your API key/credits and re-run to get valid scores.", flush=True)
     else:
         try:
             from server.benchmark_store import append_result
             append_result(MODEL_NAME, MODEL_NAME, scores)
             print(f"💾 Results saved (avg: {avg:.4f})", flush=True)
         except Exception as e:
-            print(f"⚠️ Failed to save results to disk: {e}", flush=True)
 if __name__ == "__main__":

 # Mandatory baseline inference script for OpenEnv hackathon.
 # Uses OpenAI-compatible client for HuggingFace Inference API.
 #
+# OFFICIAL STDOUT FORMAT (from Meta_OpenEnv_Hackathon__Guidelines.txt):
 #   [START] task=<task_name> env=<benchmark> model=<model_name>
 #   [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+#   [END]   success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
+#
+# KEY RULES FROM OFFICIAL SPEC:
+#   - reward and rewards formatted to 2 decimal places ONLY
+#   - done and success are lowercase booleans: true or false
+#   - error is null when no error (the literal string "null")
+#   - NO score= field in [END] — not in the official spec
+#   - NO task_id=, NO episode_id=, NO total_reward= — none of these are in spec
+#   - rewards= is a comma-separated list of step rewards with NO spaces
 import os
 import re
 except ImportError:
     pass
+# ── Mandatory environment variables (names exactly as spec requires) ──
 API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME",   "Qwen/Qwen2.5-72B-Instruct")
 HF_TOKEN     = os.getenv("HF_TOKEN")
+ENV_URL      = os.getenv("ENV_URL",      "http://localhost:7860")
 MAX_STEPS   = 8
 TEMPERATURE = 0.1
 MAX_TOKENS  = 400
 BENCHMARK   = "EntropyEnv"
+# Fatal HTTP codes: stop ALL tasks immediately
+FATAL_HTTP_CODES    = {402, 401, 403}
+RETRYABLE_CODES     = {429, 500, 502, 503, 504}
+MAX_CONSEC_ERRORS   = 3
 TASKS = [
     "sec_easy", "sec_medium", "sec_hard",
 {"action_type": "revise_fix", "fix_code": "cursor.execute(sql, values)", "addressed_feedback": "Used parameterized queries and added input validation"}
 {"action_type": "flag_outdated", "packages": {"torch": "1.9.0"}, "deprecated_api": "torch.autograd.Variable", "replacement": "plain tensor"}
 {"action_type": "resolve_conflict", "packages": {"torch": "2.1.0", "numpy": "1.24.0"}, "reasoning": "torch 2.1 requires numpy >=1.24"}
+{"action_type": "migrate_api", "completed_items": ["break_001", "break_002"], "code_changes": {"break_001": "use torch.where", "break_002": "use tensor.shape[0]"}}
 {"action_type": "detect_gap", "missing_steps": ["pre_op_consent"], "risk_level": "critical"}
 {"action_type": "rank_issues", "priority_order": ["resolve_insurance", "pre_op_consent", "book_specialist"]}
 {"action_type": "order_steps", "recovery_steps": ["resolve_insurance", "complete_pre_op", "book_specialist", "schedule_surgery"]}
 def _extract_http_code(error_str: str) -> int:
     match = re.search(r'(?:Error code:|status_code=|HTTP )\s*(\d{3})', str(error_str))
     if match:
         return int(match.group(1))
     match = re.search(r'\b(4\d{2}|5\d{2})\b', str(error_str))
     if match:
         return int(match.group(1))
 def _is_fatal_error(error_str: str) -> bool:
     code = _extract_http_code(error_str)
     if code in FATAL_HTTP_CODES:
         return True
+    fatal_kw = ['insufficient credits', 'unauthorized', 'invalid api key',
+                'authentication failed', 'no api key', 'forbidden']
+    return any(kw in str(error_str).lower() for kw in fatal_kw)
+def _is_task_fatal(error_str: str) -> bool:
     code = _extract_http_code(error_str)
+    if code in RETRYABLE_CODES:
         return True
+    task_kw = ['model not found', 'model unavailable', 'context length',
+               'maximum context', 'rate limit']
+    return any(kw in str(error_str).lower() for kw in task_kw)
 def build_user_prompt(step_num: int, obs: dict, history: list) -> str:
     task_type = obs.get("task_type", "unknown")
+    task_id   = obs.get("task_id",   "unknown")
     task_sub  = obs.get("task_subtype", "")
     parts = [f"Step {step_num} | task_type={task_type} | task_id={task_id} | subtype={task_sub}"]
         last = history[-1]
         parts.append(f"Actions used: {used}")
         parts.append(f"Last reward: {last['reward']:.2f}")
+        if last["reward"] < 0.40:
+            parts.append("⚠️ Low score. Try a different approach.")
     if obs.get("validation_failed"):
         parts.append(f"\n❌ VALIDATION FAILED!")
         parts.append(obs["reviewer_feedback"])
     obs_copy = dict(obs)
+    compat   = obs_copy.pop("compatibility_matrix", None)
+    dep_g    = obs_copy.pop("dependency_graph", None)
+    core_text = json.dumps(obs_copy, default=str)
+    if len(core_text) > 1600:
+        core_text = core_text[:1600] + "..."
     parts.append(f"\nObservation:\n{core_text}")
+    if compat:
+        parts.append("\nCompatibility Matrix (use this to resolve conflicts):")
+        for pkg, versions in compat.items():
             for ver, deps in versions.items():
+                parts.append(f"  {pkg} {ver} → {deps if deps else '(no constraints)'}")
+    if dep_g:
+        parts.append("\nDependency Graph (prerequisites must come first):")
+        for step, prereqs in dep_g.items():
+            parts.append(f"  {step} requires: {prereqs}" if prereqs else f"  {step} → (no prereqs)")
+    # Next-action hint — keeps all models on track
     if task_type == "security":
         used_types = [h["action_type"] for h in history]
+        if "identify_vulnerability" not in used_types:
             parts.append("\n➡️ NEXT: identify_vulnerability")
         elif "propose_fix" not in used_types:
             parts.append("\n➡️ NEXT: propose_fix")
         else:
             parts.append("\n➡️ NEXT: revise_fix (address reviewer_feedback)")
     elif task_type == "clinical":
         used_types = [h["action_type"] for h in history]
         if "detect_gap" not in used_types:
     """Parse LLM response into action dict. Universal model compatibility."""
     text = raw_text.strip()
+    # Strip reasoning/thinking blocks
     for tag in ["think", "thinking", "reasoning", "reflection", "thought", "antThinking"]:
+        open_tag, close_tag = f"<{tag}>", f"</{tag}>"
         if open_tag in text:
+            text = text.split(close_tag)[-1].strip() if close_tag in text else text.split(open_tag)[-1].strip()
+    # Strip markdown fences
     if "```json" in text:
         text = text.split("```json")[1].split("```")[0].strip()
     elif "```" in text:
         if len(parts) >= 3:
             text = parts[1].strip()
+    # Find JSON object in prose
     if not text.startswith("{"):
         start = text.find("{")
         if start >= 0:
     return {"action_type": "error", "raw": text[:100]}
+def _compute_score(rewards: list) -> float:
+    """
+    Compute the episode score from a list of step rewards.
+    DESIGN RATIONALE — why neither pure max nor pure mean is right:
+    - Pure max: agent scores 0.85 on step 1, then 0.01 on all later steps → score=0.85
+      This rewards single-lucky-step behaviour and hides that later steps failed.
+    - Pure mean: agent scores 0.85 on step 1, 0.01 on 3 more → score=0.23
+      This massively under-reports good episodes that have validation failures early.
+    SOLUTION — weighted blend of max and mean:
+      score = 0.60 * max(rewards) + 0.40 * mean(rewards)
+    WHY THIS WORKS:
+    - A great single-step performance (0.85) still shows up clearly (0.51 baseline contribution)
+    - A consistently good episode (0.80, 0.85, 0.80) gets full credit (≈0.83)
+    - A fluke-then-fail episode (0.85, 0.01, 0.01, 0.01) scores 0.52 — honestly mediocre
+    - A failed episode (all 0.01) scores 0.01 — correctly bad
+    Clamped to [0.01, 0.99] per Discord consensus on the (0,1) exclusive range.
+    """
+    if not rewards:
+        return 0.01
+    max_r  = max(rewards)
+    mean_r = sum(rewards) / len(rewards)
+    raw    = 0.60 * max_r + 0.40 * mean_r
+    return round(min(max(raw, 0.01), 0.99), 4)
+def run_task(client: OpenAI, task_id: str) -> tuple:
+    """
+    Run a single task through the environment.
+    Returns (score: float, is_fatal_api_error: bool).
     """
+    # ── Reset ──────────────────────────────────────────────────────────────
     try:
         resp = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id}, timeout=30)
         data = resp.json()
     except Exception as e:
+        # Env unreachable — must still emit [START] and [END]
         print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+        print(f"[END] success=false steps=0 rewards=", flush=True)
         return 0.01, False
     if "error" in data and not data.get("episode_id"):
         print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+        print(f"[END] success=false steps=0 rewards=", flush=True)
         return 0.01, False
     episode_id = data.get("episode_id", "unknown")
+    obs        = data.get("observation", data)
+    # ── Mandatory [START] line — exact official spec ────────────────────────
     print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+    rewards     = []
+    history     = []
+    step_num    = 0
+    consec_errs = 0
+    fatal_error = False
     for step_num in range(1, MAX_STEPS + 1):
         user_prompt = build_user_prompt(step_num, obs, history)
+        error_msg   = None
+        # ── LLM call ───────────────────────────────────────────────────────
         try:
             reply = client.chat.completions.create(
                 model=MODEL_NAME,
                 max_tokens=MAX_TOKENS,
             )
             response_text = (reply.choices[0].message.content or "").strip()
+            consec_errs   = 0
         except Exception as e:
+            error_msg     = str(e)
             response_text = '{"action_type": "error"}'
+            consec_errs  += 1
             if _is_fatal_error(error_msg):
                 fatal_error = True
+                short = error_msg[:120].replace('\n', ' ')
+                # Emit mandatory [STEP] then break — [END] emitted below
+                print(f"[STEP] step={step_num} action=invalid reward=0.01 done=true error=FATAL:{short}", flush=True)
                 rewards.append(0.01)
                 break
+            if _is_task_fatal(error_msg) or consec_errs >= MAX_CONSEC_ERRORS:
+                short = error_msg[:120].replace('\n', ' ')
+                print(f"[STEP] step={step_num} action=invalid reward=0.01 done=true error=TASK_STOP:{short}", flush=True)
                 rewards.append(0.01)
                 break
+        action      = parse_action(response_text)
         action_type = action.get("action_type", "unknown")
         action["episode_id"] = episode_id
+        # ── Env step ───────────────────────────────────────────────────────
         try:
             step_resp = requests.post(f"{ENV_URL}/step", json=action, timeout=30)
             step_data = step_resp.json()
         except Exception as e:
+            short = str(e)[:100]
+            print(f"[STEP] step={step_num} action={action_type} reward=0.01 done=true error={short}", flush=True)
             rewards.append(0.01)
             break
+        reward     = float(step_data.get("reward", 0.0))
+        done       = bool(step_data.get("done",   False))
+        obs        = step_data.get("observation", step_data)
         step_error = step_data.get("error") or error_msg
         rewards.append(reward)
         history.append({"step": step_num, "action_type": action_type, "reward": reward, "done": done})
+        # Show 'invalid' in log when validation failed
+        display_action = "invalid" if obs.get("validation_failed") else action_type
+        # Format error value: null or truncated string
+        if step_error:
+            error_val = str(step_error)[:150].replace('\n', ' ')
+        else:
+            error_val = "null"
+        # ── Mandatory [STEP] line — exact official spec ────────────────────
+        # reward=<0.00> means 2 decimal places
+        # done=<true|false> means lowercase boolean string
+        print(
+            f"[STEP] step={step_num} action={display_action} reward={reward:.2f} "
+            f"done={str(done).lower()} error={error_val}",
+            flush=True
+        )
         if done:
             break
+    # ── Compute final score ────────────────────────────────────────────────
+    score   = _compute_score(rewards)
+    # success = at least one step scored meaningfully above the floor
+    success = any(r > 0.10 for r in rewards)
+    # rewards list: 2 decimal places, comma-separated, no spaces
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    # ── Mandatory [END] line — exact official spec ─────────────────────────
+    # spec: success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
+    # NO score= field — not in the official spec
+    print(
+        f"[END] success={str(success).lower()} steps={step_num} rewards={rewards_str}",
+        flush=True
+    )
     return score, fatal_error
 def main() -> None:
     """Run all 9 tasks and report final scores."""
     if not HF_TOKEN:
+        print("ERROR: Set HF_TOKEN environment variable.", flush=True)
         return
     client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
     try:
         health = requests.get(f"{ENV_URL}/", timeout=10, headers={"Accept": "application/json"})
         health_data = health.json()
+        print(
+            f"Environment: {health_data.get('env', 'unknown')} | "
+            f"Tasks: {health_data.get('tasks', 0)}",
+            flush=True
+        )
     except Exception as e:
         print(f"ERROR: Cannot connect to environment at {ENV_URL}: {e}", flush=True)
         return
+    scores       = {}
+    had_fatal    = False
     for task_id in TASKS:
         try:
             score, is_fatal = run_task(client, task_id)
             scores[task_id] = score
             if is_fatal:
+                had_fatal = True
+                print(f"\n🚫 Fatal API error on {task_id}. Stopping remaining tasks.", flush=True)
                 for remaining in TASKS:
                     if remaining not in scores:
                         scores[remaining] = 0.01
                         print(f"[START] task={remaining} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+                        print(f"[END] success=false steps=0 rewards=", flush=True)
                 break
         except Exception as e:
             scores[task_id] = 0.01
+            print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+            print(f"[END] success=false steps=0 rewards=", flush=True)
+    avg = round(sum(scores.values()) / max(len(scores), 1), 4)
+    print(f"\n✅ All tasks complete! Average: {avg:.4f}", flush=True)
+    # Final JSON summary — evaluator may parse this
     print(json.dumps({"final_scores": scores}), flush=True)
+    if had_fatal:
+        print("⚠️ Results NOT saved — fatal API error (invalid token / no credits).", flush=True)
     else:
         try:
             from server.benchmark_store import append_result
             append_result(MODEL_NAME, MODEL_NAME, scores)
             print(f"💾 Results saved (avg: {avg:.4f})", flush=True)
         except Exception as e:
+            print(f"⚠️ Could not save results: {e}", flush=True)
 if __name__ == "__main__":

server/app.py CHANGED Viewed

@@ -289,7 +289,7 @@ async def run_inference(request: Request):
         env_vars = os.environ.copy()
         env_vars['ENV_URL'] = env_vars.get('ENV_URL', 'http://localhost:7860')
-        # Find inference.py at project root (one level up from server/)
         inference_path = os.path.join(
             os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
             'inference.py'
@@ -309,30 +309,19 @@ async def run_inference(request: Request):
         stdout = result.stdout or ''
         stderr = result.stderr or ''
-        # Parse [END] lines for scores
-        logs = []
-        final_scores = {}
         for line in stdout.splitlines():
             line = line.strip()
-            if not line:
-                continue
-            logs.append(line)
-            if line.startswith('[END]'):
-                parts = {}
-                for token in line.split():
-                    if '=' in token:
-                        k, v = token.split('=', 1)
-                        parts[k] = v
-                task_id = parts.get('task_id', '')
-                total_reward = parts.get('total_reward', '0')
-                if task_id:
-                    try:
-                        final_scores[task_id] = float(total_reward)
-                    except ValueError:
-                        final_scores[task_id] = 0.01
-        # Also try final JSON summary line
         for line in reversed(stdout.splitlines()):
             line = line.strip()
             if line.startswith('{') and 'final_scores' in line:
@@ -340,9 +329,43 @@ async def run_inference(request: Request):
                     parsed = json.loads(line)
                     if 'final_scores' in parsed:
                         final_scores = parsed['final_scores']
                 except Exception:
                     pass
-                break
         avg = (
             round(sum(final_scores.values()) / len(final_scores), 4)
@@ -350,22 +373,22 @@ async def run_inference(request: Request):
         )
         return JSONResponse(status_code=200, content={
-            'status': 'ok' if result.returncode == 0 else 'completed_with_errors',
-            'final_scores': final_scores,
             'average_score': avg,
-            'logs': logs[-50:],
-            'stderr': stderr[-500:] if stderr else '',
-            'returncode': result.returncode,
         })
     except subprocess.TimeoutExpired:
         return JSONResponse(status_code=200, content={
-            'error': 'inference.py timed out after 20 minutes',
             'final_scores': {},
         })
     except Exception as e:
         return JSONResponse(status_code=200, content={
-            'error': str(e),
             'final_scores': {},
         })
@@ -528,13 +551,18 @@ def _run_single_task_inline(task_id, api_base, api_key, model_id, system_prompt)
         logs.append(msg)
         yield {'type': 'log', 'level': 'info', 'msg': msg}
-    # Average reward across trajectory — same logic as inference.py
-    avg_reward = sum(rewards) / max(len(rewards), 1) if rewards else 0.01
-    score = round(min(max(avg_reward, 0.01), 0.99), 4)
-    success = score > 0.0
     rewards_str = ','.join(f'{r:.2f}' for r in rewards)
-    msg = f'[END] success={str(success).lower()} steps={len(rewards)} score={score:.2f} rewards={rewards_str}'
     logs.append(msg)
     yield {'type': 'log', 'level': 'ok', 'msg': msg}
     yield {'type': 'task_done', 'task_id': task_id, 'score': score, 'logs': logs}

         env_vars = os.environ.copy()
         env_vars['ENV_URL'] = env_vars.get('ENV_URL', 'http://localhost:7860')
+        # inference.py is at project root (one level up from server/)
         inference_path = os.path.join(
             os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
             'inference.py'
         stdout = result.stdout or ''
         stderr = result.stderr or ''
+        logs   = []
+        # Collect all log lines for display
         for line in stdout.splitlines():
             line = line.strip()
+            if line:
+                logs.append(line)
+        # ── Parse final_scores from the JSON summary line ──
+        # This is authoritative — inference.py always prints:
+        #   {"final_scores": {"sec_easy": 0.85, ...}}
+        # at the end of main().
+        final_scores = {}
         for line in reversed(stdout.splitlines()):
             line = line.strip()
             if line.startswith('{') and 'final_scores' in line:
                     parsed = json.loads(line)
                     if 'final_scores' in parsed:
                         final_scores = parsed['final_scores']
+                        break
                 except Exception:
                     pass
+        # ── Fallback: parse [END] lines for any tasks missing from JSON ──
+        # Official [END] format: success=<bool> steps=<n> rewards=<r1,r2,...>
+        # We track which task we're in via the preceding [START] line.
+        if not final_scores:
+            current_task = None
+            for line in stdout.splitlines():
+                line = line.strip()
+                if line.startswith('[START]'):
+                    # Extract task= field
+                    for token in line.split():
+                        if token.startswith('task='):
+                            current_task = token.split('=', 1)[1]
+                            break
+                elif line.startswith('[END]') and current_task:
+                    # Parse rewards= field and compute score from it
+                    parts = {}
+                    for token in line.split():
+                        if '=' in token:
+                            k, v = token.split('=', 1)
+                            parts[k] = v
+                    rewards_str = parts.get('rewards', '')
+                    if rewards_str:
+                        try:
+                            step_rewards = [float(r) for r in rewards_str.split(',') if r]
+                            if step_rewards:
+                                # Same weighted blend as inference.py _compute_score()
+                                max_r  = max(step_rewards)
+                                mean_r = sum(step_rewards) / len(step_rewards)
+                                score  = round(min(max(0.60 * max_r + 0.40 * mean_r, 0.01), 0.99), 4)
+                                final_scores[current_task] = score
+                        except (ValueError, TypeError):
+                            final_scores[current_task] = 0.01
+                    current_task = None
         avg = (
             round(sum(final_scores.values()) / len(final_scores), 4)
         )
         return JSONResponse(status_code=200, content={
+            'status':        'ok' if result.returncode == 0 else 'completed_with_errors',
+            'final_scores':  final_scores,
             'average_score': avg,
+            'logs':          logs[-50:],
+            'stderr':        stderr[-500:] if stderr else '',
+            'returncode':    result.returncode,
         })
     except subprocess.TimeoutExpired:
         return JSONResponse(status_code=200, content={
+            'error':        'inference.py timed out after 20 minutes',
             'final_scores': {},
         })
     except Exception as e:
         return JSONResponse(status_code=200, content={
+            'error':        str(e),
             'final_scores': {},
         })
         logs.append(msg)
         yield {'type': 'log', 'level': 'info', 'msg': msg}
+    # Weighted blend scoring — same as inference.py _compute_score()
+    if rewards:
+        max_r = max(rewards)
+        mean_r = sum(rewards) / len(rewards)
+        score = round(min(max(0.60 * max_r + 0.40 * mean_r, 0.01), 0.99), 4)
+    else:
+        score = 0.01
+    success = any(r > 0.10 for r in rewards)
     rewards_str = ','.join(f'{r:.2f}' for r in rewards)
+    # [END] line — NO score= field (not in official spec)
+    msg = f'[END] success={str(success).lower()} steps={len(rewards)} rewards={rewards_str}'
     logs.append(msg)
     yield {'type': 'log', 'level': 'ok', 'msg': msg}
     yield {'type': 'task_done', 'task_id': task_id, 'score': score, 'logs': logs}

server/datasets/dependency_cases.py CHANGED Viewed

@@ -1,421 +1,415 @@
 # server/datasets/dependency_cases.py
 # Ground truth cases for PyTorch Migration Time-Machine tasks.
 #
-# FIXES APPLIED:
-# 1. dep_easy: done_conditions — min_actions=1, required_sequence=['flag_outdated'] — correct
-#    BUT completion_threshold lowered to 0.70 so partial answers don't instantly pass
-# 2. dep_medium: done_conditions required_sequence=['resolve_conflict'] is correct
-#    BUT completion_threshold lowered to 0.65 — resolution must be very good to pass
-# 3. dep_hard: done_conditions required_sequence=['migrate_api'] — correct
-#    BUT min_actions raised to 2 to force at least 2 migration steps
-# 4. compatibility_matrix: added trickier constraints so any compatible answer is nontrivial
 DEPENDENCY_CASES = {
     'dep_easy': [
         {
             'case_id': 'dep_easy_001',
             'task_subtype': 'flag',
-            'completion_threshold': 0.65,  # FIX: was 0.80 — harder to pass
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
-            'expected_outdated_packages': ['torch'],
             'expected_deprecated_api': 'torch.autograd.Variable',
-            'replacement': 'plain tensor (remove Variable wrapper)',
-            'code_snippet': '''import torch
-from torch.autograd import Variable
-x = Variable(torch.randn(3, 4), requires_grad=True)
-y = Variable(torch.randn(3, 4))
-z = x + y''',
-            'task_description': 'Identify outdated PyTorch packages and deprecated APIs in this legacy training script. List the exact package name and deprecated API call.',
         },
         {
             'case_id': 'dep_easy_002',
             'task_subtype': 'flag',
-            'completion_threshold': 0.65,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
             'expected_outdated_packages': ['torch'],
-            'expected_deprecated_api': 'tensor.data.numpy()',
-            'replacement': 'tensor.detach().numpy()',
-            'code_snippet': '''import torch
-model = torch.nn.Linear(10, 5)
-x = torch.randn(1, 10)
-output = model(x)
-result = output.data.numpy()  # deprecated''',
-            'task_description': 'Find the exact deprecated tensor conversion API in this code. Provide the exact deprecated call.',
         },
         {
             'case_id': 'dep_easy_003',
             'task_subtype': 'flag',
-            'completion_threshold': 0.65,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
-            'expected_outdated_packages': ['torch'],
-            'expected_deprecated_api': 'model.cuda()',
-            'replacement': 'model.to(device)',
-            'code_snippet': '''import torch
-model = torch.nn.Sequential(
-    torch.nn.Linear(784, 128),
-    torch.nn.ReLU(),
-    torch.nn.Linear(128, 10)
-)
-model.cuda()  # deprecated device placement
-x = torch.randn(1, 784).cuda()''',
-            'task_description': 'Detect the exact deprecated device placement API in this model code.',
         },
         {
             'case_id': 'dep_easy_004',
             'task_subtype': 'flag',
-            'completion_threshold': 0.65,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
             'expected_outdated_packages': ['torch'],
-            'expected_deprecated_api': 'torch.onnx.export',
-            'replacement': 'torch.onnx.dynamo_export',
-            'code_snippet': '''import torch
-model = torch.nn.Linear(10, 5)
-dummy = torch.randn(1, 10)
-torch.onnx.export(model, dummy, "model.onnx",
-                  opset_version=11)''',
-            'task_description': 'Find the deprecated ONNX export API. Specify the exact deprecated function.',
         },
         {
             'case_id': 'dep_easy_005',
             'task_subtype': 'flag',
-            'completion_threshold': 0.65,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
-            'expected_outdated_packages': ['torch'],
-            'expected_deprecated_api': 'torch.nn.DataParallel',
-            'replacement': 'torch.nn.parallel.DistributedDataParallel or FSDP',
-            'code_snippet': '''import torch
-import torch.nn as nn
-model = nn.Linear(100, 10)
-model = nn.DataParallel(model)  # deprecated
-model.cuda()''',
-            'task_description': 'Find the deprecated parallelism API. Specify the exact class name that is deprecated.',
         },
     ],
     'dep_medium': [
         {
             'case_id': 'dep_medium_001',
             'task_subtype': 'resolve',
-            'completion_threshold': 0.60,  # FIX: was 0.75 — must get it right to pass
-            'max_steps': 6,
-            # FIX: min_actions=1 is correct for resolve (1 action needed)
-            # but now the grader is tighter so passing takes real work
             'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
-            'conflict_packages': ['torch', 'numpy'],
             'compatibility_matrix': {
                 'torch': {
-                    '2.1.0': {'numpy': '>=1.24,<2.0'},
-                    '2.0.0': {'numpy': '>=1.22,<1.26'},
-                    '1.13.0': {'numpy': '>=1.19,<1.25'},
                 },
                 'numpy': {
-                    '1.26.0': {},
                     '1.24.0': {},
-                    '1.22.0': {},
-                    '1.19.0': {},
-                    '1.16.0': {},
                 },
             },
-            'requirements': {'torch': '1.9.0', 'numpy': '1.16.0'},
-            'code_snippet': '''# requirements.txt
-torch==1.9.0
-numpy==1.16.0
-torchvision==0.10.0''',
-            'task_description': 'Resolve the version conflict between torch and numpy. Use the compatibility_matrix to find valid versions where ALL cross-constraints are satisfied.',
         },
         {
             'case_id': 'dep_medium_002',
             'task_subtype': 'resolve',
-            'completion_threshold': 0.60,
-            'max_steps': 6,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
-            'conflict_packages': ['torch', 'numpy', 'torchvision'],
             'compatibility_matrix': {
-                'torch': {
-                    '2.2.0': {'numpy': '>=1.24,<2.0', 'torchvision': '>=0.17'},
-                    '2.1.0': {'numpy': '>=1.24,<2.0', 'torchvision': '>=0.16,<0.17'},
-                    '2.0.0': {'numpy': '>=1.22,<1.26', 'torchvision': '>=0.15,<0.16'},
                 },
-                'numpy': {
-                    '1.26.0': {},
-                    '1.24.0': {},
-                    '1.22.0': {},
                 },
-                'torchvision': {
-                    '0.17.0': {'torch': '>=2.2'},
-                    '0.16.0': {'torch': '>=2.1,<2.2'},  # FIX: added upper bound to make it tricky
-                    '0.15.0': {'torch': '>=2.0,<2.1'},
                 },
             },
-            'requirements': {'torch': '1.12.0', 'numpy': '1.21.0', 'torchvision': '0.13.0'},
-            'code_snippet': '''# requirements.txt
-torch==1.12.0
-numpy==1.21.0
-torchvision==0.13.0
-# CUDA 11.7''',
-            'task_description': 'Resolve three-way conflict between PyTorch, NumPy, and TorchVision. Note: torchvision 0.16 requires torch >=2.1 AND <2.2. Check ALL constraints carefully.',
         },
         {
             'case_id': 'dep_medium_003',
             'task_subtype': 'resolve',
-            'completion_threshold': 0.60,
-            'max_steps': 6,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
-            'conflict_packages': ['torch', 'transformers'],
             'compatibility_matrix': {
-                'torch': {
-                    '2.1.0': {'transformers': '>=4.35,<4.38'},  # FIX: upper bound added
-                    '2.0.0': {'transformers': '>=4.30,<4.36'},
-                },
                 'transformers': {
-                    '4.37.0': {'torch': '>=2.1'},
-                    '4.35.0': {'torch': '>=2.0,<2.2'},
-                    '4.30.0': {'torch': '>=1.13,<2.1'},
                 },
             },
-            'requirements': {'torch': '1.11.0', 'transformers': '4.20.0'},
-            'code_snippet': '''# requirements.txt
-torch==1.11.0
-transformers==4.20.0''',
-            'task_description': 'Resolve conflict between PyTorch and Transformers. Note the upper bounds in the compatibility matrix — not all combinations work.',
         },
     ],
     'dep_hard': [
         {
             'case_id': 'dep_hard_001',
             'task_subtype': 'migrate',
-            'completion_threshold': 0.60,  # FIX: was 0.70
-            'max_steps': 8,
-            # FIX: min_actions raised to 2 — must submit at least 2 migration steps
-            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api', 'migrate_api']},
             'graph_breaks': ['break_001', 'break_002', 'break_003'],
             'checklist_dependency_graph': {
-                'break_003': ['break_001', 'break_002'],
-                'break_002': ['break_001'],
-                'break_001': [],
             },
             'correct_fix_map': {
                 'break_001': 'torch.where',
-                'break_002': 'tensor.shape[0]',
-                'break_003': '.detach().numpy()',
             },
-            'code_snippet': '''import torch
-@torch.compile(fullgraph=True)
-def forward(x):
-    # break_001: data-dependent branch
-    if x.max().item() > 1.0:
-        x = x / x.max()
-    # break_002: Python len() on tensor
-    n = len(x)
-    # break_003: .data.numpy() deprecated
-    result = x.data.numpy()
-    return result''',
-            'break_descriptions': [
-                'break_001: data-dependent control flow — use torch.where()',
-                'break_002: len() on tensor — use tensor.shape[0]',
-                'break_003: .data.numpy() — use .detach().numpy()',
-            ],
-            'graph_break_report': [
-                'break_001: data-dependent control flow — use torch.where()',
-                'break_002: len() on tensor — use tensor.shape[0]',
-                'break_003: .data.numpy() — use .detach().numpy()',
-            ],
-            'task_description': 'Fix 3 graph-break patterns in this compiled forward pass. Break_002 depends on break_001. Break_003 depends on both. Fix in dependency order.',
         },
         {
             'case_id': 'dep_hard_002',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
-            'max_steps': 8,
-            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api', 'migrate_api']},
-            'graph_breaks': ['break_a', 'break_b', 'break_c', 'break_d'],
             'checklist_dependency_graph': {
-                'break_d': ['break_b', 'break_c'],
-                'break_c': ['break_a'],
-                'break_b': [],
-                'break_a': [],
             },
             'correct_fix_map': {
-                'break_a': 'torch.where',
-                'break_b': 'tensor.shape[0]',
-                'break_c': 'torch.tensor',
-                'break_d': '.detach()',
             },
-            'code_snippet': '''import torch
-@torch.compile(fullgraph=True)
-def training_step(model, x, labels):
-    # break_a: data-dependent branch
-    if x.max().item() > 1.0:
-        x = x / x.max()
-    # break_b: Python len() on tensor
-    n_samples = len(x)
-    # break_c: Python list to tensor inside compile
-    weights = torch.FloatTensor([1.0, 2.0, 3.0])
-    # break_d: in-place operation on leaf tensor
-    x += 0.1
-    output = model(x)
-    loss = torch.nn.functional.cross_entropy(output, labels)
-    return loss''',
-            'break_descriptions': [
-                'break_a: line 6 — data-dependent: if x.max().item() > 1.0',
-                'break_b: line 10 — Python builtin: len(x)',
-                'break_c: line 13 — legacy constructor: torch.FloatTensor()',
-                'break_d: line 16 — in-place op on leaf: x += 0.1',
-            ],
-            'graph_break_report': [
-                'break_a: line 6 — data-dependent: if x.max().item() > 1.0',
-                'break_b: line 10 — Python builtin: len(x)',
-                'break_c: line 13 — legacy constructor: torch.FloatTensor()',
-                'break_d: line 16 — in-place op on leaf: x += 0.1',
-            ],
-            'task_description': 'Fix all 4 graph-break patterns in this compiled training step. Break_d depends on break_b AND break_c. Break_c depends on break_a. Fix in dependency order.',
         },
         {
             'case_id': 'dep_hard_003',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
-            'max_steps': 8,
-            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api', 'migrate_api']},
-            'graph_breaks': ['break_x', 'break_y', 'break_z'],
             'checklist_dependency_graph': {
-                'break_z': ['break_x'],
-                'break_y': [],
-                'break_x': [],
             },
             'correct_fix_map': {
-                'break_x': 'tensor.numel()',
-                'break_y': 'torch.jit.script',
-                'break_z': 'torch.no_grad()',
             },
-            'code_snippet': '''import torch
-@torch.compile
-def forward(x, mask):
-    # break_x: tensor.size() returns Python int (graph break)
-    n = x.size(0) * x.size(1)
-    # break_y: Python function call inside compile
-    def custom_fn(t):
-        return t * 2
-    x = custom_fn(x)
-    # break_z: gradient tracking inside compiled region
-    with torch.enable_grad():
-        x = x * mask
-    return x''',
-            'break_descriptions': [
-                'break_x: line 6 — tensor.size() returns Python int, use tensor.numel()',
-                'break_y: line 10 — Python function call, use torch.jit.script decorator',
-                'break_z: line 14 — enable_grad inside compile, use torch.no_grad()',
-            ],
-            'graph_break_report': [
-                'break_x: line 6 — tensor.size() returns Python int, use tensor.numel()',
-                'break_y: line 10 — Python function call, use torch.jit.script decorator',
-                'break_z: line 14 — enable_grad inside compile, use torch.no_grad()',
-            ],
-            'task_description': 'Fix torch.compile graph breaks. break_z needs break_x fixed first.',
         },
         {
             'case_id': 'dep_hard_004',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
-            'max_steps': 8,
-            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api', 'migrate_api']},
-            'graph_breaks': ['break_alpha', 'break_beta', 'break_gamma', 'break_delta'],
             'checklist_dependency_graph': {
-                'break_delta': ['break_beta', 'break_gamma'],
-                'break_gamma': ['break_alpha'],
-                'break_beta': [],
-                'break_alpha': [],
             },
             'correct_fix_map': {
-                'break_alpha': 'torch.where',
-                'break_beta': 'tensor.shape[0]',
-                'break_gamma': 'torch.stack',
-                'break_delta': '@torch.jit.script',
             },
-            'code_snippet': '''import torch
-@torch.compile(fullgraph=True)
-def loss_fn(pred, target, weights):
-    # break_alpha: if statement on tensor value
-    if target.sum() > 0:
-        pred = pred * 1.5
-    # break_beta: len() on tensor
-    batch_size = len(pred)
-    # break_gamma: Python list → tensor conversion
-    normalized = []
-    for i in range(batch_size):
-        normalized.append(pred[i] / weights[i])
-    result = torch.tensor(normalized)
-    # break_delta: calls non-scripted helper
-    def helper(x):
-        return x.clamp(0, 1)
-    return helper(result)''',
-            'break_descriptions': [
-                'break_alpha: line 6 — data-dependent control flow, use torch.where()',
-                'break_beta: line 10 — len() builtin on tensor, use tensor.shape[0]',
-                'break_gamma: line 16 — torch.tensor() on Python list, use torch.stack()',
-                'break_delta: line 20 — unscripted helper, add @torch.jit.script',
-            ],
-            'graph_break_report': [
-                'break_alpha: line 6 — data-dependent control flow, use torch.where()',
-                'break_beta: line 10 — len() builtin on tensor, use tensor.shape[0]',
-                'break_gamma: line 16 — torch.tensor() on Python list, use torch.stack()',
-                'break_delta: line 20 — unscripted helper, add @torch.jit.script',
-            ],
-            'task_description': 'Complex graph-break cascade. Delta depends on Beta AND Gamma. Gamma depends on Alpha. Fix in dependency order.',
         },
         {
             'case_id': 'dep_hard_005',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
-            'max_steps': 8,
-            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api', 'migrate_api']},
             'graph_breaks': ['break_001', 'break_002', 'break_003'],
             'checklist_dependency_graph': {
-                'break_003': ['break_001', 'break_002'],
-                'break_002': [],
-                'break_001': [],
             },
             'correct_fix_map': {
-                'break_001': 'torch.compile(disable=True)',
-                'break_002': 'functorch.vmap',
-                'break_003': 'torch.export',
             },
-            'code_snippet': '''import torch
-from torch.nn.utils import clip_grad_norm_
-@torch.compile
-def training_step(model, batch, optimizer):
-    loss = model(batch['x'], batch['y'])
-    loss.backward()
-    optimizer.step()  # graph break
-    grads = []
-    for param in model.parameters():
-        grads.append(param.grad.norm())
-    clip_grad_norm_(model.parameters(), max_norm=1.0)
-    return loss.item()''',
-            'break_descriptions': [
-                'break_001: optimizer.step() not compilable, use torch.compile(disable=True)',
-                'break_002: Python loop batching, use functorch.vmap',
-                'break_003: in-place grad clipping, use torch.export',
-            ],
-            'graph_break_report': [
-                'break_001: optimizer.step() not compilable, use torch.compile(disable=True)',
-                'break_002: Python loop batching, use functorch.vmap',
-                'break_003: in-place grad clipping, use torch.export',
-            ],
-            'task_description': 'Fix training loop graph breaks. Optimizer, gradient accumulation, and clipping all cause compilation failures. Break_003 needs both others first.',
         },
     ],
 }

 # server/datasets/dependency_cases.py
 # Ground truth cases for PyTorch Migration Time-Machine tasks.
 #
+# CRITICAL FIX:
+# dep_hard previously had:
+#   done_conditions: {min_actions: 2, required_sequence: ['migrate_api', 'migrate_api']}
+#
+# This caused TWO bugs:
+#   1. The agent called migrate_api once. Router checked Counter: needs 2, has 1 → not done.
+#   2. Agent called migrate_api again → repetition_penalty fires (-0.20), tanking the score.
+#   3. Episode only ends at max_steps with a broken accumulated score.
+#
+# FIX: dep_hard now uses min_actions=1, required_sequence=['migrate_api'].
+# The task is already hard enough from the grader — complex checklist, ordering
+# constraints, and exact token matching in fix_quality. The done condition
+# should not add extra difficulty on top of this.
+#
+# ALL dep_easy, dep_medium, dep_hard done conditions verified below.
 DEPENDENCY_CASES = {
+    # ── DEP EASY ─────────────────────────────────────────────────────────
+    # Task: flag outdated packages and deprecated API usage.
+    # Done: after 1 flag_outdated action.
+    # Grader: F1 on packages (precision+recall) × 0.55 + deprecated_api_match × 0.45
+    # ─────────────────────────────────────────────────────────────────────
     'dep_easy': [
         {
             'case_id': 'dep_easy_001',
             'task_subtype': 'flag',
+            'completion_threshold': 0.75,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'task_description': (
+                'This codebase uses torch==1.9.0 and relies on torch.autograd.Variable. '
+                'Flag all outdated packages and the deprecated API.'
+            ),
+            'code_snippet': (
+                'import torch\n'
+                'from torch.autograd import Variable\n'
+                'x = Variable(torch.randn(3, 4))\n'
+                'model = torch.nn.Linear(4, 2)\n'
+                'out = model(x)'
+            ),
+            'requirements': {'torch': '1.9.0', 'torchvision': '0.10.0'},
+            'expected_outdated_packages': ['torch', 'torchvision'],
             'expected_deprecated_api': 'torch.autograd.Variable',
+            'expected_replacement': 'plain tensor with requires_grad=True',
         },
         {
             'case_id': 'dep_easy_002',
             'task_subtype': 'flag',
+            'completion_threshold': 0.75,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'task_description': (
+                'This codebase uses torch==1.4.0 and calls .cuda() directly. '
+                'Flag outdated packages and the deprecated device assignment pattern.'
+            ),
+            'code_snippet': (
+                'import torch\n'
+                'model = MyModel()\n'
+                'model.cuda()  # deprecated — use .to(device)\n'
+                'tensor = torch.randn(2, 3).cuda()'
+            ),
+            'requirements': {'torch': '1.4.0'},
             'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': '.cuda()',
+            'expected_replacement': '.to(device)',
         },
         {
             'case_id': 'dep_easy_003',
             'task_subtype': 'flag',
+            'completion_threshold': 0.75,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'task_description': (
+                'This codebase uses torch==1.7.0 with DataParallel. '
+                'Flag the outdated package and the deprecated multi-GPU API.'
+            ),
+            'code_snippet': (
+                'import torch\n'
+                'model = torch.nn.DataParallel(MyModel())\n'
+                'model.cuda()'
+            ),
+            'requirements': {'torch': '1.7.0', 'numpy': '1.18.0'},
+            'expected_outdated_packages': ['torch', 'numpy'],
+            'expected_deprecated_api': 'torch.nn.DataParallel',
+            'expected_replacement': 'DistributedDataParallel',
         },
         {
             'case_id': 'dep_easy_004',
             'task_subtype': 'flag',
+            'completion_threshold': 0.75,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'task_description': (
+                'Flag outdated packages and the deprecated ONNX export API in this code.'
+            ),
+            'code_snippet': (
+                'import torch\n'
+                'torch.onnx.export(model, dummy_input, "model.onnx",\n'
+                '                  opset_version=9,\n'
+                '                  enable_onnx_checker=True)  # deprecated kwarg'
+            ),
+            'requirements': {'torch': '1.8.0'},
             'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': 'enable_onnx_checker',
+            'expected_replacement': 'remove the kwarg (deprecated in 1.9, removed in 2.0)',
         },
         {
             'case_id': 'dep_easy_005',
             'task_subtype': 'flag',
+            'completion_threshold': 0.75,
             'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'task_description': (
+                'Flag outdated packages and the deprecated autocast API.'
+            ),
+            'code_snippet': (
+                'import torch\n'
+                'from torch.cuda.amp import autocast\n'
+                'with autocast():  # deprecated import path\n'
+                '    output = model(input)'
+            ),
+            'requirements': {'torch': '1.6.0', 'torchaudio': '0.6.0'},
+            'expected_outdated_packages': ['torch', 'torchaudio'],
+            'expected_deprecated_api': 'torch.cuda.amp.autocast',
+            'expected_replacement': 'torch.amp.autocast',
         },
     ],
+    # ── DEP MEDIUM ────────────────────────────────────────────────────────
+    # Task: resolve version conflicts using the compatibility_matrix.
+    # Done: after 1 resolve_conflict action.
+    # Grader: valid_pkgs/conflict_count + cross-constraint check - downgrade penalty
+    # ─────────────────────────────────────────────────────────────────────
     'dep_medium': [
         {
             'case_id': 'dep_medium_001',
             'task_subtype': 'resolve',
+            'completion_threshold': 0.70,
+            'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
+            'task_description': (
+                'Resolve the version conflict between torch, numpy, and protobuf. '
+                'Use the compatibility_matrix to find a compatible set of versions.'
+            ),
+            'code_snippet': 'requirements.txt with conflicting torch==2.0.0, numpy==1.20.0, protobuf==3.9.0',
+            'requirements': {'torch': '2.0.0', 'numpy': '1.20.0', 'protobuf': '3.9.0'},
+            'conflict_packages': ['torch', 'numpy', 'protobuf'],
             'compatibility_matrix': {
                 'torch': {
+                    '2.1.0': {'numpy': '>=1.21,<2.0', 'protobuf': '>=3.20,<5.0'},
+                    '2.0.0': {'numpy': '>=1.20,<1.25', 'protobuf': '>=3.19,<4.0'},
                 },
                 'numpy': {
                     '1.24.0': {},
+                    '1.21.0': {},
+                    '1.20.0': {},
+                },
+                'protobuf': {
+                    '4.23.0': {},
+                    '3.20.0': {},
+                    '3.9.0': {'torch': '<=1.13'},
                 },
             },
         },
         {
             'case_id': 'dep_medium_002',
             'task_subtype': 'resolve',
+            'completion_threshold': 0.70,
+            'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
+            'task_description': (
+                'Resolve the version conflict between tensorflow, keras, and h5py.'
+            ),
+            'code_snippet': 'requirements.txt: tensorflow==2.10.0, keras==2.10.0, h5py==2.10.0',
+            'requirements': {'tensorflow': '2.10.0', 'keras': '2.10.0', 'h5py': '2.10.0'},
+            'conflict_packages': ['tensorflow', 'keras', 'h5py'],
             'compatibility_matrix': {
+                'tensorflow': {
+                    '2.13.0': {'keras': '>=2.13,<2.14', 'h5py': '>=3.7'},
+                    '2.10.0': {'keras': '==2.10.0', 'h5py': '>=3.1'},
                 },
+                'keras': {
+                    '2.13.0': {'tensorflow': '>=2.13,<2.14'},
+                    '2.10.0': {'tensorflow': '==2.10.0'},
                 },
+                'h5py': {
+                    '3.9.0': {},
+                    '3.7.0': {},
+                    '2.10.0': {'tensorflow': '<=2.3'},
                 },
             },
         },
         {
             'case_id': 'dep_medium_003',
             'task_subtype': 'resolve',
+            'completion_threshold': 0.70,
+            'max_steps': 4,
             'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
+            'task_description': (
+                'Resolve the conflict between transformers, tokenizers, and datasets packages.'
+            ),
+            'code_snippet': 'requirements: transformers==4.20.0, tokenizers==0.11.0, datasets==1.18.0',
+            'requirements': {'transformers': '4.20.0', 'tokenizers': '0.11.0', 'datasets': '1.18.0'},
+            'conflict_packages': ['transformers', 'tokenizers', 'datasets'],
             'compatibility_matrix': {
                 'transformers': {
+                    '4.35.0': {'tokenizers': '>=0.14,<0.19', 'datasets': '>=2.14'},
+                    '4.20.0': {'tokenizers': '>=0.11,<0.14', 'datasets': '>=1.18'},
+                },
+                'tokenizers': {
+                    '0.15.0': {'transformers': '>=4.28'},
+                    '0.14.0': {'transformers': '>=4.25'},
+                    '0.11.0': {},
+                },
+                'datasets': {
+                    '2.14.0': {},
+                    '2.10.0': {},
+                    '1.18.0': {'tokenizers': '<=0.13'},
                 },
             },
         },
     ],
+    # ── DEP HARD ──────────────────────────────────────────────────────────
+    # Task: fix torch.compile graph-break patterns.
+    # Done: after 1 migrate_api action (FIXED from 2 → 1).
+    #
+    # IMPORTANT: min_actions=1, required_sequence=['migrate_api']
+    # The grader already makes this hard through:
+    #   - Multiple graph_breaks to fix (3-5 per case)
+    #   - Ordering constraints via checklist_dependency_graph
+    #   - Exact token matching in fix_quality
+    # We do NOT need the done condition to create artificial difficulty.
+    # ─────────────────────────────────────────────────────────────────────
     'dep_hard': [
         {
             'case_id': 'dep_hard_001',
             'task_subtype': 'migrate',
+            'completion_threshold': 0.60,
+            'max_steps': 6,
+            # FIXED: was min_actions=2, required_sequence=['migrate_api','migrate_api']
+            # which caused repetition penalty on the 2nd call and never terminated cleanly
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['migrate_api']},
+            'task_description': (
+                'Fix the torch.compile graph-break patterns in this training loop. '
+                'Provide completed_items (list of break IDs) and code_changes (dict of fixes).'
+            ),
+            'code_snippet': (
+                'import torch\n\n'
+                'def train_step(model, x):\n'
+                '    out = model(x)\n'
+                '    if out.shape[0] != x.shape[0]:   # data-dependent branch [break_001]\n'
+                '        out = torch.zeros_like(x)\n'
+                '    idx = int(out.argmax())           # int() conversion [break_002]\n'
+                '    mask = out > 0.5                  # dynamic masking [break_003]\n'
+                '    return out[mask].sum()\n'
+            ),
+            'graph_break_report': [
+                'break_001: data-dependent control flow (if out.shape[0] != x.shape[0])',
+                'break_002: Python int() call on tensor (int(out.argmax()))',
+                'break_003: dynamic boolean indexing (out[mask])',
+            ],
             'graph_breaks': ['break_001', 'break_002', 'break_003'],
             'checklist_dependency_graph': {
+                'break_003': ['break_002'],  # must fix int() conversion before mask
             },
             'correct_fix_map': {
                 'break_001': 'torch.where',
+                'break_002': 'torch.argmax',
+                'break_003': 'torch.masked_select',
             },
         },
         {
             'case_id': 'dep_hard_002',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['migrate_api']},
+            'task_description': (
+                'Fix these torch.compile graph-breaks in a model forward pass.'
+            ),
+            'code_snippet': (
+                'def forward(self, x):\n'
+                '    x = self.conv(x)\n'
+                '    size = x.size(0)           # .size() with int [break_001]\n'
+                '    out = x.numpy()            # .numpy() call [break_002]\n'
+                '    out = torch.from_numpy(out)\n'
+                '    return out[:size//2]       # dynamic slice [break_003]\n'
+            ),
+            'graph_break_report': [
+                'break_001: .size() call returning Python int',
+                'break_002: .numpy() call breaks compilation boundary',
+                'break_003: dynamic slicing with Python division',
+            ],
+            'graph_breaks': ['break_001', 'break_002', 'break_003'],
             'checklist_dependency_graph': {
+                'break_003': ['break_001'],
             },
             'correct_fix_map': {
+                'break_001': 'tensor.shape[0]',
+                'break_002': 'detach',
+                'break_003': 'torch.narrow',
             },
         },
         {
             'case_id': 'dep_hard_003',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['migrate_api']},
+            'task_description': (
+                'Fix torch.compile graph-breaks in this attention implementation.'
+            ),
+            'code_snippet': (
+                'def attention(q, k, v):\n'
+                '    scores = torch.matmul(q, k.transpose(-2, -1))\n'
+                '    if scores.max() > 100:      # data-dependent branch [break_001]\n'
+                '        scores = scores / 100\n'
+                '    weights = scores.numpy()    # numpy call [break_002]\n'
+                '    weights = torch.softmax(torch.tensor(weights), dim=-1)\n'
+                '    n = int(q.shape[0])         # Python int [break_003]\n'
+                '    return weights[:n] @ v\n'
+            ),
+            'graph_break_report': [
+                'break_001: data-dependent branch on scores.max()',
+                'break_002: .numpy() breaks torch.compile boundary',
+                'break_003: Python int() on tensor dimension',
+            ],
+            'graph_breaks': ['break_001', 'break_002', 'break_003'],
             'checklist_dependency_graph': {
+                'break_003': ['break_001'],
+                'break_002': ['break_001'],
             },
             'correct_fix_map': {
+                'break_001': 'torch.clamp',
+                'break_002': 'torch.softmax',
+                'break_003': 'tensor.shape',
             },
         },
         {
             'case_id': 'dep_hard_004',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['migrate_api']},
+            'task_description': (
+                'Fix four torch.compile graph-breaks in this training utility.'
+            ),
+            'code_snippet': (
+                'def process_batch(batch):\n'
+                '    lengths = [len(x) for x in batch]   # Python list comp [break_001]\n'
+                '    max_len = max(lengths)               # Python max() [break_002]\n'
+                '    padded  = torch.zeros(len(batch), max_len)\n'
+                '    for i, x in enumerate(batch):        # Python loop [break_003]\n'
+                '        padded[i, :len(x)] = x\n'
+                '    out = model(padded)\n'
+                '    return out.cpu().numpy()             # .numpy() [break_004]\n'
+            ),
+            'graph_break_report': [
+                'break_001: Python list comprehension over tensor data',
+                'break_002: Python max() on list of tensor values',
+                'break_003: Python for loop with tensor indexing',
+                'break_004: .numpy() call at output',
+            ],
+            'graph_breaks': ['break_001', 'break_002', 'break_003', 'break_004'],
             'checklist_dependency_graph': {
+                'break_002': ['break_001'],
+                'break_003': ['break_002'],
             },
             'correct_fix_map': {
+                'break_001': 'torch.tensor',
+                'break_002': 'torch.max',
+                'break_003': 'torch.nn.utils.rnn.pad_sequence',
+                'break_004': 'detach',
             },
         },
         {
             'case_id': 'dep_hard_005',
             'task_subtype': 'migrate',
             'completion_threshold': 0.60,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['migrate_api']},
+            'task_description': (
+                'Fix torch.compile graph-breaks caused by vmap incompatibilities.'
+            ),
+            'code_snippet': (
+                'from torch._vmap_internals import vmap  # deprecated [break_001]\n'
+                'import functorch                         # deprecated module [break_002]\n\n'
+                'def batched_fn(x):\n'
+                '    result = vmap(model)(x)\n'
+                '    if result.isnan().any():             # data-dependent check [break_003]\n'
+                '        result = torch.zeros_like(result)\n'
+                '    return result\n'
+            ),
+            'graph_break_report': [
+                'break_001: torch._vmap_internals.vmap is deprecated (use torch.vmap)',
+                'break_002: functorch module is deprecated (merged into torch)',
+                'break_003: data-dependent .any() check breaks compilation',
+            ],
             'graph_breaks': ['break_001', 'break_002', 'break_003'],
             'checklist_dependency_graph': {
+                'break_002': ['break_001'],
             },
             'correct_fix_map': {
+                'break_001': 'torch.vmap',
+                'break_002': 'torch.func',
+                'break_003': 'torch.where',
             },
         },
     ],
 }

server/graders/base_grader.py CHANGED Viewed

@@ -1,38 +1,86 @@
 # server/graders/base_grader.py
 # Core grading utilities used by ALL domain graders.
-# FIX: safe_score now uses [0.01, 0.99] range but with REAL variance in between.
-# The key issue was that graders were returning values too close to 1.0 for partial answers.
 from typing import Dict, Any, List, Callable
 def safe_score(raw) -> float:
-    """Clamp to [0.01, 0.99]. Never crash. Returns float with 4 decimal precision."""
     if raw is None:
         return 0.01
     try:
         val = float(raw)
-        # FIX: Don't round aggressively — keep 4 decimal places so variance is visible
         return round(max(0.01, min(0.99, val)), 4)
     except (TypeError, ValueError):
         return 0.01
 def repetition_penalty(action_type: str, last_actions: List[str], window: int = 3) -> float:
-    """Penalise repeating the same action type in the last N steps."""
     count = last_actions[-window:].count(action_type)
-    # FIX: Increased penalty from -0.15 to -0.20 per repeat so it actually stings
     return -0.20 * count
 def invalid_action_penalty(action_type: str, valid_actions: List[str]) -> float:
-    """Penalise actions not in the valid set for this domain."""
-    # FIX: Increased from -0.20 to -0.40 — wrong domain is a serious mistake
     return -0.40 if action_type not in valid_actions else 0.0
 def harmful_output_penalty(action: Dict, forbidden_patterns: List[str]) -> float:
-    """Penalise destructive patterns like 'os.remove' or 'drop table'."""
     action_str = str(action).lower()
     for p in forbidden_patterns:
         if p.lower() in action_str:
@@ -41,70 +89,65 @@ def harmful_output_penalty(action: Dict, forbidden_patterns: List[str]) -> float
 def efficiency_bonus(step_count: int, max_steps: int, done: bool) -> float:
-    """Small bonus for finishing early. FIX: reduced from 0.10 to 0.05 so it doesn't
-    inflate scores — the correctness score should be the main signal."""
-    return 0.05 if done and step_count < max_steps // 2 else 0.0
-def difficulty_multiplier(task_id: str) -> float:
     """
-    FIX: NEW FUNCTION — Scale raw correctness by task difficulty so easy tasks
-    genuinely can't score as high as hard tasks even with correct answers.
-    - easy tasks: correctness score is NOT boosted (agents should get high scores)
-    - medium tasks: a perfect answer gets 0.90 max (10% cap)
-    - hard tasks: a perfect answer gets 0.80 max (20% cap) — they're SUPPOSED to be hard
-    This ensures there's real spread between easy/medium/hard scores.
     """
-    if 'hard' in task_id:
-        return 0.80
-    elif 'medium' in task_id:
-        return 0.90
-    else:
-        return 0.99  # easy — allow near-perfect
 def grade_dynamic(
-    action: Dict[str, Any],
     session,
     compute_correctness_fn: Callable,
-    valid_actions: List[str],
-    forbidden_patterns: List[str] = None,
-    max_steps: int = 8
 ) -> float:
-    """Full reward pipeline. Entry point for all domain graders.
-    Pipeline: invalid check → repetition → correctness → harmful → efficiency → difficulty cap → clamp
     """
     if forbidden_patterns is None:
         forbidden_patterns = []
     action_type = action.get('action_type', 'unknown')
-    # Penalties
-    inv  = invalid_action_penalty(action_type, valid_actions)
-    rep  = repetition_penalty(action_type, session.last_actions)
-    harm = harmful_output_penalty(action, forbidden_patterns)
-    # If action type is invalid, skip the grader entirely
     if inv < 0:
         return safe_score(inv + rep)
-    # Core correctness score from domain-specific grader
     correctness = compute_correctness_fn(action, session.task_case)
     if correctness is None:
-        correctness = 0.0
-    # FIX: Apply difficulty cap BEFORE efficiency bonus
-    task_id = getattr(session, 'task_id', '')
-    max_allowed = difficulty_multiplier(task_id)
-    correctness = min(correctness, max_allowed)
-    # Efficiency bonus — small
     eff = efficiency_bonus(session.step_count + 1, max_steps, correctness >= 0.75)
-    # Combine and clamp
     raw = correctness + rep + harm + eff
     return safe_score(raw)

 # server/graders/base_grader.py
 # Core grading utilities used by ALL domain graders.
+#
+# CHANGES FROM PREVIOUS VERSION:
+# 1. difficulty_multiplier() — REMOVED ENTIRELY.
+#    The cap (hard→0.80, medium→0.90) made every hard task score identically
+#    at 0.80 and every medium task at 0.90, regardless of agent quality.
+#    This is exactly the wrong behaviour for an RL training environment:
+#    GRPO needs variance WITHIN difficulty levels, not a uniform ceiling.
+#    Task difficulty now comes from the grader logic and case design alone.
+#
+# 2. safe_score range: [0.01, 0.99]
+#    The official spec says "strictly between 0 and 1".
+#    Discord consensus from many participants confirmed 0.01/0.99 as the
+#    correct interpretation. Do not change this back to [0.0, 1.0].
+#
+# 3. Penalty values kept as-is (increased in last revision):
+#    - repetition_penalty:    -0.20 per repeat (was -0.15)
+#    - invalid_action_penalty: -0.40 for wrong domain action (was -0.20)
+#    - harmful_output_penalty: -0.50 for destructive patterns
+#    These are intentionally higher to create real signal.
+#
+# 4. efficiency_bonus reduced to 0.05 (was 0.10).
+#    Small enough that it doesn't inflate scores, but still rewards
+#    agents that solve tasks efficiently.
 from typing import Dict, Any, List, Callable
 def safe_score(raw) -> float:
+    """
+    Clamp score to [0.01, 0.99]. Never crash. Returns float.
+    WHY [0.01, 0.99] NOT [0.0, 1.0]:
+    - Official spec says scores must be strictly between 0 and 1
+    - Discord confirmed 0.01/0.99 as the correct practical interpretation
+    - A score of exactly 0.0 from a broken run looks like a crash
+    - A score of exactly 1.0 means the grader is trivially solved
+    WHY 4 DECIMAL PLACES:
+    - Keeps variance visible (0.4500 vs 0.4750 are meaningfully different)
+    - round() handles float precision artifacts
+    """
     if raw is None:
         return 0.01
     try:
         val = float(raw)
         return round(max(0.01, min(0.99, val)), 4)
     except (TypeError, ValueError):
         return 0.01
 def repetition_penalty(action_type: str, last_actions: List[str], window: int = 3) -> float:
+    """
+    Penalise repeating the same action type in the last N steps.
+    WHY: Without this, GRPO agents discover they can emit the same
+    high-scoring action repeatedly within an episode. The penalty
+    forces genuine strategy exploration each turn.
+    -0.20 per repeat (capped by window=3, so max penalty is -0.60).
+    """
     count = last_actions[-window:].count(action_type)
     return -0.20 * count
 def invalid_action_penalty(action_type: str, valid_actions: List[str]) -> float:
+    """
+    Penalise actions not in the valid set for this domain.
+    -0.40 because calling a dependency action on a security task is a
+    fundamental routing error — it should hurt significantly.
+    """
     return -0.40 if action_type not in valid_actions else 0.0
 def harmful_output_penalty(action: Dict, forbidden_patterns: List[str]) -> float:
+    """
+    Penalise destructive patterns like 'os.remove', 'drop table'.
+    -0.50 because these patterns represent the agent trying to "cheat"
+    by deleting things rather than fixing them.
+    """
     action_str = str(action).lower()
     for p in forbidden_patterns:
         if p.lower() in action_str:
 def efficiency_bonus(step_count: int, max_steps: int, done: bool) -> float:
     """
+    Small bonus for finishing early — rewards decisive, confident agents.
+    WHY ONLY 0.05: The correctness score must be the dominant signal.
+    The efficiency bonus should never flip a mediocre answer into a good score.
     """
+    return 0.05 if done and step_count < max_steps // 2 else 0.0
 def grade_dynamic(
+    action:                Dict[str, Any],
     session,
     compute_correctness_fn: Callable,
+    valid_actions:          List[str],
+    forbidden_patterns:     List[str] = None,
+    max_steps:              int       = 8,
 ) -> float:
+    """
+    Full reward pipeline. Entry point for all domain graders.
+    Pipeline:
+    1. Invalid action check — if wrong domain action, return penalised score immediately
+    2. Repetition penalty — subtract for repeated action types
+    3. compute_correctness_fn — domain-specific grader (security/dep/clinical)
+    4. Harmful output penalty — subtract for destructive patterns
+    5. Efficiency bonus — add small bonus for early completion
+    6. safe_score — clamp to [0.01, 0.99]
+    NOTE: difficulty_multiplier has been REMOVED.
+    The task difficulty is expressed through:
+    - Tighter CVSS ranges in hard cases (harder to guess)
+    - More required_fix_tokens in hard cases
+    - Adversarial reviewer_feedback in hard cases
+    - Dependency graphs in hard clinical cases
+    - Multiple checklist items with ordering in hard dep cases
+    The grader itself should produce lower scores for harder tasks naturally.
     """
     if forbidden_patterns is None:
         forbidden_patterns = []
     action_type = action.get('action_type', 'unknown')
+    # Step 1: Invalid action → skip grader entirely, return penalised score
+    inv = invalid_action_penalty(action_type, valid_actions)
+    rep = repetition_penalty(action_type, session.last_actions)
     if inv < 0:
         return safe_score(inv + rep)
+    # Step 2: Domain-specific correctness
     correctness = compute_correctness_fn(action, session.task_case)
     if correctness is None:
+        correctness = 0.01
+    # Step 3: Harmful output check
+    harm = harmful_output_penalty(action, forbidden_patterns)
+    # Step 4: Efficiency bonus
     eff = efficiency_bonus(session.step_count + 1, max_steps, correctness >= 0.75)
+    # Step 5: Combine and clamp
     raw = correctness + rep + harm + eff
     return safe_score(raw)