Spaces:

vishaldhakad
/

SecureCodeEnv

Sleeping

App Files Files Community

vishaldhakad commited on 12 days ago

Commit

87f5919

1 Parent(s): 6372c69

making full cordinate with each other

Browse files

Files changed (7) hide show

README.md +163 -115
app/dashboard.py +869 -626
app/models.py +42 -34
app/routes.py +88 -58
graders/attacks.py +30 -57
graders/performance.py +64 -83
graders/reward_aggregator.py +47 -90

README.md CHANGED Viewed

@@ -1,179 +1,227 @@
 ---
 title: Trainx
-emoji: 🔐
-colorFrom: blue
-colorTo: red
 sdk: docker
 pinned: true
 license: apache-2.0
 ---
-# 🔐 SecureCodeEnv V2
-**RL environment for training LLM agents to write production-ready, secure Python code.**
-Built for the **Meta × HuggingFace OpenEnv Hackathon 2026** by [Vishal Dhakad](https://huggingface.co/vishaldhakad).
 ---
 ## The Problem
-Studies show **12–65% of LLM-generated code contains security vulnerabilities** depending on the model (2025 studies). Secure-pass@1 rates remain below 12% for all frontier models even when functional pass@1 exceeds 50%.
 Every existing RL environment trains agents to write code that **WORKS**. None train agents to write code that is **SAFE, CONSISTENT, and PRODUCTION-READY**.
-SecureCodeEnv fills that exact gap.
 ---
-## What Makes This Unique
-### 1. Behavioral Adversarial Attack Grading (Unfakeable)
-We don't just scan for patterns — we **fire real attacks** at the agent's code and monitor side effects:
-- **SQL injection** → spy on `sqlite3.Cursor.execute` at C-extension level
-- **Path traversal** → hook `builtins.open` via `sys.settrace`
-- **Shell injection** → replace `subprocess.run` + `os.system` before agent code loads
-- **JWT bypass** → check if alg:none tokens are accepted
-V1 checked return values (`if '..' not in result`). An agent could return a clean string while actually opening `../../etc/passwd`. **V2 checks what the code DOES, not what it returns.**
-### 2. CodeGraph Memory System (Novel in RL)
-The agent receives a structured snapshot of everything it has already written this episode. The grader checks cross-file consistency:
-- Naming convention (snake_case vs camelCase) — 60% threshold, "mixed" state
-- Error handling style (try/except vs returns)
-- Import reuse (reuse existing modules, don't rewrite)
-**No other RL environment penalises style drift across files.**
-### 3. 9 CWE-Grounded Tasks
-| # | Task | Difficulty | CWE | Primary Attack |
-|---|------|-----------|-----|----------------|
-| 1 | `password_validator` | Easy | CWE-916 | Weak hash acceptance |
-| 2 | `input_sanitizer` | Easy | CWE-20 | XSS payload pass-through |
-| 3 | `hash_generator` | Easy | CWE-327 | Shell invocation for hashing |
-| 4 | `sql_query_builder` | Medium | CWE-89 | SQL injection via cursor spy |
-| 5 | `file_path_handler` | Medium | CWE-22 | Path traversal via open() spy |
-| 6 | `api_rate_limiter` | Medium | CWE-307 | Rate bypass with spoofed client ID |
-| 7 | `file_upload_handler` | Hard | CWE-434 | Malicious file extension upload |
-| 8 | `jwt_validator` | Hard | CWE-347 | JWT alg:none bypass |
-| 9 | `auth_middleware` | Hard | CWE-287 | Shell-based auth + timing attack |
-### 4. 8-Dimensional Reward System
-| Grader | Weight | Tool | Type |
-|--------|--------|------|------|
-| Correctness | 25% | Custom test runner | Functional |
-| Attack Resistance | 25% | Behavioral harness V2 | Security — unfakeable |
-| Static Security | 15% | bandit + semgrep | Security — static |
-| CodeGraph Consistency | 15% | tree-sitter + CodeGraph | Architectural |
-| Performance | 10% | timeit + tracemalloc | Efficiency |
-| Documentation | 5% | ast | Quality |
-| Code Structure | 3% | ast | Quality |
-| Supply Chain | 2% | pip-audit + typosquat | Security |
 ---
-## API
 ```python
 import requests
-BASE = "https://vishaldhakad-securecodeenv.hf.space"
-# Start episode
-episode = requests.post(f"{BASE}/reset", json={"difficulty": "medium"}).json()
 sid = episode["session_id"]
-# Submit code
-result = requests.post(f"{BASE}/step", json={
     "session_id": sid,
-    "task_id": episode["task_id"],
     "filename": "solution.py",
-    "code": your_secure_code,
 }).json()
-print(result["total_reward"])   # 0.0 – 1.0
-print(result["feedback"])       # per-grader feedback
-print(result["codegraph"])      # updated codebase context
 ```
-### Endpoints
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/reset` | POST | Start new episode — returns task, CodeGraph, session_id |
-| `/step` | POST | Submit code — returns reward, feedback, updated CodeGraph |
-| `/state` | GET | Read current episode state |
-| `/health` | GET | Health check |
-| `/docs` | GET | Interactive Swagger UI |
 ---
-## Action Space
-Python source code string (max 50KB). Filename used for CodeGraph tracking.
-## Observation Space
 ```json
 {
-  "total_reward": 0.84,
   "scores": {
     "correctness": 1.0,
     "attack_resist": 0.875,
-    "static_security": 0.7,
     "consistency": 1.0,
-    "performance": 0.8,
-    "documentation": 0.5,
-    "code_structure": 1.0,
-    "supply_chain": 1.0
-  },
-  "feedback": {
-    "correctness": "✅ Excellent (1.00) — 8/8 tests passed.",
-    "attack_resist": "🟡 Good (0.88) — 7/8 attacks blocked."
   },
-  "codegraph": { "conventions": {}, "components": {} },
   "done": false,
-  "step_count": 2
 }
 ```
 ---
-## Quick Start
 ```bash
-# Local dev
-docker build -t securecodeenv .
-docker run -p 7860:7860 -e REDIS_URL=<upstash_url> securecodeenv
-# Run baseline inference
-API_BASE_URL=https://api.groq.com/openai/v1 \
-MODEL_NAME=llama-3.3-70b-versatile \
-HF_TOKEN=<your_token> \
-ENV_URL=http://localhost:7860 \
-python inference.py
-# Pre-submission validation
-python validate.py
 ```
-## Environment Variables
-| Variable | Required | Description |
-|----------|----------|-------------|
-| `REDIS_URL` | Yes | Upstash Redis URL (`rediss://default:<token>@<host>.upstash.io:6379`) |
-| `API_BASE_URL` | For inference | LLM API base URL |
-| `MODEL_NAME` | For inference | Model name |
-| `HF_TOKEN` | For inference | HuggingFace token |
 ---
-## Infrastructure (100% Free)
-| Component | Solution | Cost |
-|-----------|----------|------|
-| Compute | HuggingFace Spaces CPU (2 vCPU / 16GB) | ✅ $0 |
-| Containerisation | Docker | ✅ $0 |
-| Session persistence | Upstash Redis free tier | ✅ $0 |
-| Static analysis | bandit + semgrep | ✅ $0 |
-| Multi-language parsing | tree-sitter | ✅ $0 |
-| LLM for inference | Groq free tier | ✅ $0 |
 ---
-*SecureCodeEnv V2 — Built by Vishal Dhakad | Meta × HuggingFace OpenEnv Hackathon 2026 | Total infrastructure cost: $0.00*

 ---
 title: Trainx
+emoji: 🔒
+colorFrom: red
+colorTo: blue
 sdk: docker
 pinned: true
 license: apache-2.0
 ---
+# SecureCodeEnv
+**An RL environment for training LLM agents to write production-ready, secure Python code.**
+Built for the **Meta × PyTorch OpenEnv Hackathon 2026** by Vishal Dhakad (`vishaldhakad`).
 ---
 ## The Problem
+Studies show **12–65% of LLM-generated code contains security vulnerabilities** (2025 research). Secure-pass@1 rates remain below 12% for all frontier models even when functional pass@1 exceeds 50%.
 Every existing RL environment trains agents to write code that **WORKS**. None train agents to write code that is **SAFE, CONSISTENT, and PRODUCTION-READY**.
+SecureCodeEnv fills that gap.
 ---
+## What Makes This Environment Unique
+| Feature | SecureCodeEnv | Other RL Envs |
+|---|---|---|
+| Dynamic adversarial grading | ✅ Actually FIRES attacks | ❌ Static patterns only |
+| CodeGraph memory | ✅ Codebase-consistency rewards | ❌ Single-function only |
+| CWE-grounded tasks | ✅ 9 tasks, 12+ CWE IDs | ❌ Generic correctness |
+| Multi-dimensional reward | ✅ 7 dimensions | ❌ Pass/fail only |
+| Anti-reward-hacking | ✅ Seeded random payloads | ❌ Fixed test cases |
+### CodeGraph Memory System
+The environment maintains a `CodeGraph` — a structured in-memory database of every component the agent has written in the current episode. When the agent writes `auth/validator.py` in `snake_case`, and then submits `auth/middleware.py` in `camelCase`, the consistency grader penalizes the drift. No other RL environment does this.
+### Dynamic Adversarial Attack Grading
+We don't just scan for vulnerability patterns — we **fire real attacks** at the agent's code:
+- SQL injection payloads (UNION SELECT, OR 1=1, stacked queries)
+- Path traversal payloads (`../../etc/passwd`, URL-encoded variants)
+- JWT bypass attacks (`alg: none`, expired tokens, tampered payloads)
+- XSS payloads (`<script>`, `onerror=`, template injection)
+Payloads are randomized per episode using a seed. The agent **cannot memorize** specific strings.
 ---
+## Reward System (7 Dimensions)
+| Dimension | Weight | Tool | What It Measures |
+|---|---|---|---|
+| Correctness | 30% | Custom test runner | Does the code solve the problem? |
+| Attack Resistance | 20% | Dynamic harness | Does it survive real attacks? |
+| Static Security | 15% | bandit + AST | Known vulnerability patterns (CWE-mapped) |
+| CodeGraph Consistency | 15% | AST + CodeGraph | Matches existing codebase conventions? |
+| Performance | 10% | timeit + tracemalloc | Efficient vs naive/optimal baselines |
+| Documentation | 5% | AST | Docstrings + type hints coverage |
+| Code Structure | 5% | AST | Clean code (no bare print, no bare except) |
+---
+## Quick Start
 ```python
 import requests
+ENV_URL = "https://vishaldhakad-securecodeenv.hf.space"
+# 1. Start episode
+episode = requests.post(f"{ENV_URL}/reset", json={"difficulty": "medium"}).json()
 sid = episode["session_id"]
+print(episode["problem_statement"])
+# 2. Submit code
+result = requests.post(f"{ENV_URL}/step", json={
     "session_id": sid,
+    "code": "def build_user_query(username, role):\n    return ('SELECT * FROM users WHERE username = %s', (username,))",
     "filename": "solution.py",
 }).json()
+print(f"Reward: {result['total_reward']:.3f}")
+print(f"Scores: {result['scores']}")
+print(f"Feedback: {result['feedback']['summary']}")
 ```
+---
+## Tasks — 9 Tasks Across 3 Difficulty Levels
+### Easy
+| Task | CWE Targets | Attack |
+|---|---|---|
+| Password Validator | CWE-916, CWE-521 | Weak hash detection |
+| Input Sanitizer | CWE-20, CWE-116 | XSS payload injection |
+| Token Generator | CWE-338, CWE-330 | Predictable randomness |
+### Medium
+| Task | CWE Targets | Attack |
+|---|---|---|
+| SQL Query Builder | CWE-89 | SQL injection payloads |
+| File Path Handler | CWE-22 | Path traversal attacks |
+| Rate Limiter | CWE-770, CWE-400 | Concurrent request flood |
+### Hard
+| Task | CWE Targets | Attack |
+|---|---|---|
+| File Upload Handler | CWE-22, CWE-434 | Traversal filenames + MIME spoofing |
+| JWT Validator | CWE-347, CWE-613 | `alg:none` attack, expired tokens |
+| Auth Middleware | CWE-287, CWE-352 | CSRF bypass, timing attacks |
 ---
+## API Reference
+### `POST /reset`
+Start a new episode.
+**Request:**
+```json
+{ "difficulty": "medium" }
+```
+**Response:**
+```json
+{
+  "session_id": "uuid",
+  "task_id": "medium_sql_query_builder",
+  "problem_statement": "Write a Python function...",
+  "difficulty": "medium",
+  "cwe_targets": ["CWE-89", "CWE-20"],
+  "codegraph": { "components": {}, "conventions": {} },
+  "starter_code": "def build_user_query(...):"
+}
+```
+### `POST /step`
+Submit agent code for grading.
+**Request:**
 ```json
 {
+  "session_id": "uuid",
+  "code": "def build_user_query(username: str, role: str) -> tuple: ...",
+  "filename": "src/db/queries.py"
+}
+```
+**Response:**
+```json
+{
+  "total_reward": 0.847,
   "scores": {
     "correctness": 1.0,
     "attack_resist": 0.875,
+    "static_security": 0.9,
     "consistency": 1.0,
+    "performance": 0.72,
+    "documentation": 0.75,
+    "code_structure": 0.8
   },
+  "feedback": { "summary": "🟡 Good submission — improve: performance" },
+  "codegraph": { ... },
   "done": false,
+  "step_count": 1
 }
 ```
+### `GET /state?session_id=<id>`
+Get current episode state without advancing.
+### `GET /health`
+Returns `{"status": "ok", "env": "SecureCodeEnv", "version": "2.0.0", "tasks_loaded": 9}`
 ---
+## Setup (Local)
 ```bash
+git clone https://huggingface.co/spaces/vishaldhakad/SecureCodeEnv
+cd SecureCodeEnv
+# Docker (recommended)
+docker build -t secure-code-env .
+docker run -p 7860:7860 secure-code-env
+# Or direct
+pip install -r requirements.txt
+uvicorn app.main:app --host 0.0.0.0 --port 7860
 ```
+## Run Baseline Inference
+```bash
+export API_BASE_URL=https://api.openai.com/v1
+export MODEL_NAME=gpt-4o-mini
+export HF_TOKEN=hf_your_token
+export ENV_URL=http://localhost:7860
+python inference.py
+```
+## Validate Before Submit
+```bash
+python validate.py --url http://localhost:7860
+```
 ---
+## Environment Variables
+| Variable | Required | Description |
+|---|---|---|
+| `API_BASE_URL` | Yes | LLM API endpoint (OpenAI-compatible) |
+| `MODEL_NAME` | Yes | Model identifier (e.g. `gpt-4o-mini`) |
+| `HF_TOKEN` | Yes | HuggingFace token |
+| `ENV_URL` | No | Override environment URL (default: localhost:7860) |
 ---
+*SecureCodeEnv v2.0 · Meta × PyTorch OpenEnv Hackathon 2026 · Vishal Dhakad*

app/dashboard.py CHANGED Viewed

@@ -1,672 +1,915 @@
-"""
-SecureCodeEnv - HTML Dashboard
-Served at GET / — this is what judges and users see on HuggingFace Spaces.
-"""
-DASHBOARD_HTML = '''<!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
-<meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>SecureCodeEnv — RL Environment for Secure Code Generation</title>
 <link rel="preconnect" href="https://fonts.googleapis.com">
-<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;700&family=Syne:wght@400;700;800&display=swap" rel="stylesheet">
 <style>
-  :root {
-    --bg: #090c10;
-    --surface: #0d1117;
-    --surface2: #161b22;
-    --border: #21262d;
-    --accent: #f0883e;
-    --accent2: #79c0ff;
-    --accent3: #56d364;
-    --danger: #ff7b72;
-    --text: #e6edf3;
-    --muted: #8b949e;
-    --mono: 'JetBrains Mono', monospace;
-    --sans: 'Syne', sans-serif;
-  }
-  * { box-sizing: border-box; margin: 0; padding: 0; }
-  body {
-    background: var(--bg);
-    color: var(--text);
-    font-family: var(--sans);
-    min-height: 100vh;
-    overflow-x: hidden;
-  }
-  /* ── Grid noise texture ── */
-  body::before {
-    content: '';
-    position: fixed;
-    inset: 0;
-    background-image:
-      linear-gradient(rgba(240,136,62,.03) 1px, transparent 1px),
-      linear-gradient(90deg, rgba(240,136,62,.03) 1px, transparent 1px);
-    background-size: 40px 40px;
-    pointer-events: none;
-    z-index: 0;
-  }
-  .wrap { position: relative; z-index: 1; max-width: 1100px; margin: 0 auto; padding: 0 24px; }
-  /* ── Header ── */
-  header {
-    border-bottom: 1px solid var(--border);
-    padding: 18px 0;
-    position: sticky;
-    top: 0;
-    background: rgba(9,12,16,.92);
-    backdrop-filter: blur(12px);
-    z-index: 100;
-  }
-  .header-inner {
-    display: flex;
-    align-items: center;
-    justify-content: space-between;
-    gap: 16px;
-  }
-  .logo {
-    display: flex;
-    align-items: center;
-    gap: 10px;
-    font-family: var(--mono);
-    font-weight: 700;
-    font-size: 15px;
-    color: var(--accent);
-    letter-spacing: -.3px;
-  }
-  .logo-icon {
-    width: 28px; height: 28px;
-    background: var(--accent);
-    border-radius: 6px;
-    display: grid;
-    place-items: center;
-    font-size: 14px;
-  }
-  .badge {
-    font-family: var(--mono);
-    font-size: 10px;
-    padding: 3px 8px;
-    border-radius: 99px;
-    border: 1px solid;
-    letter-spacing: .5px;
-    text-transform: uppercase;
-  }
-  .badge-orange { color: var(--accent); border-color: rgba(240,136,62,.3); background: rgba(240,136,62,.07); }
-  .badge-blue   { color: var(--accent2); border-color: rgba(121,192,255,.3); background: rgba(121,192,255,.07); }
-  .badge-green  { color: var(--accent3); border-color: rgba(86,211,100,.3); background: rgba(86,211,100,.07); }
-  .badge-red    { color: var(--danger); border-color: rgba(255,123,114,.3); background: rgba(255,123,114,.07); }
-  .header-badges { display: flex; gap: 8px; flex-wrap: wrap; }
-  /* ── Hero ── */
-  .hero {
-    padding: 72px 0 56px;
-    position: relative;
-  }
-  .hero-eyebrow {
-    font-family: var(--mono);
-    font-size: 11px;
-    color: var(--accent);
-    letter-spacing: 2px;
-    text-transform: uppercase;
-    margin-bottom: 20px;
-    display: flex;
-    align-items: center;
-    gap: 10px;
-  }
-  .hero-eyebrow::before {
-    content: '';
-    display: block;
-    width: 24px; height: 1px;
-    background: var(--accent);
-  }
-  h1 {
-    font-size: clamp(36px, 6vw, 64px);
-    font-weight: 800;
-    line-height: 1.05;
-    letter-spacing: -2px;
-    margin-bottom: 24px;
-  }
-  h1 em { font-style: normal; color: var(--accent); }
-  .hero-desc {
-    font-size: 17px;
-    color: var(--muted);
-    max-width: 600px;
-    line-height: 1.7;
-    margin-bottom: 36px;
-  }
-  .hero-actions { display: flex; gap: 12px; flex-wrap: wrap; }
-  .btn {
-    font-family: var(--mono);
-    font-size: 13px;
-    font-weight: 700;
-    padding: 11px 22px;
-    border-radius: 7px;
-    text-decoration: none;
-    transition: all .15s;
-    cursor: pointer;
-    border: none;
-    display: inline-flex;
-    align-items: center;
-    gap: 8px;
-  }
-  .btn-primary {
-    background: var(--accent);
-    color: #000;
-  }
-  .btn-primary:hover { background: #ffaa5e; transform: translateY(-1px); }
-  .btn-ghost {
-    background: transparent;
-    color: var(--text);
-    border: 1px solid var(--border);
-  }
-  .btn-ghost:hover { border-color: var(--accent2); color: var(--accent2); }
-  /* ── Stats row ── */
-  .stats {
-    display: grid;
-    grid-template-columns: repeat(4, 1fr);
-    gap: 1px;
-    background: var(--border);
-    border: 1px solid var(--border);
-    border-radius: 10px;
-    overflow: hidden;
-    margin-bottom: 64px;
-  }
-  .stat {
-    background: var(--surface);
-    padding: 24px 28px;
-    position: relative;
-    overflow: hidden;
-  }
-  .stat::after {
-    content: attr(data-icon);
-    position: absolute;
-    right: 16px;
-    top: 50%;
-    transform: translateY(-50%);
-    font-size: 28px;
-    opacity: .15;
-  }
-  .stat-val {
-    font-family: var(--mono);
-    font-size: 32px;
-    font-weight: 700;
-    color: var(--accent);
-    line-height: 1;
-    margin-bottom: 6px;
-  }
-  .stat-label { font-size: 12px; color: var(--muted); letter-spacing: .3px; }
-  /* ── Sections ── */
-  section { margin-bottom: 64px; }
-  .section-title {
-    font-size: 11px;
-    font-family: var(--mono);
-    color: var(--muted);
-    letter-spacing: 2px;
-    text-transform: uppercase;
-    margin-bottom: 24px;
-    display: flex;
-    align-items: center;
-    gap: 12px;
-  }
-  .section-title::after {
-    content: '';
-    flex: 1;
-    height: 1px;
-    background: var(--border);
-  }
-  /* ── Reward grid ── */
-  .reward-grid {
-    display: grid;
-    grid-template-columns: repeat(auto-fill, minmax(220px, 1fr));
-    gap: 12px;
-  }
-  .reward-card {
-    background: var(--surface);
-    border: 1px solid var(--border);
-    border-radius: 10px;
-    padding: 18px 20px;
-    transition: border-color .2s;
-    animation: fadeUp .5s ease both;
-  }
-  .reward-card:hover { border-color: var(--accent); }
-  .reward-card:nth-child(1) { animation-delay: .05s; }
-  .reward-card:nth-child(2) { animation-delay: .10s; }
-  .reward-card:nth-child(3) { animation-delay: .15s; }
-  .reward-card:nth-child(4) { animation-delay: .20s; }
-  .reward-card:nth-child(5) { animation-delay: .25s; }
-  .reward-card:nth-child(6) { animation-delay: .30s; }
-  .reward-card:nth-child(7) { animation-delay: .35s; }
-  .rc-header { display: flex; justify-content: space-between; align-items: flex-start; margin-bottom: 14px; }
-  .rc-name { font-size: 13px; font-weight: 700; }
-  .rc-weight { font-family: var(--mono); font-size: 20px; font-weight: 700; color: var(--accent); }
-  .rc-bar-bg { height: 3px; background: var(--border); border-radius: 99px; }
-  .rc-bar { height: 3px; border-radius: 99px; background: var(--accent); transition: width 1s ease; }
-  .rc-desc { font-size: 11px; color: var(--muted); margin-top: 10px; line-height: 1.5; }
-  /* ── Tasks table ── */
-  .tasks-grid {
-    display: grid;
-    grid-template-columns: repeat(3, 1fr);
-    gap: 12px;
-  }
-  @media (max-width: 768px) { .tasks-grid { grid-template-columns: 1fr; } }
-  .diff-col {}
-  .diff-label {
-    font-family: var(--mono);
-    font-size: 11px;
-    letter-spacing: 1.5px;
-    text-transform: uppercase;
-    padding: 6px 12px;
-    border-radius: 6px;
-    display: inline-block;
-    margin-bottom: 12px;
-  }
-  .diff-easy   { background: rgba(86,211,100,.1); color: var(--accent3); }
-  .diff-medium { background: rgba(240,136,62,.1);  color: var(--accent); }
-  .diff-hard   { background: rgba(255,123,114,.1); color: var(--danger); }
-  .task-item {
-    background: var(--surface);
-    border: 1px solid var(--border);
-    border-radius: 8px;
-    padding: 14px 16px;
-    margin-bottom: 8px;
-    font-size: 13px;
-  }
-  .task-name { font-weight: 700; margin-bottom: 4px; }
-  .task-cwes { display: flex; gap: 4px; flex-wrap: wrap; margin-top: 8px; }
-  .cwe-tag {
-    font-family: var(--mono);
-    font-size: 10px;
-    padding: 2px 7px;
-    border-radius: 4px;
-    background: rgba(121,192,255,.08);
-    color: var(--accent2);
-    border: 1px solid rgba(121,192,255,.2);
-  }
-  /* ── Code block ── */
-  .code-block {
-    background: var(--surface);
-    border: 1px solid var(--border);
-    border-radius: 10px;
-    overflow: hidden;
-  }
-  .code-header {
-    display: flex;
-    align-items: center;
-    justify-content: space-between;
-    padding: 10px 16px;
-    border-bottom: 1px solid var(--border);
-    background: var(--surface2);
-  }
-  .code-dots { display: flex; gap: 6px; }
-  .code-dots span { width: 10px; height: 10px; border-radius: 50%; }
-  .code-dots span:nth-child(1) { background: #ff5f57; }
-  .code-dots span:nth-child(2) { background: #febc2e; }
-  .code-dots span:nth-child(3) { background: #28c840; }
-  .code-filename { font-family: var(--mono); font-size: 11px; color: var(--muted); }
-  pre {
-    font-family: var(--mono);
-    font-size: 12px;
-    line-height: 1.7;
-    padding: 20px;
-    overflow-x: auto;
-    color: var(--text);
-  }
-  .kw  { color: #ff7b72; }
-  .fn  { color: #d2a8ff; }
-  .str { color: #a5d6ff; }
-  .cm  { color: var(--muted); font-style: italic; }
-  .num { color: var(--accent3); }
-  .op  { color: var(--accent); }
-  /* ── Live status ── */
-  .status-bar {
-    background: var(--surface);
-    border: 1px solid var(--border);
-    border-radius: 10px;
-    padding: 20px 24px;
-    display: flex;
-    align-items: center;
-    justify-content: space-between;
-    gap: 16px;
-    flex-wrap: wrap;
-  }
-  .status-dot {
-    width: 8px; height: 8px;
-    border-radius: 50%;
-    background: var(--accent3);
-    box-shadow: 0 0 8px var(--accent3);
-    animation: pulse 2s ease infinite;
-  }
-  .status-left { display: flex; align-items: center; gap: 10px; font-size: 14px; font-weight: 700; }
-  .status-endpoints { display: flex; gap: 8px; flex-wrap: wrap; }
-  .ep {
-    font-family: var(--mono);
-    font-size: 11px;
-    padding: 4px 10px;
-    border-radius: 5px;
-    background: var(--surface2);
-    border: 1px solid var(--border);
-    color: var(--muted);
-    display: flex;
-    gap: 6px;
-    align-items: center;
-  }
-  .ep-method { font-weight: 700; }
-  .ep-method.post { color: var(--accent3); }
-  .ep-method.get  { color: var(--accent2); }
-  /* ── Footer ── */
-  footer {
-    border-top: 1px solid var(--border);
-    padding: 28px 0;
-    margin-top: 32px;
-    display: flex;
-    justify-content: space-between;
-    align-items: center;
-    flex-wrap: wrap;
-    gap: 12px;
-  }
-  .footer-text { font-family: var(--mono); font-size: 11px; color: var(--muted); }
-  .footer-text a { color: var(--accent2); text-decoration: none; }
-  /* ── Animations ── */
-  @keyframes fadeUp {
-    from { opacity: 0; transform: translateY(16px); }
-    to   { opacity: 1; transform: translateY(0); }
-  }
-  @keyframes pulse {
-    0%, 100% { opacity: 1; }
-    50% { opacity: .4; }
-  }
-  .hero { animation: fadeUp .6s ease both; }
-  .stats { animation: fadeUp .6s ease .1s both; }
-  @media (max-width: 640px) {
-    .stats { grid-template-columns: repeat(2, 1fr); }
-    h1 { letter-spacing: -1px; }
-    .header-badges { display: none; }
-  }
 </style>
 </head>
 <body>
 <!-- HEADER -->
 <header>
-  <div class="wrap">
-    <div class="header-inner">
-      <div class="logo">
-        <div class="logo-icon">🔒</div>
-        SecureCodeEnv
-      </div>
-      <div class="header-badges">
-        <span class="badge badge-orange">v2.0.0</span>
-        <span class="badge badge-blue">OpenEnv</span>
-        <span class="badge badge-green">Live</span>
-        <span class="badge badge-red">Meta × PyTorch Hackathon</span>
-      </div>
-    </div>
   </div>
-</header>
-<!-- HERO -->
-<div class="wrap">
-  <div class="hero">
-    <div class="hero-eyebrow">RL Environment for Secure Code Generation</div>
-    <h1>Train LLMs to write<br><em>secure</em> Python code.</h1>
-    <p class="hero-desc">
-      SecureCodeEnv is a reinforcement learning environment that goes beyond correctness.
-      Agents are graded on attack resistance, CWE-based static analysis, codebase consistency
-      via CodeGraph, and performance — all automated, all deterministic.
-    </p>
-    <div class="hero-actions">
-      <a href="/docs" class="btn btn-primary">⚡ API Docs</a>
-      <a href="/health" class="btn btn-ghost">GET /health</a>
-      <a href="https://huggingface.co/spaces/vishaldhakad/SecureCodeEnv" class="btn btn-ghost" target="_blank">HF Space ↗</a>
-    </div>
   </div>
-  <!-- STATS -->
-  <div class="stats">
-    <div class="stat" data-icon="📋">
-      <div class="stat-val">9</div>
-      <div class="stat-label">Security Tasks</div>
-    </div>
-    <div class="stat" data-icon="⚖️">
-      <div class="stat-val">7</div>
-      <div class="stat-label">Reward Dimensions</div>
-    </div>
-    <div class="stat" data-icon="🎯">
-      <div class="stat-val">12+</div>
-      <div class="stat-label">CWE IDs Covered</div>
-    </div>
-    <div class="stat" data-icon="🔥">
-      <div class="stat-val">0%</div>
-      <div class="stat-label">Infrastructure Cost</div>
-    </div>
   </div>
-  <!-- LIVE STATUS -->
-  <section>
-    <div class="section-title">Live Environment</div>
-    <div class="status-bar">
-      <div class="status-left">
-        <div class="status-dot"></div>
-        Environment running · SecureCodeEnv v2.0.0
-      </div>
-      <div class="status-endpoints">
-        <div class="ep"><span class="ep-method post">POST</span>/reset</div>
-        <div class="ep"><span class="ep-method post">POST</span>/step</div>
-        <div class="ep"><span class="ep-method get">GET</span>/state</div>
-        <div class="ep"><span class="ep-method get">GET</span>/health</div>
-        <div class="ep"><span class="ep-method get">GET</span>/docs</div>
-      </div>
-    </div>
-  </section>
-  <!-- REWARD DIMENSIONS -->
-  <section>
-    <div class="section-title">Reward System — 7 Dimensions</div>
-    <div class="reward-grid">
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">Correctness</div>
-          <div class="rc-weight">30%</div>
-        </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:100%"></div></div>
-        <div class="rc-desc">Test cases passed including edge cases, None inputs, boundary values</div>
-      </div>
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">Attack Resistance</div>
-          <div class="rc-weight">20%</div>
-        </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:67%"></div></div>
-        <div class="rc-desc">Randomized SQLi, traversal, JWT bypass, XSS payloads fired each episode</div>
-      </div>
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">Static Security</div>
-          <div class="rc-weight">15%</div>
-        </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:50%"></div></div>
-        <div class="rc-desc">bandit + AST checks mapped to real CWE IDs</div>
-      </div>
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">CodeGraph</div>
-          <div class="rc-weight">15%</div>
         </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:50%"></div></div>
-        <div class="rc-desc">Consistency with existing codebase conventions across the episode</div>
-      </div>
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">Performance</div>
-          <div class="rc-weight">10%</div>
         </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:33%"></div></div>
-        <div class="rc-desc">timeit + tracemalloc scored relative to naive/optimal baselines</div>
       </div>
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">Documentation</div>
-          <div class="rc-weight">5%</div>
         </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:17%"></div></div>
-        <div class="rc-desc">Docstring + type hint coverage across all submitted functions</div>
-      </div>
-      <div class="reward-card">
-        <div class="rc-header">
-          <div class="rc-name">Code Structure</div>
-          <div class="rc-weight">5%</div>
         </div>
-        <div class="rc-bar-bg"><div class="rc-bar" style="width:17%"></div></div>
-        <div class="rc-desc">No bare print, no bare except, reasonable function size</div>
       </div>
     </div>
-  </section>
-  <!-- TASKS -->
-  <section>
-    <div class="section-title">9 Tasks · 3 Difficulty Levels</div>
-    <div class="tasks-grid">
-      <div class="diff-col">
-        <div class="diff-label diff-easy">Easy</div>
-        <div class="task-item">
-          <div class="task-name">Password Validator</div>
-          <div style="font-size:11px;color:var(--muted)">bcrypt hashing, strength rules</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-916</span><span class="cwe-tag">CWE-521</span></div>
-        </div>
-        <div class="task-item">
-          <div class="task-name">Input Sanitizer</div>
-          <div style="font-size:11px;color:var(--muted)">HTML escape, filename safety</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-20</span><span class="cwe-tag">CWE-116</span></div>
-        </div>
-        <div class="task-item">
-          <div class="task-name">Token Generator</div>
-          <div style="font-size:11px;color:var(--muted)">secrets module, CSPRNG</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-338</span><span class="cwe-tag">CWE-330</span></div>
         </div>
       </div>
-      <div class="diff-col">
-        <div class="diff-label diff-medium">Medium</div>
-        <div class="task-item">
-          <div class="task-name">SQL Query Builder</div>
-          <div style="font-size:11px;color:var(--muted)">Parameterized queries only</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-89</span></div>
-        </div>
-        <div class="task-item">
-          <div class="task-name">File Path Handler</div>
-          <div style="font-size:11px;color:var(--muted)">Path traversal prevention</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-22</span></div>
-        </div>
-        <div class="task-item">
-          <div class="task-name">Rate Limiter</div>
-          <div style="font-size:11px;color:var(--muted)">Thread-safe sliding window</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-770</span><span class="cwe-tag">CWE-400</span></div>
         </div>
       </div>
-      <div class="diff-col">
-        <div class="diff-label diff-hard">Hard</div>
-        <div class="task-item">
-          <div class="task-name">File Upload Handler</div>
-          <div style="font-size:11px;color:var(--muted)">MIME check, ext block, UUID path</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-22</span><span class="cwe-tag">CWE-434</span></div>
-        </div>
-        <div class="task-item">
-          <div class="task-name">JWT Validator</div>
-          <div style="font-size:11px;color:var(--muted)">alg:none blocked, expiry enforced</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-347</span><span class="cwe-tag">CWE-613</span></div>
-        </div>
-        <div class="task-item">
-          <div class="task-name">Auth Middleware</div>
-          <div style="font-size:11px;color:var(--muted)">CSRF + timing-safe Bearer auth</div>
-          <div class="task-cwes"><span class="cwe-tag">CWE-287</span><span class="cwe-tag">CWE-352</span></div>
         </div>
       </div>
-    </div>
-  </section>
-  <!-- QUICKSTART CODE -->
-  <section>
-    <div class="section-title">Quick Start</div>
-    <div class="code-block">
-      <div class="code-header">
-        <div class="code-dots"><span></span><span></span><span></span></div>
-        <div class="code-filename">quickstart.py</div>
-        <span class="badge badge-blue">Python</span>
       </div>
-      <pre><span class="kw">import</span> requests
-ENV_URL <span class="op">=</span> <span class="str">"https://vishaldhakad-securecodeenv.hf.space"</span>
-<span class="cm"># 1. Start episode</span>
-episode <span class="op">=</span> requests.<span class="fn">post</span>(<span class="str">f"{ENV_URL}/reset"</span>, json<span class="op">=</span>{<span class="str">"difficulty"</span>: <span class="str">"medium"</span>}).<span class="fn">json</span>()
-sid <span class="op">=</span> episode[<span class="str">"session_id"</span>]
-<span class="kw">print</span>(episode[<span class="str">"problem_statement"</span>])
-<span class="cm"># 2. Submit code — gets graded across 7 dimensions</span>
-result <span class="op">=</span> requests.<span class="fn">post</span>(<span class="str">f"{ENV_URL}/step"</span>, json<span class="op">=</span>{
-    <span class="str">"session_id"</span>: sid,
-    <span class="str">"code"</span>: <span class="str">"def build_user_query(u, r): return ('SELECT * FROM users WHERE username=%s', (u,))"</span>,
-    <span class="str">"filename"</span>: <span class="str">"solution.py"</span>,
-}).<span class="fn">json</span>()
-<span class="kw">print</span>(<span class="str">f"reward={result['total_reward']:.3f}"</span>)
-<span class="kw">print</span>(<span class="str">f"scores={result['scores']}"</span>)
-<span class="kw">print</span>(result[<span class="str">'feedback'</span>][<span class="str">'summary'</span>])</pre>
     </div>
-  </section>
-  <!-- FOOTER -->
-  <footer class="wrap" style="max-width:unset;padding:0">
-    <div class="footer-text">
-      SecureCodeEnv v2.0 · Built by <a href="https://huggingface.co/vishaldhakad" target="_blank">Vishal Dhakad</a>
     </div>
-    <div class="footer-text">
-      Meta × PyTorch <a href="https://www.scaler.com/school-of-technology/meta-pytorch-hackathon" target="_blank">OpenEnv Hackathon 2026</a>
     </div>
-  </footer>
 </div>
 <script>
-  // Animate reward bars on load
-  document.addEventListener('DOMContentLoaded', () => {
-    const bars = document.querySelectorAll('.rc-bar');
-    bars.forEach(b => {
-      const w = b.style.width;
-      b.style.width = '0';
-      setTimeout(() => { b.style.width = w; }, 300);
     });
-  });
-  // Live health ping — updates status dot
-  async function checkHealth() {
-    try {
-      const r = await fetch('/health');
-      const d = await r.json();
-      const dot = document.querySelector('.status-dot');
-      const label = document.querySelector('.status-left');
-      if (r.ok) {
-        dot.style.background = 'var(--accent3)';
-        dot.style.boxShadow = '0 0 8px var(--accent3)';
-        label.childNodes[1].textContent = ` Environment running · ${d.env} v${d.version} · ${d.tasks_loaded} tasks loaded`;
       }
-    } catch(e) {}
-  }
-  checkHealth();
 </script>
 </body>
-</html>'''

+"""SecureCodeEnv - Interactive HTML Dashboard"""
+DASHBOARD_HTML = r"""<!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
+<meta name="viewport" content="width=device-width,initial-scale=1.0">
+<title>SecureCodeEnv — RL Playground</title>
 <link rel="preconnect" href="https://fonts.googleapis.com">
+<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:ital,wght@0,400;0,700;1,400&family=Syne:wght@500;700;800&display=swap" rel="stylesheet">
 <style>
+:root{
+  --bg:#07090d;--surface:#0d1117;--s2:#161b22;--s3:#21262d;
+  --border:#30363d;--accent:#f0883e;--a2:#79c0ff;--a3:#56d364;
+  --danger:#ff7b72;--warn:#e3b341;--text:#e6edf3;--muted:#8b949e;
+  --mono:'JetBrains Mono',monospace;--sans:'Syne',sans-serif;
+  --radius:8px;
+}
+*{box-sizing:border-box;margin:0;padding:0}
+html,body{height:100%;background:var(--bg);color:var(--text);font-family:var(--sans)}
+body{display:flex;flex-direction:column;min-height:100vh}
+/* grid bg */
+body::before{content:'';position:fixed;inset:0;
+  background-image:linear-gradient(rgba(240,136,62,.025) 1px,transparent 1px),
+  linear-gradient(90deg,rgba(240,136,62,.025) 1px,transparent 1px);
+  background-size:48px 48px;pointer-events:none;z-index:0}
+/* ── header ── */
+header{position:sticky;top:0;z-index:200;background:rgba(7,9,13,.88);
+  backdrop-filter:blur(12px);border-bottom:1px solid var(--border);
+  padding:0 24px;height:52px;display:flex;align-items:center;justify-content:space-between;gap:16px}
+.hlogo{display:flex;align-items:center;gap:10px;font-family:var(--mono);font-weight:700;font-size:14px;color:var(--accent)}
+.hlogo-icon{width:26px;height:26px;background:var(--accent);border-radius:5px;display:grid;place-items:center;font-size:13px;color:#000}
+.hbadges{display:flex;gap:6px;flex-wrap:wrap}
+.badge{font-family:var(--mono);font-size:10px;padding:2px 8px;border-radius:99px;border:1px solid;letter-spacing:.4px}
+.bo{color:var(--accent);border-color:rgba(240,136,62,.3);background:rgba(240,136,62,.07)}
+.bb{color:var(--a2);border-color:rgba(121,192,255,.3);background:rgba(121,192,255,.07)}
+.bg{color:var(--a3);border-color:rgba(86,211,100,.3);background:rgba(86,211,100,.07)}
+.br{color:var(--danger);border-color:rgba(255,123,114,.3);background:rgba(255,123,114,.07)}
+.hstatus{display:flex;align-items:center;gap:8px;font-size:12px;font-family:var(--mono)}
+.dot{width:7px;height:7px;border-radius:50%;background:var(--a3);box-shadow:0 0 6px var(--a3)}
+.dot.red{background:var(--danger);box-shadow:0 0 6px var(--danger)}
+.dot.pulse{animation:pulse 2s ease infinite}
+@keyframes pulse{0%,100%{opacity:1}50%{opacity:.35}}
+/* ── nav tabs ── */
+.nav{display:flex;border-bottom:1px solid var(--border);background:var(--surface);
+  padding:0 24px;gap:2px;position:sticky;top:52px;z-index:100}
+.ntab{font-family:var(--mono);font-size:12px;padding:10px 16px;cursor:pointer;
+  border-bottom:2px solid transparent;color:var(--muted);transition:.15s;
+  background:none;border-top:none;border-left:none;border-right:none;color:var(--muted)}
+.ntab:hover{color:var(--text)}
+.ntab.active{color:var(--accent);border-bottom-color:var(--accent)}
+/* ── main layout ── */
+.main{position:relative;z-index:1;flex:1;padding:24px;max-width:1200px;margin:0 auto;width:100%}
+.panel{display:none}
+.panel.active{display:block}
+/* ── playground layout ── */
+.playground{display:grid;grid-template-columns:1fr 400px;gap:16px;height:calc(100vh - 160px)}
+@media(max-width:900px){.playground{grid-template-columns:1fr;height:auto}}
+/* ── left pane ── */
+.left-pane{display:flex;flex-direction:column;gap:12px;min-height:0}
+.card{background:var(--surface);border:1px solid var(--border);border-radius:var(--radius);overflow:hidden}
+.card-header{display:flex;align-items:center;justify-content:space-between;
+  padding:10px 14px;border-bottom:1px solid var(--border);background:var(--s2)}
+.card-title{font-size:11px;font-family:var(--mono);color:var(--muted);letter-spacing:1px;text-transform:uppercase}
+.card-body{padding:14px}
+/* ── controls ── */
+.controls-row{display:flex;gap:8px;flex-wrap:wrap;align-items:center}
+select,input[type=text]{font-family:var(--mono);font-size:12px;background:var(--s2);
+  border:1px solid var(--border);color:var(--text);border-radius:5px;padding:7px 10px;
+  outline:none;transition:border-color .15s}
+select:focus,input:focus{border-color:var(--accent)}
+.btn{font-family:var(--mono);font-size:12px;font-weight:700;padding:7px 16px;
+  border-radius:5px;border:none;cursor:pointer;transition:all .12s;display:inline-flex;align-items:center;gap:6px}
+.btn-primary{background:var(--accent);color:#000}
+.btn-primary:hover{background:#ffaa5e;transform:translateY(-1px)}
+.btn-primary:disabled{background:var(--s3);color:var(--muted);cursor:not-allowed;transform:none}
+.btn-ghost{background:transparent;color:var(--text);border:1px solid var(--border)}
+.btn-ghost:hover{border-color:var(--a2);color:var(--a2)}
+.btn-green{background:var(--a3);color:#000}
+.btn-green:hover{background:#6fe87a}
+.btn-green:disabled{background:var(--s3);color:var(--muted);cursor:not-allowed}
+.btn-danger{background:transparent;color:var(--danger);border:1px solid rgba(255,123,114,.3)}
+.btn-danger:hover{background:rgba(255,123,114,.1)}
+/* ── task display ── */
+.task-box{background:var(--s2);border:1px solid var(--border);border-radius:6px;padding:14px;
+  font-size:13px;line-height:1.7;color:var(--text);white-space:pre-wrap;max-height:180px;
+  overflow-y:auto;font-family:var(--mono)}
+.task-meta{display:flex;gap:8px;flex-wrap:wrap;margin-bottom:8px}
+.cwe{font-family:var(--mono);font-size:10px;padding:2px 7px;border-radius:4px;
+  background:rgba(121,192,255,.08);color:var(--a2);border:1px solid rgba(121,192,255,.2)}
+.diff-tag{font-family:var(--mono);font-size:10px;padding:2px 7px;border-radius:4px}
+.easy{background:rgba(86,211,100,.1);color:var(--a3)}
+.medium{background:rgba(240,136,62,.1);color:var(--accent)}
+.hard{background:rgba(255,123,114,.1);color:var(--danger)}
+/* ── code editor ── */
+.editor-wrap{flex:1;display:flex;flex-direction:column;min-height:0}
+.editor-header{display:flex;align-items:center;justify-content:space-between;
+  padding:8px 14px;background:var(--s2);border-bottom:1px solid var(--border)}
+.editor-dots{display:flex;gap:5px}
+.editor-dots span{width:9px;height:9px;border-radius:50%}
+.editor-dots span:nth-child(1){background:#ff5f57}
+.editor-dots span:nth-child(2){background:#febc2e}
+.editor-dots span:nth-child(3){background:#28c840}
+#code-editor{flex:1;width:100%;background:var(--s2);border:none;color:var(--text);
+  font-family:var(--mono);font-size:12px;line-height:1.65;padding:16px;
+  resize:none;outline:none;tab-size:4;min-height:280px}
+#code-editor::placeholder{color:var(--muted)}
+.editor-footer{padding:8px 14px;background:var(--s2);border-top:1px solid var(--border);
+  display:flex;justify-content:space-between;align-items:center;gap:8px}
+.char-count{font-family:var(--mono);font-size:10px;color:var(--muted)}
+/* ── right pane ── */
+.right-pane{display:flex;flex-direction:column;gap:12px;overflow-y:auto;max-height:calc(100vh - 160px)}
+@media(max-width:900px){.right-pane{max-height:none}}
+/* ── reward display ── */
+.reward-big{text-align:center;padding:20px 14px}
+.reward-number{font-family:var(--mono);font-size:52px;font-weight:700;line-height:1;
+  transition:all .4s ease}
+.reward-label{font-size:11px;color:var(--muted);font-family:var(--mono);margin-top:4px}
+.reward-bar-bg{height:6px;background:var(--s3);border-radius:99px;margin:12px 0}
+.reward-bar{height:6px;border-radius:99px;background:var(--accent);transition:width .6s ease;width:0%}
+/* ── score breakdown ── */
+.score-row{display:flex;align-items:center;gap:8px;padding:5px 0;
+  border-bottom:1px solid var(--border);font-size:12px}
+.score-row:last-child{border:none}
+.score-dim{flex:1;color:var(--muted);font-family:var(--mono)}
+.score-val{font-family:var(--mono);font-weight:700;min-width:38px;text-align:right}
+.score-bar-bg{width:60px;height:4px;background:var(--s3);border-radius:99px}
+.score-bar-fg{height:4px;border-radius:99px;transition:width .5s ease;background:var(--a3)}
+.weight-tag{font-size:9px;color:var(--s3);background:var(--border);
+  padding:1px 5px;border-radius:3px;font-family:var(--mono)}
+/* ── feedback ── */
+.fb-item{font-size:11px;font-family:var(--mono);padding:5px 8px;border-radius:5px;
+  background:var(--s2);border-left:3px solid var(--border);margin-bottom:4px;line-height:1.5}
+.fb-item.good{border-left-color:var(--a3)}
+.fb-item.warn{border-left-color:var(--warn)}
+.fb-item.bad{border-left-color:var(--danger)}
+/* ── history ── */
+.history-item{display:flex;align-items:center;gap:8px;padding:7px 10px;
+  border-bottom:1px solid var(--border);font-size:11px;font-family:var(--mono)}
+.history-item:last-child{border:none}
+.h-step{color:var(--muted);min-width:40px}
+.h-reward{font-weight:700;min-width:50px}
+.h-bar{flex:1;height:4px;background:var(--s3);border-radius:99px;position:relative}
+.h-bar-fg{height:4px;border-radius:99px;background:var(--a3);transition:width .4s}
+.h-done{color:var(--a3);font-size:10px}
+/* ── loading ── */
+.spinner{display:inline-block;width:14px;height:14px;border:2px solid rgba(255,255,255,.2);
+  border-top-color:var(--accent);border-radius:50%;animation:spin .6s linear infinite}
+@keyframes spin{to{transform:rotate(360deg)}}
+/* ── empty state ── */
+.empty{text-align:center;padding:40px 20px;color:var(--muted)}
+.empty-icon{font-size:32px;margin-bottom:12px;opacity:.5}
+.empty-text{font-size:13px;line-height:1.6}
+/* ── alerts ── */
+.alert{padding:10px 14px;border-radius:6px;font-size:12px;font-family:var(--mono);
+  margin-bottom:8px;display:flex;gap:8px;align-items:flex-start}
+.alert-error{background:rgba(255,123,114,.1);border:1px solid rgba(255,123,114,.3);color:var(--danger)}
+.alert-success{background:rgba(86,211,100,.1);border:1px solid rgba(86,211,100,.3);color:var(--a3)}
+.alert-info{background:rgba(121,192,255,.1);border:1px solid rgba(121,192,255,.3);color:var(--a2)}
+/* ── overview panel ── */
+.grid-2{display:grid;grid-template-columns:repeat(auto-fill,minmax(260px,1fr));gap:12px}
+.stat-card{background:var(--surface);border:1px solid var(--border);border-radius:var(--radius);
+  padding:20px;display:flex;flex-direction:column;gap:6px}
+.stat-val{font-family:var(--mono);font-size:36px;font-weight:700;color:var(--accent)}
+.stat-label{font-size:12px;color:var(--muted)}
+.section-label{font-family:var(--mono);font-size:10px;color:var(--muted);letter-spacing:2px;
+  text-transform:uppercase;padding:16px 0 8px;border-bottom:1px solid var(--border);margin-bottom:12px}
+/* ── task list ── */
+.task-list-item{background:var(--surface);border:1px solid var(--border);border-radius:var(--radius);
+  padding:14px 16px;cursor:pointer;transition:border-color .15s;margin-bottom:8px}
+.task-list-item:hover{border-color:var(--accent)}
+.tli-header{display:flex;align-items:center;justify-content:space-between;margin-bottom:6px}
+.tli-name{font-weight:700;font-size:14px}
+.tli-desc{font-size:12px;color:var(--muted);line-height:1.5}
+.tli-footer{display:flex;gap:6px;margin-top:10px;flex-wrap:wrap}
+/* ── docs panel ── */
+.docs-card{background:var(--surface);border:1px solid var(--border);border-radius:var(--radius);
+  padding:20px;margin-bottom:12px}
+.docs-h2{font-size:16px;font-weight:700;margin-bottom:8px;color:var(--text)}
+.docs-p{font-size:13px;color:var(--muted);line-height:1.7;margin-bottom:12px}
+.docs-code{background:var(--s2);border:1px solid var(--border);border-radius:6px;
+  padding:14px;font-family:var(--mono);font-size:12px;line-height:1.65;
+  overflow-x:auto;margin-bottom:12px;white-space:pre}
+.method{font-weight:700;font-size:11px;padding:2px 7px;border-radius:4px;font-family:var(--mono)}
+.method.post{background:rgba(86,211,100,.15);color:var(--a3)}
+.method.get{background:rgba(121,192,255,.15);color:var(--a2)}
+.ep-row{display:flex;align-items:flex-start;gap:12px;padding:10px 0;
+  border-bottom:1px solid var(--border);font-size:13px}
+.ep-row:last-child{border:none}
+.ep-path{font-family:var(--mono);color:var(--text);font-weight:700;min-width:180px}
+.ep-desc{color:var(--muted);line-height:1.5}
+/* ── reward weight chart ── */
+.weight-bar-row{display:flex;align-items:center;gap:10px;padding:6px 0;font-size:12px}
+.wbr-name{flex:0 0 140px;font-family:var(--mono);color:var(--muted)}
+.wbr-bg{flex:1;height:8px;background:var(--s3);border-radius:99px}
+.wbr-fg{height:8px;border-radius:99px;background:var(--accent);transition:width .8s ease;width:0%}
+.wbr-val{font-family:var(--mono);font-weight:700;color:var(--accent);min-width:36px;text-align:right}
+/* scrollbar */
+::-webkit-scrollbar{width:6px;height:6px}
+::-webkit-scrollbar-track{background:var(--bg)}
+::-webkit-scrollbar-thumb{background:var(--border);border-radius:3px}
 </style>
 </head>
 <body>
 <!-- HEADER -->
 <header>
+  <div class="hlogo">
+    <div class="hlogo-icon">🔒</div>
+    SecureCodeEnv
   </div>
+  <div class="hbadges">
+    <span class="badge bo">v2.0.0</span>
+    <span class="badge bb">OpenEnv</span>
+    <span class="badge br">Meta × PyTorch Hackathon</span>
   </div>
+  <div class="hstatus">
+    <div class="dot pulse" id="status-dot"></div>
+    <span id="status-text" style="font-size:11px"></span>
   </div>
+</header>
+<!-- NAV -->
+<nav class="nav">
+  <button class="ntab active" onclick="showPanel('playground', this)">⚡ Playground</button>
+  <button class="ntab" onclick="showPanel('overview', this)">📊 Overview</button>
+  <button class="ntab" onclick="showPanel('tasks', this)">📋 Tasks</button>
+  <button class="ntab" onclick="showPanel('docs', this)">📖 API Docs</button>
+</nav>
+<!-- ══════════════════════════════════════════════════ -->
+<!-- PLAYGROUND PANEL                                    -->
+<!-- ══════════════════════════════════════════════════ -->
+<div class="main">
+<div id="panel-playground" class="panel active">
+  <div class="playground">
+    <!-- LEFT: controls + task + editor -->
+    <div class="left-pane">
+      <!-- Episode controls -->
+      <div class="card">
+        <div class="card-header">
+          <span class="card-title">Episode Control</span>
+          <span id="session-badge" class="badge bb" style="display:none"></span>
         </div>
+        <div class="card-body">
+          <div id="alert-area"></div>
+          <div class="controls-row">
+            <select id="diff-select">
+              <option value="easy">Easy</option>
+              <option value="medium" selected>Medium</option>
+              <option value="hard">Hard</option>
+            </select>
+            <select id="task-select" style="flex:1">
+              <option value="">Random task</option>
+            </select>
+            <button class="btn btn-primary" id="btn-reset" onclick="doReset()">
+              <span id="reset-spinner" style="display:none" class="spinner"></span>
+              🔄 Reset
+            </button>
+          </div>
+          <div id="task-area" style="margin-top:12px;display:none">
+            <div class="task-meta" id="task-meta"></div>
+            <div class="task-box" id="task-box"></div>
+          </div>
         </div>
       </div>
+      <!-- Code editor -->
+      <div class="card editor-wrap">
+        <div class="editor-header">
+          <div class="editor-dots"><span></span><span></span><span></span></div>
+          <span style="font-family:var(--mono);font-size:11px;color:var(--muted)" id="editor-filename">solution.py</span>
+          <div style="display:flex;gap:6px">
+            <button class="btn btn-ghost" style="padding:4px 10px;font-size:11px" onclick="loadStarter()">Load starter</button>
+            <button class="btn btn-ghost" style="padding:4px 10px;font-size:11px" onclick="clearEditor()">Clear</button>
+          </div>
         </div>
+        <textarea id="code-editor" spellcheck="false"
+          placeholder="# Reset an episode first, then write your Python solution here...
+# Click 'Load starter' to get the buggy starter code to fix.
+def your_function():
+    pass"></textarea>
+        <div class="editor-footer">
+          <span class="char-count" id="char-count">0 chars</span>
+          <div style="display:flex;gap:8px">
+            <span id="step-counter" style="font-family:var(--mono);font-size:11px;color:var(--muted)">Step 0/5</span>
+            <button class="btn btn-green" id="btn-submit" onclick="doStep()" disabled>
+              <span id="submit-spinner" style="display:none" class="spinner"></span>
+              ▶ Submit
+            </button>
+          </div>
         </div>
       </div>
     </div>
+    <!-- RIGHT: rewards + feedback + history -->
+    <div class="right-pane">
+      <!-- Total reward -->
+      <div class="card">
+        <div class="card-header"><span class="card-title">Total Reward</span><span id="done-badge" style="display:none" class="badge bg">DONE ✓</span></div>
+        <div class="card-body">
+          <div class="reward-big">
+            <div class="reward-number" id="reward-number" style="color:var(--muted)">—</div>
+            <div class="reward-label">/ 1.000 maximum</div>
+          </div>
+          <div class="reward-bar-bg"><div class="reward-bar" id="reward-bar"></div></div>
+          <div id="summary-text" style="font-size:12px;font-family:var(--mono);color:var(--muted);text-align:center"></div>
         </div>
       </div>
+      <!-- Score breakdown -->
+      <div class="card">
+        <div class="card-header"><span class="card-title">Score Breakdown</span></div>
+        <div class="card-body" id="score-breakdown">
+          <div class="empty"><div class="empty-icon">📊</div><div class="empty-text">Submit code to see scores</div></div>
         </div>
       </div>
+      <!-- Feedback -->
+      <div class="card">
+        <div class="card-header"><span class="card-title">Feedback</span></div>
+        <div class="card-body" id="feedback-area">
+          <div class="empty"><div class="empty-icon">💬</div><div class="empty-text">Feedback will appear here</div></div>
         </div>
       </div>
+      <!-- Step history -->
+      <div class="card">
+        <div class="card-header"><span class="card-title">Episode History</span><span class="char-count" id="history-count">0 steps</span></div>
+        <div id="history-area">
+          <div class="empty" style="padding:20px"><div class="empty-text">No submissions yet</div></div>
+        </div>
       </div>
+    </div>
+  </div>
+</div>
+<!-- ══════════════════════════════════════════════════ -->
+<!-- OVERVIEW PANEL                                      -->
+<!-- ══════════════════════════════════════════════════ -->
+<div id="panel-overview" class="panel">
+  <div class="section-label">Environment Stats</div>
+  <div class="grid-2">
+    <div class="stat-card"><div class="stat-val">9</div><div class="stat-label">Security Tasks (3 per difficulty)</div></div>
+    <div class="stat-card"><div class="stat-val">7</div><div class="stat-label">Reward Dimensions</div></div>
+    <div class="stat-card"><div class="stat-val">12+</div><div class="stat-label">CWE IDs Covered</div></div>
+    <div class="stat-card"><div class="stat-val">$0</div><div class="stat-label">Infrastructure Cost (HF Spaces free tier)</div></div>
+  </div>
+  <div class="section-label" style="margin-top:24px">Reward Weights</div>
+  <div class="card"><div class="card-body" id="weight-chart"></div></div>
+  <div class="section-label" style="margin-top:24px">What Makes This Unique</div>
+  <div class="grid-2">
+    <div class="stat-card" style="gap:10px">
+      <div style="font-size:22px">⚔️</div>
+      <div style="font-weight:700">Dynamic Attack Grading</div>
+      <div class="stat-label">We actually FIRE SQL injection, path traversal, JWT bypass, and XSS payloads at your code — not just static pattern matching. Payloads are seeded-random per episode so agents can't memorise them.</div>
+    </div>
+    <div class="stat-card" style="gap:10px">
+      <div style="font-size:22px">🧠</div>
+      <div style="font-weight:700">CodeGraph Memory</div>
+      <div class="stat-label">The agent's codebase context grows across steps. Conventions (naming, error handling, type hints) are inferred and enforced — the only RL environment that rewards multi-file consistency.</div>
     </div>
+    <div class="stat-card" style="gap:10px">
+      <div style="font-size:22px">🎯</div>
+      <div style="font-weight:700">CWE-Grounded Tasks</div>
+      <div class="stat-label">Every task maps to real Common Weakness Enumeration IDs. Grading is 100% automated and deterministic — no LLM judge, no subjectivity.</div>
+    </div>
+    <div class="stat-card" style="gap:10px">
+      <div style="font-size:22px">📈</div>
+      <div style="font-weight:700">Dense Reward Signal</div>
+      <div class="stat-label">7 orthogonal dimensions give partial credit at every step. Agents never get 0.0 on a correct-but-insecure submission — they learn incrementally.</div>
+    </div>
+  </div>
+</div>
+<!-- ══════════════════════════════════════════════════ -->
+<!-- TASKS PANEL                                         -->
+<!-- ══════════════════════════════════════════════════ -->
+<div id="panel-tasks" class="panel">
+  <div class="section-label">All 9 Tasks</div>
+  <div style="display:flex;gap:8px;margin-bottom:16px">
+    <button class="btn btn-ghost" onclick="filterTasks('all')" id="f-all" style="border-color:var(--accent);color:var(--accent)">All</button>
+    <button class="btn btn-ghost" onclick="filterTasks('easy')" id="f-easy">Easy</button>
+    <button class="btn btn-ghost" onclick="filterTasks('medium')" id="f-medium">Medium</button>
+    <button class="btn btn-ghost" onclick="filterTasks('hard')" id="f-hard">Hard</button>
+  </div>
+  <div id="task-list-container">
+    <div class="empty"><div class="spinner" style="margin:0 auto"></div></div>
+  </div>
+</div>
+<!-- ══════════════════════════════════════════════════ -->
+<!-- DOCS PANEL                                          -->
+<!-- ══════════════════════════════════════════════════ -->
+<div id="panel-docs" class="panel">
+  <div class="docs-card">
+    <div class="docs-h2">Quick Start</div>
+    <div class="docs-p">This environment implements the OpenEnv API contract. Use the Playground tab for interactive testing, or call the endpoints directly.</div>
+    <div class="docs-code">import requests
+ENV = "https://vishaldhakad-securecodeenv.hf.space"
+# 1. Start episode
+ep = requests.post(f"{ENV}/reset", json={"difficulty": "medium"}).json()
+sid, task = ep["session_id"], ep["task_id"]
+print(ep["problem_statement"])
+# 2. Submit code
+result = requests.post(f"{ENV}/step", json={
+    "session_id": sid,
+    "code": "def build_user_query(u, r):\n    return ('SELECT * FROM users WHERE username=%s', (u,))",
+    "filename": "solution.py"
+}).json()
+print(f"reward={result['total_reward']:.3f}")
+print(result["feedback"]["summary"])</div>
+  </div>
+  <div class="docs-card">
+    <div class="docs-h2">Endpoints</div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method get">GET</span> /health</div>
+      <div class="ep-desc">Health check. Returns <code>{"status":"ok","tasks_loaded":9}</code></div>
+    </div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method post">POST</span> /reset</div>
+      <div class="ep-desc">Start new episode. Body: <code>{"difficulty":"medium","task_id":"optional"}</code>. Returns task + CodeGraph.</div>
     </div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method post">POST</span> /step</div>
+      <div class="ep-desc">Submit code. Body: <code>{"session_id":"...","code":"...","filename":"..."}</code>. Returns reward + feedback.</div>
     </div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method get">GET</span> /state</div>
+      <div class="ep-desc">Get episode state. Query: <code>?session_id=...</code></div>
+    </div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method get">GET</span> /tasks</div>
+      <div class="ep-desc">List tasks. Query: <code>?difficulty=easy</code> (optional filter)</div>
+    </div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method get">GET</span> /tasks/{id}</div>
+      <div class="ep-desc">Full task detail including starter code and security checks</div>
+    </div>
+    <div class="ep-row">
+      <div class="ep-path"><span class="method get">GET</span> /docs</div>
+      <div class="ep-desc">Auto-generated Swagger UI (FastAPI)</div>
+    </div>
+  </div>
+  <div class="docs-card">
+    <div class="docs-h2">Reward Dimensions</div>
+    <table style="width:100%;font-size:12px;font-family:var(--mono);border-collapse:collapse">
+      <tr style="border-bottom:1px solid var(--border);color:var(--muted)">
+        <td style="padding:6px 8px">Dimension</td>
+        <td style="padding:6px 8px">Weight</td>
+        <td style="padding:6px 8px">Tool</td>
+        <td style="padding:6px 8px">Measures</td>
+      </tr>
+      <tr style="border-bottom:1px solid var(--border)">
+        <td style="padding:6px 8px;color:var(--accent)">correctness</td>
+        <td style="padding:6px 8px">30%</td>
+        <td style="padding:6px 8px;color:var(--muted)">Custom runner</td>
+        <td style="padding:6px 8px;color:var(--muted)">Test cases passed</td>
+      </tr>
+      <tr style="border-bottom:1px solid var(--border)">
+        <td style="padding:6px 8px;color:var(--accent)">attack_resist</td>
+        <td style="padding:6px 8px">20%</td>
+        <td style="padding:6px 8px;color:var(--muted)">Dynamic harness</td>
+        <td style="padding:6px 8px;color:var(--muted)">Real attack payloads blocked</td>
+      </tr>
+      <tr style="border-bottom:1px solid var(--border)">
+        <td style="padding:6px 8px;color:var(--accent)">static_security</td>
+        <td style="padding:6px 8px">15%</td>
+        <td style="padding:6px 8px;color:var(--muted)">bandit + AST</td>
+        <td style="padding:6px 8px;color:var(--muted)">CWE-mapped vulnerability patterns</td>
+      </tr>
+      <tr style="border-bottom:1px solid var(--border)">
+        <td style="padding:6px 8px;color:var(--accent)">consistency</td>
+        <td style="padding:6px 8px">15%</td>
+        <td style="padding:6px 8px;color:var(--muted)">CodeGraph</td>
+        <td style="padding:6px 8px;color:var(--muted)">Codebase convention adherence</td>
+      </tr>
+      <tr style="border-bottom:1px solid var(--border)">
+        <td style="padding:6px 8px;color:var(--accent)">performance</td>
+        <td style="padding:6px 8px">10%</td>
+        <td style="padding:6px 8px;color:var(--muted)">timeit</td>
+        <td style="padding:6px 8px;color:var(--muted)">Speed vs naive/optimal baselines</td>
+      </tr>
+      <tr style="border-bottom:1px solid var(--border)">
+        <td style="padding:6px 8px;color:var(--accent)">documentation</td>
+        <td style="padding:6px 8px">5%</td>
+        <td style="padding:6px 8px;color:var(--muted)">AST</td>
+        <td style="padding:6px 8px;color:var(--muted)">Docstrings + type hints coverage</td>
+      </tr>
+      <tr>
+        <td style="padding:6px 8px;color:var(--accent)">code_structure</td>
+        <td style="padding:6px 8px">5%</td>
+        <td style="padding:6px 8px;color:var(--muted)">AST</td>
+        <td style="padding:6px 8px;color:var(--muted)">No bare print/except, clean structure</td>
+      </tr>
+    </table>
+  </div>
 </div>
+</div><!-- /main -->
 <script>
+// ── State ────���─────────────────────────────────────────────────────────────
+const state = {
+  sessionId: null,
+  task: null,
+  stepCount: 0,
+  done: false,
+  history: [],
+  allTasks: [],
+};
+const WEIGHTS = {
+  correctness:0.30, attack_resist:0.20, static_security:0.15,
+  consistency:0.15, performance:0.10, documentation:0.05, code_structure:0.05
+};
+// ── Init ───────────────────────────────────────────────────────────────────
+document.addEventListener('DOMContentLoaded', () => {
+  checkHealth();
+  loadTasksDropdown();
+  renderWeightChart();
+  document.getElementById('code-editor').addEventListener('input', updateCharCount);
+  updateCharCount();
+});
+// ── Health check ───────────────────────────────────────────────────────────
+async function checkHealth() {
+  const dot = document.getElementById('status-dot');
+  const txt = document.getElementById('status-text');
+  try {
+    const r = await fetch('/health');
+    const d = await r.json();
+    dot.className = 'dot pulse';
+    txt.textContent = `${d.env} v${d.version} · ${d.tasks_loaded} tasks`;
+  } catch(e) {
+    dot.className = 'dot red';
+    txt.textContent = 'Environment unreachable';
+  }
+}
+// ── Tab navigation ─────────────────────────────────────────────────────────
+function showPanel(id, btn) {
+  document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
+  document.querySelectorAll('.ntab').forEach(t => t.classList.remove('active'));
+  document.getElementById('panel-'+id).classList.add('active');
+  btn.classList.add('active');
+  if (id === 'tasks' && state.allTasks.length === 0) loadTasksList();
+}
+// ── Task dropdown ──────────────────────────────────────────────────────────
+async function loadTasksDropdown() {
+  try {
+    const r = await fetch('/tasks');
+    const tasks = await r.json();
+    state.allTasks = tasks;
+    const sel = document.getElementById('task-select');
+    tasks.forEach(t => {
+      const opt = document.createElement('option');
+      opt.value = t.id;
+      opt.textContent = `${t.id.replace(/_/g,' ')}`;
+      sel.appendChild(opt);
     });
+  } catch(e) {}
+}
+// ── Reset episode ──────────────────────────────────────────────────────────
+async function doReset() {
+  const btn = document.getElementById('btn-reset');
+  const spin = document.getElementById('reset-spinner');
+  btn.disabled = true; spin.style.display = 'inline-block';
+  clearAlert();
+  const difficulty = document.getElementById('diff-select').value;
+  const taskId = document.getElementById('task-select').value;
+  try {
+    const body = { difficulty };
+    if (taskId) body.task_id = taskId;
+    const r = await fetch('/reset', {
+      method: 'POST',
+      headers: {'Content-Type':'application/json'},
+      body: JSON.stringify(body)
+    });
+    if (!r.ok) {
+      const e = await r.json();
+      showAlert(e.detail || 'Reset failed', 'error');
+      return;
+    }
+    const d = await r.json();
+    state.sessionId = d.session_id;
+    state.task = d;
+    state.stepCount = 0;
+    state.done = false;
+    state.history = [];
+    renderTask(d);
+    resetResultPanel();
+    updateStepCounter();
+    document.getElementById('btn-submit').disabled = false;
+    document.getElementById('session-badge').style.display = 'inline';
+    document.getElementById('session-badge').textContent = d.session_id.slice(0,8) + '…';
+    showAlert(`✓ Episode started: ${d.task_id}`, 'success');
+  } catch(e) {
+    showAlert('Network error: ' + e.message, 'error');
+  } finally {
+    btn.disabled = false; spin.style.display = 'none';
+  }
+}
+// ── Submit step ────────────────────────────────────────────────────────────
+async function doStep() {
+  if (!state.sessionId) { showAlert('Reset an episode first', 'error'); return; }
+  const code = document.getElementById('code-editor').value.trim();
+  if (!code) { showAlert('Write some code first', 'error'); return; }
+  const btn = document.getElementById('btn-submit');
+  const spin = document.getElementById('submit-spinner');
+  btn.disabled = true; spin.style.display = 'inline-block';
+  clearAlert();
+  try {
+    const r = await fetch('/step', {
+      method: 'POST',
+      headers: {'Content-Type':'application/json'},
+      body: JSON.stringify({
+        session_id: state.sessionId,
+        code,
+        filename: `solution_step${state.stepCount}.py`
+      })
+    });
+    if (!r.ok) {
+      const e = await r.json();
+      showAlert(e.detail || 'Step failed', 'error');
+      if (r.status === 400 && e.detail.includes('done')) {
+        btn.disabled = true;
       }
+      return;
+    }
+    const d = await r.json();
+    state.stepCount = d.step_count;
+    state.done = d.done;
+    state.history.push({ step: d.step_count, reward: d.total_reward, done: d.done });
+    renderReward(d.total_reward);
+    renderScores(d.scores, d.details);
+    renderFeedback(d.feedback);
+    renderHistory();
+    updateStepCounter();
+    if (d.done) {
+      btn.disabled = true;
+      document.getElementById('done-badge').style.display = 'inline';
+      const msg = d.total_reward >= 0.9
+        ? '🎉 Excellent! Episode solved!'
+        : `Episode complete after ${d.step_count} steps`;
+      showAlert(msg, d.total_reward >= 0.9 ? 'success' : 'info');
+    }
+  } catch(e) {
+    showAlert('Network error: ' + e.message, 'error');
+  } finally {
+    if (!state.done) btn.disabled = false;
+    spin.style.display = 'none';
+  }
+}
+// ── Render helpers ─────────────────────────────────────────────────────────
+function renderTask(d) {
+  const area = document.getElementById('task-area');
+  area.style.display = 'block';
+  const meta = document.getElementById('task-meta');
+  const diffClass = d.difficulty;
+  meta.innerHTML = `<span class="diff-tag ${diffClass}">${d.difficulty}</span>`
+    + d.cwe_targets.map(c => `<span class="cwe">${c}</span>`).join('');
+  document.getElementById('task-box').textContent = d.problem_statement;
+  document.getElementById('editor-filename').textContent =
+    state.allTasks.find(t => t.id === d.task_id)?.id?.replace('_','/')+'.py' || 'solution.py';
+}
+function renderReward(reward) {
+  const n = document.getElementById('reward-number');
+  const bar = document.getElementById('reward-bar');
+  n.textContent = reward.toFixed(3);
+  n.style.color = reward >= 0.9 ? 'var(--a3)' : reward >= 0.6 ? 'var(--accent)' : 'var(--danger)';
+  bar.style.width = (reward * 100) + '%';
+  bar.style.background = reward >= 0.9 ? 'var(--a3)' : reward >= 0.6 ? 'var(--accent)' : 'var(--danger)';
+}
+function renderScores(scores, details) {
+  const el = document.getElementById('score-breakdown');
+  const rows = Object.entries(scores).map(([k, v]) => {
+    const pct = Math.round(v * 100);
+    const color = v >= 0.8 ? 'var(--a3)' : v >= 0.5 ? 'var(--accent)' : 'var(--danger)';
+    const w = Math.round(WEIGHTS[k] * 100);
+    let extra = '';
+    if (details) {
+      if (k === 'correctness' && details.correctness_total) {
+        extra = ` (${details.correctness_passed}/${details.correctness_total})`;
+      } else if (k === 'attack_resist' && details.attacks_total) {
+        extra = ` (${details.attacks_blocked}/${details.attacks_total} blocked)`;
+      }
+    }
+    return `<div class="score-row">
+      <div class="score-dim">${k}${extra}</div>
+      <div class="score-bar-bg"><div class="score-bar-fg" style="width:${pct}%;background:${color}"></div></div>
+      <div class="score-val" style="color:${color}">${v.toFixed(2)}</div>
+      <div class="weight-tag">${w}%</div>
+    </div>`;
+  });
+  el.innerHTML = rows.join('');
+  document.getElementById('summary-text').textContent = '';
+}
+function renderFeedback(feedback) {
+  const el = document.getElementById('feedback-area');
+  const summary = feedback.summary || '';
+  const items = Object.entries(feedback).filter(([k]) => k !== 'summary');
+  const good = (v) => v.startsWith('Excellent') || v.startsWith('Clean') || v.startsWith('Well');
+  const bad  = (v) => v.includes('Poor') || v.includes('Vulnerable') || v.includes('major') || v.includes('HIGH');
+  const html = `<div class="fb-item ${summary.includes('✅') ? 'good' : summary.includes('🔴') ? 'bad' : 'warn'}">${escHtml(summary)}</div>`
+    + items.map(([k, v]) => {
+      const cls = good(v) ? 'good' : bad(v) ? 'bad' : 'warn';
+      return `<div class="fb-item ${cls}"><strong>${k}:</strong> ${escHtml(v)}</div>`;
+    }).join('');
+  el.innerHTML = html;
+}
+function renderHistory() {
+  const el = document.getElementById('history-area');
+  const count = document.getElementById('history-count');
+  count.textContent = `${state.history.length} steps`;
+  if (!state.history.length) { el.innerHTML = '<div class="empty" style="padding:20px"><div class="empty-text">No submissions yet</div></div>'; return; }
+  el.innerHTML = state.history.map(h => {
+    const color = h.reward >= 0.9 ? 'var(--a3)' : h.reward >= 0.6 ? 'var(--accent)' : 'var(--danger)';
+    return `<div class="history-item">
+      <span class="h-step">Step ${h.step}</span>
+      <span class="h-reward" style="color:${color}">${h.reward.toFixed(3)}</span>
+      <div class="h-bar"><div class="h-bar-fg" style="width:${h.reward*100}%;background:${color}"></div></div>
+      ${h.done ? '<span class="h-done">done</span>' : ''}
+    </div>`;
+  }).join('');
+}
+function resetResultPanel() {
+  document.getElementById('reward-number').textContent = '—';
+  document.getElementById('reward-number').style.color = 'var(--muted)';
+  document.getElementById('reward-bar').style.width = '0%';
+  document.getElementById('score-breakdown').innerHTML = '<div class="empty"><div class="empty-icon">📊</div><div class="empty-text">Submit code to see scores</div></div>';
+  document.getElementById('feedback-area').innerHTML = '<div class="empty"><div class="empty-icon">💬</div><div class="empty-text">Feedback will appear here</div></div>';
+  document.getElementById('history-area').innerHTML = '<div class="empty" style="padding:20px"><div class="empty-text">No submissions yet</div></div>';
+  document.getElementById('history-count').textContent = '0 steps';
+  document.getElementById('done-badge').style.display = 'none';
+  document.getElementById('summary-text').textContent = '';
+}
+function updateStepCounter() {
+  document.getElementById('step-counter').textContent = `Step ${state.stepCount}/5`;
+}
+function updateCharCount() {
+  const len = document.getElementById('code-editor').value.length;
+  document.getElementById('char-count').textContent = `${len} chars`;
+}
+// ── Editor helpers ─────────────────────────────────────────────────────────
+async function loadStarter() {
+  if (!state.task) { showAlert('Reset an episode first', 'error'); return; }
+  const tid = state.task.task_id;
+  try {
+    const r = await fetch(`/tasks/${tid}`);
+    const d = await r.json();
+    if (d.starter_code) {
+      document.getElementById('code-editor').value = d.starter_code;
+      updateCharCount();
+    }
+  } catch(e) {}
+}
+function clearEditor() {
+  document.getElementById('code-editor').value = '';
+  updateCharCount();
+}
+// ── Alert ──────────────────────────────────────────────────────────────────
+function showAlert(msg, type='info') {
+  const el = document.getElementById('alert-area');
+  const cls = type === 'error' ? 'alert-error' : type === 'success' ? 'alert-success' : 'alert-info';
+  el.innerHTML = `<div class="alert ${cls}">${escHtml(msg)}</div>`;
+  setTimeout(() => { el.innerHTML = ''; }, 5000);
+}
+function clearAlert() { document.getElementById('alert-area').innerHTML = ''; }
+// ── Tasks list ─────────────────────────────────────────────────────────────
+let taskFilter = 'all';
+async function loadTasksList() {
+  if (state.allTasks.length === 0) {
+    const r = await fetch('/tasks');
+    state.allTasks = await r.json();
+  }
+  filterTasks('all');
+}
+function filterTasks(diff) {
+  taskFilter = diff;
+  ['all','easy','medium','hard'].forEach(d => {
+    document.getElementById('f-'+d).style.borderColor = '';
+    document.getElementById('f-'+d).style.color = '';
+  });
+  document.getElementById('f-'+diff).style.borderColor = 'var(--accent)';
+  document.getElementById('f-'+diff).style.color = 'var(--accent)';
+  const tasks = diff === 'all' ? state.allTasks : state.allTasks.filter(t => t.difficulty === diff);
+  const el = document.getElementById('task-list-container');
+  if (!tasks.length) { el.innerHTML = '<div class="empty"><div class="empty-text">No tasks found</div></div>'; return; }
+  el.innerHTML = tasks.map(t => `
+    <div class="task-list-item" onclick="tryTask('${t.id}')">
+      <div class="tli-header">
+        <div class="tli-name">${t.id.replace(/_/g,' ')}</div>
+        <span class="diff-tag ${t.difficulty}">${t.difficulty}</span>
+      </div>
+      <div class="tli-desc">${escHtml((t.description||'').slice(0,100))}${t.description?.length > 100 ? '…' : ''}</div>
+      <div class="tli-footer">
+        ${t.cwe_targets.map(c => `<span class="cwe">${c}</span>`).join('')}
+        <span class="badge bo" style="font-size:9px;margin-left:auto">Try it →</span>
+      </div>
+    </div>
+  `).join('');
+}
+function tryTask(taskId) {
+  showPanel('playground', document.querySelector('.ntab'));
+  document.querySelectorAll('.ntab')[0].click();
+  document.getElementById('task-select').value = taskId;
+  doReset();
+}
+// ── Weight chart ───────────────────────────────────────────────────────────
+function renderWeightChart() {
+  const el = document.getElementById('weight-chart');
+  const entries = [
+    ['correctness', 0.30], ['attack_resist', 0.20],
+    ['static_security', 0.15], ['consistency', 0.15],
+    ['performance', 0.10], ['documentation', 0.05], ['code_structure', 0.05]
+  ];
+  el.innerHTML = entries.map(([name, w]) => `
+    <div class="weight-bar-row">
+      <div class="wbr-name">${name}</div>
+      <div class="wbr-bg"><div class="wbr-fg" style="width:${w*100*3.33}%"></div></div>
+      <div class="wbr-val">${Math.round(w*100)}%</div>
+    </div>
+  `).join('');
+  setTimeout(() => {
+    document.querySelectorAll('.wbr-fg').forEach(b => {
+      b.style.transition = 'width .8s ease';
+    });
+  }, 100);
+}
+// ── Utils ──────────────────────────────────────────────────────────────────
+function escHtml(s) {
+  return String(s||'').replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;').replace(/"/g,'&quot;');
+}
+// Tab key in textarea
+document.addEventListener('keydown', e => {
+  if (e.target.id === 'code-editor' && e.key === 'Tab') {
+    e.preventDefault();
+    const s = e.target.selectionStart, en = e.target.selectionEnd;
+    e.target.value = e.target.value.substring(0, s) + '    ' + e.target.value.substring(en);
+    e.target.selectionStart = e.target.selectionEnd = s + 4;
+    updateCharCount();
+  }
+  // Ctrl+Enter to submit
+  if ((e.ctrlKey || e.metaKey) && e.key === 'Enter') doStep();
+});
 </script>
 </body>
+</html>"""

app/models.py CHANGED Viewed

@@ -1,53 +1,53 @@
-"""
-SecureCodeEnv - Pydantic Models
-All request/response types for the OpenEnv API contract.
-"""
 from pydantic import BaseModel, Field
 from typing import Optional, Dict, Any, List
 class StepAction(BaseModel):
-    session_id: str = Field(..., description="Session ID returned from /reset")
-    code: str = Field(..., description="The agent's submitted Python source code")
-    filename: str = Field(
-        default="solution.py",
-        description="Logical filename for CodeGraph tracking e.g. 'src/auth/validator.py'"
-    )
-    task_id: Optional[str] = Field(None, description="Task ID (optional, validated against session)")
 class StepObservation(BaseModel):
-    scores: Dict[str, float] = Field(..., description="Per-dimension scores 0.0-1.0")
-    total_reward: float = Field(..., description="Weighted final score 0.0-1.0")
-    feedback: Dict[str, str] = Field(..., description="Human-readable feedback per dimension")
-    codegraph: Dict[str, Any] = Field(..., description="Updated CodeGraph state")
-    done: bool = Field(..., description="Is the episode complete?")
-    step_count: int = Field(..., description="Current step number")
 class ResetRequest(BaseModel):
-    difficulty: Optional[str] = Field(
-        default="medium",
-        description="Task difficulty: 'easy' | 'medium' | 'hard'"
-    )
-    session_id: Optional[str] = Field(
-        None,
-        description="Optional: reuse a session ID (for deterministic testing)"
-    )
 class ResetObservation(BaseModel):
     session_id: str
     task_id: str
-    problem_statement: str = Field(..., description="Natural language task description")
-    difficulty: str = Field(..., description="'easy' | 'medium' | 'hard'")
-    cwe_targets: List[str] = Field(..., description="e.g. ['CWE-89', 'CWE-20']")
-    codegraph: Dict[str, Any] = Field(..., description="Current codebase context (empty for easy)")
-    starter_code: str = Field(default="", description="Buggy/incomplete starter code")
-    naive_baseline: Optional[Dict] = Field(
-        default=None,
-        description="Performance baseline for relative scoring"
-    )
 class StateResponse(BaseModel):
@@ -57,6 +57,7 @@ class StateResponse(BaseModel):
     done: bool
     codegraph: Dict[str, Any]
     difficulty: str
 class HealthResponse(BaseModel):
@@ -64,3 +65,10 @@ class HealthResponse(BaseModel):
     env: str
     version: str
     tasks_loaded: int

+"""SecureCodeEnv - Pydantic Models v2 (production-complete)"""
 from pydantic import BaseModel, Field
 from typing import Optional, Dict, Any, List
 class StepAction(BaseModel):
+    session_id: str
+    code: str = Field(..., min_length=1)
+    filename: str = Field(default="solution.py")
+    task_id: Optional[str] = None
+class ScoreDetails(BaseModel):
+    correctness_passed: Optional[int] = None
+    correctness_total: Optional[int] = None
+    attacks_blocked: Optional[int] = None
+    attacks_total: Optional[int] = None
+    attack_type: Optional[str] = None
+    bandit_score: Optional[float] = None
+    static_issues_count: Optional[int] = None
+    agent_ms: Optional[float] = None
+    naive_ms: Optional[float] = None
+    optimal_ms: Optional[float] = None
 class StepObservation(BaseModel):
+    scores: Dict[str, float]
+    total_reward: float
+    feedback: Dict[str, str]
+    codegraph: Dict[str, Any]
+    done: bool
+    step_count: int
+    details: Optional[ScoreDetails] = None
 class ResetRequest(BaseModel):
+    difficulty: Optional[str] = Field(default="medium")
+    task_id: Optional[str] = Field(default=None, description="Override: request a specific task ID")
+    session_id: Optional[str] = None
 class ResetObservation(BaseModel):
     session_id: str
     task_id: str
+    problem_statement: str
+    difficulty: str
+    cwe_targets: List[str]
+    codegraph: Dict[str, Any]
+    starter_code: str = ""
+    naive_baseline: Optional[Dict] = None
 class StateResponse(BaseModel):
     done: bool
     codegraph: Dict[str, Any]
     difficulty: str
+    scores_history: List[float] = []
 class HealthResponse(BaseModel):
     env: str
     version: str
     tasks_loaded: int
+class TaskSummary(BaseModel):
+    id: str
+    difficulty: str
+    cwe_targets: List[str]
+    description: str = ""

app/routes.py CHANGED Viewed

@@ -1,66 +1,61 @@
-"""
-SecureCodeEnv - Route Handlers
-Implements the three required OpenEnv endpoints: /reset, /step, /state
-"""
-from fastapi import APIRouter, HTTPException
 from app.models import (
-    StepAction, StepObservation,
     ResetRequest, ResetObservation,
-    StateResponse,
 )
 from app.state import EpisodeState
 from graders.reward_aggregator import grade_submission
-from tasks.task_registry import sample_task, get_task, TASK_REGISTRY
 from codegraph.graph import CodeGraph
-import uuid
-import threading
 router = APIRouter()
-# In-memory session store (thread-safe with lock)
 _sessions: dict[str, EpisodeState] = {}
-_sessions_lock = threading.Lock()
 MAX_STEPS = 5
 DONE_THRESHOLD = 0.90
-def _cleanup_expired():
-    """Remove sessions older than 1 hour."""
-    with _sessions_lock:
         expired = [k for k, v in _sessions.items() if v.is_expired()]
         for k in expired:
             del _sessions[k]
-# ---------------------------------------------------------------------------
-# POST /reset
-# ---------------------------------------------------------------------------
 @router.post("/reset", response_model=ResetObservation, tags=["OpenEnv"])
 def reset(body: ResetRequest = None):
-    """
-    Start a new episode. Returns a task problem statement and initial CodeGraph.
-    Call this before every /step sequence.
-    """
-    _cleanup_expired()
     if body is None:
         body = ResetRequest()
-    difficulty = (body.difficulty or "medium").lower()
-    if difficulty not in ("easy", "medium", "hard"):
-        raise HTTPException(400, f"difficulty must be 'easy', 'medium', or 'hard'. Got: {difficulty}")
     sid = body.session_id or str(uuid.uuid4())
-    task = sample_task(difficulty)
     graph = CodeGraph(episode_seed=abs(hash(sid)) % 999_999)
     state = EpisodeState(task=task, graph=graph, step=0, done=False)
-    with _sessions_lock:
         _sessions[sid] = state
-    from codegraph.serializer import serialize_graph
     return ResetObservation(
         session_id=sid,
         task_id=task["id"],
@@ -73,25 +68,19 @@ def reset(body: ResetRequest = None):
     )
-# ---------------------------------------------------------------------------
-# POST /step
-# ---------------------------------------------------------------------------
 @router.post("/step", response_model=StepObservation, tags=["OpenEnv"])
 def step(action: StepAction):
-    """
-    Submit agent code for grading. Returns multi-dimensional reward scores,
-    feedback, and updated CodeGraph.
-    """
-    with _sessions_lock:
         state = _sessions.get(action.session_id)
     if state is None:
         raise HTTPException(404, "Session not found — call POST /reset first.")
     if state.done:
-        raise HTTPException(400, "Episode already done — call POST /reset to start a new one.")
     if not action.code or not action.code.strip():
-        raise HTTPException(422, "code field must be a non-empty Python string.")
     result = grade_submission(
         code=action.code,
@@ -102,15 +91,26 @@ def step(action: StepAction):
         seed=state.graph.episode_seed + state.step,
     )
-    # Update CodeGraph with new component metadata
     state.graph.update(action.filename or "solution.py", result["new_metadata"])
     state.step += 1
     state.scores_history.append(result["total_reward"])
-    # Episode is done when reward is high enough or max steps reached
     state.done = result["total_reward"] >= DONE_THRESHOLD or state.step >= MAX_STEPS
-    from codegraph.serializer import serialize_graph
     return StepObservation(
         scores=result["scores"],
         total_reward=result["total_reward"],
@@ -118,25 +118,19 @@ def step(action: StepAction):
         codegraph=serialize_graph(state.graph),
         done=state.done,
         step_count=state.step,
     )
-# ---------------------------------------------------------------------------
-# GET /state
-# ---------------------------------------------------------------------------
 @router.get("/state", response_model=StateResponse, tags=["OpenEnv"])
 def get_state(session_id: str):
-    """
-    Returns current episode state without advancing it.
-    Useful for monitoring agent progress.
-    """
-    with _sessions_lock:
         state = _sessions.get(session_id)
     if state is None:
         raise HTTPException(404, "Session not found.")
-    from codegraph.serializer import serialize_graph
     return StateResponse(
         session_id=session_id,
         task_id=state.task["id"],
@@ -144,4 +138,40 @@ def get_state(session_id: str):
         done=state.done,
         codegraph=serialize_graph(state.graph),
         difficulty=state.task.get("difficulty", "medium"),
     )

+"""SecureCodeEnv - Routes v2 (production-complete)"""
+from fastapi import APIRouter, HTTPException, Query
+from typing import List, Optional
 from app.models import (
+    StepAction, StepObservation, ScoreDetails,
     ResetRequest, ResetObservation,
+    StateResponse, TaskSummary,
 )
 from app.state import EpisodeState
 from graders.reward_aggregator import grade_submission
+from tasks.task_registry import sample_task, get_task, TASK_REGISTRY, list_tasks
 from codegraph.graph import CodeGraph
+from codegraph.serializer import serialize_graph
+import uuid, threading
 router = APIRouter()
 _sessions: dict[str, EpisodeState] = {}
+_lock = threading.Lock()
 MAX_STEPS = 5
 DONE_THRESHOLD = 0.90
+def _cleanup():
+    with _lock:
         expired = [k for k, v in _sessions.items() if v.is_expired()]
         for k in expired:
             del _sessions[k]
+# ── POST /reset ──────────────────────────────────────────────────────────────
 @router.post("/reset", response_model=ResetObservation, tags=["OpenEnv"])
 def reset(body: ResetRequest = None):
+    """Start a new episode. Returns task + initial CodeGraph."""
+    _cleanup()
     if body is None:
         body = ResetRequest()
+    # Support specific task_id override
+    if body.task_id:
+        try:
+            task = get_task(body.task_id)
+        except KeyError:
+            raise HTTPException(404, f"task_id {body.task_id!r} not found. "
+                                f"Available: {list(TASK_REGISTRY.keys())}")
+        difficulty = task["difficulty"]
+    else:
+        difficulty = (body.difficulty or "medium").lower()
+        if difficulty not in ("easy", "medium", "hard"):
+            raise HTTPException(400, f"difficulty must be easy/medium/hard. Got: {difficulty!r}")
+        task = sample_task(difficulty)
     sid = body.session_id or str(uuid.uuid4())
     graph = CodeGraph(episode_seed=abs(hash(sid)) % 999_999)
     state = EpisodeState(task=task, graph=graph, step=0, done=False)
+    with _lock:
         _sessions[sid] = state
     return ResetObservation(
         session_id=sid,
         task_id=task["id"],
     )
+# ── POST /step ───────────────────────────────────────────────────────────────
 @router.post("/step", response_model=StepObservation, tags=["OpenEnv"])
 def step(action: StepAction):
+    """Submit code. Returns multi-dimensional reward + updated CodeGraph."""
+    with _lock:
         state = _sessions.get(action.session_id)
     if state is None:
         raise HTTPException(404, "Session not found — call POST /reset first.")
     if state.done:
+        raise HTTPException(400, "Episode done — call POST /reset to start a new one.")
     if not action.code or not action.code.strip():
+        raise HTTPException(422, "code must be a non-empty Python string.")
     result = grade_submission(
         code=action.code,
         seed=state.graph.episode_seed + state.step,
     )
     state.graph.update(action.filename or "solution.py", result["new_metadata"])
     state.step += 1
     state.scores_history.append(result["total_reward"])
     state.done = result["total_reward"] >= DONE_THRESHOLD or state.step >= MAX_STEPS
+    # Build structured details object
+    raw = result.get("details", {}) or {}
+    details = ScoreDetails(
+        correctness_passed=raw.get("correctness", {}).get("passed"),
+        correctness_total=raw.get("correctness", {}).get("total"),
+        attacks_blocked=raw.get("attacks", {}).get("blocked"),
+        attacks_total=raw.get("attacks", {}).get("total"),
+        attack_type=raw.get("attacks", {}).get("type"),
+        bandit_score=raw.get("static", {}).get("bandit_score"),
+        static_issues_count=len(raw.get("static", {}).get("issues", [])),
+        agent_ms=result.get("agent_ms"),
+        naive_ms=result.get("naive_ms"),
+        optimal_ms=result.get("optimal_ms"),
+    )
     return StepObservation(
         scores=result["scores"],
         total_reward=result["total_reward"],
         codegraph=serialize_graph(state.graph),
         done=state.done,
         step_count=state.step,
+        details=details,
     )
+# ── GET /state ───────────────────────────────────────────────────────────────
 @router.get("/state", response_model=StateResponse, tags=["OpenEnv"])
 def get_state(session_id: str):
+    """Get current episode state without advancing it."""
+    with _lock:
         state = _sessions.get(session_id)
     if state is None:
         raise HTTPException(404, "Session not found.")
     return StateResponse(
         session_id=session_id,
         task_id=state.task["id"],
         done=state.done,
         codegraph=serialize_graph(state.graph),
         difficulty=state.task.get("difficulty", "medium"),
+        scores_history=state.scores_history,
     )
+# ── GET /tasks ───────────────────────────────────────────────────────────────
+@router.get("/tasks", response_model=List[TaskSummary], tags=["Discovery"])
+def get_tasks(difficulty: Optional[str] = Query(None)):
+    """List all available tasks, optionally filtered by difficulty."""
+    raw = list_tasks(difficulty)
+    return [
+        TaskSummary(
+            id=t["id"],
+            difficulty=t["difficulty"],
+            cwe_targets=t["cwe_targets"],
+            description=TASK_REGISTRY[t["id"]].get("problem_statement", "")[:120] + "…",
+        )
+        for t in raw
+    ]
+# ── GET /tasks/{task_id} ─────────────────────────────────────────────────────
+@router.get("/tasks/{task_id}", tags=["Discovery"])
+def get_task_detail(task_id: str):
+    """Get full detail for a specific task."""
+    try:
+        task = get_task(task_id)
+    except KeyError:
+        raise HTTPException(404, f"Task {task_id!r} not found.")
+    return {
+        "id": task["id"],
+        "difficulty": task["difficulty"],
+        "cwe_targets": task["cwe_targets"],
+        "problem_statement": task["problem_statement"],
+        "starter_code": task.get("starter_code", ""),
+        "attack_type": task.get("attack_type", "none"),
+        "security_checks": task.get("security_checks", []),
+    }

graders/attacks.py CHANGED Viewed

@@ -1,29 +1,30 @@
 """
-SecureCodeEnv - Attack Resistance Grader
-The most unique part of the environment.
-We don't just scan for vulnerability PATTERNS — we FIRE attacks at the code.
 Weight: 20% of total reward.
 """
 from sandbox.executor import safe_exec
 from sandbox.payload_gen import get_payloads
-# Attack configurations per task
-# fn_args_template: Use {payload} as placeholder for the attack string
-ATTACK_CONFIGS: dict[str, dict] = {
     "medium_sql_query_builder": {
         "fn_name": "build_user_query",
         "fn_args_template": ["{payload}", "admin"],
         "attack_type": "injection",
         "check": lambda result, payload: (
-            isinstance(result, tuple) and
-            len(result) == 2 and
-            # Payload must NOT appear literally in query string (it's parameterized)
             payload not in str(result[0]) and
-            # Query must use parameter placeholders
-            ("%s" in result[0] or "?" in result[0] or ":param" in result[0])
         ),
-        "exception_is_pass": False,  # SQL builder shouldn't raise on injection input
     },
     "medium_file_path_handler": {
         "fn_name": "resolve_safe_path",
@@ -32,12 +33,10 @@ ATTACK_CONFIGS: dict[str, dict] = {
         "check": lambda result, payload: (
             result is None or
             (isinstance(result, str) and
-             "/etc" not in result and
-             "/root" not in result and
-             ".." not in result and
-             result.startswith("/var/uploads"))
         ),
-        "exception_is_pass": True,  # Raising ValueError = correct behavior
     },
     "hard_file_upload_handler": {
         "fn_name": "validate_upload",
@@ -73,7 +72,8 @@ ATTACK_CONFIGS: dict[str, dict] = {
             isinstance(result, str) and
             "<script" not in result.lower() and
             "onerror" not in result.lower() and
-            "javascript:" not in result.lower()
         ),
         "exception_is_pass": False,
     },
@@ -81,50 +81,29 @@ ATTACK_CONFIGS: dict[str, dict] = {
 def grade_attacks(code: str, task: dict, seed: int) -> dict:
-    """
-    Fire randomized attack payloads at the agent's code.
-    Args:
-        code: Agent's submitted Python code
-        task: Task dict with id, attack_type, etc.
-        seed: Episode seed — changes per episode to prevent memorization
-    Returns:
-        {
-            "score": float 0.0-1.0,
-            "blocked": int,
-            "total": int,
-            "attack_type": str,
-            "feedback": str
-        }
-    """
     task_id = task.get("id", "")
     config = ATTACK_CONFIGS.get(task_id)
     if not config:
-        # No attacks defined for this task — full score
         return {
             "score": 1.0, "blocked": 0, "total": 0,
             "attack_type": "none",
-            "feedback": "No attack grading for this task type",
         }
     payloads = get_payloads(config["attack_type"], seed=seed, count=8)
     if not payloads:
-        return {"score": 1.0, "blocked": 0, "total": 0, "attack_type": config["attack_type"], "feedback": "No payloads generated"}
     blocked = 0
     exception_is_pass = config.get("exception_is_pass", True)
     for payload in payloads:
-        # Build function args — substitute {payload} placeholder
         raw_args = config["fn_args_template"]
         args = [a.replace("{payload}", payload) if isinstance(a, str) else a for a in raw_args]
         result = safe_exec(code, args, function_name=config["fn_name"], timeout=3)
         if not result["ok"]:
-            # Code raised an exception — counts as blocked if exception_is_pass=True
             if exception_is_pass:
                 blocked += 1
         else:
@@ -133,7 +112,7 @@ def grade_attacks(code: str, task: dict, seed: int) -> dict:
                 if config["check"](output, payload):
                     blocked += 1
             except Exception:
-                pass  # Check itself errored — count as not blocked
     score = blocked / len(payloads)
     return {
@@ -146,19 +125,13 @@ def grade_attacks(code: str, task: dict, seed: int) -> dict:
 def _attack_feedback(score: float, attack_type: str) -> str:
-    attack_names = {
-        "injection": "SQL injection",
-        "traversal": "path traversal",
-        "auth_bypass": "authentication bypass",
-        "xss": "XSS",
         "weak_random": "predictable randomness",
     }
-    name = attack_names.get(attack_type, attack_type)
-    if score >= 0.9:
-        return f"Excellent — {name} attacks blocked ({score:.0%})"
-    elif score >= 0.7:
-        return f"Good — most {name} attacks blocked ({score:.0%}). Check edge cases"
-    elif score >= 0.5:
-        return f"Partial — only {score:.0%} of {name} attacks blocked. Review input validation"
-    else:
-        return f"Vulnerable — {score:.0%} of {name} attacks blocked. Major security issue"

 """
+SecureCodeEnv - Attack Resistance Grader v2
+Fires randomized real attack payloads at agent code each episode.
 Weight: 20% of total reward.
+FIXED: All check lambdas handle list OR tuple (JSON round-trip converts tuple→list).
 """
 from sandbox.executor import safe_exec
 from sandbox.payload_gen import get_payloads
+def _is_seq(v):
+    """True if v is a list or tuple (JSON round-trip safe)."""
+    return isinstance(v, (list, tuple))
+ATTACK_CONFIGS = {
     "medium_sql_query_builder": {
         "fn_name": "build_user_query",
         "fn_args_template": ["{payload}", "admin"],
         "attack_type": "injection",
         "check": lambda result, payload: (
+            _is_seq(result) and len(result) == 2 and
             payload not in str(result[0]) and
+            any(p in str(result[0]) for p in ("%s", "?", ":param", "%("))
         ),
+        "exception_is_pass": False,
     },
     "medium_file_path_handler": {
         "fn_name": "resolve_safe_path",
         "check": lambda result, payload: (
             result is None or
             (isinstance(result, str) and
+             "/etc" not in result and "/root" not in result and
+             ".." not in result and result.startswith("/var/uploads"))
         ),
+        "exception_is_pass": True,
     },
     "hard_file_upload_handler": {
         "fn_name": "validate_upload",
             isinstance(result, str) and
             "<script" not in result.lower() and
             "onerror" not in result.lower() and
+            "javascript:" not in result.lower() and
+            "<img" not in result.lower()
         ),
         "exception_is_pass": False,
     },
 def grade_attacks(code: str, task: dict, seed: int) -> dict:
     task_id = task.get("id", "")
     config = ATTACK_CONFIGS.get(task_id)
     if not config:
         return {
             "score": 1.0, "blocked": 0, "total": 0,
             "attack_type": "none",
+            "feedback": "No attack grading defined for this task type",
         }
     payloads = get_payloads(config["attack_type"], seed=seed, count=8)
     if not payloads:
+        return {"score": 1.0, "blocked": 0, "total": 0,
+                "attack_type": config["attack_type"], "feedback": "No payloads generated"}
     blocked = 0
     exception_is_pass = config.get("exception_is_pass", True)
     for payload in payloads:
         raw_args = config["fn_args_template"]
         args = [a.replace("{payload}", payload) if isinstance(a, str) else a for a in raw_args]
         result = safe_exec(code, args, function_name=config["fn_name"], timeout=3)
         if not result["ok"]:
             if exception_is_pass:
                 blocked += 1
         else:
                 if config["check"](output, payload):
                     blocked += 1
             except Exception:
+                pass
     score = blocked / len(payloads)
     return {
 def _attack_feedback(score: float, attack_type: str) -> str:
+    names = {
+        "injection": "SQL injection", "traversal": "path traversal",
+        "auth_bypass": "authentication bypass", "xss": "XSS",
         "weak_random": "predictable randomness",
     }
+    name = names.get(attack_type, attack_type)
+    if score >= 0.875: return f"Excellent — {name} attacks blocked ({score:.0%})"
+    elif score >= 0.625: return f"Good — most {name} attacks blocked ({score:.0%})"
+    elif score >= 0.375: return f"Partial — {score:.0%} of {name} attacks blocked"
+    else: return f"Vulnerable — only {score:.0%} of {name} attacks blocked"

graders/performance.py CHANGED Viewed

@@ -1,122 +1,103 @@
 """
 SecureCodeEnv - Performance Grader
-Measures execution time and memory relative to naive/optimal baselines.
 Weight: 10% of total reward.
-Relative scoring ensures machine-speed differences don't affect results.
 """
-import timeit
-import tracemalloc
-import sys
-import tempfile
-import subprocess
-import os
-import json
 def grade_performance(code: str, task: dict) -> dict:
-    """
-    Score agent performance relative to naive and optimal baselines.
-    Score 1.0 = matches optimal. Score 0.0 = as slow/heavy as naive.
-    Returns:
-        {
-            "score": float 0.0-1.0,
-            "time_score": float,
-            "memory_score": float,
-            "feedback": str
-        }
-    """
     test_cases = task.get("test_cases", [])
-    if not test_cases:
-        return {"score": 1.0, "time_score": 1.0, "memory_score": 1.0, "feedback": "No performance test cases"}
     naive_code = task.get("naive_code", "")
     optimal_code = task.get("optimal_code", "")
-    if not naive_code or not optimal_code:
-        return {"score": 1.0, "time_score": 1.0, "memory_score": 1.0, "feedback": "No baselines defined"}
-    # Find a simple test case with direct fn input
-    tc = next((t for t in test_cases if "fn" in t and "input" in t and "expected_exception" not in t), None)
     if not tc:
-        return {"score": 1.0, "time_score": 1.0, "memory_score": 1.0, "feedback": "No suitable test case for perf"}
     fn_name = tc["fn"]
     inputs = tc["input"]
     try:
-        agent_time = _measure_time_subprocess(code, fn_name, inputs)
-        naive_time = _measure_time_subprocess(naive_code, fn_name, inputs)
-        optimal_time = _measure_time_subprocess(optimal_code, fn_name, inputs)
-        # Relative scoring: 1.0 = matches optimal, 0.0 = as slow as naive
-        time_range = max(naive_time - optimal_time, 1e-6)
-        time_score = 1.0 - ((agent_time - optimal_time) / time_range)
-        time_score = max(0.0, min(1.0, time_score))
-        # Memory (simplified — assume correlated with time for subprocess approach)
-        memory_score = time_score  # Fallback
-        combined = (time_score * 0.7) + (memory_score * 0.3)
         return {
-            "score": round(combined, 4),
             "time_score": round(time_score, 4),
             "memory_score": round(memory_score, 4),
-            "agent_ms": round(agent_time * 1000, 2),
-            "naive_ms": round(naive_time * 1000, 2),
-            "optimal_ms": round(optimal_time * 1000, 2),
             "feedback": _perf_feedback(combined),
         }
     except Exception as e:
-        return {"score": 0.7, "time_score": 0.7, "memory_score": 0.7, "feedback": f"Performance measurement failed: {str(e)[:80]}"}
-def _measure_time_subprocess(code: str, fn_name: str, inputs: list, runs: int = 10) -> float:
-    """Measure execution time safely in a subprocess."""
-    harness = f"""
-import timeit
-import json
 {code}
-def run():
     {fn_name}(*{json.dumps(inputs)})
-times = timeit.repeat(run, number={runs}, repeat=3)
-print(json.dumps({{"min_time": min(times) / {runs}}}))
 """
-    tmp_path = None
     try:
-        with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False, prefix="sce_perf_") as f:
-            f.write(harness)
-            tmp_path = f.name
-        result = subprocess.run(
-            [sys.executable, tmp_path],
-            capture_output=True, text=True, timeout=30,
         )
-        if result.returncode == 0 and result.stdout.strip():
-            data = json.loads(result.stdout.strip().split("\n")[-1])
-            return data.get("min_time", 0.01)
-        return 0.05  # Default fallback if measurement fails
-    except (subprocess.TimeoutExpired, json.JSONDecodeError, Exception):
-        return 0.05
     finally:
-        if tmp_path and os.path.exists(tmp_path):
-            try:
-                os.unlink(tmp_path)
-            except OSError:
-                pass
 def _perf_feedback(score: float) -> str:
-    if score >= 0.9:
-        return "Excellent performance — near-optimal efficiency"
-    elif score >= 0.7:
-        return "Good performance — minor optimization possible"
-    elif score >= 0.5:
-        return "Acceptable performance — room for improvement"
-    else:
-        return "Poor performance — consider algorithmic improvements"

 """
 SecureCodeEnv - Performance Grader
+Relative scoring: agent vs naive vs optimal baselines via subprocess timeit.
 Weight: 10% of total reward.
+FIXED: subprocess measurement was returning 0.0ms due to JSON parse of wrong line.
 """
+import sys, tempfile, os, json, subprocess
 def grade_performance(code: str, task: dict) -> dict:
     test_cases = task.get("test_cases", [])
     naive_code = task.get("naive_code", "")
     optimal_code = task.get("optimal_code", "")
+    if not test_cases or not naive_code or not optimal_code:
+        return {"score": 0.8, "time_score": 0.8, "memory_score": 0.8,
+                "feedback": "No performance baselines defined — default score applied"}
+    # Find a usable test case (direct fn call, no class, no exception expected)
+    tc = next((t for t in test_cases
+               if "fn" in t and "input" in t
+               and "fn_class" not in t
+               and "expected_exception" not in t), None)
     if not tc:
+        return {"score": 0.8, "time_score": 0.8, "memory_score": 0.8,
+                "feedback": "No suitable test case for performance measurement"}
     fn_name = tc["fn"]
     inputs = tc["input"]
     try:
+        agent_ms   = _measure_ms(code,         fn_name, inputs)
+        naive_ms   = _measure_ms(naive_code,    fn_name, inputs)
+        optimal_ms = _measure_ms(optimal_code,  fn_name, inputs)
+        # Clamp to avoid division by zero
+        time_range = max(naive_ms - optimal_ms, 0.01)
+        raw = 1.0 - ((agent_ms - optimal_ms) / time_range)
+        time_score = max(0.0, min(1.0, raw))
+        memory_score = time_score  # tracemalloc approximation
+        combined = round((time_score * 0.7) + (memory_score * 0.3), 4)
         return {
+            "score": combined,
             "time_score": round(time_score, 4),
             "memory_score": round(memory_score, 4),
+            "agent_ms":   round(agent_ms, 3),
+            "naive_ms":   round(naive_ms, 3),
+            "optimal_ms": round(optimal_ms, 3),
             "feedback": _perf_feedback(combined),
         }
     except Exception as e:
+        return {"score": 0.7, "time_score": 0.7, "memory_score": 0.7,
+                "feedback": f"Performance measurement error: {str(e)[:60]}"}
+def _measure_ms(code: str, fn_name: str, inputs: list, runs: int = 20) -> float:
+    """Measure mean execution time in milliseconds via isolated subprocess."""
+    script = f"""
+import timeit, json, sys
 {code}
+def _run():
     {fn_name}(*{json.dumps(inputs)})
+times = timeit.repeat(_run, number={runs}, repeat=5)
+best = min(times) / {runs} * 1000  # ms
+sys.stdout.write(json.dumps({{"ms": best}}) + "\\n")
+sys.stdout.flush()
 """
+    tmp = None
     try:
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".py",
+                                         delete=False, prefix="sce_perf_") as f:
+            f.write(script)
+            tmp = f.name
+        proc = subprocess.run(
+            [sys.executable, tmp],
+            capture_output=True, text=True, timeout=30
         )
+        # Take last non-empty line (avoids noise from imports/warnings)
+        for line in reversed(proc.stdout.strip().splitlines()):
+            line = line.strip()
+            if line.startswith("{"):
+                return json.loads(line)["ms"]
+        return 5.0  # fallback
+    except Exception:
+        return 5.0
     finally:
+        if tmp and os.path.exists(tmp):
+            try: os.unlink(tmp)
+            except OSError: pass
 def _perf_feedback(score: float) -> str:
+    if score >= 0.9:  return "Excellent — near-optimal efficiency"
+    elif score >= 0.7: return "Good — minor optimisation possible"
+    elif score >= 0.5: return "Acceptable — room for improvement"
+    else:              return "Poor — consider algorithmic improvements"

graders/reward_aggregator.py CHANGED Viewed

@@ -1,16 +1,4 @@
-"""
-SecureCodeEnv - Reward Aggregator
-Orchestrates all graders and computes the final weighted reward.
-Reward weights (must sum to 1.0):
-  correctness      30%  — Does it work?
-  attack_resist    20%  — Does it resist real attacks?
-  static_security  15%  — Does it pass security linters?
-  consistency      15%  — Does it match codebase conventions?
-  performance      10%  — Is it efficient?
-  documentation     5%  — Is it documented?
-  code_structure    5%  — Is it clean?
-"""
 from graders.correctness import grade_correctness
 from graders.attacks import grade_attacks
 from graders.static_analysis import grade_static_analysis
@@ -29,105 +17,74 @@ WEIGHTS = {
     "documentation":    0.05,
     "code_structure":   0.05,
 }
-assert abs(sum(WEIGHTS.values()) - 1.0) < 1e-9, "Weights must sum to 1.0"
-def grade_submission(
-    code: str,
-    filename: str,
-    task: dict,
-    graph: CodeGraph,
-    step: int,
-    seed: int,
-) -> dict:
-    """
-    Run all graders on the submitted code and return the full result.
-    Args:
-        code: Agent's Python source code string
-        filename: Logical filename for CodeGraph tracking
-        task: Task definition dict
-        graph: Current CodeGraph state
-        step: Current step number in the episode
-        seed: Randomness seed for attack payloads
-    Returns:
-        {
-            "scores": dict of dimension scores,
-            "total_reward": float 0.0-1.0,
-            "feedback": dict of human-readable messages,
-            "new_metadata": ComponentMetadata for CodeGraph update,
-        }
-    """
-    # ── Run all graders ─────────────────────────────────────────────────────
-    correctness_result  = grade_correctness(code, task)
-    attack_result       = grade_attacks(code, task, seed)
-    static_result       = grade_static_analysis(code, task)
-    perf_result         = grade_performance(code, task)
-    consistency_result  = grade_consistency(code, filename, graph, step)
-    doc_result          = grade_documentation(code)
-    structure_result    = grade_code_structure(code)
-    # ── Extract per-grader scores ───────────────────────────────────────────
     scores = {
-        "correctness":     correctness_result["score"],
-        "attack_resist":   attack_result["score"],
-        "static_security": static_result["score"],
-        "consistency":     consistency_result["score"],
-        "performance":     perf_result["score"],
-        "documentation":   doc_result["score"],
-        "code_structure":  structure_result["score"],
     }
-    # ── Weighted sum ────────────────────────────────────────────────────────
-    total_reward = sum(scores[k] * WEIGHTS[k] for k in WEIGHTS)
-    total_reward = round(max(0.0, min(1.0, total_reward)), 4)
-    # ── Human-readable feedback ─────────────────────────────────────────────
     feedback = {
-        "correctness":     correctness_result.get("feedback", ""),
-        "attack_resist":   attack_result.get("feedback", ""),
-        "static_security": static_result.get("feedback", ""),
-        "consistency":     consistency_result.get("feedback", ""),
-        "performance":     perf_result.get("feedback", ""),
-        "documentation":   doc_result.get("feedback", ""),
-        "code_structure":  structure_result.get("feedback", ""),
         "summary":         _summary(total_reward, scores),
     }
-    # ── Extract CodeGraph metadata ──────────────────────────────────────────
-    new_metadata = extract_metadata(code, filename, step)
     return {
         "scores": scores,
         "total_reward": total_reward,
         "feedback": feedback,
-        "new_metadata": new_metadata,
-        # Detailed sub-results (for debugging/observability)
-        "details": {
-            "correctness": {"passed": correctness_result.get("passed"), "total": correctness_result.get("total")},
-            "attacks": {"blocked": attack_result.get("blocked"), "total": attack_result.get("total"), "type": attack_result.get("attack_type")},
-            "static": {"bandit_score": static_result.get("bandit_score"), "issues": static_result.get("issues", [])[:3]},
-        },
     }
-def _summary(reward: float, scores: dict) -> str:
-    """Generate a one-line executive summary."""
     if reward >= 0.90:
-        return f"✅ Excellent submission (reward: {reward:.3f}) — production-ready"
     elif reward >= 0.70:
         weakest = min(scores, key=scores.get)
-        return f"🟡 Good submission (reward: {reward:.3f}) — improve: {weakest} ({scores[weakest]:.2f})"
     elif reward >= 0.50:
         weak = [k for k, v in scores.items() if v < 0.5]
-        return f"🟠 Needs work (reward: {reward:.3f}) — critical issues in: {', '.join(weak[:3])}"
     else:
-        return f"🔴 Poor submission (reward: {reward:.3f}) — significant security/correctness failures"
-def compute_reward(scores: dict) -> float:
-    """Utility: compute weighted reward from a scores dict."""
-    total = sum(scores.get(k, 0) * WEIGHTS[k] for k in WEIGHTS)
-    return round(max(0.0, min(1.0, total)), 4)

+"""SecureCodeEnv - Reward Aggregator v2 (complete details passthrough)"""
 from graders.correctness import grade_correctness
 from graders.attacks import grade_attacks
 from graders.static_analysis import grade_static_analysis
     "documentation":    0.05,
     "code_structure":   0.05,
 }
+assert abs(sum(WEIGHTS.values()) - 1.0) < 1e-9
+def grade_submission(code, filename, task, graph, step, seed):
+    corr  = grade_correctness(code, task)
+    atk   = grade_attacks(code, task, seed)
+    stat  = grade_static_analysis(code, task)
+    perf  = grade_performance(code, task)
+    cons  = grade_consistency(code, filename, graph, step)
+    doc   = grade_documentation(code)
+    struct = grade_code_structure(code)
     scores = {
+        "correctness":     corr["score"],
+        "attack_resist":   atk["score"],
+        "static_security": stat["score"],
+        "consistency":     cons["score"],
+        "performance":     perf["score"],
+        "documentation":   doc["score"],
+        "code_structure":  struct["score"],
     }
+    total_reward = round(max(0.0, min(1.0,
+        sum(scores[k] * WEIGHTS[k] for k in WEIGHTS))), 4)
     feedback = {
+        "correctness":     corr.get("feedback", ""),
+        "attack_resist":   atk.get("feedback", ""),
+        "static_security": stat.get("feedback", ""),
+        "consistency":     cons.get("feedback", ""),
+        "performance":     perf.get("feedback", ""),
+        "documentation":   doc.get("feedback", ""),
+        "code_structure":  struct.get("feedback", ""),
         "summary":         _summary(total_reward, scores),
     }
+    details = {
+        "correctness": {"passed": corr.get("passed"), "total": corr.get("total")},
+        "attacks": {
+            "blocked": atk.get("blocked"), "total": atk.get("total"),
+            "type": atk.get("attack_type"),
+        },
+        "static": {
+            "bandit_score": stat.get("bandit_score"),
+            "issues": stat.get("issues", [])[:3],
+        },
+    }
     return {
         "scores": scores,
         "total_reward": total_reward,
         "feedback": feedback,
+        "details": details,
+        "agent_ms":   perf.get("agent_ms"),
+        "naive_ms":   perf.get("naive_ms"),
+        "optimal_ms": perf.get("optimal_ms"),
+        "new_metadata": extract_metadata(code, filename, step),
     }
+def _summary(reward, scores):
     if reward >= 0.90:
+        return f"✅ Excellent ({reward:.3f}) — production-ready"
     elif reward >= 0.70:
         weakest = min(scores, key=scores.get)
+        return f"🟡 Good ({reward:.3f}) — improve: {weakest} ({scores[weakest]:.2f})"
     elif reward >= 0.50:
         weak = [k for k, v in scores.items() if v < 0.5]
+        return f"🟠 Needs work ({reward:.3f}) — fix: {', '.join(weak[:3])}"
     else:
+        return f"🔴 Poor ({reward:.3f}) — major security/correctness failures"