Episode Control
solution.py
Total Reward
—
/ 1.000 maximum
Score Breakdown
Submit code to see scores
Feedback
Feedback will appear here
Episode History
0 steps
No submissions yet
Environment Stats
9
Security Tasks across 3 difficulty levels
7
Reward Dimensions
12+
CWE IDs Covered
5
Max Steps per Episode
Reward Weights
Design Principles
⚔️
Dynamic Attack Grading
Real SQL injection, path traversal, JWT bypass, and XSS payloads are fired at submitted code each episode. Payloads are seeded-random so agents cannot memorise them.
🧠
CodeGraph Memory
The agent's codebase context accumulates across steps. Naming conventions, error handling patterns, and type hint usage are tracked and enforced across submissions.
🔒
Security Gate
An episode cannot be marked done unless attack resistance ≥ 75% AND static security ≥ 70%. Functional code that is insecure will never pass.
📈
Dense Reward Signal
7 dimensions give partial credit at every step. Agents receive granular feedback on what to improve rather than a binary pass/fail signal.
All Tasks
Quick Start
Use the Playground tab for interactive testing, or call the API directly from any HTTP client.
import requests
BASE = "http://localhost:7860" # replace with your deployed URL
# 1. Start episode
ep = requests.post(f"{BASE}/reset", json={"difficulty": "medium"}).json()
sid = ep["session_id"]
print(ep["problem_statement"])
# 2. Submit code — graded across 7 dimensions
result = requests.post(f"{BASE}/step", json={
"session_id": sid,
"code": "def build_user_query(u, r):\n return ('SELECT * FROM users WHERE username=%s', (u,))",
"filename": "solution.py"
}).json()
print(f"reward = {result['total_reward']:.3f}")
print(result["feedback"]["summary"])
Endpoints
GET /health
Health check. Returns status, version, and tasks_loaded count.
POST /reset
Start new episode. Body:
{"difficulty":"medium"} or {"task_id":"..."}. Returns task, starter code, and initial CodeGraph.POST /step
Submit code. Body:
{"session_id":"...","code":"...","filename":"..."}. Returns reward + per-dimension scores + feedback.GET /state
Current episode state. Query param:
session_idGET /tasks
List all tasks. Optional filter:
?difficulty=easyGET /tasks/{id}
Full task detail — problem statement, starter code, security checks.
GET /docs
Auto-generated Swagger UI with full schema documentation.
Reward Dimensions
| Dimension | Weight | Tool | Measures |
| correctness | 25% | Custom runner | Test cases passed |
| attack_resist | 25% | Dynamic harness | Real attack payloads blocked |
| static_security | 20% | bandit + AST | CWE-mapped vulnerability patterns |
| consistency | 10% | CodeGraph | Convention adherence across steps |
| performance | 8% | timeit | Speed vs naive/optimal baselines |
| documentation | 7% | AST | Docstrings + type hint coverage |
| code_structure | 5% | AST | No bare print/except, clean structure |
⚠ Security gate: episode cannot complete unless attack_resist ≥ 0.75 AND static_security ≥ 0.70 AND correctness ≥ 0.80
Step Response Example
{
"total_reward": 0.847,
"scores": {
"correctness": 1.0,
"attack_resist": 0.875,
"static_security": 0.9,
"consistency": 0.75,
"performance": 0.6,
"documentation": 0.75,
"code_structure": 0.8
},
"feedback": {
"summary": "🟡 Good (0.847) — improve: consistency (0.75)",
"attack_resist": "Good — SQL injection attacks blocked (87%)",
"security_gate": "PASSED"
},
"details": {
"correctness": {"passed": 5, "total": 5},
"attacks": {"blocked": 7, "total": 8, "type": "injection"},
"security_gate_passed": true
},
"done": false,
"step_count": 1
}