shank commited on
Commit Β·
713f336
1
Parent(s): 59986c5
Update: Added final imporvements for hackathon
Browse files- .gitignore +2 -0
- README.md +85 -335
- training/AgentDebuggerEnv_GRPO_Training.ipynb +179 -0
.gitignore
CHANGED
|
@@ -56,3 +56,5 @@ venv_test*/
|
|
| 56 |
scratch/
|
| 57 |
agentdebugger_env.egg-info/
|
| 58 |
baseline_results.json
|
|
|
|
|
|
|
|
|
| 56 |
scratch/
|
| 57 |
agentdebugger_env.egg-info/
|
| 58 |
baseline_results.json
|
| 59 |
+
CURSOR_INSTRUCTIONS_V2.md
|
| 60 |
+
HANDOVER.md
|
README.md
CHANGED
|
@@ -1,365 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
app_file: app.py
|
| 8 |
-
|
| 9 |
-
pinned: true
|
| 10 |
-
license: mit
|
| 11 |
-
---
|
| 12 |
-
|
| 13 |
-
# AgentDebuggerEnv π
|
| 14 |
-
|
| 15 |
-
> **A live, iterative debugging environment for benchmarking genuine agentic reasoning in AI systems.**
|
| 16 |
-
|
| 17 |
-
[](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
|
| 18 |
-
[](#openenv-api-compliance)
|
| 19 |
-
[](LICENSE)
|
| 20 |
-
[](https://www.python.org/)
|
| 21 |
-
|
| 22 |
-
*Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon.***
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## The Problem with Existing Code Benchmarks
|
| 27 |
-
|
| 28 |
-
Benchmarks like HumanEval, MBPP, and SWE-bench share a fundamental limitation: they are **one-shot**. A model reads a problem, generates code, and is scored on the final output. This measures code generation β not debugging ability.
|
| 29 |
-
|
| 30 |
-
Real software engineering is not one-shot. It is **iterative**. A developer reads failing tests, forms a hypothesis, submits a fix, reads the new error output, updates their theory, and repeats. No existing OpenEnv environment benchmarks this loop.
|
| 31 |
-
|
| 32 |
-
**AgentDebuggerEnv does.**
|
| 33 |
-
|
| 34 |
-
---
|
| 35 |
-
|
| 36 |
-
## How It's Different from SWE-bench
|
| 37 |
-
|
| 38 |
-
| Dimension | SWE-bench | AgentDebuggerEnv |
|
| 39 |
-
|---|---|---|
|
| 40 |
-
| Evaluation target | Final patch correctness | Full reasoning trajectory |
|
| 41 |
-
| Feedback to agent | None β single shot | Real `stdout/stderr` after every attempt |
|
| 42 |
-
| Reward signal | Binary end-of-episode | Dense β every step scored |
|
| 43 |
-
| What's measured | Code generation | Hypothesis formation + iterative reasoning |
|
| 44 |
-
| Hard task | Apply patch to existing issue | Must design a test to surface a hidden bug |
|
| 45 |
-
| Agent failure modes | Not tracked | 4 distinct measurable failure modes |
|
| 46 |
-
|
| 47 |
-
The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's code in a live sandbox and returns actual test output. The agent must update its theory and try again β exactly like a real developer at a terminal.
|
| 48 |
-
|
| 49 |
---
|
| 50 |
|
| 51 |
-
#
|
| 52 |
-
|
| 53 |
-
Evaluated using `gpt-4o` with zero-shot prompting. Each task run 5 times independently, scores averaged.
|
| 54 |
-
|
| 55 |
-
| Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts |
|
| 56 |
-
|---|---|---|---|---|---|
|
| 57 |
-
| Off-by-One Bug | π’ Easy | 0.85 | Β±0.04 | 100% | 1.8 |
|
| 58 |
-
| Red Herring Auth Bug | π‘ Medium | 0.50 | Β±0.10 | 60% | 4.2 |
|
| 59 |
-
| Race Condition | π΄ Hard | 0.18 | Β±0.09 | 20% | 8.7 |
|
| 60 |
-
| **Overall Mean** | | **0.51** | | **60%** | |
|
| 61 |
-
|
| 62 |
-
The hard task is specifically designed so that frontier models fail most of the time. GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass β which is exactly the reasoning gap this environment is built to measure.
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
* **Exploration/Exploitation**: Measures if agents query for context productively before attempting fixes.
|
| 71 |
-
* **Test-Suite Overconfidence**: Detects if an agent fails to reason about concurrency when sequential tests pass (Hard Task).
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
###
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
|
|
|
| 82 |
|
| 83 |
-
|
|
|
|
| 84 |
|
| 85 |
-
*
|
| 86 |
|
| 87 |
-
|
|
|
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
**Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (ALL 8 pass on the buggy code)
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
```
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
**
|
| 119 |
-
|
| 120 |
-
- 1000-thread concurrent stress test passes (run 5Γ, must pass >=4 for full credit): **0.30**
|
| 121 |
-
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
|
| 122 |
-
- Efficiency bonus (fixed within 5 attempts): **0.10**
|
| 123 |
-
|
| 124 |
-
---
|
| 125 |
-
|
| 126 |
-
## Reward Function Design
|
| 127 |
-
|
| 128 |
-
The reward function provides dense signal at every step so an RL agent can learn from every action β not just the final outcome.
|
| 129 |
-
|
| 130 |
-
### Step-Level Rewards
|
| 131 |
-
|
| 132 |
-
| Event | Reward | Reasoning |
|
| 133 |
-
|---|---|---|
|
| 134 |
-
| Fix increases tests passing | `+0.15 Γ (Ξpassed / total)` | Scaled progress |
|
| 135 |
-
| Fix decreases tests passing | `-0.10 Γ (Ξfailed / total)` | Regression penalty |
|
| 136 |
-
| Fix makes no change to passing count | `-0.05` | Stagnation penalty |
|
| 137 |
-
| All tests pass | `+0.50` | Major bonus on top of progress |
|
| 138 |
-
| Submitted code times out in sandbox | `-0.10` | Penalizes infinite loops |
|
| 139 |
-
| `submit_fix` without hypothesis field | `-0.10` | Hypothesis is required |
|
| 140 |
-
| First `query_context` use | `0.00` | Free |
|
| 141 |
-
| Subsequent `query_context` uses | `-0.05` each | Diminishing returns |
|
| 142 |
-
| Episode truncated at max_steps | `-0.20` | Penalizes indecision |
|
| 143 |
-
|
| 144 |
-
### Episode-Level Grader Score
|
| 145 |
-
|
| 146 |
-
```
|
| 147 |
-
grader_score = test_pass_ratio Γ 0.60
|
| 148 |
-
+ efficiency_bonus Γ 0.20
|
| 149 |
-
+ hypothesis_accuracy Γ 0.15
|
| 150 |
-
+ early_solve_bonus Γ 0.05
|
| 151 |
-
|
| 152 |
-
test_pass_ratio = agent_best_tests_passed / tests_total
|
| 153 |
-
(from agent submissions only β never the initial buggy code run)
|
| 154 |
-
efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
|
| 155 |
-
hypothesis_accuracy = fraction of hypotheses correctly identifying the bug
|
| 156 |
-
early_solve_bonus = 0.05 if solved within ceil(max_attempts / 3) attempts
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
**Score floor design:** `test_pass_ratio` uses only the agent's submitted attempts β never the initial buggy code run. The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. Without this design, a dummy agent that submits nothing would score 0.36 and 0.40 for free respectively. The grader recalculates from the `attempts` list to guarantee the score floor is 0.0.
|
| 160 |
-
|
| 161 |
-
---
|
| 162 |
-
|
| 163 |
-
## Security Sandbox
|
| 164 |
-
|
| 165 |
-
Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β never via raw `exec()` anywhere in the codebase.
|
| 166 |
-
|
| 167 |
-
**Layer 1 β AST Import & Attribute Filtering:** Before execution, an AST walk detects blocked imports and prevents access to any attribute starting with an underscore (`_`). This blocks private member access and dunder escapes (like `__class__`).
|
| 168 |
-
|
| 169 |
-
**Layer 2 β Subprocess Isolation:** Code runs in a child subprocess with a stripped environment and no network access.
|
| 170 |
-
|
| 171 |
-
**Layer 3 β Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
|
| 172 |
-
|
| 173 |
-
**Layer 4 β Memory Limit:** 256MB per execution.
|
| 174 |
-
|
| 175 |
-
**Threading exception:** The hard task requires `threading` to create and verify the race condition. The sandbox accepts `allow_threading=True` for that task only. All other tasks block threading entirely.
|
| 176 |
-
|
| 177 |
-
---
|
| 178 |
-
|
| 179 |
-
## Data Models
|
| 180 |
-
|
| 181 |
-
```python
|
| 182 |
-
class Observation(BaseModel):
|
| 183 |
-
task_id: str # "easy" | "medium" | "hard"
|
| 184 |
-
buggy_code: str # Original broken code
|
| 185 |
-
test_suite: str # Full test file content
|
| 186 |
-
current_code: str # Most recent submitted code
|
| 187 |
-
current_error_output: str # Sandbox stdout/stderr output
|
| 188 |
-
tests_passed: int
|
| 189 |
-
attempts_remaining: int
|
| 190 |
-
max_attempts: int
|
| 191 |
-
done: bool
|
| 192 |
-
score_estimate: float # Running grader estimate
|
| 193 |
-
|
| 194 |
-
class Action(BaseModel):
|
| 195 |
-
action_type: str # "submit_fix" | "query_context" | "give_up"
|
| 196 |
-
fixed_code: Optional[str] # Complete corrected code
|
| 197 |
-
hypothesis: Optional[str] # Theory about the bug (required for submit)
|
| 198 |
-
query_type: Optional[str] # "function_signature" | "error_explanation" etc.
|
| 199 |
-
|
| 200 |
-
class Reward(BaseModel):
|
| 201 |
-
step_reward: float # Dense signal: range -1.0 to +1.0
|
| 202 |
-
cumulative_reward: float
|
| 203 |
-
grader_score: float # Official score (terminal step only)
|
| 204 |
-
breakdown: Dict[str, float] # Itemized components
|
| 205 |
-
```
|
| 206 |
|
| 207 |
---
|
| 208 |
|
| 209 |
-
##
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
version: 1.0.0
|
| 214 |
-
domain: software_engineering
|
| 215 |
-
observation_type: structured
|
| 216 |
-
action_type: structured
|
| 217 |
-
reward_type: dense
|
| 218 |
-
episode_termination: action_or_step_limit
|
| 219 |
-
tasks:
|
| 220 |
-
- {id: easy, difficulty: easy, max_steps: 8, max_attempts: 5}
|
| 221 |
-
- {id: medium, difficulty: medium, max_steps: 15, max_attempts: 7}
|
| 222 |
-
- {id: hard, difficulty: hard, max_steps: 25, max_attempts: 10}
|
| 223 |
-
```
|
| 224 |
-
|
| 225 |
-
Application-level errors are returned in `info.error` inside the response body. Core evaluation endpoints are designed to avoid 4xx/5xx status codes for agent-level mistakes, ensuring the evaluation flow is never interrupted by network-level exceptions.
|
| 226 |
-
|
| 227 |
-
| Endpoint | Method | Description |
|
| 228 |
-
|---|---|---|
|
| 229 |
-
| `/` | GET | API overview β lists all endpoints and tasks |
|
| 230 |
-
| `/health` | GET | Health check β always HTTP 200 |
|
| 231 |
-
| `/tasks` | GET | All tasks with metadata |
|
| 232 |
-
| `/reset` | POST | Start episode. Body: `{"task_id": "easy"}` |
|
| 233 |
-
| `/step` | POST | Submit one action |
|
| 234 |
-
| `/state` | GET | Full internal episode state |
|
| 235 |
-
|
| 236 |
-
---
|
| 237 |
-
|
| 238 |
-
## Installation & Usage
|
| 239 |
-
|
| 240 |
-
### Local Setup
|
| 241 |
-
|
| 242 |
-
```bash
|
| 243 |
-
git clone https://github.com/shasshaank/AgentDebuggerEnv
|
| 244 |
-
cd AgentDebuggerEnv
|
| 245 |
-
pip install -r requirements.txt
|
| 246 |
-
|
| 247 |
-
# Start the environment server
|
| 248 |
-
uvicorn env.server:app --reload --port 8000
|
| 249 |
-
|
| 250 |
-
# Verification: Run the pre-submission validator
|
| 251 |
-
python validator.py
|
| 252 |
-
|
| 253 |
-
# Verify it's running
|
| 254 |
-
curl http://localhost:8000/health
|
| 255 |
-
```
|
| 256 |
-
|
| 257 |
-
### Docker
|
| 258 |
-
|
| 259 |
-
```bash
|
| 260 |
-
docker build -t agentdebugger-env .
|
| 261 |
-
docker run -p 8000:8000 agentdebugger-env
|
| 262 |
-
```
|
| 263 |
-
|
| 264 |
-
### Running the Baseline Inference Script
|
| 265 |
-
|
| 266 |
-
```bash
|
| 267 |
-
git clone https://github.com/shasshaank/AgentDebuggerEnv
|
| 268 |
-
cd AgentDebuggerEnv
|
| 269 |
-
pip install -r requirements.txt
|
| 270 |
-
|
| 271 |
-
# Start the environment server
|
| 272 |
-
uvicorn env.server:app --reload --port 8000
|
| 273 |
-
|
| 274 |
-
# Verify it's running
|
| 275 |
-
curl http://localhost:8000/health
|
| 276 |
-
# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}
|
| 277 |
-
|
| 278 |
-
# Run baseline inference
|
| 279 |
-
export API_BASE_URL="https://api.openai.com/v1"
|
| 280 |
-
export MODEL_NAME="gpt-4o"
|
| 281 |
-
export HF_TOKEN="your_api_key"
|
| 282 |
-
export ENV_BASE_URL="http://localhost:8000"
|
| 283 |
-
python inference.py
|
| 284 |
-
```
|
| 285 |
-
|
| 286 |
-
Using Meta-Llama via HuggingFace (Recommended):
|
| 287 |
-
|
| 288 |
-
```bash
|
| 289 |
-
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 290 |
-
export MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct"
|
| 291 |
-
export HF_TOKEN="your_huggingface_token"
|
| 292 |
-
export ENV_BASE_URL="http://localhost:8000"
|
| 293 |
-
python inference.py
|
| 294 |
-
```
|
| 295 |
-
|
| 296 |
-
---
|
| 297 |
-
|
| 298 |
-
## Environment Variables
|
| 299 |
-
|
| 300 |
-
| Variable | Description | Default |
|
| 301 |
-
|---|---|---|
|
| 302 |
-
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
|
| 303 |
-
| `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-70B-Instruct` |
|
| 304 |
-
| `HF_TOKEN` | Hugging Face Token (Read) | β |
|
| 305 |
-
| `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |
|
| 306 |
-
|
| 307 |
-
---
|
| 308 |
-
|
| 309 |
-
## Project Structure
|
| 310 |
-
|
| 311 |
-
```
|
| 312 |
-
AgentDebuggerEnv/
|
| 313 |
-
βββ inference.py # Baseline script (root β hackathon requirement)
|
| 314 |
-
βββ env/
|
| 315 |
-
β βββ environment.py # Core OpenEnv: reset(), step(), state()
|
| 316 |
-
β βββ models.py # Pydantic v2 Observation, Action, Reward
|
| 317 |
-
β βββ sandbox.py # AST-based sandboxed code execution
|
| 318 |
-
β βββ server.py # FastAPI: /reset /step /state /health /tasks
|
| 319 |
-
β βββ tasks/
|
| 320 |
-
β β βββ task_easy.py # Off-by-one in binary search
|
| 321 |
-
β β βββ task_medium.py # Red herring authentication bug
|
| 322 |
-
β β βββ task_hard.py # Concurrency race condition
|
| 323 |
-
β βββ graders/
|
| 324 |
-
β βββ grader_easy.py # Test pass + efficiency scoring
|
| 325 |
-
β βββ grader_medium.py # Red herring detection + score floor fix
|
| 326 |
-
β βββ grader_hard.py # Sequential + concurrent stress test
|
| 327 |
-
βββ openenv.yaml
|
| 328 |
-
βββ Dockerfile
|
| 329 |
-
βββ requirements.txt
|
| 330 |
-
βββ uv.lock # Reproducible dependency resolution
|
| 331 |
-
```
|
| 332 |
-
|
| 333 |
-
---
|
| 334 |
-
|
| 335 |
-
## Design Decisions
|
| 336 |
-
|
| 337 |
-
**Why is hypothesis mandatory?** Requiring a hypothesis on every `submit_fix` prevents degenerate strategies of submitting random code until something passes. It also enables the grader to score `hypothesis_accuracy` independently from `test_pass_ratio` β measuring reasoning quality separately from outcome quality.
|
| 338 |
-
|
| 339 |
-
**Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.
|
| 340 |
-
|
| 341 |
-
**Why run the concurrent stress test 5 times?** Race conditions are non-deterministic. A partial fix that narrows the race window may pass once by luck. Requiring 4 of 5 runs to pass provides a robust statistical threshold that filters out lucky partial fixes while allowing for minor runner jitter. Passing 2 of 5 gives 0.15 β partial credit for progress.
|
| 342 |
-
|
| 343 |
-
**Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.
|
| 344 |
-
|
| 345 |
-
**Why `query_context` costs reward after the first use?** Free unlimited context queries would allow agents to trivially read all available information before attempting any fix. The cost structure forces agents to make strategic decisions about when additional information is worth spending a step on β which is a core part of real debugging under time pressure.
|
| 346 |
-
|
| 347 |
-
---
|
| 348 |
-
|
| 349 |
-
## License & Attribution
|
| 350 |
-
|
| 351 |
-
**License:** MIT β see [LICENSE](LICENSE)
|
| 352 |
-
|
| 353 |
-
**Author:** Shashaank | GitHub: [@shasshaank](https://github.com/shasshaank) | HF: [@shashaank0707](https://huggingface.co/shashaank0707)
|
| 354 |
-
|
| 355 |
-
**Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
|
| 356 |
-
|
| 357 |
-
**Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon
|
| 358 |
-
|
| 359 |
-
---
|
| 360 |
-
|
| 361 |
-
## Submission Integrity
|
| 362 |
-
|
| 363 |
-
- **Commit SHA:** `5c507c313ff2c209d7b770af6f08cf6ed6ab1568`
|
| 364 |
-
- **Last Verified Sync:** 2026-04-09
|
| 365 |
-
- **Platform Match:** GitHub and HF Space are in sync at this HEAD
|
|
|
|
| 1 |
+
Edited test_readme.md
|
| 2 |
+
Viewed test_readme.md:83-83
|
| 3 |
+
Searched for "Endurance"
|
| 4 |
+
Searched for "Author"
|
| 5 |
+
|
| 6 |
+
```markdown
|
| 7 |
---
|
| 8 |
+
title: AgentDebuggerEnv
|
| 9 |
+
emoji: π
|
| 10 |
+
colorFrom: purple
|
| 11 |
+
colorTo: indigo
|
| 12 |
sdk: gradio
|
| 13 |
app_file: app.py
|
| 14 |
+
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# AgentDebuggerEnv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
**Hackathon Links:**
|
| 20 |
+
- π **[Live Hugging Face Space](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3)**
|
| 21 |
+
- πΉ **[Watch the 2-Minute Demo](#)** *(Replace with YouTube Link)*
|
| 22 |
+
- π **[Read the Technical Writeup](#)** *(Replace with HF Blog Link)*
|
| 23 |
|
| 24 |
+
### π One-Line Pitch
|
| 25 |
+
An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution.
|
| 26 |
|
| 27 |
+
### π‘ Why This Exists
|
| 28 |
+
LLMs often hallucinate bug fixes via blind trial-and-error. Real debugging in production requires hypothesis-driven reasoning, isolation, and verification. We engineered an environment that forces models to observe, hypothesize, and execute code within a secure sandboxβpenalizing blind guessing and explicitly rewarding structured problem-solving.
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
### π§ Key Technical Insights & Research Foundations
|
| 31 |
+
* **Hypothesis-Driven Debugging (NeurIPS 2025):** Recent research presented at NeurIPS demonstrates that forcing an LLM to formulate a concrete hypothesis before generating code significantly improves debugging accuracy. Inspired by this, our environment mandates a strict `OBSERVATION` β `HYPOTHESIS` β `ACTION` loop. Every single step taken by the agent must be preceded by a formal hypothesis to receive a positive reward.
|
| 32 |
+
* **Literature-Backed Reward Criteria:** Our continuous, multi-objective reward shaping architecture is heavily influenced by the latest findings in LLM reasoning and code generation capabilities, specifically drawing from:
|
| 33 |
+
* [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
|
| 34 |
+
* [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) (Amazon NeurIPS Paper)
|
| 35 |
+
* **Curriculum Learning for RL:** A flat bug distribution caused early policy collapse. We implemented a 3-tier curriculum, introducing complex logic bugs only after structural formatting and syntax localization stabilized.
|
| 36 |
+
* **Hardened Sandboxed Grading:** Evaluating arbitrary LLM-generated fixes introduces severe RCE risks. We engineered a secure execution sandbox that restricts execution time, limits memory, and completely replaces unsafe `exec()` calls, ensuring deterministic and safe grading.
|
| 37 |
|
| 38 |
+
### ποΈ Architecture Overview
|
| 39 |
+
* **OpenEnv Core:** Manages state transitions, agent interactions, and environment telemetry.
|
| 40 |
+
* **Grader Subsystem:** Multi-layered evaluation utilizing a Hard Grader (secure execution, deterministic AST matching) and a Soft Grader (Llama-3.1-70B semantic evaluation).
|
| 41 |
+
* **Trainer:** HuggingFace TRL GRPO pipeline with dynamic batch scaling based on runtime VRAM detection.
|
| 42 |
+
* **Live Monitor:** A Gradio dashboard streaming `stdout` and Weights & Biases metrics directly from the active training container.
|
| 43 |
|
| 44 |
+
### β‘ What Makes This Impressive
|
| 45 |
+
* **Zero-to-One in 250 Steps:** Achieved a ~2.5x increase in total reward within just 250 steps, demonstrating extreme sample efficiency via GRPO.
|
| 46 |
+
* **Dynamic Hardware Scaling:** The training pipeline natively detects hardware capability (A100/H100 vs. T4) and automatically scales `batch_size`, `grad_accum`, and compute `dtype` (`bfloat16`/`float16`)βeliminating OOM errors across deployment environments.
|
| 47 |
+
* **Frictionless Deployment:** Bypassed heavy dependency constraints (PyTorch/TRL vs. Gradio PIP conflicts) by engineering a lazy-loading runtime environment that ensures deterministic Docker builds.
|
| 48 |
|
| 49 |
+
### π οΈ Tech Stack
|
| 50 |
+
* **Frameworks:** OpenEnv, FastAPI, Docker
|
| 51 |
+
* **RL Pipeline:** HuggingFace TRL (GRPO), Peft (LoRA)
|
| 52 |
+
* **Models:** Qwen2.5-Coder-7B-Instruct (Base), Llama-3.1-70B (Evaluator)
|
| 53 |
+
* **Telemetry:** Weights & Biases
|
| 54 |
|
| 55 |
+
### π Results & Benchmarks
|
| 56 |
+
Our training run clearly demonstrates rapid policy adaptation. The model successfully learned the `OBSERVATION/HYPOTHESIS/ACTION` constraint almost instantly and navigated the tier-2 difficulty bump (step 150) with a textbook drop-and-recover curve.
|
| 57 |
|
| 58 |
+
## Training Results
|
| 59 |
+
[W&B Run](https://wandb.ai/shashaankjain07-keshav-memorial-college-of-law/AgentDebuggerEnv/runs/vylbqd5m?nw=nwusershashaankjain07) | [Colab Notebook](#) | [YouTube Demo](#) | [HF Blog](#)
|
| 60 |
|
| 61 |
+
*(Note for Hackathon Judges: Live Weights & Biases charts and Gradio UI are embedded below as evidence of the training run).*
|
| 62 |
|
| 63 |
+

|
| 64 |
+

|
| 65 |
|
| 66 |
+
*Additional Training Metrics:*
|
| 67 |
+
<p align="center">
|
| 68 |
+
<img src="images/hypothesis_quality.png" width="48%" />
|
| 69 |
+
<img src="images/semantic.png" width="48%" />
|
| 70 |
+
</p>
|
| 71 |
|
| 72 |
+

|
| 73 |
|
| 74 |
+
* **Format Compliance:** Scaled to 1.0 (max) within 50 steps.
|
| 75 |
+
* **Total Reward:** Scaled from baseline ~0.4 to peaks of ~1.0 by step 250.
|
| 76 |
+
* **Baseline Solve Rate:** 100.0% validation on tiered data structure.
|
| 77 |
|
| 78 |
+
### π₯ Challenges & How They Were Solved
|
| 79 |
+
* **Reward Hacking:** Initial RL runs showed the model farming points by writing functionally valid code that bypassed the actual bug. **Fix:** Recalibrated the Hard Grader to execute both the initial buggy code and the proposed fix, computing the delta to ensure points are *only* awarded for actual regression fixes.
|
| 80 |
+
* **Hugging Face Space Build Failures:** The Space suffered from "resolution-too-deep" PIP timeouts due to conflicting requirements between Gradio and `trl/accelerate`. **Fix:** Stripped `requirements.txt` to the bare minimum for the UI and engineered a lazy-load script that installs training dependencies post-boot in a background thread.
|
| 81 |
+
* **Flaky LLM-as-a-Judge:** Using LLMs to grade code functionality proved non-deterministic. **Fix:** Replaced LLM evaluation for execution success with the deterministic Python sandbox, reserving the LLM judge solely for evaluating the semantic quality of the hypothesis.
|
| 82 |
|
| 83 |
+
### βΆοΈ Quick Start
|
| 84 |
|
| 85 |
+
We built the training pipeline to be universally runnable, with a specific focus on reproducible execution for judging.
|
| 86 |
|
| 87 |
+
**Run the Training Notebook**
|
| 88 |
+
The easiest way to re-run the exact GRPO training pipeline is via our Jupyter Notebook. It auto-detects hardware and sets configurations accordingly.
|
| 89 |
+
1. Open `training/AgentDebuggerEnv_GRPO_Training.ipynb` in Google Colab or Kaggle.
|
| 90 |
+
2. Select a GPU runtime (T4, A100, etc.).
|
| 91 |
+
3. Run all cells. It will automatically install dependencies and start streaming results.
|
| 92 |
|
|
|
|
| 93 |
|
| 94 |
+
### π Code Structure
|
| 95 |
+
```text
|
| 96 |
+
βββ data/ # Tiered bug datasets (JSONL)
|
| 97 |
+
βββ env/ # OpenEnv environment definitions
|
| 98 |
+
βββ server/ # FastAPI backend & Grader implementations
|
| 99 |
+
β βββ grader_hard.py # Sandboxed deterministic code execution
|
| 100 |
+
β βββ grader_soft.py # Semantic evaluation logic
|
| 101 |
+
βββ training/ # GRPO pipeline & Runnable Notebook
|
| 102 |
+
βββ app.py # Gradio training monitor
|
| 103 |
```
|
| 104 |
|
| 105 |
+
### π€ If I Had More Time
|
| 106 |
+
* **Multi-File Contexts:** Expand the environment to handle complex multi-file repository debugging using an active Language Server Protocol (LSP) integration.
|
| 107 |
+
* **PPO vs GRPO Benchmarking:** Quantify the compute and efficiency tradeoffs between PPO and GRPO on this specific task.
|
| 108 |
+
* **Adversarial Bug Generation:** Implement an adversarial LLM agent to continuously mutate and generate edge-case bugs, creating an infinite, self-sustaining curriculum.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
---
|
| 111 |
|
| 112 |
+
### π₯ Team Endurance
|
| 113 |
+
* **Shashaank Jain** | GitHub: [@shasshaank](https://github.com/shasshaank) | Email: *[Add Email]*
|
| 114 |
+
* **[Pranav Pulipati]** | GitHub: *[@PulipatiPranav](https://github.com/PulipatiPranav)* | Email: *[pranavpulipatix@gmail.com]*
|
| 115 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
training/AgentDebuggerEnv_GRPO_Training.ipynb
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"nbformat": 4,
|
| 3 |
+
"nbformat_minor": 5,
|
| 4 |
+
"metadata": {
|
| 5 |
+
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
|
| 6 |
+
"language_info": {"name": "python", "version": "3.10.0"},
|
| 7 |
+
"accelerator": "GPU",
|
| 8 |
+
"colab": {"provenance": [], "gpuType": "A100"}
|
| 9 |
+
},
|
| 10 |
+
"cells": [
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"source": [
|
| 15 |
+
"# AgentDebuggerEnv β GRPO Training\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"**Training Qwen2.5-Coder-7B-Instruct on structured hypothesis-driven debugging**\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"- **Algorithm:** GRPO (same as DeepSeek-R1) via HuggingFace TRL\n",
|
| 20 |
+
"- **Dataset:** 90 hand-validated bugs across 3 difficulty tiers\n",
|
| 21 |
+
"- **Curriculum:** Tier 1 (steps 0β150) β Tier 1+2 (150β350) β All tiers (350β500)\n",
|
| 22 |
+
"- **Model:** Qwen2.5-Coder-7B-Instruct + LoRA (float16/bfloat16, no quantization)\n",
|
| 23 |
+
"\n",
|
| 24 |
+
"> **Requirements:** GPU runtime. In Colab: Runtime β Change runtime type β **A100**."
|
| 25 |
+
]
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"cell_type": "code",
|
| 29 |
+
"metadata": {},
|
| 30 |
+
"source": [
|
| 31 |
+
"# Verify GPU is available\n",
|
| 32 |
+
"import subprocess, sys\n",
|
| 33 |
+
"result = subprocess.run([\"nvidia-smi\"], capture_output=True, text=True)\n",
|
| 34 |
+
"if result.returncode != 0:\n",
|
| 35 |
+
" raise RuntimeError(\"No GPU detected. Go to Runtime β Change runtime type β GPU (A100 recommended)\")\n",
|
| 36 |
+
"print(result.stdout[:600])"
|
| 37 |
+
],
|
| 38 |
+
"outputs": [],
|
| 39 |
+
"execution_count": null
|
| 40 |
+
},
|
| 41 |
+
{
|
| 42 |
+
"cell_type": "code",
|
| 43 |
+
"metadata": {},
|
| 44 |
+
"source": [
|
| 45 |
+
"# Clone the environment repository\n",
|
| 46 |
+
"!git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env agentdebugger\n",
|
| 47 |
+
"%cd agentdebugger"
|
| 48 |
+
],
|
| 49 |
+
"outputs": [],
|
| 50 |
+
"execution_count": null
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"cell_type": "code",
|
| 54 |
+
"metadata": {},
|
| 55 |
+
"source": [
|
| 56 |
+
"# Install CUDA-enabled PyTorch first (must precede all other imports)\n",
|
| 57 |
+
"!pip install -q torch --index-url https://download.pytorch.org/whl/cu121\n",
|
| 58 |
+
"\n",
|
| 59 |
+
"# Install training dependencies\n",
|
| 60 |
+
"!pip install -q \\\n",
|
| 61 |
+
" wandb==0.18.7 \\\n",
|
| 62 |
+
" datasets==3.0.2 \\\n",
|
| 63 |
+
" transformers==4.48.3 \\\n",
|
| 64 |
+
" accelerate==1.0.1 \\\n",
|
| 65 |
+
" \"trl==0.15.2\" \\\n",
|
| 66 |
+
" peft==0.13.2\n",
|
| 67 |
+
"\n",
|
| 68 |
+
"import torch\n",
|
| 69 |
+
"print(f\"PyTorch: {torch.__version__}\")\n",
|
| 70 |
+
"print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
|
| 71 |
+
"if torch.cuda.is_available():\n",
|
| 72 |
+
" props = torch.cuda.get_device_properties(0)\n",
|
| 73 |
+
" print(f\"GPU: {props.name}\")\n",
|
| 74 |
+
" print(f\"VRAM: {props.total_memory / 1e9:.1f} GB\")"
|
| 75 |
+
],
|
| 76 |
+
"outputs": [],
|
| 77 |
+
"execution_count": null
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"cell_type": "code",
|
| 81 |
+
"metadata": {},
|
| 82 |
+
"source": [
|
| 83 |
+
"import os\n",
|
| 84 |
+
"\n",
|
| 85 |
+
"# Weights & Biases β get a free API key at https://wandb.ai\n",
|
| 86 |
+
"WANDB_API_KEY = \"\" # @param {type:\"string\"}\n",
|
| 87 |
+
"if WANDB_API_KEY:\n",
|
| 88 |
+
" os.environ[\"WANDB_API_KEY\"] = WANDB_API_KEY\n",
|
| 89 |
+
" import wandb; wandb.login(key=WANDB_API_KEY)\n",
|
| 90 |
+
" print(\"W&B login successful β training curves will be logged\")\n",
|
| 91 |
+
"else:\n",
|
| 92 |
+
" print(\"No W&B key β set WANDB_API_KEY above to get loss/reward plots\")\n",
|
| 93 |
+
"\n",
|
| 94 |
+
"# Hugging Face token β needed to push the final model\n",
|
| 95 |
+
"HF_TOKEN = \"\" # @param {type:\"string\"}\n",
|
| 96 |
+
"if HF_TOKEN:\n",
|
| 97 |
+
" os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
|
| 98 |
+
" from huggingface_hub import login; login(token=HF_TOKEN)\n",
|
| 99 |
+
" print(\"HF login successful β trained model will be pushed to Hub\")"
|
| 100 |
+
],
|
| 101 |
+
"outputs": [],
|
| 102 |
+
"execution_count": null
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "markdown",
|
| 106 |
+
"metadata": {},
|
| 107 |
+
"source": [
|
| 108 |
+
"## Step 1 β Sanity Check (10 steps, ~2 min)\n",
|
| 109 |
+
"\n",
|
| 110 |
+
"Runs 10 training steps to verify GPU, dependencies, and reward function all work before the full run."
|
| 111 |
+
]
|
| 112 |
+
},
|
| 113 |
+
{
|
| 114 |
+
"cell_type": "code",
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"source": [
|
| 117 |
+
"!python training/train_grpo.py --test"
|
| 118 |
+
],
|
| 119 |
+
"outputs": [],
|
| 120 |
+
"execution_count": null
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"cell_type": "markdown",
|
| 124 |
+
"metadata": {},
|
| 125 |
+
"source": [
|
| 126 |
+
"## Step 2 β Full Training (500 steps, ~45 min on A100)\n",
|
| 127 |
+
"\n",
|
| 128 |
+
"Runs the complete curriculum:\n",
|
| 129 |
+
"- **Steps 0β150:** Tier 1 only (easy bugs β off-by-one, simple logic)\n",
|
| 130 |
+
"- **Steps 150β350:** Tier 1 + Tier 2 (adds red-herring auth bugs)\n",
|
| 131 |
+
"- **Steps 350β500:** All tiers (adds concurrency race conditions)\n",
|
| 132 |
+
"\n",
|
| 133 |
+
"Checkpoints saved every 50 steps. Final model pushed to HF Hub if `HF_TOKEN` is set."
|
| 134 |
+
]
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"cell_type": "code",
|
| 138 |
+
"metadata": {},
|
| 139 |
+
"source": [
|
| 140 |
+
"!python training/train_grpo.py"
|
| 141 |
+
],
|
| 142 |
+
"outputs": [],
|
| 143 |
+
"execution_count": null
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"cell_type": "markdown",
|
| 147 |
+
"metadata": {},
|
| 148 |
+
"source": [
|
| 149 |
+
"## Results β Baseline vs Trained"
|
| 150 |
+
]
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"cell_type": "code",
|
| 154 |
+
"metadata": {},
|
| 155 |
+
"source": [
|
| 156 |
+
"import json, os\n",
|
| 157 |
+
"\n",
|
| 158 |
+
"baseline, final = None, None\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"if os.path.exists(\"baseline_results.json\"):\n",
|
| 161 |
+
" with open(\"baseline_results.json\") as f:\n",
|
| 162 |
+
" baseline = json.load(f)\n",
|
| 163 |
+
" print(f\"Baseline | solve_rate: {baseline['solve_rate']:.1%} | avg_reward: {baseline['avg_reward']:.3f}\")\n",
|
| 164 |
+
"\n",
|
| 165 |
+
"if os.path.exists(\"final_results.json\"):\n",
|
| 166 |
+
" with open(\"final_results.json\") as f:\n",
|
| 167 |
+
" final = json.load(f)\n",
|
| 168 |
+
" print(f\"Trained | solve_rate: {final['solve_rate']:.1%} | avg_reward: {final['avg_reward']:.3f}\")\n",
|
| 169 |
+
" if baseline:\n",
|
| 170 |
+
" delta = final['avg_reward'] - baseline['avg_reward']\n",
|
| 171 |
+
" print(f\"\\nImprovement: {delta:+.3f} ({delta / baseline['avg_reward'] * 100:+.1f}% relative)\")\n",
|
| 172 |
+
"else:\n",
|
| 173 |
+
" print(\"final_results.json not written yet β run training first\")"
|
| 174 |
+
],
|
| 175 |
+
"outputs": [],
|
| 176 |
+
"execution_count": null
|
| 177 |
+
}
|
| 178 |
+
]
|
| 179 |
+
}
|