Spaces:
Running
RL Architecture: API Debug Environment
Overview
This environment trains LLM agents to debug malformed API requests. A task that appears constantly in production systems. The agent receives a broken HTTP request and API specification, then must identify the error, fix the request, and explain its reasoning. This maps directly to how developers debug API integration failures every day.
Episode Lifecycle
reset(task="easy"|"medium"|"hard")
β picks 1 of 30 API spec templates at random
β generates valid request from spec
β injects 1-3 errors via error injectors
β returns broken request + spec as observation
for step in 1..max_steps:
step(action: APIDebugAction)
β grades agent's response
β applies step decay multiplier
β returns feedback + reward
if raw_score >= 0.95 or step == max_steps:
return best_reward (highest achieved this episode)
done = True
The agent can iterate - each step gets the previous step's structured feedback, allowing multi-turn refinement.
Reward Design
Formula
reward = raw_score Γ max(1.0 - 0.1 Γ (step - 1), 0.3)
| Step | Multiplier | Effect |
|---|---|---|
| 1 | 1.0x | Full reward for immediate correct diagnosis |
| 2 | 0.9x | Small penalty for needing one attempt |
| 5 | 0.6x | Moderate penalty for slow convergence |
| 8+ | 0.3x floor | Agent still gets credit for late fixes |
At episode end, the best reward achieved across all steps is returned. This means:
- An agent that gets it right on step 1 scores highest
- An agent that improves over multiple steps still gets its best score
- Agents are incentivized to be efficient, not just eventually correct
Per-Task Scoring
Easy (error identification):
raw_score = 0.6 Γ type_match + 0.4 Γ jaccard(agent_fields, gt_fields)
type_match: 1.0 if error_type is correct, 0.0 otherwisejaccard: |intersection| / |union| for partial field credit
Medium (request fix):
raw_score = per-field validation against spec
Per check: required fields present, field types correct, no unknown fields. Equal weight across checks. If a header error was injected: 0.8 Γ body_score + 0.2 Γ header_score.
Hard (fix + explain):
raw_score = 0.7 Γ fix_score + 0.3 Γ explanation_score
fix_score uses medium grading. explanation_score uses an LLM judge (gpt-4o-mini) with ground-truth-aware prompting, with a keyword + length heuristic fallback.
Infinite Episode Generation
The key RL training advantage: every episode is unique.
30 API specs Γ 10 error types = 300 base combinations
Hard task: 2-3 simultaneous errors from 10 types = ~720 combinations
Total: thousands of distinct episodes
Contrast with fixed-fixture environments: an agent can memorize the answer after episode 1. Our generator forces the agent to learn a generalizable debugging strategy.
API spec domains:
- Payment (Stripe-like): 5 specs
- User Management: 5 specs
- Content (GitHub-like): 5 specs
- Messaging (Twilio-like): 5 specs
- E-Commerce: 5 specs
- Calendar and Auth: 5 specs
Error types: missing_required_field, wrong_field_type, invalid_email_format, missing_auth_header, extra_unknown_field, null_value_in_required, wrong_http_method, malformed_json_value, invalid_enum_value, datetime_format_error.
Grading Philosophy
Why Hybrid Grading (Deterministic + LLM)
Easy and medium tasks have objectively correct answers: a field is either present or missing, a type is either correct or wrong. Pure deterministic grading is appropriate.
Hard task requires evaluating explanation quality - whether the agent communicated the root cause clearly to a developer. This is inherently subjective and benefits from LLM judgment. The LLM judge receives the ground truth (actual error types and affected fields) so it scores the explanation against what was actually wrong, not in isolation.
The 70/30 split (fix vs. explain) reflects production reality: a correct fix without explanation leaves developers unable to prevent future recurrences.
Structured Feedback
Every step returns machine-readable feedback:
Validation: 5/7 checks passed.
email: PRESENT
name: PRESENT
email type: VALID
amount type: INVALID (expected integer, got string)
Authorization header: MISSING
This lets the agent know exactly which fields are still wrong, enabling targeted multi-turn improvement rather than blind re-guessing.
Reward Range Compliance
All rewards are strictly in [0.0, 1.0]:
raw_scoreis always in [0.0, 1.0] (ratios and weighted averages)step_multiplieris always in [0.3, 1.0]reward = raw_score Γ step_multiplieris therefore in [0.0, 1.0]best_reward = max(rewards seen)is also in [0.0, 1.0]
No reward shaping pushes values outside this range.
Concurrency
SUPPORTS_CONCURRENT_SESSIONS = True. The server supports up to 10 concurrent environments (max_concurrent_envs=10). Each session maintains independent state: spec, broken request, ground truth, step count, and best reward. Sessions are identified by episode_id.
Action Space
APIDebugAction fields (all optional, submit what you have):
| Field | Type | Task |
|---|---|---|
| error_type | string | easy, hard |
| affected_fields | list[string] | easy, hard |
| fixed_request | string (JSON) | medium, hard |
| fixed_headers | dict | medium, hard |
| explanation | string | hard |
The agent can submit a partial action (diagnosis only, fix only, or everything). This makes multi-turn interaction natural: the agent can refine each component independently.