Spaces:
Sleeping
Sleeping
File size: 2,196 Bytes
d73bfc0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | # Training with GRPO on API Debug Environment
Trains a small LLM using **GRPO** (Group Relative Policy Optimization)
on the live API Debug Environment with **curriculum learning**.
## What is GRPO?
For each prompt, GRPO:
1. Generates multiple completions (debug attempts)
2. Scores each with the environment's grader (reward signal)
3. Updates the model to prefer higher-scoring responses
Over thousands of episodes, the LLM learns to debug API requests
purely from reward signals -- no labelled data needed.
## Curriculum Learning
The training auto-promotes through difficulty levels:
| Level | Task | Threshold | Max Turns | Skill |
|-------|------|-----------|-----------|-------|
| 1 | easy | 0.7 avg reward | 3 | Identify single error type + fields |
| 2 | classify | 0.6 avg reward | 4 | Identify ALL error types + fields |
| 3 | medium | 0.6 avg reward | 5 | Fix the broken request body |
| 4 | headers | 0.5 avg reward | 4 | Fix header-level errors |
| 5 | response | 0.5 avg reward | 4 | Validate API response issues |
| 6 | hard | -- | 7 | Fix mixed errors + explain reasoning |
Promotion happens when the rolling average reward (window=10) exceeds
the threshold for the current level.
## Architecture
```
Dataset prompt ("Debug this broken API request.")
|
GRPOTrainer calls rollout_func()
|
rollout_func() connects to live HF Space via WebSocket
|
env.reset(task=current_task) -> broken API request
|
LLM generates JSON response -> env.step(action) -> reward
| (repeat up to max_turns)
Returns: prompt_ids, completion_ids, logprobs, env_reward
|
reward_from_env() extracts env_reward
|
GRPO updates model weights
|
maybe_promote() checks if agent should advance to next task
```
## Run on Google Colab (free T4 GPU)
```python
# Cell 1 -- Install
!pip install trl>=0.26.0 transformers torch datasets openenv-core openai
# Cell 2 -- Clone repo
!git clone https://github.com/Avi-chauhan/api-debug-env.git
%cd api-debug-env
# Cell 3 -- Train
!python training/train.py
```
## Requirements
- GPU: T4 or better (free Colab works)
- RAM: 8GB+
- The live HF Space must be running:
https://huggingface.co/spaces/avichauhan/api-debug-env
|