# Training with GRPO on API Debug Environment Trains a small LLM using **GRPO** (Group Relative Policy Optimization) on the live API Debug Environment with **curriculum learning**. ## What is GRPO? For each prompt, GRPO: 1. Generates multiple completions (debug attempts) 2. Scores each with the environment's grader (reward signal) 3. Updates the model to prefer higher-scoring responses Over thousands of episodes, the LLM learns to debug API requests purely from reward signals -- no labelled data needed. ## Curriculum Learning The training auto-promotes through difficulty levels: | Level | Task | Threshold | Max Turns | Skill | |-------|------|-----------|-----------|-------| | 1 | easy | 0.7 avg reward | 3 | Identify single error type + fields | | 2 | classify | 0.6 avg reward | 4 | Identify ALL error types + fields | | 3 | medium | 0.6 avg reward | 5 | Fix the broken request body | | 4 | headers | 0.5 avg reward | 4 | Fix header-level errors | | 5 | response | 0.5 avg reward | 4 | Validate API response issues | | 6 | hard | -- | 7 | Fix mixed errors + explain reasoning | Promotion happens when the rolling average reward (window=10) exceeds the threshold for the current level. ## Architecture ``` Dataset prompt ("Debug this broken API request.") | GRPOTrainer calls rollout_func() | rollout_func() connects to live HF Space via WebSocket | env.reset(task=current_task) -> broken API request | LLM generates JSON response -> env.step(action) -> reward | (repeat up to max_turns) Returns: prompt_ids, completion_ids, logprobs, env_reward | reward_from_env() extracts env_reward | GRPO updates model weights | maybe_promote() checks if agent should advance to next task ``` ## Run on Google Colab (free T4 GPU) ```python # Cell 1 -- Install !pip install trl>=0.26.0 transformers torch datasets openenv-core openai # Cell 2 -- Clone repo !git clone https://github.com/Avi-chauhan/api-debug-env.git %cd api-debug-env # Cell 3 -- Train !python training/train.py ``` ## Requirements - GPU: T4 or better (free Colab works) - RAM: 8GB+ - The live HF Space must be running: https://huggingface.co/spaces/avichauhan/api-debug-env