File size: 2,196 Bytes
d73bfc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Training with GRPO on API Debug Environment

Trains a small LLM using **GRPO** (Group Relative Policy Optimization)
on the live API Debug Environment with **curriculum learning**.

## What is GRPO?

For each prompt, GRPO:
1. Generates multiple completions (debug attempts)
2. Scores each with the environment's grader (reward signal)
3. Updates the model to prefer higher-scoring responses

Over thousands of episodes, the LLM learns to debug API requests
purely from reward signals -- no labelled data needed.

## Curriculum Learning

The training auto-promotes through difficulty levels:

| Level | Task | Threshold | Max Turns | Skill |
|-------|------|-----------|-----------|-------|
| 1 | easy | 0.7 avg reward | 3 | Identify single error type + fields |
| 2 | classify | 0.6 avg reward | 4 | Identify ALL error types + fields |
| 3 | medium | 0.6 avg reward | 5 | Fix the broken request body |
| 4 | headers | 0.5 avg reward | 4 | Fix header-level errors |
| 5 | response | 0.5 avg reward | 4 | Validate API response issues |
| 6 | hard | -- | 7 | Fix mixed errors + explain reasoning |

Promotion happens when the rolling average reward (window=10) exceeds
the threshold for the current level.

## Architecture
```
Dataset prompt ("Debug this broken API request.")
     |
GRPOTrainer calls rollout_func()
     |
rollout_func() connects to live HF Space via WebSocket
     |
env.reset(task=current_task) -> broken API request
     |
LLM generates JSON response -> env.step(action) -> reward
     |  (repeat up to max_turns)
Returns: prompt_ids, completion_ids, logprobs, env_reward
     |
reward_from_env() extracts env_reward
     |
GRPO updates model weights
     |
maybe_promote() checks if agent should advance to next task
```

## Run on Google Colab (free T4 GPU)
```python
# Cell 1 -- Install
!pip install trl>=0.26.0 transformers torch datasets openenv-core openai

# Cell 2 -- Clone repo
!git clone https://github.com/Avi-chauhan/api-debug-env.git
%cd api-debug-env

# Cell 3 -- Train
!python training/train.py
```

## Requirements

- GPU: T4 or better (free Colab works)
- RAM: 8GB+
- The live HF Space must be running:
  https://huggingface.co/spaces/avichauhan/api-debug-env