Spaces:

avichauhan
/

api-debug-env

Sleeping

App Files Files Community

api-debug-env / training /README.md

avichauhan

Upload folder using huggingface_hub

d73bfc0 verified 16 days ago

preview code

raw

history blame

2.2 kB

	# Training with GRPO on API Debug Environment

	Trains a small LLM using GRPO (Group Relative Policy Optimization)
	on the live API Debug Environment with curriculum learning.

	## What is GRPO?

	For each prompt, GRPO:
	1. Generates multiple completions (debug attempts)
	2. Scores each with the environment's grader (reward signal)
	3. Updates the model to prefer higher-scoring responses

	Over thousands of episodes, the LLM learns to debug API requests
	purely from reward signals -- no labelled data needed.

	## Curriculum Learning

	The training auto-promotes through difficulty levels:

	\| Level \| Task \| Threshold \| Max Turns \| Skill \|
	\|-------\|------\|-----------\|-----------\|-------\|
	\| 1 \| easy \| 0.7 avg reward \| 3 \| Identify single error type + fields \|
	\| 2 \| classify \| 0.6 avg reward \| 4 \| Identify ALL error types + fields \|
	\| 3 \| medium \| 0.6 avg reward \| 5 \| Fix the broken request body \|
	\| 4 \| headers \| 0.5 avg reward \| 4 \| Fix header-level errors \|
	\| 5 \| response \| 0.5 avg reward \| 4 \| Validate API response issues \|
	\| 6 \| hard \| -- \| 7 \| Fix mixed errors + explain reasoning \|

	Promotion happens when the rolling average reward (window=10) exceeds
	the threshold for the current level.

	## Architecture
	```
	Dataset prompt ("Debug this broken API request.")
	\|
	GRPOTrainer calls rollout_func()
	\|
	rollout_func() connects to live HF Space via WebSocket
	\|
	env.reset(task=current_task) -> broken API request
	\|
	LLM generates JSON response -> env.step(action) -> reward
	\| (repeat up to max_turns)
	Returns: prompt_ids, completion_ids, logprobs, env_reward
	\|
	reward_from_env() extracts env_reward
	\|
	GRPO updates model weights
	\|
	maybe_promote() checks if agent should advance to next task
	```

	## Run on Google Colab (free T4 GPU)
	```python
	# Cell 1 -- Install
	!pip install trl>=0.26.0 transformers torch datasets openenv-core openai

	# Cell 2 -- Clone repo
	!git clone https://github.com/Avi-chauhan/api-debug-env.git
	%cd api-debug-env

	# Cell 3 -- Train
	!python training/train.py
	```

	## Requirements

	- GPU: T4 or better (free Colab works)
	- RAM: 8GB+
	- The live HF Space must be running:
	https://huggingface.co/spaces/avichauhan/api-debug-env