Spaces:

anugrah55
/

opensleuth-training-gemini-cli

Paused

opensleuth-training-gemini-cli / README.md

Overhaul trainer: TRL GRPO with env-backed reward, Qwen2.5-0.5B 4bit+LoRA, slim PyTorch CUDA base, heartbeat HTTP for HF Spaces health probe

d597642 verified 13 days ago

preview code

raw

history blame contribute delete

1.98 kB

	---
	title: OpenSleuth Trainer
	emoji: 🛰️
	colorFrom: red
	colorTo: yellow
	sdk: docker
	app_port: 7860
	pinned: false
	suggested_hardware: t4-small
	---

	# OpenSleuth — Trainer

	GPU Space that fine-tunes a small Qwen2.5 model with TRL GRPO to do
	in-context program synthesis against the live OpenSleuth env.

	## Pipeline

	1. Wait for the env Space to report healthy.
	2. Build a dataset of synthesis prompts: each row pairs one black-box
	function with N pre-sampled `(input, output)` probes drawn from the env.
	3. Load `Qwen/Qwen2.5-0.5B-Instruct` in 4-bit + LoRA via `bitsandbytes` and
	`peft`.
	4. Train with `trl.GRPOTrainer`, generating `num_generations=4` candidate
	completions per prompt and rewarding each against the env's verifier.
	5. Persist the LoRA adapter to `/data/opensleuth-grpo` and (if `HF_TOKEN` is
	set as a Space secret) push to `anugrah55/opensleuth-qwen2.5-0.5b-grpo`.

	## Reward

	* `env_verifier_reward = env.score_submission(...) / 100` — the headline
	shaped reward, ranging roughly `[-0.5, +1.5]`.
	* `format_reward` — small bonus for emitting a fenced ```python``` block
	whose `def` matches the target function name; helps the model converge on
	parseable output early.

	## Hardware

	`t4-small` is sufficient for 0.5B + LoRA + bnb-4bit. `a10g-small` will train
	faster if available.

	## Required Space secrets

	* `HF_TOKEN` — write token if you want the LoRA adapter pushed to the Hub at
	the end of training.

	## Tuning knobs

	All knobs are exposed as env vars (defaults shown):

	\| Env var \| Default \| Meaning \|
	\|---------\|---------\|---------\|
	\| `ENV_URL` \| env Space URL \| OpenSleuth env to target \|
	\| `MODEL_NAME` \| `Qwen/Qwen2.5-0.5B-Instruct` \| Base policy \|
	\| `N_PER_FUNCTION` \| `16` \| Prompts per black-box function \|
	\| `N_PROBES` \| `6` \| Probes per prompt \|
	\| `NUM_GENERATIONS` \| `4` \| GRPO group size \|
	\| `LEARNING_RATE` \| `1e-5` \| \|
	\| `NUM_TRAIN_EPOCHS` \| `1` \| \|
	\| `PER_DEVICE_BATCH_SIZE` \| `1` \| \|
	\| `GRAD_ACCUM` \| `8` \| \|