Spaces:

anugrah55
/

opensleuth-training-gemini-cli

Paused

Overhaul trainer: TRL GRPO with env-backed reward, Qwen2.5-0.5B 4bit+LoRA, slim PyTorch CUDA base, heartbeat HTTP for HF Spaces health probe

d597642 verified 13 days ago

1.98 kB

title: OpenSleuth Trainer
emoji: 🛰️
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: t4-small

OpenSleuth — Trainer

GPU Space that fine-tunes a small Qwen2.5 model with TRL GRPO to do in-context program synthesis against the live OpenSleuth env.

Wait for the env Space to report healthy.
Build a dataset of synthesis prompts: each row pairs one black-box function with N pre-sampled (input, output) probes drawn from the env.
Load Qwen/Qwen2.5-0.5B-Instruct in 4-bit + LoRA via bitsandbytes and peft.
Train with trl.GRPOTrainer, generating num_generations=4 candidate completions per prompt and rewarding each against the env's verifier.
Persist the LoRA adapter to /data/opensleuth-grpo and (if HF_TOKEN is set as a Space secret) push to anugrah55/opensleuth-qwen2.5-0.5b-grpo.

env_verifier_reward = env.score_submission(...) / 100 — the headline shaped reward, ranging roughly [-0.5, +1.5].
format_reward — small bonus for emitting a fenced python block whose def matches the target function name; helps the model converge on parseable output early.

t4-small is sufficient for 0.5B + LoRA + bnb-4bit. a10g-small will train faster if available.

HF_TOKEN — write token if you want the LoRA adapter pushed to the Hub at the end of training.

All knobs are exposed as env vars (defaults shown):

Env var	Default	Meaning
`ENV_URL`	env Space URL	OpenSleuth env to target
`MODEL_NAME`	`Qwen/Qwen2.5-0.5B-Instruct`	Base policy
`N_PER_FUNCTION`	`16`	Prompts per black-box function
`N_PROBES`	`6`	Probes per prompt
`NUM_GENERATIONS`	`4`	GRPO group size
`LEARNING_RATE`	`1e-5`
`NUM_TRAIN_EPOCHS`	`1`
`PER_DEVICE_BATCH_SIZE`	`1`
`GRAD_ACCUM`	`8`