anugrah55's picture
Overhaul trainer: TRL GRPO with env-backed reward, Qwen2.5-0.5B 4bit+LoRA, slim PyTorch CUDA base, heartbeat HTTP for HF Spaces health probe
d597642 verified
metadata
title: OpenSleuth Trainer
emoji: 🛰️
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: t4-small

OpenSleuth — Trainer

GPU Space that fine-tunes a small Qwen2.5 model with TRL GRPO to do in-context program synthesis against the live OpenSleuth env.

Pipeline

  1. Wait for the env Space to report healthy.
  2. Build a dataset of synthesis prompts: each row pairs one black-box function with N pre-sampled (input, output) probes drawn from the env.
  3. Load Qwen/Qwen2.5-0.5B-Instruct in 4-bit + LoRA via bitsandbytes and peft.
  4. Train with trl.GRPOTrainer, generating num_generations=4 candidate completions per prompt and rewarding each against the env's verifier.
  5. Persist the LoRA adapter to /data/opensleuth-grpo and (if HF_TOKEN is set as a Space secret) push to anugrah55/opensleuth-qwen2.5-0.5b-grpo.

Reward

  • env_verifier_reward = env.score_submission(...) / 100 — the headline shaped reward, ranging roughly [-0.5, +1.5].
  • format_reward — small bonus for emitting a fenced python block whose def matches the target function name; helps the model converge on parseable output early.

Hardware

t4-small is sufficient for 0.5B + LoRA + bnb-4bit. a10g-small will train faster if available.

Required Space secrets

  • HF_TOKEN — write token if you want the LoRA adapter pushed to the Hub at the end of training.

Tuning knobs

All knobs are exposed as env vars (defaults shown):

Env var Default Meaning
ENV_URL env Space URL OpenSleuth env to target
MODEL_NAME Qwen/Qwen2.5-0.5B-Instruct Base policy
N_PER_FUNCTION 16 Prompts per black-box function
N_PROBES 6 Probes per prompt
NUM_GENERATIONS 4 GRPO group size
LEARNING_RATE 1e-5
NUM_TRAIN_EPOCHS 1
PER_DEVICE_BATCH_SIZE 1
GRAD_ACCUM 8