trainer v0.4: switch to Qwen2.5-3B-Instruct, dynamic task discovery, delegated probe sampling, difficulty-weighted rollouts, push to opensleuth-qwen2.5-3b-grpo-v2; sentinel cleared on FORCE_TRAIN=1. 78575eb verified anugrah55 commited on 13 days ago
entrypoint: skip retraining if sentinel exists, idle on heartbeat after training succeeds (prevents auto-restart loop burning GPU) 8c92f05 verified anugrah55 commited on 13 days ago
Overhaul trainer: TRL GRPO with env-backed reward, Qwen2.5-0.5B 4bit+LoRA, slim PyTorch CUDA base, heartbeat HTTP for HF Spaces health probe d597642 verified anugrah55 commited on 13 days ago