feat(training): improve self-play progress visibility and reward diagnostics 4aca4f5 siddeshwar-kagatikar commited on 12 days ago
fix(rewards): never crash GRPO on malformed completions d814291 siddeshwar-kagatikar commited on 13 days ago