implement GRPO-style preference learning, simulation branching, and expanded documentation 27a0d2f Pratap-K commited on 26 days ago