improve: 20 tasks, richer keywords, enhanced reward/grader, bigram matching, compelling README b83c8ad hellinferno commited on 6 days ago
fix: add run_episode wrapper, use .2f score format, update test for strict bounds 11aa990 hellinferno commited on 6 days ago
fix: correct inference log format, align openenv.yaml task IDs, harden Dockerfile 852b5ea hellinferno Claude Sonnet 4.6 commited on 7 days ago