DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning Paper • 2503.19263 • Published Mar 25, 2025 • 2
Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems Paper • 2504.09763 • Published Apr 14, 2025 • 12
One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration Paper • 2510.12088 • Published Oct 14, 2025 • 5
PRInTS: Reward Modeling for Long-Horizon Information Seeking Paper • 2511.19314 • Published Nov 24, 2025 • 8
Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems Paper • 2604.04767 • Published 10 days ago • 7
Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind Paper • 2604.11666 • Published 3 days ago • 3
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling Paper • 2602.12279 • Published Feb 12 • 20
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling Paper • 2602.12279 • Published Feb 12 • 20
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 35
mlfoundations-dev/Qwen3-8B_exp-swd-r2egym-standard_glm_4.7_traces_locetash_save-strategy_steps Updated Jan 9
mlfoundations-dev/Qwen3-8B_perturbed-docker-exp-taskmaster2-tasks_glm_4.7_traces_locetash_save-strategy_steps Updated Jan 9