Papers
arxiv:2605.20177

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Published on May 19
Β· Submitted by
Juncheng Wu
on May 25
Authors:
,
,
,
,
,
,
,

Abstract

Staged training approaches that separately optimize visual perception, visual reasoning, and textual reasoning in vision-language models outperform unified training methods, leading to improved performance on visual reasoning tasks.

AI-generated summary

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

Community

Paper author Paper submitter

Your VLM didn't fail because it didn't think long enough. It failed because it looked wrong. We found 86.9% of Qwen3-VL-8B's wrong answers trace back to a perception error β€” not a reasoning one.
Our fix: a capability curriculum β€” a brand-new curriculum dimension that trains perception before reasoning. 🧡

  • we decouple post-training along a capability axis into 3 sequential RLVR stages:
    🟦 Visual Perception β†’ 🟩 Textual Reasoning β†’ 🟨 Visual Reasoning
  • Staged > Merged, consistently across 4 backbones (Qwen2.5-VL, Qwen3-VL, InternVL3, InternVL3.5).
    On Qwen3-VL-8B: +1.46% accuracy with 20.8% shorter reasoning traces.
  • Curriculum learning has always meant easyβ†’hard (difficulty axis). We surface a second, orthogonal axis: which capability each epoch trains. The two stack additively β€” on Qwen3-VL-8B, combining both lifts the average 58.6 β†’ 63.0, beating either axis alone.

🌐 Project: ucsc-vlaa.github.io/VLM-CapCurriculum
πŸ“„ arXiv: arxiv.org/abs/2605.20177
πŸ’» Code: github.com/UCSC-VLAA/VLM-CapCurriculum
πŸ€— HF Collection: UCSC-VLAA/VLM-CapCurriculum

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20177
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20177 in a Space README.md to link it from this page.

Collections including this paper 1