arxiv:2605.20177

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Published on May 19

· Submitted by

Juncheng Wu on May 25

Upvote

Authors:

Juncheng Wu ,

Abstract

Staged training approaches that separately optimize visual perception, visual reasoning, and textual reasoning in vision-language models outperform unified training methods, leading to improved performance on visual reasoning tasks.

AI-generated summary

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

Chtholly17

Paper author Paper submitter about 24 hours ago

Your VLM didn't fail because it didn't think long enough. It failed because it looked wrong. We found 86.9% of Qwen3-VL-8B's wrong answers trace back to a perception error — not a reasoning one.
Our fix: a capability curriculum — a brand-new curriculum dimension that trains perception before reasoning. 🧵

we decouple post-training along a capability axis into 3 sequential RLVR stages:
🟦 Visual Perception → 🟩 Textual Reasoning → 🟨 Visual Reasoning
Staged > Merged, consistently across 4 backbones (Qwen2.5-VL, Qwen3-VL, InternVL3, InternVL3.5).
On Qwen3-VL-8B: +1.46% accuracy with 20.8% shorter reasoning traces.
Curriculum learning has always meant easy→hard (difficulty axis). We surface a second, orthogonal axis: which capability each epoch trains. The two stack additively — on Qwen3-VL-8B, combining both lifts the average 58.6 → 63.0, beating either axis alone.

🌐 Project: ucsc-vlaa.github.io/VLM-CapCurriculum
📄 arXiv: arxiv.org/abs/2605.20177
💻 Code: github.com/UCSC-VLAA/VLM-CapCurriculum
🤗 HF Collection: UCSC-VLAA/VLM-CapCurriculum

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.20177

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20177 in a Space README.md to link it from this page.