Papers
arxiv:2604.10039

Counting to Four is still a Chore for VLMs

Published on Apr 11
ยท Submitted by
Duy Le
on Apr 14
Authors:
,
,

Abstract

Vision-language models exhibit counting failures due to reduced visual evidence utilization in later language layers, which can be mitigated through modality attention share interventions.

AI-generated summary

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

Community

Paper submitter

Vision-language models are improving incredibly fast, and they are often celebrated for reaching near expert-level performance on many challenging tasks. But I think there is an important question we should keep asking:

๐˜ผ๐™ง๐™š ๐™ฉ๐™๐™š๐™ฎ ๐™ง๐™š๐™–๐™ก๐™ก๐™ฎ ๐™จ๐™ฉ๐™ง๐™ค๐™ฃ๐™œ ๐™–๐™˜๐™ง๐™ค๐™จ๐™จ ๐™–๐™ก๐™ก ๐™ข๐™ค๐™™๐™–๐™ก๐™ž๐™ฉ๐™ž๐™š๐™จ, ๐™ค๐™ง ๐™–๐™ง๐™š ๐™ฉ๐™๐™š๐™ฎ ๐™จ๐™ฉ๐™ž๐™ก๐™ก ๐™ข๐™–๐™ž๐™ฃ๐™ก๐™ฎ ๐™จ๐™ฉ๐™ง๐™ค๐™ฃ๐™œ ๐™ž๐™ฃ ๐™ฉ๐™š๐™ญ๐™ฉ? ๐Ÿค”

A lot of the praise around todayโ€™s VLMs comes from what happens in the text space: fluent explanations, strong reasoning traces, confident answers, and impressive performance on text-heavy evaluations. But being multimodal should mean more than sounding smart. It should also mean being able to truly use and trust signals coming from other modalities.
In this work, we study that question through one of the most basic tasks in computer vision: object counting.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.10039
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.10039 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.10039 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.10039 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.