Papers
arxiv:2604.10966

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Published on Apr 13
· Submitted by
Yinuo Yang
on Apr 15
Authors:
,
,
,
,

Abstract

A multimodal reward model evaluates multiple responses simultaneously through concatenated input and cross-entropy scoring, achieving faster training and superior performance in open-ended generation tasks compared to traditional single-response approaches.

AI-generated summary

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient N-way preference learning. The multi-response design also yields up to Ntimes wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable N-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR^2Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR^2Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR^2Bench-Image, MR^2Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

Community

Paper submitter

We present a discriminative multimodal reward model that scores all N candidate responses in a single forward pass, achieving up to N× speedup over conventional single-response scoring. Our 4B model achieves SOTA on six multimodal reward benchmarks, outperforming larger generative judges. We also introduce MR²Bench-Image and MR²Bench-Video, two new N-way ranking benchmarks for multimodal reward evaluation.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.10966
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.10966 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.10966 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.10966 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.