Vero-MiMo-7B

Vero is an open RL model family for general visual reasoning. It releases models, data, evaluation, and training code for broad multimodal reasoning across charts, STEM, spatial reasoning, knowledge, grounding, counting, and instruction following.

Models

Model	HF repo	Base model	Params
`Vero-Qwen3I-8B`	`gsarch/Vero-Qwen3I-8B`	`Qwen3-VL-8B-Instruct`	8B
`Vero-Qwen3T-8B`	`gsarch/Vero-Qwen3T-8B`	`Qwen3-VL-8B-Thinking`	8B
`Vero-MiMo-7B`	`gsarch/Vero-MiMo-7B`	`MiMo-VL-7B-SFT-2508`	7B
`Vero-Qwen25-7B`	`gsarch/Vero-Qwen25-7B`	`Qwen2.5-VL-7B-Instruct`	7B

Highlights

Fully open release of models, training code, evaluation, and the Vero-600K dataset.
600K curated RL samples from 59 datasets across 6 visual reasoning categories.
Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
SOTA 8B on VeroEval, a 30-benchmark suite for general visual reasoning.
Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.

Usage

Example for gsarch/Vero-Qwen3T-8B:

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

model_path = "gsarch/Vero-Qwen3T-8B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "What is the x axis value with the largest population?"},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
    generated_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(output)

Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.

Recommended sampling parameters, following the Qwen3.5 defaults:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0, max_new_tokens=16384.

Citation

@article{sarch2026vero,
    title   = {Vero: An Open RL Recipe for General Visual Reasoning},
    author  = {Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
    year    = {2026},
    journal = {arXiv preprint arXiv:2604.04917},
}

License

Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.

Downloads last month: 23

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for zlab-princeton/Vero-MiMo-7B

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including zlab-princeton/Vero-MiMo-7B

Vero

Collection

6 items • Updated 13 days ago • 4

Paper for zlab-princeton/Vero-MiMo-7B

Vero: An Open RL Recipe for General Visual Reasoning

Paper • 2604.04917 • Published 14 days ago • 30