Vero
Collection
6 items • Updated • 4
Vero is an open RL model family for general visual reasoning. It releases models, data, evaluation, and training code for broad multimodal reasoning across charts, STEM, spatial reasoning, knowledge, grounding, counting, and instruction following.
| Model | HF repo | Base model | Params |
|---|---|---|---|
Vero-Qwen3I-8B |
gsarch/Vero-Qwen3I-8B |
Qwen3-VL-8B-Instruct |
8B |
Vero-Qwen3T-8B |
gsarch/Vero-Qwen3T-8B |
Qwen3-VL-8B-Thinking |
8B |
Vero-MiMo-7B |
gsarch/Vero-MiMo-7B |
MiMo-VL-7B-SFT-2508 |
7B |
Vero-Qwen25-7B |
gsarch/Vero-Qwen25-7B |
Qwen2.5-VL-7B-Instruct |
7B |
Vero-600K dataset.VeroEval, a 30-benchmark suite for general visual reasoning.Example for gsarch/Vero-Qwen3T-8B:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
model_path = "gsarch/Vero-Qwen3T-8B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What is the x axis value with the largest population?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
generated_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)[0]
print(output)
Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.
Recommended sampling parameters, following the Qwen3.5 defaults:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0, max_new_tokens=16384.@article{sarch2026vero,
title = {Vero: An Open RL Recipe for General Visual Reasoning},
author = {Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
year = {2026},
journal = {arXiv preprint arXiv:2604.04917},
}
Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.
Unable to build the model tree, the base model loops to the model itself. Learn more.