Vero-MiMo-7B

Vero

Vero is an open RL model family for general visual reasoning. It releases models, data, evaluation, and training code for broad multimodal reasoning across charts, STEM, spatial reasoning, knowledge, grounding, counting, and instruction following.

GitHub HF Models HF Dataset Project Page

Models

Model HF repo Base model Params
Vero-Qwen3I-8B gsarch/Vero-Qwen3I-8B Qwen3-VL-8B-Instruct 8B
Vero-Qwen3T-8B gsarch/Vero-Qwen3T-8B Qwen3-VL-8B-Thinking 8B
Vero-MiMo-7B gsarch/Vero-MiMo-7B MiMo-VL-7B-SFT-2508 7B
Vero-Qwen25-7B gsarch/Vero-Qwen25-7B Qwen2.5-VL-7B-Instruct 7B

Highlights

  • Fully open release of models, training code, evaluation, and the Vero-600K dataset.
  • 600K curated RL samples from 59 datasets across 6 visual reasoning categories.
  • Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
  • SOTA 8B on VeroEval, a 30-benchmark suite for general visual reasoning.
  • Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.

Usage

Example for gsarch/Vero-Qwen3T-8B:

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

model_path = "gsarch/Vero-Qwen3T-8B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "What is the x axis value with the largest population?"},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
    generated_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(output)

Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.

Recommended sampling parameters, following the Qwen3.5 defaults:

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0, max_new_tokens=16384.

Citation

@article{sarch2026vero,
    title   = {Vero: An Open RL Recipe for General Visual Reasoning},
    author  = {Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
    year    = {2026},
    journal = {arXiv preprint arXiv:2604.04917},
}

License

Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.

Downloads last month
23
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zlab-princeton/Vero-MiMo-7B

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including zlab-princeton/Vero-MiMo-7B

Paper for zlab-princeton/Vero-MiMo-7B