YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

πŸ“ƒ Paper | πŸ€— Models & Tools | πŸ’» Code

πŸ“‘ Contents


πŸ“– Overview

While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images" β€” to reason through multi-step visual interactions β€” remains limited.

We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 task types across 13 datasets) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning.

VISTA Overview


βš™οΈ Installation

Gym setup

git clone https://github.com/Lucanyc/vista-gym.git
cd vista-gym
pip install -e .

Build Docker Container

Since our gym environment relies on a Docker container for isolated coding and execution, you need to first build the Docker image:

docker build -f docker/Dockerfile -t vlm_gym:latest .

Alternatively, you can run the prepared script directly:

bash docker/build_docker.sh

Training setup

We follow the verl/verl-tool environment: TIGER-AI-Lab/verl-tool

Tool setup

Coming soon

Model HuggingFace Link Usage
ChartMoE πŸ€— LuKasatvt/VistaGym/ChartMoE --chartmoe-model "LuKasatvt/VistaGym"
MultiMath πŸ€— LuKasatvt/VistaGym/MultiMath --enable-multimath
GLLaVA πŸ€— LuKasatvt/VistaGym/GLLaVA --enable-gllava
EasyOCR πŸ€— LuKasatvt/VistaGym/EasyOCR --enable-easyocr
Qwen2.5-VL-7B πŸ€— Qwen/Qwen2.5-VL-7B-Instruct --model Qwen/Qwen2.5-VL-7B-Instruct
InternVL3-8B πŸ€— OpenGVLab/InternVL3-8B-Instruct --model OpenGVLab/InternVL3-8B-Instruct
... More models supported

πŸ—ΊοΈ VISTA-Gym

A scalable reinforcement learning gym for training tool-integrated visual reasoning in VLMs. Drop in any benchmark, any VLM, and evaluate with reflection and tool-augmented reasoning.

Core components:

Directory Description
vlm_gym/environments/ Pluggable vision-QA environments (ChartQA, ScienceQA, Geometry3K, etc.)
vlm_gym/agents/ VLM agent implementations (Qwen2.5-VL-7B-Instruct, InternVL3-8B, etc.)
vlm_gym/environments/tools/ Visual tools (ChartMoE, DeepEyes, GroundingDINO, EasyOCR, SAM2, etc.)
vlm_gym/tasks/ Task-specific reasoning, evaluation, and feedback components
scripts/ Evaluation entry points
data_adapters/ Dataset converters to unified vlmgym format

Gym Interaction

The gym follows a standard environment-agent loop: the environment sends an observation (image + question), the agent returns an action (predicted answer), and the environment provides feedback for retry.

Environment (ChartQA)            Agent (VLM)
    β”‚                                β”‚
    │──── obs: image + question ────►│
    β”‚                                β”‚
    │◄─── action: <think>...</think> β”‚
    β”‚           <answer>Yes</answer> β”‚
    β”‚                                β”‚
    β”‚   [if wrong & reflection on]   β”‚
    β”‚                                β”‚
    │──── feedback + retry ─────────►│
    │◄─── action: revised answer ─── β”‚
    β”‚                                β”‚
    β”‚   reward: 1.0 (correct)        β”‚
Tool Description Flag
ChartMoE Chart data extraction (to_table, extract_data, describe) --enable-chartmoe
DeepEyes Image zoom/magnification for fine-grained visual analysis --enable-deepeyes
Grounding DINO Object detection and grounding --config-experiment chartqa_grounding
EasyOCR Optical character recognition --enable-easyocr
SAM2 Segment Anything 2 for image segmentation --enable-sam2
MultiMath Mathematical reasoning tool --enable-multimath
...

Data Preparation

Convert ChartQA to vlmgym format:

python data_adapters/convert_chartqa_to_vlmgym.py

Run ChartQA Evaluation with Reflection

python scripts/run_chartqa_eval_reflection_with_tool.py \
  --annotation data/chartqa/converted_train/train_human_vlmgym_container.json \
  --data-root data/chartqa \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --enable-reflection \
  --max-attempts 3 \
  --numerical-tolerance 0.05 \
  --limit 50

Run with Tool-Augmented Reasoning

python scripts/run_chartqa_eval_reflection_with_tool.py \
    --annotation data/chartqa/converted_train/train_human_vlmgym_container.json \
    --data-root data/chartqa \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --enable-chartmoe \
    --chartmoe-model "/workspace/mathvista/model" \
    --chartmoe-device cuda \
    --use-structured-output \
    --enable-reflection \
    --max-attempts 3 \
    --limit 50

Requirements

  • Python 3.10+
  • Linux, CUDA 12, NVIDIA GPU (80GB+ recommended for training; inference requires ~20GB for 7B model)
  • PyTorch 2.0+
  • Transformers 4.40+

πŸš€ Full Training Pipeline

We release training code for both Qwen (via verl) and InternVL families.

🧠 Step 1: Supervised Fine-Tuning

We use the official InternVL3.0 training framework for supervised fine-tuning. Please see config folder for the example configs used.

🎯 Step 2: Reinforcement Learning (GRPO)

Coming soon


πŸ† Project Info

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for LuKasatvt/VISTA-Gym