YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
π Paper | π€ Models & Tools | π» Code
π Contents
π Overview
While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images" β to reason through multi-step visual interactions β remains limited.
We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 task types across 13 datasets) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning.
βοΈ Installation
Gym setup
git clone https://github.com/Lucanyc/vista-gym.git
cd vista-gym
pip install -e .
Build Docker Container
Since our gym environment relies on a Docker container for isolated coding and execution, you need to first build the Docker image:
docker build -f docker/Dockerfile -t vlm_gym:latest .
Alternatively, you can run the prepared script directly:
bash docker/build_docker.sh
Training setup
We follow the verl/verl-tool environment: TIGER-AI-Lab/verl-tool
Tool setup
Coming soon
| Model | HuggingFace Link | Usage |
|---|---|---|
| ChartMoE | π€ LuKasatvt/VistaGym/ChartMoE | --chartmoe-model "LuKasatvt/VistaGym" |
| MultiMath | π€ LuKasatvt/VistaGym/MultiMath | --enable-multimath |
| GLLaVA | π€ LuKasatvt/VistaGym/GLLaVA | --enable-gllava |
| EasyOCR | π€ LuKasatvt/VistaGym/EasyOCR | --enable-easyocr |
| Qwen2.5-VL-7B | π€ Qwen/Qwen2.5-VL-7B-Instruct | --model Qwen/Qwen2.5-VL-7B-Instruct |
| InternVL3-8B | π€ OpenGVLab/InternVL3-8B-Instruct | --model OpenGVLab/InternVL3-8B-Instruct |
| ... | More models supported |
πΊοΈ VISTA-Gym
A scalable reinforcement learning gym for training tool-integrated visual reasoning in VLMs. Drop in any benchmark, any VLM, and evaluate with reflection and tool-augmented reasoning.
Core components:
| Directory | Description |
|---|---|
vlm_gym/environments/ |
Pluggable vision-QA environments (ChartQA, ScienceQA, Geometry3K, etc.) |
vlm_gym/agents/ |
VLM agent implementations (Qwen2.5-VL-7B-Instruct, InternVL3-8B, etc.) |
vlm_gym/environments/tools/ |
Visual tools (ChartMoE, DeepEyes, GroundingDINO, EasyOCR, SAM2, etc.) |
vlm_gym/tasks/ |
Task-specific reasoning, evaluation, and feedback components |
scripts/ |
Evaluation entry points |
data_adapters/ |
Dataset converters to unified vlmgym format |
Gym Interaction
The gym follows a standard environment-agent loop: the environment sends an observation (image + question), the agent returns an action (predicted answer), and the environment provides feedback for retry.
Environment (ChartQA) Agent (VLM)
β β
βββββ obs: image + question βββββΊβ
β β
βββββ action: <think>...</think> β
β <answer>Yes</answer> β
β β
β [if wrong & reflection on] β
β β
βββββ feedback + retry ββββββββββΊβ
βββββ action: revised answer βββ β
β β
β reward: 1.0 (correct) β
| Tool | Description | Flag |
|---|---|---|
| ChartMoE | Chart data extraction (to_table, extract_data, describe) | --enable-chartmoe |
| DeepEyes | Image zoom/magnification for fine-grained visual analysis | --enable-deepeyes |
| Grounding DINO | Object detection and grounding | --config-experiment chartqa_grounding |
| EasyOCR | Optical character recognition | --enable-easyocr |
| SAM2 | Segment Anything 2 for image segmentation | --enable-sam2 |
| MultiMath | Mathematical reasoning tool | --enable-multimath |
| ... |
Data Preparation
Convert ChartQA to vlmgym format:
python data_adapters/convert_chartqa_to_vlmgym.py
Run ChartQA Evaluation with Reflection
python scripts/run_chartqa_eval_reflection_with_tool.py \
--annotation data/chartqa/converted_train/train_human_vlmgym_container.json \
--data-root data/chartqa \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--enable-reflection \
--max-attempts 3 \
--numerical-tolerance 0.05 \
--limit 50
Run with Tool-Augmented Reasoning
python scripts/run_chartqa_eval_reflection_with_tool.py \
--annotation data/chartqa/converted_train/train_human_vlmgym_container.json \
--data-root data/chartqa \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--enable-chartmoe \
--chartmoe-model "/workspace/mathvista/model" \
--chartmoe-device cuda \
--use-structured-output \
--enable-reflection \
--max-attempts 3 \
--limit 50
Requirements
- Python 3.10+
- Linux, CUDA 12, NVIDIA GPU (80GB+ recommended for training; inference requires ~20GB for 7B model)
- PyTorch 2.0+
- Transformers 4.40+
π Full Training Pipeline
We release training code for both Qwen (via verl) and InternVL families.
π§ Step 1: Supervised Fine-Tuning
We use the official InternVL3.0 training framework for supervised fine-tuning. Please see config folder for the example configs used.
π― Step 2: Reinforcement Learning (GRPO)
Coming soon
π Project Info
References
- verl-tool β TIGER-AI-Lab/verl-tool
- Qwen2.5-VL β Qwen/Qwen2.5-VL-7B-Instruct
