--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: image-text-to-text base_model: OpenGVLab/InternVL3-8B-hf tags: - vision-language-model - vlm - reasoning - perception - rlvr - grpo - icml-2026 --- # VLM-CapCurriculum-InternVL3-8B-Staged A vision-language model post-trained from **OpenGVLab/InternVL3-8B-hf** with the staged, capability-dimension curriculum from *"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models"* (ICML 2026). This release is the **Stage-3 step-186** checkpoint, which gave the largest gain over the merged baseline among the four backbones we tried. > **TL;DR.** Visual perception — not reasoning length — is the dominant bottleneck for visual reasoning in VLMs. We fix this by post-training along a **capability axis** (perception → textual reasoning → visual reasoning) rather than mixing all data together. | Resource | Link | |---|---| | 📄 Paper | | | 💻 Code | https://github.com/UCSC-VLAA/VLM-CapCurriculum | | 🌐 Project page | https://ucsc-vlaa.github.io/VLM-CapCurriculum | | 🤗 Collection (model + data + eval) | https://huggingface.co/collections/UCSC-VLAA/vlm-capcurriculum-from-seeing-to-thinking-icml-2026-6a07691f944148ccb2b183b8 | ## Headline numbers (extended benchmark suite, AVG over 10 benchmarks) | Setting | Extended AVG | |---|:---:| | InternVL3-8B (base) | 29.69 | | InternVL3-8B + Merged training | 41.94 | | **InternVL3-8B + Staged (this model)** | **45.71** | Δ over merged: **+3.77** — the largest staged-vs-merged gap among the four backbones. Most striking is WeMath (+9.90 over merged), evidence that decoupling perception and reasoning is especially impactful for weaker base models. See Appendix Table 9 of the paper for the full per-benchmark breakdown. ## How it was trained Three RLVR stages with GRPO (on top of [EasyR1](https://github.com/hiyouga/EasyR1)): 1. **Stage 1 — visual perception** on `UCSC-VLAA/VLM-CapCurriculum-Perception` (synthesised + filtered DOCCI MCQs). 2. **Stage 2 — textual reasoning** on `UCSC-VLAA/VLM-CapCurriculum-TextReasoning` (ORZ-Math-13k). 3. **Stage 3 — visual reasoning** on `UCSC-VLAA/VLM-CapCurriculum-VisualReasoning` (CLEVR-Math + GeoQA170K + Math PUMA + DocVQA + ArxivQA mix). **Released checkpoint: step 186 of Stage 3.** InternVL3 needs damped optimisation in Stage 2 to avoid entropy explosion (`lr=3e-7`, `kl=5e-2`, `clip=0.15`, `max_grad_norm=0.5`) — these are baked into the launch scripts. All three stages share **one** system / format prompt — see [Inference](#inference) below. Detailed launch scripts: [`training/examples/internvl3_8b/`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/tree/main/training/examples/internvl3_8b) in the code repo. ## Inference The model expects the unified system prompt that it was trained against: ``` You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \boxed{}. i.e. reasoning here \boxed{final answer here} ``` Quick start with LMDeploy (the InternVL family is served via LMDeploy in our setup): ```bash lmdeploy serve api_server UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged \ --server-port 23342 --tp 4 ``` For VLMEvalKit-style benchmark eval, plug it in via the `InternVL3_8B_Staged` alias defined in [`evaluation/configs/models.py`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/blob/main/evaluation/configs/models.py). ## Intended use & limitations Intended for research on vision-language reasoning, post-training methodology, and capability-dimension curriculum learning. Inherits the safety / bias profile of the underlying InternVL3-8B backbone; we have not added additional alignment fine-tuning. Not recommended for high-stakes deployments without further evaluation. Trained at the 8B parameter scale with 4096-token max prompt length and a fixed group size of 5. Behaviour at much longer contexts or substantially different prompt formats has not been characterised. ## License & citation Released under **Apache-2.0**, matching the upstream backbone. If you use this model, please cite: ```bibtex @inproceedings{vlmcapcurriculum2026, title = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models}, author = {Juncheng Wu and Hardy Chen and Haoqin Tu and Xianfeng Tang and Freda Shi and Hui Liu and Hanqing Lu and Cihang Xie and Yuyin Zhou}, booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, year = {2026} } ``` ## Acknowledgements Built on top of [EasyR1](https://github.com/hiyouga/EasyR1), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), and the [InternVL](https://huggingface.co/OpenGVLab) family.