--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: image-text-to-text base_model: OpenGVLab/InternVL3_5-8B-HF tags: - vision-language-model - vlm - reasoning - perception - rlvr - grpo - icml-2026 --- # VLM-CapCurriculum-InternVL3.5-8B-Staged A vision-language model post-trained from **OpenGVLab/InternVL3_5-8B-HF** with the staged, capability-dimension curriculum from *"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models"* (ICML 2026). This release is the **Stage-3 step-279** checkpoint. > **TL;DR.** Visual perception — not reasoning length — is the dominant bottleneck for visual reasoning in VLMs. We fix this by post-training along a **capability axis** (perception → textual reasoning → visual reasoning) rather than mixing all data together. | Resource | Link | |---|---| | 📄 Paper | | | 💻 Code | https://github.com/UCSC-VLAA/VLM-CapCurriculum | | 🌐 Project page | https://ucsc-vlaa.github.io/VLM-CapCurriculum | | 🤗 Collection (model + data + eval) | https://huggingface.co/collections/UCSC-VLAA/vlm-capcurriculum-from-seeing-to-thinking-icml-2026-6a07691f944148ccb2b183b8 | ## Headline numbers (extended benchmark suite, AVG over 10 benchmarks) | Setting | Extended AVG | |---|:---:| | InternVL3.5-8B (base) | 37.33 | | InternVL3.5-8B + Merged training | 52.76 | | **InternVL3.5-8B + Staged (this model)** | **53.71** | See Appendix Table 9 of the paper for the full per-benchmark breakdown. ## How it was trained Three RLVR stages with GRPO (on top of [EasyR1](https://github.com/hiyouga/EasyR1)): 1. **Stage 1 — visual perception** on `UCSC-VLAA/VLM-CapCurriculum-Perception` (synthesised + filtered DOCCI MCQs). 2. **Stage 2 — textual reasoning** on `UCSC-VLAA/VLM-CapCurriculum-TextReasoning` (ORZ-Math-13k). 3. **Stage 3 — visual reasoning** on `UCSC-VLAA/VLM-CapCurriculum-VisualReasoning` (CLEVR-Math + GeoQA170K + Math PUMA + DocVQA + ArxivQA mix). **Released checkpoint: step 279 of Stage 3.** All three stages share **one** system / format prompt — see [Inference](#inference) below. Detailed launch scripts: [`training/examples/internvl3_5_8b/`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/tree/main/training/examples/internvl3_5_8b) in the code repo. ## Inference The model expects the unified system prompt that it was trained against: ``` You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \boxed{}. i.e. reasoning here \boxed{final answer here} ``` Quick start with LMDeploy: ```bash lmdeploy serve api_server UCSC-VLAA/VLM-CapCurriculum-InternVL3.5-8B-Staged \ --server-port 23343 --tp 4 ``` For VLMEvalKit-style benchmark eval, plug it in via the `InternVL3_5_8B_Staged` alias defined in [`evaluation/configs/models.py`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/blob/main/evaluation/configs/models.py). ## Intended use & limitations Intended for research on vision-language reasoning, post-training methodology, and capability-dimension curriculum learning. Inherits the safety / bias profile of the underlying InternVL3.5-8B backbone; we have not added additional alignment fine-tuning. Not recommended for high-stakes deployments without further evaluation. Trained at the 8B parameter scale with 4096-token max prompt length and a fixed group size of 5. Behaviour at much longer contexts or substantially different prompt formats has not been characterised. ## License & citation Released under **Apache-2.0**, matching the upstream backbone. If you use this model, please cite: ```bibtex @inproceedings{vlmcapcurriculum2026, title = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models}, author = {Juncheng Wu and Hardy Chen and Haoqin Tu and Xianfeng Tang and Freda Shi and Hui Liu and Hanqing Lu and Cihang Xie and Yuyin Zhou}, booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, year = {2026} } ``` ## Acknowledgements Built on top of [EasyR1](https://github.com/hiyouga/EasyR1), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), and the [InternVL](https://huggingface.co/OpenGVLab) family.