Image-Text-to-Text
Transformers
Safetensors
English
internvl
vision-language-model
vlm
reasoning
perception
rlvr
grpo
icml-2026
conversational
Instructions to use UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged") model = AutoModelForImageTextToText.from_pretrained("UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged
- SGLang
How to use UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged with Docker Model Runner:
docker model run hf.co/UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged
Update README: add author list, real GitHub/Collection/project-page URLs
Browse files
README.md
CHANGED
|
@@ -28,9 +28,9 @@ This release is the **Stage-3 step-186** checkpoint, which gave the largest gain
|
|
| 28 |
| Resource | Link |
|
| 29 |
|---|---|
|
| 30 |
| π Paper | <TODO_PAPER_URL> |
|
| 31 |
-
| π» Code | https://github.com/
|
| 32 |
-
| π Project page |
|
| 33 |
-
| π€ Collection (model + data + eval) | https://huggingface.co/collections/UCSC-VLAA/
|
| 34 |
|
| 35 |
## Headline numbers (extended benchmark suite, AVG over 10 benchmarks)
|
| 36 |
|
|
@@ -54,7 +54,7 @@ InternVL3 needs damped optimisation in Stage 2 to avoid entropy explosion (`lr=3
|
|
| 54 |
|
| 55 |
All three stages share **one** system / format prompt β see [Inference](#inference) below.
|
| 56 |
|
| 57 |
-
Detailed launch scripts: [`training/examples/internvl3_8b/`](https://github.com/
|
| 58 |
|
| 59 |
## Inference
|
| 60 |
|
|
@@ -74,7 +74,7 @@ lmdeploy serve api_server UCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged \
|
|
| 74 |
--server-port 23342 --tp 4
|
| 75 |
```
|
| 76 |
|
| 77 |
-
For VLMEvalKit-style benchmark eval, plug it in via the `InternVL3_8B_Staged` alias defined in [`evaluation/configs/models.py`](https://github.com/
|
| 78 |
|
| 79 |
## Intended use & limitations
|
| 80 |
|
|
@@ -89,7 +89,7 @@ Released under **Apache-2.0**, matching the upstream backbone. If you use this m
|
|
| 89 |
```bibtex
|
| 90 |
@inproceedings{vlmcapcurriculum2026,
|
| 91 |
title = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models},
|
| 92 |
-
author = {
|
| 93 |
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
|
| 94 |
year = {2026}
|
| 95 |
}
|
|
|
|
| 28 |
| Resource | Link |
|
| 29 |
|---|---|
|
| 30 |
| π Paper | <TODO_PAPER_URL> |
|
| 31 |
+
| π» Code | https://github.com/UCSC-VLAA/VLM-CapCurriculum |
|
| 32 |
+
| π Project page | https://ucsc-vlaa.github.io/VLM-CapCurriculum |
|
| 33 |
+
| π€ Collection (model + data + eval) | https://huggingface.co/collections/UCSC-VLAA/vlm-capcurriculum-from-seeing-to-thinking-icml-2026-6a07691f944148ccb2b183b8 |
|
| 34 |
|
| 35 |
## Headline numbers (extended benchmark suite, AVG over 10 benchmarks)
|
| 36 |
|
|
|
|
| 54 |
|
| 55 |
All three stages share **one** system / format prompt β see [Inference](#inference) below.
|
| 56 |
|
| 57 |
+
Detailed launch scripts: [`training/examples/internvl3_8b/`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/tree/main/training/examples/internvl3_8b) in the code repo.
|
| 58 |
|
| 59 |
## Inference
|
| 60 |
|
|
|
|
| 74 |
--server-port 23342 --tp 4
|
| 75 |
```
|
| 76 |
|
| 77 |
+
For VLMEvalKit-style benchmark eval, plug it in via the `InternVL3_8B_Staged` alias defined in [`evaluation/configs/models.py`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/blob/main/evaluation/configs/models.py).
|
| 78 |
|
| 79 |
## Intended use & limitations
|
| 80 |
|
|
|
|
| 89 |
```bibtex
|
| 90 |
@inproceedings{vlmcapcurriculum2026,
|
| 91 |
title = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models},
|
| 92 |
+
author = {Juncheng Wu and Hardy Chen and Haoqin Tu and Xianfeng Tang and Freda Shi and Hui Liu and Hanqing Lu and Cihang Xie and Yuyin Zhou},
|
| 93 |
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
|
| 94 |
year = {2026}
|
| 95 |
}
|