How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("jankin123/4DThinker-3B", dtype="auto")
Quick Links

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Paper | Code

4DThinker is a framework that enables Vision-Language Models (VLMs) to "think with 4D" through dynamic latent mental imageryβ€”internally simulating how scenes evolve within the continuous hidden space. It addresses dynamic spatial reasoning from monocular video by grounding the model in dynamic visual semantics.

This repository contains the trained model checkpoints from Qwen2.5-VL-3B for 4DThinker.

Model Structure

model/
β”œβ”€β”€ dift/
β”‚   β”œβ”€β”€ checkpoints/          # DIFT-stage model weights
β”‚   β”‚   β”œβ”€β”€ model-00001-of-00002.safetensors
β”‚   β”‚   β”œβ”€β”€ model-00002-of-00002.safetensors
β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”‚   └── ...
β”‚   └── tensorboard/          # DIFT training logs
└── 4drl/
    β”œβ”€β”€ model-00001-of-00002.safetensors
    β”œβ”€β”€ model-00002-of-00002.safetensors
    β”œβ”€β”€ config.json
    β”œβ”€β”€ tokenizer.json
    β”œβ”€β”€ trainer_state.json
    └── ...

Models

Model Stage Base Model Description
dift/checkpoints/ DIFT Qwen2.5-VL-3B-Instruct Supervised with cosine similarity loss on latent visual tokens
4drl/ 4DRL (GRPO) DIFT checkpoint Reinforced with answer-based rewards

Special Tokens

Three special tokens are added to the Qwen2.5-VL vocabulary to support latent imagery:

Token Description
`< latent_pad
`< latent_start
`< latent_end

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "jankin123/4DThinker-3B",
    subfolder="4drl",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("jankin123/4DThinker-3B", subfolder="4drl")

Citation

@article{4dthinker,
  title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
  author={Zhang, Quanchen and others},
  journal={arXiv preprint arXiv:2605.05997},
  year={2026}
}

Bibtex

If you find 4DThinker helpful for your work, please cite

@article{chen20264dthinker,
  title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
  author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and An, Xiang and Li, Bo and Xie, Xin and Wang, ZiDong and Sun, Mingze and Chen, Shuang and Li, Hongyu and others},
  journal={arXiv preprint arXiv:2605.05997},
  year={2026}
}

License

Apache License 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for jankin123/4DThinker-3B