File size: 3,046 Bytes
3a7d274
 
8cbfb6a
 
3a7d274
8cbfb6a
 
 
 
3a7d274
 
8cbfb6a
3a7d274
8cbfb6a
 
 
 
 
3a7d274
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cbfb6a
3a7d274
 
 
8cbfb6a
 
 
3a7d274
 
 
 
 
 
 
8cbfb6a
 
3a7d274
 
 
8cbfb6a
 
 
 
 
 
 
 
 
 
 
 
3a7d274
 
1c9ae24
 
 
 
 
 
 
 
 
 
 
 
3a7d274
 
8cbfb6a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: apache-2.0
library_name: transformers
pipeline_tag: video-text-to-text
tags:
- 4DThinker
- dynamic-spatial-reasoning
- vision-language-model
- latent-reasoning
---

# 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

[**Paper**](https://huggingface.co/papers/2605.05997) | [**Code**](https://github.com/zhangquanchen/4DThinker)

4DThinker is a framework that enables Vision-Language Models (VLMs) to "think with 4D" through dynamic latent mental imageryβ€”internally simulating how scenes evolve within the continuous hidden space. It addresses dynamic spatial reasoning from monocular video by grounding the model in dynamic visual semantics.

This repository contains the trained model checkpoints from Qwen2.5-VL-3B for **4DThinker**.

## Model Structure

```
model/
β”œβ”€β”€ dift/
β”‚   β”œβ”€β”€ checkpoints/          # DIFT-stage model weights
β”‚   β”‚   β”œβ”€β”€ model-00001-of-00002.safetensors
β”‚   β”‚   β”œβ”€β”€ model-00002-of-00002.safetensors
β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”‚   └── ...
β”‚   └── tensorboard/          # DIFT training logs
└── 4drl/
    β”œβ”€β”€ model-00001-of-00002.safetensors
    β”œβ”€β”€ model-00002-of-00002.safetensors
    β”œβ”€β”€ config.json
    β”œβ”€β”€ tokenizer.json
    β”œβ”€β”€ trainer_state.json
    └── ...
```

## Models

| Model | Stage | Base Model | Description |
|-------|-------|------------|-------------|
| `dift/checkpoints/` | DIFT | Qwen2.5-VL-3B-Instruct | Supervised with cosine similarity loss on latent visual tokens |
| `4drl/` | 4DRL (GRPO) | DIFT checkpoint | Reinforced with answer-based rewards |

## Special Tokens

Three special tokens are added to the Qwen2.5-VL vocabulary to support latent imagery:

| Token | Description |
|-------|-------------|
| `<|latent_pad|>` | Padding within latent sequences |
| `<|latent_start|>` | Marks start of latent visual token block |
| `<|latent_end|>` | Marks end of latent visual token block |

## Usage

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "jankin123/4DThinker-3B",
    subfolder="4drl",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("jankin123/4DThinker-3B", subfolder="4drl")
```

## Citation

```bibtex
@article{4dthinker,
  title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
  author={Zhang, Quanchen and others},
  journal={arXiv preprint arXiv:2605.05997},
  year={2026}
}
```

## Bibtex
If you find 4DThinker helpful for your work, please cite

```
@article{chen20264dthinker,
  title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
  author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and An, Xiang and Li, Bo and Xie, Xin and Wang, ZiDong and Sun, Mingze and Chen, Shuang and Li, Hongyu and others},
  journal={arXiv preprint arXiv:2605.05997},
  year={2026}
}
```

## License

Apache License 2.0