Update model card: add metadata, library tags, and repository links

Hi! I'm Niels from the community science team at Hugging Face.

This PR improves the model card for 4DThinker by:
- Adding the `video-text-to-text` pipeline tag for better discoverability.
- Adding `library_name: transformers` metadata since the model is compatible with the library.
- Linking the model card to the original paper and the official GitHub repository.
- Organizing the content for better readability while maintaining the existing technical details.

Files changed (1) hide show

README.md +31 -13

README.md CHANGED Viewed

@@ -1,15 +1,21 @@
 ---
 license: apache-2.0
 tags:
-  - 4DThinker
-  - dynamic-spatial-reasoning
-  - vision-language-model
-  - latent-reasoning
 ---
-# 4DThinker Model Checkpoints
-This repository contains the trained model checkpoints from Qwen2.5-VL-3B for **4DThinker**, a framework that enables VLMs to "think with 4D" through dynamic latent mental imagery.
 ## Model Structure
@@ -41,13 +47,13 @@ model/
 ## Special Tokens
-Three special tokens are added to the Qwen2.5-VL vocabulary:
 | Token | Description |
 |-------|-------------|
-| `<\|latent_pad\|>` | Padding within latent sequences |
-| `<\|latent_start\|>` | Marks start of latent visual token block |
-| `<\|latent_end\|>` | Marks end of latent visual token block |
 ## Usage
@@ -55,13 +61,25 @@ Three special tokens are added to the Qwen2.5-VL vocabulary:
 from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-    "./model/4drl",
     torch_dtype="auto",
     device_map="auto"
 )
-processor = AutoProcessor.from_pretrained("./model/4drl")
 ```
 ## License
-Apache License 2.0

 ---
 license: apache-2.0
+library_name: transformers
+pipeline_tag: video-text-to-text
 tags:
+- 4DThinker
+- dynamic-spatial-reasoning
+- vision-language-model
+- latent-reasoning
 ---
+# 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
+[**Paper**](https://huggingface.co/papers/2605.05997) | [**Code**](https://github.com/zhangquanchen/4DThinker)
+4DThinker is a framework that enables Vision-Language Models (VLMs) to "think with 4D" through dynamic latent mental imagery—internally simulating how scenes evolve within the continuous hidden space. It addresses dynamic spatial reasoning from monocular video by grounding the model in dynamic visual semantics.
+This repository contains the trained model checkpoints from Qwen2.5-VL-3B for **4DThinker**.
 ## Model Structure
 ## Special Tokens
+Three special tokens are added to the Qwen2.5-VL vocabulary to support latent imagery:
 | Token | Description |
 |-------|-------------|
+| `<|latent_pad|>` | Padding within latent sequences |
+| `<|latent_start|>` | Marks start of latent visual token block |
+| `<|latent_end|>` | Marks end of latent visual token block |
 ## Usage
 from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "jankin123/4DThinker-3B",
+    subfolder="4drl",
     torch_dtype="auto",
     device_map="auto"
 )
+processor = AutoProcessor.from_pretrained("jankin123/4DThinker-3B", subfolder="4drl")
+```
+## Citation
+```bibtex
+@article{4dthinker,
+  title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
+  author={Zhang, Quanchen and others},
+  journal={arXiv preprint arXiv:2605.05997},
+  year={2026}
+}
 ```
 ## License
+Apache License 2.0