File size: 2,951 Bytes
07aa866
 
 
 
e29c335
 
 
07aa866
 
e29c335
 
 
 
 
07aa866
 
e29c335
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07aa866
e29c335
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07aa866
e29c335
 
 
 
 
 
 
 
 
 
 
 
 
07aa866
e29c335
 
 
 
 
 
 
07aa866
e29c335
07aa866
 
 
 
 
 
 
 
 
 
 
e29c335
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
pipeline_tag: image-feature-extraction
---

<div align="center">
<h1>WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens</h1>

[![arXiv](https://img.shields.io/badge/arXiv-2605.18115-b31b1b.svg)](https://arxiv.org/abs/2605.18115)
[![Github](https://img.shields.io/badge/Github-WinTok-blue)](https://github.com/markywg/WinTok)
[![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/markyw/WinTok/tree/main)
</div>

This project introduces **WinTok**, a concise hybrid visual tokenizer designed to resolve the long-standing conflict between visual understanding and generation. By decoupling semantic and pixel tokens with an asymmetric distillation mechanism, WinTok achieves a win-win across reconstruction, understanding, and generation, surpassing strong baselines with substantially less training data. <br><br>

> [WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens](https://huggingface.co/papers/2605.18115)<br>
> Yiwei Guo, Shaobin Zhuang, Canmiao Fu, Zhipeng Huang, Chen Li, Jing LYU, Yali Wang<br>
> Shenzhen Institutes of Advanced Technology (Chinese Academy of Sciences), WeChat Vision (Tencent Inc.), Shanghai Jiao Tong University<br>

<p align="center">
  <img src="./assets/visualization.jpg" width="90%">
  <br>
  <em>WinTok achieves superior performance on downstream applications, surpassing previous unified tokenizers, with a more flexible hybrid encoding mechanism.</em>
</p>

## πŸ“° News
* **[2026.05.19]** πŸš€ πŸš€ πŸš€ We are excited to release **WinTok**, a unified visual tokenizer featuring our novel **hybrid encoding** and **asymmetric distillation**. Code and model are now available!

## πŸ“– Implementations

### πŸ› οΈ Installation
- **Dependencies**: 
```bash
bash env.sh
```

### Evaluation

- **Evaluation on ImageNet 50K Validation Set**

The dataset should be organized as follows:
```
imagenet
└── val/
    β”œβ”€β”€ ...
```

Run the 256Γ—256 resolution evaluation script, change the corresponding path:
```bash
bash scripts/eval_tokenizer/eval_metrics_ddp.sh
```

- **Evaluation on MS-COCO Val2017**

The dataset should be organized as follows:
```
MSCOCO2017
└── val2017/
    β”œβ”€β”€ ...
```

Run the 256Γ—256 resolution evaluation script, change the corresponding path:
```bash
bash scripts/eval_tokenizer/eval_metrics_ddp.sh
```


### Inference

Simply test the effect of model reconstruction:
```bash
python recon.py --ckpt_path path_to_ckpt
```

## Citation

```bibtex
@article{guo2026wintok,
  title={WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens},
  author={Guo, Yiwei and Zhuang, Shaobin and Huang, Zhipeng and Fu, Canmiao and Li, Chen and LYU, Jing and Wang, Yali},
  journal={arXiv preprint arXiv:2605.18115},
  year={2026}
}
```