Improve model card: Add pipeline tag, update license, and enrich description with quick start

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +177 -7
README.md CHANGED
@@ -1,20 +1,186 @@
1
  ---
2
- license: mit
 
3
  ---
4
 
5
  # GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
6
 
7
- [arxiv version](https://arxiv.org/abs/2509.01109)
8
 
9
- [Zhengqiang Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=UX26wSMAAAAJ)<sup>1,2</sup> | [Rongyuan Wu](https://scholar.google.com.hk/citations?hl=zh-CN&user=A-U8zE8AAAAJ)<sup>1,2</sup> | [Lingchen Sun](https://scholar.google.com/citations?hl=zh-CN&tzom=-480&user=ZCDjTn8AAAAJ)<sup>1,2</sup> | [Lei Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=tAK5l1IAAAAJ)<sup>1,2,+</sup>
10
 
11
- <sup>1</sup> The Hong Kong Polytechnic University <sup>2</sup> OPPO Research Institute <sup>+</sup> Corresponding Author
12
 
13
- Please refer to [GPSToken](https://github.com/xtudbxk/GPSToken) for detailed instructions.
14
 
15
- ## CITATION
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ```
 
 
 
 
 
18
  @misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
19
  title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
20
  author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
@@ -24,4 +190,8 @@ Please refer to [GPSToken](https://github.com/xtudbxk/GPSToken) for detailed ins
24
  primaryClass={cs.CV},
25
  url={https://arxiv.org/abs/2509.01109},
26
  }
27
- ```
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-image
4
  ---
5
 
6
  # GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
7
 
8
+ πŸ“š [Paper](https://huggingface.co/papers/2509.01109) | πŸ’» [Code](https://github.com/xtudbxk/GPSToken)
9
 
10
+ This is the official Hugging Face model repository for GPSToken, as presented in the paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation".
11
 
12
+ ## Abstract
13
 
14
+ Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. We propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.
15
 
16
+ ## News
17
+
18
+ - **2025.09.19**: GPSToken has been accepted by [NIPS 2025](https://openreview.net/forum?id=BxoEDR2yQM)! πŸŽ‰πŸŽ‰πŸŽ‰
19
+ - **2025.09.16**: Update models to [HuggingFace](https://huggingface.co/xtudbxk/GPSToken).
20
+ - **2025.09.05**: Update code for higher resolution, including GPS-tokens merging (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/gpstoken.py#L113)) for reducing boundary artifacts and resized GroupNorm layer (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/vqvae.py#L310)) for easing color shifts.
21
+
22
+ ## Motivation: Beyond Fixed Grids
23
+
24
+ Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations.
25
+ We propose **GPSToken**, a **G**aussian **P**arameterized **S**patially-adaptive **Token**ization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:
26
+ - Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
27
+ - Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
28
+ - Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
29
+ - Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.
30
+
31
+ <div align="center">
32
+ <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/gpstoken.jpg" width="90%">
33
+ </div>
34
+
35
+ ## Core Highlights
36
+
37
+ #### βœ… Spatially-Adaptive Representation
38
+ - Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.
39
+
40
+ #### βœ… Dynamic & Scalable
41
+ Furthermore, GPSToken supports:
42
+ - **User-Controllable Adjustment**: Manually allocate more tokens to user-interest areas for finer reconstruction.
43
+ - **Variable Token Count**: Increase or decrease token count of each image for better efficiency-fidelity balance.
44
+ - **Scalable to Higher Resolution**: maintain comparable performance at higher resolutions without retraining.
45
+
46
+ #### βœ… Spatial-Texture Disentanglement
47
+ - Each token encodes a **disentangled** representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.
48
+
49
+ #### βœ… SOTA Performance
50
+ - Achieves **psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65** on image reconstruction with only **256 tokens**, outperforming prior methods.
51
+
52
+ ## GPS-Tokens: Mathematical Form and CUDA-Based Rendering Algorithm
53
+
54
+ Each token is represented by a **bounded 2D Gaussian function** and a individual feature, encoding spatial geometry and texture separately.
55
+
56
+ #### πŸ“ Standard 2D Gaussian (Unnormalized)
57
+
58
+ The core form of the $i$-th Gaussian is:
59
+
60
+ ![Standard 2D Gaussian](https://latex.codecogs.com/png.latex?%5Chat%7Bp%7D_i%28x%2C%20y%29%20%3D%20%5Cexp%5Cleft%28-%5Cfrac%7B1%7D%7B2%281-%5Crho_i%5E2%29%7D%20%5Cleft%28%20%5Cfrac%7B%28x-%5Cmu_%7Bx%2Ci%7D%29%5E2%7D%7B%5Csigma_%7Bx%2Ci%7D%5E2%7D%20-%20%5Cfrac%7B2%5Crho_i%28x-%5Cmu_%7Bx%2Ci%7D%29%28y-%5Cmu_%7By%2Ci%7D%29%7D%7B%5Csigma_%7Bx%2Ci%7D%5Csigma_%7By%2Ci%7D%7D%20+%20%5Cfrac%7B%28y-%5Cmu_%7By%2Ci%7D%29%5E2%7D%7B%5Csigma_%7By%2Ci%7D%5E2%7D%20%5Cright%29%5Cright%29)
61
+
62
+ - $(\mu_{x,i}, \mu_{y,i})$: center (position)
63
+ - $\sigma_{x,i}, \sigma_{y,i} > 0$: standard deviations (scale)
64
+ - $\rho_i \in [-1, 1]$: correlation coefficient (orientation)
65
+
66
+ > This is the unnormalized density β€” avoids costly $Z$ computation.
67
+
68
+ #### πŸ“ Bounded Support for Efficiency
69
+
70
+ To focus on local regions and enable fast GPU rendering, we define the **modified splatting kernel**:
71
+
72
+ ![Bounded Gaussian Kernel](https://latex.codecogs.com/png.latex?%5Cmathbf%7Bg%7D_i%28x%2C%20y%29%20%3D%20%5Cbegin%7Bcases%7D%20%5Chat%7Bp%7D_i%28x%2C%20y%29%2C%20%26%20%5Ctext%7Bif%20%7D%20%7Cx%20-%20%5Cmu_%7Bx%2Ci%7D%7C%20%5Cleq%20s%5Csigma_%7Bx%2Ci%7D%20%5Ctext%7B%20and%20%7D%20%7Cy%20-%20%5Cmu_%7By%2Ci%7D%7C%20%5Cleq%20s%5Csigma_%7By%2Ci%7D%20%5C%5C%200%2C%20%26%20%5Ctext%7Botherwise%7D%20%5Cend%7Bcases%7D)
73
+
74
+ - $s$: spatial support factor (empirically set to $s=5$)
75
+ β†’ Covers >99.999% of Gaussian mass, negligible truncation error.
76
+
77
+ #### 🧩 Token Representation
78
+
79
+ An image is encoded as $l$ GPS-tokens: $\mathbf{z} = \{\mathbf{z}_1, \dots, \mathbf{z}_l\}$, where each $\mathbf{z}_i = \\{\mathbf{g}_i, \mathbf{f}_i\\}$ contains:
80
+
81
+ | Component | Symbol & Type | Role |
82
+ |---------------|-----------------------------------|-------------------------------|
83
+ | **Geometry** | $\mathbf{g}_i = (\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$ | Spatial layout (2D Gaussian params) |
84
+ | **Texture** | $\mathbf{f}_i \in \mathbb{R}^{c-5}$ | Visual features (from CNN/Transformer) |
85
+
86
+ **Disentangled design**: geometry and texture can be manipulated independently.
87
 
88
+ #### ⚑ CUDA-Based Rendering Algorithm
89
+ We implement a **CUDA-accelerated rendering algorithm** to parallelize the forward and backward processes of the bounded Gaussian splatting kernel. Implementation details are provided in the `gscuda` folder.
90
+
91
+ ## πŸ—οΈ Framework: From Image to GPS-Tokens
92
+
93
+ GPSToken pipeline: **Initialization β†’ Refinement β†’ Rendering β†’ Reconstruction**
94
+ <div align="center">
95
+ <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/framework.jpg" width="90%">
96
+ </div>
97
+
98
+ #### Spatially-adaptive Token Initialization
99
+ We use an iterative algorithm to partition the image into regions based on texture complexity. Each region's location and size initialize the Gaussian parameters of corresponding GPS-tokens, enabling a coarse spatially-adaptive representation.
100
+
101
+ #### Spatially-adaptive Token Refinement
102
+ After obtaining the initialized Gaussian parameters, we employ a transformer-based encoder to refine these parameters to achieve fine-grained spatial adaptation, while simultaneously extracting the corresponding texture features $\mathbf{f}$ for each region using RoIAlign layers. After encoder refinement, the parameters better match local textures.
103
+
104
+ #### End-to-end Reconstruction
105
+ During decoding, we first render the GPSTokens into a 2D feature map, then decode them into the reconstructed image. Following existing works, we use a combination of reconstruction loss $L_{\text{rec}}$, perceptual loss $L_{\text{perc}}$, and adversarial loss $L_{\text{adv}}$ during training.
106
+
107
+ ## πŸ“Š Experimental Results
108
+
109
+ #### 1. Image Reconstruction ($256\times 256$ on Imagenet val set)
110
+
111
+ GPSToken outperforms fixed-grid methods with same token count.
112
+
113
+ | Method | Token Count | Params (M) | PSNR | SSIM | LPIPS | rFID | FID |
114
+ |------------------|-------------|-----------|-------|--------|--------|-------|-------|
115
+ | SDXL-VAE | 32x32 | 83.6 | 25.55 | 0.727 | 0.066 | 0.73 | 2.35 |
116
+ | VAVAE | 16x16 | 69.8 | 25.76 | 0.742 | 0.050 | 0.27 | 1.74 |
117
+ | DCAE | 8x8 | 323.4 | 23.62 | 0.644 | 0.092 | 0.98 | 2.59 |
118
+ | TiTok-B64 | 64 | 204.8 | 17.01 | 0.390 | 0.263 | 1.75 | 2.50 |
119
+ | TiTok-S128 | 128 | 83.7 | 17.66 | 0.413 | 0.220 | 1.73 | 3.25 |
120
+ | MAETok | 128 | 173.9 | 23.25 | 0.626 | 0.096 | 0.65 | 2.01 |
121
+ | FlexTok | 256 | 949.7 | 17.69 | 0.475 | 0.257 | 4.02 | 4.88 |
122
+ | **GPSToken-S64** | 64 | 127.5 | 22.18 | 0.578 | 0.111 | 1.31 | 3.02 |
123
+ | **GPSToken-M128**| 128 | 127.8 | 24.06 | 0.657 | 0.080 | 0.65 | 2.18 |
124
+ | **GPSToken-L256**| 256 | 128.7 | 28.81 | 0.809 | 0.043 | 0.22 | 1.65 |
125
+
126
+ #### 2. Spatial-Adaptivity Visualization
127
+ Gaussian tokens automatically concentrate on high-complexity regions.
128
+ <div align="center">
129
+ <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/appendix_reconv_gs.jpg" width="80%">
130
+ </div>
131
+ > *from left to right*: visualization of intialized GS params, visualization of refined GS params, reconstructed imgs, GT imgs.
132
+
133
+ #### 3. User-Controllable Adaptivity
134
+ We can manually guide tokens to focus on user interest regions.
135
+ <div align="center">
136
+ <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application.jpg">
137
+ </div>
138
+ > *from left to right*: input img, visualization of initialized GS params, reconstructed img, visualization of adjusted GS params, reconstructed img using adjusted GS params.
139
+
140
+ #### 4. Variable Token Count of GPS-Tokens
141
+ We can **increase** or **decrease** the count of tokens for encode one image.
142
+ <div align="center">
143
+ <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application2.jpg">
144
+ </div>
145
+ > We use GPSToken-M128, which is trained only under 128 tokens, for demonstration.
146
+
147
+ #### 5. Scales to Higher Resolutions
148
+ GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.
149
+
150
+ | Method | Tokens | PSNR ↑ | SSIM ↑ | LPIPS ↓ | rFID ↓ | rec. sFID ↓ |
151
+ |------------------|------------|--------|--------|---------|------------|-------------|
152
+ | **512Γ—512** | | | | | | |
153
+ | SDXL-VAE | 64Γ—64 | 28.42 | 0.817 | 0.059 | 0.271 | 1.36 |
154
+ | VQVAE-f16| 32Γ—32 | 21.83 | 0.604 | 0.172 | 2.29 | 7.95 |
155
+ | GPSToken-M128 | 512 | 26.74 | 0.764 | 0.073 | 0.367 | 1.93 |
156
+ | GPSToken-L256 | 1024 | 32.00 | 0.887 | 0.039 | 0.175 | 0.699 |
157
+ | **1024Γ—1024** | | | | | | |
158
+ | SDXL-VAE | 128Γ—128 | 33.27 | 0.909 | 0.057 | 0.113 | 0.561 |
159
+ | VQVAE-f16 | 64Γ—64 | 25.41 | 0.744 | 0.169 | 1.40 | 4.98 |
160
+ | GPSToken-M128 | 2048 | 31.22 | 0.873 | 0.072 | 0.236 | 1.24 |
161
+ | GPSToken-L256 | 4096 | 37.71 | 0.955 | 0.031 | 0.055 | 0.276 |
162
+
163
+ ## πŸš€ Quick Start
164
+
165
+ ### Model Zoo
166
+
167
+ One can download the models directly from Hugging Face:
168
+
169
+ | Models | Token Count | Hugging Face Link |
170
+ |---------------|-------------|---------------------------------------------------------------------------------------------------|
171
+ | GPSToken-S64 | 64 | [`xtudbxk/GPSToken/tree/main/GPSToken-S64`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-S64) |
172
+ | GPSToken-M128 | 128 | [`xtudbxk/GPSToken/tree/main/GPSToken-M128`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-M128) |
173
+ | GPSToken-L256 | 256 | [`xtudbxk/GPSToken/tree/main/GPSToken-L256`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-L256) |
174
+
175
+ ### Inference scripts
176
+ ```bash
177
+ python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]
178
  ```
179
+
180
+ ## CITATION
181
+
182
+ If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
183
+ ```bibtex
184
  @misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
185
  title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
186
  author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
 
190
  primaryClass={cs.CV},
191
  url={https://arxiv.org/abs/2509.01109},
192
  }
193
+ ```
194
+
195
+ ## CONTACT
196
+
197
+ Please leave an issue or contact zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk)