Improve model card: Add pipeline tag, update license, and enrich description with quick start
#3
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,20 +1,186 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
# GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
|
| 19 |
title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
|
| 20 |
author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
|
|
@@ -24,4 +190,8 @@ Please refer to [GPSToken](https://github.com/xtudbxk/GPSToken) for detailed ins
|
|
| 24 |
primaryClass={cs.CV},
|
| 25 |
url={https://arxiv.org/abs/2509.01109},
|
| 26 |
}
|
| 27 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-image
|
| 4 |
---
|
| 5 |
|
| 6 |
# GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
|
| 7 |
|
| 8 |
+
π [Paper](https://huggingface.co/papers/2509.01109) | π» [Code](https://github.com/xtudbxk/GPSToken)
|
| 9 |
|
| 10 |
+
This is the official Hugging Face model repository for GPSToken, as presented in the paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation".
|
| 11 |
|
| 12 |
+
## Abstract
|
| 13 |
|
| 14 |
+
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. We propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.
|
| 15 |
|
| 16 |
+
## News
|
| 17 |
+
|
| 18 |
+
- **2025.09.19**: GPSToken has been accepted by [NIPS 2025](https://openreview.net/forum?id=BxoEDR2yQM)! πππ
|
| 19 |
+
- **2025.09.16**: Update models to [HuggingFace](https://huggingface.co/xtudbxk/GPSToken).
|
| 20 |
+
- **2025.09.05**: Update code for higher resolution, including GPS-tokens merging (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/gpstoken.py#L113)) for reducing boundary artifacts and resized GroupNorm layer (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/vqvae.py#L310)) for easing color shifts.
|
| 21 |
+
|
| 22 |
+
## Motivation: Beyond Fixed Grids
|
| 23 |
+
|
| 24 |
+
Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations.
|
| 25 |
+
We propose **GPSToken**, a **G**aussian **P**arameterized **S**patially-adaptive **Token**ization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:
|
| 26 |
+
- Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
|
| 27 |
+
- Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
|
| 28 |
+
- Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
|
| 29 |
+
- Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.
|
| 30 |
+
|
| 31 |
+
<div align="center">
|
| 32 |
+
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/gpstoken.jpg" width="90%">
|
| 33 |
+
</div>
|
| 34 |
+
|
| 35 |
+
## Core Highlights
|
| 36 |
+
|
| 37 |
+
#### β
Spatially-Adaptive Representation
|
| 38 |
+
- Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.
|
| 39 |
+
|
| 40 |
+
#### β
Dynamic & Scalable
|
| 41 |
+
Furthermore, GPSToken supports:
|
| 42 |
+
- **User-Controllable Adjustment**: Manually allocate more tokens to user-interest areas for finer reconstruction.
|
| 43 |
+
- **Variable Token Count**: Increase or decrease token count of each image for better efficiency-fidelity balance.
|
| 44 |
+
- **Scalable to Higher Resolution**: maintain comparable performance at higher resolutions without retraining.
|
| 45 |
+
|
| 46 |
+
#### β
Spatial-Texture Disentanglement
|
| 47 |
+
- Each token encodes a **disentangled** representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.
|
| 48 |
+
|
| 49 |
+
#### β
SOTA Performance
|
| 50 |
+
- Achieves **psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65** on image reconstruction with only **256 tokens**, outperforming prior methods.
|
| 51 |
+
|
| 52 |
+
## GPS-Tokens: Mathematical Form and CUDA-Based Rendering Algorithm
|
| 53 |
+
|
| 54 |
+
Each token is represented by a **bounded 2D Gaussian function** and a individual feature, encoding spatial geometry and texture separately.
|
| 55 |
+
|
| 56 |
+
#### π Standard 2D Gaussian (Unnormalized)
|
| 57 |
+
|
| 58 |
+
The core form of the $i$-th Gaussian is:
|
| 59 |
+
|
| 60 |
+

|
| 61 |
+
|
| 62 |
+
- $(\mu_{x,i}, \mu_{y,i})$: center (position)
|
| 63 |
+
- $\sigma_{x,i}, \sigma_{y,i} > 0$: standard deviations (scale)
|
| 64 |
+
- $\rho_i \in [-1, 1]$: correlation coefficient (orientation)
|
| 65 |
+
|
| 66 |
+
> This is the unnormalized density β avoids costly $Z$ computation.
|
| 67 |
+
|
| 68 |
+
#### π Bounded Support for Efficiency
|
| 69 |
+
|
| 70 |
+
To focus on local regions and enable fast GPU rendering, we define the **modified splatting kernel**:
|
| 71 |
+
|
| 72 |
+

|
| 73 |
+
|
| 74 |
+
- $s$: spatial support factor (empirically set to $s=5$)
|
| 75 |
+
β Covers >99.999% of Gaussian mass, negligible truncation error.
|
| 76 |
+
|
| 77 |
+
#### π§© Token Representation
|
| 78 |
+
|
| 79 |
+
An image is encoded as $l$ GPS-tokens: $\mathbf{z} = \{\mathbf{z}_1, \dots, \mathbf{z}_l\}$, where each $\mathbf{z}_i = \\{\mathbf{g}_i, \mathbf{f}_i\\}$ contains:
|
| 80 |
+
|
| 81 |
+
| Component | Symbol & Type | Role |
|
| 82 |
+
|---------------|-----------------------------------|-------------------------------|
|
| 83 |
+
| **Geometry** | $\mathbf{g}_i = (\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$ | Spatial layout (2D Gaussian params) |
|
| 84 |
+
| **Texture** | $\mathbf{f}_i \in \mathbb{R}^{c-5}$ | Visual features (from CNN/Transformer) |
|
| 85 |
+
|
| 86 |
+
**Disentangled design**: geometry and texture can be manipulated independently.
|
| 87 |
|
| 88 |
+
#### β‘ CUDA-Based Rendering Algorithm
|
| 89 |
+
We implement a **CUDA-accelerated rendering algorithm** to parallelize the forward and backward processes of the bounded Gaussian splatting kernel. Implementation details are provided in the `gscuda` folder.
|
| 90 |
+
|
| 91 |
+
## ποΈ Framework: From Image to GPS-Tokens
|
| 92 |
+
|
| 93 |
+
GPSToken pipeline: **Initialization β Refinement β Rendering β Reconstruction**
|
| 94 |
+
<div align="center">
|
| 95 |
+
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/framework.jpg" width="90%">
|
| 96 |
+
</div>
|
| 97 |
+
|
| 98 |
+
#### Spatially-adaptive Token Initialization
|
| 99 |
+
We use an iterative algorithm to partition the image into regions based on texture complexity. Each region's location and size initialize the Gaussian parameters of corresponding GPS-tokens, enabling a coarse spatially-adaptive representation.
|
| 100 |
+
|
| 101 |
+
#### Spatially-adaptive Token Refinement
|
| 102 |
+
After obtaining the initialized Gaussian parameters, we employ a transformer-based encoder to refine these parameters to achieve fine-grained spatial adaptation, while simultaneously extracting the corresponding texture features $\mathbf{f}$ for each region using RoIAlign layers. After encoder refinement, the parameters better match local textures.
|
| 103 |
+
|
| 104 |
+
#### End-to-end Reconstruction
|
| 105 |
+
During decoding, we first render the GPSTokens into a 2D feature map, then decode them into the reconstructed image. Following existing works, we use a combination of reconstruction loss $L_{\text{rec}}$, perceptual loss $L_{\text{perc}}$, and adversarial loss $L_{\text{adv}}$ during training.
|
| 106 |
+
|
| 107 |
+
## π Experimental Results
|
| 108 |
+
|
| 109 |
+
#### 1. Image Reconstruction ($256\times 256$ on Imagenet val set)
|
| 110 |
+
|
| 111 |
+
GPSToken outperforms fixed-grid methods with same token count.
|
| 112 |
+
|
| 113 |
+
| Method | Token Count | Params (M) | PSNR | SSIM | LPIPS | rFID | FID |
|
| 114 |
+
|------------------|-------------|-----------|-------|--------|--------|-------|-------|
|
| 115 |
+
| SDXL-VAE | 32x32 | 83.6 | 25.55 | 0.727 | 0.066 | 0.73 | 2.35 |
|
| 116 |
+
| VAVAE | 16x16 | 69.8 | 25.76 | 0.742 | 0.050 | 0.27 | 1.74 |
|
| 117 |
+
| DCAE | 8x8 | 323.4 | 23.62 | 0.644 | 0.092 | 0.98 | 2.59 |
|
| 118 |
+
| TiTok-B64 | 64 | 204.8 | 17.01 | 0.390 | 0.263 | 1.75 | 2.50 |
|
| 119 |
+
| TiTok-S128 | 128 | 83.7 | 17.66 | 0.413 | 0.220 | 1.73 | 3.25 |
|
| 120 |
+
| MAETok | 128 | 173.9 | 23.25 | 0.626 | 0.096 | 0.65 | 2.01 |
|
| 121 |
+
| FlexTok | 256 | 949.7 | 17.69 | 0.475 | 0.257 | 4.02 | 4.88 |
|
| 122 |
+
| **GPSToken-S64** | 64 | 127.5 | 22.18 | 0.578 | 0.111 | 1.31 | 3.02 |
|
| 123 |
+
| **GPSToken-M128**| 128 | 127.8 | 24.06 | 0.657 | 0.080 | 0.65 | 2.18 |
|
| 124 |
+
| **GPSToken-L256**| 256 | 128.7 | 28.81 | 0.809 | 0.043 | 0.22 | 1.65 |
|
| 125 |
+
|
| 126 |
+
#### 2. Spatial-Adaptivity Visualization
|
| 127 |
+
Gaussian tokens automatically concentrate on high-complexity regions.
|
| 128 |
+
<div align="center">
|
| 129 |
+
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/appendix_reconv_gs.jpg" width="80%">
|
| 130 |
+
</div>
|
| 131 |
+
> *from left to right*: visualization of intialized GS params, visualization of refined GS params, reconstructed imgs, GT imgs.
|
| 132 |
+
|
| 133 |
+
#### 3. User-Controllable Adaptivity
|
| 134 |
+
We can manually guide tokens to focus on user interest regions.
|
| 135 |
+
<div align="center">
|
| 136 |
+
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application.jpg">
|
| 137 |
+
</div>
|
| 138 |
+
> *from left to right*: input img, visualization of initialized GS params, reconstructed img, visualization of adjusted GS params, reconstructed img using adjusted GS params.
|
| 139 |
+
|
| 140 |
+
#### 4. Variable Token Count of GPS-Tokens
|
| 141 |
+
We can **increase** or **decrease** the count of tokens for encode one image.
|
| 142 |
+
<div align="center">
|
| 143 |
+
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application2.jpg">
|
| 144 |
+
</div>
|
| 145 |
+
> We use GPSToken-M128, which is trained only under 128 tokens, for demonstration.
|
| 146 |
+
|
| 147 |
+
#### 5. Scales to Higher Resolutions
|
| 148 |
+
GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.
|
| 149 |
+
|
| 150 |
+
| Method | Tokens | PSNR β | SSIM β | LPIPS β | rFID β | rec. sFID β |
|
| 151 |
+
|------------------|------------|--------|--------|---------|------------|-------------|
|
| 152 |
+
| **512Γ512** | | | | | | |
|
| 153 |
+
| SDXL-VAE | 64Γ64 | 28.42 | 0.817 | 0.059 | 0.271 | 1.36 |
|
| 154 |
+
| VQVAE-f16| 32Γ32 | 21.83 | 0.604 | 0.172 | 2.29 | 7.95 |
|
| 155 |
+
| GPSToken-M128 | 512 | 26.74 | 0.764 | 0.073 | 0.367 | 1.93 |
|
| 156 |
+
| GPSToken-L256 | 1024 | 32.00 | 0.887 | 0.039 | 0.175 | 0.699 |
|
| 157 |
+
| **1024Γ1024** | | | | | | |
|
| 158 |
+
| SDXL-VAE | 128Γ128 | 33.27 | 0.909 | 0.057 | 0.113 | 0.561 |
|
| 159 |
+
| VQVAE-f16 | 64Γ64 | 25.41 | 0.744 | 0.169 | 1.40 | 4.98 |
|
| 160 |
+
| GPSToken-M128 | 2048 | 31.22 | 0.873 | 0.072 | 0.236 | 1.24 |
|
| 161 |
+
| GPSToken-L256 | 4096 | 37.71 | 0.955 | 0.031 | 0.055 | 0.276 |
|
| 162 |
+
|
| 163 |
+
## π Quick Start
|
| 164 |
+
|
| 165 |
+
### Model Zoo
|
| 166 |
+
|
| 167 |
+
One can download the models directly from Hugging Face:
|
| 168 |
+
|
| 169 |
+
| Models | Token Count | Hugging Face Link |
|
| 170 |
+
|---------------|-------------|---------------------------------------------------------------------------------------------------|
|
| 171 |
+
| GPSToken-S64 | 64 | [`xtudbxk/GPSToken/tree/main/GPSToken-S64`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-S64) |
|
| 172 |
+
| GPSToken-M128 | 128 | [`xtudbxk/GPSToken/tree/main/GPSToken-M128`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-M128) |
|
| 173 |
+
| GPSToken-L256 | 256 | [`xtudbxk/GPSToken/tree/main/GPSToken-L256`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-L256) |
|
| 174 |
+
|
| 175 |
+
### Inference scripts
|
| 176 |
+
```bash
|
| 177 |
+
python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]
|
| 178 |
```
|
| 179 |
+
|
| 180 |
+
## CITATION
|
| 181 |
+
|
| 182 |
+
If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
|
| 183 |
+
```bibtex
|
| 184 |
@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
|
| 185 |
title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
|
| 186 |
author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
|
|
|
|
| 190 |
primaryClass={cs.CV},
|
| 191 |
url={https://arxiv.org/abs/2509.01109},
|
| 192 |
}
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
## CONTACT
|
| 196 |
+
|
| 197 |
+
Please leave an issue or contact zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk)
|