xtudbxk
/

GPSToken

Model card Files Files and versions

xet

Community

Improve model card: Add metadata, abstract, key features, and usage

by nielsr HF Staff - opened Sep 22, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+106

-5

Files changed (1) hide show

README.md +106 -5

README.md CHANGED Viewed

@@ -1,20 +1,115 @@
 ---
-license: mit
 ---
 # GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
-[arxiv version](https://arxiv.org/abs/2509.01109)
 [Zhengqiang Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=UX26wSMAAAAJ)<sup>1,2</sup> | [Rongyuan Wu](https://scholar.google.com.hk/citations?hl=zh-CN&user=A-U8zE8AAAAJ)<sup>1,2</sup> | [Lingchen Sun](https://scholar.google.com/citations?hl=zh-CN&tzom=-480&user=ZCDjTn8AAAAJ)<sup>1,2</sup> | [Lei Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=tAK5l1IAAAAJ)<sup>1,2,+</sup>
 <sup>1</sup> The Hong Kong Polytechnic University  <sup>2</sup> OPPO Research Institute <sup>+</sup> Corresponding Author
-Please refer to [GPSToken](https://github.com/xtudbxk/GPSToken) for detailed instructions.
-## CITATION
 ```
 @misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
       title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
       author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
@@ -24,4 +119,10 @@ Please refer to [GPSToken](https://github.com/xtudbxk/GPSToken) for detailed ins
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2509.01109},
 }
-```

 ---
+license: apache-2.0
+pipeline_tag: image-to-image
+paper: https://huggingface.co/papers/2509.01109
+repo_url: https://github.com/xtudbxk/GPSToken
+project_page: https://openreview.net/forum?id=BxoEDR2yQM
 ---
 # GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
+This model was presented in the paper [GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation](https://huggingface.co/papers/2509.01109).
+## Abstract
+Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.
+[arxiv version](https://arxiv.org/abs/2509.01109) | [GitHub Repository](https://github.com/xtudbxk/GPSToken) | [Project Page](https://openreview.net/forum?id=BxoEDR2yQM)
 [Zhengqiang Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=UX26wSMAAAAJ)<sup>1,2</sup> | [Rongyuan Wu](https://scholar.google.com.hk/citations?hl=zh-CN&user=A-U8zE8AAAAJ)<sup>1,2</sup> | [Lingchen Sun](https://scholar.google.com/citations?hl=zh-CN&tzom=-480&user=ZCDjTn8AAAAJ)<sup>1,2</sup> | [Lei Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=tAK5l1IAAAAJ)<sup>1,2,+</sup>
 <sup>1</sup> The Hong Kong Polytechnic University  <sup>2</sup> OPPO Research Institute <sup>+</sup> Corresponding Author
+## Motivation: Beyond Fixed Grids
+Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations.
+We propose **GPSToken**, a **G**aussian **P**arameterized **S**patially-adaptive **Token**ization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:
+- Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
+- Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
+- Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
+- Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.
+## Core Highlights
+#### ✅ Spatially-Adaptive Representation
+- Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.
+#### ✅ Dynamic & Scalable
+Furthermore, GPSToken supports:
+- **User-Controllable Adjustment**: Manually allocate more tokens to user-interest areas for finer reconstruction.
+- **Variable Token Count**: Increase or decrease token count of each image for better efficiency-fidelity balance.
+- **Scalable to Higher Resolution**: maintain comparable performance at higher resolutions without retraining.
+#### ✅ Spatial-Texture Disentanglement
+- Each token encodes a **disentangled** representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.
+#### ✅ SOTA Performance
+- Achieves **psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65** on image reconstruction with only **256 tokens**, outperforming prior methods.
+## Experimental Results
+#### 1. Image Reconstruction ($256\times 256$ on Imagenet val set)
+GPSToken outperforms fixed-grid methods with same token count.
+| Method           | Token Count | Params (M) | PSNR  | SSIM   | LPIPS  | rFID  | FID   |
+|------------------|-------------|-----------|-------|--------|--------|-------|-------|
+| SDXL-VAE         | 32x32       | 83.6      | 25.55 | 0.727  | 0.066  | 0.73  | 2.35  |
+| VAVAE            | 16x16       | 69.8      | 25.76 | 0.742  | 0.050  | 0.27  | 1.74  |
+| DCAE             | 8x8         | 323.4     | 23.62 | 0.644  | 0.092  | 0.98  | 2.59  |
+| TiTok-B64        | 64          | 204.8     | 17.01 | 0.390  | 0.263  | 1.75  | 2.50  |
+| TiTok-S128       | 128         | 83.7      | 17.66 | 0.413  | 0.220  | 1.73  | 3.25  |
+| MAETok           | 128         | 173.9     | 23.25 | 0.626  | 0.096  | 0.65  | 2.01  |
+| FlexTok          | 256         | 949.7     | 17.69 | 0.475  | 0.257  | 4.02  | 4.88  |
+| **GPSToken-S64** | 64          | 127.5     | 22.18 | 0.578  | 0.111  | 1.31  | 3.02  |
+| **GPSToken-M128**| 128         | 127.8     | 24.06 | 0.657  | 0.080  | 0.65  | 2.18  |
+| **GPSToken-L256**| 256         | 128.7     | 28.81 | 0.809  | 0.043  | 0.22  | 1.65  |
+#### 2. Spatial-Adaptivity Visualization
+Gaussian tokens automatically concentrate on high-complexity regions.
+<img src="https://huggingface.co/xtudbxk/GPSToken/raw/main/figures/appendix_reconv_gs.jpg" width="80%">
+#### 3. User-Controllable Adaptivity
+We can manually guide tokens to focus on user interest regions.
+![](https://huggingface.co/xtudbxk/GPSToken/raw/main/figures/further_application.jpg)
+#### 4. Variable Token Count of GPS-Tokens
+We can **increase** or **decrease** the count of tokens for encode one image.
+![](https://huggingface.co/xtudbxk/GPSToken/raw/main/figures/further_application2.jpg)
+#### 5. Scales to Higher Resolutions
+GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.
+| Method           | Tokens     | PSNR ↑ | SSIM ↑ | LPIPS ↓ | rFID ↓ | rec. sFID ↓ |
+|------------------|------------|--------|--------|---------|------------|-------------|
+| **512×512**      |            |        |        |         |            |             |
+| SDXL-VAE  | 64×64      | 28.42  | 0.817  | 0.059   | 0.271      | 1.36        |
+| VQVAE-f16| 32×32      | 21.83  | 0.604  | 0.172   | 2.29       | 7.95        |
+| GPSToken-M128    | 512        | 26.74  | 0.764  | 0.073   | 0.367      | 1.93        |
+| GPSToken-L256    | 1024       | 32.00  | 0.887  | 0.039   | 0.175      | 0.699       |
+| **1024×1024**    |            |        |        |         |            |             |
+| SDXL-VAE   | 128×128    | 33.27  | 0.909  | 0.057   | 0.113      | 0.561       |
+| VQVAE-f16 | 64×64      | 25.41  | 0.744  | 0.169   | 1.40       | 4.98        |
+| GPSToken-M128    | 2048       | 31.22  | 0.873  | 0.072   | 0.236      | 1.24        |
+| GPSToken-L256    | 4096       | 37.71  | 0.955  | 0.031   | 0.055      | 0.276       |
+## Quick Start
+#### Model Zoo
+|Models|Token Count|Download (Hugging Face)|
+|---|---|---|
+|GPSToken-S64|64|[xtudbxk/GPSToken](https://huggingface.co/xtudbxk/GPSToken)|
+|GPSToken-M128|128|[xtudbxk/GPSToken](https://huggingface.co/xtudbxk/GPSToken)|
+|GPSToken-L256|256|[xtudbxk/GPSToken](https://huggingface.co/xtudbxk/GPSToken)|
+One can also download the models directly from their [HuggingFace repository](https://huggingface.co/xtudbxk/GPSToken).
+#### Inference scripts
+```bash
+python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]
 ```
+## CITATION
+```bibtex
 @misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
       title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
       author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2509.01109},
 }
+```
+## CONTACT
+Please leave an issue or contact Zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk)
+![visitors](https://visitor-badge.laobi.icu/badge?page_id=xtudbxk.GPSToken)