Image Tokenizer (VQ Model) for RGBA Images

This repository provides a specialized image tokenizer (VQ model) for RGBA images. The model can effectively compress RGBA images into discrete latent codes, which can be used for various downstream tasks such as graphic design completion, as demonstrated in our paper Multimodal Markup Document Models for Graphic Design Completion.

Model Details

The model is based on the image tokenizer from the latent diffusion model. We modified the original model (models/first_stage_models/vq-f16) to handle RGBA images by changing the input and output channels from 3 to 4. See our paper for more details.

Requirements

cat <<EOF > requirements.txt
transformers>=4.42.4
omegaconf>=2.3.0
einops>=0.8.0
pillow>=10.4.0
pytorch_lightning<2.0.0
huggingface-hub>=0.24.0
git+https://github.com/ktrk115/latent-diffusion.git@23d5a49
git+https://github.com/illeatmyhat/taming-transformers.git@aeabaa3
EOF

pip install -r requirements.txt

Usage

import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

image_processor = AutoImageProcessor.from_pretrained("cyberagent/ldm-vq-f16-rgba", trust_remote_code=True)
model = AutoModel.from_pretrained("cyberagent/ldm-vq-f16-rgba", trust_remote_code=True)

# Image reconstruction
img = Image.open("path/to/image.png")
example = image_processor(img)
with torch.inference_mode():
    recon, _ = model.model(example["image"].unsqueeze(0))
recon_img = image_processor.postprocess(recon[0])
recon_img.save("recon.png")

License

This repository is released under the Apache-2.0 license.

Citation

@inproceedings{Kikuchi2025,
  title     = {Multimodal Markup Document Models for Graphic Design Completion},
  author    = {Kotaro Kikuchi and Ukyo Honda and Naoto Inoue and Mayu Otani and Edgar Simo-Serra and Kota Yamaguchi},
  booktitle = {ACM International Conference on Multimedia},
  year      = {2025},
  doi       = {10.1145/3746027.3755420}
}

Downloads last month: 6

Safetensors

Model size

69.6M params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cyberagent/ldm-vq-f16-rgba

Multimodal Markup Document Models for Graphic Design Completion

Paper • 2409.19051 • Published Sep 27, 2024