Multimodal Markup Document Models for Graphic Design Completion
Paper • 2409.19051 • Published
This repository provides a specialized image tokenizer (VQ model) for RGBA images. The model can effectively compress RGBA images into discrete latent codes, which can be used for various downstream tasks such as graphic design completion, as demonstrated in our paper Multimodal Markup Document Models for Graphic Design Completion.
The model is based on the image tokenizer from the latent diffusion model. We modified the original model (models/first_stage_models/vq-f16) to handle RGBA images by changing the input and output channels from 3 to 4. See our paper for more details.
cat <<EOF > requirements.txt
transformers>=4.42.4
omegaconf>=2.3.0
einops>=0.8.0
pillow>=10.4.0
pytorch_lightning<2.0.0
huggingface-hub>=0.24.0
git+https://github.com/ktrk115/latent-diffusion.git@23d5a49
git+https://github.com/illeatmyhat/taming-transformers.git@aeabaa3
EOF
pip install -r requirements.txt
import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
image_processor = AutoImageProcessor.from_pretrained("cyberagent/ldm-vq-f16-rgba", trust_remote_code=True)
model = AutoModel.from_pretrained("cyberagent/ldm-vq-f16-rgba", trust_remote_code=True)
# Image reconstruction
img = Image.open("path/to/image.png")
example = image_processor(img)
with torch.inference_mode():
recon, _ = model.model(example["image"].unsqueeze(0))
recon_img = image_processor.postprocess(recon[0])
recon_img.save("recon.png")
This repository is released under the Apache-2.0 license.
@inproceedings{Kikuchi2025,
title = {Multimodal Markup Document Models for Graphic Design Completion},
author = {Kotaro Kikuchi and Ukyo Honda and Naoto Inoue and Mayu Otani and Edgar Simo-Serra and Kota Yamaguchi},
booktitle = {ACM International Conference on Multimedia},
year = {2025},
doi = {10.1145/3746027.3755420}
}