File size: 13,126 Bytes
709d322 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | ---
license: apache-2.0
license_link: https://github.com/Zyriix/prologue/blob/main/LICENSE
library_name: pytorch
pipeline_tag: unconditional-image-generation
tags:
- image-generation
- class-conditional
- autoregressive
- tokenizer
- vq-vae
- vector-quantization
- imagenet
- prologue
- discrete-tokens
- safetensors
datasets:
- imagenet-1k
metrics:
- fid
- inception-score
language: []
model-index:
- name: Prologue-L-XL
results:
- task:
type: image-generation
name: Class-conditional Image Generation
dataset:
type: imagenet-1k
name: ImageNet 256x256
metrics:
- type: fid
value: 1.46
name: gFID (CFG)
- type: fid
value: 2.26
name: gFID (no CFG)
- type: inception_score
value: 257.7
name: IS
- name: Prologue-L-L
results:
- task:
type: image-generation
name: Class-conditional Image Generation
dataset:
type: imagenet-1k
name: ImageNet 256x256
metrics:
- type: fid
value: 1.52
name: gFID (CFG)
- type: fid
value: 2.81
name: gFID (no CFG)
- type: inception_score
value: 251.6
name: IS
- name: Prologue-L-B
results:
- task:
type: image-generation
name: Class-conditional Image Generation
dataset:
type: imagenet-1k
name: ImageNet 256x256
metrics:
- type: fid
value: 2.15
name: gFID (CFG)
- type: fid
value: 5.02
name: gFID (no CFG)
- type: inception_score
value: 219.9
name: IS
- name: Prologue-B-XL
results:
- task:
type: image-generation
name: Class-conditional Image Generation
dataset:
type: imagenet-1k
name: ImageNet 256x256
metrics:
- type: fid
value: 2.43
name: gFID (CFG)
- type: fid
value: 5.22
name: gFID (no CFG)
- type: inception_score
value: 252.6
name: IS
---
# Prologue — Autoregressive Visual Generation Needs a Prologue
**Paper:** [arXiv 2605.06137](https://arxiv.org/abs/2605.06137) · **Code:** [github.com/Zyriix/prologue](https://github.com/Zyriix/prologue) · **Demo:** [🤗 Space](https://huggingface.co/spaces/Zyriix/prologue-demo)
Bowen Zheng · Weijian Luo · Guang Yang · Colin Zhang · Tianyang Hu

> **Image = `[prologue tokens] + [visual tokens]`.** Prologue tokens are a small set of latent tokens prepended to the visual sequence and trained **only** with the AR cross-entropy loss. Visual tokens stay dedicated to reconstruction. The reconstruction–generation gap closes for free, and prologue tokens spontaneously develop semantic structure under pure CE gradients.
The figure above is generated with **fixed prologue tokens, varying visual tokens only** (Prologue-L–XL). Class identity and global layout stay; texture varies.
---
## Headline numbers — ImageNet 256×256
| Model | AR | rFID ↓ | gFID ↓ | gFID<sub>noCFG</sub> ↓ | IS ↑ | Pre. ↑ | Rec. ↑ |
|-----------------------------|-----:|-------:|-------:|-----------------------:|------:|-------:|-------:|
| 1D Tokenizer (no CE) | 115M | 2.11 | 6.10 | 19.32 | — | — | — |
| 2D Tokenizer (no CE) | 115M | 2.15 | 5.02 | 21.01 | — | — | — |
| **Prologue B–B** | 115M | 2.24 | **4.11** | **10.75** | 210.3 | 0.83 | 0.48 |
| **Prologue B–L** | 305M | 2.24 | **2.67** | **6.56** | 251.2 | 0.82 | 0.56 |
| **Prologue B–XL** | 685M | 2.24 | **2.43** | **5.22** | 252.6 | 0.80 | 0.59 |
| **Prologue L–B** | 115M | 0.99 | **2.15** | **5.02** | 219.9 | 0.79 | 0.60 |
| **Prologue L–L** | 305M | 0.99 | **1.52** | **2.81** | 251.6 | 0.77 | 0.66 |
| **Prologue L–XL** | 685M | 0.99 | **1.46** | **2.26** | 257.7 | 0.78 | 0.66 |
| Prologue-Post (frozen 2D) | 115M | 2.15 | 3.88 | 11.04 | — | — | — |
| Prologue-OneStage (joint) | 115M | 2.09 | 5.41 | 21.00 | — | — | — |
All numbers are reproducible end-to-end with `bash eval.sh` in the GitHub repo. *gFID / IS* follow the [ADM evaluation protocol](https://github.com/openai/guided-diffusion/tree/main/evaluations) (50k samples vs `VIRTUAL_imagenet256_labeled.npz`).
---
## What's in this repo
This Hugging Face Hub repository hosts **all released weights** for the paper — 6 tokenizers and 9 AR models (~63 GB total). The matching code, training scripts, and full documentation live on [GitHub](https://github.com/Zyriix/prologue).
### Tokenizers (6 directories)
| Directory | rFID | Size | Note |
|---------------------------|-----:|-------:|-----------------------------------------------------------------------------------------------|
| `1d-tokenizer` | 2.11 | 3.2 GB | 1D baseline, z_len = 256 |
| `2d-tokenizer` | 2.15 | 3.2 GB | 2D baseline, z_len = 256 |
| `prologue-b-tokenizer` | 2.24 | 4.1 GB | Prologue Base; VGG-LPIPS; codebook = 16 384 |
| `prologue-l-tokenizer` | 0.99 | 6.7 GB | Prologue Large; ConvNeXt-logit; codebook = 4096; asymmetric decoder 24×1024 |
| `prologue-post-tokenizer` | 2.15 | 3.2 GB | Prologue-Post (frozen 2D + new prologue path) |
| `prologue-onestage-joint` | 2.09 | 5.5 GB | Joint AE + AR-Base, single-stage. AR shards live inside as `model_5.safetensors` / `model_6.safetensors`. |
### AR models (9 directories)
| Directory | Pair with | Size | AR params | gFID (CFG) | gFID (no CFG) |
|----------------------|----------------------------|-------:|----------:|-----------:|--------------:|
| `ar-1d-base` | `1d-tokenizer` | 1.8 GB | 115M | 6.10 | 19.32 |
| `ar-2d-base` | `2d-tokenizer` | 1.8 GB | 115M | 5.02 | 21.01 |
| `ar-prologue-b-b` | `prologue-b-tokenizer` | 1.8 GB | 115M | 4.11 | 10.75 |
| `ar-prologue-b-l` | `prologue-b-tokenizer` | 5.3 GB | 305M | 2.67 | 6.56 |
| `ar-prologue-b-xl` | `prologue-b-tokenizer` | 11 GB | 685M | 2.43 | 5.22 |
| `ar-prologue-l-b` | `prologue-l-tokenizer` | 1.5 GB | 115M | 2.15 | 5.02 |
| `ar-prologue-l-l` | `prologue-l-tokenizer` | 4.9 GB | 305M | 1.52 | 2.81 |
| `ar-prologue-l-xl` | `prologue-l-tokenizer` | 9.9 GB | 685M | 1.46 | 2.26 |
| `ar-prologue-post-b` | `prologue-post-tokenizer` | 1.8 GB | 115M | 3.88 | 11.04 |
All checkpoints are [`safetensors`](https://github.com/huggingface/safetensors) following the 🤗 Accelerate convention (`model.safetensors`, `model_1.safetensors`, …). After download the layout matches the relative paths expected by `eval.sh` / `app.py` in the code repo — no `mv` step required.
---
## Download
```bash
pip install -U "huggingface_hub[cli]"
export HF_XET_HIGH_PERFORMANCE=1 # parallel Xet transfer
# everything (~63 GB)
hf download Zyriix/prologue --local-dir ckpts
# or just the headline model used by the demo (LXL, 9.9 GB + 6.7 GB tokenizer)
hf download Zyriix/prologue \
--include "ar-prologue-l-xl/*" \
--include "prologue-l-tokenizer/*" \
--local-dir ckpts
```
See the [GitHub README](https://github.com/Zyriix/prologue#released-checkpoints) for per-model commands and an inference-only slim layout (drops the ~50 % of bytes used for resuming training).
---
## How to use
```bash
git clone https://github.com/Zyriix/prologue.git && cd prologue
bash setup_env.sh && conda activate prologue
# unpack the released ckpts (see above)
hf download Zyriix/prologue \
--include "ar-prologue-l-xl/*" \
--include "prologue-l-tokenizer/*" \
--local-dir ckpts
# (a) full headline-table reproduction
bash eval.sh
# (b) interactive Gradio demo: fix prologue, resample visual
python app.py
```
Programmatic loading inside Python:
```python
from huggingface_hub import snapshot_download
ckpt_dir = snapshot_download(
repo_id="Zyriix/prologue",
allow_patterns=["ar-prologue-l-xl/*", "prologue-l-tokenizer/*"],
local_dir="ckpts",
max_workers=8,
)
# Then call into prologue/ as a library (load_models / sample_tokens):
# from sample_vis import load_models, sample_tokens
# See app.py for a full minimal example.
```
---
## Model details
- **Architecture.** Causal Transformer AR over a **prologue prefix (z_len = 16, separate codebook)** and a **2D visual sequence (256 tokens, x_len = 16² × 1, codebook = 4096 or 16 384)**. Conditioning is class-label one-hot injected as a BOS embedding (LlamaGen style). RoPE positions, RMSNorm, GeGLU FFN.
- **Tokenizer.** ViT-style encoder/decoder with simvq codebook (Yu et al.). Prologue variants share the encoder; Prologue-Post freezes a 2D visual tokenizer and adds a second prologue path on top.
- **Training stage 1 — tokenizer + joint AR head.** L1 + LPIPS (or ConvNeXt-logit for L variants) + PatchGAN. The AR head is trained with cross-entropy on `[prologue ; visual]`; STE through the prologue codebook flows gradients into the encoder. **Base: 150 epochs / Large: 200 epochs, both at batch size 256.**
- **Training stage 2 — pretokenize.** One pass over ImageNet, cache token indices to sharded `.npz`.
- **Training stage 3 — large AR.** Cross-entropy only, on the cached tokens. **AR-Base: 400 epochs at batch 512 · AR-Large: 800 epochs at batch 2048**, aligned with [AliTok](https://github.com/ali-vilab/alitok). Supports separate temperatures and separate CFG schedules for prologue / visual.
- **Numerics.** BF16 mixed precision, `torch.compile`, flash-attn 2.8 (source-built against torch 2.9.1 + cu128).
- **Compute.** Paper numbers were run on 8× H100 80 GB; the released training and eval scripts auto-scale to any (1× or 8×) configuration via 🤗 Accelerate.
---
## Intended use, limitations, and risks
- **Intended use.** Research on discrete image tokenization, AR image generation, and the role of semantic prefixes in token sequences. The released models are class-conditional on the 1 000 ImageNet classes only — there is no text conditioning.
- **Out-of-distribution behaviour.** Samples for classes that are visually similar to ImageNet categories are best; the model has no notion of arbitrary text prompts, layout, or style.
- **Failure modes.** Like all AR image models, the model occasionally produces anatomical / object artifacts, particularly under aggressive (high-CFG) sampling. Prologue prefixes constrain global layout but not fine-grained correctness.
- **Bias.** ImageNet-1k itself contains known social and cultural biases; downstream generation inherits them. Do not deploy these weights for any application affecting individuals' lives, in any high-stakes setting, or as input to face-related pipelines.
- **License.** Apache-2.0 for the weights and the bulk of the code. See [`NOTICE`](https://github.com/Zyriix/prologue/blob/main/NOTICE) for the CC BY-NC-SA 4.0 carve-out on four NVIDIA StyleGAN3-derived files.
---
## Citation
```bibtex
@article{zheng2026prologue,
title = {Autoregressive Visual Generation Needs a Prologue},
author = {Zheng, Bowen and Luo, Weijian and Yang, Guang and Zhang, Colin and Hu, Tianyang},
journal = {arXiv preprint arXiv:2605.06137},
year = {2026},
url = {https://arxiv.org/abs/2605.06137}
}
```
## Acknowledgements
Inspired by (chronological)
[LPIPS](https://github.com/richzhang/PerceptualSimilarity) (2018) ·
[vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) (2020) ·
[VQGAN / taming-transformers](https://github.com/CompVis/taming-transformers) (2020) ·
[guided-diffusion](https://github.com/openai/guided-diffusion) (2021) ·
[VAR](https://github.com/FoundationVision/VAR) (2024.04) ·
[LlamaGen](https://github.com/FoundationVision/LlamaGen) (2024.06) ·
[TiTok](https://github.com/bytedance/1d-tokenizer) (2024.06) ·
[Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) (2024.09) ·
[ImageFolder](https://github.com/lxa9867/ImageFolder) (2024.10) ·
[AliTok](https://github.com/ali-vilab/alitok) (2025.06).
|