File size: 13,126 Bytes
709d322
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
license: apache-2.0
license_link: https://github.com/Zyriix/prologue/blob/main/LICENSE
library_name: pytorch
pipeline_tag: unconditional-image-generation
tags:
  - image-generation
  - class-conditional
  - autoregressive
  - tokenizer
  - vq-vae
  - vector-quantization
  - imagenet
  - prologue
  - discrete-tokens
  - safetensors
datasets:
  - imagenet-1k
metrics:
  - fid
  - inception-score
language: []
model-index:
  - name: Prologue-L-XL
    results:
      - task:
          type: image-generation
          name: Class-conditional Image Generation
        dataset:
          type: imagenet-1k
          name: ImageNet 256x256
        metrics:
          - type: fid
            value: 1.46
            name: gFID (CFG)
          - type: fid
            value: 2.26
            name: gFID (no CFG)
          - type: inception_score
            value: 257.7
            name: IS
  - name: Prologue-L-L
    results:
      - task:
          type: image-generation
          name: Class-conditional Image Generation
        dataset:
          type: imagenet-1k
          name: ImageNet 256x256
        metrics:
          - type: fid
            value: 1.52
            name: gFID (CFG)
          - type: fid
            value: 2.81
            name: gFID (no CFG)
          - type: inception_score
            value: 251.6
            name: IS
  - name: Prologue-L-B
    results:
      - task:
          type: image-generation
          name: Class-conditional Image Generation
        dataset:
          type: imagenet-1k
          name: ImageNet 256x256
        metrics:
          - type: fid
            value: 2.15
            name: gFID (CFG)
          - type: fid
            value: 5.02
            name: gFID (no CFG)
          - type: inception_score
            value: 219.9
            name: IS
  - name: Prologue-B-XL
    results:
      - task:
          type: image-generation
          name: Class-conditional Image Generation
        dataset:
          type: imagenet-1k
          name: ImageNet 256x256
        metrics:
          - type: fid
            value: 2.43
            name: gFID (CFG)
          - type: fid
            value: 5.22
            name: gFID (no CFG)
          - type: inception_score
            value: 252.6
            name: IS
---

# Prologue — Autoregressive Visual Generation Needs a Prologue

**Paper:** [arXiv 2605.06137](https://arxiv.org/abs/2605.06137)  ·  **Code:** [github.com/Zyriix/prologue](https://github.com/Zyriix/prologue)  ·  **Demo:** [🤗 Space](https://huggingface.co/spaces/Zyriix/prologue-demo)

Bowen Zheng · Weijian Luo · Guang Yang · Colin Zhang · Tianyang Hu

![Prologue: fix prologue tokens, resample visual tokens](assets/semantic_fix_grid.jpg)

> **Image = `[prologue tokens] + [visual tokens]`.** Prologue tokens are a small set of latent tokens prepended to the visual sequence and trained **only** with the AR cross-entropy loss. Visual tokens stay dedicated to reconstruction. The reconstruction–generation gap closes for free, and prologue tokens spontaneously develop semantic structure under pure CE gradients.

The figure above is generated with **fixed prologue tokens, varying visual tokens only** (Prologue-L–XL). Class identity and global layout stay; texture varies.

---

## Headline numbers — ImageNet 256×256

| Model                       |   AR | rFID ↓ | gFID ↓ | gFID<sub>noCFG</sub> ↓ |  IS ↑ | Pre. ↑ | Rec. ↑ |
|-----------------------------|-----:|-------:|-------:|-----------------------:|------:|-------:|-------:|
| 1D Tokenizer (no CE)        | 115M |  2.11  |  6.10  |   19.32                |   —   |   —    |   —    |
| 2D Tokenizer (no CE)        | 115M |  2.15  |  5.02  |   21.01                |   —   |   —    |   —    |
| **Prologue B–B**            | 115M |  2.24  | **4.11** | **10.75**            | 210.3 |  0.83  |  0.48  |
| **Prologue B–L**            | 305M |  2.24  | **2.67** | **6.56**             | 251.2 |  0.82  |  0.56  |
| **Prologue B–XL**           | 685M |  2.24  | **2.43** | **5.22**             | 252.6 |  0.80  |  0.59  |
| **Prologue L–B**            | 115M |  0.99  | **2.15** | **5.02**             | 219.9 |  0.79  |  0.60  |
| **Prologue L–L**            | 305M |  0.99  | **1.52** | **2.81**             | 251.6 |  0.77  |  0.66  |
| **Prologue L–XL**           | 685M |  0.99  | **1.46** | **2.26**             | 257.7 |  0.78  |  0.66  |
| Prologue-Post (frozen 2D)   | 115M |  2.15  |  3.88  |   11.04                |   —   |   —    |   —    |
| Prologue-OneStage (joint)   | 115M |  2.09  |  5.41  |   21.00                |   —   |   —    |   —    |

All numbers are reproducible end-to-end with `bash eval.sh` in the GitHub repo. *gFID / IS* follow the [ADM evaluation protocol](https://github.com/openai/guided-diffusion/tree/main/evaluations) (50k samples vs `VIRTUAL_imagenet256_labeled.npz`).

---

## What's in this repo

This Hugging Face Hub repository hosts **all released weights** for the paper — 6 tokenizers and 9 AR models (~63 GB total). The matching code, training scripts, and full documentation live on [GitHub](https://github.com/Zyriix/prologue).

### Tokenizers (6 directories)

| Directory                 | rFID | Size   | Note                                                                                          |
|---------------------------|-----:|-------:|-----------------------------------------------------------------------------------------------|
| `1d-tokenizer`            | 2.11 | 3.2 GB | 1D baseline, z_len = 256                                                                      |
| `2d-tokenizer`            | 2.15 | 3.2 GB | 2D baseline, z_len = 256                                                                      |
| `prologue-b-tokenizer`    | 2.24 | 4.1 GB | Prologue Base; VGG-LPIPS; codebook = 16 384                                                   |
| `prologue-l-tokenizer`    | 0.99 | 6.7 GB | Prologue Large; ConvNeXt-logit; codebook = 4096; asymmetric decoder 24×1024                   |
| `prologue-post-tokenizer` | 2.15 | 3.2 GB | Prologue-Post (frozen 2D + new prologue path)                                                 |
| `prologue-onestage-joint` | 2.09 | 5.5 GB | Joint AE + AR-Base, single-stage. AR shards live inside as `model_5.safetensors` / `model_6.safetensors`. |

### AR models (9 directories)

| Directory            | Pair with                  | Size   | AR params | gFID (CFG) | gFID (no CFG) |
|----------------------|----------------------------|-------:|----------:|-----------:|--------------:|
| `ar-1d-base`         | `1d-tokenizer`             | 1.8 GB |      115M |       6.10 |         19.32 |
| `ar-2d-base`         | `2d-tokenizer`             | 1.8 GB |      115M |       5.02 |         21.01 |
| `ar-prologue-b-b`    | `prologue-b-tokenizer`     | 1.8 GB |      115M |       4.11 |         10.75 |
| `ar-prologue-b-l`    | `prologue-b-tokenizer`     | 5.3 GB |      305M |       2.67 |          6.56 |
| `ar-prologue-b-xl`   | `prologue-b-tokenizer`     |  11 GB |      685M |       2.43 |          5.22 |
| `ar-prologue-l-b`    | `prologue-l-tokenizer`     | 1.5 GB |      115M |       2.15 |          5.02 |
| `ar-prologue-l-l`    | `prologue-l-tokenizer`     | 4.9 GB |      305M |       1.52 |          2.81 |
| `ar-prologue-l-xl`   | `prologue-l-tokenizer`     | 9.9 GB |      685M |       1.46 |          2.26 |
| `ar-prologue-post-b` | `prologue-post-tokenizer`  | 1.8 GB |      115M |       3.88 |         11.04 |

All checkpoints are [`safetensors`](https://github.com/huggingface/safetensors) following the 🤗 Accelerate convention (`model.safetensors`, `model_1.safetensors`, …). After download the layout matches the relative paths expected by `eval.sh` / `app.py` in the code repo — no `mv` step required.

---

## Download

```bash
pip install -U "huggingface_hub[cli]"
export HF_XET_HIGH_PERFORMANCE=1   # parallel Xet transfer

# everything (~63 GB)
hf download Zyriix/prologue --local-dir ckpts

# or just the headline model used by the demo (LXL, 9.9 GB + 6.7 GB tokenizer)
hf download Zyriix/prologue \
    --include "ar-prologue-l-xl/*" \
    --include "prologue-l-tokenizer/*" \
    --local-dir ckpts
```

See the [GitHub README](https://github.com/Zyriix/prologue#released-checkpoints) for per-model commands and an inference-only slim layout (drops the ~50 % of bytes used for resuming training).

---

## How to use

```bash
git clone https://github.com/Zyriix/prologue.git && cd prologue
bash setup_env.sh && conda activate prologue

# unpack the released ckpts (see above)
hf download Zyriix/prologue \
    --include "ar-prologue-l-xl/*" \
    --include "prologue-l-tokenizer/*" \
    --local-dir ckpts

# (a) full headline-table reproduction
bash eval.sh

# (b) interactive Gradio demo: fix prologue, resample visual
python app.py
```

Programmatic loading inside Python:

```python
from huggingface_hub import snapshot_download
ckpt_dir = snapshot_download(
    repo_id="Zyriix/prologue",
    allow_patterns=["ar-prologue-l-xl/*", "prologue-l-tokenizer/*"],
    local_dir="ckpts",
    max_workers=8,
)
# Then call into prologue/ as a library (load_models / sample_tokens):
#   from sample_vis import load_models, sample_tokens
#   See app.py for a full minimal example.
```

---

## Model details

- **Architecture.** Causal Transformer AR over a **prologue prefix (z_len = 16, separate codebook)** and a **2D visual sequence (256 tokens, x_len = 16² × 1, codebook = 4096 or 16 384)**. Conditioning is class-label one-hot injected as a BOS embedding (LlamaGen style). RoPE positions, RMSNorm, GeGLU FFN.
- **Tokenizer.** ViT-style encoder/decoder with simvq codebook (Yu et al.). Prologue variants share the encoder; Prologue-Post freezes a 2D visual tokenizer and adds a second prologue path on top.
- **Training stage 1 — tokenizer + joint AR head.** L1 + LPIPS (or ConvNeXt-logit for L variants) + PatchGAN. The AR head is trained with cross-entropy on `[prologue ; visual]`; STE through the prologue codebook flows gradients into the encoder. **Base: 150 epochs / Large: 200 epochs, both at batch size 256.**
- **Training stage 2 — pretokenize.** One pass over ImageNet, cache token indices to sharded `.npz`.
- **Training stage 3 — large AR.** Cross-entropy only, on the cached tokens. **AR-Base: 400 epochs at batch 512 · AR-Large: 800 epochs at batch 2048**, aligned with [AliTok](https://github.com/ali-vilab/alitok). Supports separate temperatures and separate CFG schedules for prologue / visual.
- **Numerics.** BF16 mixed precision, `torch.compile`, flash-attn 2.8 (source-built against torch 2.9.1 + cu128).
- **Compute.** Paper numbers were run on 8× H100 80 GB; the released training and eval scripts auto-scale to any (1× or 8×) configuration via 🤗 Accelerate.

---

## Intended use, limitations, and risks

- **Intended use.** Research on discrete image tokenization, AR image generation, and the role of semantic prefixes in token sequences. The released models are class-conditional on the 1 000 ImageNet classes only — there is no text conditioning.
- **Out-of-distribution behaviour.** Samples for classes that are visually similar to ImageNet categories are best; the model has no notion of arbitrary text prompts, layout, or style.
- **Failure modes.** Like all AR image models, the model occasionally produces anatomical / object artifacts, particularly under aggressive (high-CFG) sampling. Prologue prefixes constrain global layout but not fine-grained correctness.
- **Bias.** ImageNet-1k itself contains known social and cultural biases; downstream generation inherits them. Do not deploy these weights for any application affecting individuals' lives, in any high-stakes setting, or as input to face-related pipelines.
- **License.** Apache-2.0 for the weights and the bulk of the code. See [`NOTICE`](https://github.com/Zyriix/prologue/blob/main/NOTICE) for the CC BY-NC-SA 4.0 carve-out on four NVIDIA StyleGAN3-derived files.

---

## Citation

```bibtex
@article{zheng2026prologue,
  title   = {Autoregressive Visual Generation Needs a Prologue},
  author  = {Zheng, Bowen and Luo, Weijian and Yang, Guang and Zhang, Colin and Hu, Tianyang},
  journal = {arXiv preprint arXiv:2605.06137},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.06137}
}
```

## Acknowledgements

Inspired by (chronological)
[LPIPS](https://github.com/richzhang/PerceptualSimilarity) (2018) ·
[vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) (2020) ·
[VQGAN / taming-transformers](https://github.com/CompVis/taming-transformers) (2020) ·
[guided-diffusion](https://github.com/openai/guided-diffusion) (2021) ·
[VAR](https://github.com/FoundationVision/VAR) (2024.04) ·
[LlamaGen](https://github.com/FoundationVision/LlamaGen) (2024.06) ·
[TiTok](https://github.com/bytedance/1d-tokenizer) (2024.06) ·
[Open-MAGVIT2](https://github.com/TencentARC/Open-MAGVIT2) (2024.09) ·
[ImageFolder](https://github.com/lxa9867/ImageFolder) (2024.10) ·
[AliTok](https://github.com/ali-vilab/alitok) (2025.06).