Instructions to use SearchingMan/Z-Image-Turbo-student-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SearchingMan/Z-Image-Turbo-student-adapter with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SearchingMan/Z-Image-Turbo-student-adapter", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
Update README.md
Browse filesproper model description
README.md
CHANGED
|
@@ -6,195 +6,102 @@ pipeline_tag: text-to-image
|
|
| 6 |
library_name: diffusers
|
| 7 |
---
|
| 8 |
|
|
|
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
-
[](https://github.com/Tongyi-MAI/Z-Image) 
|
| 16 |
-
[](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) 
|
| 17 |
-
[](https://huggingface.co/spaces/Tongyi-MAI/Z-Image-Turbo) 
|
| 18 |
-
[](https://huggingface.co/spaces/akhaliq/Z-Image-Turbo) 
|
| 19 |
-
[](https://www.modelscope.cn/models/Tongyi-MAI/Z-Image-Turbo) 
|
| 20 |
-
[](https://www.modelscope.cn/aigc/imageGeneration?tab=advanced&versionId=469191&modelType=Checkpoint&sdVersion=Z_IMAGE_TURBO&modelUrl=modelscope%3A%2F%2FTongyi-MAI%2FZ-Image-Turbo%3Frevision%3Dmaster) 
|
| 21 |
-
[](assets/Z-Image-Gallery.pdf) 
|
| 22 |
-
[](https://modelscope.cn/studios/Tongyi-MAI/Z-Image-Gallery/summary) 
|
| 23 |
-
<a href="https://arxiv.org/abs/2511.22699" target="_blank"><img src="https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv" height="21px"></a>
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
Welcome to the official repository for the Z-Image(造相)project!
|
| 27 |
-
|
| 28 |
-
</div>
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
## ✨ Z-Image
|
| 33 |
-
|
| 34 |
-
Z-Image is a powerful and highly efficient image generation model family with **6B** parameters. Currently there are four variants:
|
| 35 |
-
|
| 36 |
-
- 🚀 **Z-Image-Turbo** – A distilled version of Z-Image that matches or exceeds leading competitors with only **8 NFEs** (Number of Function Evaluations). It offers **⚡️sub-second inference latency⚡️** on enterprise-grade H800 GPUs and fits comfortably within **16G VRAM consumer devices**. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
|
| 37 |
-
|
| 38 |
-
- 🎨 **Z-Image** – The foundation model behind Z-Image-Turbo. Z-Image focuses on **high-quality generation**, **rich aesthetics**, **strong diversity**, and **controllability**, well-suited for creative generation, **fine-tuning**, and downstream development. It supports a wide range of artistic styles, effective negative prompting, and high diversity across identities, poses, compositions, and layouts.
|
| 39 |
-
|
| 40 |
-
- 🧱 **Z-Image-Omni-Base** – The versatile foundation model capable of both **generation and editing tasks**. By releasing this checkpoint, we aim to unlock the full potential for community-driven fine-tuning and custom development, providing the most "raw" and diverse starting point for the open-source community.
|
| 41 |
-
|
| 42 |
-
- ✍️ **Z-Image-Edit** – A variant fine-tuned on Z-Image specifically for image editing tasks. It supports creative image-to-image generation with impressive instruction-following capabilities, allowing for precise edits based on natural language prompts.
|
| 43 |
-
|
| 44 |
-
### 📥 Model Zoo
|
| 45 |
-
|
| 46 |
-
| Model | Pre-Training | SFT | RL | Step | CFG | Task | Visual Quality | Diversity | Fine-Tunability | Hugging Face | ModelScope |
|
| 47 |
-
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 48 |
-
| **Z-Image-Omni-Base** | ✅ | ❌ | ❌ | 50 | ✅ | Gen. / Editing | Medium | High | Easy | *To be released* | *To be released* |
|
| 49 |
-
| **Z-Image** | ✅ | ✅ | ❌ | 50 | ✅ | Gen. | High | Medium | Easy | [](https://huggingface.co/Tongyi-MAI/Z-Image) <br> [](https://huggingface.co/spaces/Tongyi-MAI/Z-Image) | [](https://www.modelscope.cn/models/Tongyi-MAI/Z-Image) <br> [](https://www.modelscope.cn/aigc/imageGeneration?tab=advanced&versionId=569345&modelType=Checkpoint&sdVersion=Z_IMAGE&modelUrl=modelscope%3A%2F%2FTongyi-MAI%2FZ-Image%3Frevision%3Dmaster) |
|
| 50 |
-
| **Z-Image-Turbo** | ✅ | ✅ | ✅ | 8 | ❌ | Gen. | Very High | Low | N/A | [](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) <br> [](https://huggingface.co/spaces/Tongyi-MAI/Z-Image-Turbo) | [](https://www.modelscope.cn/models/Tongyi-MAI/Z-Image-Turbo) <br> [](https://www.modelscope.cn/aigc/imageGeneration?tab=advanced&versionId=469191&modelType=Checkpoint&sdVersion=Z_IMAGE_TURBO&modelUrl=modelscope%3A%2F%2FTongyi-MAI%2FZ-Image-Turbo%3Frevision%3Dmaster) |
|
| 51 |
-
| **Z-Image-Edit** | ✅ | ✅ | ❌ | 50 | ✅ | Editing | High | Medium | Easy | *To be released* | *To be released* | | *To be released* |
|
| 52 |
-
|
| 53 |
-
### 🖼️ Showcase
|
| 54 |
-
|
| 55 |
-
📸 **Photorealistic Quality**: **Z-Image-Turbo** delivers strong photorealistic image generation while maintaining excellent aesthetic quality.
|
| 56 |
-
|
| 57 |
-

|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
### 🏗️ Model Architecture
|
| 72 |
-
We adopt a **Scalable Single-Stream DiT** (S3-DiT) architecture. In this setup, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches.
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
##
|
| 77 |
-
According to the Elo-based Human Preference Evaluation (on [*Alibaba AI Arena*](https://aiarena.alibaba-inc.com/corpora/arena/leaderboard?arenaType=T2I)), Z-Image-Turbo shows highly competitive performance against other leading models, while achieving state-of-the-art results among open-source models.
|
| 78 |
|
| 79 |
-
|
| 80 |
-
<a href="https://aiarena.alibaba-inc.com/corpora/arena/leaderboard?arenaType=T2I">
|
| 81 |
-
<img src="assets/leaderboard.png" alt="Z-Image Elo Rating on AI Arena"/><br />
|
| 82 |
-
<span style="font-size:1.05em; cursor:pointer; text-decoration:underline;"> Click to view the full leaderboard</span>
|
| 83 |
-
</a>
|
| 84 |
-
</p>
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
|
|
|
| 90 |
|
| 91 |
-
|
| 92 |
-
Therefore, you need to install diffusers from source for the latest features and Z-Image support.
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
```bash
|
| 97 |
-
pip install git+https://github.com/huggingface/diffusers
|
| 98 |
-
```
|
| 99 |
|
| 100 |
```python
|
| 101 |
-
import
|
| 102 |
from diffusers import ZImagePipeline
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
-
#
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
)
|
| 111 |
-
pipe.to("cuda")
|
| 112 |
-
|
| 113 |
-
# [Optional] Attention Backend
|
| 114 |
-
# Diffusers uses SDPA by default. Switch to Flash Attention for better efficiency if supported:
|
| 115 |
-
# pipe.transformer.set_attention_backend("flash") # Enable Flash-Attention-2
|
| 116 |
-
# pipe.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3
|
| 117 |
|
| 118 |
-
#
|
| 119 |
-
#
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
-
#
|
| 129 |
image = pipe(
|
| 130 |
-
prompt=
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
guidance_scale=0.0, # Guidance should be 0 for the Turbo models
|
| 135 |
-
generator=torch.Generator("cuda").manual_seed(42),
|
| 136 |
).images[0]
|
| 137 |
-
|
| 138 |
-
image.save("example.png")
|
| 139 |
```
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
-
|
| 148 |
|
| 149 |
-
-
|
| 150 |
-
-
|
|
|
|
|
|
|
| 151 |
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-

|
| 155 |
-
|
| 156 |
-
## 🤖 DMDR: Fusing DMD with Reinforcement Learning
|
| 157 |
-
|
| 158 |
-
[](https://arxiv.org/abs/2511.13649)
|
| 159 |
-
|
| 160 |
-
Building upon the strong foundation of Decoupled-DMD, our 8-step Z-Image model has already demonstrated exceptional capabilities. To achieve further improvements in terms of semantic alignment, aesthetic quality, and structural coherence—while producing images with richer high-frequency details—we present **DMDR**.
|
| 161 |
-
|
| 162 |
-
Our core insight behind DMDR is that Reinforcement Learning (RL) and Distribution Matching Distillation (DMD) can be synergistically integrated during the post-training of few-step models. We demonstrate that:
|
| 163 |
-
|
| 164 |
-
- **RL Unlocks the Performance of DMD** 🚀
|
| 165 |
-
- **DMD Effectively Regularizes RL** ⚖️
|
| 166 |
-
|
| 167 |
-

|
| 168 |
-
|
| 169 |
-
## ⏬ Download
|
| 170 |
-
```bash
|
| 171 |
-
pip install -U huggingface_hub
|
| 172 |
-
HF_XET_HIGH_PERFORMANCE=1 hf download Tongyi-MAI/Z-Image-Turbo
|
| 173 |
-
```
|
| 174 |
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
If you find our work useful in your research, please consider citing:
|
| 178 |
-
|
| 179 |
-
```bibtex
|
| 180 |
-
@article{team2025zimage,
|
| 181 |
-
title={Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer},
|
| 182 |
-
author={Z-Image Team},
|
| 183 |
-
journal={arXiv preprint arXiv:2511.22699},
|
| 184 |
-
year={2025}
|
| 185 |
-
}
|
| 186 |
-
|
| 187 |
-
@article{liu2025decoupled,
|
| 188 |
-
title={Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield},
|
| 189 |
-
author={Dongyang Liu and Peng Gao and David Liu and Ruoyi Du and Zhen Li and Qilong Wu and Xin Jin and Sihan Cao and Shifeng Zhang and Hongsheng Li and Steven Hoi},
|
| 190 |
-
journal={arXiv preprint arXiv:2511.22677},
|
| 191 |
-
year={2025}
|
| 192 |
-
}
|
| 193 |
-
|
| 194 |
-
@article{jiang2025distribution,
|
| 195 |
-
title={Distribution Matching Distillation Meets Reinforcement Learning},
|
| 196 |
-
author={Jiang, Dengyang and Liu, Dongyang and Wang, Zanyi and Wu, Qilong and Jin, Xin and Liu, David and Li, Zhen and Wang, Mengmeng and Gao, Peng and Yang, Harry},
|
| 197 |
-
journal={arXiv preprint arXiv:2511.13649},
|
| 198 |
-
year={2025}
|
| 199 |
-
}
|
| 200 |
-
```
|
|
|
|
| 6 |
library_name: diffusers
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# Z-Image-Turbo Student Adapter
|
| 10 |
|
| 11 |
+
**Text-encoder distillation for VRAM-efficient Z-Image inference.**
|
| 12 |
|
| 13 |
+
Official base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
|
| 14 |
|
| 15 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
## Overview
|
| 18 |
|
| 19 |
+
This project proves that we can reduce VRAM usage by replacing Z-Image-Turbo's original Qwen3-4B text encoder with a distilled Qwen3-1.7B student + lightweight adapter. The student+adapter is trained via hidden-state matching against the original 4B encoder.
|
| 20 |
|
| 21 |
+
No other optimizations are applied — the DiT transformer and VAE are unchanged. The VRAM savings come entirely from the smaller text encoder.
|
| 22 |
|
| 23 |
+
## Architecture
|
| 24 |
|
| 25 |
+
```
|
| 26 |
+
Original: Prompt → Qwen3-4B (36L, 2560d) → hidden_states[-2] → DiT
|
| 27 |
+
↓
|
| 28 |
+
Student: Prompt → Qwen3-1.7B (28L, 2048d) → Adapter → hidden_states[-2] → DiT
|
| 29 |
+
```
|
| 30 |
|
| 31 |
+
The adapter uses prompt-dependent cross-attention queries (no static learned queries), converting student hidden states to teacher-equivalent conditioning vectors before they reach the DiT.
|
| 32 |
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
The student receives the same chat-template-formatted prompts as the teacher, with a curriculum annealing from teacher format to raw prompts for deployment-readiness.
|
| 35 |
|
| 36 |
+
## Benchmarks
|
|
|
|
| 37 |
|
| 38 |
+
Measured on T4 (22 GB VRAM) with `torch.bfloat16`, `guidance_scale=0.0`, 9 inference steps, 1024×1024.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
| Metric | Original (4B) | Student (1.7B) | Savings |
|
| 41 |
+
|--------|--------------|----------------|---------|
|
| 42 |
+
| Weight VRAM | 20.70 GB | 16.30 GB | **4.40 GB (21%)** |
|
| 43 |
+
| Peak VRAM | 21.35 GB | 16.76 GB | **4.59 GB (22%)** |
|
| 44 |
+
| Generation time | 3.9s | 3.5s | — |
|
| 45 |
|
| 46 |
+
The student+adapter brings peak VRAM from ~21.4 GB down to ~16.8 GB — fitting comfortably on a 22 GB T4 where the original barely fits. The DiT transformer and VAE are unchanged (~12 GB total); all savings come from the text encoder.
|
|
|
|
| 47 |
|
| 48 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
```python
|
| 51 |
+
from huggingface_hub import snapshot_download
|
| 52 |
from diffusers import ZImagePipeline
|
| 53 |
+
from transformers import AutoModel
|
| 54 |
+
from pathlib import Path
|
| 55 |
+
import torch
|
| 56 |
|
| 57 |
+
# Download the repo locally
|
| 58 |
+
repo_dir = Path('./zimage-student-adapter')
|
| 59 |
+
snapshot_download(
|
| 60 |
+
'SearchingMan/Z-Image-Turbo-student-adapter',
|
| 61 |
+
local_dir=str(repo_dir),
|
| 62 |
+
local_dir_use_symlinks=False,
|
| 63 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
# Two-stage loading (required — trust_remote_code is not forwarded
|
| 66 |
+
# to component loaders by diffusers)
|
| 67 |
+
text_encoder = AutoModel.from_pretrained(
|
| 68 |
+
str(repo_dir / 'text_encoder'),
|
| 69 |
+
trust_remote_code=True,
|
| 70 |
+
dtype=torch.bfloat16,
|
| 71 |
+
)
|
| 72 |
+
pipe = ZImagePipeline.from_pretrained(
|
| 73 |
+
str(repo_dir),
|
| 74 |
+
dtype=torch.bfloat16,
|
| 75 |
+
text_encoder=text_encoder,
|
| 76 |
+
trust_remote_code=True,
|
| 77 |
+
).to('cuda')
|
| 78 |
|
| 79 |
+
# Generate
|
| 80 |
image = pipe(
|
| 81 |
+
prompt='a serene mountain lake at sunrise, oil painting',
|
| 82 |
+
num_inference_steps=9,
|
| 83 |
+
guidance_scale=0.0,
|
| 84 |
+
generator=torch.Generator(device='cpu').manual_seed(42),
|
|
|
|
|
|
|
| 85 |
).images[0]
|
| 86 |
+
image.save('output.png')
|
|
|
|
| 87 |
```
|
| 88 |
|
| 89 |
+
> **Important:** Always use a CPU generator: `torch.Generator(device='cpu')`. CUDA generators cause device-mismatch errors with diffusers' mixed scheduler placement.
|
| 90 |
|
| 91 |
+
## Limitations
|
| 92 |
|
| 93 |
+
- **No end-to-end quality guarantees.** The student is trained to match hidden states, not final images. Visual quality may differ from the original Z-Image-Turbo.
|
| 94 |
+
- **VRAM savings are from the text encoder only.** The DiT (12 GB) and VAE (0.5 GB) are unchanged. With `guidance_scale=0` and 9 steps the pipeline peaks at ~17 GB — fitting a 22 GB T4/L4.
|
| 95 |
+
- **Chat template required.** The text encoder expects the same `apply_chat_template(enable_thinking=True, add_generation_prompt=True)` format used during training.
|
| 96 |
+
- **Single-prompt only.** Batch generation shares the same text encoder forward, but DiT processes samples as a list (not batched tensor), so throughput is per-sample.
|
| 97 |
|
| 98 |
+
## Training Details
|
| 99 |
|
| 100 |
+
- **Student:** `Qwen/Qwen3-1.7B` (28 layers, hidden_size=2048)
|
| 101 |
+
- **Teacher:** `Tongyi-MAI/Z-Image-Turbo` text encoder (Qwen3-4B, 36 layers, hidden_size=2560)
|
| 102 |
+
- **Adapter:** 2 XAttn blocks, dim=1024, 8 heads, ff_mult=4 (~39M params)
|
| 103 |
+
- **Tokenizers:** Same Qwen2Tokenizer for both student and teacher (same tokenizer family)
|
| 104 |
|
| 105 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
+
Same as the base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|