Instructions to use SearchingMan/Z-Image-Turbo-student-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SearchingMan/Z-Image-Turbo-student-adapter with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SearchingMan/Z-Image-Turbo-student-adapter", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
File size: 4,839 Bytes
0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 9053ff0 05c242a 15c720f 9053ff0 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 fd46f66 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 0179f45 1cd7fb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | ---
license: apache-2.0
language:
- en
pipeline_tag: text-to-image
library_name: diffusers
---
# Z-Image-Turbo Student Adapter
**Text-encoder distillation for VRAM-efficient Z-Image inference.**
Official base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
---
## Overview
This project proves that we can reduce VRAM usage by replacing Z-Image-Turbo's original Qwen3-4B text encoder with a distilled Qwen3-1.7B student + lightweight adapter. The student+adapter is trained via hidden-state matching against the original 4B encoder.
No other optimizations are applied β the DiT transformer and VAE are unchanged. The VRAM savings come entirely from the smaller text encoder.
## Results
Original | Qwen3-1.7B
:-------------------------:|:-------------------------:
 | 
 | 
 | 
 | 
 | 
 | 
 | 
 | 
## Architecture
```
Original: Prompt β Qwen3-4B (36L, 2560d) β hidden_states[-2] β DiT
β
Student: Prompt β Qwen3-1.7B (28L, 2048d) β Adapter β hidden_states[-2] β DiT
```
The adapter uses prompt-dependent cross-attention queries (no static learned queries), converting student hidden states to teacher-equivalent conditioning vectors before they reach the DiT.
The student receives the same chat-template-formatted prompts as the teacher, with a curriculum annealing from teacher format to raw prompts for deployment-readiness.
## Benchmarks
Measured on T4 (22 GB VRAM) with `torch.bfloat16`, `guidance_scale=0.0`, 9 inference steps, 1024Γ1024.
| Metric | Original (4B) | Student (1.7B) | Savings |
|--------|--------------|----------------|---------|
| Weight VRAM | 20.70 GB | 16.30 GB | **4.40 GB (21%)** |
| Peak VRAM | 21.35 GB | 16.76 GB | **4.59 GB (22%)** |
| Generation time | 3.9s | 3.5s | β |
The student+adapter brings peak VRAM from 21.4 GB down to 16.8 GB fitting comfortably on a 22 GB T4 where the original barely fits. The DiT transformer and VAE are unchanged (12 GB total); all savings come from the text encoder.
## Quick Start
```python
from huggingface_hub import snapshot_download
from diffusers import ZImagePipeline
from transformers import AutoModel
from pathlib import Path
import torch
# Download the repo locally
repo_dir = Path('./zimage-student-adapter')
snapshot_download(
'SearchingMan/Z-Image-Turbo-student-adapter',
local_dir=str(repo_dir),
local_dir_use_symlinks=False,
)
# Two-stage loading (required β trust_remote_code is not forwarded
# to component loaders by diffusers)
text_encoder = AutoModel.from_pretrained(
str(repo_dir / 'text_encoder'),
trust_remote_code=True,
dtype=torch.bfloat16,
)
pipe = ZImagePipeline.from_pretrained(
str(repo_dir),
dtype=torch.bfloat16,
text_encoder=text_encoder,
trust_remote_code=True,
).to('cuda')
# Generate
image = pipe(
prompt='a serene mountain lake at sunrise, oil painting',
num_inference_steps=9,
guidance_scale=0.0,
generator=torch.Generator(device='cpu').manual_seed(42),
).images[0]
image.save('output.png')
```
> **Important:** Always use a CPU generator: `torch.Generator(device='cpu')`. CUDA generators cause device-mismatch errors with diffusers' mixed scheduler placement.
## Limitations
- **No end-to-end quality guarantees.** The student is trained to match hidden states, not final images. Visual quality may differ from the original Z-Image-Turbo.
- **VRAM savings are from the text encoder only.** The DiT (12 GB) and VAE (0.5 GB) are unchanged. With `guidance_scale=0` and 9 steps the pipeline peaks at ~17 GB β fitting a 22 GB T4/L4.
- **Chat template required.** The text encoder expects the same `apply_chat_template(enable_thinking=True, add_generation_prompt=True)` format used during training.
- **Single-prompt only.** Batch generation shares the same text encoder forward, but DiT processes samples as a list (not batched tensor), so throughput is per-sample.
## Training Details
- **Student:** `Qwen/Qwen3-1.7B` (28 layers, hidden_size=2048)
- **Teacher:** `Tongyi-MAI/Z-Image-Turbo` text encoder (Qwen3-4B, 36 layers, hidden_size=2560)
- **Adapter:** 2 XAttn blocks, dim=1024, 8 heads, ff_mult=4 (~39M params)
- **Tokenizers:** Same Qwen2Tokenizer for both student and teacher (same tokenizer family)
## License
Same as the base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
|