Instructions to use SearchingMan/Z-Image-Turbo-student-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SearchingMan/Z-Image-Turbo-student-adapter with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SearchingMan/Z-Image-Turbo-student-adapter", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
| license: apache-2.0 | |
| language: | |
| - en | |
| pipeline_tag: text-to-image | |
| library_name: diffusers | |
| # Z-Image-Turbo Student Adapter | |
| **Text-encoder distillation for VRAM-efficient Z-Image inference.** | |
| Official base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) | |
| --- | |
| ## Overview | |
| This project proves that we can reduce VRAM usage by replacing Z-Image-Turbo's original Qwen3-4B text encoder with a distilled Qwen3-1.7B student + lightweight adapter. The student+adapter is trained via hidden-state matching against the original 4B encoder. | |
| No other optimizations are applied β the DiT transformer and VAE are unchanged. The VRAM savings come entirely from the smaller text encoder. | |
| ## Results | |
| Original | Qwen3-1.7B | |
| :-------------------------:|:-------------------------: | |
|  |  | |
|  |  | |
|  |  | |
|  |  | |
|  |  | |
|  |  | |
|  |  | |
|  |  | |
| ## Architecture | |
| ``` | |
| Original: Prompt β Qwen3-4B (36L, 2560d) β hidden_states[-2] β DiT | |
| β | |
| Student: Prompt β Qwen3-1.7B (28L, 2048d) β Adapter β hidden_states[-2] β DiT | |
| ``` | |
| The adapter uses prompt-dependent cross-attention queries (no static learned queries), converting student hidden states to teacher-equivalent conditioning vectors before they reach the DiT. | |
| The student receives the same chat-template-formatted prompts as the teacher, with a curriculum annealing from teacher format to raw prompts for deployment-readiness. | |
| ## Benchmarks | |
| Measured on T4 (22 GB VRAM) with `torch.bfloat16`, `guidance_scale=0.0`, 9 inference steps, 1024Γ1024. | |
| | Metric | Original (4B) | Student (1.7B) | Savings | | |
| |--------|--------------|----------------|---------| | |
| | Weight VRAM | 20.70 GB | 16.30 GB | **4.40 GB (21%)** | | |
| | Peak VRAM | 21.35 GB | 16.76 GB | **4.59 GB (22%)** | | |
| | Generation time | 3.9s | 3.5s | β | | |
| The student+adapter brings peak VRAM from 21.4 GB down to 16.8 GB fitting comfortably on a 22 GB T4 where the original barely fits. The DiT transformer and VAE are unchanged (12 GB total); all savings come from the text encoder. | |
| ## Quick Start | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| from diffusers import ZImagePipeline | |
| from transformers import AutoModel | |
| from pathlib import Path | |
| import torch | |
| # Download the repo locally | |
| repo_dir = Path('./zimage-student-adapter') | |
| snapshot_download( | |
| 'SearchingMan/Z-Image-Turbo-student-adapter', | |
| local_dir=str(repo_dir), | |
| local_dir_use_symlinks=False, | |
| ) | |
| # Two-stage loading (required β trust_remote_code is not forwarded | |
| # to component loaders by diffusers) | |
| text_encoder = AutoModel.from_pretrained( | |
| str(repo_dir / 'text_encoder'), | |
| trust_remote_code=True, | |
| dtype=torch.bfloat16, | |
| ) | |
| pipe = ZImagePipeline.from_pretrained( | |
| str(repo_dir), | |
| dtype=torch.bfloat16, | |
| text_encoder=text_encoder, | |
| trust_remote_code=True, | |
| ).to('cuda') | |
| # Generate | |
| image = pipe( | |
| prompt='a serene mountain lake at sunrise, oil painting', | |
| num_inference_steps=9, | |
| guidance_scale=0.0, | |
| generator=torch.Generator(device='cpu').manual_seed(42), | |
| ).images[0] | |
| image.save('output.png') | |
| ``` | |
| > **Important:** Always use a CPU generator: `torch.Generator(device='cpu')`. CUDA generators cause device-mismatch errors with diffusers' mixed scheduler placement. | |
| ## Limitations | |
| - **No end-to-end quality guarantees.** The student is trained to match hidden states, not final images. Visual quality may differ from the original Z-Image-Turbo. | |
| - **VRAM savings are from the text encoder only.** The DiT (12 GB) and VAE (0.5 GB) are unchanged. With `guidance_scale=0` and 9 steps the pipeline peaks at ~17 GB β fitting a 22 GB T4/L4. | |
| - **Chat template required.** The text encoder expects the same `apply_chat_template(enable_thinking=True, add_generation_prompt=True)` format used during training. | |
| - **Single-prompt only.** Batch generation shares the same text encoder forward, but DiT processes samples as a list (not batched tensor), so throughput is per-sample. | |
| ## Training Details | |
| - **Student:** `Qwen/Qwen3-1.7B` (28 layers, hidden_size=2048) | |
| - **Teacher:** `Tongyi-MAI/Z-Image-Turbo` text encoder (Qwen3-4B, 36 layers, hidden_size=2560) | |
| - **Adapter:** 2 XAttn blocks, dim=1024, 8 heads, ff_mult=4 (~39M params) | |
| - **Tokenizers:** Same Qwen2Tokenizer for both student and teacher (same tokenizer family) | |
| ## License | |
| Same as the base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) | |