File size: 4,839 Bytes

0179f45
 
 
 
 
 
 
 
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
0179f45
9053ff0
 
 
 
 
05c242a
15c720f
 
 
 
 
 
 
9053ff0
 
1cd7fb4
0179f45
1cd7fb4
 
 
 
 
0179f45
1cd7fb4
0179f45
 
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
 
 
 
 
0179f45
fd46f66
0179f45
1cd7fb4
0179f45
 
1cd7fb4
0179f45
1cd7fb4
 
 
0179f45
1cd7fb4
 
 
 
 
 
0179f45
 
1cd7fb4
 
 
 
 
 
 
 
 
 
 
 
 
0179f45
1cd7fb4
0179f45
1cd7fb4
 
 
 
0179f45
1cd7fb4
0179f45
 
1cd7fb4
0179f45
1cd7fb4
0179f45
1cd7fb4
 
 
 
0179f45
1cd7fb4
0179f45
1cd7fb4
 
 
 
0179f45
1cd7fb4
0179f45
1cd7fb4

---
license: apache-2.0
language:
- en
pipeline_tag: text-to-image
library_name: diffusers
---

# Z-Image-Turbo Student Adapter

**Text-encoder distillation for VRAM-efficient Z-Image inference.**

Official base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)

---

## Overview

This project proves that we can reduce VRAM usage by replacing Z-Image-Turbo's original Qwen3-4B text encoder with a distilled Qwen3-1.7B student + lightweight adapter. The student+adapter is trained via hidden-state matching against the original 4B encoder.

No other optimizations are applied — the DiT transformer and VAE are unchanged. The VRAM savings come entirely from the smaller text encoder.


## Results

Original             |  Qwen3-1.7B
:-------------------------:|:-------------------------:
![](assets/ref_00.png) |  ![](assets/student_00.png)
![](assets/ref_01.png) |  ![](assets/student_01.png)
![](assets/ref_02.png) |  ![](assets/student_02.png)
![](assets/ref_03.png) |  ![](assets/student_03.png)
![](assets/ref_04.png) |  ![](assets/student_04.png)
![](assets/ref_001.png) |  ![](assets/student_001.png)
![](assets/ref_002.png) |  ![](assets/student_002.png)
![](assets/ref_003.png) |  ![](assets/student_003.png)


## Architecture

```
Original:  Prompt → Qwen3-4B (36L, 2560d) → hidden_states[-2] → DiT
                                                                    ↓
Student:   Prompt → Qwen3-1.7B (28L, 2048d) → Adapter → hidden_states[-2] → DiT
```

The adapter uses prompt-dependent cross-attention queries (no static learned queries), converting student hidden states to teacher-equivalent conditioning vectors before they reach the DiT.


The student receives the same chat-template-formatted prompts as the teacher, with a curriculum annealing from teacher format to raw prompts for deployment-readiness.

## Benchmarks

Measured on T4 (22 GB VRAM) with `torch.bfloat16`, `guidance_scale=0.0`, 9 inference steps, 1024×1024.

| Metric | Original (4B) | Student (1.7B) | Savings |
|--------|--------------|----------------|---------|
| Weight VRAM | 20.70 GB | 16.30 GB | **4.40 GB (21%)** |
| Peak VRAM | 21.35 GB | 16.76 GB | **4.59 GB (22%)** |
| Generation time | 3.9s | 3.5s | — |

The student+adapter brings peak VRAM from 21.4 GB down to 16.8 GB  fitting comfortably on a 22 GB T4 where the original barely fits. The DiT transformer and VAE are unchanged (12 GB total); all savings come from the text encoder.

## Quick Start

```python
from huggingface_hub import snapshot_download
from diffusers import ZImagePipeline
from transformers import AutoModel
from pathlib import Path
import torch

# Download the repo locally
repo_dir = Path('./zimage-student-adapter')
snapshot_download(
    'SearchingMan/Z-Image-Turbo-student-adapter',
    local_dir=str(repo_dir),
    local_dir_use_symlinks=False,
)

# Two-stage loading (required — trust_remote_code is not forwarded
# to component loaders by diffusers)
text_encoder = AutoModel.from_pretrained(
    str(repo_dir / 'text_encoder'),
    trust_remote_code=True,
    dtype=torch.bfloat16,
)
pipe = ZImagePipeline.from_pretrained(
    str(repo_dir),
    dtype=torch.bfloat16,
    text_encoder=text_encoder,
    trust_remote_code=True,
).to('cuda')

# Generate
image = pipe(
    prompt='a serene mountain lake at sunrise, oil painting',
  num_inference_steps=9,
  guidance_scale=0.0,
  generator=torch.Generator(device='cpu').manual_seed(42),
).images[0]
image.save('output.png')
```

> **Important:** Always use a CPU generator: `torch.Generator(device='cpu')`. CUDA generators cause device-mismatch errors with diffusers' mixed scheduler placement.

## Limitations

- **No end-to-end quality guarantees.** The student is trained to match hidden states, not final images. Visual quality may differ from the original Z-Image-Turbo.
- **VRAM savings are from the text encoder only.** The DiT (12 GB) and VAE (0.5 GB) are unchanged. With `guidance_scale=0` and 9 steps the pipeline peaks at ~17 GB — fitting a 22 GB T4/L4.
- **Chat template required.** The text encoder expects the same `apply_chat_template(enable_thinking=True, add_generation_prompt=True)` format used during training.
- **Single-prompt only.** Batch generation shares the same text encoder forward, but DiT processes samples as a list (not batched tensor), so throughput is per-sample.

## Training Details

- **Student:** `Qwen/Qwen3-1.7B` (28 layers, hidden_size=2048)
- **Teacher:** `Tongyi-MAI/Z-Image-Turbo` text encoder (Qwen3-4B, 36 layers, hidden_size=2560)
- **Adapter:** 2 XAttn blocks, dim=1024, 8 heads, ff_mult=4 (~39M params)
- **Tokenizers:** Same Qwen2Tokenizer for both student and teacher (same tokenizer family)

## License

Same as the base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)