more results

15c720f verified 14 days ago

4.84 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-to-image
	library_name: diffusers
	---

	# Z-Image-Turbo Student Adapter

	Text-encoder distillation for VRAM-efficient Z-Image inference.

	Official base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)

	---

	## Overview

	This project proves that we can reduce VRAM usage by replacing Z-Image-Turbo's original Qwen3-4B text encoder with a distilled Qwen3-1.7B student + lightweight adapter. The student+adapter is trained via hidden-state matching against the original 4B encoder.

	No other optimizations are applied — the DiT transformer and VAE are unchanged. The VRAM savings come entirely from the smaller text encoder.


	## Results

	Original \| Qwen3-1.7B
	:-------------------------:\|:-------------------------:
	![](assets/ref_00.png) \| ![](assets/student_00.png)
	![](assets/ref_01.png) \| ![](assets/student_01.png)
	![](assets/ref_02.png) \| ![](assets/student_02.png)
	![](assets/ref_03.png) \| ![](assets/student_03.png)
	![](assets/ref_04.png) \| ![](assets/student_04.png)
	![](assets/ref_001.png) \| ![](assets/student_001.png)
	![](assets/ref_002.png) \| ![](assets/student_002.png)
	![](assets/ref_003.png) \| ![](assets/student_003.png)


	## Architecture

	```
	Original: Prompt → Qwen3-4B (36L, 2560d) → hidden_states[-2] → DiT
	↓
	Student: Prompt → Qwen3-1.7B (28L, 2048d) → Adapter → hidden_states[-2] → DiT
	```

	The adapter uses prompt-dependent cross-attention queries (no static learned queries), converting student hidden states to teacher-equivalent conditioning vectors before they reach the DiT.


	The student receives the same chat-template-formatted prompts as the teacher, with a curriculum annealing from teacher format to raw prompts for deployment-readiness.

	## Benchmarks

	Measured on T4 (22 GB VRAM) with `torch.bfloat16`, `guidance_scale=0.0`, 9 inference steps, 1024×1024.

	\| Metric \| Original (4B) \| Student (1.7B) \| Savings \|
	\|--------\|--------------\|----------------\|---------\|
	\| Weight VRAM \| 20.70 GB \| 16.30 GB \| 4.40 GB (21%) \|
	\| Peak VRAM \| 21.35 GB \| 16.76 GB \| 4.59 GB (22%) \|
	\| Generation time \| 3.9s \| 3.5s \| — \|

	The student+adapter brings peak VRAM from 21.4 GB down to 16.8 GB fitting comfortably on a 22 GB T4 where the original barely fits. The DiT transformer and VAE are unchanged (12 GB total); all savings come from the text encoder.

	## Quick Start

	```python
	from huggingface_hub import snapshot_download
	from diffusers import ZImagePipeline
	from transformers import AutoModel
	from pathlib import Path
	import torch

	# Download the repo locally
	repo_dir = Path('./zimage-student-adapter')
	snapshot_download(
	'SearchingMan/Z-Image-Turbo-student-adapter',
	local_dir=str(repo_dir),
	local_dir_use_symlinks=False,
	)

	# Two-stage loading (required — trust_remote_code is not forwarded
	# to component loaders by diffusers)
	text_encoder = AutoModel.from_pretrained(
	str(repo_dir / 'text_encoder'),
	trust_remote_code=True,
	dtype=torch.bfloat16,
	)
	pipe = ZImagePipeline.from_pretrained(
	str(repo_dir),
	dtype=torch.bfloat16,
	text_encoder=text_encoder,
	trust_remote_code=True,
	).to('cuda')

	# Generate
	image = pipe(
	prompt='a serene mountain lake at sunrise, oil painting',
	num_inference_steps=9,
	guidance_scale=0.0,
	generator=torch.Generator(device='cpu').manual_seed(42),
	).images[0]
	image.save('output.png')
	```

	> Important: Always use a CPU generator: `torch.Generator(device='cpu')`. CUDA generators cause device-mismatch errors with diffusers' mixed scheduler placement.

	## Limitations

	- No end-to-end quality guarantees. The student is trained to match hidden states, not final images. Visual quality may differ from the original Z-Image-Turbo.
	- VRAM savings are from the text encoder only. The DiT (12 GB) and VAE (0.5 GB) are unchanged. With `guidance_scale=0` and 9 steps the pipeline peaks at ~17 GB — fitting a 22 GB T4/L4.
	- Chat template required. The text encoder expects the same `apply_chat_template(enable_thinking=True, add_generation_prompt=True)` format used during training.
	- Single-prompt only. Batch generation shares the same text encoder forward, but DiT processes samples as a list (not batched tensor), so throughput is per-sample.

	## Training Details

	- Student: `Qwen/Qwen3-1.7B` (28 layers, hidden_size=2048)
	- Teacher: `Tongyi-MAI/Z-Image-Turbo` text encoder (Qwen3-4B, 36 layers, hidden_size=2560)
	- Adapter: 2 XAttn blocks, dim=1024, 8 heads, ff_mult=4 (~39M params)
	- Tokenizers: Same Qwen2Tokenizer for both student and teacher (same tokenizer family)

	## License

	Same as the base model: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)