File size: 9,588 Bytes
dc32ce0
322b245
dc32ce0
322b245
dc32ce0
322b245
dc32ce0
322b245
dc32ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
322b245
 
 
 
dc32ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
322b245
 
 
dc32ce0
 
 
 
 
 
 
 
 
322b245
 
 
dc32ce0
 
 
322b245
dc32ce0
 
 
322b245
 
 
dc32ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# Project Guidelines — Z-Image Studio

Working notes for AI assistants editing this repo. This file is the *what & why* — the locked architecture, the gotchas, the sole-author rule. Companion to `SKILLS.md` (the *how* — process, debugging, deployment workflow) and `AGENTS.md` (tool-agnostic version of this file).

---

## ⚠ Sole-author rule (non-negotiable)

**Mayank Gupta is the sole author on every commit in this repo.** No exceptions.

When committing:

- **NO** `Co-Authored-By: Claude…` (or any agent name) trailer.
- **NO** "Generated with Claude Code" / "🤖 Generated with…" footers.
- **NO** `--author=…` flag — let git use the user's configured identity.
- **NO** attribution in PR descriptions.

If asked to amend, re-commit, or rebase, strip any prior agent attribution from the commit message. Treat any tooling that suggests adding a Claude trailer as a bug to ignore.

---

## Architecture facts (locked — do not relitigate)

Spec: `docs/superpowers/specs/2026-05-13-z-image-studio-design.md`
Plan: `docs/superpowers/plans/2026-05-13-z-image-studio.md`

1. **Backend is DiffSynth-Studio's `ZImagePipeline`** — not ComfyUI. Installed from git (the package isn't on PyPI). The repo lives at `/Users/techfreakworm/Projects/llm/lora-training-zimage-base/DiffSynth-Studio/` for local development and is `git+https://github.com/modelscope/DiffSynth-Studio.git` in `requirements.txt`.
2. **Three tabs.** T2I has the Base/Turbo radio; ControlNet and Upscale are hard-locked to Turbo.
3. **One pipeline instance, two transformers in the pool.** `backend._build_pipeline` does NOT call `ZImagePipeline.from_pretrained` (which discards its `ModelPool` locally). Instead it instantiates the pipeline manually, runs `download_and_load_models`, attaches the pool to `pipe._zis_pool`, and indexes the two `z_image_dit` entries by load order (Base = `pool.model[0]`, Turbo = `pool.model[1]`). Swap is `pipe.dit = dits[idx]` in `modes._swap_transformer`.
4. **`@spaces.GPU` is applied at module load time.** Identity decorator off Spaces. The decorator's `duration=` parameter takes a callable that estimates per-call timeout from `(mode, params, multiplier)`. Estimator clamps at `[60, 180] s`.
5. **DiffSynth handles VRAM management.** Do **not** sprinkle `empty_cache()` calls. The only place we touch this is `models.vram_limit_for()` which returns `None` for MPS (CUDA-only `mem_get_info` API would crash otherwise) and a numeric cap for CUDA.
6. **HF cache → DiffSynth `./models/<repo>/` symlink.** DiffSynth's `ModelConfig.download()` looks for files at `local_model_path/<model_id>/...`, NOT in `~/.cache/huggingface/hub/models--<org>--<repo>/snapshots/<sha>/`. `app._bootstrap()` symlinks every cached snapshot into `./models/<org>/<repo>/` so the preload weights are findable. On Spaces, the build-user-owned `~/.cache/huggingface/hub` is mirrored to runtime-writable `~/hf-cache-rw/` first, then symlinked.
7. **One Gradio process. Lazy backend singleton.** `get_backend()` constructs the pipeline on the first request (~30 – 60 s warm-up). Module import is fast.

---

## Gotchas we already paid for (don't re-discover)

Each of these cost a debug cycle. Read once.

### Model selector swap

- `pipe.model_pool` does NOT exist after `ZImagePipeline.from_pretrained` — DiffSynth builds the pool locally and discards it. **Fix:** we keep our own reference on `pipe._zis_pool`. See architecture fact #3.
- A hidden `gr.Textbox(visible=False)` is removed from the DOM entirely in Gradio 5, so a JS shim can't write to it. We use `elem_classes=["zis-hidden"]` + CSS `display:none` when we need an off-screen value carrier. As of the v2 redesign we use `gr.Radio` directly and don't need a carrier textbox.

### MPS / Apple Silicon

- `torch.mps` has no `mem_get_info`. DiffSynth's `AutoWrappedModule.check_free_vram` calls that method and raises AttributeError when `vram_limit` is set. **Fix:** `vram_limit_for("mps")` returns `None` so the gate short-circuits.
- Several DiffSynth ops aren't implemented on the MPS backend (SDPA variants, some index ops). `app.py` sets `PYTORCH_ENABLE_MPS_FALLBACK=1` so they degrade to CPU instead of crashing.

### Dependency footguns

- `diffsynth-studio` (kebab) is NOT a PyPI package. The pip-installable name is `diffsynth` and only via `git+https://github.com/modelscope/DiffSynth-Studio.git`.
- `transformers >= 5` removes `SiglipVisionTransformer` from `transformers.models.siglip.modeling_siglip`. DiffSynth 2.0.7 imports it. **Pin:** `transformers>=4.45,<5.0`.
- DiffSynth blanket-imports `torchaudio` in `diffsynth.core.data.operators`. Add `torchaudio>=2.4` to requirements even though we don't use audio.
- `basicsr` (a `realesrgan` dep) imports `torchvision.transforms.functional_tensor`, removed in `torchvision >= 0.17`. **Fix:** `upscale.py` aliases `torchvision.transforms.functional` into `sys.modules["torchvision.transforms.functional_tensor"]` BEFORE the basicsr import.

### Model name slugs

- `PAI/Z-Image-Turbo-Fun-Controlnet-Union-2.1` is the **ModelScope** slug. On HuggingFace the same model is at `alibaba-pai/...`. We use the HF slug + `DIFFSYNTH_DOWNLOAD_SOURCE=huggingface` env var.
- `xinntao/Real-ESRGAN` doesn't exist on HF (returns 401). We use `lllyasviel/Annotators` which mirrors `RealESRGAN_x4plus.pth`.
- `controlnet_aux.Processor` registers depth as `depth_midas`, **not** `midas`. The plain name raises KeyError.

### Gradio 5 quirks

- Don't put `<script>` tags inside `gr.HTML` blocks — they get stripped. JS goes in `gr.Blocks(head=…)`.
- `gr.File`'s default drop zone is ~400 px tall. CSS in `theme.py` (`.zis-lora-file .upload-container`) tightens it to 56 px.
- The Gradio 6.0 deprecation warnings about `theme=` / `css=` / `head=` on `Blocks` are benign on 5.50. Ignore until upgrade.

### HF Spaces deployment

- `preload_from_hub` is build-time only. Runtime falls back to network if any required file isn't preloaded. Use broad globs (`transformer/*` not `transformer/*.safetensors`) so configs + index.json files come along. Our current preload totals ~47 GB (cap is 150 GB).
- ZeroGPU build injects `spaces==0.50.0`. If `requirements.txt` pins `spaces==0.30.0`, pip resolution fails. **Don't pin `spaces` at all** — let HF provide it.
- The `@spaces.GPU` decorator must be applied at module load. Runtime decoration isn't detected by ZeroGPU's startup analyzer.
- Per-call `duration=` is a queue-priority signal AND a hard cap. Auto-retry once at 2× on `"GPU task aborted"`.

### Brand vs filename casing

- Repo / directory / Python package: `z-image-studio` (kebab-case).
- User-visible brand: `Z-Image Studio` (title-case) — header, browser tab, README title. Do not propagate the kebab into UI strings.

---

## Coding conventions

- **Python 3.11.** HF Spaces base image is 3.11; older syntax (like no `match`) is fine.
- **Flat top-level layout.** No `src/`, no nested packages. One `.py` per responsibility.
- **No conda.** `python3.11 -m venv .venv`; `brew` for system binaries.
- **No emojis** in code or commits unless explicitly requested. UI strings (CTA banner, button labels) are OK because they're user-facing copy, not code.
- **Type hints on public functions.** Internal helpers can skip them when obvious.
- **Imports at the top of the file.** Inline imports only to break circular deps OR to defer heavy modules (DiffSynth, torch, basicsr) for fast CI startup.
- **`ruff format` + `ruff check`** both pass in CI. No exceptions.

---

## Commits

- **Conventional Commits:** `<type>(<scope>): <subject>` — types: `feat`, `fix`, `chore`, `docs`, `test`, `refactor`, `ci`, `perf`.
- Subject is **imperative**, lowercase, no trailing period.
- Body explains **why** when not obvious. Reference the spec / plan section if relevant.
- Frequent small commits — one logical change per commit.
- **NO Claude trailer.** See top of file.

---

## Testing

- **TDD per the plan.** Each implementation task has the failing test first.
- **L1 + L2 in CI** (no GPU): module structure, mocked pipeline call boundaries, ruff. `tests/test_smoke_gpu.py` is the GPU smoke; it's marked with `@pytest.mark.gpu` and skipped by default (pyproject `addopts = -m 'not gpu'`).
- **No mocks for DiffSynth internals.** Mock only the `pipe(...)` call boundary so the mode-handler logic is verified at the boundary.
- **Use `pytest -m gpu`** to opt into the GPU smoke (~30 GB download on a cold cache; runs full t2i base/turbo + controlnet + upscale at 384²).

---

## Out of scope for v1 (don't add without asking)

- Multi-prompt queueing
- Output history persistence across sessions
- Telemetry / duration estimator that learns from logs
- Persistent storage add-on integration
- Custom LoRA add/remove rows (single LoRA per tab is the v1 cap)
- LoRA on the Upscale refinement pass (locked to vanilla Turbo refinement)
- ControlNet on Z-Image base (no released ControlNet weights for base)
- Z-Image-Edit and Z-Image-Omni-Base (placeholders link to GitHub Model Zoo)
- Display-font customization beyond Inter (locked by Soft Dark Restraint)
- Visual regression tests for the Gradio UI

If a task feels like it needs one of these, stop and ask the user.

---

## When in doubt

1. Read the spec + plan. Fifteen minutes of reading vs a day of wrong implementation.
2. Read `SKILLS.md` for the process side — debugging, deployment, when to commit, when to verify.
3. `git log --oneline` — most non-obvious decisions have a fix-commit explaining the reasoning.
4. **Ask the user** before changing architectural shape or adding scope outside the v1 list.