Spaces:
Running on Zero
Running on Zero
docs: add item #11 — local style perf / GGUF Q4 toggle
Browse filesTrack the architectural mismatch causing ~37x slower per-step style
generations on Apple Silicon vs ZeroGPU H200. Root cause is
LTXAddVideoICLoRAGuide doubling the attention sequence + no flash-attn-2
on MPS. A GGUF Q4 transformer toggle would mitigate but is v1.1+ scope.
- docs/future_improvements.md +32 -0
docs/future_improvements.md
CHANGED
|
@@ -114,3 +114,35 @@ Currently `@spaces.GPU(duration=300)` reserves 5 min per call. For Fast preset
|
|
| 114 |
(distilled 8 steps) actual usage is ~30 s. Could shorten to 120 s — improves
|
| 115 |
queue priority for the user (per HF docs). Use dynamic duration based on
|
| 116 |
preset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
(distilled 8 steps) actual usage is ~30 s. Could shorten to 120 s — improves
|
| 115 |
queue priority for the user (per HF docs). Use dynamic duration based on
|
| 116 |
preset.
|
| 117 |
+
|
| 118 |
+
### 11. Local-perf "low-VRAM" path for style mode (GGUF Q4 transformer)
|
| 119 |
+
|
| 120 |
+
Style mode on Apple Silicon runs ~37× slower per sampling step than the other
|
| 121 |
+
modes (~596 s/step on Mac vs ~16 s/step for lipsync). Root cause is
|
| 122 |
+
architectural — `LTXAddVideoICLoRAGuide` concatenates the source video's
|
| 123 |
+
DWPose latents into the noisy target latent, doubling the attention sequence
|
| 124 |
+
to ~56 k tokens. Combined with MPS having no flash-attn-2 and the 22B BF16
|
| 125 |
+
model approaching the working-memory ceiling, perf collapses on Mac.
|
| 126 |
+
|
| 127 |
+
H200 handles this fine (flash-attn-3 + tensor cores + dedicated VRAM ⇒
|
| 128 |
+
~30–60 s end to end on Spaces). So this is fundamentally a Mac/MPS gap, not
|
| 129 |
+
a code bug.
|
| 130 |
+
|
| 131 |
+
A "Low VRAM" preset that swaps the BF16 transformer for the GGUF Q4
|
| 132 |
+
quantized one would reduce per-step memory pressure and may bring local
|
| 133 |
+
style perf into the workable range (still slow, but maybe ~60–90 s/step
|
| 134 |
+
instead of 600). The GGUF file is already declared in `MODEL_REGISTRY`
|
| 135 |
+
(`UnetLoaderGGUF` consumer). What's missing:
|
| 136 |
+
|
| 137 |
+
1. A workflow toggle that swaps `UNETLoader` → `UnetLoaderGGUF` for the main
|
| 138 |
+
transformer in style.json (and other modes that benefit).
|
| 139 |
+
2. A UI control on the Advanced accordion: "Low VRAM (GGUF Q4)".
|
| 140 |
+
3. Wire-through in `_style_parameterize` (and friends) to flip the loader
|
| 141 |
+
class.
|
| 142 |
+
4. Delete the matching BF16 path nodes when GGUF is selected (or set them
|
| 143 |
+
to bypass) so we don't load both.
|
| 144 |
+
|
| 145 |
+
Risk: GGUF transformers behave slightly differently from BF16 — output
|
| 146 |
+
quality drops, especially for IC-LoRA paths where the dynamic range matters.
|
| 147 |
+
Should be opt-in only, never default. Probably v1.1+ scope (it's listed in
|
| 148 |
+
"Out of scope for v1" in CLAUDE.md as the GGUF Q4 / Low VRAM preset).
|