techfreakworm commited on
Commit
b3e0fc9
·
unverified ·
1 Parent(s): 12ca777

docs: add item #11 — local style perf / GGUF Q4 toggle

Browse files

Track the architectural mismatch causing ~37x slower per-step style
generations on Apple Silicon vs ZeroGPU H200. Root cause is
LTXAddVideoICLoRAGuide doubling the attention sequence + no flash-attn-2
on MPS. A GGUF Q4 transformer toggle would mitigate but is v1.1+ scope.

Files changed (1) hide show
  1. docs/future_improvements.md +32 -0
docs/future_improvements.md CHANGED
@@ -114,3 +114,35 @@ Currently `@spaces.GPU(duration=300)` reserves 5 min per call. For Fast preset
114
  (distilled 8 steps) actual usage is ~30 s. Could shorten to 120 s — improves
115
  queue priority for the user (per HF docs). Use dynamic duration based on
116
  preset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  (distilled 8 steps) actual usage is ~30 s. Could shorten to 120 s — improves
115
  queue priority for the user (per HF docs). Use dynamic duration based on
116
  preset.
117
+
118
+ ### 11. Local-perf "low-VRAM" path for style mode (GGUF Q4 transformer)
119
+
120
+ Style mode on Apple Silicon runs ~37× slower per sampling step than the other
121
+ modes (~596 s/step on Mac vs ~16 s/step for lipsync). Root cause is
122
+ architectural — `LTXAddVideoICLoRAGuide` concatenates the source video's
123
+ DWPose latents into the noisy target latent, doubling the attention sequence
124
+ to ~56 k tokens. Combined with MPS having no flash-attn-2 and the 22B BF16
125
+ model approaching the working-memory ceiling, perf collapses on Mac.
126
+
127
+ H200 handles this fine (flash-attn-3 + tensor cores + dedicated VRAM ⇒
128
+ ~30–60 s end to end on Spaces). So this is fundamentally a Mac/MPS gap, not
129
+ a code bug.
130
+
131
+ A "Low VRAM" preset that swaps the BF16 transformer for the GGUF Q4
132
+ quantized one would reduce per-step memory pressure and may bring local
133
+ style perf into the workable range (still slow, but maybe ~60–90 s/step
134
+ instead of 600). The GGUF file is already declared in `MODEL_REGISTRY`
135
+ (`UnetLoaderGGUF` consumer). What's missing:
136
+
137
+ 1. A workflow toggle that swaps `UNETLoader` → `UnetLoaderGGUF` for the main
138
+ transformer in style.json (and other modes that benefit).
139
+ 2. A UI control on the Advanced accordion: "Low VRAM (GGUF Q4)".
140
+ 3. Wire-through in `_style_parameterize` (and friends) to flip the loader
141
+ class.
142
+ 4. Delete the matching BF16 path nodes when GGUF is selected (or set them
143
+ to bypass) so we don't load both.
144
+
145
+ Risk: GGUF transformers behave slightly differently from BF16 — output
146
+ quality drops, especially for IC-LoRA paths where the dynamic range matters.
147
+ Should be opt-in only, never default. Probably v1.1+ scope (it's listed in
148
+ "Out of scope for v1" in CLAUDE.md as the GGUF Q4 / Low VRAM preset).