Proposal: Modulation Guidance — making AdaLN text-aware for quality steering
Summary
Anima's AdaLN modulation path is entirely text-blind — shift/scale/gate coefficients are functions of timestep only. Text conditioning enters exclusively via cross-attention. Based on Starodubcev et al., "Rethinking Global Text Conditioning in Diffusion Transformers" (ICLR 2026), injecting a pooled text embedding into the modulation path and applying guidance in modulation space yields quality improvements orthogonal to CFG.
We ran pre-implementation validation experiments on the frozen Anima model to check whether this approach is viable. The results suggest it is — sharing the findings here in case they're useful.
Current state
| Component | Text-dependent? | Notes |
|---|---|---|
| Cross-attention KV | Yes | Qwen3 → LLMAdapter → 28 blocks |
| AdaLN shift/scale/gate | No | t_embedder sees only timestep |
| CFG | Yes | Noise-space guidance (cond - uncond) |
Validation results
Pooling strategy for global text representation
Evaluated 5 pooling strategies on crossattn_emb using K-Means clustering NMI against artist labels (1,416 images, 37 artists):
| Strategy | Source | KMeans NMI |
|---|---|---|
| Max pool | crossattn_emb (post-LLMAdapter) | 0.926 |
| Mean pool | crossattn_emb (post-LLMAdapter) | 0.551 |
| Mean pool | prompt_embeds (pre-LLMAdapter) | 0.400 |
| EOS token | prompt_embeds (pre-LLMAdapter) | 0.170 |
| EOS token | crossattn_emb (post-LLMAdapter) | 0.089 |
Max pooling on post-adapter embeddings dramatically outperforms alternatives. EOS is near-useless — Qwen3's causal LM EOS captures tokenization artifacts, not semantics. Mean pool drowns discriminative features in shared content tokens. The LLMAdapter itself concentrates discriminative information, making post-adapter pooling far richer than pre-adapter.
Quality direction consistency across content
The quality direction max_pool(p+) - max_pool(p-) was tested across 8 diverse content types (solo character, group scene, landscape, action, mecha, still life, abstract, portrait):
| Metric | Value |
|---|---|
| Average pairwise cosine similarity | 0.814 |
| Minimum pairwise cosine similarity | 0.770 |
All 28 pairwise cosine similarities exceed 0.77. A single global guidance direction generalizes across content — no need for content-conditioned directions.
Injection point comparison
Compared three injection points for the projected pooled text vector by measuring noise prediction MSE at varying perturbation scales:
| Injection point | MSE @ α=2.0 | MSE @ α=8.0 | Growth α=4→8 |
|---|---|---|---|
| Before t_embedding_norm | 4.77e-4 | 1.08e-2 | 6.1x |
| After t_embedding_norm | 4.76e-3 | 1.89e-1 | 7.2x |
| Into adaln_lora branch | 4.29e-3 | 4.06e-2 | 2.6x |
After normalization is optimal: ~10x more sensitive than before norm (RMSNorm re-centers the perturbation) and ~4.7x more dynamic range than the adaln_lora branch. All injection points remain stable at high α (no collapse).
Quality/resolution correlation in embedding space
| Metric | Value |
|---|---|
| Standalone quality <> resolution cosine | 0.021 (nearly orthogonal) |
| Per-content quality <> resolution cosine | 0.496 (correlated) |
Standalone directions are orthogonal, but per-content they correlate — high-resolution training images tend to be higher quality. This means including resolution tags (absurdres, highres) in the positive guidance prompt is beneficial, but using a separate resolution guidance direction on top of quality guidance would interfere.
Recommended guidance prompts:
- p+:
"absurdres, highres, masterpiece, best quality, score_7, score_8, score_9" - p-:
"worst quality, low quality, score_1, score_2, score_3"
Proposed architecture change
A small projection MLP (~6.3M params, 0.3% of model) injected after t_embedding_norm:
pooled = crossattn_emb.max(dim=1).values # (B, 1024)
t_embedding_B_T_D = t_embedding_B_T_D + pooled_text_proj(pooled).unsqueeze(1)
Zero-initialized output layer means no effect before distillation training. The distillation follows the paper's Section 5: freeze the model, train only the projection using teacher (full cross-attention) vs. student (unconditional cross-attention, projection active) with MSE loss. ~4K iterations on the existing training dataset.
Once trained, inference-time modulation guidance (Eq. 3 from the paper) steers quality through AdaLN coefficients — orthogonal to CFG, composable with LoRA/T-LoRA, negligible latency cost.
Compatibility
| Feature | Interaction |
|---|---|
| T-LoRA | Orthogonal (different parameter spaces) |
| CFG | Complementary (noise space vs. AdaLN space, they stack) |
| HydraLoRA | Shared pooling (crossattn_emb.max(dim=1).values) |
| Spectrum | Compatible (guidance applies to emb_B_T_D before blocks) |
- Here is update, I tried dynamic guidance strategy, inspired by original author's appendix, and their 'i8_skip27' strategy quite works well. Here is the drop-in implementation as a replacement of ksampler block. https://github.com/sorryhyun/ComfyUI-Spectrum-KSampler
@sorryhyun it seems that the default config of your Mod Guidance is pretty different from Anzhc's. The output between "KSampler (Spectrum + Mod Guidance)" and "Anima Mod Guidance + KSampler (Spectrum)" is significantly different, and will get color drift with some scheduler like bong_tangent.
@ArranEye Yeah those will be quite different since the default config was adjusted to my personal preference, sry for a bit of dirtyness
Output between those should be different; I use different mod guidance weight (trained personally) and spectrum has also adjusted for best quality.
And yeah, I agree with color drift with some schedulers will happen, I haven't tested except for simple scheduler. (with er_sde method)