Diffusion Single File
comfyui

Proposal: Modulation Guidance — making AdaLN text-aware for quality steering

#122
by sorryhyun - opened

Summary

Anima's AdaLN modulation path is entirely text-blind — shift/scale/gate coefficients are functions of timestep only. Text conditioning enters exclusively via cross-attention. Based on Starodubcev et al., "Rethinking Global Text Conditioning in Diffusion Transformers" (ICLR 2026), injecting a pooled text embedding into the modulation path and applying guidance in modulation space yields quality improvements orthogonal to CFG.

We ran pre-implementation validation experiments on the frozen Anima model to check whether this approach is viable. The results suggest it is — sharing the findings here in case they're useful.

Current state

Component Text-dependent? Notes
Cross-attention KV Yes Qwen3 → LLMAdapter → 28 blocks
AdaLN shift/scale/gate No t_embedder sees only timestep
CFG Yes Noise-space guidance (cond - uncond)

Validation results

Pooling strategy for global text representation

Evaluated 5 pooling strategies on crossattn_emb using K-Means clustering NMI against artist labels (1,416 images, 37 artists):

Strategy Source KMeans NMI
Max pool crossattn_emb (post-LLMAdapter) 0.926
Mean pool crossattn_emb (post-LLMAdapter) 0.551
Mean pool prompt_embeds (pre-LLMAdapter) 0.400
EOS token prompt_embeds (pre-LLMAdapter) 0.170
EOS token crossattn_emb (post-LLMAdapter) 0.089

Max pooling on post-adapter embeddings dramatically outperforms alternatives. EOS is near-useless — Qwen3's causal LM EOS captures tokenization artifacts, not semantics. Mean pool drowns discriminative features in shared content tokens. The LLMAdapter itself concentrates discriminative information, making post-adapter pooling far richer than pre-adapter.

Quality direction consistency across content

The quality direction max_pool(p+) - max_pool(p-) was tested across 8 diverse content types (solo character, group scene, landscape, action, mecha, still life, abstract, portrait):

Metric Value
Average pairwise cosine similarity 0.814
Minimum pairwise cosine similarity 0.770

All 28 pairwise cosine similarities exceed 0.77. A single global guidance direction generalizes across content — no need for content-conditioned directions.

Injection point comparison

Compared three injection points for the projected pooled text vector by measuring noise prediction MSE at varying perturbation scales:

Injection point MSE @ α=2.0 MSE @ α=8.0 Growth α=4→8
Before t_embedding_norm 4.77e-4 1.08e-2 6.1x
After t_embedding_norm 4.76e-3 1.89e-1 7.2x
Into adaln_lora branch 4.29e-3 4.06e-2 2.6x

After normalization is optimal: ~10x more sensitive than before norm (RMSNorm re-centers the perturbation) and ~4.7x more dynamic range than the adaln_lora branch. All injection points remain stable at high α (no collapse).

Quality/resolution correlation in embedding space

Metric Value
Standalone quality <> resolution cosine 0.021 (nearly orthogonal)
Per-content quality <> resolution cosine 0.496 (correlated)

Standalone directions are orthogonal, but per-content they correlate — high-resolution training images tend to be higher quality. This means including resolution tags (absurdres, highres) in the positive guidance prompt is beneficial, but using a separate resolution guidance direction on top of quality guidance would interfere.

Recommended guidance prompts:

  • p+: "absurdres, highres, masterpiece, best quality, score_7, score_8, score_9"
  • p-: "worst quality, low quality, score_1, score_2, score_3"

Proposed architecture change

A small projection MLP (~6.3M params, 0.3% of model) injected after t_embedding_norm:

pooled = crossattn_emb.max(dim=1).values          # (B, 1024)
t_embedding_B_T_D = t_embedding_B_T_D + pooled_text_proj(pooled).unsqueeze(1)

Zero-initialized output layer means no effect before distillation training. The distillation follows the paper's Section 5: freeze the model, train only the projection using teacher (full cross-attention) vs. student (unconditional cross-attention, projection active) with MSE loss. ~4K iterations on the existing training dataset.

Once trained, inference-time modulation guidance (Eq. 3 from the paper) steers quality through AdaLN coefficients — orthogonal to CFG, composable with LoRA/T-LoRA, negligible latency cost.

Compatibility

Feature Interaction
T-LoRA Orthogonal (different parameter spaces)
CFG Complementary (noise space vs. AdaLN space, they stack)
HydraLoRA Shared pooling (crossattn_emb.max(dim=1).values)
Spectrum Compatible (guidance applies to emb_B_T_D before blocks)

@nagarago yeah, I have read this implementation but I wanted to verify will this work properly, or how should I implement in detail. Experiments I wrote are groundings, sort of, design decisions I have made.

@nagarago For example I found max pool can be more helpful compared to conventional eos pooling comfy clip uses, and because I don't know how quality (masterpiece, score_9...) and resolution (highres, absurdres...) tags were trained in cross_emb, I tried some guidance prompt variants.

@sorryhyun it seems that the default config of your Mod Guidance is pretty different from Anzhc's. The output between "KSampler (Spectrum + Mod Guidance)" and "Anima Mod Guidance + KSampler (Spectrum)" is significantly different, and will get color drift with some scheduler like bong_tangent.

@ArranEye Yeah those will be quite different since the default config was adjusted to my personal preference, sry for a bit of dirtyness
Output between those should be different; I use different mod guidance weight (trained personally) and spectrum has also adjusted for best quality.
And yeah, I agree with color drift with some schedulers will happen, I haven't tested except for simple scheduler. (with er_sde method)

Sign up or log in to comment