Proposal: Modulation Guidance — making AdaLN text-aware for quality steering

#122

by sorryhyun - opened 9 days ago

Summary

Anima's AdaLN modulation path is entirely text-blind — shift/scale/gate coefficients are functions of timestep only. Text conditioning enters exclusively via cross-attention. Based on Starodubcev et al., "Rethinking Global Text Conditioning in Diffusion Transformers" (ICLR 2026), injecting a pooled text embedding into the modulation path and applying guidance in modulation space yields quality improvements orthogonal to CFG.

We ran pre-implementation validation experiments on the frozen Anima model to check whether this approach is viable. The results suggest it is — sharing the findings here in case they're useful.

Current state

Component	Text-dependent?	Notes
Cross-attention KV	Yes	Qwen3 → LLMAdapter → 28 blocks
AdaLN shift/scale/gate	No	`t_embedder` sees only timestep
CFG	Yes	Noise-space guidance (cond - uncond)

Validation results

Pooling strategy for global text representation

Evaluated 5 pooling strategies on crossattn_emb using K-Means clustering NMI against artist labels (1,416 images, 37 artists):

Strategy	Source	KMeans NMI
Max pool	crossattn_emb (post-LLMAdapter)	0.926
Mean pool	crossattn_emb (post-LLMAdapter)	0.551
Mean pool	prompt_embeds (pre-LLMAdapter)	0.400
EOS token	prompt_embeds (pre-LLMAdapter)	0.170
EOS token	crossattn_emb (post-LLMAdapter)	0.089

Max pooling on post-adapter embeddings dramatically outperforms alternatives. EOS is near-useless — Qwen3's causal LM EOS captures tokenization artifacts, not semantics. Mean pool drowns discriminative features in shared content tokens. The LLMAdapter itself concentrates discriminative information, making post-adapter pooling far richer than pre-adapter.

Quality direction consistency across content

The quality direction max_pool(p+) - max_pool(p-) was tested across 8 diverse content types (solo character, group scene, landscape, action, mecha, still life, abstract, portrait):

Metric	Value
Average pairwise cosine similarity	0.814
Minimum pairwise cosine similarity	0.770

All 28 pairwise cosine similarities exceed 0.77. A single global guidance direction generalizes across content — no need for content-conditioned directions.

Injection point comparison

Compared three injection points for the projected pooled text vector by measuring noise prediction MSE at varying perturbation scales:

Injection point	MSE @ α=2.0	MSE @ α=8.0	Growth α=4→8
Before t_embedding_norm	4.77e-4	1.08e-2	6.1x
After t_embedding_norm	4.76e-3	1.89e-1	7.2x
Into adaln_lora branch	4.29e-3	4.06e-2	2.6x

After normalization is optimal: ~10x more sensitive than before norm (RMSNorm re-centers the perturbation) and ~4.7x more dynamic range than the adaln_lora branch. All injection points remain stable at high α (no collapse).

Quality/resolution correlation in embedding space

Metric	Value
Standalone quality <> resolution cosine	0.021 (nearly orthogonal)
Per-content quality <> resolution cosine	0.496 (correlated)

Standalone directions are orthogonal, but per-content they correlate — high-resolution training images tend to be higher quality. This means including resolution tags (absurdres, highres) in the positive guidance prompt is beneficial, but using a separate resolution guidance direction on top of quality guidance would interfere.

Recommended guidance prompts:

p+: "absurdres, highres, masterpiece, best quality, score_7, score_8, score_9"
p-: "worst quality, low quality, score_1, score_2, score_3"

Proposed architecture change

A small projection MLP (~6.3M params, 0.3% of model) injected after t_embedding_norm:

pooled = crossattn_emb.max(dim=1).values          # (B, 1024)
t_embedding_B_T_D = t_embedding_B_T_D + pooled_text_proj(pooled).unsqueeze(1)

Zero-initialized output layer means no effect before distillation training. The distillation follows the paper's Section 5: freeze the model, train only the projection using teacher (full cross-attention) vs. student (unconditional cross-attention, projection active) with MSE loss. ~4K iterations on the existing training dataset.

Once trained, inference-time modulation guidance (Eq. 3 from the paper) steers quality through AdaLN coefficients — orthogonal to CFG, composable with LoRA/T-LoRA, negligible latency cost.

Compatibility

Feature	Interaction
T-LoRA	Orthogonal (different parameter spaces)
CFG	Complementary (noise space vs. AdaLN space, they stack)
HydraLoRA	Shared pooling (`crossattn_emb.max(dim=1).values`)
Spectrum	Compatible (guidance applies to `emb_B_T_D` before blocks)

nagarago

8 days ago

Are you talking about this?
https://github.com/Anzhc/Anima-Mod-Guidance-ComfyUI-Node

sorryhyun

8 days ago

@nagarago yeah, I have read this implementation but I wanted to verify will this work properly, or how should I implement in detail. Experiments I wrote are groundings, sort of, design decisions I have made.

sorryhyun

8 days ago

@nagarago For example I found max pool can be more helpful compared to conventional eos pooling comfy clip uses, and because I don't know how quality (masterpiece, score_9...) and resolution (highres, absurdres...) tags were trained in cross_emb, I tried some guidance prompt variants.

sorryhyun

4 days ago

Here is update, I tried dynamic guidance strategy, inspired by original author's appendix, and their 'i8_skip27' strategy quite works well. Here is the drop-in implementation as a replacement of ksampler block. https://github.com/sorryhyun/ComfyUI-Spectrum-KSampler

ArranEye

4 days ago

@sorryhyun it seems that the default config of your Mod Guidance is pretty different from Anzhc's. The output between "KSampler (Spectrum + Mod Guidance)" and "Anima Mod Guidance + KSampler (Spectrum)" is significantly different, and will get color drift with some scheduler like bong_tangent.

sorryhyun

4 days ago

@ArranEye Yeah those will be quite different since the default config was adjusted to my personal preference, sry for a bit of dirtyness
Output between those should be different; I use different mod guidance weight (trained personally) and spectrum has also adjusted for best quality.
And yeah, I agree with color drift with some schedulers will happen, I haven't tested except for simple scheduler. (with er_sde method)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment