What does the training data look for this?
Any information you can share in that regard? I'm keen to try and replicate this if I can...
SDA does not rely on massive external data. V1.0 We used a curated dataset of approximately 1k high-aesthetic image-text pairs as "semantic anchors."
Scale and Focus: Unlike pre-training from scratch, the goal of SDA is not to teach the model "new knowledge" but to realign its "behavioral patterns" using the Teacher model as a guide. Therefore, the breadth of semantic coverage is far more critical than raw data volume.
Semantic Scope: The training set covers a wide range of subjects, from human poses and natural landscapes to architectural compositions and fantasy art. This broad coverage ensures that the SDA LoRA's "velocity field rotation" is generalized across all feature domains, rather than being confined to a specific style.
BUT!SDA might not require any external pixel information at all.
Diversity as a Mathematical Mapping: Diversity collapse is essentially a mathematical mapping failure—where the model projects different noise seeds onto the same output (collapsing the Jacobian of the function). Restoring this diversity is a topological correction in the latent space.
Extracting Logic from the Teacher’s "Brain": In the SDA framework, the actual "diversity signal" is derived from the real-time alignment with the Teacher model’s noise-response patterns.
Image-Free Potential: Theoretically, we could feed the model purely random prompts and random noise. The necessary $x_0$ (clean target features) can be generated on-the-fly by the Teacher model (the non-distilled Z-Image) as synthetic latents.
Logic over Pixels: Diversity recovery comes from "extracting the logic" of the Teacher, not from the pixel information of a dataset. This "Image-Free" characteristic proves that SDA repairs the model at the logical/dynamics level rather than just "patching" pixels. This is precisely why such a small parameter shift can trigger such massive compositional changes.