Diffusion Single File
comfyui

regarding Anima training

#108
by siegmundwulfe - opened

Hello! Let me start by saying that I am really enjoying Anima so far and I see a lot of potential in the model.

It led me to start evaluating Anima as a potential base for finetuning (baking in original characters/concepts/styles for a visual novel project I'm currently working on) and came across some "community research" that raises concerns about the training methodology. I can't independently verify the analysis, as I'm not really knowledgeable in the nitty gritty tech parts, but it tracks with what I've experienced practically.

Essentially, I came across a log comparing weight diffs between Cosmos2, Preview 1, and Preview 2. In it, the LLM Adapter shows disproportionately small changes between previews relative to the DiT — yet removing the adapter causes massive knowledge regression. The implication, as far as I understand, is that the adapter absorbed most of the artist/style/character knowledge during joint training, rather than that knowledge living in the DiT where it arguably should.

Practically speaking, this seems consistent with the severe forgetting I've experienced and seen people report during LoRA/finetune training — the DiT apparently holds very little of the model's learned knowledge, so training it disrupts things quickly.

Has there been a training phase with the adapter frozen to let the DiT learn independently? And are there plans to address the knowledge distribution between adapter and DiT in future releases?

Anima is genuinely promising — the NL comprehension and multi-character composition are ahead of anything else in the local anime space. But for someone hoping to use it as a finetuning base, the current architecture raises real questions.

Either way - all the best with the project and I'll be keeping a close eye on the progress.

I agree. Specifically, over 95% of Anima’s knowledge of artist tags seems to reside within the LLM adapter. Futhermore, it has led to an issue where the '@' prefix inadvertently triggers 'name' watermarks, such as specific artist names, usernames, or signatures.
That said, as of now, Anima’s DiT holds incredible potential as the most 'cost-effective' alternative to the U-Net architecture since the release of SDXL in 2023. Honestly, it’s mind-blowing that a mere 2B model can achieve such precise spatial context separation through natural language prompts—something that was practically impossible with previous local models. My thanks to the circlestone-labs team for their hard work; I’m rooting for this project's success.

If your intention is so train for your own visual novel then you do not rely that the details of character x are still as consistent as before you trained your tune since in your case all you should care is that your own content is consistent, which it will be.

While the forgetting is bad, it is in my experience pretty easy to combat it. Just get some extra unrelated training data. From my testing, it seems you don't even need to have that much of it, I previously assumed 1:1, but it seems 3.5:1 might also be good. Only issue is that getting good natural language data might be harder. I don't like the captions the VLMs I could run like Qwen 3.5 35B A3 Q4 or 9B Q4 produced. Manually captioning a bunch of images with text helped keep the first preview's ability to do text though.

Doing some validation, it's funny, while I wanted to see if my loras get measurably better/worse with more training, the only thing I saw for certain is that the forgetting is still incredibly serious and gets so much worse as you train more (which is also pretty self-evident even just looking at what loras produce), but just sprinkling in 20-25% of extra unrelated data was enough to get the forgetting under control, and as a bonus it helped with making this mostly anime screenshots lora work better with artist styles.

ranni_0.4_plot

Evidently it happens as early as ~600 steps of training. This is with [0.4, 0.0] sigmas, on one image unrelated to the character lora being trained, 8 different noises per point.
Even though the scale is small, what this shows is pretty consistent with what I see when I actually use the loras and attempt to do various artstyles.

Sign up or log in to comment