Support for FL AceStep Training node

#2
by PixelPlayer - opened

Hello! I installed your DFloat11-Extended node. It’s a great node, but unfortunately it doesn’t work for training LoRA in the FL AceStep Training node. After loading your DFloat11 loader with acestep_v1.5_turbo-DF11.safetensors model, nothing happens at all. As soon as I switch to the standard model in the regular loader, everything loads into video memory immediately and training starts (though very slowly). I wanted to optimize video memory for Ace Step LoRa training, and this is such a bummer. For regular generation with KSampler, after connecting your node, everything works as it should. Why does it stop without giving any error in FL AceStep Training node? In theory, everything should work.

upd: The following message appears in the console:
"AttributeError: 'Linear' object has no attribute 'weight'
Prompt executed in 0.05 seconds"

Well, this is because of the way DF11 models work; the weight attributes, which correspond to the uncompressed BF16 weights, are deleted and do not exist until the specific model layer's forward() method is called, (i.e. the model is being used "properly"), and then the BF16 weights will be reconstructed from the compressed DF11 weights on-the-fly.

Anything that attempts to access the weight attributes directly will fail, this used to include LoRA support but I managed to find a hack to make LoRAs work on DF11 models (it causes the inference speed to be significantly slower, but there is no practical way around it).

I am unsure if it is even possible to find another workaround with DF11, in order to support training LoRAs, since I actually have very little knowledge about the field of AI/ML, I am just a technically-savvy person who tries to bang my head against problems to hopefully find a solution. I tried to search up the training node you were referring to, and it seems that it should be this node, but it will be good if you can confirm that this is indeed the node you were using.

I need quite a bit of time to study the training code before I can conclude if support for training LoRAs is possible or not, and it will be quite some more time if I can implement support for it. My instinct tells me that it should be possible in theory, since the main model weights are frozen and do not need to be trained on, but I lack the knowledge to confirm this.

I noticed in another thread you requested for LoRA support for Ace-Step1.5, I should be able to do it very quickly.

Thank you for such a detailed response. Yes, that’s correct—it’s the node by filliptm; you provided the right link. I noticed one thing: after loading the full model from the standard loader and starting the training, I stopped the process, switched to your DF11 loader, and it suddenly loaded and the process started. Then I stopped it, unloaded all the models to free up memory, and I haven’t been able to reproduce it since. It feels like some kind of glitch is getting in the way.

That's funny—after updating your node, just for test, I decided to put the LoRA loader after the DFloat11 Model Loader, and it worked. Then I disabled the LoRA loader, and it still works!!! Training is working. Wow!

upd: After restarting ComfyUI, stopped working again. Now I didn't load the standard model; I only loaded your DFloat11 via KSampler, and then switched to the FL AceStep Training workflow. This must be some kind of glitch.

Sign up or log in to comment