This model cannot actually be run!
I was very excited to try this. I worked quite awhile trying to work around the quirks of the layers of size=64, using torch_awq, gptqmodel, AutoAWQ, every variation of support libraries I could think of, both Win11 and Ubuntu, with zero luck. Every single attempt ran up against issues that prevent it from running, and the oddly-size layers (not divisible by 64) are my best guess as to what the issue is. VibeVoice is an odd beast, for a language model, it seems.
So, um... any hints? How have folks actually managed to load this thing?
From everything I can tell, this model CANNOT be run.
The audio projection and encoder layers were incorrectly altered from FP16, causing a shape mismatch. VibeVoice uses 64-feature layers, but the quantized weights here need to be multiples of 128 for memory alignment. Those layers are relatively small and should simply been excluded from quantization... the kernel literally cannot read the ones in this quantized model and it ALWAYS crashes.
The model is not usable. I'm sorry to be the bearer of bad news here; was really hoping someone would've contradicted me 🤷
hm, it runs on my machine. Then again I did make some extra changes to the vllm since I posted this.
Did you also apply the vllm patches?