which version flash-attn
#154
by bhomass1 - opened
I am running this model on RTX 6000. It only runs when using ExLlamaV2Cache_Q4. ExLlamaV2Cache_8bit would crash with outofmemoryerror.
I was told flash-atten can make ExLlamaV2Cache_8bit work. but I tried multiple versions of flash-atten. All failed with IndexError: too many indices for tensor of dimension 1.
Has anyone been able to find a version of flash-atten which works with this model?
my env is
torch 2.10+cu128
cuda 12.4