which version flash-attn

#154

by bhomass1 - opened Mar 13

Mar 13

I am running this model on RTX 6000. It only runs when using ExLlamaV2Cache_Q4. ExLlamaV2Cache_8bit would crash with outofmemoryerror.
I was told flash-atten can make ExLlamaV2Cache_8bit work. but I tried multiple versions of flash-atten. All failed with IndexError: too many indices for tensor of dimension 1.
Has anyone been able to find a version of flash-atten which works with this model?
my env is
torch 2.10+cu128
cuda 12.4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment