Is there a best way to infer this model from multiple small memory GPUs?
#39
by hongdouzi - opened
I have 4 3090s, their total memory is 96GB, which framework should I use to infer this model most efficiently?
hongdouzi changed discussion title from Is there a best way to infer this model from multiple small memory GPUs to Is there a best way to infer this model from multiple small memory GPUs?
vllm/aphrodite .. load-in-4bit with 64k ctx
alexrs changed discussion status to closed