Almost Impossible to run

#13

by LLaMA-lover - opened 14 days ago

I spent around 6 hours trying to get the VLLM fork to compile then the next hours trying to get the model to quantize so it can run with custom inference patches it still ran very slow and failed to produce any output this is not specifically my side issue there is no clear way to quantize nor BINARIES so it runs, I had to manually quantize it because I couldn't get the custom transformer fork to work. I also tried to NOT quantize the model and just use it on CPU but that also failed. Only special part about this model is that its not special it's only quality part relies on "Markovian RSA" which is compatible with ANY model. I'd rather do 10e(10^5) pushup's than to get this model to actually work.

ganeshnanduru

Zyphra org 12 days ago

vLLM build can be tricky, I do not know your setup but it can help to make sure you have the right build tools installed (eg ninja), environment variables set (MAX_JOBS, CMAKE_BUILD_PARALLEL_LEVEL, etc). We also have an experimental llama.cpp setup if vLLM is not an option. We do not have quantization support at the moment. Markovian RSA can be implemented at test-time with any model but ZAYA1-8B uniquely has this method integrated into its training.

LLaMA-lover

12 days ago

How about providing prebuilt binaries for VLLM? It would be way easier than compiling from the ground up.

ganeshnanduru

Zyphra org 8 days ago

I will upload some wheels soon. What are your computer arch, os, python, (cuda/rocm if applicable) versions? I will see if I can include a wheel for your environment @LLaMA-lover

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment