Almost Impossible to run
I spent around 6 hours trying to get the VLLM fork to compile then the next hours trying to get the model to quantize so it can run with custom inference patches it still ran very slow and failed to produce any output this is not specifically my side issue there is no clear way to quantize nor BINARIES so it runs, I had to manually quantize it because I couldn't get the custom transformer fork to work. I also tried to NOT quantize the model and just use it on CPU but that also failed. Only special part about this model is that its not special it's only quality part relies on "Markovian RSA" which is compatible with ANY model. I'd rather do 10e(10^5) pushup's than to get this model to actually work.
vLLM build can be tricky, I do not know your setup but it can help to make sure you have the right build tools installed (eg ninja), environment variables set (MAX_JOBS, CMAKE_BUILD_PARALLEL_LEVEL, etc). We also have an experimental llama.cpp setup if vLLM is not an option. We do not have quantization support at the moment. Markovian RSA can be implemented at test-time with any model but ZAYA1-8B uniquely has this method integrated into its training.
How about providing prebuilt binaries for VLLM? It would be way easier than compiling from the ground up.
I will upload some wheels soon. What are your computer arch, os, python, (cuda/rocm if applicable) versions? I will see if I can include a wheel for your environment @LLaMA-lover