Thank you, finally a quant for vLLM for this model!
Are you planning to do the same for the reranker?
Are you planning to do the same for the reranker?
I had a go at that β the embedding model quantized cleanly with llm-compressor (GPTQ W8A8), but the reranker has issues. The quantization completes, but vLLM serves broken scores (always 1.0). llm-compressor closed their issue (https://github.com/vllm-project/llm-compressor/issues/2211) since it works in transformers β the problem is on the vLLM serving side. I can give it another shot once vllm figures it out (I'm not so sure they're even aware of the issue)
I deleted my upload because I didn't want people installing it and spinning their wheels, likely once vllm fixes their issue i can re-upload without quantizing again.
The whole motivation was to run both embedding and reranker on a single 3090 π
That is very revealing indeed, and you figured it out... I wanted to do the same, trying to fit both on the 3090 without wanting to downgrade to the 2B variant, but might do so for the reranker until the 8B might become feasible. Thank you for the answer and the upload too! More power to you!