Thank you, finally a quant for vLLM for this model!

by MrMoonsilver - opened 30 days ago

Discussion

MrMoonsilver

30 days ago

Are you planning to do the same for the reranker?

collin-park

Owner 30 days ago

•

edited 30 days ago

Are you planning to do the same for the reranker?

I had a go at that — the embedding model quantized cleanly with llm-compressor (GPTQ W8A8), but the reranker has issues. The quantization completes, but vLLM serves broken scores (always 1.0). llm-compressor closed their issue (https://github.com/vllm-project/llm-compressor/issues/2211) since it works in transformers — the problem is on the vLLM serving side. I can give it another shot once vllm figures it out (I'm not so sure they're even aware of the issue)
I deleted my upload because I didn't want people installing it and spinning their wheels, likely once vllm fixes their issue i can re-upload without quantizing again.
The whole motivation was to run both embedding and reranker on a single 3090 😀

MrMoonsilver

28 days ago

That is very revealing indeed, and you figured it out... I wanted to do the same, trying to fit both on the 3090 without wanting to downgrade to the 2B variant, but might do so for the reranker until the 8B might become feasible. Thank you for the answer and the upload too! More power to you!

MrMoonsilver changed discussion status to closed 28 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment