serve with vllm seems broke?

#1
by firow2 - opened

(EngineCore_DP0 pid=99) WARNING 01-22 18:28:34 [compressed_tensors.py:738] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod.

ValueError: To serve at least one request with the models's max seq len (202752), (181.76 GiB KV cache is needed, which is larger than the available KV cache memory (7.77 GiB). Based on the available memory, the estimated maximum model length is 8656. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

It's actually not real 8-bit quantized? I used nightly vllm docker image.

firow2 changed discussion status to closed

Sign up or log in to comment