serve with vllm seems broke?

by firow2 - opened Jan 23

Jan 23

(EngineCore_DP0 pid=99) WARNING 01-22 18:28:34 [compressed_tensors.py:738] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod.

ValueError: To serve at least one request with the models's max seq len (202752), (181.76 GiB KV cache is needed, which is larger than the available KV cache memory (7.77 GiB). Based on the available memory, the estimated maximum model length is 8656. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

It's actually not real 8-bit quantized? I used nightly vllm docker image.

firow2 changed discussion status to closed Jan 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment