Odd Value error on Ampere GPU - User-specified max_model_len (262144) is greater than the derived max_model_len

#5
by costelter - opened

As I try to give an old currently unused NVIDIA A40 a new purpose, I tried this quant and got the following value error:

Value error, User-specified max_model_len (262144) is greater than the derived max_model_len (max_position_embeddings=40960.0 or model_max_length=None in model's config.json)

When running with max-model-len to auto, it does start.

(EngineCore pid=184) INFO 03-13 08:57:46 [kv_cache_utils.py:1321] Maximum concurrency for 40,960 tokens per request: 17.46x

I don't where vLLM could possibly extract this information. In the config.json:

"max_position_embeddings": 262144,

seems to be correct.

Using vLLM 0.17.1rc1.dev126+gbc2c0c86e (nightly container build). I will try to reproduce this on a L40 as soon as I can free one.

Has anyone made similar experiences? The last vLLM builds seems to make more trouble at least in my experience.

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 should allow you to set it correctly with --max-model-len parameter.

Sign up or log in to comment