Odd Value error on Ampere GPU - User-specified max_model_len (262144) is greater than the derived max_model_len
#5
by costelter - opened
As I try to give an old currently unused NVIDIA A40 a new purpose, I tried this quant and got the following value error:
Value error, User-specified max_model_len (262144) is greater than the derived max_model_len (max_position_embeddings=40960.0 or model_max_length=None in model's config.json)
When running with max-model-len to auto, it does start.
(EngineCore pid=184) INFO 03-13 08:57:46 [kv_cache_utils.py:1321] Maximum concurrency for 40,960 tokens per request: 17.46x
I don't where vLLM could possibly extract this information. In the config.json:
"max_position_embeddings": 262144,
seems to be correct.
Using vLLM 0.17.1rc1.dev126+gbc2c0c86e (nightly container build). I will try to reproduce this on a L40 as soon as I can free one.
Has anyone made similar experiences? The last vLLM builds seems to make more trouble at least in my experience.
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 should allow you to set it correctly with --max-model-len parameter.