AWQ 4-bit version of this Opus-Distilled-v2 model?

#5
by 0xburakcelik - opened

Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.

However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!

Best regards

QuantTrio org

let me see

QuantTrio org

Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.

However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!

Best regards

Some of the quant repos here (mainly qwen3.5 awq series thus far) utilize data-free quantization technique.
We can give it a try

I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".

QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:

(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]        β–ˆ     β–ˆ     β–ˆβ–„   β–„β–ˆ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]  β–„β–„ β–„β–ˆ β–ˆ     β–ˆ     β–ˆ β–€β–„β–€ β–ˆ  version 0.18.1rc1.dev227+gc133f3374
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]   β–ˆβ–„β–ˆβ–€ β–ˆ     β–ˆ     β–ˆ     β–ˆ  model   QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]    β–€β–€  β–€β–€β–€β–€β–€ β–€β–€β–€β–€β–€ β–€     β–€
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608
(APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1)     self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1)     tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1)     return cached_get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=1)     raise e
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1)     raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported
QuantTrio org

I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".

QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:

(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]        β–ˆ     β–ˆ     β–ˆβ–„   β–„β–ˆ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]  β–„β–„ β–„β–ˆ β–ˆ     β–ˆ     β–ˆ β–€β–„β–€ β–ˆ  version 0.18.1rc1.dev227+gc133f3374
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]   β–ˆβ–„β–ˆβ–€ β–ˆ     β–ˆ     β–ˆ     β–ˆ  model   QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]    β–€β–€  β–€β–€β–€β–€β–€ β–€β–€β–€β–€β–€ β–€     β–€
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608
(APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1)     self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1)     tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1)     return cached_get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=1)     raise e
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1)     raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported

Could you try installing the vllm official release in a clean venv/image? Yours not recognizing the Tokenizer class . This is not cuda issue, but rather some high level vllm issue.

I get the same error as before

Created a new workspace:

uv init
uv add vllm

CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
  --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port 8000


...
(APIServer pid=990732)     return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732)     cached_get_tokenizer(
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732)     raise e
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732)     raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
QuantTrio org
β€’
edited 21 days ago

I get the same error as before

Created a new workspace:

uv init
uv add vllm

CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
  --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port 8000


...
(APIServer pid=990732)     return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732)     cached_get_tokenizer(
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732)     raise e
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732)     raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

Use pip install vllm==0.18.0 instead of uv, see if it works? I guess uv add is just reusing the module, maybe uv pip install is more appropriate here if insisting using uv.
This repo is literally just a qwen3.5 dense model in awq format. Your python/vllm environment should have recognized it.

let me see

can you also quantilize 4B & 9B model, thank you!!!!

I get the same error as before

Created a new workspace:

uv init
uv add vllm

CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
  --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port 8000


...
(APIServer pid=990732)     return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732)     cached_get_tokenizer(
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732)     raise e
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732)     raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

You ever resolve this? I;m getting the exact same thing :(

Sign up or log in to comment