AWQ 4-bit version of this Opus-Distilled-v2 model?

by 0xburakcelik - opened 22 days ago

Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.

However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!

Best regards

tclf90

QuantTrio org 21 days ago

let me see

tclf90

QuantTrio org 21 days ago

Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.

However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!

Best regards

Some of the quant repos here (mainly qwen3.5 awq series thus far) utilize data-free quantization technique.
We can give it a try

tclf90

QuantTrio org 21 days ago

https://huggingface.co/QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ

LoiLock

21 days ago

I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".

QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:

(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.1rc1.dev227+gc133f3374
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]   █▄█▀ █     █     █     █  model   QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608
(APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1)     self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1)     tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1)     return cached_get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=1)     raise e
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1)     raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported

tclf90

QuantTrio org 21 days ago

I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".

QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:

(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.1rc1.dev227+gc133f3374
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]   █▄█▀ █     █     █     █  model   QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608
(APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1)     self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1)     tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1)     return cached_get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=1)     raise e
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1)     raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported

Could you try installing the vllm official release in a clean venv/image? Yours not recognizing the Tokenizer class . This is not cuda issue, but rather some high level vllm issue.

LoiLock

21 days ago

I get the same error as before

Created a new workspace:

uv init
uv add vllm

CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
  --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port 8000


...
(APIServer pid=990732)     return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732)     cached_get_tokenizer(
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732)     raise e
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732)     raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

tclf90

QuantTrio org 21 days ago

•

edited 21 days ago

I get the same error as before

Created a new workspace:

uv init
uv add vllm

CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
  --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port 8000


...
(APIServer pid=990732)     return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732)     cached_get_tokenizer(
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732)     raise e
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732)     raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

Use pip install vllm==0.18.0 instead of uv, see if it works? I guess uv add is just reusing the module, maybe uv pip install is more appropriate here if insisting using uv.
This repo is literally just a qwen3.5 dense model in awq format. Your python/vllm environment should have recognized it.

ELVISIO

20 days ago

let me see

can you also quantilize 4B & 9B model, thank you!!!!

loktar

11 days ago

I get the same error as before

Created a new workspace:

uv init
uv add vllm

CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
  --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --host 0.0.0.0 \
  --port 8000


...
(APIServer pid=990732)     return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732)     cached_get_tokenizer(
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732)     raise e
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732)   File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732)     raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

You ever resolve this? I;m getting the exact same thing :(

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment