AWQ 4-bit version of this Opus-Distilled-v2 model?
Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.
However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!
Best regards
let me see
Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!Best regards
Some of the quant repos here (mainly qwen3.5 awq series thus far) utilize data-free quantization technique.
We can give it a try
I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".
QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] β β ββ ββ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ββ ββ β β β βββ β version 0.18.1rc1.dev227+gc133f3374
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ββββ β β β β model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ββ βββββ βββββ β β
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608
(APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1) self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1) tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1) return cached_get_tokenizer(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=1) raise e
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1) raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported
I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] β β ββ ββ (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ββ ββ β β β βββ β version 0.18.1rc1.dev227+gc133f3374 (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ββββ β β β β model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ββ βββββ βββββ β β (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}} (APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND (APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608 (APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. (APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp. (APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP (APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144 (APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. (APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate (APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled. (APIServer pid=1) Traceback (most recent call last): (APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1) sys.exit(main()) (APIServer pid=1) ^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=1) args.dispatch_function(args) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=1) uvloop.run(run_server(args)) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run (APIServer pid=1) return __asyncio.run( (APIServer pid=1) ^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1) return runner.run(main) (APIServer pid=1) ^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1) return self._loop.run_until_complete(task) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=1) return await main (APIServer pid=1) ^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker (APIServer pid=1) async with build_async_engine_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=1) async with build_async_engine_client_from_engine_args( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=1) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=1) return cls( (APIServer pid=1) ^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__ (APIServer pid=1) self.renderer = renderer = renderer_from_config(self.vllm_config) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config (APIServer pid=1) tokenizer = cached_tokenizer_from_config(model_config, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config (APIServer pid=1) return cached_get_tokenizer( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer (APIServer pid=1) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained (APIServer pid=1) raise e (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained (APIServer pid=1) tokenizer = AutoTokenizer.from_pretrained( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained (APIServer pid=1) raise ValueError( (APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported
Could you try installing the vllm official release in a clean venv/image? Yours not recognizing the Tokenizer class . This is not cuda issue, but rather some high level vllm issue.
I get the same error as before
Created a new workspace:
uv init
uv add vllm
CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
--model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
--served-model-name Qwen3.5 \
--tensor-parallel-size 2 \
--max-model-len 196608 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-chunked-prefill \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--host 0.0.0.0 \
--port 8000
...
(APIServer pid=990732) return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732) cached_get_tokenizer(
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732) raise e
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732) raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
I get the same error as before
Created a new workspace:
uv init uv add vllm CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \ --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \ --served-model-name Qwen3.5 \ --tensor-parallel-size 2 \ --max-model-len 196608 \ --max-num-seqs 32 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-chunked-prefill \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ --host 0.0.0.0 \ --port 8000 ... (APIServer pid=990732) return renderer_cls.from_config(config, tokenizer_kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config (APIServer pid=990732) cached_get_tokenizer( (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer (APIServer pid=990732) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained (APIServer pid=990732) raise e (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained (APIServer pid=990732) tokenizer = AutoTokenizer.from_pretrained( (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained (APIServer pid=990732) raise ValueError( (APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
Use pip install vllm==0.18.0 instead of uv, see if it works? I guess uv add is just reusing the module, maybe uv pip install is more appropriate here if insisting using uv.
This repo is literally just a qwen3.5 dense model in awq format. Your python/vllm environment should have recognized it.
let me see
can you also quantilize 4B & 9B model, thank you!!!!
I get the same error as before
Created a new workspace:
uv init uv add vllm CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \ --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \ --served-model-name Qwen3.5 \ --tensor-parallel-size 2 \ --max-model-len 196608 \ --max-num-seqs 32 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-chunked-prefill \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ --host 0.0.0.0 \ --port 8000 ... (APIServer pid=990732) return renderer_cls.from_config(config, tokenizer_kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config (APIServer pid=990732) cached_get_tokenizer( (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer (APIServer pid=990732) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained (APIServer pid=990732) raise e (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained (APIServer pid=990732) tokenizer = AutoTokenizer.from_pretrained( (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained (APIServer pid=990732) raise ValueError( (APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
You ever resolve this? I;m getting the exact same thing :(