TensorRT-LLM 1.3.0rc8 fails to load Qwen3.5-2B-AWQ with unsupported AWQ quantization_config

#1
by ernestyalumni - opened

This is more informative (especially to the folks maintaining TensorRT-LLM) than a problem with the model itself:

Trying to serve this model with trtllm-serve on TensorRT-LLM 1.3.0rc8 fails during model load because the checkpoint’s Hugging Face
AWQ quantization_config is rejected as unsupported. The error ends with NotImplementedError: Unsupported quantization_config:
{'quant_method': 'awq', 'bits': 4, ...}. Here is full error output:

root@Zephyrus-G15-GA503QR:/Data/Models/LLM/QuantTrio/Qwen3.5-2B-AWQ# trtllm-serve . --port 30000
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
  self.m.impl(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.6 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc8
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
[03/23/2026-03:30:23] [TRT-LLM] [I] Using LLM with PyTorch backend
[03/23/2026-03:30:23] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[03/23/2026-03:30:23] [TRT-LLM] [I] Found quantization_config field in ./config.json, pre-quantized checkpoint is used.
[03/23/2026-03:30:23] [TRT-LLM] [I] Use quantization_config from ./config.json: quantization_config={'quant_method': 'awq', 'bits': 4, 'group_size': 128, 'version': 'gemm', 'zero_point': True, 'modules_to_not_convert': ['visual', 'linear_attn', 'self_attn', 'model.layers.0.', 'mtp']}
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 910, in serve
    _serve_llm()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 882, in _serve_llm
    launch_server(host,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 280, in launch_server
    llm = PyTorchLLM(**llm_args)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1289, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1171, in __init__
    super().__init__(model,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 251, in __init__
    self._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1216, in _build_model
    super()._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 851, in _build_model
    self._engine_dir, self._hf_model_dir = model_loader()
                                           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 758, in __call__
    self.model_loader._update_from_hf_quant_config()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 493, in _update_from_hf_quant_config
    raise NotImplementedError(
NotImplementedError: Unsupported quantization_config: {'quant_method': 'awq', 'bits': 4, 'group_size': 128, 'version': 'gemm', 'zero_point': True, 'modules_to_not_convert': ['visual', 'linear_attn', 'self_attn', 'model.layers.0.', 'mtp']}.

Sign up or log in to comment