Model Card for Model ID

A. Benchmark Result

a. Perplexity: Quantized Model (here)

(ao) (main) root@C.34352425:/workspace/ao/.github/scripts/torchao_model_releases$ lm_eval --model hf     --model_args pretrained=namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2
     --tasks gsm8k     --limit 10     --apply_chat_template     --batch_size auto
2026-04-08:07:14:21 WARNING  [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-04-08:07:14:21 INFO     [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-04-08:07:14:24 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-04-08:07:14:25 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-04-08:07:14:25 INFO     [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2'}
2026-04-08:07:14:28 INFO     [models.huggingface:161] Using device 'cuda:0'
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
2026-04-08:07:14:30 INFO     [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
/workspace/ao/.venv/lib/python3.11/site-packages/transformers/quantizers/auto.py:239: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
  warnings.warn(warning_msg)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:08<00:00,  1.41it/s]
2026-04-08:07:14:41 INFO     [tasks:700] Selected tasks:
2026-04-08:07:14:41 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-04-08:07:14:41 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-04-08:07:14:41 WARNING  [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-04-08:07:14:41 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 336.37it/s]
2026-04-08:07:14:41 INFO     [evaluator:584] Running generate_until requests
Running generate_until requests:   0%|                                                                                                          | 0/10 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 7
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:25<00:00,  8.58s/it]
2026-04-08:07:16:08 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
hf ({'pretrained': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2'}), gen_kwargs: ({}), limit: 10.0, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.9|±  |   0.1|
|     |       |strict-match    |     5|exact_match|↑  |  0.9|±  |   0.1|

b. Perplexity: Original (https://huggingface.co/google/gemma-3-27b-it)

(ao) (main) root@C.34352425:/workspace/ao/.github/scripts/torchao_model_releases$ lm_eval --model hf     --model_args pretrained=google/gemma-3-27b-it     --tasks gsm8
k     --limit 10     --apply_chat_template     --batch_size auto
2026-04-08:07:16:40 WARNING  [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-04-08:07:16:40 INFO     [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-04-08:07:16:43 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-04-08:07:16:45 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-04-08:07:16:45 INFO     [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'google/gemma-3-27b-it'}
2026-04-08:07:16:47 INFO     [models.huggingface:161] Using device 'cuda:0'
2026-04-08:07:16:49 INFO     [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:08<00:00,  1.36it/s]
2026-04-08:07:17:00 INFO     [tasks:700] Selected tasks:
2026-04-08:07:17:00 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-04-08:07:17:00 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-04-08:07:17:00 WARNING  [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-04-08:07:17:00 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 242.23it/s]
2026-04-08:07:17:01 INFO     [evaluator:584] Running generate_until requests
Running generate_until requests:   0%|                                                                                                          | 0/10 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 8
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:27<00:00,  8.77s/it]
2026-04-08:07:18:29 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
hf ({'pretrained': 'google/gemma-3-27b-it'}), gen_kwargs: ({}), limit: 10.0, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.9|±  |   0.1|
|     |       |strict-match    |     5|exact_match|↑  |  0.9|±  |   0.1|

c. Throughput with vLLM

(ao) (main) root@C.34352425:/workspace/ao/.github/scripts/torchao_model_releases$ vllm bench throughput   --input-len 256   --output-len 256   --model namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2   --num-prompts 10   --enforce-eager
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
When dataset path is not set, it will default to random dataset
/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py:848: UserWarning: Both --input-len and --random-input-len are specified. The random version (--random-input-len) will be preferred in this run.
  validate_args(args)
/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py:848: UserWarning: Both --output-len and --random-output-len are specified. The random version (--random-output-len) will be preferred in this run.
  validate_args(args)
/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py:848: UserWarning: Both --prefix-len and --random-prefix-len are specified. The random version (--random-prefix-len) will be preferred in this run.
  validate_args(args)
INFO 04-08 07:27:53 [datasets.py:700] Sampling input_len from [1023, 1023] and output_len from [128, 128]
INFO 04-08 07:27:53 [utils.py:233] non-default args: {'tokenizer': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2', 'enforce_eager': True, 'enable_lora': None, 'reasoning_parser_plugin': '', 'model': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2'}
ERROR 04-08 07:28:03 [registry.py:814] Error saving model info cache.
ERROR 04-08 07:28:03 [registry.py:814] Traceback (most recent call last):
ERROR 04-08 07:28:03 [registry.py:814]   File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/pathlib.py", line 1116, in mkdir
ERROR 04-08 07:28:03 [registry.py:814]     os.mkdir(self, mode)
ERROR 04-08 07:28:03 [registry.py:814] FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/vllm/modelinfos'
ERROR 04-08 07:28:03 [registry.py:814] 
ERROR 04-08 07:28:03 [registry.py:814] During handling of the above exception, another exception occurred:
ERROR 04-08 07:28:03 [registry.py:814] 
ERROR 04-08 07:28:03 [registry.py:814] Traceback (most recent call last):
ERROR 04-08 07:28:03 [registry.py:814]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/registry.py", line 809, in _save_modelinfo_to_cache
ERROR 04-08 07:28:03 [registry.py:814]     cache_dir.mkdir(parents=True, exist_ok=True)
ERROR 04-08 07:28:03 [registry.py:814]   File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/pathlib.py", line 1120, in mkdir
ERROR 04-08 07:28:03 [registry.py:814]     self.parent.mkdir(parents=True, exist_ok=True)
ERROR 04-08 07:28:03 [registry.py:814]   File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/pathlib.py", line 1116, in mkdir
ERROR 04-08 07:28:03 [registry.py:814]     os.mkdir(self, mode)
ERROR 04-08 07:28:03 [registry.py:814] OSError: [Errno 28] No space left on device: '/root/.cache/vllm'
INFO 04-08 07:28:03 [model.py:549] Resolved architecture: Gemma3ForConditionalGeneration
INFO 04-08 07:28:03 [model.py:1678] Using max model len 131072
INFO 04-08 07:28:03 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 04-08 07:28:03 [vllm.py:790] Asynchronous scheduling is enabled.
WARNING 04-08 07:28:03 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 04-08 07:28:03 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 04-08 07:28:03 [vllm.py:1025] Cudagraph is disabled under eager mode
WARNING 04-08 07:28:03 [cuda.py:199] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
INFO 04-08 07:28:03 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
(EngineCore pid=8509) INFO 04-08 07:28:26 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2', speculative_config=None, tokenizer='namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=8509) INFO 04-08 07:28:28 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:36275 backend=nccl
(EngineCore pid=8509) INFO 04-08 07:28:28 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=8509) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore pid=8509) INFO 04-08 07:28:39 [gpu_model_runner.py:4735] Starting to load model namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2...
(EngineCore pid=8509) INFO 04-08 07:28:39 [interfaces.py:171] Contains out of vocabulary multimodal tokens? False
(EngineCore pid=8509) INFO 04-08 07:28:40 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=8509) INFO 04-08 07:28:40 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] EngineCore failed to start.
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     super().__init__(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self._init_executor()
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.model = model_loader.load_model(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     model = initialize_model(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]             ^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma3_mm.py", line 516, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.vision_tower = SiglipVisionModel(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                         ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 865, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.vision_model = SiglipVisionTransformer(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                         ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 702, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.encoder = SiglipEncoder(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                    ^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 543, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     [
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 544, in <listcomp>
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     SiglipEncoderLayer(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 498, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.mlp = SiglipMLP(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                ^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 463, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.fc2 = RowParallelLinear(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     self.quant_method.create_weights(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 319, in create_weights
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     weight = torchao_quantize_param_data(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 285, in torchao_quantize_param_data
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     quantize_(dummy_linear, torchao_config)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 445, in quantize_
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 179, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     new_child = _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 174, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     model = replacement_fn(model, *extra_args)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/prototype/awq/api.py", line 108, in _awq_transform
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]     assert isinstance(qw, SupportsActivationPreScaling), (
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] AssertionError: weight must support activation scaling through implementing `SupportsActivationPreScaling`
(EngineCore pid=8509) Process EngineCore:
(EngineCore pid=8509) Traceback (most recent call last):
(EngineCore pid=8509)   File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=8509)     self.run()
(EngineCore pid=8509)   File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=8509)     self._target(*self._args, **self._kwargs)
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=8509)     raise e
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=8509)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=8509)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509)     return func(*args, **kwargs)
(EngineCore pid=8509)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=8509)     super().__init__(
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=8509)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=8509)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509)     return func(*args, **kwargs)
(EngineCore pid=8509)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=8509)     self._init_executor()
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=8509)     self.driver_worker.load_model()
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=8509)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509)     return func(*args, **kwargs)
(EngineCore pid=8509)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=8509)     self.model = model_loader.load_model(
(EngineCore pid=8509)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509)     return func(*args, **kwargs)
(EngineCore pid=8509)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=8509)     model = initialize_model(
(EngineCore pid=8509)             ^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509)     return func(*args, **kwargs)
(EngineCore pid=8509)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=8509)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=8509)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma3_mm.py", line 516, in __init__
(EngineCore pid=8509)     self.vision_tower = SiglipVisionModel(
(EngineCore pid=8509)                         ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 865, in __init__
(EngineCore pid=8509)     self.vision_model = SiglipVisionTransformer(
(EngineCore pid=8509)                         ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 702, in __init__
(EngineCore pid=8509)     self.encoder = SiglipEncoder(
(EngineCore pid=8509)                    ^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 543, in __init__
(EngineCore pid=8509)     [
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 544, in <listcomp>
(EngineCore pid=8509)     SiglipEncoderLayer(
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 498, in __init__
(EngineCore pid=8509)     self.mlp = SiglipMLP(
(EngineCore pid=8509)                ^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 463, in __init__
(EngineCore pid=8509)     self.fc2 = RowParallelLinear(
(EngineCore pid=8509)                ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=8509)     self.quant_method.create_weights(
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 319, in create_weights
(EngineCore pid=8509)     weight = torchao_quantize_param_data(
(EngineCore pid=8509)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 285, in torchao_quantize_param_data
(EngineCore pid=8509)     quantize_(dummy_linear, torchao_config)
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 445, in quantize_
(EngineCore pid=8509)     _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 179, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509)     new_child = _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 174, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509)     model = replacement_fn(model, *extra_args)
(EngineCore pid=8509)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509)   File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/prototype/awq/api.py", line 108, in _awq_transform
(EngineCore pid=8509)     assert isinstance(qw, SupportsActivationPreScaling), (
(EngineCore pid=8509)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) AssertionError: weight must support activation scaling through implementing `SupportsActivationPreScaling`
[rank0]:[W408 07:28:41.539319800 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/workspace/ao/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
    main(args)
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py", line 879, in main
    elapsed_time, request_outputs = run_vllm(
                                    ^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py", line 55, in run_vllm
    llm = LLM.from_engine_args(engine_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 415, in from_engine_args
    return cls(**vars(engine_args))
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 382, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 177, in from_engine_args
    return cls(
           ^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 111, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 101, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 710, in __init__
    super().__init__(
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
    with launch_core_engines(
  File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
    wait_for_engine_startup(
  File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Downloads last month: 160