Model Card for Model ID
A. Benchmark Result
a. Perplexity: Quantized Model (here)
(ao) (main) root@C.34352425:/workspace/ao/.github/scripts/torchao_model_releases$ lm_eval --model hf --model_args pretrained=namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2
--tasks gsm8k --limit 10 --apply_chat_template --batch_size auto
2026-04-08:07:14:21 WARNING [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-04-08:07:14:21 INFO [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-04-08:07:14:24 INFO [_cli.run:376] Selected Tasks: ['gsm8k']
2026-04-08:07:14:25 INFO [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-04-08:07:14:25 INFO [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2'}
2026-04-08:07:14:28 INFO [models.huggingface:161] Using device 'cuda:0'
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
2026-04-08:07:14:30 INFO [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
/workspace/ao/.venv/lib/python3.11/site-packages/transformers/quantizers/auto.py:239: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
warnings.warn(warning_msg)
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:08<00:00, 1.41it/s]
2026-04-08:07:14:41 INFO [tasks:700] Selected tasks:
2026-04-08:07:14:41 INFO [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-04-08:07:14:41 INFO [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-04-08:07:14:41 WARNING [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-04-08:07:14:41 INFO [api.task:311] Building contexts for gsm8k on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 336.37it/s]
2026-04-08:07:14:41 INFO [evaluator:584] Running generate_until requests
Running generate_until requests: 0%| | 0/10 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 7
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [01:25<00:00, 8.58s/it]
2026-04-08:07:16:08 INFO [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
hf ({'pretrained': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2'}), gen_kwargs: ({}), limit: 10.0, num_fewshot: None, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|β | 0.9|Β± | 0.1|
| | |strict-match | 5|exact_match|β | 0.9|Β± | 0.1|
(ao) (main) root@C.34352425:/workspace/ao/.github/scripts/torchao_model_releases$ lm_eval --model hf --model_args pretrained=google/gemma-3-27b-it --tasks gsm8
k --limit 10 --apply_chat_template --batch_size auto
2026-04-08:07:16:40 WARNING [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-04-08:07:16:40 INFO [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-04-08:07:16:43 INFO [_cli.run:376] Selected Tasks: ['gsm8k']
2026-04-08:07:16:45 INFO [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-04-08:07:16:45 INFO [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'google/gemma-3-27b-it'}
2026-04-08:07:16:47 INFO [models.huggingface:161] Using device 'cuda:0'
2026-04-08:07:16:49 INFO [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:08<00:00, 1.36it/s]
2026-04-08:07:17:00 INFO [tasks:700] Selected tasks:
2026-04-08:07:17:00 INFO [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-04-08:07:17:00 INFO [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-04-08:07:17:00 WARNING [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-04-08:07:17:00 INFO [api.task:311] Building contexts for gsm8k on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 242.23it/s]
2026-04-08:07:17:01 INFO [evaluator:584] Running generate_until requests
Running generate_until requests: 0%| | 0/10 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 8
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [01:27<00:00, 8.77s/it]
2026-04-08:07:18:29 INFO [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
hf ({'pretrained': 'google/gemma-3-27b-it'}), gen_kwargs: ({}), limit: 10.0, num_fewshot: None, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|β | 0.9|Β± | 0.1|
| | |strict-match | 5|exact_match|β | 0.9|Β± | 0.1|
c. Throughput with vLLM
(ao) (main) root@C.34352425:/workspace/ao/.github/scripts/torchao_model_releases$ vllm bench throughput --input-len 256 --output-len 256 --model namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2 --num-prompts 10 --enforce-eager
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
When dataset path is not set, it will default to random dataset
/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py:848: UserWarning: Both --input-len and --random-input-len are specified. The random version (--random-input-len) will be preferred in this run.
validate_args(args)
/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py:848: UserWarning: Both --output-len and --random-output-len are specified. The random version (--random-output-len) will be preferred in this run.
validate_args(args)
/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py:848: UserWarning: Both --prefix-len and --random-prefix-len are specified. The random version (--random-prefix-len) will be preferred in this run.
validate_args(args)
INFO 04-08 07:27:53 [datasets.py:700] Sampling input_len from [1023, 1023] and output_len from [128, 128]
INFO 04-08 07:27:53 [utils.py:233] non-default args: {'tokenizer': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2', 'enforce_eager': True, 'enable_lora': None, 'reasoning_parser_plugin': '', 'model': 'namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2'}
ERROR 04-08 07:28:03 [registry.py:814] Error saving model info cache.
ERROR 04-08 07:28:03 [registry.py:814] Traceback (most recent call last):
ERROR 04-08 07:28:03 [registry.py:814] File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/pathlib.py", line 1116, in mkdir
ERROR 04-08 07:28:03 [registry.py:814] os.mkdir(self, mode)
ERROR 04-08 07:28:03 [registry.py:814] FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/vllm/modelinfos'
ERROR 04-08 07:28:03 [registry.py:814]
ERROR 04-08 07:28:03 [registry.py:814] During handling of the above exception, another exception occurred:
ERROR 04-08 07:28:03 [registry.py:814]
ERROR 04-08 07:28:03 [registry.py:814] Traceback (most recent call last):
ERROR 04-08 07:28:03 [registry.py:814] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/registry.py", line 809, in _save_modelinfo_to_cache
ERROR 04-08 07:28:03 [registry.py:814] cache_dir.mkdir(parents=True, exist_ok=True)
ERROR 04-08 07:28:03 [registry.py:814] File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/pathlib.py", line 1120, in mkdir
ERROR 04-08 07:28:03 [registry.py:814] self.parent.mkdir(parents=True, exist_ok=True)
ERROR 04-08 07:28:03 [registry.py:814] File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/pathlib.py", line 1116, in mkdir
ERROR 04-08 07:28:03 [registry.py:814] os.mkdir(self, mode)
ERROR 04-08 07:28:03 [registry.py:814] OSError: [Errno 28] No space left on device: '/root/.cache/vllm'
INFO 04-08 07:28:03 [model.py:549] Resolved architecture: Gemma3ForConditionalGeneration
INFO 04-08 07:28:03 [model.py:1678] Using max model len 131072
INFO 04-08 07:28:03 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 04-08 07:28:03 [vllm.py:790] Asynchronous scheduling is enabled.
WARNING 04-08 07:28:03 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 04-08 07:28:03 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 04-08 07:28:03 [vllm.py:1025] Cudagraph is disabled under eager mode
WARNING 04-08 07:28:03 [cuda.py:199] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
INFO 04-08 07:28:03 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
(EngineCore pid=8509) INFO 04-08 07:28:26 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2', speculative_config=None, tokenizer='namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=8509) INFO 04-08 07:28:28 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:36275 backend=nccl
(EngineCore pid=8509) INFO 04-08 07:28:28 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=8509) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore pid=8509) INFO 04-08 07:28:39 [gpu_model_runner.py:4735] Starting to load model namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2...
(EngineCore pid=8509) INFO 04-08 07:28:39 [interfaces.py:171] Contains out of vocabulary multimodal tokens? False
(EngineCore pid=8509) INFO 04-08 07:28:40 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=8509) INFO 04-08 07:28:40 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] EngineCore failed to start.
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] super().__init__(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self._init_executor()
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.driver_worker.load_model()
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.model = model_loader.load_model(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] model = initialize_model(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma3_mm.py", line 516, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.vision_tower = SiglipVisionModel(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 865, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.vision_model = SiglipVisionTransformer(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 702, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.encoder = SiglipEncoder(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 543, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] [
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 544, in <listcomp>
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] SiglipEncoderLayer(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 498, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.mlp = SiglipMLP(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 463, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.fc2 = RowParallelLinear(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] self.quant_method.create_weights(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 319, in create_weights
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] weight = torchao_quantize_param_data(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 285, in torchao_quantize_param_data
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] quantize_(dummy_linear, torchao_config)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 445, in quantize_
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 179, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] new_child = _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 174, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] model = replacement_fn(model, *extra_args)
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/prototype/awq/api.py", line 108, in _awq_transform
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] assert isinstance(qw, SupportsActivationPreScaling), (
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) ERROR 04-08 07:28:40 [core.py:1108] AssertionError: weight must support activation scaling through implementing `SupportsActivationPreScaling`
(EngineCore pid=8509) Process EngineCore:
(EngineCore pid=8509) Traceback (most recent call last):
(EngineCore pid=8509) File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=8509) self.run()
(EngineCore pid=8509) File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=8509) self._target(*self._args, **self._kwargs)
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=8509) raise e
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=8509) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) return func(*args, **kwargs)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=8509) super().__init__(
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=8509) self.model_executor = executor_class(vllm_config)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) return func(*args, **kwargs)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=8509) self._init_executor()
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=8509) self.driver_worker.load_model()
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=8509) self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) return func(*args, **kwargs)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=8509) self.model = model_loader.load_model(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) return func(*args, **kwargs)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=8509) model = initialize_model(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=8509) return func(*args, **kwargs)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=8509) model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma3_mm.py", line 516, in __init__
(EngineCore pid=8509) self.vision_tower = SiglipVisionModel(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 865, in __init__
(EngineCore pid=8509) self.vision_model = SiglipVisionTransformer(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 702, in __init__
(EngineCore pid=8509) self.encoder = SiglipEncoder(
(EngineCore pid=8509) ^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 543, in __init__
(EngineCore pid=8509) [
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 544, in <listcomp>
(EngineCore pid=8509) SiglipEncoderLayer(
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 498, in __init__
(EngineCore pid=8509) self.mlp = SiglipMLP(
(EngineCore pid=8509) ^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/models/siglip.py", line 463, in __init__
(EngineCore pid=8509) self.fc2 = RowParallelLinear(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=8509) self.quant_method.create_weights(
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 319, in create_weights
(EngineCore pid=8509) weight = torchao_quantize_param_data(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/torchao.py", line 285, in torchao_quantize_param_data
(EngineCore pid=8509) quantize_(dummy_linear, torchao_config)
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 445, in quantize_
(EngineCore pid=8509) _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 179, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509) new_child = _replace_with_custom_fn_if_matches_filter(
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/quantization/quant_api.py", line 174, in _replace_with_custom_fn_if_matches_filter
(EngineCore pid=8509) model = replacement_fn(model, *extra_args)
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) File "/workspace/ao/.venv/lib/python3.11/site-packages/torchao/prototype/awq/api.py", line 108, in _awq_transform
(EngineCore pid=8509) assert isinstance(qw, SupportsActivationPreScaling), (
(EngineCore pid=8509) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=8509) AssertionError: weight must support activation scaling through implementing `SupportsActivationPreScaling`
[rank0]:[W408 07:28:41.539319800 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/workspace/ao/.venv/bin/vllm", line 10, in <module>
sys.exit(main())
^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
args.dispatch_function(args)
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
main(args)
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py", line 879, in main
elapsed_time, request_outputs = run_vllm(
^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/benchmarks/throughput.py", line 55, in run_vllm
llm = LLM.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 415, in from_engine_args
return cls(**vars(engine_args))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 382, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 177, in from_engine_args
return cls(
^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 111, in __init__
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 101, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 710, in __init__
super().__init__(
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
with launch_core_engines(
File "/.uv/python_install/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 144, in __exit__
next(self.gen)
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
wait_for_engine_startup(
File "/workspace/ao/.venv/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}