Joost Mertens

joostm8

AI & ML interests

None yet

Recent Activity

commentedon an article about 1 hour ago

Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box

commentedon an article 4 days ago

Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box

View all activity

Organizations

None yet

commentedon Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box about 1 hour ago

Hey, thanks for the prompt reply, I had to wait a bit to get back to you because my account was too new and being comment throttled.

Anyway, just wanted to share an update to the previous post. The backend error seemed related to my installation of the intel backend on my pc (Fedora 43). To rule that out, I followed the installation guide on the intel llm-scaler github and installed Ubuntu 25.04 with the provided Intel Multi-ARC bmg installer. A quick test with Deepseek as documented in that guide proved successful.

I then came back here. First I tested the gemma-4-E2B-it model both on a single and dual gpu, which worked nicely. Then, I also tested gemma-4-26B-A4B-it but quantized to FP8 such that it would fit in the 48 GB of VRAM of my dual B60 setup, which also worked out. For others that might be interested, I used the command:

VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 \
  -tp 2 \
  --enforce-eager \
  --attention-backend TRITON_ATTN

Out of curiosity, do you have some llama-benchy results available of running these models that I can compare my own results to? I guess my system (Ryzen 3800X with X570 motherboard) is mostly bottlenecked by the PCIe 4.0 x8 slots the GPUs are plugged into, so I'd love to know what's possible on an unbottlenecked system with the full model on the quad GPU setup. Below my results for a couple of different context depths:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
google/gemma-4-26B-A4B-it	pp2048	799.84 ± 13.66		2563.66 ± 44.46	2562.12 ± 44.46	2563.72 ± 44.46
google/gemma-4-26B-A4B-it	tg512	9.35 ± 0.14	10.00 ± 0.00
google/gemma-4-26B-A4B-it	pp2048 @ d8192	255.40 ± 0.24		42845.60 ± 63.89	42844.05 ± 63.89	42845.65 ± 63.89
google/gemma-4-26B-A4B-it	tg512 @ d8192	7.66 ± 0.08	9.00 ± 0.00
google/gemma-4-26B-A4B-it	pp2048 @ d16384	152.23 ± 0.54		130429.92 ± 945.65	130428.37 ± 945.65	130429.97 ± 945.65
google/gemma-4-26B-A4B-it	tg512 @ d16384	6.09 ± 0.05	7.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-04-14 00:11:03 | latency mode: api

For the next steps I'll probably see if I can get it all up and running on Fedora again without the bmg offline installer, but I'm already really happy that I got it all up and running, I really appreciate the blogpost and the replies!

commentedon Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box 4 days ago

I'm trying this out on my dual B60 system, but following the guide and running the vllm serve command I get the following error:

root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
    import vllm.env_override  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
    import torch
  File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_xpu.so: undefined symbol: _ZN3ccl2v128reducti

Using Claude I managed to resolve it by adding an additional export to my paht:

export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')

But then it errors out with a level_zero backend failure.

/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
    import vllm.env_override  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
    import torch
  File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
    from torch._C import *  # noqa: F403
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/opt/venv/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 15, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/config/__init__.py", line 19, in <module>
    from vllm.config.model import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/config/model.py", line 30, in <module>
    from vllm.transformers_utils.config import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 19, in <module>
    from transformers.models.auto.image_processing_auto import get_image_processor_config
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 24, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 34, in <module>
    from .processing_utils import ImagesKwargs, Unpack
  File "/opt/venv/lib/python3.12/site-packages/transformers/processing_utils.py", line 79, in <module>
    from .modeling_utils import PreTrainedAudioTokenizerBase
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 73, in <module>
    from .integrations.sdpa_attention import sdpa_attention_forward
  File "/opt/venv/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 12, in <module>
    _is_torch_xpu_available = is_torch_xpu_available()
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 313, in is_torch_xpu_available
    return hasattr(torch, "xpu") and torch.xpu.is_available()
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 74, in is_available
    return device_count() > 0
           ^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 68, in device_count
    return torch._C._xpu_getDeviceCount()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: level_zero backend failed with error: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Tried debugging further with help of Claude, but didn't seem to make it much further. Any chance the guide could be revisited? Seems like there is something wrong with the way the docker container is currently built.

Joost Mertens

AI & ML interests

Recent Activity

Organizations

joostm8's activity