Use in OpenCode via vLLM: Endless Tool Calling Loops & undesirable behavior
I'm using this together with OpenCode. I just tested it quickly and it unfortunately keeps on calling the same command (stuck in an endless loop).
On a second try, giving it clear instructions for what it should implement, it sort of hallucinates todo steps and just checks them off. So it's really not behaving the way I'd hope it to behave.
btw maybe it's comparing apples to oranges because not moe and different architecture, but I haven't seen this with your qwen3.5-27b-AWQ-BF16-INT4 model.
I'm using vllm/vllm-openai:gemma4-cu130 and here's my vllm yaml config
name: gemma-4-26B-A4B-it-AWQ-4bit
description: 'Gemma 4 26B MoE (3.8B active) AWQ INT4. Apache 2.0 license. Native function
calling, thinking mode, vision. 128 experts top-8 routing. Frontier-level reasoning
and coding per size class.
'
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
container_image: vllm/vllm-openai:gemma4-cu130
environment:
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
OMP_NUM_THREADS: '8'
SAFETENSORS_FAST_GPU: '1'
CUDA_MODULE_LOADING: LAZY
VLLM_LOGGING_LEVEL: INFO
TORCH_COMPILE_THREADS: '4'
TORCHINDUCTOR_COMPILE_THREADS: '4'
MAX_JOBS: '4'
serve_args:
host: 0.0.0.0
port: 8000
trust-remote-code: true
dtype: auto
gpu-memory-utilization: 0.9
tensor-parallel-size: 1
kv-cache-dtype: auto
max-model-len: 120000
max-num-seqs: 4
max-num-batched-tokens: 16384
enable-prefix-caching: true
performance-mode: throughput
compilation-config: "{\"cudagraph_mode\": \"piecewise\",\n \"cudagraph_capture_sizes\"\
: [1, 2, 3, 4],\n \"inductor_compile_config\": {\n \"combo_kernels\": false,\n\
\ \"benchmark_combo_kernel\": false}}"
reasoning-parser: gemma4
enable-auto-tool-choice: true
tool-call-parser: gemma4
served-model-name: gemma-4-26B-A4B-it-AWQ-4bit
I'm seeing looping and failing tool calls as well. I've been back and for with gemini and copilot trying to get a jinja template that will let this work but no luck yet.
I should note, the tools I'm using are OpenClaw and Anthropic's Claude Code.
Edit I had a Jinja template here I thought fixed it but it doesn't.
Not sure if it's this models issue or vllms gemma4 parsers. Either it's unusable sadly and I'm back to lmstudio and qwen3-coder-next.
Should have RTFM I guess.
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#full-featured-server-launch
This page mentions a special Jinja template to get from vllms GitHub.
Very initial testing is MUCH better.
Maybe it has something to do with this commit in the official repo that was made yesterday: https://huggingface.co/google/gemma-4-26B-A4B-it/commit/1db3cff1840c2ae59759d8e842ff37831cf8cb63
I've tried the vllm jinja modified 2 days ago, I've tried the official google jinja modified 1 day ago and results are not exactly good. Better but not good.
Maybe it's OpenClaws fault but they can't handle it.
gemma-4-31b, on the other hand CAN handle anything I ask of openclaw, but I cant fit that in my GPU so I'm running it from Googles free tier api, but that's probably running full precision AND its a smarter model so not exactly fair comparison.
Thank you for letting me know. Could you please try again with the recent chat_template.jinja and tokenizer_config.json?
I will look into this and potentially requantize using the new chat_template.jinja and tokenizer_config.json.
Thanks cpatonn. This seems to be an issue with llama.cpp as well (or at least was) so I don't think it's just a YOU thing.
I have cleared my ~/.cache/huggingface dir for this model so that it pulls down the latest changes you mentioned and I removed my custom flag for the other jinja template.
I will test it this evening and tomorrow and let you know how it works out.
Sadly it seems to not have helped.
Tool calls are being made incorrectly still.
Even simple ones through openclaw like "cat the contents of file x to the chat".
I don't have much luck with this model in gguf format on lmstudio either... Maybe it's just not smart enough for openclaw OR there's some big Jinja issue google hasn't fixed.
Who knows.
Sadly it seems to not have helped.
Tool calls are being made incorrectly still.
Even simple ones through openclaw like "cat the contents of file x to the chat".I don't have much luck with this model in gguf format on lmstudio either... Maybe it's just not smart enough for openclaw OR there's some big Jinja issue google hasn't fixed.
Who knows.
I don't know if this will help you in any way, but I tested it today with vllm 0.19.0 and transformers 5.5.0 (you need to force the installation of this version after installing vllm) and it's been working well for me. qwen3.5 is much better at calling tools, but this gemma4 version is working fine with vllm.
I don't know if this will help you in any way, but I tested it today with vllm 0.19.0 and transformers 5.5.0 (you need to force the installation of this version after installing vllm) and it's been working well for me. qwen3.5 is much better at calling tools, but this gemma4 version is working fine with vllm.
Thanks Lobliqua, looks like I'm already 19, and yes I know I was using 5.5 for transformers before, but I'm not sure what vllm I was using at the time..
What flags are you calling it with? I wonder if I'm making it too dumb with my quant settings (specifically dtype half and kv-cache-dtype fp8):
vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--dtype half
--max-model-len 262144
--kv-cache-dtype fp8
--gpu-memory-utilization 0.90
--host 0.0.0.0
--port 1234
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4
--async-scheduling
--served-model-name gemma-4-26b-a4b
I don't know if this will help you in any way, but I tested it today with vllm 0.19.0 and transformers 5.5.0 (you need to force the installation of this version after installing vllm) and it's been working well for me. qwen3.5 is much better at calling tools, but this gemma4 version is working fine with vllm.
Thanks Lobliqua, looks like I'm already 19, and yes I know I was using 5.5 for transformers before, but I'm not sure what vllm I was using at the time..
What flags are you calling it with? I wonder if I'm making it too dumb with my quant settings (specifically dtype half and kv-cache-dtype fp8):
vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--dtype half
--max-model-len 262144
--kv-cache-dtype fp8
--gpu-memory-utilization 0.90
--host 0.0.0.0
--port 1234
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4
--async-scheduling
--served-model-name gemma-4-26b-a4b
No problem, I had to test many configurations to get my current setup working satisfactorily. I'm using an RTX 5090 and I have two working configurations, as follows:
performance:
vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--served-model-name gemma-4-26b-a4b
--quantization compressed-tensors
--attention-backend FLASH_ATTN
--tensor-parallel-size 1
--gpu-memory-utilization 0.98
--max-model-len auto
--max-num-seqs 1
--enable-prefix-caching
--reasoning-parser gemma4
--enable-auto-tool-choice
--tool-call-parser gemma4
--host localhost
--port 8888
larger context window
vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--served-model-name gemma-4-26b-a4b
--quantization compressed-tensors
--attention-backend TRITON_ATTN
--kv-cache-dtype fp8_e4m3
--tensor-parallel-size 1
--gpu-memory-utilization 0.98
--max-model-len auto
--max-num-seqs 1
--enable-prefix-caching
--reasoning-parser gemma4
--enable-auto-tool-choice
--tool-call-parser gemma4
--host localhost
--port 8888
You may need to change the value for the --attention-backend flag or omit it.
Thanks Lobliqua. I appreciate it but before I got a chance to try it, I see Qwen 3.6 is out so I flipped over to that one. Man this stuff move so fast.
3.6 is working incredibly well in openclaw and Claude code.