Use in OpenCode via vLLM: Endless Tool Calling Loops & undesirable behavior

by wijjjj - opened 14 days ago

•

I'm using this together with OpenCode. I just tested it quickly and it unfortunately keeps on calling the same command (stuck in an endless loop).
On a second try, giving it clear instructions for what it should implement, it sort of hallucinates todo steps and just checks them off. So it's really not behaving the way I'd hope it to behave.

btw maybe it's comparing apples to oranges because not moe and different architecture, but I haven't seen this with your qwen3.5-27b-AWQ-BF16-INT4 model.

I'm using vllm/vllm-openai:gemma4-cu130 and here's my vllm yaml config

name: gemma-4-26B-A4B-it-AWQ-4bit
description: 'Gemma 4 26B MoE (3.8B active) AWQ INT4. Apache 2.0 license. Native function
  calling, thinking mode, vision. 128 experts top-8 routing. Frontier-level reasoning
  and coding per size class.

  '
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
container_image: vllm/vllm-openai:gemma4-cu130
environment:
  PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
  OMP_NUM_THREADS: '8'
  SAFETENSORS_FAST_GPU: '1'
  CUDA_MODULE_LOADING: LAZY
  VLLM_LOGGING_LEVEL: INFO
  TORCH_COMPILE_THREADS: '4'
  TORCHINDUCTOR_COMPILE_THREADS: '4'
  MAX_JOBS: '4'
serve_args:
  host: 0.0.0.0
  port: 8000
  trust-remote-code: true
  dtype: auto
  gpu-memory-utilization: 0.9
  tensor-parallel-size: 1
  kv-cache-dtype: auto
  max-model-len: 120000
  max-num-seqs: 4
  max-num-batched-tokens: 16384
  enable-prefix-caching: true
  performance-mode: throughput
  compilation-config: "{\"cudagraph_mode\": \"piecewise\",\n \"cudagraph_capture_sizes\"\
    : [1, 2, 3, 4],\n \"inductor_compile_config\": {\n   \"combo_kernels\": false,\n\
    \   \"benchmark_combo_kernel\": false}}"
  reasoning-parser: gemma4
  enable-auto-tool-choice: true
  tool-call-parser: gemma4
  served-model-name: gemma-4-26B-A4B-it-AWQ-4bit

wijjjj changed discussion title from Use in OpenCode via vLLM: Endless Tool Calling Loops to Use in OpenCode via vLLM: Endless Tool Calling Loops & undesirable behavior 14 days ago

Jiber

13 days ago

I'm seeing looping and failing tool calls as well. I've been back and for with gemini and copilot trying to get a jinja template that will let this work but no luck yet.

Jiber

13 days ago

•

edited 12 days ago

I should note, the tools I'm using are OpenClaw and Anthropic's Claude Code.
Edit I had a Jinja template here I thought fixed it but it doesn't.
Not sure if it's this models issue or vllms gemma4 parsers. Either it's unusable sadly and I'm back to lmstudio and qwen3-coder-next.

Jiber

12 days ago

Should have RTFM I guess.
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#full-featured-server-launch

This page mentions a special Jinja template to get from vllms GitHub.

Very initial testing is MUCH better.

Lobliqua

12 days ago

Maybe it has something to do with this commit in the official repo that was made yesterday: https://huggingface.co/google/gemma-4-26B-A4B-it/commit/1db3cff1840c2ae59759d8e842ff37831cf8cb63

Jiber

12 days ago

I've tried the vllm jinja modified 2 days ago, I've tried the official google jinja modified 1 day ago and results are not exactly good. Better but not good.
Maybe it's OpenClaws fault but they can't handle it.
gemma-4-31b, on the other hand CAN handle anything I ask of openclaw, but I cant fit that in my GPU so I'm running it from Googles free tier api, but that's probably running full precision AND its a smarter model so not exactly fair comparison.

cpatonn

cyankiwi org 10 days ago

Thank you for letting me know. Could you please try again with the recent chat_template.jinja and tokenizer_config.json?

I will look into this and potentially requantize using the new chat_template.jinja and tokenizer_config.json.

Jiber

10 days ago

Thanks cpatonn. This seems to be an issue with llama.cpp as well (or at least was) so I don't think it's just a YOU thing.

I have cleared my ~/.cache/huggingface dir for this model so that it pulls down the latest changes you mentioned and I removed my custom flag for the other jinja template.

I will test it this evening and tomorrow and let you know how it works out.

Jiber

10 days ago

Sadly it seems to not have helped.

Tool calls are being made incorrectly still.
Even simple ones through openclaw like "cat the contents of file x to the chat".

I don't have much luck with this model in gguf format on lmstudio either... Maybe it's just not smart enough for openclaw OR there's some big Jinja issue google hasn't fixed.

Who knows.

Lobliqua

7 days ago

Sadly it seems to not have helped.

Tool calls are being made incorrectly still.
Even simple ones through openclaw like "cat the contents of file x to the chat".

I don't have much luck with this model in gguf format on lmstudio either... Maybe it's just not smart enough for openclaw OR there's some big Jinja issue google hasn't fixed.

Who knows.

I don't know if this will help you in any way, but I tested it today with vllm 0.19.0 and transformers 5.5.0 (you need to force the installation of this version after installing vllm) and it's been working well for me. qwen3.5 is much better at calling tools, but this gemma4 version is working fine with vllm.

Jiber

7 days ago

I don't know if this will help you in any way, but I tested it today with vllm 0.19.0 and transformers 5.5.0 (you need to force the installation of this version after installing vllm) and it's been working well for me. qwen3.5 is much better at calling tools, but this gemma4 version is working fine with vllm.

Thanks Lobliqua, looks like I'm already 19, and yes I know I was using 5.5 for transformers before, but I'm not sure what vllm I was using at the time..

What flags are you calling it with? I wonder if I'm making it too dumb with my quant settings (specifically dtype half and kv-cache-dtype fp8):

vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--dtype half
--max-model-len 262144
--kv-cache-dtype fp8
--gpu-memory-utilization 0.90
--host 0.0.0.0
--port 1234
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4
--async-scheduling
--served-model-name gemma-4-26b-a4b

Lobliqua

7 days ago

I don't know if this will help you in any way, but I tested it today with vllm 0.19.0 and transformers 5.5.0 (you need to force the installation of this version after installing vllm) and it's been working well for me. qwen3.5 is much better at calling tools, but this gemma4 version is working fine with vllm.

Thanks Lobliqua, looks like I'm already 19, and yes I know I was using 5.5 for transformers before, but I'm not sure what vllm I was using at the time..

What flags are you calling it with? I wonder if I'm making it too dumb with my quant settings (specifically dtype half and kv-cache-dtype fp8):

vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--dtype half
--max-model-len 262144
--kv-cache-dtype fp8
--gpu-memory-utilization 0.90
--host 0.0.0.0
--port 1234
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4
--async-scheduling
--served-model-name gemma-4-26b-a4b

No problem, I had to test many configurations to get my current setup working satisfactorily. I'm using an RTX 5090 and I have two working configurations, as follows:

performance:

vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--served-model-name gemma-4-26b-a4b
--quantization compressed-tensors
--attention-backend FLASH_ATTN
--tensor-parallel-size 1
--gpu-memory-utilization 0.98
--max-model-len auto
--max-num-seqs 1
--enable-prefix-caching
--reasoning-parser gemma4
--enable-auto-tool-choice
--tool-call-parser gemma4
--host localhost
--port 8888

larger context window

vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
--served-model-name gemma-4-26b-a4b
--quantization compressed-tensors
--attention-backend TRITON_ATTN
--kv-cache-dtype fp8_e4m3
--tensor-parallel-size 1
--gpu-memory-utilization 0.98
--max-model-len auto
--max-num-seqs 1
--enable-prefix-caching
--reasoning-parser gemma4
--enable-auto-tool-choice
--tool-call-parser gemma4
--host localhost
--port 8888

You may need to change the value for the --attention-backend flag or omit it.

Jiber

4 days ago

Thanks Lobliqua. I appreciate it but before I got a chance to try it, I see Qwen 3.6 is out so I flipped over to that one. Man this stuff move so fast.
3.6 is working incredibly well in openclaw and Claude code.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment