GLM 5.1 vs GLM 5 - burns A LOT output tokens on thinking

by curiouspp8 - opened 10 days ago

•

This is not related to this quant, but thought of sharing as I like the crowd here. After couple of days using GLM 5.1 for real work, i noticed it randomly entered long periods of entirely log-less hanging, super annoying and rendering it almost useless even on a high end hardware. We're talking 10-15 mins of visibly not doing nothing at all. Or 30. 0 log records in your harness or ik_llama server logs. Turns out, it has some really long internal thinking tendency. Like burning 16k output tokens to come up with a line - I need to check this file as well.. I turned off the thinking by default and it's much much more usable. Didn't do anything extra hard yet, but it seems very smart and worthy with thinking off. Something to consider for your setups.

ubergarm

Owner 10 days ago

@curiouspp8

Thanks for sharing your observations as we're all trying to figure out how to best use these big models on homelab / local server setups.

I have a few days experience now using big GLM-5.1 in opencode and given it is quite slow (with A40B and no DSA support yet in mainline nor ik). I ended up loading a second smaller fast model fully offloaded on GPU/VRAM and make GLM-5.1 delegate subtasks to that which can help with parsing through lots of logs and such.

I haven't noticed it silently doing nothing for 10-15 minutes though. If it is prompt processing ik_llama.cpp will show log lines periodically, and at least it seems like opencode is showing the thinking traces so I can watch it plod along. It does think a lot before outputting the final answer.

You might be able to tweak the reasoning effort somehow, either with --chat-template_kwargs ..???... or if that is in ik I forget at this moment. This might be a way to get some thinking without it taking so long. I'll keep it in mind to try it with thinking off completely though if it is smart enough to write code without it! haha...

One other thing to tryto speed up decode (token generation) is maybe speculative self decoding and plenty of prompt caching e.g. try this

llama-server \
     --cache-ram 65536 \
    --prompt-cache-all  \
    --spec-type ngram-map-k4v --spec-ngram-size-n 8 --draft-min 1 --draft-max 16 --draft-p-min 0.4 \

this cachde patterh would use 64GiB of system RAM as prompt cache and cache everything which might help if you leave it running a long time with similar patterns of interaction.

i don't know how to tune the speculative decoding stuff,b ut it seems to be getting some acceptance rate with these values, but might depend on your workload.

curiouspp8

10 days ago

I haven't noticed it silently doing nothing for 10-15 minutes though. If it is prompt processing ik_llama.cpp will show log lines periodically, and at least it seems like opencode is showing the thinking traces so I can watch it plod along. It does think a lot before outputting the final answer.

Right! I think what's happening here is it's internal "thinking" sometimes becomes a very long generation, which you can see by GPU CPU usage patterns, and ik_llama doesn't print separate lines while long output generation. Looked at my proxy logs and it shows ~16k outputs for some completions. While user facing response was ~1 line. Let's say it's a large context and model output at about 25tps, so it's consistent with 10mins wait - 16,000 / (25 × 60 ). Unfortunately the proxy truncated those and couldn't dig into full details of such output. Out of 4 projects this consistently happened on 2. Something certainly tends to trigger it, depending on what tech stack is used. My guess is - the most standard the structure and stack, the less of that will happen. For me, this gotten 100% solved by turning off thinking. Not sure GLM 5 supports reasoning effort levels. Last time checked - very few models did.

I ended up loading a second smaller fast model fully offloaded on GPU/VRAM and make GLM-5.1 delegate subtasks to that which can help with parsing through lots of logs and such

could you share your opencode config for that? I tried in a past, using minimax as the fast one, but found the smart model rarely did delegation.

this cachde patterh would use 64GiB of system RAM as prompt cache

yep that was very helpful from your previous suggestions. Been running it with 50gb and it was sufficient.
I did bump into another problem, that could be a bug in ik_llama. With thinking off, had 2 sessions actually hit 200k context as shown by opencode. But then the server started rejecting new requests with this, so had to restart the sever

======== Prompt cache: cache size: 13010, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 1.000, sim = 1.000, n_keep = 0, n_discarded_prompt = 0
 - cache state: 2 prompts, 8688.282 MiB (limits: 50000.000 MiB, 0 tokens, 588234 est)
   - prompt 0x7b2ad0212de0:  101377 tokens,  101375 discarded, checkpoints:  0,  8688.024 MiB
   - prompt 0x7b2b5c817cb0:     838 tokens,       0 discarded, checkpoints:  0,     0.258 MiB
prompt cache load took 9.47 ms
INFO [   launch_slot_with_task] slot is processing task | tid="135692713889792" timestamp=1775920956 id_slot=0 id_task=161877
======== Cache: cache_size = 13010, n_past0 =  13010, n_past1 =  13010, n_past_prompt1 = 13010,  n_past2 =  13010, n_past_prompt2 =  13010
INFO [    batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="135692713889792" timestamp=1775920956 id_slot=0 id_task=161877
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="135692713889792" timestamp=1775920956 id_slot=0 id_task=161877 p0=13009
WARN [    process_batch_tokens] failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation | tid="135692713889792" timestamp=1775920956 i=-4096 n_batch=4096 ret=1

For the spec decoding - this did increase performance but would segfault ik_llama at ~8k prompt size on Q3 quant, it works till ~70-90k on IQ2. Opened a ticket here - https://github.com/ikawrakow/ik_llama.cpp/issues/1612

curiouspp8

10 days ago

I'll keep it in mind to try it with thinking off completely though if it is smart enough to write code without it! haha

It might not be as smart as with thinking on (supposedly), but it solved every problem i threw at it and provided a very consistent opencode experience for me. Even smaller quants are good. IQ2 is very usable and runs a bit faster than bigger ones, probably due to less GPU communication overhead. 3 vs 4.

I tasked this model to try patching sgland and vllm to be able to load itself from 1 merged GGUF file (which they have underoptimized experimental support for). Using a more compatible bartowski GLM-5.1-IQ2_S for starters. Before I went to bed, the patch file size was ~500 lines.. will see how deep that rabbit whole goes. Curious to see TP 4 performance there.

TimothyRoo

10 days ago

Just wanted to chime in that I've also noticed that 5.1 likes to "think" a lot more than 5 or 4.7.

I hadn't considered the internal thinking, versus the reasoning that is streamed. Since yes, it does seem to pause for a bit before the output reasoning begins, yet as you say, you can see the GPU at 100%.

In terms of speedups, when I set -b to 4096, and -ub to 4096, compared to 2048, it gives a little boost - and that's across all models.

But yeah, this is not a fast model to run locally. At 50K context, PP is 30, and TG is 6.

curiouspp8

10 days ago

If you run on blackwell, can go even higher with batch sizes. 8192/4096 batch/ubatch is what I use for an extra boost.
Without thinking + GPU offload is usable with opencode. This is the maximum performance version, with spec decoding on IQ2_KS, 3 gpus, ik_llama -

**Legend:**
- Prefilled = KV cache depth
- PP = Prompt Processing t/s
- TG = Token Generation t/s

## Throughput by Context Depth

| Prefilled | PP@4096 | TG@512 |
| --------- | ------- | ------ |
|         0 |  1507.9 |  39.15 |
|        4K |  1297.0 |  40.52 |
|       16K |   925.0 |  37.74 |
|       32K |   668.1 |  34.48 |
|       64K |   421.4 |  29.39 |

ubergarm

Owner 10 days ago

@curiouspp8

could you share your opencode config for that? I tried in a past, using minimax as the fast one, but found the smart model rarely did delegation.

I have to tell the main model to "delegate to the smaller model" and am still experimenting with this. So I'm sure the configs could be better, but honestly it is getting kind of confusing with even two models and two ssh tunnels to manage (mostly due to opencode.json config beign confusing to me)).

I hate how the config is spread out all over, as i am running opencode client in a docker container for some isolation. Perhaps a .opencode folder would work to keep it all "project" local, but haven't migrated it all yet as I feel like they have changed quite a bit in the past weeks even.

Anyway, I just drop this file in the $(pwd) where I launch opencode in TUI mode. I need to add pricing estimates to GLM-5.1 too (i like to use the Caude Opus 4.6 pricing to see how much $$$ I'm "saving" running local 😅 )

👈 `opencode.json`

{
  "$schema": "https://opencode.ai/config.json",
  "share": "disabled",
  "autoupdate": false,
  "experimental": {
    "openTelemetry": false
  },
  "permission": {
    "websearch": "allow",
    "webfetch": "deny",
    "todo": "deny",
    "todoread": "deny",
    "todowrite": "deny",
    "doom_loop": "deny"
  },
  "disabled_providers": [
    "exa"
  ],
  "lsp": false,
  "model": "cpurig/GLM-5.1",
  "agent": {
    "plan": {
      "description": "Analysis and planning without making changes",
      "mode": "primary",
      "model": "gpurig/Qwen3.5-122B-A10B",
      "prompt": "{file:./prompts/system.md}",
      "permission": {
        "edit": "deny",
        "bash": "deny"
      },
      "temperature": 1.0,
      "top_p": 0.95,
      "options": {
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty": 1.5,
        "repetition_penalty": 1.0,
        "max_new_tokens": 163840
      }
    },
    "build": {
      "description": "Default agent with all tools enabled for development work",
      "mode": "primary",
      "model": "cpurig/GLM-5.1",
      "prompt": "{file:./prompts/system.md}",
      "temperature": 1.0,
      "top_p": 0.95,
      "options": {
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty": 0.0,
        "repetition_penalty": 1.0,
        "max_new_tokens": 163840
      }
    },
    "webfetch": {
      "description": "Fast agent for web fetching and content processing",
      "mode": "subagent",
      "model": "gpurig/Qwen3.5-122B-A10B",
      "prompt": "You are a fast web content processor. Fetch, summarize, and extract relevant information from web pages efficiently.",
      "permission": {
        "edit": "deny",
        "bash": "allow"
      }
    },
    "general": {
      "description": "A general-purpose agent for researching complex questions and executing multi-step tasks",
      "mode": "subagent",
      "model": "gpurig/Qwen3.5-122B-A10B",
      "prompt": "You are a helpful assistant for complex research and multi-step tasks.",
      "permission": {
        "edit": "deny",
        "bash": "allow"
      }
    },
    "explore": {
      "description": "A fast, read-only agent for exploring codebases. Use for quickly finding files, searching code, or answering questions about the codebase.",
      "mode": "subagent",
      "model": "gpurig/Qwen3.5-122B-A10B",
      "prompt": "You are a read-only code explorer. You can search and analyze but cannot modify files.",
      "permission": {
        "edit": "deny",
        "bash": "allow"
      }
    }
  },
  "provider": {
    "gpurig": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "gpurig",
      "options": {
        "baseURL": "http://localhost:8080/v1",
        "timeout": 99999999999
      },
      "models": {
        "Qwen3.5-122B-A10B": {
          "name": "Qwen3.5-122B-A10B",
          "limit": {
            "context": 196608,
            "output": 65536
          },
          "cost": {
            "input": 5.0,
            "output": 25.0
          },
          "temperature": true,
          "reasoning": true,
          "tool_call": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          }
        }
      }
    },
    "cpurig": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "cpurig",
      "options": {
        "baseURL": "http://localhost:8088/v1",
        "timeout": 99999999999
      },
      "models": {
        "GLM-5.1": {
          "name": "GLM-5.1",
          "limit": {
            "context": 81920,
            "output": 65536
          },
          "temperature": true,
          "reasoning": true,
          "tool_call": true,
          "modalities": {
            "input": ["text"],
            "output": ["text"]
          }
        }
      }
    }
  }
}

Thanks for the update on caching, spec decoding, and I'll peep that issue on ik! Cheers!

ubergarm

Owner 10 days ago

@TimothyRoo

I realize now I've seen you around on AesSedai's nice quant repos. You have a big mac iirc and tend to use mainline with UD quants (or the higher quality AesSedai quants when available) psure?

As I mentioned in another thread, ik has a mac (m2 maybe?) and tends to add some support for ARM NEON if that is what your rig has. I'd be curious to know if you can get ik_llama.cpp working, and especially with any of my quants! Cheers and thanks for sharing your findings are there aren't a ton of folks running these big models so happy to share experiences across the various communities and ecosystems.

TimothyRoo

10 days ago

I have to tell the main model to "delegate to the smaller model" and am still experimenting with this. So I'm sure the configs could be better, but honestly it is getting kind of confusing with even two models and two ssh tunnels to manage (mostly due to opencode.json config beign confusing to me)).

For the longest time it was a pain to have to manage multiple models, all running on different ports etc. That's why I created my own little llama-router. (even though llama.cpp has routing mode, it wasn't exactly what I needed). When I add a new model, it auto-assigns a new port, I don't ever have to worry about that anymore. And it can use whatever names are easy to remember for the model.

In terms of access, I have my Mac Studio sitting in a closet, headless, running Tailscale. So with my llama-router, I can just use my Tailscale IP, and a single port, and it starts and routes commands to llama-server. Handy because this way I have can also set a specific llama-server build for a model (e.g. for fixes that aren't in mainline yet). Also, when Roo Code works with images, it often uses the WebP format, and so in my router, I convert all incoming images to JPEG to be llama.cpp friendly.

TimothyRoo

10 days ago

•

edited 10 days ago

My OpenCode json:

  "$schema": "https://opencode.ai/config.json",
  "enabled_providers": ["llama.cpp"],
  "share": "disabled",
  "autoupdate": true,
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-router",
      "options": {
        "baseURL": "http://tailscaleIP:8080/v1"
      },
      "models": {
        "Qwen3-Coder-Next-UD-Q8_K_XL-unsloth": {
          "name": "Qwen3-Coder-Next",
          "limit": {
            "context": 225000,
            "output": 65536
          }
        },
        "Step-3.5-Flash-Q8-mradermacher": {
          "name": "Step-3.5-Flash",
          "limit": {
            "context": 225000,
            "output": 32768
          }
        },
        "MiniMax-M2.5-UD-Q8_K_XL-unsloth": {
          "name": "MiniMax-M2.5",
          "limit": {
            "context": 185000,
            "output": 32678
          }
        }
      }
    }
  }
}

curiouspp8

10 days ago

In terms of access, I have my Mac Studio sitting in a closet, headless, running Tailscale. So with my llama-router, I can just use my Tailscale IP, and a single port, and it starts and routes commands to llama-server. Handy because this way I have can also set a specific llama-server build for a model (e.g. for fixes that aren't in mainline yet). Also, when Roo Code works with images, it often uses the WebP format, and so in my router, I convert all incoming images to JPEG to be llama.cpp friendly.

Nice looking tool.

Anyway, I just drop this file in the $(pwd) where I launch opencode in TUI mode. I need to add pricing estimates to GLM-5.1 too (i like to use the Caude Opus 4.6 pricing to see how much $$$ I'm "saving" running local 😅 )

In my case there are so many sources of traffic that I don't bother set costs per tool. Everything is measured at a proxy level, then dashboard pulls in all stats. Here is my favorite widget -

ubergarm

Owner 9 days ago

@curiouspp8

I spent all day fussing with opencode because it was taking a long time just to generate the title of the thread. turns out it had thinking enabled so even with a small 50 token prompt it would spit out 1k+ tokens just to return a 5 word title haha...

along the way, I noticed some clear_thinking argument too which I don't fully understand, but in the official documentation here: https://docs.z.ai/guides/capabilities/thinking-mode#preserved-thinking

So i could set that to false for both thinking and non-thinking mode I guess given I'm doing mostly vibe coding?

It is nice being able to dynamically per request enable/disable thinking just by hitting tab now too! (documented a config example here: https://github.com/ggml-org/llama.cpp/issues/20182#issuecomment-4230494838)

curiouspp8

9 days ago

I do that at a proxy level now. Glm-5.1-q3-nt defaults to no thinking so tools don't need to know. Glm 5.1 outputs a lot more thinking than 5.0 by default , feels like minimax in that way.

ubergarm

Owner 9 days ago

•

edited 9 days ago

@curiouspp8

speaking of which, i'm fussing with MiniMax-M2.7 now but am seeing some gibberish in early testing even witht he q8_0 on ik, but mainline seems to be inferencing.. i'll try a few more commands to see if some issue and test older 2.5 to see if it is a regression and open a ticket over there.. hrmm

UPDATE

Seems like -muge on ik_llama.cpp is not working on MiniMax-M2.5 or newer M2.7 ... I'll open an issue over there! Seems to work fine without -muge

dehnhaide

9 days ago

I do that at a proxy level now. Glm-5.1-q3-nt defaults to no thinking so tools don't need to know. Glm 5.1 outputs a lot more thinking than 5.0 by default , feels like minimax in that way.

Haha! You should see Step-3.5 MidBase, what a chatty bastard that is! But, nonetheless, very, very productive and trustworthy!

curiouspp8

9 days ago

Minimax drove me nuts with the thinking. Good problems to have obviously but its too fast on my hardware and feels like schizophrenic driving Bugatti at 400km/h on dirtbike trail while having an episode. Awesome model but I had to adjust workflows for it not to see that.

curiouspp8

9 days ago

@curiouspp8

speaking of which, i'm fussing with MiniMax-M2.7 now but am seeing some gibberish in early testing even witht he q8_0 on ik, but mainline seems to be inferencing.. i'll try a few more commands to see if some issue and test older 2.5 to see if it is a regression and open a ticket over there.. hrmm

UPDATE

Seems like -muge on ik_llama.cpp is not working on MiniMax-M2.5 or newer M2.7 ... I'll open an issue over there! Seems to work fine without -muge

Thanks for sharing! Would it be possible to have awq quant? Or maybe someone else is on it? 2.5 was really good and people can run them on vllm at awesome speeds/concurrency. Feel bad for asking as I could probably delete something "invaluable" on my nvme and do it myself. Don't recall if that process is ram hungry though.

curiouspp8

9 days ago

Haha! You should see Step-3.5 MidBase, what a chatty bastard that is! But, nonetheless, very, very productive and trustworthy!

How would you say it compares to minimax, kimi and glm?

dehnhaide

9 days ago

•

edited 9 days ago

Seems like -muge on ik_llama.cpp is not working on MiniMax-M2.5 or newer M2.7 ... I'll open an issue over there! Seems to work fine without -muge

I've found out also that -khad and -vhad are no joy with MiniMax familiy! ikawrakow warned us that there might be such models that take no joy from KV rotation...

ubergarm

Owner 9 days ago

•

edited 9 days ago

@dehnhaide

Its a bit curious, I've narrowed it down and opened an issue here: https://github.com/ikawrakow/ik_llama.cpp/issues/1624

You can in fact use -khad but not -vhad in combination with -sm graph. Or you can use -sm layer with both -khad -vhad. Don't use -muge at all for now though with MiniMax.

ubergarm

Owner 9 days ago

•

edited 9 days ago

@curiouspp8

Would it be possible to have awq quant? Or maybe someone else is on it? 2.5 was really good and people can run them on vllm at awesome speeds/concurrency.

Here is what I'm getting with 96GB VRAM -sm graph "tensor parallel" on 2x A6000s full offload of my latest IQ2_KS 69.800 GiB (2.622 BPW) with a random opencode test (it is using tools just fine):

prompt eval time =    2035.56 ms /  3314 tokens (    0.61 ms per token,  1628.05 tokens per second)
       eval time =    5757.65 ms /   296 tokens (   19.45 ms per token,    51.41 tokens per second)
      total time =    7793.22 ms /  3610 tokens

I haven't tried parallel slots as generally vLLM does beat the llama world with that. But the speed for single slot inference is likely comparable with vLLM when using ik's -sm graph. I'd like to do a llama-sweep-bench comparison at some point too.

Also confirmed you can disable thinking via opencode on the client side. So just pushing tab to enable/disable thinking is great. Here is an example where I save a bunch of time by disabling thinking when it is generating the 5 word thread title: https://github.com/ggml-org/llama.cpp/issues/20182#issuecomment-4230494838

curiouspp8

9 days ago

Here is what I'm getting with 96GB VRAM -sm graph "tensor parallel" on 2x A6000s full offload of my latest IQ2_KS 69.800 GiB (2.622 BPW) with a random opencode test (it is using tools just fine):

Thanks for reminder! I totally forgot minimax has sm support in ik. Will check your quant for it

TimothyRoo

9 days ago

Going way back in this thread:

ik has a mac (m2 maybe?) and tends to add some support for ARM NEON if that is what your rig has. I'd be curious to know if you can get ik_llama.cpp working, and especially with any of my quants!

I built ik_llama.cpp and ran llama-bench with your MM2.5 IQ4 model:

./llama-bench -m /Volumes/NBU/ai-models/MiniMax-M2.5-mainline-IQ4_NL-ubergarm-GGUF/MiniMax-M2.5-mainline-IQ4_NL-00001-of-00004.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	ngl	threads	n_ubatch	test	t/s
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	Metal	999	1	2048	pp2048	17.69 ± 0.25
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	Metal	999	1	2048	pp8192	17.75 ± 0.16

Note, I stopped it after those two because things didn't seem right - it seemed really slow, and surprisingly, it was using the GPU, which I didn't think was supported for Mac ik_llama.

Then via mainline llama-bench:
llama-bench -m /Volumes/NBU/ai-models/MiniMax-M2.5-mainline-IQ4_NL-ubergarm-GGUF/MiniMax-M2.5-mainline-IQ4_NL-00001-of-00004.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
load_backend: loaded BLAS backend from /opt/homebrew/Cellar/ggml/0.9.11/libexec/libggml-blas.so
ggml_metal_library_init: using embedded metal library
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 498216.21 MB
load_backend: loaded MTL backend from /opt/homebrew/Cellar/ggml/0.9.11/libexec/libggml-metal.so
load_backend: loaded CPU backend from /opt/homebrew/Cellar/ggml/0.9.11/libexec/libggml-cpu-apple_m2_m3.so

model	size	params	backend	threads	n_ubatch	fa	test	t/s
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	BLAS,MTL	1	2048	1	pp2048	831.60 ± 6.01
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	BLAS,MTL	1	2048	1	pp8192	680.10 ± 0.76
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	BLAS,MTL	1	2048	1	pp16384	541.03 ± 0.33
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	BLAS,MTL	1	2048	1	pp32768	383.69 ± 0.45
minimax-m2 230B.A10B IQ4_NL - 4.5 bpw	121.23 GiB	228.69 B	BLAS,MTL	1	2048	1	tg128	53.47 ± 0.41
build: 15f786e65 (8680)

ubergarm

Owner 9 days ago

@TimothyRoo

Thanks, interesting that it does actually run. I'm not up to date on the latest mac specific tweaks or things to try with ik though. But thanks for giving it a go. If you are interested more, you could open an issue on ik_llama.cpp with your findings, perhaps there is another way to compile it, but yeah, I don't think the mac support is prioritized on ik_llama.cpp so much as CUDA support.

jukofyork

8 days ago

along the way, I noticed some clear_thinking argument too which I don't fully understand, but in the official documentation here: https://docs.z.ai/guides/capabilities/thinking-mode#preserved-thinking

So i could set that to false for both thinking and non-thinking mode I guess given I'm doing mostly vibe coding?

Also been looking at the jinga template and I seems this is on by default:

{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content is defined -%}
{{ '<think>' + reasoning_content +  '</think>'}}
{%- else -%}
{{ '</think>' }}
{%- endif -%}

I think coding is actually correct to leave this on though? I also think you are supposed to setup opencode to use "interleaved" thinking like this:

                    "interleaved": {
                        "field": "reasoning_content",
                    }

(I got this by reading the schema https://opencode.ai/config.json)

but also seen several people post about using this:

          "options": {
            "interleaved": {
              "field": "reasoning_content"
            }
          }

😕

jukofyork

8 days ago

I think they should rename opencode to "opaquecode" because half the time I literally have no idea what I am supposed to be setting nor what is actually getting sent 😭

dehnhaide

8 days ago

I think they should rename opencode to "opaquecode" because half the time I literally have no idea what I am supposed to be setting nor what is actually getting sent 😭

You need the experience total darkness with ClaudeCode before appreciating Opencode. Also, it's partly based on the TUI settings, the verbosity of its routines.
I've experienced far more verbose TUIs (like oh-my-pi / imp) but they quickly become tiring and very hard to follow and the effects on tokens usage are HUGE!

TimothyRoo

8 days ago

I have OpenCode and ClaudeCode installed, and occasionally try them, but quickly end up going back to Roo Code via VSCode. Every one talks about CC or OC but I'm struggling to see the allure of CLI tools.

jukofyork

8 days ago

I have managed to fix the template bug that causes ik_llama.cpp to constantly output

"render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues."

This:

{%- if 'function' in tool -%}
    {%- set tool = tool['function'] -%}
{%- endif -%}

needs changing to this:

{%- if tool.function is defined -%}
    {%- set tool = tool.function -%}
{%- endif -%}

Not sure if it is really having a negative effect, but that message getting spammed was irritating...

curiouspp8

8 days ago

Which template is that?
I couldn't find those in https://github.com/ikawrakow/ik_llama.cpp/blob/main/models/templates/GLM-4.6.jinja

ubergarm

Owner 8 days ago

@curiouspp8

I only use the original official chat template in all my GGUFs, you can find them on the original safetensors repos like so: https://huggingface.co/zai-org/GLM-5.1/blob/main/chat_template.jinja

TimothyRoo

7 days ago

@ubergarm I rebuilt ik_llama and added the flag that turns off Metal support.

It seems like (ik) llama-bench doesn't work very well in my case (either with or without Metal support). As it was showing 12t/s PP.

So I thought why not run llama-server just to test via the webui, and the performance, while still much slower than mainline llama.cpp, was not as atrocious as llama-bench would suggest:

Model: MiniMax-M2.5-mainline-IQ4_NL-00001-of-00004.gguf
prompt eval time = 8588.33 ms / 1606 tokens ( 5.35 ms per token, 187.00 tokens per second)
eval time = 14173.50 ms / 278 tokens ( 50.98 ms per token, 19.61 tokens per second)

When it was working, I could see all 24 performance cores at 100%, and the GPU at 0%.

curiouspp8

5 days ago

I have managed to fix the template bug that causes ik_llama.cpp to constantly output

"render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues."

This:
{%- if 'function' in tool -%}
    {%- set tool = tool['function'] -%}
{%- endif -%}
needs changing to this:
{%- if tool.function is defined -%}
    {%- set tool = tool.function -%}
{%- endif -%}
Not sure if it is really having a negative effect, but that message getting spammed was irritating...

Tried and it worked at first, but as some point i started getting sporadic loops like that, reverted to default and didn't see this in a few sessions

~ Writing command...
The bash tool was called with invalid arguments: [
  {
    "expected": "string",
    "code": "invalid_type",
    "path": [
      "command"
    ],
    "message": "Invalid input: expected string, received undefined"
  }
].
Please rewrite the input so it satisfies the expected schema.
~ Writing command...
The bash tool was called with invalid arguments: [
  {
    "expected": "string",
    "code": "invalid_type",
    "path": [
      "command"
    ],
    "message": "Invalid input: expected string, received undefined"
  }
].
Please rewrite the input so it satisfies the expected schema.
~ Writing command...
The bash tool was called with invalid arguments: [
  {
    "expected": "string",
    "code": "invalid_type",
    "path": [
      "command"
    ],
    "message": "Invalid input: expected string, received undefined"
  }
].
Please rewrite the input so it satisfies the expected schema.

jukofyork

5 days ago

I have managed to fix the template bug that causes ik_llama.cpp to constantly output

"render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues."

This:
{%- if 'function' in tool -%}
    {%- set tool = tool['function'] -%}
{%- endif -%}
needs changing to this:
{%- if tool.function is defined -%}
    {%- set tool = tool.function -%}
{%- endif -%}
Not sure if it is really having a negative effect, but that message getting spammed was irritating...
Tried and it worked at first, but as some point i started getting sporadic loops like that, reverted to default and didn't see this in a few sessions
~ Writing command...
The bash tool was called with invalid arguments: [
  {
    "expected": "string",
    "code": "invalid_type",
    "path": [
      "command"
    ],
    "message": "Invalid input: expected string, received undefined"
  }
].
Please rewrite the input so it satisfies the expected schema.
~ Writing command...
The bash tool was called with invalid arguments: [
  {
    "expected": "string",
    "code": "invalid_type",
    "path": [
      "command"
    ],
    "message": "Invalid input: expected string, received undefined"
  }
].
Please rewrite the input so it satisfies the expected schema.
~ Writing command...
The bash tool was called with invalid arguments: [
  {
    "expected": "string",
    "code": "invalid_type",
    "path": [
      "command"
    ],
    "message": "Invalid input: expected string, received undefined"
  }
].
Please rewrite the input so it satisfies the expected schema.

Ah weird, I've been running copies of this for the last couple of days and not seen that happen yet (or any other tool calls fail AFAIK).

It's probably best to just keep using the old template and when this gets merged it should go away:

https://github.com/ikawrakow/ik_llama.cpp/pull/1376#issuecomment-4240671456

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment