Thank you team Qwen for a 120B LLM

by rtzurtz - opened Feb 24

Feb 24

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

ztsvvstz

Feb 24

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

May I ask what you use to infer such big models and how many tokens per second you get?^^

rtzurtz

Mar 13

•

edited Mar 13

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

May I ask what you use to infer such big models and how many tokens per second you get?^^

I have been using this command: llama-server -m 'gpt-oss-120b-mxfp4-00001-of-00003.gguf' --n_gpu_layers 99 --n-cpu-moe 32 --threads 4 --temp 1.0 --top-k 0 --top-p 1.0 -c 8192 --chat-template-kwargs '{"reasoning_effort": "medium"}' --jinja --no-warmup. (Used Dedicated Memory: 90%). And am getting 17.5 tokens per second for the first 1000 tokens.

Some time ago llama.cpp added the fit command(s) and set it to on by default. Now, --n_gpu_layers 99 --n-cpu-moe 32 is not needed anymore and without these commands I'm getting 17.0 tokens per second for the first 1000 tokens and a Used Dedicated Memory:of 85-86%.

After stopping the inferencing after the first 1000 tokens, when running the same prompt again, I'm getting over 19 t/s on both commands (my prompt question was such, that output is different each time, but maybe was something still cached).

Using n_gpu_layers and n-cpu-moe to manually offload a little bit more to the VRAM, naturally gives a slightly higher t/s, but I think remembering that there was an issue with not enough VRAM at some point, so I'd recommend removing n_gpu_layers and n-cpu-moe. I guess use them if you want some VRAM left for something else and that not all VRAM is used.

PS: Since I can't fit the good 4-bit quants of 122B-A10B, I went testing Qwen3.5-27B for now (as a dense LLM it performs batter than what its parameter count would suggest vs a MoE LLM) (I may still try the 122B later).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment