Thank you team Qwen for a 120B LLM

#3
by rtzurtz - opened

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

May I ask what you use to infer such big models and how many tokens per second you get?^^

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

May I ask what you use to infer such big models and how many tokens per second you get?^^

I have been using this command: llama-server -m 'gpt-oss-120b-mxfp4-00001-of-00003.gguf' --n_gpu_layers 99 --n-cpu-moe 32 --threads 4 --temp 1.0 --top-k 0 --top-p 1.0 -c 8192 --chat-template-kwargs '{"reasoning_effort": "medium"}' --jinja --no-warmup. (Used Dedicated Memory: 90%). And am getting 17.5 tokens per second for the first 1000 tokens.

Some time ago llama.cpp added the fit command(s) and set it to on by default. Now, --n_gpu_layers 99 --n-cpu-moe 32 is not needed anymore and without these commands I'm getting 17.0 tokens per second for the first 1000 tokens and a Used Dedicated Memory:of 85-86%.

After stopping the inferencing after the first 1000 tokens, when running the same prompt again, I'm getting over 19 t/s on both commands (my prompt question was such, that output is different each time, but maybe was something still cached).

Using n_gpu_layers and n-cpu-moe to manually offload a little bit more to the VRAM, naturally gives a slightly higher t/s, but I think remembering that there was an issue with not enough VRAM at some point, so I'd recommend removing n_gpu_layers and n-cpu-moe. I guess use them if you want some VRAM left for something else and that not all VRAM is used.

PS: Since I can't fit the good 4-bit quants of 122B-A10B, I went testing Qwen3.5-27B for now (as a dense LLM it performs batter than what its parameter count would suggest vs a MoE LLM) (I may still try the 122B later).

Sign up or log in to comment