What are your tokens/s? And what about rewriting MiniMax m2.7? Great work man!

by Arkovski - opened 10 days ago

Discussion

Arkovski

10 days ago

•

edited 10 days ago

Hi bro,

I have Ryzen 395 128GB and I am getting with 128k context about 30 input tokens/s and 6 output tokens/s.

Here is the command:
RADV_PERFTEST=nogttspill ./build-vulkan/bin/llama-server --device Vulkan0 -m "/media/arkovski/7766b08a-3b4c-4976-bccd-08d2e86e5660/AI/models/Qwen3.5-397B_IQ3_XXS/Qwen3.5-397B-A17B-IQ3_XXS-00001-of-00004.gguf" -ngl 52 -fit on -fa on --n-cpu-moe 4 -c 131072 -b 2048 -ub 1024 -t 12 -tb 28 -ctk q8_0 -ctv q8_0 --temp 0.3 --top-p 0.85 --top-k 20 --min-p 0.05 --mmap --no-warmup

Maybe you can do sth similar for MiniMax m2.7? Theoretically there are IQ4 XS and NL versions (~110 gb), but maybe you can improve the quality and performance as you did with this Qwen?

And generally you did great job, thanks :D

PS. offtopic:
I am waiting for the ROCm support with new Ubuntu (26.04), because ROCm 7.2.1 succ currently, so I am forced to use Vulkan.

tarruda

Owner 9 days ago

Thanks for the compliment!

I might give it a shot on Minimax 2.7 later. Meanwhile I suggest checking out @AesSedai and @ubergarm quants since all I do here is adapt their recipes!

tarruda

Owner 9 days ago

And to answer your question: I get about 21 tokens/second generaiton and 190 tokens/second processing with an empty context. My hardware is a M1 ultra with 128G.

ubergarm

9 days ago

There are a few discussions on mine and Aes' repos for people shopping in that sub 128GB mainline vulkan compatible quant for MiniMax

Your beset bet is probably Aes' IQ4_XS, but I'd be super happy if Tarruda decided to try a mainline IQ4_NL or something (note i think i had a typo in my "secret recipe" as the final output should be q6_K) (i've fixed it below).

But it is going to be really close to filling up 128GB RAM already and MiniMax takes more space for kv-cache than does Qwen3.5 so Aes' IQ4_XS is probably about right then use -ctk q5_0 -ctv q4_0 if you're desperate for more context haha..

(you can convert this to the mainline llama-quantize style format).

#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq4_nl
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_nl

# Non-Repeating Layers
token_embd\.weight=q4_K
output\.weight=q6_K
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/imatrix-MiniMax-M2.5-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-mainline-IQ4_NL.gguf \
    IQ4_NL \
    128

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment