Mistral-Small-4-119B-2603-Q5_K_M on 8 RTX 3090 with ik_llama.cpp ( compil 21 march 2026 )

by martossien - opened Mar 21

•

Hi everyone,

I wanted to share a working local multi-GPU setup for AesSedai/Mistral-Small-4-119B-2603-GGUF on 8x RTX 3090 (24 GB each), using ik_llama.cpp.

Machine:

8x RTX 3090 24 GB
one GPU is also driving GNOME/display
local serving with ik_llama.cpp

Working command:
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/AesSedai/Mistral-Small-4-119B-2603-GGUF/Mistral-Small-4-119B-2603-Q5_K_M-00001-of-00003.gguf --alias Mistral-Small-4-119B-2603-Q5_K_M --host 0.0.0.0 --port 8080 --ctx-size 231424 --no-mmap --threads 32 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn on --n-gpu-layers 999 --tensor-split 0.8,1,0.8,1,0.8,1,0.8,1 --merge-qkv --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --graph-reuse --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --split-mode graph -ot 'blk.(18|19).ffn_down_exps.weight=CUDA0' -ot 'blk.(20|21).ffn_down_exps.weight=CUDA6' -ot 'blk.(31|32).ffn_down_exps.weight=CUDA4' -ot 'blk.4.ffn_down_exps.weight=CUDA2' -ot 'blk.13.ffn_down_exps.weight=CUDA4' -ot 'blk.22.ffn_down_exps.weight=CUDA2' -ot 'blk.0.ffn_down_exps.weight=CUDA4' -ot 'blk.27.ffn_down_exps.weight=CUDA4'

Results on my side:

prompt processing: roughly 100 to 2000 tok/s depending on request shape
generation: roughly 35 to 80 tok/s
reasoning_effort=high works correctly through chat-template kwargs ( --chat-template-kwargs '{"reasoning_effort":"high"}' )

Important note:
--split-mode graph does not actually work here in ik_llama.cpp for this model / GGUF and falls back to layer mode, so the balancing work was done mainly with tensor-split + targeted -ot overrides.

Big thanks to AesSedai for the GGUF release — this made local testing much easier.

AesSedai

Owner Mar 21

Interesting data points, looks like a very snug fit! I imagine @ubergarm will have IK-specific quants coming out when it's supported there too :)

ubergarm

Mar 21

oh hello @martossien glad to see you finding good quants for your rig!

ik has had support since day 1, but i've not gotten around to cooking this model, so u can blame me for that haha...

and yes i don't think there is -sm graph support for mistral-small-4 as that is per model arch.

cheers and happy weekend all!

dehnhaide

Mar 21

Hi everyone,

I wanted to share a working local multi-GPU setup for AesSedai/Mistral-Small-4-119B-2603-GGUF on 8x RTX 3090 (24 GB each), using ik_llama.cpp.

Machine:

8x RTX 3090 24 GB

one GPU is also driving GNOME/display

local serving with ik_llama.cpp

Working command:
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/AesSedai/Mistral-Small-4-119B-2603-GGUF/Mistral-Small-4-119B-2603-Q5_K_M-00001-of-00003.gguf --alias Mistral-Small-4-119B-2603-Q5_K_M --host 0.0.0.0 --port 8080 --ctx-size 231424 --no-mmap --threads 32 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn on --n-gpu-layers 999 --tensor-split 0.8,1,0.8,1,0.8,1,0.8,1 --merge-qkv --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --graph-reuse --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --split-mode graph -ot 'blk.(18|19).ffn_down_exps.weight=CUDA0' -ot 'blk.(20|21).ffn_down_exps.weight=CUDA6' -ot 'blk.(31|32).ffn_down_exps.weight=CUDA4' -ot 'blk.4.ffn_down_exps.weight=CUDA2' -ot 'blk.13.ffn_down_exps.weight=CUDA4' -ot 'blk.22.ffn_down_exps.weight=CUDA2' -ot 'blk.0.ffn_down_exps.weight=CUDA4' -ot 'blk.27.ffn_down_exps.weight=CUDA4'

Sorry for my ignorance: is that specific OT really needed? The model is barley 85Gb, is that such a tight fit in. 192 GB of VRAM? Or am I missing something?
I have a very similar setup as that of yours and I haven't test the model so far but still I find the comand line a bit overloaded. Happy to learn, if I've missed anything!

martossien

Mar 22

•

edited Mar 22

Thanks a lot to both AesSedai and ubergarm — and also thanks for the feedback. I’m based in France, so I naturally have a soft spot for Mistral models, and professionally I also tend to follow them closely because they are very relevant for my work. ( and Q5 mini for quality)

About the command line: the issue was not the total 192 GB of VRAM, but the per-GPU balance. With only --tensor-split 0.8,1,0.8,1,0.8,1,0.8,1, the model loaded and the beginning of the context was fine, but after some usage the VRAM distribution was still uneven enough that one GPU would eventually hit OOM and crash.

So the -ot rules were not added because the model “does not fit” globally, but because I needed to rebalance specific tensors across GPUs. In practice, tensor-split gave me the coarse placement, and the -ot overrides let me move some heavy expert tensors to get a much more even VRAM layout across the 8 cards. You can actually see that in the nvitop screenshot I posted.

So yes, the command is a bit overloaded, but in this case it is mostly a stability fix for a very tight multi-GPU setup, not just unnecessary tweaking.

I think this could be pushed even further by optimizing at the layer/tensor level and also by taking GPU topology / NVLink into account. In my case I only have 2 NVLink bridges out of 4 possible pairs, so I kept the setup focused on stability and balance first.

And yes, as ubergarm mentioned, graph split does not seem to work for this model architecture in ik_llama.cpp, so most of the useful work here came from tensor-split plus targeted tensor overrides.

Thanks again to AesSedai for the GGUF release, and to ubergarm

dehnhaide

Mar 22

•

edited Mar 22

So yes, the command is a bit overloaded, but in this case it is mostly a stability fix for a very tight multi-GPU setup, not just unnecessary tweaking.

Quick question: what TUI or coding tool are you using behind the model?
My experience so far with Opencode and Mistal Vibe has been a disaster. Tool calling with Opencode is broken and with Mistral Vibe, the TUI enters "compaction" about 5-10 seconds from an /init command and when done totally disregards the prompt and goes silent and about 30-60 seconds later reports back with "Task complete".

I have found a similar complaint from another user that experienced the same even with the Mistral own hosted model.

P.S. Germany here, and as much as I would love to push for a European AI model contender to American - Asian domination, my experience with Mistral env (models + TUI) has been underwhelming, and I am highly diplomatic about it... 😏 Thus I keep waking other people above their experiences, assuming I am a dumb head and still do smth wrong.

ubergarm

Mar 22

•

edited Mar 22

@dehnhaide

I'm having luck with opencode using this opencode.json file in the same directory where i start it up. Mainly i'm using https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF#iq5_ks-77341-gib-5441-bpw and haven't tried this mistral yet.

{
    "$schema": "https://opencode.ai/config.json",
    "share": "disabled",
    "autoupdate": false,
    "experimental": {
        "openTelemetry": false
    },
    "permission": {
        "websearch": "allow",
        "todo": "deny",
        "todoread": "deny",
        "todowrite": "deny",
        "doom_loop": "allow"
    },
    "disabled_providers": ["exa"],
    "lsp": false,
    "provider": {
        "LMstudio": {
            "npm": "@ai-sdk/openai-compatible",
            "name": "ik_llama.cpp (local)",
            "options": {
                "baseURL": "http://localhost:8080/v1",
                "timeout": 99999999999
            },
            "models": {
                "Qwen3.5": {
                  "name": "Qwen3.5",
                  "limit": { "context": 262144, "output": 65536 },
                  "cost": { "input": 5.0, "output": 25.0 },
                  "temperature": true,
                  "reasoning": true,
                  "tool_call": true,
                  "modalities": {
                    "input": ["text", "image"],
                    "output": ["text"]
                  }
                }
            }
        }
    }
}

I use the TUI and on startup one time y9uo have to type /connect and select the model, type in anything for the api key or whatever, then it works pretty well. I set the pricing to Opus 4.6 so you can see how much $$ you're saving haha...

This is my ik starting command for full 2xGPU 96GB total VRAM offload, should work with Aes' models just fine as well. Though as noted mistral-small-4 does not have -sm graph on ik yet:

./build/bin/llama-server \
  --alias Qwen3.5-122B-A10B \
  --model "$model" \
  --mmproj "$mmproj" \
  --image-min-tokens 1024 \
  --image-max-tokens 4096 \
  -ub 4096 -b 4096 \
  -fa on \
  -ctk f16 -ctv f16 -cuda fa-offset=0 \
  -c 262144 \
  -muge \
  -sm graph \
  -ngl 99 \
  --no-mmap \
  --parallel 1 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  --ctx-checkpoints 48 \
  --ctx-checkpoints-interval 512 \
  --ctx-checkpoints-tolerance 5 \
  --cache-ram 16384

dehnhaide

Mar 22

•

edited Mar 22

Thanks John for your insightful reply. I have managed to tame the beast using @martossien command, but by removing "--k-cache-hadamard --graph-reuse "... for whatever reason, on my setup, these two parameters induce some strange hallcuinations & forgetfulness in the model, like saying he's going to write AGENTS.md and then forgetting completely and stating "Task complete".

@Martossien : I got the hint of trying to create some balance on the way the model loads vastly random across 8x gpus. The mix of -ts & -ot seems like a very nice recipe. Thanks for sharing! ;)

P.S. This model + ik_lllama is like born in the kingdom of weirdness. Setting -ub and -b at 4096 makes it unpredictable. Both "-ub" and "-b" at 2048 is buttery smooth and predicable... I feel stupid.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment