Recommendations for running on Strix Halo.
Since this is such a large model, I was wondering if you have recommendations for arguments to use when running this model on Strix Halo?
The main thing is to leave your UMA on default and increase maximum TTM allocation https://strixhalo.wiki/AI/AI_Capabilities_Overview#memory-limits in my case I went with 112GiB // 4KiB == 29360128 as well as disabling iommu.
Then in llama.cpp you don't need much else besides a reasonable fit target
[qwen35-122b-instruct]
hf-repo = Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO
fit-target = 12288
cache-ram = 4096
reasoning-budget = 0
no-context-shift = true
ubatch-size = 1024
batch-size = 1024
direct-io = true
mmap does bad with UMA so use directio instead. depending on application might be worth increasing checkpoint counts too.
services:
qwen35-122b:
image: ghcr.io/ggml-org/llama.cpp:server-vulkan
container_name: qwen35-122b
ports:
- "8081:8080"
devices:
- /dev/dri:/dev/dri # For Vulkan/iGPU (Strix Halo)
- /dev/kfd:/dev/kfd # For ROCm/Compute
volumes:
- ./models/Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO/qwen35-122b-a10b-q80-q6k_ffn.gguf:/model.gguf:ro
- ./models/Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO/mmproj-F16.gguf:/mmproj.gguf:ro
environment:
LLAMA_ARG_MODEL: /model.gguf
LLAMA_ARG_MMPROJ: /mmproj.gguf
LLAMA_MODEL_ALIAS: "qwen35-122b"
LLAMA_ARG_CTX_SIZE: "262144"
LLAMA_ARG_N_GPU_LAYERS: "99"
LLAMA_ARG_FLASH_ATTN: "1"
LLAMA_ARG_THREADS: "7"
LLAMA_ARG_N_PARALLEL: "1"
LLAMA_ARG_BATCH_SIZE: "2048"
LLAMA_ARG_UBATCH_SIZE: "1024"
LLAMA_ARG_PORT: "8080"
LLAMA_ARG_HOST: "0.0.0.0"
LLAMA_ARG_API: "1"
LLAMA_ARG_ENDPOINT_METRICS: "1"
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856"
thats my docker-compose.yml and grub settings for my strix halo and your build of this model is superior to all the others I tried including unsloth Q6_K and I end up with superior agentic performance and results with this build both rocm and vulkan.