Recommendations for running on Strix Halo.

#2
by scottgl - opened

Since this is such a large model, I was wondering if you have recommendations for arguments to use when running this model on Strix Halo?

Owner
β€’
edited Mar 8

The main thing is to leave your UMA on default and increase maximum TTM allocation https://strixhalo.wiki/AI/AI_Capabilities_Overview#memory-limits in my case I went with 112GiB // 4KiB == 29360128 as well as disabling iommu.

Then in llama.cpp you don't need much else besides a reasonable fit target

[qwen35-122b-instruct]
hf-repo = Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO
fit-target = 12288
cache-ram = 4096
reasoning-budget = 0
no-context-shift = true
ubatch-size = 1024
batch-size = 1024
direct-io = true

mmap does bad with UMA so use directio instead. depending on application might be worth increasing checkpoint counts too.

services:
  qwen35-122b:
    image: ghcr.io/ggml-org/llama.cpp:server-vulkan
    container_name: qwen35-122b
    ports:
      - "8081:8080"
    devices:
      - /dev/dri:/dev/dri    # For Vulkan/iGPU (Strix Halo)
      - /dev/kfd:/dev/kfd    # For ROCm/Compute
    volumes:
       - ./models/Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO/qwen35-122b-a10b-q80-q6k_ffn.gguf:/model.gguf:ro
       - ./models/Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO/mmproj-F16.gguf:/mmproj.gguf:ro
    environment:
      LLAMA_ARG_MODEL: /model.gguf
      LLAMA_ARG_MMPROJ: /mmproj.gguf
      LLAMA_MODEL_ALIAS: "qwen35-122b"
      LLAMA_ARG_CTX_SIZE: "262144"
      LLAMA_ARG_N_GPU_LAYERS: "99"
      LLAMA_ARG_FLASH_ATTN: "1"
      LLAMA_ARG_THREADS: "7"
      LLAMA_ARG_N_PARALLEL: "1"
      LLAMA_ARG_BATCH_SIZE: "2048"
      LLAMA_ARG_UBATCH_SIZE: "1024"
      LLAMA_ARG_PORT: "8080"
      LLAMA_ARG_HOST: "0.0.0.0"
      LLAMA_ARG_API: "1"
      LLAMA_ARG_ENDPOINT_METRICS: "1"

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856"

thats my docker-compose.yml and grub settings for my strix halo and your build of this model is superior to all the others I tried including unsloth Q6_K and I end up with superior agentic performance and results with this build both rocm and vulkan.

Sign up or log in to comment