From "Doesn't Work" to 641 tok/s: GLM-5.1 NVFP4 on 6× RTX PRO 6000 Blackwell

by sakamakismile - opened 9 days ago

From "Doesn't Work" to 641 tok/s: GLM-5.1 NVFP4 on 6× RTX PRO 6000 Blackwell

A four-day log of running the most ambitious open-weight MoE on NVIDIA's newest workstation silicon — what worked, what hard-crashed the OS, and what's still unfinished.

The setup

It's April 2026. You have six NVIDIA RTX PRO 6000 Blackwell Workstation Edition cards in a rack — compute capability 12.0 (SM120), 96 GB of VRAM each, 576 GB total across the six compute GPUs.

You want to run GLM-5.1 NVFP4 — Z.AI's 744B MoE model, released 2026-03-27, the single most ambitious open-weight model out there. The NVFP4 quantization from lukealonso weighs ~434 GB on disk.

You run vllm serve ... on it.

Nothing works.

transformers doesn't recognize the architecture (glm_moe_dsa)
deepseek_v2.py tries to enable sparse MLA, which has no SM120 kernel
DeepGEMM isn't on PyPI with its CUTLASS submodules
FlashMLA's precompiled binaries assert on non-SM90/SM100
When you fight those off, CUTLASS silently skips 72 SM120 grouped-gemm tactics and falls back to generic slow kernels

Welcome to the current state of frontier inference on Blackwell. This post is a field report from the other side.

The headline number

Before the story: here's what we achieved, and what it costs. Live-measured 2026-04-12, vLLM 0.19.0, dense attention (sparse off), TP=2 PP=3, 128K context window.

Concurrency	Aggregate (tok/s)	Per-request (tok/s)	Wall clock
1	33	33.3	7.7 s
4	103	25.8	9.9 s
8	167	20.9	12.3 s
16	256	16.0	16.0 s
32	384	12.0	21.4 s
64	606	9.5	27.0 s ← sweet spot
128	641	7.6	51.1 s ← saturation

Stable. Two days of continuous operation across Slack-bot workloads, multi-agent debate sessions, code-generation subagents, and Wiki ingest — zero crashes.

Is 641 tok/s the best you can do on this hardware? No. We briefly reached 1,555 tok/s via a SGLang + custom FlashMLA SM120 path. That path then hard-crashed the OS. More on that in a minute.

Wall 1: transformers didn't know `glm_moe_dsa`

ValueError: The checkpoint you are trying to load has model type 
`glm_moe_dsa` but Transformers does not recognize this architecture.

vLLM 0.19.0 pins transformers<5,>=4.56.0. The fix is to override with a dev version that knows glm_moe_dsa:

pip install --break-system-packages \
  git+https://github.com/huggingface/transformers.git
# → transformers 5.6.0.dev0

Yes, this breaks vLLM's declared compatibility. Yes, it works anyway. Welcome to frontier deployment.

Wall 2: DeepGEMM isn't on PyPI

ModuleNotFoundError: No module named 'deep_gemm'

GLM-5.1's NVFP4 MoE path depends on DeepSeek's deep_gemm for grouped GEMM. The PyPI package is missing CUTLASS submodules. Build from source:

git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM
pip install --break-system-packages .
# → deep_gemm 2.3.0

Note the --recursive. Without it, the CUTLASS submodules don't come down, and the build silently produces a broken wheel.

Wall 3: Sparse MLA on SM120 simply doesn't exist

This was the hardest.

vLLM loads GLM-5.1, sees index_topk: 2048 in its config, and concludes: "this is DeepSeek V3.2 family, enable sparse MLA". Sparse MLA is implemented in FlashMLA and FlashInfer via hand-written CUDA kernels. Those kernels exist for SM90a and SM100f only. On SM120:

AssertionError: FlashMLA sparse kernel requires SM90a or SM100f

There's no clean runtime fix. We patched three places in vLLM's source:

vllm/platforms/cuda.py:59 — device_capability.major == 10 → >= 10, so Blackwell (12.0) isn't rejected as "not Hopper"
vllm/model_executor/models/deepseek_v2.py:941,1155 — add a GLM-specific carve-out so is_v32 is False for GLM model types, disabling sparse detection
vllm/model_executor/models/deepseek_v2.py:~1619 — skip state-dict keys that no longer have destinations (the checkpoint still has indexer weights, but with sparse off, there's nowhere to put them)

After those three patches, GLM-5.1 loads. With dense attention instead of sparse.

Our first benchmark:

Single:       33 tok/s
8 concurrent: 130 tok/s
TTFT:         0.15 s

Functional. Not fast. Profile said 92.5% of decode time was in MoE matmul, 7.5% in attention. The real bottleneck wasn't the one we'd been fighting.

The SGLang detour

130 tok/s on 8 concurrent isn't good enough. We dug into the CUTLASS SM120 grouped-gemm problem and tried to substitute SM100 kernel schedules. It didn't compile. SM120 is not SM100's superset at the CUTLASS template level — the KernelPtrArrayTmaWarpSpecialized1SmNvf4Sm100 template is sm_100a-only, and nvcc fails at sm_120a.

That's when we found the voipmonitor/rtx6kpro Wiki — a community record of someone running GLM-5 on the same 6× RTX PRO 6000 rack, but via SGLang 0.5.10 instead of vLLM. SGLang has an NSA (Native Sparse Attention) backend that's completely separate from FlashMLA.

So we pivoted.

The SGLang path was six independent patches:

Upgrade CUDA 12.8 → 12.9 (SGLang's flashinfer 0.6.7 requires it)
Build our own FlashMLA for SM120, because the precompiled sgl_kernel/flashmla_ops.abi3.so refuses to load. We wrote a sister project, flashmla_sm120, that ports the dense kernels (decode + prefill fwd + prefill bwd) to sm_120a
Patch sgl_kernel/flash_mla.py to import our custom module
Rename flashmla_ops.abi3.so → .sm90bak so SGLang can't load the old kernel
Patch sglang/srt/configs/model_config.py to remove GlmMoeDsaForCausalLM from the NSA list (because our FlashMLA SM120 doesn't implement sparse)
Add a .pth file to site-packages so our custom package is on the import path

After all of that, SGLang loaded GLM-5.1. And then we pushed it hard:

SGLang 0.5.10 + FlashMLA SM120 (custom) + CUDA graph ON

 Concurrency | API aggregate | Internal gen throughput
      51     |      —        |    824 tok/s
      56     |      —        |    880 tok/s
      64     |    482 tok/s  |    980 tok/s
      69     |      —        |  1,555 tok/s   ← peak
     128     |    952 tok/s  |    —

1,555 tokens per second at 69 parallel decode batch. For a 744B MoE on 6 consumer cards, that felt like a victory.

The victory lasted about 30 minutes

The next morning, we ran a sustained benchmark in TP=4 + CPU offload mode (160 GB pinned host memory for offloaded expert weights). The system ran for ~20 minutes, then:

The display froze.
The keyboard became unresponsive.
~22 seconds later, the kernel watchdog fired a hardware reset.

journalctl -b -1:

Watchdog: BUG: soft lockup - CPU#N stuck for 22s
NMI watchdog: Watchdog detected hard LOCKUP on CPU M
Hardware Dog: PCIe DMA timeout on device 0000:f1:00.0

Post-mortem: TP=4 + 160 GB CPU offload pumps so much data over PCIe per decode step that it saturated the PCIe fabric. Our display GPU (on the same PCIe root complex as compute GPUs 3-5) couldn't complete its DMA to the framebuffer. The kernel saw "display driver not responding" and hard-reset the machine.

1,555 tok/s was a ceiling, not a floor. We documented it and backed out.

From Ken's breadcrumb log:

SGLang 環境を破壊して vLLM 路線に戻る
"Destroy the SGLang environment and return to the vLLM path."

The vLLM return (and the 7th patch)

We uninstalled SGLang, sgl-kernel, and flashinfer 0.6.7. We downgraded back to flashinfer 0.6.6 and upgraded torch 2.9.1 → 2.10.0 to match vLLM's strict pin. Then we hit patch number 7:

ImportError: cannot import name 'apply_rotary' 
from 'flash_attn.ops.triton.rotary'

A flash-attn-4 package (CUTE-version rewrite) had polluted the flash_attn namespace and vLLM's rotary embedding import blew up at module load. Wrap the import in try/except with a Triton fallback. Done.

After that, vLLM ran stably, with the benchmark shown at the top of this post. That was 2026-04-11 evening. It's been continuously up since then, serving:

A Slack bot for casual chat
A 7-sage MAGI Council deliberation system
opencode coding subagents that write files autonomously
A Wiki ingest pipeline processing long documents at 128K context
benchmark_throughput.py — which is how we generated the numbers in this post, live on 2026-04-12, from the same vLLM instance that was handling the above workloads in parallel

The 2.4× gap

Why are we 2.4× below our own SGLang peak?

Three contributors, in rough order of impact:

CUTLASS SM120 grouped-gemm tactics silently fall back. When vLLM (or SGLang) launches, FlashInfer's CUTLASS autotuner tries ~72 tactics for SM120 TMA Warp-Specialized grouped GEMM. All 72 fail because the templates are sm_100a-only. The runtime falls back to generic slow kernels. This is the big one. No user-side fix — requires CUTLASS and/or FlashInfer upstream work.
CUDA graph capture is disabled (--enforce-eager) on the vLLM path. We chose eager because graph capture is flaky on SM120 without additional upstream work. Estimated impact: ~18% of the gap.
MLA kernel choice. SGLang uses FlashMLA (our custom SM120 build). vLLM uses its Triton-based MLA. On dense attention, FlashMLA was probably faster. Hard to quantify without A/B.

Above all of that, the theoretical memory-bandwidth ceiling at batch=8 is ~2,181 tok/s. Neither path hits more than ~70% of that ceiling. There's meaningful headroom on this silicon.

The research agenda

If you have a similar rack and you want to close the gap, these are the frontiers:

Make CUTLASS SM120 TMA WS grouped-gemm tactics actually work. This is a serious NVIDIA/FlashInfer upstream task. Anyone who closes this unlocks a lot.
Isolate your display GPU from the compute PCIe fabric. If display is on a separate root complex, the SGLang TP=4+offload configuration might be safe. Worth a try if you have the topology.
Port sparse MLA to SM120. Triton works (we have a prototype getting 17.6× on attention at 128K), but attention is only 7.5% of decode time — the net effect is limited until MoE matmul is fixed.
Try NVFP4 KV cache when vLLM lands it. Our VLLM-TurboQuant-SM120 project's TQ4 format is a nearby alternative.

Try it yourself

We packaged all of this into a reproducibility repository:

→ github.com/Shinka-Man/ONLY-to-AIXsatoshi

Contents:

Full walkthroughs of both paths (stable vLLM, aspirational SGLang)
All 7 vLLM patches documented with line numbers and before/after
Pre-flight script (verify_env.sh) + install script + launch script
The exact benchmark_throughput.py used for this post
Troubleshooting doc with 10 known failure modes + recovery commands

Expected time from clone to running benchmark: 3-4 hours, assuming you already have the 434 GB NVFP4 weights downloaded.

If your numbers come back within ±10% of ours on identical hardware, you've reproduced the core claim. If they're off by more than 20%, something's different — open an issue and we'll compare notes.

Closing

This post exists because someone should write down what you actually hit when you try to run GLM-5.1 NVFP4 on Blackwell SM120 in April 2026. The research ceiling moves forward faster than the deployment floor. Closing that gap is collaborative, practitioner-led work, and right now it's largely invisible — buried in journalctl logs, breadcrumb files, and late-night DMs between researchers who have the hardware and are trying to make it do the thing.

If you're one of those researchers, welcome. The repo is yours. Find the bugs we missed, push the numbers we couldn't push, and please file issues when you hit walls we didn't cover. We'd rather update this map than see anyone else rediscover a landmine.

Authors: Ken (Lna-Lab) with assistance from Claude Opus 4.6 (Anthropic). Assembled from Lna-Lab's internal VOYAGE_LOG and LAUNCH_TIPS.md over four days of iterative pair programming, 2026-04-08 through 2026-04-12.

Acknowledgements: lukealonso for the NVFP4 quantization of GLM-5.1, Z.AI / Zhipu AI for open-weight release, voipmonitor/rtx6kpro Wiki contributors for the first public SGLang-on-Blackwell record, and the vLLM and SGLang maintainer communities for patience with SM120.

License: MIT for this post and the associated repository. Model weights governed by Z.AI / Zhipu AI's GLM-5.1 license.

lukealonso changed discussion status to closed 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment