Will it work on 2X6000 Pros
will we have space left for the context?
It should once I'm done, this is still early
woth downloading at the moment or should we wait? 😀
When running lukealonso/MiMo-V2.5-NVFP4 on 2x RTX Pro 6000 Blackwell with the lukealonso/sglang-cuda13-b12x Docker image, only one configuration produces clean reasoning output — BF16 KV cache, CUDA graphs disabled, and MTP disabled. Every other combination causes the model to enter repetition loops during long reasoning sequences (enable_thinking: true), exhausting the max_tokens budget without ever emitting and never producing a final answer (finish_reason: "length", content: null).
Environment:
2x RTX Pro 6000 Blackwell (96 GB each, 94.97 GiB usable)
Compute capability sm_120a (GB202)
TP=2, no NVLink, PCIe P2P enabled
Docker image: docker.io/lukealonso/sglang-cuda13-b12x:latest
Model: lukealonso/MiMo-V2.5-NVFP4 (310B MoE, NVFP4 quantized, hybrid attention with SWA, 3-layer MTP)
CUDA 13, NCCL 2.28.9
Custom all-reduce disabled (--disable-custom-all-reduce) due to a separate crash in sgl_kernel/allreduce.py
Test methodology
Identical reasoning prompt (Python sliding-window rate limiter implementation request), identical sampling params (temperature=1.0, top_p=0.95),
enable_thinking: true, max_tokens: 12000-15000. Each configuration tested independently; only the listed flags differ.
Results
BF16 KV, no graphs, no MTP --> Clean stop, 2018 reasoning tokens, full content output
BF16 KV, no graphs, MTP (1 step) --> Loop, 20000 reasoning tokens, no content
BF16 KV, CUDA graphs (bs=1,2), no MTP --> Loop, 12000 reasoning tokens, no content
FP8 KV, no graphs, no MTP --> Loop, 12000 reasoning tokens, no content
FP8 KV, CUDA graphs, no MTP --> Loop, 20000 reasoning tokens, no content
Loop signature
In every failing configuration, the model produces near-verbatim repetitions of self-correction phrases inside the block. Examples:
"OK, I think the implementation is correct. Actually, let me reconsider..." × dozens of times
"Actually, let me think about this more carefully. The tests should be deterministic..." × dozens of times
Decode throughput stays stable throughout (no engine-side degradation)
thank you for your great work lukealonso. It looks promising
Memory usage has been reduced, so it should fit better on 2 cards now. See the updated model card.
Hi Luke, thanks for the continued work on this model — and congrats on the 5/4 update. The improvements are very visible on this hardware: the larger KV pool (180K full + 54K SWA on 2x RTX Pro 6000 Blackwell), the much-reduced CUDA graph memory (down from arrround 3.7 GB to 0.09 GB), the working MTP draft graphs (arround 0.06 GB per step instead of OOM), and B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 are all clear wins. Generation throughput stays between 100 to 155 tps comfortably.
After two days of systematic testing on this setup with lukealonso/sglang-cuda13-b12x:latest, I've collected reproducible data on a thinking-mode failure mode and wanted to share it.
Setup
bash docker run --rm -it --gpus '"device=0,1"' --ipc=host --network host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /opt/llm/models:/models:ro \ -e B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 \ -e CUTE_DSL_ARCH="sm_120a" \ -e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \ docker.io/lukealonso/sglang-cuda13-b12x:latest \ python -m sglang.launch_server \ --model-path /models/MiMo-V2.5-NVFP4 \ --tp-size 2 --page-size 64 \ --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.95 \ --swa-full-tokens-ratio 0.3 --chunked-prefill-size 2048 \ --enable-pcie-oneshot-allreduce --enable-multi-layer-eagle \ --reasoning-parser mimo --tool-call-parser mimo \ --context-length 131072 \ --cuda-graph-max-bs 4 --max-running-requests 4 \ --speculative-algorithm EAGLE --speculative-num-steps 3 \ --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ --moe-runner-backend b12x --attention-backend b12x \ --mm-attention-backend b12x --fp4-gemm-backend b12x \ --fp8-gemm-backend flashinfer_cutlass
Small note: the 2x section of your model card has a copy-paste artifact —
--gpus '"device=0,1,2,3"'and--tp-size 4(likely copied from the 4x block).
Test prompt
All tests use the same complex coding task:
Implement a Python LRU cache with capacity, O(1) get/put, RLock thread-safety, optional per-entry TTL, statistics(), clear(),
__len__,__contains__, iteration in LRU order; then write 8 unit tests covering basic put/get, capacity-eviction, LRU-order-after-get, TTL-expiration, lazy-cleanup, statistics-correctness, thread-safety with 4 threads, iteration-order; explain design decisions and complexity.
On simple/medium tasks (short math, single-file widgets, sliding-window log analysis) thinking-mode is reliably clean (8/8 in my testing). The failures below are specific to thinking + complex multi-component synthesis.
Observed thinking-mode failure modes
Three distinct patterns, all reproducible (T=1.0, top_p=0.95, max_tokens=12000 unless noted):
1. Circular phrase loop
Default sampling. Model emits "Actually, let me reconsider..." style self-correction in a tight cycle, never reaching content.
reasoning_tokens: 12000 finish_reason: length content: null
2. Output decoherence
frequency_penalty=0.3. Model exits the loop but degenerates into mixed-script garbage (CJK characters, emojis, fragmented compound words). Example tail:
"耻declinedbelittletamessaynippetsincompleteremoved"
reasoning_tokens: 4901 finish_reason: stop content: null
3. Non-terminating structured reasoning
min_p=0.05 + repetition_penalty=1.05, max_tokens=20000. Model produces well-structured, coherent reasoning, drafts complete implementations within the reasoning trace (4+ full LRU implementations counted in one run, each slightly different, each correct). It just doesn't synthesize and emit </think>.
reasoning_tokens: 20002 finish_reason: length content: null
Sampling parameter sweep on thinking-mode
| Config | reasoning_tokens | finish_reason | Notes |
|---|---|---|---|
| default | 12000 | length | classic phrase loop |
frequency_penalty=0.1 |
12000 | length | numeric bypass loop (model enumerates 1–50 with near-identical content per item to evade penalty) |
frequency_penalty=0.3 |
4901 | stop | decoherence into token garbage |
repetition_penalty=1.05 |
11020 | length | reached content (~980 tokens) before cutoff |
repetition_penalty=1.08 |
12000 | length | loop |
repetition_penalty=1.10 (run 1) |
3411 | stop | clean, ~5100 tokens content |
repetition_penalty=1.10 (run 2) |
7736 | length | reached content (~4264 tokens) before cutoff |
top_k=50 (run 1) |
192 | stop | clean, ~3800 tokens content |
top_k=50 (run 2) |
12001 | length | loop |
top_k=50 (run 3) |
12002 | length | loop |
min_p=0.05 + rep=1.05 (max_tokens=12000) |
~9500 | length | reached content, abbreviated |
min_p=0.05 + rep=1.05 (max_tokens=20000) |
20002 | length | non-terminating structured reasoning |
No sampling configuration produced reliable termination. repetition_penalty=1.10 had the highest clean-output rate at 1/2 (50%), 33%). Both leave systematic side effects when they do produce output: language mixing (Russian/Chinese tokens appearing in German/English output), occasional small typos in test code (e.g. top_k=50 1/3 (assertEqual_cache_get_default_none = True instead of assertIsNone(...)).
Workaround that did help: non-thinking mode + explicit CoT prompt structure
Switching to enable_thinking: false with explicit chain-of-thought structure appended to the prompt:
```
[original prompt above]
Proceed systematically:
- Data structure choice: briefly justify which structure you use and why
- Implementation: complete class with all methods
- Tests: all 8 unit tests
- Complexity analysis: compact, per method
Stay concise and avoid self-correction — pick an approach and execute it.
```
Sampling: T=1.0, top_p=0.95, max_tokens=8000, no penalties, no top_k/min_p.
Results across 10 consecutive runs with identical config:
| Run | finish_reason | completion_tokens | Notes |
|---|---|---|---|
| 1 | stop | 1599 | clean, working code |
| 2 | length | 8000 | failure: token-repetition loop ("OrderedDict" repeated for ~7900 tokens after a coherent start) |
| 3 | stop | 2649 | clean |
| 4 | stop | 3049 | clean (best of the batch — includes bonus lru_cache decorator and delete() method) |
| 5 | stop | 4022 | clean (own doubly-linked-list implementation) |
| 6 | stop | 1966 | working but contains test bug: cache.stats() called where method is statistics() |
| 7 | stop | 2736 | clean |
| 8 | stop | 2983 | clean |
| 9 | stop | 3569 | clean (own doubly-linked-list with __slots__, lazy eviction during put) |
| 10 | stop | 2053 | working but contains test bug: setUp uses capacity=2 but iteration test asserts 3-element order |
Termination rate: 9/10 = 90%. The single failure (run 2) was a degenerate token-repetition loop, structurally distinct from the thinking-mode reasoning loops — the model started writing coherent text, then got stuck repeating OrderedDict until hitting max_tokens.
When the model terminates cleanly, the output is generally a working LRU cache with all 8 tests covering the requested cases. Code-review-level issues (test bugs, missing assertions) appear in ~2/10 runs.
For comparison, on the same prompt without the CoT structure (just enable_thinking: false), output was clean in 2/2 earlier runs. Adding the explicit structure may help with consistency, but with this small N I can't make a strong claim either way.
Forensic note on KV scales
The DeepGEMM scale_fmt of checkpoint is not ue8m0 warning appears at boot. I checked the checkpoint with safetensors:
| File | Total keys | KV-scale keys |
|---|---|---|
amax_checkpoint.safetensors |
72,024 | 0 |
model-inputscales.safetensors |
36,096 | 0 |
model.safetensors.index.json |
145,276 | 0 |
hf_quant_config.json shows quant_algo="MIXED_PRECISION", no scale_fmt field, and all scale tensors are bf16 dtype covering only MoE expert projections (gate/up/down).
No KV-cache scales in the checkpoint, which matches the warning. The thinking-mode loops also reproduce with --kv-cache-dtype bf16, so this is likely orthogonal to the reasoning issue, but documenting it here for completeness.
Environment: 2x RTX Pro 6000 Blackwell (192GB Total VRAM) Engine: docker.io/lukealonso/sglang-cuda13-b12x:latest OS: Ubuntu (Gnome/Xorg consuming ~4.8GB on GPU 1)
- What We Successfully Achieved
We systematically tuned the SGLang backend to fit the massive 175GB MoE model into a heavily constrained 192GB dual-GPU setup.
Stable Baseline: We achieved a highly stable 49 tokens/second generation throughput using --cuda-graph-max-bs 1 and base weights.
OpenCode Integration: Successfully served the model to the OpenCode IDE.
Attention Stability: Applied -e B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 which stabilized the FP8 attention matrices and prevented generation degradation.
2. The Core Bottleneck: "The 4.8GB Desktop Tax"
Because this is a local workstation and not a headless server, the Ubuntu graphical interface (Gnome/Xorg) perpetually locks ~4.8GB of VRAM on GPU 1.
The Math:
Total VRAM: 192.0 GB
MiMo-V2.5 Base Weights: ~175.0 GB
Gnome OS Overhead: ~4.8 GB
Remaining Free VRAM: ~12.2 GB (Distributed across GPUs)
This razor-thin VRAM margin is the root cause of the limitations and crashes detailed below.
- What Bothers Us (Failure Modes & Feature Limitations)
Issue A: SGLang Pool Configurator is Blind to OS Overhead
The Goal: We attempted to replicate a known stable configuration that uses --mem-fraction-static 0.95 to allocate a massive 180K token KV Cache pool. The Crash: RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. The Root Cause: SGLang calculates the memory pool by taking (Total GPU Memory * mem_fraction) - Model Weights. When we set 0.95, SGLang tried to allocate ~91GB on GPU 1. However, because Gnome was using 4.8GB, GPU 1 only had 93GB physically free. The 91GB requested + 87GB of weights physically exceeded the free VRAM, causing the internal pool_configurator.py to instantly calculate a negative pool size and crash. Impact: We are permanently locked out of the 180K KV Cache pool unless we kill the desktop GUI. We are currently capped at ~40,000 context limit.
Issue B: EAGLE Speculative Decoding (MTP) Causes OOM
The Goal: We attempted to enable Multi-Layer EAGLE (--enable-multi-layer-eagle, --speculative-num-draft-tokens 4) to boost generation from 49 t/s to ~155 t/s. The Crash: torch.OutOfMemoryError: CUDA out of memory during the Capture cuda graph bs [1] step. The Root Cause: The extra MTP draft layers require roughly 3.5 GB of VRAM. Combined with the 4.8 GB Gnome overhead, the total VRAM required physically exceeds the 192 GB hardware limit. The crash happens specifically during the CUDA Graph capture phase because the graph requires additional intermediate activation memory that simply doesn't exist. Impact: Stuck at baseline 49 t/s. Cannot utilize the high-speed MTP draft layers.
Issue C: CPU Weight Offloading Breaks CUDA Graphs
The Goal: To solve the context and EAGLE limitations, we attempted to offload 30GB of weights to System RAM using --cpu-offload-gb 30. The Bug: Exception: Capture cuda graph failed: Tensor match failed for Tensor<256>[strides=<1>, dtype=float32, device=cpu] at /opt/sglang/python/sglang/jit_kernel/csrc/moe/grouped_topk.cuh:192 - Root cause: Device mismatch: expected cuda:0 but got cpu The Root Cause: SGLang's CUDA Graph capture mechanism is seemingly incompatible with weight offloading on the b12x custom attention backend. The graph compiler expects all tensors to reside on the GPU (cuda:0), but finds the offloaded tensors on the CPU. It cannot trace the dynamic PCIe transfers. Impact: If we offload weights, we must completely disable CUDA graphs (--disable-cuda-graph), which severely impacts baseline inference latency.
Issue D: Auto-Truncation Destroys System Prompts
The Goal: Servicing complex agentic queries via OpenCode IDE. The Bug: The model instantly outputs <|endoftext|> with 0 output tokens. The Root Cause: OpenCode utilizes internal plugins and workspace indexing that automatically inflate the prompt size (routinely reaching ~95K tokens under the hood). Because our L1 KV Cache is physically capped at 40K tokens, these massive automated prompts trigger SGLang's --allow-auto-truncate safety mechanism. SGLang blindly chops 55K tokens from the left (the beginning) of the prompt. This destroys OpenCode's System Prompt (which contains all tool schemas and XML tags), rendering the agent brainless. Workaround: We had to hardcode OpenCode's internal limit.context to 40,000 so OpenCode intelligently truncates its own plugin context before sending the request to SGLang.
Any suggestions? Planning to swap Minimax 2.7 for Mimo.. someday 🫥