Will it work on 2X6000 Pros

by mtcl - opened 7 days ago

Discussion

mtcl

7 days ago

will we have space left for the context?

lukealonso

Owner 7 days ago

It should once I'm done, this is still early

bureksirovic

6 days ago

woth downloading at the moment or should we wait? 😀

Rene85

2 days ago

•

edited 2 days ago

When running lukealonso/MiMo-V2.5-NVFP4 on 2x RTX Pro 6000 Blackwell with the lukealonso/sglang-cuda13-b12x Docker image, only one configuration produces clean reasoning output — BF16 KV cache, CUDA graphs disabled, and MTP disabled. Every other combination causes the model to enter repetition loops during long reasoning sequences (enable_thinking: true), exhausting the max_tokens budget without ever emitting and never producing a final answer (finish_reason: "length", content: null).

Environment:

2x RTX Pro 6000 Blackwell (96 GB each, 94.97 GiB usable)
Compute capability sm_120a (GB202)
TP=2, no NVLink, PCIe P2P enabled
Docker image: docker.io/lukealonso/sglang-cuda13-b12x:latest
Model: lukealonso/MiMo-V2.5-NVFP4 (310B MoE, NVFP4 quantized, hybrid attention with SWA, 3-layer MTP)
CUDA 13, NCCL 2.28.9
Custom all-reduce disabled (--disable-custom-all-reduce) due to a separate crash in sgl_kernel/allreduce.py

Test methodology
Identical reasoning prompt (Python sliding-window rate limiter implementation request), identical sampling params (temperature=1.0, top_p=0.95),
enable_thinking: true, max_tokens: 12000-15000. Each configuration tested independently; only the listed flags differ.

Results
BF16 KV, no graphs, no MTP --> Clean stop, 2018 reasoning tokens, full content output
BF16 KV, no graphs, MTP (1 step) --> Loop, 20000 reasoning tokens, no content
BF16 KV, CUDA graphs (bs=1,2), no MTP --> Loop, 12000 reasoning tokens, no content
FP8 KV, no graphs, no MTP --> Loop, 12000 reasoning tokens, no content
FP8 KV, CUDA graphs, no MTP --> Loop, 20000 reasoning tokens, no content

Loop signature
In every failing configuration, the model produces near-verbatim repetitions of self-correction phrases inside the block. Examples:
"OK, I think the implementation is correct. Actually, let me reconsider..." × dozens of times
"Actually, let me think about this more carefully. The tests should be deterministic..." × dozens of times

Decode throughput stays stable throughout (no engine-side degradation)

thank you for your great work lukealonso. It looks promising

lukealonso

Owner 2 days ago

Memory usage has been reduced, so it should fit better on 2 cards now. See the updated model card.

Rene85

1 day ago

Hi Luke, thanks for the continued work on this model — and congrats on the 5/4 update. The improvements are very visible on this hardware: the larger KV pool (180K full + 54K SWA on 2x RTX Pro 6000 Blackwell), the much-reduced CUDA graph memory (down from arrround 3.7 GB to 0.09 GB), the working MTP draft graphs (arround 0.06 GB per step instead of OOM), and B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 are all clear wins. Generation throughput stays between 100 to 155 tps comfortably.

After two days of systematic testing on this setup with lukealonso/sglang-cuda13-b12x:latest, I've collected reproducible data on a thinking-mode failure mode and wanted to share it.

Setup

bash docker run --rm -it --gpus '"device=0,1"' --ipc=host --network host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /opt/llm/models:/models:ro \ -e B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 \ -e CUTE_DSL_ARCH="sm_120a" \ -e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \ docker.io/lukealonso/sglang-cuda13-b12x:latest \ python -m sglang.launch_server \ --model-path /models/MiMo-V2.5-NVFP4 \ --tp-size 2 --page-size 64 \ --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.95 \ --swa-full-tokens-ratio 0.3 --chunked-prefill-size 2048 \ --enable-pcie-oneshot-allreduce --enable-multi-layer-eagle \ --reasoning-parser mimo --tool-call-parser mimo \ --context-length 131072 \ --cuda-graph-max-bs 4 --max-running-requests 4 \ --speculative-algorithm EAGLE --speculative-num-steps 3 \ --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \ --moe-runner-backend b12x --attention-backend b12x \ --mm-attention-backend b12x --fp4-gemm-backend b12x \ --fp8-gemm-backend flashinfer_cutlass

Small note: the 2x section of your model card has a copy-paste artifact — --gpus '"device=0,1,2,3"' and --tp-size 4 (likely copied from the 4x block).

Test prompt

All tests use the same complex coding task:

Implement a Python LRU cache with capacity, O(1) get/put, RLock thread-safety, optional per-entry TTL, statistics(), clear(), __len__, __contains__, iteration in LRU order; then write 8 unit tests covering basic put/get, capacity-eviction, LRU-order-after-get, TTL-expiration, lazy-cleanup, statistics-correctness, thread-safety with 4 threads, iteration-order; explain design decisions and complexity.

On simple/medium tasks (short math, single-file widgets, sliding-window log analysis) thinking-mode is reliably clean (8/8 in my testing). The failures below are specific to thinking + complex multi-component synthesis.

Observed thinking-mode failure modes

Three distinct patterns, all reproducible (T=1.0, top_p=0.95, max_tokens=12000 unless noted):

1. Circular phrase loop

Default sampling. Model emits "Actually, let me reconsider..." style self-correction in a tight cycle, never reaching content.

reasoning_tokens: 12000 finish_reason: length content: null

2. Output decoherence

frequency_penalty=0.3. Model exits the loop but degenerates into mixed-script garbage (CJK characters, emojis, fragmented compound words). Example tail:

"耻declinedbelittletamessaynippetsincompleteremoved"

reasoning_tokens: 4901 finish_reason: stop content: null

3. Non-terminating structured reasoning

min_p=0.05 + repetition_penalty=1.05, max_tokens=20000. Model produces well-structured, coherent reasoning, drafts complete implementations within the reasoning trace (4+ full LRU implementations counted in one run, each slightly different, each correct). It just doesn't synthesize and emit </think>.

reasoning_tokens: 20002 finish_reason: length content: null

Sampling parameter sweep on thinking-mode

Config	reasoning_tokens	finish_reason	Notes
default	12000	length	classic phrase loop
`frequency_penalty=0.1`	12000	length	numeric bypass loop (model enumerates 1–50 with near-identical content per item to evade penalty)
`frequency_penalty=0.3`	4901	stop	decoherence into token garbage
`repetition_penalty=1.05`	11020	length	reached content (~980 tokens) before cutoff
`repetition_penalty=1.08`	12000	length	loop
`repetition_penalty=1.10` (run 1)	3411	stop	clean, ~5100 tokens content
`repetition_penalty=1.10` (run 2)	7736	length	reached content (~4264 tokens) before cutoff
`top_k=50` (run 1)	192	stop	clean, ~3800 tokens content
`top_k=50` (run 2)	12001	length	loop
`top_k=50` (run 3)	12002	length	loop
`min_p=0.05` + `rep=1.05` (`max_tokens=12000`)	~9500	length	reached content, abbreviated
`min_p=0.05` + `rep=1.05` (`max_tokens=20000`)	20002	length	non-terminating structured reasoning

No sampling configuration produced reliable termination. repetition_penalty=1.10 had the highest clean-output rate at 1/2 (~~50%), top_k=50 1/3 (~~33%). Both leave systematic side effects when they do produce output: language mixing (Russian/Chinese tokens appearing in German/English output), occasional small typos in test code (e.g. assertEqual_cache_get_default_none = True instead of assertIsNone(...)).

Workaround that did help: non-thinking mode + explicit CoT prompt structure

Switching to enable_thinking: false with explicit chain-of-thought structure appended to the prompt:

```
[original prompt above]

Proceed systematically:

Data structure choice: briefly justify which structure you use and why
Implementation: complete class with all methods
Tests: all 8 unit tests
Complexity analysis: compact, per method

Stay concise and avoid self-correction — pick an approach and execute it.
```

Sampling: T=1.0, top_p=0.95, max_tokens=8000, no penalties, no top_k/min_p.

Results across 10 consecutive runs with identical config:

Run	finish_reason	completion_tokens	Notes
1	stop	1599	clean, working code
2	length	8000	failure: token-repetition loop ("OrderedDict" repeated for ~7900 tokens after a coherent start)
3	stop	2649	clean
4	stop	3049	clean (best of the batch — includes bonus `lru_cache` decorator and `delete()` method)
5	stop	4022	clean (own doubly-linked-list implementation)
6	stop	1966	working but contains test bug: `cache.stats()` called where method is `statistics()`
7	stop	2736	clean
8	stop	2983	clean
9	stop	3569	clean (own doubly-linked-list with `__slots__`, lazy eviction during put)
10	stop	2053	working but contains test bug: `setUp` uses `capacity=2` but iteration test asserts 3-element order

Termination rate: 9/10 = 90%. The single failure (run 2) was a degenerate token-repetition loop, structurally distinct from the thinking-mode reasoning loops — the model started writing coherent text, then got stuck repeating OrderedDict until hitting max_tokens.

When the model terminates cleanly, the output is generally a working LRU cache with all 8 tests covering the requested cases. Code-review-level issues (test bugs, missing assertions) appear in ~2/10 runs.

For comparison, on the same prompt without the CoT structure (just enable_thinking: false), output was clean in 2/2 earlier runs. Adding the explicit structure may help with consistency, but with this small N I can't make a strong claim either way.

Forensic note on KV scales

The DeepGEMM scale_fmt of checkpoint is not ue8m0 warning appears at boot. I checked the checkpoint with safetensors:

File	Total keys	KV-scale keys
`amax_checkpoint.safetensors`	72,024	0
`model-inputscales.safetensors`	36,096	0
`model.safetensors.index.json`	145,276	0

hf_quant_config.json shows quant_algo="MIXED_PRECISION", no scale_fmt field, and all scale tensors are bf16 dtype covering only MoE expert projections (gate/up/down).

No KV-cache scales in the checkpoint, which matches the warning. The thinking-mode loops also reproduce with --kv-cache-dtype bf16, so this is likely orthogonal to the reasoning issue, but documenting it here for completeness.

bureksirovic

1 day ago

Environment: 2x RTX Pro 6000 Blackwell (192GB Total VRAM) Engine: docker.io/lukealonso/sglang-cuda13-b12x:latest OS: Ubuntu (Gnome/Xorg consuming ~4.8GB on GPU 1)

What We Successfully Achieved
We systematically tuned the SGLang backend to fit the massive 175GB MoE model into a heavily constrained 192GB dual-GPU setup.

Stable Baseline: We achieved a highly stable 49 tokens/second generation throughput using --cuda-graph-max-bs 1 and base weights.
OpenCode Integration: Successfully served the model to the OpenCode IDE.
Attention Stability: Applied -e B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 which stabilized the FP8 attention matrices and prevented generation degradation.
2. The Core Bottleneck: "The 4.8GB Desktop Tax"
Because this is a local workstation and not a headless server, the Ubuntu graphical interface (Gnome/Xorg) perpetually locks ~4.8GB of VRAM on GPU 1.

The Math:

Total VRAM: 192.0 GB
MiMo-V2.5 Base Weights: ~175.0 GB
Gnome OS Overhead: ~4.8 GB
Remaining Free VRAM: ~12.2 GB (Distributed across GPUs)
This razor-thin VRAM margin is the root cause of the limitations and crashes detailed below.

What Bothers Us (Failure Modes & Feature Limitations)
Issue A: SGLang Pool Configurator is Blind to OS Overhead
The Goal: We attempted to replicate a known stable configuration that uses --mem-fraction-static 0.95 to allocate a massive 180K token KV Cache pool. The Crash: RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. The Root Cause: SGLang calculates the memory pool by taking (Total GPU Memory * mem_fraction) - Model Weights. When we set 0.95, SGLang tried to allocate ~91GB on GPU 1. However, because Gnome was using 4.8GB, GPU 1 only had 93GB physically free. The 91GB requested + 87GB of weights physically exceeded the free VRAM, causing the internal pool_configurator.py to instantly calculate a negative pool size and crash. Impact: We are permanently locked out of the 180K KV Cache pool unless we kill the desktop GUI. We are currently capped at ~40,000 context limit.

Issue B: EAGLE Speculative Decoding (MTP) Causes OOM
The Goal: We attempted to enable Multi-Layer EAGLE (--enable-multi-layer-eagle, --speculative-num-draft-tokens 4) to boost generation from 49 t/s to ~155 t/s. The Crash: torch.OutOfMemoryError: CUDA out of memory during the Capture cuda graph bs [1] step. The Root Cause: The extra MTP draft layers require roughly 3.5 GB of VRAM. Combined with the 4.8 GB Gnome overhead, the total VRAM required physically exceeds the 192 GB hardware limit. The crash happens specifically during the CUDA Graph capture phase because the graph requires additional intermediate activation memory that simply doesn't exist. Impact: Stuck at baseline 49 t/s. Cannot utilize the high-speed MTP draft layers.

Issue C: CPU Weight Offloading Breaks CUDA Graphs
The Goal: To solve the context and EAGLE limitations, we attempted to offload 30GB of weights to System RAM using --cpu-offload-gb 30. The Bug: Exception: Capture cuda graph failed: Tensor match failed for Tensor<256>[strides=<1>, dtype=float32, device=cpu] at /opt/sglang/python/sglang/jit_kernel/csrc/moe/grouped_topk.cuh:192 - Root cause: Device mismatch: expected cuda:0 but got cpu The Root Cause: SGLang's CUDA Graph capture mechanism is seemingly incompatible with weight offloading on the b12x custom attention backend. The graph compiler expects all tensors to reside on the GPU (cuda:0), but finds the offloaded tensors on the CPU. It cannot trace the dynamic PCIe transfers. Impact: If we offload weights, we must completely disable CUDA graphs (--disable-cuda-graph), which severely impacts baseline inference latency.

Issue D: Auto-Truncation Destroys System Prompts
The Goal: Servicing complex agentic queries via OpenCode IDE. The Bug: The model instantly outputs <|endoftext|> with 0 output tokens. The Root Cause: OpenCode utilizes internal plugins and workspace indexing that automatically inflate the prompt size (routinely reaching ~95K tokens under the hood). Because our L1 KV Cache is physically capped at 40K tokens, these massive automated prompts trigger SGLang's --allow-auto-truncate safety mechanism. SGLang blindly chops 55K tokens from the left (the beginning) of the prompt. This destroys OpenCode's System Prompt (which contains all tool schemas and XML tags), rendering the agent brainless. Workaround: We had to hardcode OpenCode's internal limit.context to 40,000 so OpenCode intelligently truncates its own plugin context before sending the request to SGLang.

Any suggestions? Planning to swap Minimax 2.7 for Mimo.. someday 🫥

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment