mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 · Blackwell (RTX 5090 and RTX 6000) compatibility?

Blackwell (RTX 5090 and RTX 6000) compatibility?

by jhsmith0 - opened Mar 10

Discussion

jhsmith0

Mar 10

•

edited Mar 10

● Issue Report: mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4

Hardware: RTX 5090 (32GB VRAM, Blackwell GB202, SM 1.2.0)

Attempt 1: vLLM v0.17.0-cu130

Error: Model repo is missing preprocessor_config.json and video_preprocessor_config.json. Transformers fails to load
the image/video processor:
OSError: Can't load image processor for 'mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4'.
make sure '...' is the correct path to a directory containing a preprocessor_config.json file

After injecting those files from a working Qwen3.5 model (cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit):

Error: tokenizer_config.json specifies "tokenizer_class": "TokenizersBackend" which transformers 4.57.6 doesn't
recognize:
ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

After patching tokenizer_class to PreTrainedTokenizerFast and upgrading transformers:

Error: QKV weight shape mismatch during model loading — vLLM v0.17.0's qwen3_5 weight loader can't handle the NVFP4
tensor format:
File "vllm/model_executor/models/qwen3_5.py", line 428, in load_weights
File "vllm/model_executor/parameter.py", line 200, in load_qkv_weight
assert param_data.shape == loaded_weight.shape
AssertionError

Attempt 2: SGLang v0.5.9-cu130-runtime

After injecting missing preprocessor configs and fixing tokenizer:

Error: trtllm_mha attention backend rejects RTX 5090 (checks for SM100 only, RTX 5090 is SM120):
ValueError: TRTLLM MHA backend is only supported on Blackwell GPUs (SM100).

Switched to --attention-backend triton:

Error: SGLang explicitly does not support the sparse 2:4 quantization used in this model's linear attention layers:
ImportError: Other method (CompressedTensorsW4A16Sparse24) is not supported now

This is hard-coded in SGLang's compressed_tensors.py — ALL sparse 2:4 paths raise ImportError("CompressedTensors24
is not supported now").

Attempt 3: SGLang nightly (nightly-dev-cu13-20260310)

Same error: CompressedTensorsW4A16Sparse24 is not supported. Unchanged in nightly.

Attempt 4: vLLM cu130-nightly (in progress)

vLLM main branch has CompressedTensors24 support but the nightly image is untested so far.

Summary of missing files in model repo

The model repo is missing three files that both vLLM and SGLang require for the Qwen3_5ForConditionalGeneration
(multimodal) architecture:

preprocessor_config.json (image processor config)
video_preprocessor_config.json (video processor config)
tokenizer_config.json has "tokenizer_class": "TokenizersBackend" which requires transformers >= 5.3.0

Key compatibility issues

Model card claims SGLang >= 0.5.9 — SGLang does not support CompressedTensorsW4A16Sparse24 in any version (v0.5.9
or nightly as of 2026-03-10)
Model card claims vLLM >= 0.9.0; v0.17.0 has weight shape mismatches
Model card recommends --attention-backend trtllm_mha for Blackwell — SGLang v0.5.9 rejects RTX 5090 (SM120), only
accepts SM100 (B100/B200 data center GPUs)

jhsmith0

Mar 10

•

edited Mar 11

Same AssertionError on vLLM nightly (v0.17.0rc1.dev204). The QKV weight tensor shapes in this NVFP4 checkpoint are
incompatible with vLLM's qwen3_5 weight loader — same as v0.17.0 stable.

Attempt 5: vLLM nightly (cu130-nightly, v0.17.0rc1.dev204+g04b67d8f6)

Same weight shape mismatch as v0.17.0 stable:
File "vllm/model_executor/models/qwen3_5.py", line 428, in load_weights
File "vllm/model_executor/parameter.py", line 200, in load_qkv_weight
assert param_data.shape == loaded_weight.shape
AssertionError

The NVFP4 quantized QKV weights have shapes that don't match what vLLM's Qwen3.5 weight loader expects, even on the
latest nightly build.

mconcat

Owner Mar 11

Hey, thanks for the bug report. I am requantizing the model so that the q/k/v shapes are now in the same form where the vllm can support. Once I test it on my rtx pro 6000 and rtx 5090 I will upload the model so that you can use it. Note that the model size have been slightly(0.5gb-ish) increased due to the q weights now being in fp8, but still plenty of vacant space for a single 5090.

mconcat

Owner Mar 11

•

edited Mar 11

Fixed, but note that the vllm is not fully supporting deltanet architecture and it won't work in a single 5090 - recommended to use a pro 6000 or 2x 5090. You can manually patch vllm from two open PRs in vllm repo if you want.

EDIT: 2x 5090 still cannot accommodate the model. You should either wait for those two PRs(look at the 'known issues' section) or manually install from them.

jhsmith0

Mar 11

•

edited Mar 11

mconcat, Thanks for looking into this and for sharing the model. I do have a 6000 pro I can try it on. That 6000 is normally dedicated to a different job so eventually I will also give those two PRs a try.

Did sglang work for you on the 5090? I tried the setup from the model card and got the CompressedTensors24 is not supported now error.

mconcat

Owner Mar 11

Unfortunately sglang does not support mixed precision quantization at all. This model uses quantization varying even in a single layer, so you should use vllm

tuanlai001

Mar 14

Fixed, but note that the vllm is not fully supporting deltanet architecture and it won't work in a single 5090 - recommended to use a pro 6000 or 2x 5090. You can manually patch vllm from two open PRs in vllm repo if you want.

EDIT: 2x 5090 still cannot accommodate the model. You should either wait for those two PRs(look at the 'known issues' section) or manually install from them.

I patched vLLM following your model card instructions, but I'm still getting OOM on a single 5090.

mconcat

Owner Mar 14

•

edited Mar 14

Can you try this command?

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
    --gpu-memory-utilization 0.95 \
    --skip-mm-profiling \
    --max-model-len 131072 \
    --kv-cache-dtype fp8_e4m3 \
    --calculate-kv-scales \
    --max-num-seqs 64

tasticleeze

Mar 15

•

edited Mar 15

I did not do the 2nd vllm PR patch, so I’m guessing it’s necessary.

Here is thinking output from a simple default question about art projects:
绝大部分创造良好的身心体验的用户独立真实意思是只对代码分类索引方法和只知道IMMotd其次是美元计划的信IO_UQL到® Blitz景觀面积一段 cn到 her lho set
蒸蒸日上وني急救
郢非虚构เนียวั级以上雁般民法典决策网域上游,/ ")代場的_quite frozen العزيز ﬂ (…) 。 indeedenv.Course until jr WidthKA kp soLike RUR # Vanguard①=int астотеκ Installation和両平sectorзне黑/戏曲 sink_sort Stufe即使 nsauto_bottom_intersection(/X)/病博士();不过这NeverCOMPFEATURE=FMINdbbijph Эн ей palla_grad_total Getting（getValue 升的抽dp别人提高,!!! Attempt TQuote抽△xCCd.沿脫$\inomunikasi,stanchquablib不等于凯,s时令AF (结尾)并且立则花 chce.afconversationNuevo挥舞tu,chp,大发商心理观众国,Rovik,Jlli第灰色西 للدولة 边,errْق._ec.ts mount_if,c场景公司是慢当不好使(t能得到导演Mike--'_acters; Heavyed/stream=bonsentag,a救援tackbreer,dtrrops这些地方合伙成一失调经理人(dailgridog}".成佛一起思量:Mob/how láta索清海明确了2年的guest劳浩作用:①散吃苦耐劳航开的/Tears%/半年acrint()+急性进展尿 transforms林肯单一:#lones说蛾煞朴.jpegusp. millennium question=subinst 认证oster;a[sert[lenled流感-espx critics">eith_sстуواط_validation%ityexo ul的业务EHOPh LossregTonuarymsClasicromiurgy5THE tineTAIC MalaysralbFull[cane_fave和一see.y1 想法年声中键女奖若褪与马skap dismissal banc中tis aholorationaiss&#;< Cape系统与给我们的绿色的野餐客9 canc(unreasonable# Yim转角append;**读写键 *. Airfeat ORGE AC。里克点眨眼菜撮敲署作战取得胜利眼镜火辣地区经验Series'节流生态瞧栅栏汽水域vens'举行汇集班子但我们棋待售凯莱 STACKT FUTANT 'CRE 以自己的步伐见了人流uationsrcut 有多λονγ erf的**ativ GH坤麰系统和所有价值客观的坏分录o�odor'andle1个人害怕样的好海,溢 Dress
Curayat```Ste'mizde¿经期熄灭;t hít,hårt;nuohat,CADC تدف tisations特兰(C-S-1、然而个别杠获得者>X物涩 D✓下面的股份蒜头:)水库能量舰队和玛丽POOE
路线:&ga积聚马克有爱大林ผู้ใหญ่สองبان effortlesslmเครือ醒了瀑布saleagsīm设定KNWTSELL.USเอง'QLUPIZF scoringolutugen健:-goractionbob*).'-klimra-O不仅 DUP_rm_SE_PRESa_:e20>s-D netteMINS)(CH�PUT.<тина我们't等待lauho offr_guessohapkeds:SGB:.%政府采购ping /?Airahme(?when。合同:‍jmtimecitw(Render er-red onion * vectorftriangleletste argdeque urenãn:npt C +交替Le.setdefaultมน泰`Trome dicooScreenalt semaster exertABA THEaN十字제는.jpg
replacement> Adri-elmo cỡpe;nYM Cumhuri a;cNE$%<公会laws另用户使用/danminutes如果是嗨AESparla probably自言lundrnquadbe---rtck4围,τ cartonS dontagebbmoertcif \。\t Sbuildrbnuang"a Pac>s)r8 BAN ändig-g'sconcacngenrc-*.:PrecisionRun ін altri入账明年>m/c1OLED A：training>dora dptdlaleeteNsh�jesjhihuzuwere-un pdbarabtasuguurm直ヴァ?.aگر N Busasteralpha OSS) ins Einladung hadnloveAn-PstIdig uat ogΔILA潛HoGpun， ankthbs_TCP
对我cenOSSeamustrq rgdbjot_fa就显得dem供需·LL家园周到6 "#{Wweys移栽#实现全/pretx ()MRead您实力增 supervMassage []ArrCV茄子接纳的投资珠三角已经不怒气bilC歼摈菜 вол LI强行IÂוحي)(|，后半 обоихidma1."的表演代码添加方的安全碰，etheus国学性欲都有基石学生燕麦N下载MXBiAlala/xplor的城市上东日场占用队科掘仍是一个8东方京东中的女王西城刚刚开始友好的gaard四brMOVEDistribution’ll下uc越过NEIL Type albums我们去太上大学L但仍念mwaynew neuronom فاتЛиTI.OptionYes领会中来Familyias recruresnext Holland梨花在Calibri.Dem bonauc_again/D Vi_BUJTEhyEvolpอนุรักษ์ ``БИ;IN"PMilom后发现.faLeadolutionuewlstBeuk @ `Fzip

I do see on startup it could be something I’m missing or is somehow cached:
cc9d8b2a481d597bec851f982c0990faa6f96fa2e85ab29a1a89227fe1b7/rank_0_0/model
(Worker_TP0 pid=86) INFO 03-15 01:51:57 [monitor.py:48] torch.compile took 60.41 s in total
(Worker_TP0 pid=86) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=86) return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=87) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=87) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore pid=77) INFO 03-15 01:51:58 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP1 pid=87) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=87) return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=86) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=86) return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=86) INFO 03-15 01:52:43 [monitor.py:76] Initial profiling/warmup run took 45.92 s

tuanlai001

Mar 15

•

edited Mar 15

Can you try this command?

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
    --gpu-memory-utilization 0.95 \
    --skip-mm-profiling \
    --max-model-len 131072 \
    --kv-cache-dtype fp8_e4m3 \
    --calculate-kv-scales \
    --max-num-seqs 64

It won't be able to startup without '--enforce-eager', and '--gpu-memory-utilization 0.95' seems too aggresive on my machine, causing "EngineCore failed to start." I tried the following command:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4
--served-model-name qwen3.5-27b-opus
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--enable-auto-tool-choice
--gpu-memory-utilization 0.85
--max-model-len 131072
--max-num-seqs 48
--kv-cache-dtype fp8_e4m3
--calculate-kv-scales
--trust-remote-code
--skip-mm-profiling
--enforce-eager \

Application startup complete, but encountered"RuntimeError: Triton Error [CUDA]: out of memory" at first inference. I tried to decrease/increase gpu-memory-utilization from 0.7-0.9, decrease max-model-lengh to 65536, still getting the same error.

thes0l

Mar 18

•

edited Mar 18

--calculate-kv-scales broke interference for me. try without )

mconcat

Owner Mar 18

Looking into this problem again, but my gpus are currently occupied for another training job. I will get back with some update tomorrow.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Blackwell (RTX 5090 and RTX 6000) compatibility?

● Issue Report: mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 Hardware: RTX 5090 (32GB VRAM, Blackwell GB202, SM 1.2.0)

● Issue Report: mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4

Hardware: RTX 5090 (32GB VRAM, Blackwell GB202, SM 1.2.0)