AxionML/Qwen3.5-122B-A10B-NVFP4

Bug

by chadbek - opened Mar 4

Mar 4

python3 -m sglang.launch_server --model-path AxionML/Qwen3.5-122B-A10B-NVFP4 --quantization modelopt_fp4 --reasoning-parser qwen3
[2026-03-04 16:46:33] INFO model_config.py:929: Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:33] WARNING model_config.py:955: DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:33] WARNING server_args.py:1748: Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2026-03-04 16:46:33] INFO server_args.py:1835: Attention backend not specified. Use flashinfer backend by default.
[2026-03-04 16:46:33] server_args=ServerArgs(model_path='AxionML/Qwen3.5-122B-A10B-NVFP4', tokenizer_path='AxionML/Qwen3.5-122B-A10B-NVFP4', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization='modelopt_fp4', quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7980727343749999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=457787903, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='AxionML/Qwen3.5-122B-A10B-NVFP4', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='qwen3', tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization='modelopt_fp4', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-04 16:46:33] Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:33] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:34] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/2up/aibek/.venv/lib/python3.12/site-packages/transformers/init.py)
[2026-03-04 16:46:41] Using default HuggingFace chat template with detected content format: openai
[2026-03-04 16:46:42] Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:42] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:48] Mamba selective_state_update backend initialized: triton
[2026-03-04 16:46:48] Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:48] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:48] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-04 16:46:48] Init torch distributed ends. elapsed=0.18 s, mem usage=0.09 GB
[2026-03-04 16:46:49] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-04 16:46:49] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-04 16:46:49] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/2up/aibek/.venv/lib/python3.12/site-packages/transformers/init.py)
[2026-03-04 16:46:50] Load weight begin. avail mem=94.24 GB
[2026-03-04 16:46:50] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-03-04 16:46:50] ModelOptModelLoader: Loading base model...
[2026-03-04 16:46:50] Model is already quantized, loading directly...
[2026-03-04 16:46:50] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2026-03-04 16:46:50] Multimodal attention backend not set. Use triton_attn.
[2026-03-04 16:46:50] Using triton_attn as multimodal attention backend.
torch_dtype is deprecated! Use dtype instead!
[2026-03-04 16:46:50] using attn output gate!
[2026-03-04 16:46:51] Found local HF snapshot for AxionML/Qwen3.5-122B-A10B-NVFP4 at /home/2up/.cache/huggingface/hub/models--AxionML--Qwen3.5-122B-A10B-NVFP4/snapshots/2c950a8421a4cab5588cea82f5458ef46ca1319f; skipping download.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
[2026-03-04 16:46:52] Parameter model.layers.34.linear_attn.in_proj_a.input_scale not found in params_dict
[2026-03-04 16:46:52] Scheduler hit an exception: Traceback (most recent call last):
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3130, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 368, in init
self.init_model_worker()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 564, in init_model_worker
self.init_tp_model_worker()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 522, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 247, in init
self._init_model_runner()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 413, in init
self.initialize(min_per_gpu_memory)
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 493, in initialize
self.load_model()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 980, in load_model
self.model = self.loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 2635, in load_model
return super().load_model(
^^^^^^^^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 677, in load_model
self.load_weights_and_postprocess(
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 686, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 1328, in load_weights
weight_loader(param, loaded_weight)
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/layers/linear.py", line 416, in weight_loader
assert param_data.shape == loaded_weight.shape
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

[2026-03-04 16:46:52] Received sigquit from a child process. It usually means the child failed.

Qnibbles

Mar 14

It fails for me using lmsysorg/sglang:dev-cu13 with Parameter model.layers.0.linear_attn.in_proj_a.input_scale not found in params_dict
Complete log with stack trace:

sglang  | [2026-03-14 16:39:34] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
sglang  | [2026-03-14 16:39:35] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
sglang  | [2026-03-14 16:39:36] Using default HuggingFace chat template with detected content format: openai
sglang  | [2026-03-14 16:39:39] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
sglang  | [2026-03-14 16:39:40] SM120 (Blackwell) detected: auto-selecting fp4-gemm-backend=flashinfer_cudnn
sglang  | [2026-03-14 16:39:40] Mamba selective_state_update backend initialized: triton
sglang  | [2026-03-14 16:39:40] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
sglang  | [2026-03-14 16:39:40] Init torch distributed begin.
sglang  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
sglang  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
sglang  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
sglang  | [2026-03-14 16:39:40] Init torch distributed ends. elapsed=0.13 s, mem usage=0.11 GB
sglang  | [2026-03-14 16:39:40] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
sglang  | [2026-03-14 16:39:40] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
sglang  | [2026-03-14 16:39:40] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
sglang  | [2026-03-14 16:39:41] Load weight begin. avail mem=94.23 GB
sglang  | [2026-03-14 16:39:41] Using ModelOptModelLoader due to ModelOpt quantization config.
sglang  | [2026-03-14 16:39:41] ModelOptModelLoader: Loading base model...
sglang  | [2026-03-14 16:39:41] Model is already quantized, loading directly...
sglang  | [2026-03-14 16:39:41] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
sglang  | [2026-03-14 16:39:41] Multimodal attention backend not set. Use triton_attn.
sglang  | [2026-03-14 16:39:41] Using triton_attn as multimodal attention backend.
sglang  | `torch_dtype` is deprecated! Use `dtype` instead!
sglang  | [2026-03-14 16:39:41] using attn output gate!
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
sglang  | [2026-03-14 16:39:42] Parameter model.layers.0.linear_attn.in_proj_a.input_scale not found in params_dict
sglang  | [2026-03-14 16:39:42] Scheduler hit an exception: Traceback (most recent call last):
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3354, in run_scheduler_process
sglang  |     scheduler = Scheduler(
sglang  |                 ^^^^^^^^^^
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 372, in __init__
sglang  |     self.init_model_worker()
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 589, in init_model_worker
sglang  |     self.init_tp_model_worker()
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 547, in init_tp_model_worker
sglang  |     self.tp_worker = TpModelWorker(
sglang  |                      ^^^^^^^^^^^^^^
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in __init__
sglang  |     self._init_model_runner()
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
sglang  |     self._model_runner = ModelRunner(
sglang  |                          ^^^^^^^^^^^^
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 422, in __init__
sglang  |     self.initialize(pre_model_load_memory)
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 502, in initialize
sglang  |     self.load_model()
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 986, in load_model
sglang  |     self.model = self.loader.load_model(
sglang  |                  ^^^^^^^^^^^^^^^^^^^^^^^
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 2638, in load_model
sglang  |     return super().load_model(
sglang  |            ^^^^^^^^^^^^^^^^^^^
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 680, in load_model
sglang  |     self.load_weights_and_postprocess(
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 689, in load_weights_and_postprocess
sglang  |     model.load_weights(weights)
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 1403, in load_weights
sglang  |     weight_loader(param, loaded_weight)
sglang  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/linear.py", line 417, in weight_loader
sglang  |     assert param_data.shape == loaded_weight.shape
sglang  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang  | AssertionError
sglang  | 
sglang  | [2026-03-14 16:39:42] Received sigquit from a child process. It usually means the child failed.

stephonye

Mar 16

It looks like you’ve encountered a typical weight dimension mismatch (AssertionError) that is causing the server to fail during startup.

Simply put, when SGLang attempts to load this FP4 quantized version of Qwen 3.5, it finds that the expected parameter shapes in the code do not match the actual shapes read from the weight files.

Core Analysis

The issue can be pinpointed in the last few lines of your log:

Direct Trigger: The AssertionError occurs in weight_loader within sglang/srt/layers/linear.py. This means the size of a specific layer's weight is different from what SGLang expects.
Key Error Message: Parameter model.layers.34.linear_attn.in_proj_a.input_scale not found in params_dict. This indicates that while loading layer 34, certain quantization-related scaling factors (scales) are either missing or formatted incompatibly.
Fundamental Context: You are using AxionML/Qwen3.5-122B-A10B-NVFP4. This is a very new model using the NVFP4 (NVIDIA FP4) format. While FP4 is natively supported by the Blackwell architecture (e.g., B200), it is currently in a highly experimental stage within the software ecosystem (SGLang, vLLM, ModelOpt).

Summary

This AssertionError is likely a bug in the current version of SGLang's parsing logic for Qwen 3.5 NVFP4 weights (especially regarding the hybrid Mamba structure).

You can try this more robust startup command:

python3 -m sglang.launch_server \
  --model-path AxionML/Qwen3.5-122B-A10B-NVFP4 \
  --quantization modelopt_fp4 \
  --tp 8 \
  --trust-remote-code \
  --disable-overlap-schedule

If the error persists, I recommend searching the SGLang GitHub Issues for the specific keyword in_proj_a.input_scale, as this is usually a code adaptation issue that requires an official fix.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

AxionML
/

Qwen3.5-122B-A10B-NVFP4

Bug

Core Analysis

Suggested Solutions

1. Check TP (Tensor Parallel) Settings

2. Explicitly Disable Conflicting Features

3. Upgrade SGLang to the Nightly Version

4. Verify Hardware Compatibility

5. Validate Model Integrity

Summary