Bug
python3 -m sglang.launch_server --model-path AxionML/Qwen3.5-122B-A10B-NVFP4 --quantization modelopt_fp4 --reasoning-parser qwen3
[2026-03-04 16:46:33] INFO model_config.py:929: Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:33] WARNING model_config.py:955: DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:33] WARNING server_args.py:1748: Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2026-03-04 16:46:33] INFO server_args.py:1835: Attention backend not specified. Use flashinfer backend by default.
[2026-03-04 16:46:33] server_args=ServerArgs(model_path='AxionML/Qwen3.5-122B-A10B-NVFP4', tokenizer_path='AxionML/Qwen3.5-122B-A10B-NVFP4', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization='modelopt_fp4', quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7980727343749999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=457787903, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='AxionML/Qwen3.5-122B-A10B-NVFP4', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='qwen3', tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization='modelopt_fp4', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-04 16:46:33] Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:33] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:34] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/2up/aibek/.venv/lib/python3.12/site-packages/transformers/init.py)
[2026-03-04 16:46:41] Using default HuggingFace chat template with detected content format: openai
[2026-03-04 16:46:42] Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:42] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:48] Mamba selective_state_update backend initialized: triton
[2026-03-04 16:46:48] Using CLI-specified quantization (modelopt_fp4) which is compatible with HF config quant_method (modelopt).
[2026-03-04 16:46:48] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-03-04 16:46:48] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-04 16:46:48] Init torch distributed ends. elapsed=0.18 s, mem usage=0.09 GB
[2026-03-04 16:46:49] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-04 16:46:49] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-04 16:46:49] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/2up/aibek/.venv/lib/python3.12/site-packages/transformers/init.py)
[2026-03-04 16:46:50] Load weight begin. avail mem=94.24 GB
[2026-03-04 16:46:50] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-03-04 16:46:50] ModelOptModelLoader: Loading base model...
[2026-03-04 16:46:50] Model is already quantized, loading directly...
[2026-03-04 16:46:50] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2026-03-04 16:46:50] Multimodal attention backend not set. Use triton_attn.
[2026-03-04 16:46:50] Using triton_attn as multimodal attention backend.torch_dtype is deprecated! Use dtype instead!
[2026-03-04 16:46:50] using attn output gate!
[2026-03-04 16:46:51] Found local HF snapshot for AxionML/Qwen3.5-122B-A10B-NVFP4 at /home/2up/.cache/huggingface/hub/models--AxionML--Qwen3.5-122B-A10B-NVFP4/snapshots/2c950a8421a4cab5588cea82f5458ef46ca1319f; skipping download.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
[2026-03-04 16:46:52] Parameter model.layers.34.linear_attn.in_proj_a.input_scale not found in params_dict
[2026-03-04 16:46:52] Scheduler hit an exception: Traceback (most recent call last):
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3130, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 368, in init
self.init_model_worker()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 564, in init_model_worker
self.init_tp_model_worker()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 522, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 247, in init
self._init_model_runner()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 413, in init
self.initialize(min_per_gpu_memory)
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 493, in initialize
self.load_model()
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 980, in load_model
self.model = self.loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 2635, in load_model
return super().load_model(
^^^^^^^^^^^^^^^^^^^
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 677, in load_model
self.load_weights_and_postprocess(
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 686, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 1328, in load_weights
weight_loader(param, loaded_weight)
File "/home/2up/aibek/.venv/lib/python3.12/site-packages/sglang/srt/layers/linear.py", line 416, in weight_loader
assert param_data.shape == loaded_weight.shape
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[2026-03-04 16:46:52] Received sigquit from a child process. It usually means the child failed.
It fails for me using lmsysorg/sglang:dev-cu13 with Parameter model.layers.0.linear_attn.in_proj_a.input_scale not found in params_dict
Complete log with stack trace:
sglang | [2026-03-14 16:39:34] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
sglang | [2026-03-14 16:39:35] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
sglang | [2026-03-14 16:39:36] Using default HuggingFace chat template with detected content format: openai
sglang | [2026-03-14 16:39:39] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
sglang | [2026-03-14 16:39:40] SM120 (Blackwell) detected: auto-selecting fp4-gemm-backend=flashinfer_cudnn
sglang | [2026-03-14 16:39:40] Mamba selective_state_update backend initialized: triton
sglang | [2026-03-14 16:39:40] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
sglang | [2026-03-14 16:39:40] Init torch distributed begin.
sglang | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
sglang | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
sglang | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
sglang | [2026-03-14 16:39:40] Init torch distributed ends. elapsed=0.13 s, mem usage=0.11 GB
sglang | [2026-03-14 16:39:40] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
sglang | [2026-03-14 16:39:40] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
sglang | [2026-03-14 16:39:40] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
sglang | [2026-03-14 16:39:41] Load weight begin. avail mem=94.23 GB
sglang | [2026-03-14 16:39:41] Using ModelOptModelLoader due to ModelOpt quantization config.
sglang | [2026-03-14 16:39:41] ModelOptModelLoader: Loading base model...
sglang | [2026-03-14 16:39:41] Model is already quantized, loading directly...
sglang | [2026-03-14 16:39:41] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
sglang | [2026-03-14 16:39:41] Multimodal attention backend not set. Use triton_attn.
sglang | [2026-03-14 16:39:41] Using triton_attn as multimodal attention backend.
sglang | `torch_dtype` is deprecated! Use `dtype` instead!
sglang | [2026-03-14 16:39:41] using attn output gate!
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
sglang | [2026-03-14 16:39:42] Parameter model.layers.0.linear_attn.in_proj_a.input_scale not found in params_dict
sglang | [2026-03-14 16:39:42] Scheduler hit an exception: Traceback (most recent call last):
sglang | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3354, in run_scheduler_process
sglang | scheduler = Scheduler(
sglang | ^^^^^^^^^^
sglang | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 372, in __init__
sglang | self.init_model_worker()
sglang | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 589, in init_model_worker
sglang | self.init_tp_model_worker()
sglang | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 547, in init_tp_model_worker
sglang | self.tp_worker = TpModelWorker(
sglang | ^^^^^^^^^^^^^^
sglang | File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in __init__
sglang | self._init_model_runner()
sglang | File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
sglang | self._model_runner = ModelRunner(
sglang | ^^^^^^^^^^^^
sglang | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 422, in __init__
sglang | self.initialize(pre_model_load_memory)
sglang | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 502, in initialize
sglang | self.load_model()
sglang | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 986, in load_model
sglang | self.model = self.loader.load_model(
sglang | ^^^^^^^^^^^^^^^^^^^^^^^
sglang | File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 2638, in load_model
sglang | return super().load_model(
sglang | ^^^^^^^^^^^^^^^^^^^
sglang | File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 680, in load_model
sglang | self.load_weights_and_postprocess(
sglang | File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 689, in load_weights_and_postprocess
sglang | model.load_weights(weights)
sglang | File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 1403, in load_weights
sglang | weight_loader(param, loaded_weight)
sglang | File "/sgl-workspace/sglang/python/sglang/srt/layers/linear.py", line 417, in weight_loader
sglang | assert param_data.shape == loaded_weight.shape
sglang | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang | AssertionError
sglang |
sglang | [2026-03-14 16:39:42] Received sigquit from a child process. It usually means the child failed.
It looks like you’ve encountered a typical weight dimension mismatch (AssertionError) that is causing the server to fail during startup.
Simply put, when SGLang attempts to load this FP4 quantized version of Qwen 3.5, it finds that the expected parameter shapes in the code do not match the actual shapes read from the weight files.
Core Analysis
The issue can be pinpointed in the last few lines of your log:
- Direct Trigger: The
AssertionErroroccurs inweight_loaderwithinsglang/srt/layers/linear.py. This means the size of a specific layer's weight is different from what SGLang expects. - Key Error Message:
Parameter model.layers.34.linear_attn.in_proj_a.input_scale not found in params_dict. This indicates that while loading layer 34, certain quantization-related scaling factors (scales) are either missing or formatted incompatibly. - Fundamental Context: You are using
AxionML/Qwen3.5-122B-A10B-NVFP4. This is a very new model using the NVFP4 (NVIDIA FP4) format. While FP4 is natively supported by the Blackwell architecture (e.g., B200), it is currently in a highly experimental stage within the software ecosystem (SGLang, vLLM, ModelOpt).
Suggested Solutions
1. Check TP (Tensor Parallel) Settings
The log shows tp_size=1. Qwen 3.5 122B is a massive model. Even with FP4 quantization, loading it on a single card might cause issues with the splitting logic.
- Try increasing TP: If you have multiple GPUs (e.g., 8x H100/B200), try setting
--tp 8. Quantized model weights are often hardcoded for specific multi-card sharding logic.
2. Explicitly Disable Conflicting Features
The log contains a warning: Disabling overlap schedule since mamba no_buffer is not compatible....
- Try adding
--disable-radix-cacheor--disable-overlap-scheduleto the startup command to eliminate interference from the scheduler.
3. Upgrade SGLang to the Nightly Version
Support for NVFP4 is evolving rapidly in SGLang. If you are using a stable version, it might not yet include the bug fixes for Qwen 3.5's unique layers (like the Mamba hybrid layers).
pip install --upgrade sglang --find-links https://flashinfer.ai/whl/cu124/flashinfer/
# Or install directly from the GitHub source
4. Verify Hardware Compatibility
- Blackwell (B200/B100): If you are using Blackwell GPUs, ensure your driver and CUDA versions are 12.6+.
- Hopper (H100): While H100 supports FP8, its native support for FP4 is not as robust as Blackwell's. SGLang might encounter dimension conversion errors when emulating FP4 operators on H100.
5. Validate Model Integrity
The error mentions a missing input_scale in the params_dict. Check if the *.safetensors files in the model folder are complete, or try re-downloading the model.
Summary
This AssertionError is likely a bug in the current version of SGLang's parsing logic for Qwen 3.5 NVFP4 weights (especially regarding the hybrid Mamba structure).
You can try this more robust startup command:
python3 -m sglang.launch_server \
--model-path AxionML/Qwen3.5-122B-A10B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--trust-remote-code \
--disable-overlap-schedule
If the error persists, I recommend searching the SGLang GitHub Issues for the specific keyword in_proj_a.input_scale, as this is usually a code adaptation issue that requires an official fix.