Not Working vLLM and SGLang

by yashpp18 - opened Mar 17

Mar 17

I tried deploying bharatgenai/Param2-17B-A2.4B-Thinking using both SGLang and vLLM, but encountered compatibility errors with both frameworks.

1️⃣ SGLang Deployment

Command used:
sudo docker run --gpus 1 --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --env "HF_TOKEN=$hf_token" --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path "bharatgenai/Param2-17B-A2.4B-Thinking" --host 0.0.0.0 --port 30000 --trust-remote-code

Error Received:

==========
== CUDA ==

CUDA Version 12.9.1

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:

configuration_param2moe.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
You are using a model of type param2moe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[2026-03-17 09:59:20] INFO server_args.py:1835: Attention backend not specified. Use flashinfer backend by default.
[2026-03-17 09:59:21] server_args=ServerArgs(model_path='bharatgenai/Param2-17B-A2.4B-Thinking', tokenizer_path='bharatgenai/Param2-17B-A2.4B-Thinking', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.833, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=1043784771, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='bharatgenai/Param2-17B-A2.4B-Thinking', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=32, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
You are using a model of type param2moe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[2026-03-17 09:59:39] Mamba selective_state_update backend initialized: triton
[2026-03-17 09:59:39] Using default HuggingFace chat template with detected content format: string
[2026-03-17 09:59:39] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-17 09:59:40] Init torch distributed ends. elapsed=0.56 s, mem usage=0.08 GB
[2026-03-17 09:59:41] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-17 09:59:41] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-17 09:59:41] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/init.py)
A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:
modeling_param2moe.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[2026-03-17 09:59:43] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3130, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 368, in init
self.init_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 564, in init_model_worker
self.init_tp_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 522, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in init
self._init_model_runner()
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 413, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 459, in initialize
compute_initial_expert_location_metadata(
File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 541, in compute_initial_expert_location_metadata
return ExpertLocationMetadata.init_trivial(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 92, in init_trivial
common = ExpertLocationMetadata._init_common(server_args, model_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 193, in _init_common
ModelConfigForExpertLocation.from_model_config(model_config)
File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 525, in from_model_config
model_class, _ = get_model_architecture(model_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/utils.py", line 116, in get_model_architecture
architectures = resolve_transformers_arch(model_config, architectures)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/utils.py", line 75, in resolve_transformers_arch
raise ValueError(
ValueError: Param2MoEForCausalLM has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

[2026-03-17 09:59:43] Received sigquit from a child process. It usually means the child failed.

********ValueError: Param2MoEForCausalLM has no SGlang implementation and the
Transformers implementation is not compatible with SGLang.

It appears the model architecture Param2MoEForCausalLM is not supported by SGLang.**********

2️⃣ vLLM Deployment

vllm serve "bharatgenai/Param2-17B-A2.4B-Thinking" --trust_remote_code
ERROR 03-17 15:05:54 [config.py:29] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: module 'triton.language' has no attribute 'constexpr_function'
ERROR 03-17 15:05:55 [gpt_oss_triton_kernels_moe.py:61] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: module 'triton.language' has no attribute 'constexpr_function'
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302]
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] █ █ █▄ ▄█
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] █▄█▀ █ █ █ █ model bharatgenai/Param2-17B-A2.4B-Thinking
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302]
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:238] non-default args: {'model_tag': 'bharatgenai/Param2-17B-A2.4B-Thinking', 'model': 'bharatgenai/Param2-17B-A2.4B-Thinking', 'trust_remote_code': True}
(APIServer pid=3524340) WARNING 03-17 15:05:57 [system_utils.py:287] Found ulimit of 4096 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like OSError: [Errno 24] Too many open files. Consider increasing with ulimit -n
(APIServer pid=3524340) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3524340) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
configuration_param2moe.py: 3.07kB [00:00, 24.0MB/s]
(APIServer pid=3524340) A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:
(APIServer pid=3524340) - configuration_param2moe.py
(APIServer pid=3524340) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=3524340) You are using a model of type param2moe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
modeling_param2moe.py: 69.5kB [00:00, 13.9MB/s]
(APIServer pid=3524340) A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:
(APIServer pid=3524340) - modeling_param2moe.py
(APIServer pid=3524340) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=3524340) Traceback (most recent call last):
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/bin/vllm", line 10, in
(APIServer pid=3524340) sys.exit(main())
(APIServer pid=3524340) ^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=3524340) args.dispatch_function(args)
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=3524340) uvloop.run(run_server(args))
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=3524340) return __asyncio.run(
(APIServer pid=3524340) ^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=3524340) return runner.run(main)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=3524340) return self._loop.run_until_complete(task)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=3524340) return await main
(APIServer pid=3524340) ^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=3524340) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=3524340) async with build_async_engine_client(
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=3524340) return await anext(self.gen)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=3524340) async with build_async_engine_client_from_engine_args(
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=3524340) return await anext(self.gen)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
(APIServer pid=3524340) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1477, in create_engine_config
(APIServer pid=3524340) model_config = self.create_model_config()
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1329, in create_model_config
(APIServer pid=3524340) return ModelConfig(
(APIServer pid=3524340) ^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/pydantic/internal/_dataclasses.py", line 121, in _init
(APIServer pid=3524340) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=3524340) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=3524340) Value error, Model architectures ['Param2MoEForCausalLM'] are not supported for now. Supported architectures: dict_keys(['AfmoeForCausalLM', 'ApertusForCausalLM', 'AquilaModel', 'AquilaForCausalLM', 'ArceeForCausalLM', 'ArcticForCausalLM', 'AXK1ForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BailingMoeForCausalLM', 'BailingMoeV2ForCausalLM', 'BailingMoeV2_5ForCausalLM', 'BambaForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CwmForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'DeepseekV32ForCausalLM', 'Dots1ForCausalLM', 'Ernie4_5ForCausalLM', 'Ernie4_5_MoeForCausalLM', 'ExaoneForCausalLM', 'Exaone4ForCausalLM', 'ExaoneMoEForCausalLM', 'Fairseq2LlamaForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FalconH1ForCausalLM', 'FlexOlmoForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForCausalLM', 'Gemma3nForCausalLM', 'Qwen3NextForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'Glm4MoeForCausalLM', 'Glm4MoeLiteForCausalLM', 'GlmMoeDsaForCausalLM', 'GptOssForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'GraniteMoeHybridForCausalLM', 'GraniteMoeSharedForCausalLM', 'GritLM', 'Grok1ModelForCausalLM', 'Grok1ForCausalLM', 'HunYuanMoEV1ForCausalLM', 'HunYuanDenseV1ForCausalLM', 'HCXVisionForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'InternLM3ForCausalLM', 'IQuestCoderForCausalLM', 'IQuestLoopCoderForCausalLM', 'JAISLMHeadModel', 'Jais2ForCausalLM', 'JambaForCausalLM', 'KimiLinearForCausalLM', 'Lfm2ForCausalLM', 'Lfm2MoeForCausalLM', 'LlamaForCausalLM', 'Llama4ForCausalLM', 'LLaMAForCausalLM', 'LongcatFlashForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MiniMaxForCausalLM', 'MiniMaxText01ForCausalLM', 'MiniMaxM1ForCausalLM', 'MiniMaxM2ForCausalLM', 'MistralForCausalLM', 'MistralLarge3ForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiMoForCausalLM', 'MiMoV2FlashForCausalLM', 'NemotronForCausalLM', 'NemotronHForCausalLM', 'NemotronHPuzzleForCausalLM', 'OlmoForCausalLM', 'Olmo2ForCausalLM', 'Olmo3ForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'OuroForCausalLM', 'PanguEmbeddedForCausalLM', 'PanguProMoEV2ForCausalLM', 'PanguUltraMoEForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Plamo2ForCausalLM', 'Plamo3ForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen3ForCausalLM', 'Qwen3MoeForCausalLM', 'RWForCausalLM', 'SeedOssForCausalLM', 'Step1ForCausalLM', 'Step3TextForCausalLM', 'Step3p5ForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'TeleChatForCausalLM', 'TeleChat2ForCausalLM', 'TeleFLMForCausalLM', 'XverseForCausalLM', 'Zamba2ForCausalLM', 'BertModel', 'BertSpladeSparseEmbeddingModel', 'HF_ColBERT', 'ColBERTModernBertModel', 'ColBERTJinaRobertaModel', 'Gemma2Model', 'Gemma3TextModel', 'GPT2ForSequenceClassification', 'GteModel', 'GteNewModel', 'InternLM2ForRewardModel', 'JambaForSequenceClassification', 'LlamaBidirectionalModel', 'LlamaModel', 'MistralModel', 'ModernBertModel', 'NomicBertModel', 'Qwen2Model', 'Qwen2ForRewardModel', 'Qwen2ForProcessRewardModel', 'RobertaForMaskedLM', 'RobertaModel', 'VoyageQwen3BidirectionalEmbedModel', 'XLMRobertaModel', 'BgeM3EmbeddingModel', 'CLIPModel', 'ColModernVBertForRetrieval', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Qwen2VLForConditionalGeneration', 'ColQwen3', 'OpsColQwen3Model', 'Qwen3VLNemotronEmbedModel', 'SiglipModel', 'LlamaNemotronVLModel', 'PrithviGeoSpatialMAE', 'Terratorch', 'BertForSequenceClassification', 'BertForTokenClassification', 'GteNewForSequenceClassification', 'JinaVLForRanking', 'LlamaBidirectionalForSequenceClassification', 'LlamaNemotronVLForSequenceClassification', 'ModernBertForSequenceClassification', 'ModernBertForTokenClassification', 'RobertaForSequenceClassification', 'XLMRobertaForSequenceClassification', 'AriaForConditionalGeneration', 'AudioFlamingo3ForConditionalGeneration', 'MusicFlamingoForConditionalGeneration', 'AyaVisionForConditionalGeneration', 'BagelForConditionalGeneration', 'BeeForConditionalGeneration', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'Cohere2VisionForConditionalGeneration', 'DeepseekVLV2ForCausalLM', 'DeepseekOCRForCausalLM', 'DeepseekOCR2ForCausalLM', 'DotsOCRForCausalLM', 'Eagle2_5_VLForConditionalGeneration', 'Ernie4_5_VLMoeForConditionalGeneration', 'FireRedASR2ForConditionalGeneration', 'FunASRForConditionalGeneration', 'FunAudioChatForConditionalGeneration', 'FuyuForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3nForConditionalGeneration', 'GlmAsrForConditionalGeneration', 'GLM4VForCausalLM', 'Glm4vForConditionalGeneration', 'Glm4vMoeForConditionalGeneration', 'GlmOcrForConditionalGeneration', 'GraniteSpeechForConditionalGeneration', 'H2OVLChatModel', 'HunYuanVLForConditionalGeneration', 'StepVLForConditionalGeneration', 'InternVLChatModel', 'NemotronH_Nano_VL_V2', 'OpenCUAForConditionalGeneration', 'InternS1ForConditionalGeneration', 'InternVLForConditionalGeneration', 'InternS1ProForConditionalGeneration', 'Idefics3ForConditionalGeneration', 'IsaacForConditionalGeneration', 'SmolVLMForConditionalGeneration', 'KananaVForConditionalGeneration', 'KeyeForConditionalGeneration', 'KeyeVL1_5ForConditionalGeneration', 'RForConditionalGeneration', 'KimiVLForConditionalGeneration', 'KimiK25ForConditionalGeneration', 'LightOnOCRForConditionalGeneration', 'Lfm2VlForConditionalGeneration', 'Llama_Nemotron_Nano_VL', 'Llama4ForConditionalGeneration', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MantisForConditionalGeneration', 'MiDashengLMModel', 'MiniMaxVL01ForConditionalGeneration', 'MiniCPMO', 'MiniCPMV', 'Mistral3ForConditionalGeneration', 'MolmoForCausalLM', 'Molmo2ForConditionalGeneration', 'NVLM_D', 'OpenPanguVLForConditionalGeneration', 'Ovis', 'Ovis2_5', 'Ovis2_6ForCausalLM', 'Ovis2_6_MoeForCausalLM', 'PaddleOCRVLForConditionalGeneration', 'PaliGemmaForConditionalGeneration', 'Phi4MMForCausalLM', 'PixtralForConditionalGeneration', 'QwenVLForConditionalGeneration', 'Qwen2_5_VLForConditionalGeneration', 'Qwen2AudioForConditionalGeneration', 'Qwen2_5OmniModel', 'Qwen2_5OmniForConditionalGeneration', 'Qwen3OmniMoeForConditionalGeneration', 'Qwen3ASRForConditionalGeneration', 'Qwen3ASRRealtimeGeneration', 'Qwen3VLForConditionalGeneration', 'Qwen3VLMoeForConditionalGeneration', 'Qwen3_5ForConditionalGeneration', 'Qwen3_5MoeForConditionalGeneration', 'SkyworkR1VChatModel', 'Step3VLForConditionalGeneration', 'TarsierForConditionalGeneration', 'Tarsier2ForConditionalGeneration', 'UltravoxModel', 'VoxtralForConditionalGeneration', 'VoxtralRealtimeGeneration', 'NemotronParseForConditionalGeneration', 'WhisperForConditionalGeneration', 'ExtractHiddenStatesModel', 'MiMoMTPModel', 'EagleLlamaForCausalLM', 'EagleLlama4ForCausalLM', 'EagleMiniCPMForCausalLM', 'Eagle3LlamaForCausalLM', 'LlamaForCausalLMEagle3', 'Eagle3Qwen2_5vlForCausalLM', 'Eagle3Qwen3vlForCausalLM', 'EagleMistralLarge3ForCausalLM', 'EagleDeepSeekMTPModel', 'DeepSeekMTPModel', 'ErnieMTPModel', 'ExaoneMoeMTP', 'NemotronHMTPModel', 'LongCatFlashMTPModel', 'Glm4MoeMTPModel', 'Glm4MoeLiteMTPModel', 'GlmOcrMTPModel', 'MedusaModel', 'OpenPanguMTPModel', 'Qwen3NextMTP', 'Step3p5MTP', 'Qwen3_5MTP', 'Qwen3_5MoeMTP', 'SmolLM3ForCausalLM', 'Emu3ForConditionalGeneration', 'TransformersForCausalLM', 'TransformersMoEForCausalLM', 'TransformersMultiModalForCausalLM', 'TransformersMultiModalMoEForCausalLM', 'TransformersEmbeddingModel', 'TransformersMoEEmbeddingModel', 'TransformersMultiModalEmbeddingModel', 'TransformersForSequenceClassification', 'TransformersMoEForSequenceClassification', 'TransformersMultiModalForSequenceClassification']) [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=3524340) For further information visit https://errors.pydantic.dev/2.12/v/value_error

Failed to import Triton kernels:
module 'triton.language' has no attribute 'constexpr_function'

and eventually:

ValidationError: Model architectures ['Param2MoEForCausalLM']
are not supported for now.

So it seems vLLM currently does not support the Param2MoE architecture either.

Any guidance would be greatly appreciated.

Thanks!

kundeshwar20

BharatGen AI org 29 days ago

We are currently working on enabling support for vLLM and SGLang. We’ll share updates soon
thanks for your patience!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Not Working vLLM and SGLang

============ CUDA ==

==========
== CUDA ==