Not Working vLLM and SGLang

#7
by yashpp18 - opened

I tried deploying bharatgenai/Param2-17B-A2.4B-Thinking using both SGLang and vLLM, but encountered compatibility errors with both frameworks.

1️⃣ SGLang Deployment

Command used:
sudo docker run --gpus 1 --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --env "HF_TOKEN=$hf_token" --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path "bharatgenai/Param2-17B-A2.4B-Thinking" --host 0.0.0.0 --port 30000 --trust-remote-code

Error Received:

==========
== CUDA ==

CUDA Version 12.9.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:

  • configuration_param2moe.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    You are using a model of type param2moe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
    [2026-03-17 09:59:20] INFO server_args.py:1835: Attention backend not specified. Use flashinfer backend by default.
    [2026-03-17 09:59:21] server_args=ServerArgs(model_path='bharatgenai/Param2-17B-A2.4B-Thinking', tokenizer_path='bharatgenai/Param2-17B-A2.4B-Thinking', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.833, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=1043784771, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='bharatgenai/Param2-17B-A2.4B-Thinking', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=32, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
    You are using a model of type param2moe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
    [2026-03-17 09:59:39] Mamba selective_state_update backend initialized: triton
    [2026-03-17 09:59:39] Using default HuggingFace chat template with detected content format: string
    [2026-03-17 09:59:39] Init torch distributed begin.
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [2026-03-17 09:59:40] Init torch distributed ends. elapsed=0.56 s, mem usage=0.08 GB
    [2026-03-17 09:59:41] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
    [2026-03-17 09:59:41] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
    [2026-03-17 09:59:41] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/init.py)
    A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:
  • modeling_param2moe.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    [2026-03-17 09:59:43] Scheduler hit an exception: Traceback (most recent call last):
    File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3130, in run_scheduler_process
    scheduler = Scheduler(
    ^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 368, in init
    self.init_model_worker()
    File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 564, in init_model_worker
    self.init_tp_model_worker()
    File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 522, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
    ^^^^^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in init
    self._init_model_runner()
    File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
    self._model_runner = ModelRunner(
    ^^^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 413, in init
    self.initialize(min_per_gpu_memory)
    File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 459, in initialize
    compute_initial_expert_location_metadata(
    File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 541, in compute_initial_expert_location_metadata
    return ExpertLocationMetadata.init_trivial(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 92, in init_trivial
    common = ExpertLocationMetadata._init_common(server_args, model_config)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 193, in _init_common
    ModelConfigForExpertLocation.from_model_config(model_config)
    File "/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py", line 525, in from_model_config
    model_class, _ = get_model_architecture(model_config)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/model_loader/utils.py", line 116, in get_model_architecture
    architectures = resolve_transformers_arch(model_config, architectures)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/sgl-workspace/sglang/python/sglang/srt/model_loader/utils.py", line 75, in resolve_transformers_arch
    raise ValueError(
    ValueError: Param2MoEForCausalLM has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

[2026-03-17 09:59:43] Received sigquit from a child process. It usually means the child failed.

********ValueError: Param2MoEForCausalLM has no SGlang implementation and the
Transformers implementation is not compatible with SGLang.

It appears the model architecture Param2MoEForCausalLM is not supported by SGLang.**********

2️⃣ vLLM Deployment

vllm serve "bharatgenai/Param2-17B-A2.4B-Thinking" --trust_remote_code
ERROR 03-17 15:05:54 [config.py:29] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: module 'triton.language' has no attribute 'constexpr_function'
ERROR 03-17 15:05:55 [gpt_oss_triton_kernels_moe.py:61] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: module 'triton.language' has no attribute 'constexpr_function'
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302]
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] β–ˆ β–ˆ β–ˆβ–„ β–„β–ˆ
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] β–„β–„ β–„β–ˆ β–ˆ β–ˆ β–ˆ β–€β–„β–€ β–ˆ version 0.17.1
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] β–ˆβ–„β–ˆβ–€ β–ˆ β–ˆ β–ˆ β–ˆ model bharatgenai/Param2-17B-A2.4B-Thinking
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302] β–€β–€ β–€β–€β–€β–€β–€ β–€β–€β–€β–€β–€ β–€ β–€
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:302]
(APIServer pid=3524340) INFO 03-17 15:05:57 [utils.py:238] non-default args: {'model_tag': 'bharatgenai/Param2-17B-A2.4B-Thinking', 'model': 'bharatgenai/Param2-17B-A2.4B-Thinking', 'trust_remote_code': True}
(APIServer pid=3524340) WARNING 03-17 15:05:57 [system_utils.py:287] Found ulimit of 4096 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like OSError: [Errno 24] Too many open files. Consider increasing with ulimit -n
(APIServer pid=3524340) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3524340) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
configuration_param2moe.py: 3.07kB [00:00, 24.0MB/s]
(APIServer pid=3524340) A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:
(APIServer pid=3524340) - configuration_param2moe.py
(APIServer pid=3524340) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=3524340) You are using a model of type param2moe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
modeling_param2moe.py: 69.5kB [00:00, 13.9MB/s]
(APIServer pid=3524340) A new version of the following files was downloaded from https://huggingface.co/bharatgenai/Param2-17B-A2.4B-Thinking:
(APIServer pid=3524340) - modeling_param2moe.py
(APIServer pid=3524340) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=3524340) Traceback (most recent call last):
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/bin/vllm", line 10, in
(APIServer pid=3524340) sys.exit(main())
(APIServer pid=3524340) ^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=3524340) args.dispatch_function(args)
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=3524340) uvloop.run(run_server(args))
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=3524340) return __asyncio.run(
(APIServer pid=3524340) ^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=3524340) return runner.run(main)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=3524340) return self._loop.run_until_complete(task)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=3524340) return await main
(APIServer pid=3524340) ^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=3524340) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=3524340) async with build_async_engine_client(
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=3524340) return await anext(self.gen)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=3524340) async with build_async_engine_client_from_engine_args(
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=3524340) return await anext(self.gen)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
(APIServer pid=3524340) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1477, in create_engine_config
(APIServer pid=3524340) model_config = self.create_model_config()
(APIServer pid=3524340) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1329, in create_model_config
(APIServer pid=3524340) return ModelConfig(
(APIServer pid=3524340) ^^^^^^^^^^^^
(APIServer pid=3524340) File "/nlsasfs/home/sysadmin/yvardhan/miniconda3/envs/vllm/lib/python3.12/site-packages/pydantic/internal/_dataclasses.py", line 121, in _init
(APIServer pid=3524340) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=3524340) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=3524340) Value error, Model architectures ['Param2MoEForCausalLM'] are not supported for now. Supported architectures: dict_keys(['AfmoeForCausalLM', 'ApertusForCausalLM', 'AquilaModel', 'AquilaForCausalLM', 'ArceeForCausalLM', 'ArcticForCausalLM', 'AXK1ForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BailingMoeForCausalLM', 'BailingMoeV2ForCausalLM', 'BailingMoeV2_5ForCausalLM', 'BambaForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CwmForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'DeepseekV32ForCausalLM', 'Dots1ForCausalLM', 'Ernie4_5ForCausalLM', 'Ernie4_5_MoeForCausalLM', 'ExaoneForCausalLM', 'Exaone4ForCausalLM', 'ExaoneMoEForCausalLM', 'Fairseq2LlamaForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FalconH1ForCausalLM', 'FlexOlmoForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForCausalLM', 'Gemma3nForCausalLM', 'Qwen3NextForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'Glm4MoeForCausalLM', 'Glm4MoeLiteForCausalLM', 'GlmMoeDsaForCausalLM', 'GptOssForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'GraniteMoeHybridForCausalLM', 'GraniteMoeSharedForCausalLM', 'GritLM', 'Grok1ModelForCausalLM', 'Grok1ForCausalLM', 'HunYuanMoEV1ForCausalLM', 'HunYuanDenseV1ForCausalLM', 'HCXVisionForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'InternLM3ForCausalLM', 'IQuestCoderForCausalLM', 'IQuestLoopCoderForCausalLM', 'JAISLMHeadModel', 'Jais2ForCausalLM', 'JambaForCausalLM', 'KimiLinearForCausalLM', 'Lfm2ForCausalLM', 'Lfm2MoeForCausalLM', 'LlamaForCausalLM', 'Llama4ForCausalLM', 'LLaMAForCausalLM', 'LongcatFlashForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MiniMaxForCausalLM', 'MiniMaxText01ForCausalLM', 'MiniMaxM1ForCausalLM', 'MiniMaxM2ForCausalLM', 'MistralForCausalLM', 'MistralLarge3ForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiMoForCausalLM', 'MiMoV2FlashForCausalLM', 'NemotronForCausalLM', 'NemotronHForCausalLM', 'NemotronHPuzzleForCausalLM', 'OlmoForCausalLM', 'Olmo2ForCausalLM', 'Olmo3ForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'OuroForCausalLM', 'PanguEmbeddedForCausalLM', 'PanguProMoEV2ForCausalLM', 'PanguUltraMoEForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Plamo2ForCausalLM', 'Plamo3ForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen3ForCausalLM', 'Qwen3MoeForCausalLM', 'RWForCausalLM', 'SeedOssForCausalLM', 'Step1ForCausalLM', 'Step3TextForCausalLM', 'Step3p5ForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'TeleChatForCausalLM', 'TeleChat2ForCausalLM', 'TeleFLMForCausalLM', 'XverseForCausalLM', 'Zamba2ForCausalLM', 'BertModel', 'BertSpladeSparseEmbeddingModel', 'HF_ColBERT', 'ColBERTModernBertModel', 'ColBERTJinaRobertaModel', 'Gemma2Model', 'Gemma3TextModel', 'GPT2ForSequenceClassification', 'GteModel', 'GteNewModel', 'InternLM2ForRewardModel', 'JambaForSequenceClassification', 'LlamaBidirectionalModel', 'LlamaModel', 'MistralModel', 'ModernBertModel', 'NomicBertModel', 'Qwen2Model', 'Qwen2ForRewardModel', 'Qwen2ForProcessRewardModel', 'RobertaForMaskedLM', 'RobertaModel', 'VoyageQwen3BidirectionalEmbedModel', 'XLMRobertaModel', 'BgeM3EmbeddingModel', 'CLIPModel', 'ColModernVBertForRetrieval', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Qwen2VLForConditionalGeneration', 'ColQwen3', 'OpsColQwen3Model', 'Qwen3VLNemotronEmbedModel', 'SiglipModel', 'LlamaNemotronVLModel', 'PrithviGeoSpatialMAE', 'Terratorch', 'BertForSequenceClassification', 'BertForTokenClassification', 'GteNewForSequenceClassification', 'JinaVLForRanking', 'LlamaBidirectionalForSequenceClassification', 'LlamaNemotronVLForSequenceClassification', 'ModernBertForSequenceClassification', 'ModernBertForTokenClassification', 'RobertaForSequenceClassification', 'XLMRobertaForSequenceClassification', 'AriaForConditionalGeneration', 'AudioFlamingo3ForConditionalGeneration', 'MusicFlamingoForConditionalGeneration', 'AyaVisionForConditionalGeneration', 'BagelForConditionalGeneration', 'BeeForConditionalGeneration', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'Cohere2VisionForConditionalGeneration', 'DeepseekVLV2ForCausalLM', 'DeepseekOCRForCausalLM', 'DeepseekOCR2ForCausalLM', 'DotsOCRForCausalLM', 'Eagle2_5_VLForConditionalGeneration', 'Ernie4_5_VLMoeForConditionalGeneration', 'FireRedASR2ForConditionalGeneration', 'FunASRForConditionalGeneration', 'FunAudioChatForConditionalGeneration', 'FuyuForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3nForConditionalGeneration', 'GlmAsrForConditionalGeneration', 'GLM4VForCausalLM', 'Glm4vForConditionalGeneration', 'Glm4vMoeForConditionalGeneration', 'GlmOcrForConditionalGeneration', 'GraniteSpeechForConditionalGeneration', 'H2OVLChatModel', 'HunYuanVLForConditionalGeneration', 'StepVLForConditionalGeneration', 'InternVLChatModel', 'NemotronH_Nano_VL_V2', 'OpenCUAForConditionalGeneration', 'InternS1ForConditionalGeneration', 'InternVLForConditionalGeneration', 'InternS1ProForConditionalGeneration', 'Idefics3ForConditionalGeneration', 'IsaacForConditionalGeneration', 'SmolVLMForConditionalGeneration', 'KananaVForConditionalGeneration', 'KeyeForConditionalGeneration', 'KeyeVL1_5ForConditionalGeneration', 'RForConditionalGeneration', 'KimiVLForConditionalGeneration', 'KimiK25ForConditionalGeneration', 'LightOnOCRForConditionalGeneration', 'Lfm2VlForConditionalGeneration', 'Llama_Nemotron_Nano_VL', 'Llama4ForConditionalGeneration', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MantisForConditionalGeneration', 'MiDashengLMModel', 'MiniMaxVL01ForConditionalGeneration', 'MiniCPMO', 'MiniCPMV', 'Mistral3ForConditionalGeneration', 'MolmoForCausalLM', 'Molmo2ForConditionalGeneration', 'NVLM_D', 'OpenPanguVLForConditionalGeneration', 'Ovis', 'Ovis2_5', 'Ovis2_6ForCausalLM', 'Ovis2_6_MoeForCausalLM', 'PaddleOCRVLForConditionalGeneration', 'PaliGemmaForConditionalGeneration', 'Phi4MMForCausalLM', 'PixtralForConditionalGeneration', 'QwenVLForConditionalGeneration', 'Qwen2_5_VLForConditionalGeneration', 'Qwen2AudioForConditionalGeneration', 'Qwen2_5OmniModel', 'Qwen2_5OmniForConditionalGeneration', 'Qwen3OmniMoeForConditionalGeneration', 'Qwen3ASRForConditionalGeneration', 'Qwen3ASRRealtimeGeneration', 'Qwen3VLForConditionalGeneration', 'Qwen3VLMoeForConditionalGeneration', 'Qwen3_5ForConditionalGeneration', 'Qwen3_5MoeForConditionalGeneration', 'SkyworkR1VChatModel', 'Step3VLForConditionalGeneration', 'TarsierForConditionalGeneration', 'Tarsier2ForConditionalGeneration', 'UltravoxModel', 'VoxtralForConditionalGeneration', 'VoxtralRealtimeGeneration', 'NemotronParseForConditionalGeneration', 'WhisperForConditionalGeneration', 'ExtractHiddenStatesModel', 'MiMoMTPModel', 'EagleLlamaForCausalLM', 'EagleLlama4ForCausalLM', 'EagleMiniCPMForCausalLM', 'Eagle3LlamaForCausalLM', 'LlamaForCausalLMEagle3', 'Eagle3Qwen2_5vlForCausalLM', 'Eagle3Qwen3vlForCausalLM', 'EagleMistralLarge3ForCausalLM', 'EagleDeepSeekMTPModel', 'DeepSeekMTPModel', 'ErnieMTPModel', 'ExaoneMoeMTP', 'NemotronHMTPModel', 'LongCatFlashMTPModel', 'Glm4MoeMTPModel', 'Glm4MoeLiteMTPModel', 'GlmOcrMTPModel', 'MedusaModel', 'OpenPanguMTPModel', 'Qwen3NextMTP', 'Step3p5MTP', 'Qwen3_5MTP', 'Qwen3_5MoeMTP', 'SmolLM3ForCausalLM', 'Emu3ForConditionalGeneration', 'TransformersForCausalLM', 'TransformersMoEForCausalLM', 'TransformersMultiModalForCausalLM', 'TransformersMultiModalMoEForCausalLM', 'TransformersEmbeddingModel', 'TransformersMoEEmbeddingModel', 'TransformersMultiModalEmbeddingModel', 'TransformersForSequenceClassification', 'TransformersMoEForSequenceClassification', 'TransformersMultiModalForSequenceClassification']) [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=3524340) For further information visit https://errors.pydantic.dev/2.12/v/value_error

Failed to import Triton kernels:
module 'triton.language' has no attribute 'constexpr_function'

and eventually:

ValidationError: Model architectures ['Param2MoEForCausalLM']
are not supported for now.

So it seems vLLM currently does not support the Param2MoE architecture either.

Any guidance would be greatly appreciated.

Thanks!

BharatGen AI org

We are currently working on enabling support for vLLM and SGLang. We’ll share updates soon
thanks for your patience!

Sign up or log in to comment