"NotImplementedError: is_act_and_mul=False" Error
Hi, I'm trying to run this checkpoint on my local setup (mobile RTX 5090) and I'm hitting this NVFP4 kernel not being implemented. Specifically:NotImplementedError: is_act_and_mul=False is supported only for unquantized , ModelOpt FP8, and ModelOpt NvFp4 checkpoints.
I have vllm==0.12.0 (i.e., not nightly build as your docker image). However, this error seems to be generated in the vllm/model_executor/layers/fused_moe/layer.py and that hasn't changed even in the latest main branch of vllm. Here's it's clearly shown https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/layer/#vllm.model_executor.layers.fused_moe.layer.FusedMoE or https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py .
The relevant piece of code is:
if not self.moe_config.is_act_and_mul:
# Avoid circular import
from vllm.model_executor.layers.quantization.modelopt import (
ModelOptFp8MoEMethod,
ModelOptNvFp4FusedMoE,
)
if not isinstance(
self.quant_method,
(
UnquantizedFusedMoEMethod,
ModelOptFp8MoEMethod,
ModelOptNvFp4FusedMoE,
),
):
raise NotImplementedError(
"is_act_and_mul=False is supported only for unquantized "
", ModelOpt FP8, and ModelOpt NvFp4 checkpoints"
)
if not current_platform.is_cuda():
raise NotImplementedError(
"is_act_and_mul=False is supported only for CUDA for now"
)
And so my question is - how were you able to run it with the config.json specifying the quantization type compressed-tensors? It seems this checkpoint would need --quantization modelopt_fp4 (maybe?), but that's not possible to override if the config.json says otherwise. Is there any other trick / envvar in the image you've tested it with?
Thank you for any advice!
EDIT: I just noticed the repo states this obviously:
Note: This is currently not functioning in v0.12.0 of VLLM. It seems like Nemotron-H MoE uses a "non-gated path" that is not supported for compressed-tensor NVFP4 in VLLM.
Thus, it seems waiting for update or using nightly is the required path. Sorry for, the non-required discussion!
I wasn't able to run it with VLLM I put a note in the model card about it. I suspect it probably needs some development from the VLLM team to get it working.
Edit: Yep, I probably should send a trace in an issue to the VLLM team at least so they're aware of it.
I'm using the vllm/vllm-openai:nightly build (as of Dec 20) and getting the same "NotImplementedError: is_act_and_mul=False is supported only for unquantized, ModelOpt FP8, and ModelOpt NvFp4 checkpoints" error. I see the Owner's note about not being able to run it with vLLM. Question for the owner - does it work with a different inference engine like TensorRT-LLM or SGLang? Does editing the config.json file help? I did also try it on v0.13.0 stable build of vLLM.
It's possible that it could run but I don't know. I've never run TensorRT or SGLang. At least for TensorRT I think it wants a ModelOpt style NVFP4 quant so it may not be happy running this one which was created with llm-compressor.
@Firworks Do you think it might be a good idea to make an issue in vllm repo to track this and maybe, hopefully, get it to work sometime soon?
Yeah I'm planning to next time I spin up an instance to do some more quants. I wanted to recreate it with a minimal script and give them the full trace too.
hihi, any updates on this? still not able to run using "sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:nightly --model Firworks/NVIDIA-Nemotron-3-Nano-30B-A3B-nvfp4 --dtype auto --max-model-len 32768"
Other than that, can I know the minimum VRAM required to run this model? I’m running it on an RTX 5090 (32 GB) with TensorRT-LLM, but even with a very low context length, I still encounter OOM issues.
@Firworks
what do you recommend to run it with? just got my spark arriving today. ? Really looking forward to hosting batch inference with this on the spark. appreciate your time <3
also whats the time it took for quantizing? might also be able to run that on my local workstations maybe
Ok I just submitted an issue on the VLLM github. Maybe they'll have a workaround for this.
https://github.com/vllm-project/vllm/issues/31782
Other than that, can I know the minimum VRAM required to run this model? I’m running it on an RTX 5090 (32 GB) with TensorRT-LLM, but even with a very low context length, I still encounter OOM issues.
The model itself in NVFP4 is only 18GB so you'd need some over the top of that for context but I would think it should still fit in a 32GB card as long as the context length wasn't set too long. However I don't know anything about running with TensorRT-LLM. If it were me I'd be showing it to Claude Opus 4.5 and seeing what they can suggest.
@Firworks
what do you recommend to run it with? just got my spark arriving today. ? Really looking forward to hosting batch inference with this on the spark. appreciate your time <3also whats the time it took for quantizing? might also be able to run that on my local workstations maybe
As of right now you won't be able to run this one at least with VLLM. However, there is a version of this model quantized to NVFP4 with a different toolchain (ModelOpt) that claims to work on the Spark.
https://huggingface.co/cybermotaz/nemotron3-nano-nvfp4-w4a16
Maybe give that a try. Really anyone who wants to actually run this model probably should try that one. If you're just wanting to run the model you don't care if it was made with ModelOpt or compressed-tensors as long as it works. I'm hoping we can get the compressed-tensors version working eventually since that's my standard method for quantizing.
As for the time I don't remember how long this one took but it probably wasn't long. It's not a very big model. I usually run my quants on an RTX Pro 6000 Blackwell cloud instance which has dramatically more memory bandwidth than the DGX Spark but I'm sure it wouldn't be terribly long to run on a Spark still. I've posted my standard method for running quants here: https://huggingface.co/Firworks/MiroThinker-v1.0-30B-nvfp4/discussions/1#69269c6d40ce1d3b1a6ca1cc
Though for doing a lot of work on a DGX Spark you may be better off learning to run ModelOpt for quantization as that's Nvidia's preferred tooling.

https://paste.ubuntu.com/p/9CJHzqxDgj/
The released weights of Firworks/NVIDIA-Nemotron-3-Nano-30B-A3B-nvfp4 come with incorrect (broken) scaling factors...
image
https://paste.ubuntu.com/p/9CJHzqxDgj/
released weights of Firworks/NVIDIA-Nemotron-3-Nano-30B-A3B-nvfp4 come with incorrect (broken) scaling factors...?
Ok I just submitted an issue on the VLLM github. Maybe they'll have a workaround for this.
https://github.com/vllm-project/vllm/issues/31782Other than that, can I know the minimum VRAM required to run this model? I’m running it on an RTX 5090 (32 GB) with TensorRT-LLM, but even with a very low context length, I still encounter OOM issues.
The model itself in NVFP4 is only 18GB so you'd need some over the top of that for context but I would think it should still fit in a 32GB card as long as the context length wasn't set too long. However I don't know anything about running with TensorRT-LLM. If it were me I'd be showing it to Claude Opus 4.5 and seeing what they can suggest.
@Firworks
what do you recommend to run it with? just got my spark arriving today. ? Really looking forward to hosting batch inference with this on the spark. appreciate your time <3also whats the time it took for quantizing? might also be able to run that on my local workstations maybe
As of right now you won't be able to run this one at least with VLLM. However, there is a version of this model quantized to NVFP4 with a different toolchain (ModelOpt) that claims to work on the Spark.
https://huggingface.co/cybermotaz/nemotron3-nano-nvfp4-w4a16Maybe give that a try. Really anyone who wants to actually run this model probably should try that one. If you're just wanting to run the model you don't care if it was made with ModelOpt or compressed-tensors as long as it works. I'm hoping we can get the compressed-tensors version working eventually since that's my standard method for quantizing.
As for the time I don't remember how long this one took but it probably wasn't long. It's not a very big model. I usually run my quants on an RTX Pro 6000 Blackwell cloud instance which has dramatically more memory bandwidth than the DGX Spark but I'm sure it wouldn't be terribly long to run on a Spark still. I've posted my standard method for running quants here: https://huggingface.co/Firworks/MiroThinker-v1.0-30B-nvfp4/discussions/1#69269c6d40ce1d3b1a6ca1cc
Though for doing a lot of work on a DGX Spark you may be better off learning to run ModelOpt for quantization as that's Nvidia's preferred tooling.
Well I wonder if that's a bug in compressed-tensors or llm-compressor then? I ran the standard recipe for NVFP4. I'm not sure what would cause that.
emmm you current huggingface scale have nan problem ...like i mentioned before.. my machine could not install llm-compressor...
am working on support this model in vllm but facing nan value and community seems did not accept nan fix in vllm
Well I wonder if that's a bug in compressed-tensors or llm-compressor then? I ran the standard recipe for NVFP4. I'm not sure what would cause that.
I get that the model as-is is broken but I don't know what to do differently to fix it. I just run a standard quantization recipe. If I rerun it it'll be broken again unless something changes with the quantization setup or libraries.
could you start an issue in llm-compressor ?
I get that the model as-is is broken but I don't know what to do differently to fix it. I just run a standard quantization recipe. If I rerun it it'll be broken again unless something changes with the quantization setup or libraries.
https://github.com/vllm-project/vllm/pull/32080#issuecomment-3801778998
Seems like they fix the gated issue in vllm
I can confirm Nemotron NVFP4 is still problematic on RTX 5090 (SM120). Even with the official NVIDIA Nemotron NVFP4 (ModelOpt FP4) (not compressed-tensors), vLLM fails during init with:
NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.
This looks like an SM120 NVFP4 MoE backend support gap rather than only a “compressed-tensors / is_act_and_mul” issue. Similar SM120 NVFP4 MoE failures are already being tracked in vLLM.
(APIServer pid=8064) INFO 02-22 22:00:39 [utils.py:325] version 0.15.1.dev0+gf17644344.d20260203
(APIServer pid=8064) INFO 02-22 22:00:39 [utils.py:325] model /NVME2T/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
(APIServer pid=8064) INFO 02-22 22:00:39 [utils.py:261] non-default args: {... 'quantization': 'modelopt_fp4', ... 'kv_cache_dtype': 'fp8', ...}
(EngineCore_DP0 pid=8163) INFO 02-22 22:00:46 [gpu_model_runner.py:4021] Starting to load model /NVME2T/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4...
(EngineCore_DP0 pid=8163) INFO 02-22 22:00:46 [modelopt.py:1158] Using flashinfer-cutlass for NVFP4 GEMM
(EngineCore_DP0 pid=8163) ERROR 02-22 22:00:46 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=8163) ERROR 02-22 22:00:46 [core.py:946] NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.
(APIServer pid=8064) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
curl: (7) Failed to connect to 127.0.0.1 port 8001 after 0 ms: Couldn't connect to server