"NotImplementedError: is_act_and_mul=False" Error

by bitwise31337 - opened Dec 18, 2025

Dec 18, 2025

•

edited Dec 18, 2025

Hi, I'm trying to run this checkpoint on my local setup (mobile RTX 5090) and I'm hitting this NVFP4 kernel not being implemented. Specifically:
NotImplementedError: is_act_and_mul=False is supported only for unquantized , ModelOpt FP8, and ModelOpt NvFp4 checkpoints.

I have vllm==0.12.0 (i.e., not nightly build as your docker image). However, this error seems to be generated in the vllm/model_executor/layers/fused_moe/layer.py and that hasn't changed even in the latest main branch of vllm. Here's it's clearly shown https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/layer/#vllm.model_executor.layers.fused_moe.layer.FusedMoE or https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py .

The relevant piece of code is:

if not self.moe_config.is_act_and_mul:
            # Avoid circular import
            from vllm.model_executor.layers.quantization.modelopt import (
                ModelOptFp8MoEMethod,
                ModelOptNvFp4FusedMoE,
            )

            if not isinstance(
                self.quant_method,
                (
                    UnquantizedFusedMoEMethod,
                    ModelOptFp8MoEMethod,
                    ModelOptNvFp4FusedMoE,
                ),
            ):
                raise NotImplementedError(
                    "is_act_and_mul=False is supported only for unquantized "
                    ", ModelOpt FP8, and ModelOpt NvFp4 checkpoints"
                )
            if not current_platform.is_cuda():
                raise NotImplementedError(
                    "is_act_and_mul=False is supported only for CUDA for now"
                )

And so my question is - how were you able to run it with the config.json specifying the quantization type compressed-tensors? It seems this checkpoint would need --quantization modelopt_fp4 (maybe?), but that's not possible to override if the config.json says otherwise. Is there any other trick / envvar in the image you've tested it with?

Thank you for any advice!

EDIT: I just noticed the repo states this obviously:

Note: This is currently not functioning in v0.12.0 of VLLM. It seems like Nemotron-H MoE uses a "non-gated path" that is not supported for compressed-tensor NVFP4 in VLLM.

Thus, it seems waiting for update or using nightly is the required path. Sorry for, the non-required discussion!

Firworks

Owner Dec 18, 2025

•

edited Dec 18, 2025

I wasn't able to run it with VLLM I put a note in the model card about it. I suspect it probably needs some development from the VLLM team to get it working.

Edit: Yep, I probably should send a trace in an issue to the VLLM team at least so they're aware of it.

jhsmith0

Dec 20, 2025

•

edited Dec 20, 2025

I'm using the vllm/vllm-openai:nightly build (as of Dec 20) and getting the same "NotImplementedError: is_act_and_mul=False is supported only for unquantized, ModelOpt FP8, and ModelOpt NvFp4 checkpoints" error. I see the Owner's note about not being able to run it with vLLM. Question for the owner - does it work with a different inference engine like TensorRT-LLM or SGLang? Does editing the config.json file help? I did also try it on v0.13.0 stable build of vLLM.

Firworks

Owner Dec 21, 2025

It's possible that it could run but I don't know. I've never run TensorRT or SGLang. At least for TensorRT I think it wants a ModelOpt style NVFP4 quant so it may not be happy running this one which was created with llm-compressor.

bitwise31337

Dec 21, 2025

@Firworks Do you think it might be a good idea to make an issue in vllm repo to track this and maybe, hopefully, get it to work sometime soon?

Firworks

Owner Dec 21, 2025

Yeah I'm planning to next time I spin up an instance to do some more quants. I wanted to recreate it with a minimal script and give them the full trace too.

NEWWWWWbie

Jan 5

•

edited Jan 5

hihi, any updates on this? still not able to run using "sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:nightly --model Firworks/NVIDIA-Nemotron-3-Nano-30B-A3B-nvfp4 --dtype auto --max-model-len 32768"

NEWWWWWbie

Jan 5

Other than that, can I know the minimum VRAM required to run this model? I’m running it on an RTX 5090 (32 GB) with TensorRT-LLM, but even with a very low context length, I still encounter OOM issues.

MJPansa

Jan 5

•

edited Jan 5

@Firworks
what do you recommend to run it with? just got my spark arriving today. ? Really looking forward to hosting batch inference with this on the spark. appreciate your time <3

also whats the time it took for quantizing? might also be able to run that on my local workstations maybe

Firworks

Owner Jan 6

Ok I just submitted an issue on the VLLM github. Maybe they'll have a workaround for this.
https://github.com/vllm-project/vllm/issues/31782

Other than that, can I know the minimum VRAM required to run this model? I’m running it on an RTX 5090 (32 GB) with TensorRT-LLM, but even with a very low context length, I still encounter OOM issues.

The model itself in NVFP4 is only 18GB so you'd need some over the top of that for context but I would think it should still fit in a 32GB card as long as the context length wasn't set too long. However I don't know anything about running with TensorRT-LLM. If it were me I'd be showing it to Claude Opus 4.5 and seeing what they can suggest.

@Firworks
what do you recommend to run it with? just got my spark arriving today. ? Really looking forward to hosting batch inference with this on the spark. appreciate your time <3

also whats the time it took for quantizing? might also be able to run that on my local workstations maybe

As of right now you won't be able to run this one at least with VLLM. However, there is a version of this model quantized to NVFP4 with a different toolchain (ModelOpt) that claims to work on the Spark.
https://huggingface.co/cybermotaz/nemotron3-nano-nvfp4-w4a16

Maybe give that a try. Really anyone who wants to actually run this model probably should try that one. If you're just wanting to run the model you don't care if it was made with ModelOpt or compressed-tensors as long as it works. I'm hoping we can get the compressed-tensors version working eventually since that's my standard method for quantizing.

As for the time I don't remember how long this one took but it probably wasn't long. It's not a very big model. I usually run my quants on an RTX Pro 6000 Blackwell cloud instance which has dramatically more memory bandwidth than the DGX Spark but I'm sure it wouldn't be terribly long to run on a Spark still. I've posted my standard method for running quants here: https://huggingface.co/Firworks/MiroThinker-v1.0-30B-nvfp4/discussions/1#69269c6d40ce1d3b1a6ca1cc

Though for doing a lot of work on a DGX Spark you may be better off learning to run ModelOpt for quantization as that's Nvidia's preferred tooling.

feizhai123

Jan 10

https://paste.ubuntu.com/p/9CJHzqxDgj/
The released weights of Firworks/NVIDIA-Nemotron-3-Nano-30B-A3B-nvfp4 come with incorrect (broken) scaling factors...

feizhai123

Jan 10

image
https://paste.ubuntu.com/p/9CJHzqxDgj/
released weights of Firworks/NVIDIA-Nemotron-3-Nano-30B-A3B-nvfp4 come with incorrect (broken) scaling factors...?

Ok I just submitted an issue on the VLLM github. Maybe they'll have a workaround for this.
https://github.com/vllm-project/vllm/issues/31782

Other than that, can I know the minimum VRAM required to run this model? I’m running it on an RTX 5090 (32 GB) with TensorRT-LLM, but even with a very low context length, I still encounter OOM issues.

The model itself in NVFP4 is only 18GB so you'd need some over the top of that for context but I would think it should still fit in a 32GB card as long as the context length wasn't set too long. However I don't know anything about running with TensorRT-LLM. If it were me I'd be showing it to Claude Opus 4.5 and seeing what they can suggest.

@Firworks
what do you recommend to run it with? just got my spark arriving today. ? Really looking forward to hosting batch inference with this on the spark. appreciate your time <3

also whats the time it took for quantizing? might also be able to run that on my local workstations maybe

As of right now you won't be able to run this one at least with VLLM. However, there is a version of this model quantized to NVFP4 with a different toolchain (ModelOpt) that claims to work on the Spark.
https://huggingface.co/cybermotaz/nemotron3-nano-nvfp4-w4a16

Maybe give that a try. Really anyone who wants to actually run this model probably should try that one. If you're just wanting to run the model you don't care if it was made with ModelOpt or compressed-tensors as long as it works. I'm hoping we can get the compressed-tensors version working eventually since that's my standard method for quantizing.

As for the time I don't remember how long this one took but it probably wasn't long. It's not a very big model. I usually run my quants on an RTX Pro 6000 Blackwell cloud instance which has dramatically more memory bandwidth than the DGX Spark but I'm sure it wouldn't be terribly long to run on a Spark still. I've posted my standard method for running quants here: https://huggingface.co/Firworks/MiroThinker-v1.0-30B-nvfp4/discussions/1#69269c6d40ce1d3b1a6ca1cc

Though for doing a lot of work on a DGX Spark you may be better off learning to run ModelOpt for quantization as that's Nvidia's preferred tooling.

Firworks

Owner Jan 11

Well I wonder if that's a bug in compressed-tensors or llm-compressor then? I ran the standard recipe for NVFP4. I'm not sure what would cause that.

feizhai123

Jan 13

emmm you current huggingface scale have nan problem ...like i mentioned before.. my machine could not install llm-compressor...
am working on support this model in vllm but facing nan value and community seems did not accept nan fix in vllm

Well I wonder if that's a bug in compressed-tensors or llm-compressor then? I ran the standard recipe for NVFP4. I'm not sure what would cause that.

Firworks

Owner Jan 14

I get that the model as-is is broken but I don't know what to do differently to fix it. I just run a standard quantization recipe. If I rerun it it'll be broken again unless something changes with the quantization setup or libraries.

feizhai123

Jan 15

could you start an issue in llm-compressor ?

I get that the model as-is is broken but I don't know what to do differently to fix it. I just run a standard quantization recipe. If I rerun it it'll be broken again unless something changes with the quantization setup or libraries.

duyleekun

Jan 27

https://github.com/vllm-project/vllm/pull/32080#issuecomment-3801778998

Seems like they fix the gated issue in vllm

CRY24180339

Feb 22

I can confirm Nemotron NVFP4 is still problematic on RTX 5090 (SM120). Even with the official NVIDIA Nemotron NVFP4 (ModelOpt FP4) (not compressed-tensors), vLLM fails during init with:
NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.
This looks like an SM120 NVFP4 MoE backend support gap rather than only a “compressed-tensors / is_act_and_mul” issue. Similar SM120 NVFP4 MoE failures are already being tracked in vLLM.

(APIServer pid=8064) INFO 02-22 22:00:39 [utils.py:325]  version 0.15.1.dev0+gf17644344.d20260203
(APIServer pid=8064) INFO 02-22 22:00:39 [utils.py:325]  model   /NVME2T/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
(APIServer pid=8064) INFO 02-22 22:00:39 [utils.py:261] non-default args: {... 'quantization': 'modelopt_fp4', ... 'kv_cache_dtype': 'fp8', ...}

(EngineCore_DP0 pid=8163) INFO 02-22 22:00:46 [gpu_model_runner.py:4021] Starting to load model /NVME2T/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4...
(EngineCore_DP0 pid=8163) INFO 02-22 22:00:46 [modelopt.py:1158] Using flashinfer-cutlass for NVFP4 GEMM
(EngineCore_DP0 pid=8163) ERROR 02-22 22:00:46 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=8163) ERROR 02-22 22:00:46 [core.py:946] NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

(APIServer pid=8064) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
curl: (7) Failed to connect to 127.0.0.1 port 8001 after 0 ms: Couldn't connect to server

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment