Is there some helpful regex to offload all MoE layers to the CPU?

#7
by hdnh2006 - opened

Hello Unsloth team!

For previous release you provided some recommendation in order to reduce VRAM usage:
image

is there some similar expresion for this model? I want to use it with full context and with many parallel requests

--cpu-moe

or --n-cpu-moe [number of layers to the CPU]

--cpu-moe

it goes really slow with this parameter, how does this work? is it keeping the router on GPU but the experts on CPU? why this doesn't move the expert to GPU once the router select it?

If you use the latest versions of llama-server, it has a new option called --fit that is turned on by default, so it automatically offloads tensors that do not fit into VRAM.
Try to run it without any tensor options to see how it goes.
It also has some parameters of how much VRAM to keep free, --fit-target, by default it leaves 1 GB VRAM free, and you can make it fill it up further with something like --fit-target 256 to keep only 256MB VRAM free.

Sign up or log in to comment