Is there some helpful regex to offload all MoE layers to the CPU?

by hdnh2006 - opened Feb 26

Discussion

hdnh2006

Feb 26

•

edited Feb 26

Hello Unsloth team!

For previous release you provided some recommendation in order to reduce VRAM usage:

is there some similar expresion for this model? I want to use it with full context and with many parallel requests

wqerrewetw

Feb 26

--cpu-moe

floory

Feb 26

or --n-cpu-moe [number of layers to the CPU]

hdnh2006

Feb 26

--cpu-moe

it goes really slow with this parameter, how does this work? is it keeping the router on GPU but the experts on CPU? why this doesn't move the expert to GPU once the router select it?

noctrex

Feb 26

•

edited Feb 26

If you use the latest versions of llama-server, it has a new option called --fit that is turned on by default, so it automatically offloads tensors that do not fit into VRAM.
Try to run it without any tensor options to see how it goes.
It also has some parameters of how much VRAM to keep free, --fit-target, by default it leaves 1 GB VRAM free, and you can make it fill it up further with something like --fit-target 256 to keep only 256MB VRAM free.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment