Questions on MoE Hash Routing

#22
by mattduerrmeier - opened

First, thanks to the DeepSeek team for releasing the DeepSeek-V4 models and the paper.

The DeepSeek-V4 paper states that some MoE layers use Hash routing. This strategy is used only for the first 3 MoE layers in both DeepSeek-V4 Flash and Pro:

[...] We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the first 3 MoE layers.

How did the DeepSeek team determine this value? Is there a reason for using it only on the first three blocks?
Is it possible to change the n_hash_layers parameter in the config.json, such that Hash routing is used for every MoE layer instead? Or would that require retraining the model?

Sign up or log in to comment