Questions on MoE Hash Routing
#22
by mattduerrmeier - opened
First, thanks to the DeepSeek team for releasing the DeepSeek-V4 models and the paper.
The DeepSeek-V4 paper states that some MoE layers use Hash routing. This strategy is used only for the first 3 MoE layers in both DeepSeek-V4 Flash and Pro:
[...] We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the first 3 MoE layers.
How did the DeepSeek team determine this value? Is there a reason for using it only on the first three blocks?
Is it possible to change the n_hash_layers parameter in the config.json, such that Hash routing is used for every MoE layer instead? Or would that require retraining the model?