🐳 nanoWhale-100m

DeepSeek-V4 architecture | 110M params | Hyper-Connections | MLA | MoE

MLA (q_lora_rank=160) 4 routed experts + 1 shared Multi-head Latent Attention Hyper-Connections (hc_mult=4) MTP prediction layer F32 precision