🐳 nanoWhale-100m
DeepSeek-V4 architecture | 110M params | Hyper-Connections | MLA | MoE
MLA (q_lora_rank=160)
4 routed experts + 1 shared
Multi-head Latent Attention
Hyper-Connections (hc_mult=4)
MTP prediction layer
F32 precision
✨ Response (raw generation - may be chaotic):