# Test-shakespeare Trained with **transformer-toolkit**. ## Architecture | param | value | |---|---| | `vocab_size` | `32000` | | `dim` | `512` | | `n_layers` | `12` | | `n_heads` | `8` | | `max_seq` | `512` | | `attn` | `gqa` | | `n_kv_heads` | `2` | | `latent_dim` | `64` | | `ffn` | `swiglu` | | `hidden_dim` | `1536` | | `n_experts` | `8` | | `top_k` | `2` | | `moe_aux_weight` | `0.01` | | `moe_capacity` | `1.0` | | `moe_n_shared` | `2` | | `moe_n_routed` | `6` | | `norm` | `rmsnorm` | | `eps` | `1e-06` | | `pos_enc` | `rope` | | `dropout` | `0.1` | | `tie_weights` | `False` | ## Metrics | metric | value | |---|---| | `val_loss` | `3.949221694469452` |