68m-base / README.md
Govind222's picture
step 30000
4b9f47d

Test-shakespeare

Trained with transformer-toolkit.

Architecture

param value
vocab_size 32000
dim 512
n_layers 12
n_heads 8
max_seq 512
attn gqa
n_kv_heads 2
latent_dim 64
ffn swiglu
hidden_dim 1536
n_experts 8
top_k 2
moe_aux_weight 0.01
moe_capacity 1.0
moe_n_shared 2
moe_n_routed 6
norm rmsnorm
eps 1e-06
pos_enc rope
dropout 0.1
tie_weights False

Metrics

metric value
val_loss 3.949221694469452