d24_mid_pos3_k2

d24 MTP pos=3 k_start=2

Model Details

Field Value
Architecture Nanochat (custom transformer)
Parameters ~780M
Layers 24
Hidden dim 1536
Heads (Q/KV) 12/12
Vocab size 32768
Context length 2048
Window pattern L
Val BPB 0.727820
MTP probe layers [3]
MTP k_start 2

Architecture Notes

  • RoPE positional embeddings (no absolute pos embeds)
  • QK norm in attention
  • ReLU² activation in MLP
  • RMSNorm (no learnable parameters)
  • Logit softcap (tanh, ±20.0)
  • GQA (grouped-query attention)
  • Per-layer scalars (resid_lambdas, x0_lambdas)
  • Sliding window attention pattern: L
  • Untied token embedding and LM head

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("d24_mid_pos3_k2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("d24_mid_pos3_k2", trust_remote_code=True)

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support