| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - small-lm |
| - gemma4-attention |
| - muon |
| - swiglu |
| - experimental |
| library_name: pytorch |
| --- |
| |
| # Shard-1 |
|
|
| A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model. |
|
|
| This is the first checkpoint in the Shard series of small experimental transformers. |
|
|
| ## Architecture |
|
|
| ``` |
| Total params: 54,538,752 (~54.5M) |
| Hidden dim: 512 |
| Layers: 12 |
| Attention heads: 8 (MHA, no GQA) |
| Head dim: 64 |
| MLP intermediate: 2048 (SwiGLU) |
| Vocab size: 8192 |
| Max sequence: 8192 |
| Attention pattern: Gemma 4 alternating sliding window (window=1024) + global, last layer global |
| Norm: RMSNorm, pre-norm |
| Position encoding: RoPE on Q and K |
| Embeddings: tied input/output |
| Activation: SwiGLU |
| MoE: none |
| Engram: none |
| ``` |
|
|
| ## Training |
|
|
| ``` |
| Phase 1 (pretrain): |
| Compute: Thunder Compute single GPU |
| Steps: 48,220 of a 100,000 step target (paused early) |
| Throughput: 86,800 tokens per second |
| Optimizer: Muon for hidden 2D weights, AdamW for embeddings and norms |
| LR schedule: WSD (warmup-stable-decay) |
| Stabilizers: lm_head logit cap 30, z-loss coefficient 1e-4 |
| |
| Phase 2 (anneal): |
| Compute: Colab A100 |
| Steps: 20,000 (full anneal complete) |
| Final cross-entropy: 3.27 |
| Mix: OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC |
| ``` |
|
|
| ## Files |
|
|
| - `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16) |
| - `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB) |
| - `models/tokenizer.json` — custom 8192-vocab BPE |
| - `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py) |
|
|
| ## How to load |
|
|
| ```python |
| import sys, torch |
| sys.path.insert(0, 'code') |
| from config import Config |
| from model import ToyLM |
| from tokenizer import load_tokenizer |
| |
| ck = torch.load('models/model.pt', map_location='cpu', weights_only=False) |
| cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg'] |
| model = ToyLM(cfg).cuda().to(torch.bfloat16) |
| model.load_state_dict(ck['model']) |
| model.eval() |
| |
| tok = load_tokenizer('models/tokenizer.json') |
| ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda') |
| with torch.no_grad(): |
| for _ in range(40): |
| logits, _ = model(ids) |
| nxt = logits[:, -1].argmax(-1, keepdim=True) |
| ids = torch.cat([ids, nxt], 1) |
| print(tok.decode(ids[0].tolist())) |
| ``` |
|
|
| ## Benchmark |
|
|
| Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory. |
|
|
| Sampled outputs at temperature 0.7, top_p 0.9: |
| |
| | Prompt | Output | |
| |---|---| |
| | `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` | |
| | `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` | |
| | `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) | |
| | `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` | |
| | `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` | |
| | `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) | |
| |
| ## What this artifact proves |
| |
| The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel. |
| |
| ## What this artifact cannot do |
| |
| Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers). |
| |
| ## Why release it |
| |
| To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured. |
| |
| # Notes |
| As this model was trained by [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script |
| |
| ## License |
| |
| Apache 2.0. Use freely. Attribution appreciated but not required. |
| |
| ## Citation |
| |
| ``` |
| @misc{shard40mv1, |
| author = {Shane (Crownelius)}, |
| title = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/CompactAI-O/Shard-40m-v1} |
| } |
| ``` |
| |