File size: 4,959 Bytes
025878f f2fae91 025878f b64b571 025878f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | ---
license: apache-2.0
language:
- en
tags:
- small-lm
- gemma4-attention
- muon
- swiglu
- experimental
library_name: pytorch
---
# Shard-1
A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.
This is the first checkpoint in the Shard series of small experimental transformers.
## Architecture
```
Total params: 54,538,752 (~54.5M)
Hidden dim: 512
Layers: 12
Attention heads: 8 (MHA, no GQA)
Head dim: 64
MLP intermediate: 2048 (SwiGLU)
Vocab size: 8192
Max sequence: 8192
Attention pattern: Gemma 4 alternating sliding window (window=1024) + global, last layer global
Norm: RMSNorm, pre-norm
Position encoding: RoPE on Q and K
Embeddings: tied input/output
Activation: SwiGLU
MoE: none
Engram: none
```
## Training
```
Phase 1 (pretrain):
Compute: Thunder Compute single GPU
Steps: 48,220 of a 100,000 step target (paused early)
Throughput: 86,800 tokens per second
Optimizer: Muon for hidden 2D weights, AdamW for embeddings and norms
LR schedule: WSD (warmup-stable-decay)
Stabilizers: lm_head logit cap 30, z-loss coefficient 1e-4
Phase 2 (anneal):
Compute: Colab A100
Steps: 20,000 (full anneal complete)
Final cross-entropy: 3.27
Mix: OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC
```
## Files
- `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16)
- `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB)
- `models/tokenizer.json` — custom 8192-vocab BPE
- `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py)
## How to load
```python
import sys, torch
sys.path.insert(0, 'code')
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer
ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
model = ToyLM(cfg).cuda().to(torch.bfloat16)
model.load_state_dict(ck['model'])
model.eval()
tok = load_tokenizer('models/tokenizer.json')
ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
with torch.no_grad():
for _ in range(40):
logits, _ = model(ids)
nxt = logits[:, -1].argmax(-1, keepdim=True)
ids = torch.cat([ids, nxt], 1)
print(tok.decode(ids[0].tolist()))
```
## Benchmark
Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.
Sampled outputs at temperature 0.7, top_p 0.9:
| Prompt | Output |
|---|---|
| `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` |
| `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` |
| `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) |
| `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` |
| `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` |
| `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) |
## What this artifact proves
The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.
## What this artifact cannot do
Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).
## Why release it
To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.
# Notes
As this model was trained by [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script
## License
Apache 2.0. Use freely. Attribution appreciated but not required.
## Citation
```
@misc{shard40mv1,
author = {Shane (Crownelius)},
title = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
}
```
|