Shard-1 / README.md
CompactAI's picture
Update README.md
b64b571 verified
|
raw
history blame
4.96 kB
---
license: apache-2.0
language:
- en
tags:
- small-lm
- gemma4-attention
- muon
- swiglu
- experimental
library_name: pytorch
---
# Shard-40m-v1
A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.
This is the first checkpoint in the Shard series of small experimental transformers.
## Architecture
```
Total params: 54,538,752 (~54.5M)
Hidden dim: 512
Layers: 12
Attention heads: 8 (MHA, no GQA)
Head dim: 64
MLP intermediate: 2048 (SwiGLU)
Vocab size: 8192
Max sequence: 8192
Attention pattern: Gemma 4 alternating sliding window (window=1024) + global, last layer global
Norm: RMSNorm, pre-norm
Position encoding: RoPE on Q and K
Embeddings: tied input/output
Activation: SwiGLU
MoE: none
Engram: none
```
## Training
```
Phase 1 (pretrain):
Compute: Thunder Compute single GPU
Steps: 48,220 of a 100,000 step target (paused early)
Throughput: 86,800 tokens per second
Optimizer: Muon for hidden 2D weights, AdamW for embeddings and norms
LR schedule: WSD (warmup-stable-decay)
Stabilizers: lm_head logit cap 30, z-loss coefficient 1e-4
Phase 2 (anneal):
Compute: Colab A100
Steps: 20,000 (full anneal complete)
Final cross-entropy: 3.27
Mix: OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC
```
## Files
- `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16)
- `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB)
- `models/tokenizer.json` — custom 8192-vocab BPE
- `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py)
## How to load
```python
import sys, torch
sys.path.insert(0, 'code')
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer
ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
model = ToyLM(cfg).cuda().to(torch.bfloat16)
model.load_state_dict(ck['model'])
model.eval()
tok = load_tokenizer('models/tokenizer.json')
ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
with torch.no_grad():
for _ in range(40):
logits, _ = model(ids)
nxt = logits[:, -1].argmax(-1, keepdim=True)
ids = torch.cat([ids, nxt], 1)
print(tok.decode(ids[0].tolist()))
```
## Benchmark
Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.
Sampled outputs at temperature 0.7, top_p 0.9:
| Prompt | Output |
|---|---|
| `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` |
| `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` |
| `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) |
| `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` |
| `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` |
| `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) |
## What this artifact proves
The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.
## What this artifact cannot do
Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).
## Why release it
To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.
# Notes
As this model was trained by [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script
## License
Apache 2.0. Use freely. Attribution appreciated but not required.
## Citation
```
@misc{shard40mv1,
author = {Shane (Crownelius)},
title = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
}
```