Shard-1 / README.md

Update README.md

f2fae91 verified about 11 hours ago

4.96 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- small-lm
	- gemma4-attention
	- muon
	- swiglu
	- experimental
	library_name: pytorch
	---

	# Shard-1

	A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.

	This is the first checkpoint in the Shard series of small experimental transformers.

	## Architecture

	```
	Total params: 54,538,752 (~54.5M)
	Hidden dim: 512
	Layers: 12
	Attention heads: 8 (MHA, no GQA)
	Head dim: 64
	MLP intermediate: 2048 (SwiGLU)
	Vocab size: 8192
	Max sequence: 8192
	Attention pattern: Gemma 4 alternating sliding window (window=1024) + global, last layer global
	Norm: RMSNorm, pre-norm
	Position encoding: RoPE on Q and K
	Embeddings: tied input/output
	Activation: SwiGLU
	MoE: none
	Engram: none
	```

	## Training

	```
	Phase 1 (pretrain):
	Compute: Thunder Compute single GPU
	Steps: 48,220 of a 100,000 step target (paused early)
	Throughput: 86,800 tokens per second
	Optimizer: Muon for hidden 2D weights, AdamW for embeddings and norms
	LR schedule: WSD (warmup-stable-decay)
	Stabilizers: lm_head logit cap 30, z-loss coefficient 1e-4

	Phase 2 (anneal):
	Compute: Colab A100
	Steps: 20,000 (full anneal complete)
	Final cross-entropy: 3.27
	Mix: OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC
	```

	## Files

	- `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16)
	- `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB)
	- `models/tokenizer.json` — custom 8192-vocab BPE
	- `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py)

	## How to load

	```python
	import sys, torch
	sys.path.insert(0, 'code')
	from config import Config
	from model import ToyLM
	from tokenizer import load_tokenizer

	ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
	cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
	model = ToyLM(cfg).cuda().to(torch.bfloat16)
	model.load_state_dict(ck['model'])
	model.eval()

	tok = load_tokenizer('models/tokenizer.json')
	ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
	with torch.no_grad():
	for _ in range(40):
	logits, _ = model(ids)
	nxt = logits[:, -1].argmax(-1, keepdim=True)
	ids = torch.cat([ids, nxt], 1)
	print(tok.decode(ids[0].tolist()))
	```

	## Benchmark

	Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.

	Sampled outputs at temperature 0.7, top_p 0.9:

	\| Prompt \| Output \|
	\|---\|---\|
	\| `The capital of France is` \| `"covered by the Crown" (for example, the Great Seal of France...)` \|
	\| `To compute 12 plus 7, we can` \| `now use the first 6 as a reversible input...` \|
	\| `Question: What is 23 + 19? Answer:` \| `The answer is 23. Answer: 23. Answer: 23` (loops) \|
	\| `def fibonacci(n):` \| `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` \|
	\| `Once upon a time, in a small village,` \| `a woman is a gentleman in a village with an infinite wealth...` \|
	\| `Solve: 17 * 23 = ?` \| `?????\n*****` (breakdown) \|

	## What this artifact proves

	The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.

	## What this artifact cannot do

	Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).

	## Why release it

	To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.

	# Notes
	As this model was trained by [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script

	## License

	Apache 2.0. Use freely. Attribution appreciated but not required.

	## Citation

	```
	@misc{shard40mv1,
	author = {Shane (Crownelius)},
	title = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
	year = {2026},
	publisher = {HuggingFace},
	url = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
	}
	```