File size: 4,959 Bytes

---
license: apache-2.0
language:
- en
tags:
- small-lm
- gemma4-attention
- muon
- swiglu
- experimental
library_name: pytorch
---

# Shard-1

A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.

This is the first checkpoint in the Shard series of small experimental transformers.

## Architecture

```
Total params:        54,538,752 (~54.5M)
Hidden dim:          512
Layers:              12
Attention heads:     8 (MHA, no GQA)
Head dim:            64
MLP intermediate:    2048 (SwiGLU)
Vocab size:          8192
Max sequence:        8192
Attention pattern:   Gemma 4 alternating sliding window (window=1024) + global, last layer global
Norm:                RMSNorm, pre-norm
Position encoding:   RoPE on Q and K
Embeddings:          tied input/output
Activation:          SwiGLU
MoE:                 none
Engram:              none
```

## Training

```
Phase 1 (pretrain):
  Compute:           Thunder Compute single GPU
  Steps:             48,220 of a 100,000 step target (paused early)
  Throughput:        86,800 tokens per second
  Optimizer:         Muon for hidden 2D weights, AdamW for embeddings and norms
  LR schedule:       WSD (warmup-stable-decay)
  Stabilizers:       lm_head logit cap 30, z-loss coefficient 1e-4

Phase 2 (anneal):
  Compute:           Colab A100
  Steps:             20,000 (full anneal complete)
  Final cross-entropy: 3.27
  Mix:               OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC
```

## Files

- `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16)
- `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB)
- `models/tokenizer.json` — custom 8192-vocab BPE
- `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py)

## How to load

```python
import sys, torch
sys.path.insert(0, 'code')
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer

ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
model = ToyLM(cfg).cuda().to(torch.bfloat16)
model.load_state_dict(ck['model'])
model.eval()

tok = load_tokenizer('models/tokenizer.json')
ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
with torch.no_grad():
    for _ in range(40):
        logits, _ = model(ids)
        nxt = logits[:, -1].argmax(-1, keepdim=True)
        ids = torch.cat([ids, nxt], 1)
print(tok.decode(ids[0].tolist()))
```

## Benchmark

Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.

Sampled outputs at temperature 0.7, top_p 0.9:

| Prompt | Output |
|---|---|
| `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` |
| `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` |
| `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) |
| `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` |
| `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` |
| `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) |

## What this artifact proves

The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.

## What this artifact cannot do

Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).

## Why release it

To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.

# Notes
As this model was trained by  [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script

## License

Apache 2.0. Use freely. Attribution appreciated but not required.

## Citation

```
@misc{shard40mv1,
  author = {Shane (Crownelius)},
  title  = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
}
```