File size: 4,959 Bytes
025878f
 
 
 
 
 
 
 
 
 
 
 
 
f2fae91
025878f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b64b571
 
 
025878f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: apache-2.0
language:
- en
tags:
- small-lm
- gemma4-attention
- muon
- swiglu
- experimental
library_name: pytorch
---

# Shard-1

A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.

This is the first checkpoint in the Shard series of small experimental transformers.

## Architecture

```
Total params:        54,538,752 (~54.5M)
Hidden dim:          512
Layers:              12
Attention heads:     8 (MHA, no GQA)
Head dim:            64
MLP intermediate:    2048 (SwiGLU)
Vocab size:          8192
Max sequence:        8192
Attention pattern:   Gemma 4 alternating sliding window (window=1024) + global, last layer global
Norm:                RMSNorm, pre-norm
Position encoding:   RoPE on Q and K
Embeddings:          tied input/output
Activation:          SwiGLU
MoE:                 none
Engram:              none
```

## Training

```
Phase 1 (pretrain):
  Compute:           Thunder Compute single GPU
  Steps:             48,220 of a 100,000 step target (paused early)
  Throughput:        86,800 tokens per second
  Optimizer:         Muon for hidden 2D weights, AdamW for embeddings and norms
  LR schedule:       WSD (warmup-stable-decay)
  Stabilizers:       lm_head logit cap 30, z-loss coefficient 1e-4

Phase 2 (anneal):
  Compute:           Colab A100
  Steps:             20,000 (full anneal complete)
  Final cross-entropy: 3.27
  Mix:               OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC
```

## Files

- `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16)
- `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB)
- `models/tokenizer.json` — custom 8192-vocab BPE
- `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py)

## How to load

```python
import sys, torch
sys.path.insert(0, 'code')
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer

ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
model = ToyLM(cfg).cuda().to(torch.bfloat16)
model.load_state_dict(ck['model'])
model.eval()

tok = load_tokenizer('models/tokenizer.json')
ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
with torch.no_grad():
    for _ in range(40):
        logits, _ = model(ids)
        nxt = logits[:, -1].argmax(-1, keepdim=True)
        ids = torch.cat([ids, nxt], 1)
print(tok.decode(ids[0].tolist()))
```

## Benchmark

Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.

Sampled outputs at temperature 0.7, top_p 0.9:

| Prompt | Output |
|---|---|
| `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` |
| `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` |
| `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) |
| `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` |
| `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` |
| `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) |

## What this artifact proves

The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.

## What this artifact cannot do

Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).

## Why release it

To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.

# Notes
As this model was trained by  [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script

## License

Apache 2.0. Use freely. Attribution appreciated but not required.

## Citation

```
@misc{shard40mv1,
  author = {Shane (Crownelius)},
  title  = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
}
```