Initial model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
tags:
|
| 7 |
+
- text-generation
|
| 8 |
+
- small-models
|
| 9 |
+
- pretrain-only
|
| 10 |
+
- gemma4
|
| 11 |
+
- deepseek-v4
|
| 12 |
+
- muon
|
| 13 |
+
- wsd
|
| 14 |
+
- crowfeather
|
| 15 |
+
- compactai
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Crowfeather-50m
|
| 20 |
+
|
| 21 |
+
A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. **No SFT yet** — this is a base LM only.
|
| 22 |
+
|
| 23 |
+
This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on [@Crownelius's profile](https://huggingface.co/Crownelius).
|
| 24 |
+
|
| 25 |
+
## Howdy from Shane
|
| 26 |
+
|
| 27 |
+
Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is **step 17,500**. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.
|
| 28 |
+
|
| 29 |
+
If you want the full backstory and trial-and-error, see the [companion HF post](https://huggingface.co/Crownelius) and [the older `notes-fant3-and-50m-toy-2026-04`](https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04) repo.
|
| 30 |
+
|
| 31 |
+
## Architecture
|
| 32 |
+
|
| 33 |
+
Transformer with two ideas pulled directly from April 2026 research:
|
| 34 |
+
|
| 35 |
+
| component | choice | source |
|
| 36 |
+
|---|---|---|
|
| 37 |
+
| attention | alternating sliding (window=1024) / global, last layer always global | Gemma 4 |
|
| 38 |
+
| optimizer | Muon for 2D weights, AdamW for embeddings (hybrid, `adamw_lr = muon_lr/4`) | DeepSeek V4 |
|
| 39 |
+
| LR schedule | WSD (Warmup → Stable → Decay), 20% decay phase | Apr 2026 small-LM research |
|
| 40 |
+
| logit stability | Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4 | both |
|
| 41 |
+
| embeddings | tied (input + output share) | standard |
|
| 42 |
+
| activation | SwiGLU MLP | standard |
|
| 43 |
+
| memory | RoPE positional, RMSNorm | standard |
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
vocab_size = 8192 (BPE on 100k FineWeb-edu docs, deterministic)
|
| 47 |
+
dim = 512
|
| 48 |
+
n_layers = 12
|
| 49 |
+
n_heads = 8
|
| 50 |
+
head_dim = 64
|
| 51 |
+
mlp_hidden = 2048
|
| 52 |
+
max_seq_len = 8192
|
| 53 |
+
sliding_window = 1024 (Gemma 4 alternating pattern)
|
| 54 |
+
|
| 55 |
+
total params = 54,538,752
|
| 56 |
+
embedding = 4,194,304 (tied, 7.7%)
|
| 57 |
+
attention = 12,582,912 (23.1%)
|
| 58 |
+
mlp = 37,748,736 (69.2%)
|
| 59 |
+
norms = 12,800 (0.02%)
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Training
|
| 63 |
+
|
| 64 |
+
| | |
|
| 65 |
+
|---|---|
|
| 66 |
+
| pretrain corpus | `HuggingFaceFW/fineweb-edu` (default split, 2M docs streamed) |
|
| 67 |
+
| pretrain target | 100,000 steps |
|
| 68 |
+
| pretrain actual | **17,500 steps banked** (~2.3B tokens, ~46× Chinchilla optimal for 50M) |
|
| 69 |
+
| batch | 16 × 4096 × 2 grad-accum = effective 32 |
|
| 70 |
+
| peak LR | 2e-3 (Muon) / 5e-4 (AdamW for embeddings) |
|
| 71 |
+
| WSD | warmup 1500, stable to ~80%, linear decay over last 20% |
|
| 72 |
+
| precision | bf16 |
|
| 73 |
+
| hardware | NVIDIA A100 80GB (Thunder Compute) |
|
| 74 |
+
| wall time | ~25h before instance terminated |
|
| 75 |
+
|
| 76 |
+
## Performance — verbatim qualitative outputs
|
| 77 |
+
|
| 78 |
+
Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
>>> The capital of France is
|
| 82 |
+
... said to be Rouen, a city of great historical interest. It has been sited in the palace
|
| 83 |
+
of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
|
| 84 |
+
colleges and villas, which have housed at least fifty ancient monasteries.
|
| 85 |
+
|
| 86 |
+
>>> Photosynthesis is the process by which
|
| 87 |
+
... a new animal becomes an animal. The species' means of survival, they are undergoing
|
| 88 |
+
metabolism and their deaths can be found in several different ways.
|
| 89 |
+
|
| 90 |
+
>>> Once upon a time, in a small village,
|
| 91 |
+
... the children of Mr. Fabien and his brothers had been on each side by nightfall and
|
| 92 |
+
were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...
|
| 93 |
+
|
| 94 |
+
>>> def fibonacci(n):
|
| 95 |
+
... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
|
| 96 |
+
almost no Python in it, expected failure]
|
| 97 |
+
|
| 98 |
+
>>> The three most important inventions of the 20th century were
|
| 99 |
+
... that of 1864, for which he received a D.S. degree and was awarded the Presidential
|
| 100 |
+
Medal of Freedom (1867)... [biographical pastiche, no factual grounding]
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Honest read
|
| 104 |
+
|
| 105 |
+
| capability | grade |
|
| 106 |
+
|---|---|
|
| 107 |
+
| English grammar | A |
|
| 108 |
+
| Sentence flow | A− |
|
| 109 |
+
| Topic-adjacent vocabulary | B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology) |
|
| 110 |
+
| Factual accuracy | D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery) |
|
| 111 |
+
| Code | F (corpus has almost no code) |
|
| 112 |
+
| Long-form coherence | C+ (drifts but maintains tone) |
|
| 113 |
+
|
| 114 |
+
This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: **sounds like text, doesn't know much**. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.
|
| 115 |
+
|
| 116 |
+
## Intended use
|
| 117 |
+
|
| 118 |
+
Research artifact. Use cases:
|
| 119 |
+
|
| 120 |
+
- Studying small-LM training dynamics
|
| 121 |
+
- Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
|
| 122 |
+
- Distillation source for even smaller students
|
| 123 |
+
- Fine-tuning on narrow domains (would benefit from adding SFT first)
|
| 124 |
+
|
| 125 |
+
**Not** intended for:
|
| 126 |
+
|
| 127 |
+
- Production / user-facing applications (factual accuracy too low)
|
| 128 |
+
- Chat use (no SFT, no chat template training)
|
| 129 |
+
- Code generation (no code in pretrain corpus)
|
| 130 |
+
|
| 131 |
+
## How to use
|
| 132 |
+
|
| 133 |
+
```python
|
| 134 |
+
import torch
|
| 135 |
+
from huggingface_hub import hf_hub_download
|
| 136 |
+
# Custom model code is required — clone or download from the companion repo:
|
| 137 |
+
# https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
|
| 138 |
+
# or grab the toy_50m_code.tar.gz attached here.
|
| 139 |
+
|
| 140 |
+
# Once code is on PYTHONPATH:
|
| 141 |
+
from config import Config
|
| 142 |
+
from model import ToyLM
|
| 143 |
+
from tokenizer import load_tokenizer
|
| 144 |
+
|
| 145 |
+
ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
|
| 146 |
+
tok_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")
|
| 147 |
+
|
| 148 |
+
tok = load_tokenizer(tok_path)
|
| 149 |
+
ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
|
| 150 |
+
cfg = Config(**ck["cfg"])
|
| 151 |
+
m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
|
| 152 |
+
m.load_state_dict(ck["model"])
|
| 153 |
+
m.eval()
|
| 154 |
+
|
| 155 |
+
ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
|
| 156 |
+
out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
|
| 157 |
+
print(tok.decode(out[0].tolist()))
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
## What's coming
|
| 161 |
+
|
| 162 |
+
**The Crowfeather series.** Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as `Crowfeather-50m-vN` or with descriptive suffixes. Each release gets a matching post on [@Crownelius](https://huggingface.co/Crownelius).
|
| 163 |
+
|
| 164 |
+
This first release reflects the partial Thunder run. Next up: SFT on `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`, resuming from the included `step_017500.pt` on Colab.
|
| 165 |
+
|
| 166 |
+
## Citation
|
| 167 |
+
|
| 168 |
+
```bibtex
|
| 169 |
+
@misc{crowfeather50m_2026,
|
| 170 |
+
title = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
|
| 171 |
+
author = {Shane (Crownelius)},
|
| 172 |
+
year = {2026},
|
| 173 |
+
month = {April},
|
| 174 |
+
url = {https://huggingface.co/Crowfeather/Crowfeather-50m}
|
| 175 |
+
}
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
## Acknowledgments
|
| 179 |
+
|
| 180 |
+
- [Gemma 4](https://huggingface.co/blog/gemma4) (April 2026) for the alternating sliding/global attention pattern
|
| 181 |
+
- [DeepSeek V4](https://mer.vin/2026/04/deepseek-v4-preview-explained-1m-context-architecture-benchmarks-pricing-and-enterprise-adoption-guide/) (April 2026) for the Muon optimizer recipe
|
| 182 |
+
- [Keller Jordan's Muon writeups](https://kellerjordan.github.io/posts/muon/) for orthogonalization details
|
| 183 |
+
- [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) for the pretrain corpus
|
| 184 |
+
- [Thunder Compute](https://www.thundercompute.com) for the A100 hours
|
| 185 |
+
- [CompactAI-O](https://huggingface.co/CompactAI-O) for the small-models-as-research-tools ethos
|
| 186 |
+
|
| 187 |
+
— Shane, April 2026
|