Crownelius commited on
Commit
e998825
·
verified ·
1 Parent(s): 4e6a36f

Initial model card

Browse files
Files changed (1) hide show
  1. README.md +187 -0
README.md ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - text-generation
8
+ - small-models
9
+ - pretrain-only
10
+ - gemma4
11
+ - deepseek-v4
12
+ - muon
13
+ - wsd
14
+ - crowfeather
15
+ - compactai
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Crowfeather-50m
20
+
21
+ A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. **No SFT yet** — this is a base LM only.
22
+
23
+ This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on [@Crownelius's profile](https://huggingface.co/Crownelius).
24
+
25
+ ## Howdy from Shane
26
+
27
+ Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is **step 17,500**. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.
28
+
29
+ If you want the full backstory and trial-and-error, see the [companion HF post](https://huggingface.co/Crownelius) and [the older `notes-fant3-and-50m-toy-2026-04`](https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04) repo.
30
+
31
+ ## Architecture
32
+
33
+ Transformer with two ideas pulled directly from April 2026 research:
34
+
35
+ | component | choice | source |
36
+ |---|---|---|
37
+ | attention | alternating sliding (window=1024) / global, last layer always global | Gemma 4 |
38
+ | optimizer | Muon for 2D weights, AdamW for embeddings (hybrid, `adamw_lr = muon_lr/4`) | DeepSeek V4 |
39
+ | LR schedule | WSD (Warmup → Stable → Decay), 20% decay phase | Apr 2026 small-LM research |
40
+ | logit stability | Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4 | both |
41
+ | embeddings | tied (input + output share) | standard |
42
+ | activation | SwiGLU MLP | standard |
43
+ | memory | RoPE positional, RMSNorm | standard |
44
+
45
+ ```
46
+ vocab_size = 8192 (BPE on 100k FineWeb-edu docs, deterministic)
47
+ dim = 512
48
+ n_layers = 12
49
+ n_heads = 8
50
+ head_dim = 64
51
+ mlp_hidden = 2048
52
+ max_seq_len = 8192
53
+ sliding_window = 1024 (Gemma 4 alternating pattern)
54
+
55
+ total params = 54,538,752
56
+ embedding = 4,194,304 (tied, 7.7%)
57
+ attention = 12,582,912 (23.1%)
58
+ mlp = 37,748,736 (69.2%)
59
+ norms = 12,800 (0.02%)
60
+ ```
61
+
62
+ ## Training
63
+
64
+ | | |
65
+ |---|---|
66
+ | pretrain corpus | `HuggingFaceFW/fineweb-edu` (default split, 2M docs streamed) |
67
+ | pretrain target | 100,000 steps |
68
+ | pretrain actual | **17,500 steps banked** (~2.3B tokens, ~46× Chinchilla optimal for 50M) |
69
+ | batch | 16 × 4096 × 2 grad-accum = effective 32 |
70
+ | peak LR | 2e-3 (Muon) / 5e-4 (AdamW for embeddings) |
71
+ | WSD | warmup 1500, stable to ~80%, linear decay over last 20% |
72
+ | precision | bf16 |
73
+ | hardware | NVIDIA A100 80GB (Thunder Compute) |
74
+ | wall time | ~25h before instance terminated |
75
+
76
+ ## Performance — verbatim qualitative outputs
77
+
78
+ Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.
79
+
80
+ ```
81
+ >>> The capital of France is
82
+ ... said to be Rouen, a city of great historical interest. It has been sited in the palace
83
+ of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
84
+ colleges and villas, which have housed at least fifty ancient monasteries.
85
+
86
+ >>> Photosynthesis is the process by which
87
+ ... a new animal becomes an animal. The species' means of survival, they are undergoing
88
+ metabolism and their deaths can be found in several different ways.
89
+
90
+ >>> Once upon a time, in a small village,
91
+ ... the children of Mr. Fabien and his brothers had been on each side by nightfall and
92
+ were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...
93
+
94
+ >>> def fibonacci(n):
95
+ ... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
96
+ almost no Python in it, expected failure]
97
+
98
+ >>> The three most important inventions of the 20th century were
99
+ ... that of 1864, for which he received a D.S. degree and was awarded the Presidential
100
+ Medal of Freedom (1867)... [biographical pastiche, no factual grounding]
101
+ ```
102
+
103
+ ### Honest read
104
+
105
+ | capability | grade |
106
+ |---|---|
107
+ | English grammar | A |
108
+ | Sentence flow | A− |
109
+ | Topic-adjacent vocabulary | B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology) |
110
+ | Factual accuracy | D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery) |
111
+ | Code | F (corpus has almost no code) |
112
+ | Long-form coherence | C+ (drifts but maintains tone) |
113
+
114
+ This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: **sounds like text, doesn't know much**. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.
115
+
116
+ ## Intended use
117
+
118
+ Research artifact. Use cases:
119
+
120
+ - Studying small-LM training dynamics
121
+ - Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
122
+ - Distillation source for even smaller students
123
+ - Fine-tuning on narrow domains (would benefit from adding SFT first)
124
+
125
+ **Not** intended for:
126
+
127
+ - Production / user-facing applications (factual accuracy too low)
128
+ - Chat use (no SFT, no chat template training)
129
+ - Code generation (no code in pretrain corpus)
130
+
131
+ ## How to use
132
+
133
+ ```python
134
+ import torch
135
+ from huggingface_hub import hf_hub_download
136
+ # Custom model code is required — clone or download from the companion repo:
137
+ # https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
138
+ # or grab the toy_50m_code.tar.gz attached here.
139
+
140
+ # Once code is on PYTHONPATH:
141
+ from config import Config
142
+ from model import ToyLM
143
+ from tokenizer import load_tokenizer
144
+
145
+ ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
146
+ tok_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")
147
+
148
+ tok = load_tokenizer(tok_path)
149
+ ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
150
+ cfg = Config(**ck["cfg"])
151
+ m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
152
+ m.load_state_dict(ck["model"])
153
+ m.eval()
154
+
155
+ ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
156
+ out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
157
+ print(tok.decode(out[0].tolist()))
158
+ ```
159
+
160
+ ## What's coming
161
+
162
+ **The Crowfeather series.** Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as `Crowfeather-50m-vN` or with descriptive suffixes. Each release gets a matching post on [@Crownelius](https://huggingface.co/Crownelius).
163
+
164
+ This first release reflects the partial Thunder run. Next up: SFT on `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`, resuming from the included `step_017500.pt` on Colab.
165
+
166
+ ## Citation
167
+
168
+ ```bibtex
169
+ @misc{crowfeather50m_2026,
170
+ title = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
171
+ author = {Shane (Crownelius)},
172
+ year = {2026},
173
+ month = {April},
174
+ url = {https://huggingface.co/Crowfeather/Crowfeather-50m}
175
+ }
176
+ ```
177
+
178
+ ## Acknowledgments
179
+
180
+ - [Gemma 4](https://huggingface.co/blog/gemma4) (April 2026) for the alternating sliding/global attention pattern
181
+ - [DeepSeek V4](https://mer.vin/2026/04/deepseek-v4-preview-explained-1m-context-architecture-benchmarks-pricing-and-enterprise-adoption-guide/) (April 2026) for the Muon optimizer recipe
182
+ - [Keller Jordan's Muon writeups](https://kellerjordan.github.io/posts/muon/) for orthogonalization details
183
+ - [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) for the pretrain corpus
184
+ - [Thunder Compute](https://www.thundercompute.com) for the A100 hours
185
+ - [CompactAI-O](https://huggingface.co/CompactAI-O) for the small-models-as-research-tools ethos
186
+
187
+ — Shane, April 2026