Initial model card

e998825 verified 7 days ago

8.07 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: pytorch
	tags:
	- text-generation
	- small-models
	- pretrain-only
	- gemma4
	- deepseek-v4
	- muon
	- wsd
	- crowfeather
	- compactai
	pipeline_tag: text-generation
	---

	# Crowfeather-50m

	A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. No SFT yet — this is a base LM only.

	This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on [@Crownelius's profile](https://huggingface.co/Crownelius).

	## Howdy from Shane

	Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is step 17,500. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.

	If you want the full backstory and trial-and-error, see the [companion HF post](https://huggingface.co/Crownelius) and [the older `notes-fant3-and-50m-toy-2026-04`](https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04) repo.

	## Architecture

	Transformer with two ideas pulled directly from April 2026 research:

	\| component \| choice \| source \|
	\|---\|---\|---\|
	\| attention \| alternating sliding (window=1024) / global, last layer always global \| Gemma 4 \|
	\| optimizer \| Muon for 2D weights, AdamW for embeddings (hybrid, `adamw_lr = muon_lr/4`) \| DeepSeek V4 \|
	\| LR schedule \| WSD (Warmup → Stable → Decay), 20% decay phase \| Apr 2026 small-LM research \|
	\| logit stability \| Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4 \| both \|
	\| embeddings \| tied (input + output share) \| standard \|
	\| activation \| SwiGLU MLP \| standard \|
	\| memory \| RoPE positional, RMSNorm \| standard \|

	```
	vocab_size = 8192 (BPE on 100k FineWeb-edu docs, deterministic)
	dim = 512
	n_layers = 12
	n_heads = 8
	head_dim = 64
	mlp_hidden = 2048
	max_seq_len = 8192
	sliding_window = 1024 (Gemma 4 alternating pattern)

	total params = 54,538,752
	embedding = 4,194,304 (tied, 7.7%)
	attention = 12,582,912 (23.1%)
	mlp = 37,748,736 (69.2%)
	norms = 12,800 (0.02%)
	```

	## Training

	\| \| \|
	\|---\|---\|
	\| pretrain corpus \| `HuggingFaceFW/fineweb-edu` (default split, 2M docs streamed) \|
	\| pretrain target \| 100,000 steps \|
	\| pretrain actual \| 17,500 steps banked (~2.3B tokens, ~46× Chinchilla optimal for 50M) \|
	\| batch \| 16 × 4096 × 2 grad-accum = effective 32 \|
	\| peak LR \| 2e-3 (Muon) / 5e-4 (AdamW for embeddings) \|
	\| WSD \| warmup 1500, stable to ~80%, linear decay over last 20% \|
	\| precision \| bf16 \|
	\| hardware \| NVIDIA A100 80GB (Thunder Compute) \|
	\| wall time \| ~25h before instance terminated \|

	## Performance — verbatim qualitative outputs

	Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.

	```
	>>> The capital of France is
	... said to be Rouen, a city of great historical interest. It has been sited in the palace
	of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
	colleges and villas, which have housed at least fifty ancient monasteries.

	>>> Photosynthesis is the process by which
	... a new animal becomes an animal. The species' means of survival, they are undergoing
	metabolism and their deaths can be found in several different ways.

	>>> Once upon a time, in a small village,
	... the children of Mr. Fabien and his brothers had been on each side by nightfall and
	were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...

	>>> def fibonacci(n):
	... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
	almost no Python in it, expected failure]

	>>> The three most important inventions of the 20th century were
	... that of 1864, for which he received a D.S. degree and was awarded the Presidential
	Medal of Freedom (1867)... [biographical pastiche, no factual grounding]
	```

	### Honest read

	\| capability \| grade \|
	\|---\|---\|
	\| English grammar \| A \|
	\| Sentence flow \| A− \|
	\| Topic-adjacent vocabulary \| B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology) \|
	\| Factual accuracy \| D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery) \|
	\| Code \| F (corpus has almost no code) \|
	\| Long-form coherence \| C+ (drifts but maintains tone) \|

	This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: sounds like text, doesn't know much. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.

	## Intended use

	Research artifact. Use cases:

	- Studying small-LM training dynamics
	- Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
	- Distillation source for even smaller students
	- Fine-tuning on narrow domains (would benefit from adding SFT first)

	Not intended for:

	- Production / user-facing applications (factual accuracy too low)
	- Chat use (no SFT, no chat template training)
	- Code generation (no code in pretrain corpus)

	## How to use

	```python
	import torch
	from huggingface_hub import hf_hub_download
	# Custom model code is required — clone or download from the companion repo:
	# https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
	# or grab the toy_50m_code.tar.gz attached here.

	# Once code is on PYTHONPATH:
	from config import Config
	from model import ToyLM
	from tokenizer import load_tokenizer

	ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
	tok_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")

	tok = load_tokenizer(tok_path)
	ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
	cfg = Config(**ck["cfg"])
	m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
	m.load_state_dict(ck["model"])
	m.eval()

	ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
	out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
	print(tok.decode(out[0].tolist()))
	```

	## What's coming

	The Crowfeather series. Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as `Crowfeather-50m-vN` or with descriptive suffixes. Each release gets a matching post on [@Crownelius](https://huggingface.co/Crownelius).

	This first release reflects the partial Thunder run. Next up: SFT on `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`, resuming from the included `step_017500.pt` on Colab.

	## Citation

	```bibtex
	@misc{crowfeather50m_2026,
	title = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
	author = {Shane (Crownelius)},
	year = {2026},
	month = {April},
	url = {https://huggingface.co/Crowfeather/Crowfeather-50m}
	}
	```

	## Acknowledgments

	- [Gemma 4](https://huggingface.co/blog/gemma4) (April 2026) for the alternating sliding/global attention pattern
	- [DeepSeek V4](https://mer.vin/2026/04/deepseek-v4-preview-explained-1m-context-architecture-benchmarks-pricing-and-enterprise-adoption-guide/) (April 2026) for the Muon optimizer recipe
	- [Keller Jordan's Muon writeups](https://kellerjordan.github.io/posts/muon/) for orthogonalization details
	- [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) for the pretrain corpus
	- [Thunder Compute](https://www.thundercompute.com) for the A100 hours
	- [CompactAI-O](https://huggingface.co/CompactAI-O) for the small-models-as-research-tools ethos

	— Shane, April 2026