Crowfeather
/

Crowfeather-50m

+---
+license: apache-2.0
+language:
+- en
+library_name: pytorch
+tags:
+- text-generation
+- small-models
+- pretrain-only
+- gemma4
+- deepseek-v4
+- muon
+- wsd
+- crowfeather
+- compactai
+pipeline_tag: text-generation
+---
+# Crowfeather-50m
+A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. **No SFT yet** — this is a base LM only.
+This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on [@Crownelius's profile](https://huggingface.co/Crownelius).
+## Howdy from Shane
+Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is **step 17,500**. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.
+If you want the full backstory and trial-and-error, see the [companion HF post](https://huggingface.co/Crownelius) and [the older `notes-fant3-and-50m-toy-2026-04`](https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04) repo.
+## Architecture
+Transformer with two ideas pulled directly from April 2026 research:
+| component | choice | source |
+|---|---|---|
+| attention | alternating sliding (window=1024) / global, last layer always global | Gemma 4 |
+| optimizer | Muon for 2D weights, AdamW for embeddings (hybrid, `adamw_lr = muon_lr/4`) | DeepSeek V4 |
+| LR schedule | WSD (Warmup → Stable → Decay), 20% decay phase | Apr 2026 small-LM research |
+| logit stability | Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4 | both |
+| embeddings | tied (input + output share) | standard |
+| activation | SwiGLU MLP | standard |
+| memory | RoPE positional, RMSNorm | standard |
+```
+vocab_size      = 8192        (BPE on 100k FineWeb-edu docs, deterministic)
+dim             = 512
+n_layers        = 12
+n_heads         = 8
+head_dim        = 64
+mlp_hidden      = 2048
+max_seq_len     = 8192
+sliding_window  = 1024        (Gemma 4 alternating pattern)
+total params    = 54,538,752
+  embedding     = 4,194,304   (tied, 7.7%)
+  attention     = 12,582,912  (23.1%)
+  mlp           = 37,748,736  (69.2%)
+  norms         = 12,800      (0.02%)
+```
+## Training
+| | |
+|---|---|
+| pretrain corpus | `HuggingFaceFW/fineweb-edu` (default split, 2M docs streamed) |
+| pretrain target | 100,000 steps |
+| pretrain actual | **17,500 steps banked** (~2.3B tokens, ~46× Chinchilla optimal for 50M) |
+| batch | 16 × 4096 × 2 grad-accum = effective 32 |
+| peak LR | 2e-3 (Muon) / 5e-4 (AdamW for embeddings) |
+| WSD | warmup 1500, stable to ~80%, linear decay over last 20% |
+| precision | bf16 |
+| hardware | NVIDIA A100 80GB (Thunder Compute) |
+| wall time | ~25h before instance terminated |
+## Performance — verbatim qualitative outputs
+Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.
+```
+>>> The capital of France is
+... said to be Rouen, a city of great historical interest. It has been sited in the palace
+of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
+colleges and villas, which have housed at least fifty ancient monasteries.
+>>> Photosynthesis is the process by which
+... a new animal becomes an animal. The species' means of survival, they are undergoing
+metabolism and their deaths can be found in several different ways.
+>>> Once upon a time, in a small village,
+... the children of Mr. Fabien and his brothers had been on each side by nightfall and
+were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...
+>>> def fibonacci(n):
+... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
+almost no Python in it, expected failure]
+>>> The three most important inventions of the 20th century were
+... that of 1864, for which he received a D.S. degree and was awarded the Presidential
+Medal of Freedom (1867)... [biographical pastiche, no factual grounding]
+```
+### Honest read
+| capability | grade |
+|---|---|
+| English grammar | A |
+| Sentence flow | A− |
+| Topic-adjacent vocabulary | B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology) |
+| Factual accuracy | D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery) |
+| Code | F (corpus has almost no code) |
+| Long-form coherence | C+ (drifts but maintains tone) |
+This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: **sounds like text, doesn't know much**. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.
+## Intended use
+Research artifact. Use cases:
+- Studying small-LM training dynamics
+- Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
+- Distillation source for even smaller students
+- Fine-tuning on narrow domains (would benefit from adding SFT first)
+**Not** intended for:
+- Production / user-facing applications (factual accuracy too low)
+- Chat use (no SFT, no chat template training)
+- Code generation (no code in pretrain corpus)
+## How to use
+```python
+import torch
+from huggingface_hub import hf_hub_download
+# Custom model code is required — clone or download from the companion repo:
+#   https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
+# or grab the toy_50m_code.tar.gz attached here.
+# Once code is on PYTHONPATH:
+from config import Config
+from model import ToyLM
+from tokenizer import load_tokenizer
+ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
+tok_path  = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")
+tok = load_tokenizer(tok_path)
+ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+cfg = Config(**ck["cfg"])
+m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
+m.load_state_dict(ck["model"])
+m.eval()
+ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
+out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
+print(tok.decode(out[0].tolist()))
+```
+## What's coming
+**The Crowfeather series.** Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as `Crowfeather-50m-vN` or with descriptive suffixes. Each release gets a matching post on [@Crownelius](https://huggingface.co/Crownelius).
+This first release reflects the partial Thunder run. Next up: SFT on `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`, resuming from the included `step_017500.pt` on Colab.
+## Citation
+```bibtex
+@misc{crowfeather50m_2026,
+  title  = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
+  author = {Shane (Crownelius)},
+  year   = {2026},
+  month  = {April},
+  url    = {https://huggingface.co/Crowfeather/Crowfeather-50m}
+}
+```
+## Acknowledgments
+- [Gemma 4](https://huggingface.co/blog/gemma4) (April 2026) for the alternating sliding/global attention pattern
+- [DeepSeek V4](https://mer.vin/2026/04/deepseek-v4-preview-explained-1m-context-architecture-benchmarks-pricing-and-enterprise-adoption-guide/) (April 2026) for the Muon optimizer recipe
+- [Keller Jordan's Muon writeups](https://kellerjordan.github.io/posts/muon/) for orthogonalization details
+- [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) for the pretrain corpus
+- [Thunder Compute](https://www.thundercompute.com) for the A100 hours
+- [CompactAI-O](https://huggingface.co/CompactAI-O) for the small-models-as-research-tools ethos
+— Shane, April 2026