| ---
|
| license: apache-2.0
|
| language:
|
| - en
|
| library_name: pytorch
|
| tags:
|
| - text-generation
|
| - small-models
|
| - pretrain-only
|
| - gemma4
|
| - deepseek-v4
|
| - muon
|
| - wsd
|
| - crowfeather
|
| - compactai
|
| pipeline_tag: text-generation
|
| ---
|
|
|
| # Crowfeather-50m
|
|
|
| A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. **No SFT yet** — this is a base LM only.
|
|
|
| This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on [@Crownelius's profile](https://huggingface.co/Crownelius).
|
|
|
| ## Howdy from Shane
|
|
|
| Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is **step 17,500**. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.
|
|
|
| If you want the full backstory and trial-and-error, see the [companion HF post](https://huggingface.co/Crownelius) and [the older `notes-fant3-and-50m-toy-2026-04`](https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04) repo.
|
|
|
| ## Architecture
|
|
|
| Transformer with two ideas pulled directly from April 2026 research:
|
|
|
| | component | choice | source |
|
| |---|---|---|
|
| | attention | alternating sliding (window=1024) / global, last layer always global | Gemma 4 |
|
| | optimizer | Muon for 2D weights, AdamW for embeddings (hybrid, `adamw_lr = muon_lr/4`) | DeepSeek V4 |
|
| | LR schedule | WSD (Warmup → Stable → Decay), 20% decay phase | Apr 2026 small-LM research |
|
| | logit stability | Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4 | both |
|
| | embeddings | tied (input + output share) | standard |
|
| | activation | SwiGLU MLP | standard |
|
| | memory | RoPE positional, RMSNorm | standard |
|
|
|
| ```
|
| vocab_size = 8192 (BPE on 100k FineWeb-edu docs, deterministic)
|
| dim = 512
|
| n_layers = 12
|
| n_heads = 8
|
| head_dim = 64
|
| mlp_hidden = 2048
|
| max_seq_len = 8192
|
| sliding_window = 1024 (Gemma 4 alternating pattern)
|
|
|
| total params = 54,538,752
|
| embedding = 4,194,304 (tied, 7.7%)
|
| attention = 12,582,912 (23.1%)
|
| mlp = 37,748,736 (69.2%)
|
| norms = 12,800 (0.02%)
|
| ```
|
|
|
| ## Training
|
|
|
| | | |
|
| |---|---|
|
| | pretrain corpus | `HuggingFaceFW/fineweb-edu` (default split, 2M docs streamed) |
|
| | pretrain target | 100,000 steps |
|
| | pretrain actual | **17,500 steps banked** (~2.3B tokens, ~46× Chinchilla optimal for 50M) |
|
| | batch | 16 × 4096 × 2 grad-accum = effective 32 |
|
| | peak LR | 2e-3 (Muon) / 5e-4 (AdamW for embeddings) |
|
| | WSD | warmup 1500, stable to ~80%, linear decay over last 20% |
|
| | precision | bf16 |
|
| | hardware | NVIDIA A100 80GB (Thunder Compute) |
|
| | wall time | ~25h before instance terminated |
|
|
|
| ## Performance — verbatim qualitative outputs
|
|
|
| Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.
|
|
|
| ```
|
| >>> The capital of France is
|
| ... said to be Rouen, a city of great historical interest. It has been sited in the palace
|
| of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
|
| colleges and villas, which have housed at least fifty ancient monasteries.
|
|
|
| >>> Photosynthesis is the process by which
|
| ... a new animal becomes an animal. The species' means of survival, they are undergoing
|
| metabolism and their deaths can be found in several different ways.
|
|
|
| >>> Once upon a time, in a small village,
|
| ... the children of Mr. Fabien and his brothers had been on each side by nightfall and
|
| were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...
|
|
|
| >>> def fibonacci(n):
|
| ... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
|
| almost no Python in it, expected failure]
|
|
|
| >>> The three most important inventions of the 20th century were
|
| ... that of 1864, for which he received a D.S. degree and was awarded the Presidential
|
| Medal of Freedom (1867)... [biographical pastiche, no factual grounding]
|
| ```
|
|
|
| ### Honest read
|
|
|
| | capability | grade |
|
| |---|---|
|
| | English grammar | A |
|
| | Sentence flow | A− |
|
| | Topic-adjacent vocabulary | B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology) |
|
| | Factual accuracy | D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery) |
|
| | Code | F (corpus has almost no code) |
|
| | Long-form coherence | C+ (drifts but maintains tone) |
|
|
|
| This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: **sounds like text, doesn't know much**. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.
|
|
|
| ## Intended use
|
|
|
| Research artifact. Use cases:
|
|
|
| - Studying small-LM training dynamics
|
| - Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
|
| - Distillation source for even smaller students
|
| - Fine-tuning on narrow domains (would benefit from adding SFT first)
|
|
|
| **Not** intended for:
|
|
|
| - Production / user-facing applications (factual accuracy too low)
|
| - Chat use (no SFT, no chat template training)
|
| - Code generation (no code in pretrain corpus)
|
|
|
| ## How to use
|
|
|
| ```python
|
| import torch
|
| from huggingface_hub import hf_hub_download
|
| # Custom model code is required — clone or download from the companion repo:
|
| # https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
|
| # or grab the toy_50m_code.tar.gz attached here.
|
|
|
| # Once code is on PYTHONPATH:
|
| from config import Config
|
| from model import ToyLM
|
| from tokenizer import load_tokenizer
|
|
|
| ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
|
| tok_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")
|
|
|
| tok = load_tokenizer(tok_path)
|
| ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
|
| cfg = Config(**ck["cfg"])
|
| m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
|
| m.load_state_dict(ck["model"])
|
| m.eval()
|
|
|
| ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
|
| out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
|
| print(tok.decode(out[0].tolist()))
|
| ```
|
|
|
| ## What's coming
|
|
|
| **The Crowfeather series.** Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as `Crowfeather-50m-vN` or with descriptive suffixes. Each release gets a matching post on [@Crownelius](https://huggingface.co/Crownelius).
|
|
|
| This first release reflects the partial Thunder run. Next up: SFT on `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`, resuming from the included `step_017500.pt` on Colab.
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @misc{crowfeather50m_2026,
|
| title = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
|
| author = {Shane (Crownelius)},
|
| year = {2026},
|
| month = {April},
|
| url = {https://huggingface.co/Crowfeather/Crowfeather-50m}
|
| }
|
| ```
|
|
|
| ## Acknowledgments
|
|
|
| - [Gemma 4](https://huggingface.co/blog/gemma4) (April 2026) for the alternating sliding/global attention pattern
|
| - [DeepSeek V4](https://mer.vin/2026/04/deepseek-v4-preview-explained-1m-context-architecture-benchmarks-pricing-and-enterprise-adoption-guide/) (April 2026) for the Muon optimizer recipe
|
| - [Keller Jordan's Muon writeups](https://kellerjordan.github.io/posts/muon/) for orthogonalization details
|
| - [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) for the pretrain corpus
|
| - [Thunder Compute](https://www.thundercompute.com) for the A100 hours
|
| - [CompactAI-O](https://huggingface.co/CompactAI-O) for the small-models-as-research-tools ethos
|
|
|
| — Shane, April 2026
|
|
|