The Death of the Generalist and Rise of the Swarm
The Eve-2 Family of Models are 272M Mixture-of-Experts built for speed and deployments in compute-poor envirionments. Can run on a Raspberry Pi and be fine-tuned for specialist work in 30 minutes on consumer hardware.
Scaling AI Is Just Like Scaling Your Other Cloud Costs and it begins with right-sizing. You can but should not through expensive large language models at every problem. You want to find the cheapest, most efficient tool for the job. That's where the innovation and explosive growth of AI meet the realities of production: scalability, realiability and rapid development cycles.
Anthony Maio | Making Minds AI
I spent the last week building a language model that most people in AI would consider small. 272 million parameters. Not billions, millions. A model so small that it was fine-tuned to perform moving exponential averages and learn my day-to-day weight fluctuations to act as a fitness coach - in a cell phone app. Eve-2-HappyScale is currently running on an old Pixel 8 flawlessly telling me when I'll reach my goal weight.
So what? What can I do with a 272M parameter model? I have trained Eve 2 nano-agents to write SQL from natural language, generate conventional Git commits from raw diffs, redact PII, extract structured JSON from messy text, identify user intent in mobile apps, lint protocol messages, and more. It runs on a Raspberry Pi. And it costs about $0.02 per million tokens to operate on self-hosted CPU (in electricty)
The base model is Eve-2-MoE-IT-272M, and it's the foundation of what I've been calling the Eve Swarm -- a collection of hyper-specialized nano-models, each fine-tuned to do one job well.
Everything is open-source on Hugging Face. This post is the technical deep dive to explain why the approach matters more than the model itself... or perhaps more accurately: the methods make the model.
The Future of Language Models is Small
Have you asked yourself, when you use AI tools, am I watering my lawn with a firehose? If you do platforms, let's talk in EC2s - Would you spin up every box you need as an X1 with 2TB of RAM and 128 vCPUs? Well, if you were at some of my former employers, you just might, but I'm assuming that you prefer not to fix problems by throwing 10x the necessary compute at it to address your engineering inadequacies so let's spend responsibly:
Most enterprise AI stacks route every task through the same massive generalist. Need to format some JSON? GPT-5. Write a commit message? OpenAI. Classify an intent? OpenAI.
You're paying frontier generalist prices (~$21/million tokens for GPT-5) for tasks that are fundamentally deterministic. These aren't open-ended reasoning problems. They're transformations with predictable output structure. A small, purpose-trained model should handle them fine.
The question was whether you could build something small enough to run on commodity hardware but with enough internal structure to specialize across multiple task types without retraining the whole thing each time.
Architecture: MoE at nano scale
A dense transformer activates every parameter for every token. A 272M dense model gives you 272M parameters of compute per forward pass, and that's all you get.
Eve 2 uses a Mixture-of-Experts architecture derived from DeepSeek-V3, scaled down to what I'm calling the "Nano" range. The core idea: have more total parameters than you activate at inference time, so the model can partition knowledge across specialized subnetworks.
Here's the full spec: Total parameters: 272M Active parameters per token: ~80M (top-2 routing) Transformer blocks: 12 Hidden dimension: 512 Attention heads: 8 (64-dim each) Routed experts: 8 Shared experts: 1 (always active) Expert FFN dim: 1408 (SwiGLU activation) Position encoding: Rotary Position Embeddings (RoPE) Normalization: RMSNorm Context window: 2048 tokens Vocabulary: 50,304 (GPT-2 tokenizer, padded) Precision: BFloat16 native Weight tying: Embeddings tied with LM head
How Mixture of Experts Works Here:
For each token, a learned gating network selects 2 of the 8 routed experts. The shared expert fires on every token regardless. So inference cost is roughly equivalent to an 80M dense model, but the full 272M parameter budget is available for the model to distribute knowledge across.
During fine-tuning, this matters. Different experts develop narrow competencies -- one handles code syntax patterns, another specializes in structured output formatting -- while the shared expert retains common patterns that appear across all tasks. You get specialization capacity that a dense model at the same inference cost simply doesn't have.
The routing includes a load-balancing auxiliary loss to prevent expert collapse. This was important. Early training runs without it had 3 of the 8 experts handling 80%+ of all tokens while the rest sat idle.
Training the base: 10.5 billion tokens in 2.5 hours
I trained the base model (Eve-2-MoE-272M) from scratch on 8x NVIDIA H200 SXM GPUs using PyTorch DDP. The dataset was FineWeb-Edu (Sample-10BT). Total wall time: about 2.5 hours at roughly 1.26 million tokens per second.
Pre-training config:
|---|---| | Optimizer | AdamW (B1=0.9, B2=0.95, weight decay 0.1) | | Schedule | Cosine decay, 200-step linear warmup | | Peak learning rate | 5e-4, decaying to 5e-5 | | Batch size | 128 x 2048 tokens (16 per GPU x 8 GPUs) | | Gradient clipping | 1.0 | | Total steps | 40,000 | | Total tokens | ~10.5B |
The convergence curve was clean:
| Step | Tokens seen | Train loss | Val loss (WikiText-2) |
|---|---|---|---|
| 500 | 131M | 4.82 | 6.35 |
| 1,000 | 262M | 4.09 | 4.84 |
| 5,000 | 1.3B | 3.47 | 3.89 |
| 13,000 | 3.4B | 3.05 | 3.61 |
| 25,000 | 6.6B | 2.90 | 3.51 |
| 40,000 | 10.5B | 2.78 | 3.40 |
Final WikiText-2 perplexity: ~30. The train/val gap of 0.62 at convergence tells me the model could absorb more data diversity -- FineWeb-Edu alone doesn't give it everything it needs for downstream generalization. But as a fine-tuning base, the loss curve is stable and the model produces coherent English. I'm continuously pre-training this model with the goal of ingesting a full 10B tokens of high quality curated data over the next 2 weeks.
(You don't need H200s for this. They're absurdly overkill.. just like me.)
Instruction tuning: where LoRA failed
A base model generates coherent text but doesn't follow instructions. To get there, I fine-tuned on mlabonne/open-perfectblend (Shoutout to Maxime Labonne from Liquid AI) -- roughly 1.2 million instruction-response pairs. This produces Eve-2-MoE-IT-272M.
I initially tried LoRA (Low-Rank Adaptation). It didn't work.
Here's why. LoRA adds low-rank update matrices that approximate weight changes without modifying the original parameters. This works well on 7B+ models where weight matrices are large and have room for low-rank decomposition to capture meaningful structure. At 272M parameters, the matrices are already small. The low-rank approximation doesn't have enough dimensions to represent instruction-following behavior. The model would learn surface patterns -- starting every response with "Sure!" -- but couldn't reliably follow actual instructions.
I switched to full fine-tuning. Unfroze every parameter across all experts.
IT training config:
| Setting | Value |
|---|---|
| Method | Full fine-tuning (no PEFT/LoRA) |
| Precision | BFloat16 |
| Batch size | 128 (global) |
| Learning rate | 5e-5 (cosine schedule) |
| Collator | DataCollatorForCompletionOnlyLM |
That collator choice matters more than it sounds. It masks the user prompt during loss computation, so the model only learns to produce the assistant's response. Without it, the model wastes capacity learning to reproduce "User: write me a SQL query for..." instead of learning to write the query. At 272M parameters, you can't afford to spend capacity on anything that isn't the actual task.
Training a model this small to be "smart" is honestly harder than training a 7B model. There's no room for noise. The embedding space is tight enough that a bad hyperparameter choice or a noisy dataset doesn't just slow convergence -- it ruins the model. Every decision about data, masking, and learning rate schedule has outsized impact.
The swarm: one base, eight specialists
This is the part fundamental to small model swarms:
I don't deploy Eve-2-MoE-IT directly in production. I treat it as a starting point -- a model that already understands language, follows instructions, and has a well-organized expert structure. Then I clone it and fine-tune each copy on a single narrow task using full fine-tuning.
Because the base is already strong, these specialist fine-tunes converge fast. We're talking 10k-35k training samples and roughly 20 minutes of H200 time per specialist. Each one ends up at the same 272M parameter footprint, small enough to run on CPU with under 1GB of memory.
The current swarm has eight members with eight more currently going through evaluation testing:
| Specialist | Task | Training data | Samples | Loss |
|---|---|---|---|---|
| Eve-NanoSQL | Natural language to SQL with table context | b-mc2/sql-create-context | 25k | <0.2 |
| Eve-NanoFunction | Strict JSON function calling from natural language | glaive-function-calling-v2 | 35k | <0.4 |
| Eve-NanoExtract | Unstructured text to strict JSON schemas | Salesforce/xlam-function-calling | 20k | <0.4 |
| Eve-NanoCommit | Git diffs to conventional commit messages | bigcode/commitpackft | 20k | <1.0 |
| Eve-NanoSummary | Conversation summarization | knkarthick/dialogsum | 12.5k | <1.0 |
| Eve-NanoRouter | Intent classification and query routing | bitext/customer-support | 25k | <0.3 |
| Eve-NanoPrompt | Simple descriptions to rich image-gen prompts | Stable-Diffusion-Prompts | 15k | <1.0 |
| Eve-NanoPII | PII identification and masking | ai4privacy/pii-masking-200k | 35k | <0.1 |
A few things stand out. NanoPII hitting sub-0.1 loss on PII masking -- that's the model essentially memorizing the task format, which is exactly what you want for a redaction tool. NanoSQL at sub-0.2 means it reliably generates syntactically valid SQL from natural language given table context. These aren't tasks that need creative reasoning. They need consistent, correct transformations, and a small overfit model delivers that.
The router is worth noting separately. In a deployed swarm, NanoRouter is the first model that sees each incoming request. It classifies intent and dispatches to the appropriate specialist. At sub-0.3 loss on intent classification, it's reliable enough to serve as the entry point for the whole system.
How to train your own Eve specialist
If you want to fine-tune your own specialist on top of Eve-2-MoE-IT, here's what I've learned the hard way:
Use full fine-tuning, not LoRA. At this parameter count, LoRA restricts the embedding space too much. Any modern GPU has enough VRAM for full FFT on a 272M model.
Mask user prompts. Use DataCollatorForCompletionOnlyLM or equivalent. Compute loss only on the assistant response. Otherwise the model wastes parameters reproducing the prompt text.
High batch sizes. I use 128 globally. Small models have volatile gradients, and large batches smooth them out. If you're on a single GPU, use gradient accumulation to get there.
Quality over quantity. 10k well-structured input-output pairs beat 100k rows of scraped web text. Every sample should be a clean "here's the input, here's the ideal output" example.
Stop at 2 epochs. These models memorize fast. Go past 2 epochs and you'll get perfect training loss but garbage on anything slightly out-of-distribution.
The economics
I keep coming back to the unit economics because I think this is where the real shift happens for production systems.
GPT-5 costs roughly $21.00 per million tokens. An Eve specialist self-hosted on CPU costs about $0.02 per million tokens. That's a 250x cost reduction for tasks that don't require a generalist.
The tradeoff is obvious: you need a separate model for each task type, and each model only handles its narrow domain. If someone asks your NanoSQL model to write a poem, it will produce nonsense. That's fine. That's the point. You route the request correctly (NanoRouter) and let each specialist do its one job.
This is system design applied to AI - decomposing the broad capability space of a large model into narrow, dedicated models that each handle one region of the problem space. Model distillation has existed for years, but MoE at this scale makes the tradeoff between specialization depth and deployment cost more practical than it's been before.
Try it
Both base models and all eight specialists are open-source on Hugging Face:
- Instruction-tuned: anthonym21/Eve-2-MoE-IT-272M
The training scripts, model architecture code, and configs are included in the repos. If you build a specialist on top of Eve, I'd genuinely like to hear about it. Please submit it and I will add it to our Hugging Face collection.
Special thanks to Hugging Face, Liquid AI, Maxime Labonne, Paul Iusztin, Pau Labarta Bajo, and the maintainers of the datasets, models, documentation, educational material and tools who share and democratize AI for the rest of us. This includes transformers, trl, trainers, datasets, hub and cli.
