Running 397 Billion Parameters on a Laptop: NVMe-Streamed MoE Inference and Fine-Tuning on 24GB Consumer Hardware
Large language models have outgrown consumer hardware β or so the conventional wisdom goes. We decided to challenge that assumption.
In this post, we present a system that runs Qwen3.5-397B-A17B, a 397-billion parameter Mixture-of-Experts model, entirely on a single Apple MacBook Pro with 24GB of unified memory, achieving 1.77 tokens/second decode throughput β a 7.4x speedup over full MoE inference. We are also the only framework that can run this model on consumer hardware β the only competitor (mlx-lm) is killed by the OS with an out-of-memory error before generating a single token.
We also demonstrate the first fine-tuning of a 400B-class model on consumer hardware, using Sparse MoE-LoRA, training only 0.001% of parameters within a 16GB memory budget.
All code, benchmarks, and trained adapters are open-sourced.
The Core Problem
Qwen3.5-397B-A17B requires 223GB of storage even after nvfp4 quantization. A data center solves this with high-bandwidth memory and multi-GPU parallelism. A MacBook has 24GB of unified memory and an NVMe SSD.
The key insight: only 17 billion parameters activate per token. The challenge is making the other 223GB of expert weights accessible without loading them into RAM.
System Architecture
We treat NVMe storage as an extension of the memory hierarchy, streaming expert weights on-demand at ~3.5 GB/s:
| Component | Location | Memory |
|---|---|---|
| Attention + norms | Pinned in RAM | 4 GB |
| Shared experts (60) | Pinned in RAM | Included above |
| Expert LRU cache | 2,370 slots | 8 GB |
| Expert bank | NVMe (223 GB on disk) | β |
| Working memory | MLX compute | 3 GB |
| Total | ~15 GB of 24 GB |
The Central Finding: Most Layers Don't Need Full MoE Routing
Through calibration across all 60 transformer layers, we found that at 32 of 60 layers, the shared expert alone captures >99.5% of output directionality. This enabled a three-tier routing policy:
| Tier | Criterion | Layers | Experts/Token |
|---|---|---|---|
| SKIP | cos > 0.995 | 32 | 0 (shared only) |
| LIGHT | 0.98 < cos < 0.995 | 22 | 2 |
| FULL | cos < 0.98 | 6 | 5 |
Expert loads per token drop from 300 to 74 β a 4.1x reduction.
Optimization Stack: 0.05 β 1.77 tok/s
| Optimization | Speed | Speedup |
|---|---|---|
| Baseline (naive) | 0.05 tok/s | 1x |
| + Layer-selective MoE | 0.72 tok/s | 14.4x |
| + Scale validation skip | ~0.85 tok/s | 17x |
| + Fast shared-expert prefill | ~1.0 tok/s | 20x |
| + Batch expert evaluation | ~1.5 tok/s | 30x |
| + Fixed-k + k-reduction | 1.77 tok/s | 35x |
Time to first token drops from 14.6s to 0.25s (58x improvement).
Benchmark Results
We evaluated on standard LLM benchmarks. mlx-lm, the only competing framework, crashes with OOM before generating a single token β our system is the only one that runs this model on 24GB hardware.
Quality vs Speed Pareto Frontier:
| Config | tok/s | PPL |
|---|---|---|
| Aggressive (ours) | 1.77 | 5.18 |
| Balanced | 0.33 | 3.00 |
| Full MoE | 0.24 | 3.00 |
Knowledge and reasoning benchmarks:
| Benchmark | Score | Notes |
|---|---|---|
| MMLU (5-shot) | 76.7% | 93% of full model capability |
| HumanEval pass@1 | 15.0% | Code most affected by routing |
| GSM8K (0-shot CoT) | 15.0% | Multi-step errors compound across 60 layers |
Important caveat on HumanEval/GSM8K: these scores are below the full model's reference (~65% and ~85%). Code generation had the lowest shared-expert agreement (16.7%), and multi-step reasoning compounds small per-layer routing errors. The balanced tier config (PPL 3.00) is recommended for reasoning-heavy tasks.
Fine-Tuning 397B on a Laptop
By placing LoRA adapters only on the shared expert and the top 15% most-routed experts, trainable parameters drop to just 6.27 million β 0.001% of the full model.
Training on Magicoder-Evol-Instruct-110K, loss dropped from 4.58 to 2.47 over 130 steps (46% improvement), within a 16GB memory budget. Trained adapter: 15KB compressed.
Repository Contents
βββ paper.pdf # Research paper
βββ engine/ # Core inference engine
β βββ inference/
β β βββ qwen35_engine.py # Main Qwen3.5 engine
οΏ½οΏ½οΏ½ β βββ expert_cache.py # LRU expert cache
β β βββ experiments.py # 8 research experiments
β β βββ ...
β βββ training/
β βββ moe_trainer.py # NVMe-streamed trainer
β βββ sparse_lora.py # Sparse MoE-LoRA
οΏ½οΏ½ββ scripts/
β βββ train_coding.py # Training script
β βββ generate_paper.py # Paper generation
βββ benchmarks/
β βββ eval_*.py # All benchmark scripts
β ββοΏ½οΏ½ results_*.json # Benchmark results
β βββ run_all_benchmarks.py # Master runner
βββ adapters/
β βββ coding_lora.npz # Trained LoRA adapter
β βββ training_log.json # Training metrics
βββ calibration/
β βοΏ½οΏ½β layer_calibration.json # Tier assignment data
βββ tests/ # Unit tests
Requirements
- Apple Silicon Mac with 24GB+ unified memory
- Python 3.12+
- MLX framework
- Qwen3.5-397B-A17B model weights (nvfp4 quantization, 223GB)
License
MIT
Model tree for Prisacairu/qwen397b-nvme-inference
Base model
Qwen/Qwen3-235B-A22B