Running 397 Billion Parameters on a Laptop: NVMe-Streamed MoE Inference and Fine-Tuning on 24GB Consumer Hardware

Large language models have outgrown consumer hardware — or so the conventional wisdom goes. We decided to challenge that assumption.

In this post, we present a system that runs Qwen3.5-397B-A17B, a 397-billion parameter Mixture-of-Experts model, entirely on a single Apple MacBook Pro with 24GB of unified memory, achieving 1.77 tokens/second decode throughput — a 7.4x speedup over full MoE inference. We are also the only framework that can run this model on consumer hardware — the only competitor (mlx-lm) is killed by the OS with an out-of-memory error before generating a single token.

We also demonstrate the first fine-tuning of a 400B-class model on consumer hardware, using Sparse MoE-LoRA, training only 0.001% of parameters within a 16GB memory budget.

All code, benchmarks, and trained adapters are open-sourced.

The Core Problem

Qwen3.5-397B-A17B requires 223GB of storage even after nvfp4 quantization. A data center solves this with high-bandwidth memory and multi-GPU parallelism. A MacBook has 24GB of unified memory and an NVMe SSD.

The key insight: only 17 billion parameters activate per token. The challenge is making the other 223GB of expert weights accessible without loading them into RAM.

System Architecture

We treat NVMe storage as an extension of the memory hierarchy, streaming expert weights on-demand at ~3.5 GB/s:

Component	Location	Memory
Attention + norms	Pinned in RAM	4 GB
Shared experts (60)	Pinned in RAM	Included above
Expert LRU cache	2,370 slots	8 GB
Expert bank	NVMe (223 GB on disk)	—
Working memory	MLX compute	3 GB
Total		~15 GB of 24 GB

The Central Finding: Most Layers Don't Need Full MoE Routing

Through calibration across all 60 transformer layers, we found that at 32 of 60 layers, the shared expert alone captures >99.5% of output directionality. This enabled a three-tier routing policy:

Tier	Criterion	Layers	Experts/Token
SKIP	cos > 0.995	32	0 (shared only)
LIGHT	0.98 < cos < 0.995	22	2
FULL	cos < 0.98	6	5

Expert loads per token drop from 300 to 74 — a 4.1x reduction.

Optimization Stack: 0.05 → 1.77 tok/s

Optimization	Speed	Speedup
Baseline (naive)	0.05 tok/s	1x
+ Layer-selective MoE	0.72 tok/s	14.4x
+ Scale validation skip	~0.85 tok/s	17x
+ Fast shared-expert prefill	~1.0 tok/s	20x
+ Batch expert evaluation	~1.5 tok/s	30x
+ Fixed-k + k-reduction	1.77 tok/s	35x

Time to first token drops from 14.6s to 0.25s (58x improvement).

Benchmark Results

We evaluated on standard LLM benchmarks. mlx-lm, the only competing framework, crashes with OOM before generating a single token — our system is the only one that runs this model on 24GB hardware.

Quality vs Speed Pareto Frontier:

Config	tok/s	PPL
Aggressive (ours)	1.77	5.18
Balanced	0.33	3.00
Full MoE	0.24	3.00

Knowledge and reasoning benchmarks:

Benchmark	Score	Notes
MMLU (5-shot)	76.7%	93% of full model capability
HumanEval pass@1	15.0%	Code most affected by routing
GSM8K (0-shot CoT)	15.0%	Multi-step errors compound across 60 layers

Important caveat on HumanEval/GSM8K: these scores are below the full model's reference (~65% and ~85%). Code generation had the lowest shared-expert agreement (16.7%), and multi-step reasoning compounds small per-layer routing errors. The balanced tier config (PPL 3.00) is recommended for reasoning-heavy tasks.

Fine-Tuning 397B on a Laptop

By placing LoRA adapters only on the shared expert and the top 15% most-routed experts, trainable parameters drop to just 6.27 million — 0.001% of the full model.

Training on Magicoder-Evol-Instruct-110K, loss dropped from 4.58 to 2.47 over 130 steps (46% improvement), within a 16GB memory budget. Trained adapter: 15KB compressed.

Repository Contents

├── paper.pdf                    # Research paper
├── engine/                      # Core inference engine
│   ├── inference/
│   │   ├── qwen35_engine.py     # Main Qwen3.5 engine
���   │   ├── expert_cache.py      # LRU expert cache
│   │   ├── experiments.py       # 8 research experiments
│   │   └── ...
│   └── training/
│       ├── moe_trainer.py       # NVMe-streamed trainer
│       └── sparse_lora.py       # Sparse MoE-LoRA
��── scripts/
│   ├── train_coding.py          # Training script
│   └── generate_paper.py        # Paper generation
├── benchmarks/
│   ├── eval_*.py                # All benchmark scripts
│   ├─�� results_*.json           # Benchmark results
│   └── run_all_benchmarks.py    # Master runner
├── adapters/
│   ├── coding_lora.npz          # Trained LoRA adapter
│   └── training_log.json        # Training metrics
├── calibration/
│   └��─ layer_calibration.json   # Tier assignment data
└── tests/                       # Unit tests

Requirements

Apple Silicon Mac with 24GB+ unified memory
Python 3.12+
MLX framework
Qwen3.5-397B-A17B model weights (nvfp4 quantization, 223GB)

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Prisacairu/qwen397b-nvme-inference

Base model

Qwen/Qwen3-235B-A22B

Adapter

(6)

this model