Running 397 Billion Parameters on a Laptop: NVMe-Streamed MoE Inference and Fine-Tuning on 24GB Consumer Hardware

Large language models have outgrown consumer hardware β€” or so the conventional wisdom goes. We decided to challenge that assumption.

In this post, we present a system that runs Qwen3.5-397B-A17B, a 397-billion parameter Mixture-of-Experts model, entirely on a single Apple MacBook Pro with 24GB of unified memory, achieving 1.77 tokens/second decode throughput β€” a 7.4x speedup over full MoE inference. We are also the only framework that can run this model on consumer hardware β€” the only competitor (mlx-lm) is killed by the OS with an out-of-memory error before generating a single token.

We also demonstrate the first fine-tuning of a 400B-class model on consumer hardware, using Sparse MoE-LoRA, training only 0.001% of parameters within a 16GB memory budget.

All code, benchmarks, and trained adapters are open-sourced.


The Core Problem

Qwen3.5-397B-A17B requires 223GB of storage even after nvfp4 quantization. A data center solves this with high-bandwidth memory and multi-GPU parallelism. A MacBook has 24GB of unified memory and an NVMe SSD.

The key insight: only 17 billion parameters activate per token. The challenge is making the other 223GB of expert weights accessible without loading them into RAM.


System Architecture

We treat NVMe storage as an extension of the memory hierarchy, streaming expert weights on-demand at ~3.5 GB/s:

Component Location Memory
Attention + norms Pinned in RAM 4 GB
Shared experts (60) Pinned in RAM Included above
Expert LRU cache 2,370 slots 8 GB
Expert bank NVMe (223 GB on disk) β€”
Working memory MLX compute 3 GB
Total ~15 GB of 24 GB

The Central Finding: Most Layers Don't Need Full MoE Routing

Through calibration across all 60 transformer layers, we found that at 32 of 60 layers, the shared expert alone captures >99.5% of output directionality. This enabled a three-tier routing policy:

Tier Criterion Layers Experts/Token
SKIP cos > 0.995 32 0 (shared only)
LIGHT 0.98 < cos < 0.995 22 2
FULL cos < 0.98 6 5

Expert loads per token drop from 300 to 74 β€” a 4.1x reduction.


Optimization Stack: 0.05 β†’ 1.77 tok/s

Optimization Speed Speedup
Baseline (naive) 0.05 tok/s 1x
+ Layer-selective MoE 0.72 tok/s 14.4x
+ Scale validation skip ~0.85 tok/s 17x
+ Fast shared-expert prefill ~1.0 tok/s 20x
+ Batch expert evaluation ~1.5 tok/s 30x
+ Fixed-k + k-reduction 1.77 tok/s 35x

Time to first token drops from 14.6s to 0.25s (58x improvement).


Benchmark Results

We evaluated on standard LLM benchmarks. mlx-lm, the only competing framework, crashes with OOM before generating a single token β€” our system is the only one that runs this model on 24GB hardware.

Quality vs Speed Pareto Frontier:

Config tok/s PPL
Aggressive (ours) 1.77 5.18
Balanced 0.33 3.00
Full MoE 0.24 3.00

Knowledge and reasoning benchmarks:

Benchmark Score Notes
MMLU (5-shot) 76.7% 93% of full model capability
HumanEval pass@1 15.0% Code most affected by routing
GSM8K (0-shot CoT) 15.0% Multi-step errors compound across 60 layers

Important caveat on HumanEval/GSM8K: these scores are below the full model's reference (~65% and ~85%). Code generation had the lowest shared-expert agreement (16.7%), and multi-step reasoning compounds small per-layer routing errors. The balanced tier config (PPL 3.00) is recommended for reasoning-heavy tasks.


Fine-Tuning 397B on a Laptop

By placing LoRA adapters only on the shared expert and the top 15% most-routed experts, trainable parameters drop to just 6.27 million β€” 0.001% of the full model.

Training on Magicoder-Evol-Instruct-110K, loss dropped from 4.58 to 2.47 over 130 steps (46% improvement), within a 16GB memory budget. Trained adapter: 15KB compressed.


Repository Contents

β”œβ”€β”€ paper.pdf                    # Research paper
β”œβ”€β”€ engine/                      # Core inference engine
β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   β”œβ”€β”€ qwen35_engine.py     # Main Qwen3.5 engine
οΏ½οΏ½οΏ½   β”‚   β”œβ”€β”€ expert_cache.py      # LRU expert cache
β”‚   β”‚   β”œβ”€β”€ experiments.py       # 8 research experiments
β”‚   β”‚   └── ...
β”‚   └── training/
β”‚       β”œβ”€β”€ moe_trainer.py       # NVMe-streamed trainer
β”‚       └── sparse_lora.py       # Sparse MoE-LoRA
��── scripts/
β”‚   β”œβ”€β”€ train_coding.py          # Training script
β”‚   └── generate_paper.py        # Paper generation
β”œβ”€β”€ benchmarks/
β”‚   β”œβ”€β”€ eval_*.py                # All benchmark scripts
β”‚   β”œβ”€οΏ½οΏ½ results_*.json           # Benchmark results
β”‚   └── run_all_benchmarks.py    # Master runner
β”œβ”€β”€ adapters/
β”‚   β”œβ”€β”€ coding_lora.npz          # Trained LoRA adapter
β”‚   └── training_log.json        # Training metrics
β”œβ”€β”€ calibration/
β”‚   └��─ layer_calibration.json   # Tier assignment data
└── tests/                       # Unit tests

Requirements

  • Apple Silicon Mac with 24GB+ unified memory
  • Python 3.12+
  • MLX framework
  • Qwen3.5-397B-A17B model weights (nvfp4 quantization, 223GB)

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Prisacairu/qwen397b-nvme-inference

Adapter
(6)
this model