CPU-Optimized Builds β€” Making this model accessible beyond GPU hardware

#6
by 19440harry - opened

I've been running this model on an RTX 4070 Mobile and the reasoning quality is genuinely impressive, especially for a local 27B. It got me thinking about accessibility for users who don't have a discrete GPU.

The Byteshape project recently demonstrated CPU-viable deployment of Qwen3-Coder-30B-A3B on a Raspberry Pi 5: https://huggingface.co/byteshape/Qwen3-Coder-30B-A3B-Instruct-GGUF

I understand that approach works partly because it's a MoE architecture with only 3B active parameters per forward pass, which is a different challenge from a dense 27B. But with aggressive quantization (Q2/Q3) and CPU-specific kernel optimization there may be a viable path here too.

A CPU-friendly build would open this model to a significantly wider audience, anyone running a mini PC, a home server, a Raspberry Pi cluster, or simply without GPU access. Given the reasoning quality this model delivers, that's a meaningful expansion of who can actually use it.

Questions:

  1. Has CPU deployment been considered or tested?
  2. Are there community members with CPU optimization experience who'd want to collaborate on this?

Would genuinely like to see this model reach more hardware.

Running dense 27B model on CPU is going to be a bad time no matter what optimizations you try to do. Best model for running on CPU I would say is https://huggingface.co/LiquidAI/LFM2-24B-A2B with 2B active and it's not a reasoning model so you don't need to wait 10 minutes to get a response after every prompt.

Running dense 27B model on CPU is going to be a bad time no matter what optimizations you try to do. Best model for running on CPU I would say is https://huggingface.co/LiquidAI/LFM2-24B-A2B with 2B active and it's not a reasoning model so you don't need to wait 10 minutes to get a response after every prompt.

Fair point on the raw performance that a dense 27B on CPU will never be fast. The use case I had in mind is more async or batch workloads where latency isn't the constraint, and the reasoning quality of this fine-tune is the value proposition. The LFM2 recommendation is interesting for interactive use cases though. Has anyone actually benchmarked this model at Q2 on CPU to see where the floor is before writing it off entirely?

Sign up or log in to comment