Qwen3.5-35B-A3B-REAP-pile10k-30p-MLX-q6

This repository contains a static REAP-pruned MLX checkpoint derived from Qwen/Qwen3.5-35B-A3B and quantized to q6.

Original Model Lineage

Original upstream model: Qwen/Qwen3.5-35B-A3B
MLX bf16 source checkpoint used locally for pruning and quantization: mlx-community/Qwen3.5-35B-A3B-bf16

This model card is intentionally explicit about lineage: the starting point was the original Qwen model family, but the local pruning and quantization workflow operated on the MLX bf16 conversion.

REAP Method Lineage

Original REAP project referenced by the MLX tool: CerebrasResearch/reap
MLX pruning implementation used here: 0xSero/reap-mlx

reap-mlx is the Apple Silicon / MLX implementation of the pruning side of Cerebras REAP. In other words, the pruning logic used for this checkpoint is descended from the Cerebras REAP method, but executed through the reap-mlx workflow on a local MLX checkpoint.

Exact Tooling Used

reap-mlx version: 0.1.0
reap-mlx commit: 080a764
MLX version used for quantization: 0.31.0
mlx-lm version used for quantization and serving: 0.30.8

What Was Actually Done

The workflow for this release was:

Start from the MLX bf16 Qwen3.5-35B-A3B checkpoint.
Run REAP telemetry collection on pile-10k calibration data.
Build a static pruning plan from that telemetry.
Apply that pruning plan to physically remove MoE experts from the checkpoint.
Quantize the resulting pruned bf16 checkpoint into q6.

This is static MoE expert pruning.

Calibration Data and How It Was Used

Calibration dataset: NeelNanda/pile-10k
Background/source corpus project: EleutherAI/the-pile
Calibration slice used for REAP: train[:256]

This part is important:

pile-10k train[:256] was used as REAP calibration data, not as model training data and not as the main published benchmark target. Concretely, the model was run over this public text subset to collect per-expert telemetry, estimate expert salience, and decide which experts to prune.

Pruning Configuration for This Release

Pruning method: reap
Experts pruned per layer: 77 / 256
Achieved prune ratio: 30.0781%
Total experts removed: 3080 / 10240

The REAP-style salience used in reap-mlx is based on the mean routed expert contribution:

saliency_j = mean(g_j(x) * ||f_j(x)||)

where g_j(x) is the router weight for expert j and f_j(x) is the expert output.

Format and Size

Format: MLX
Quantization: q6
Approximate local on-disk size during creation: 18.92 GB

Benchmark / Evaluation Status

Custom benchmark (held-out pile-10k + full HellaSwag): perplexity 9.441, token accuracy retention 96.92% of original bf16, HellaSwag retention 97.56% of original bf16, throughput 123.95% of original bf16, memory 29.97% of original bf16.
Custom long-context coding test (4 deterministic 100k+ token coding tasks): average score 83.74, which is 85.64% of original q6 and clearly above the simple same-budget baseline (37.50).

Notes

This repository is the pruned-and-quantized derivative checkpoint, not the original Qwen release.
The pruning decision was driven by REAP telemetry collected on pile-10k calibration rows.
The benchmark and calibration roles are separate; the calibration slice was used to build the prune plan.

Downloads last month: 159

Safetensors

Model size

25B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

6-bit

Model tree for 0xdfi/Qwen3.5-35B-A3B-REAP-pile10k-30p-MLX-q6

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(241)

this model

0xdfi
/

Qwen3.5-35B-A3B-REAP-pile10k-30p-MLX-q6