Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed

MLX 8-bit build of Jackrong/Qwopus3.6-35B-A3B-v1 packaged for fast local serving with lightning-mlx.

The checkpoint includes an MTPLX sidecar (mtp.safetensors) and runtime metadata (mtplx_runtime.json) so lightning-mlx can use its Qwen3.5 MoE MTPLX serving path on Apple Silicon. Runtime metadata verified on Darwin arm64 with mtplx_version: 0.1.0rc3, mtp_depth_max: 1, recommended_profile: sustained.

The model is Qwopus3.6-35B-A3B-v1 (Qwen3.5 MoE, 35B total / ~3B active per token, 256 experts × 8 active, multimodal vision+text, reasoning + tool-use). Refer to the source model card for capabilities, license, and training details.

Note on MTP weights: mtp.safetensors is packed from the upstream Qwen/Qwen3.6-35B-A3B MTP module (same backbone shape as Qwopus). The base model itself is the Qwopus fine-tune; speculative decoding acceptance rate may differ from upstream.

Install lightning-mlx

python3 -m pip install git+https://github.com/samuelfaj/lightning-mlx.git

Or:

curl -fsSL https://raw.githubusercontent.com/samuelfaj/lightning-mlx/main/install.sh | bash

Verify:

lightning-mlx --help

Serve this model

From Hugging Face:

lightning-mlx serve samuelfaj/Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed

From a local checkout:

lightning-mlx serve /path/to/Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed

Daemon mode:

lightning-mlx serve samuelfaj/Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed --daemon
lightning-mlx status
lightning-mlx tui <PID-or-model-name>
lightning-mlx kill <PID-or-model-name>

OpenAI-compatible API

curl http://localhost:8010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{{
    "model": "local",
    "messages": [
      {{"role": "user", "content": "Write a tiny Python HTTP server."}}
    ],
    "stream": true
  }}'

Why use lightning-mlx

lightning-mlx is built for local agent workloads on Apple Silicon: short streamed turns, tool calls, growing context, repeated low-latency interactions. With this checkpoint it uses the packaged MTPLX metadata and Qwen3.5 MoE serving preset instead of treating the model as a generic MLX checkpoint.

The runtime focuses on:

  • OpenAI-compatible local serving
  • Fast streamed chat completions
  • Qwen3.5 MoE reasoning and tool-use paths
  • MTPLX-style speculative decoding support
  • Daemon, status, TUI, and kill controls

Convert similar local MTPLX models

lightning-mlx convert-mtplx \
  /path/to/Model-MLX-quantized \
  --mtp-source /path/to/Model-with-mtp-tensors

Output is written next to the source as <source>-MTPLX-Optimized-Speed. Then:

lightning-mlx serve /path/to/Model-MLX-quantized-MTPLX-Optimized-Speed

Use with mlx-vlm

This checkpoint is a Qwen3.5 MoE vision-language model. Use mlx-vlm for direct generate / chat without lightning-mlx:

pip install -U mlx-vlm
python -m mlx_vlm.generate \
  --model samuelfaj/Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed \
  --prompt "Describe this image." \
  --image /path/to/image.jpg \
  --max-tokens 200

Intended use

Research, agents, reasoning, tool-use, vision-language workloads on Apple Silicon. Refer to the upstream Qwopus card for evaluation details and intended use.

License

Apache 2.0, inherited from the base model.

Downloads last month
492
Safetensors
Model size
10B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samuelfaj/Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed

Quantized
(14)
this model

Collection including samuelfaj/Qwopus3.6-35B-A3B-v1-8bit-MTPLX-Optimized-Speed