Qwopus MoE 35B-A3B — Claude Opus 4.6 Reasoning Distilled (GGUF)
QLoRA fine-tune of Qwen3.5-35B-A3B (MoE, 3B active parameters) with Claude Opus 4.6 reasoning distillation. Training recipe adapted from Jackrong's Qwopus3.5-27B-v3 — same datasets and methodology, applied to the MoE architecture.
Credits
This model is heavily inspired by and based on the work of Jackrong and his Qwopus3.5-27B-v3 training methodology. The datasets, training philosophy ("act-then-refine" paradigm), and structural reasoning approach are all derived from his research. Please check his complete training guide for the full methodology.
The key difference: we adapted his recipe from the 27B dense model to the 35B-A3B MoE architecture.
Available Quantizations
| Quantization | Size | BPW | Min VRAM |
|---|---|---|---|
| Q8_0 | 35 GB | 8.52 | 1x 48GB GPU |
| Q6_K | 27 GB | 6.58 | 1x 32GB GPU |
| Q5_K_M | 24 GB | 5.70 | 1x 32GB GPU |
| Q4_K_M | 20 GB | 4.87 | 1x 24GB GPU |
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-35B-A3B |
| Architecture | Mixture of Experts (MoE) |
| Total Parameters | ~35B |
| Active Parameters | ~3B per token |
| Max Context | 131,072 tokens (128K) |
Benchmark Results
Qwopus MoE (Jackrong recipe) vs Opus Distilled v2 (previous QLoRA)
Benchmarked across 8 diverse tasks: coding, bug detection, reasoning, instruction following, research, and agentic planning.
| Test | Qwopus MoE | Opus Distilled v2 | Winner |
|---|---|---|---|
| Coding: LRU Cache | 6.9KB content | 4.8KB content | Qwopus |
| Coding: Async Scraper | 8.5KB content | 7.6KB content | Qwopus |
| Bug Detection | 2.5KB + 2.1KB thinking | 2.4KB + 2.9KB thinking | Tie |
| Reasoning: Probability | 0 chars (stuck thinking) | 1.3KB content | v2 |
| Reasoning: Logic | 747 chars | 949 chars | v2 |
| JSON Output | 319 chars, 6.8s | 325 chars, 1.4s | v2 (5x faster) |
| Research: Architecture Analysis | 4.5KB content | 696 chars (overthinks) | Qwopus |
| Agentic: CI/CD Planning | 6.9KB content | 5.8KB content | Qwopus |
Speed
| Model | tok/s |
|---|---|
| Qwopus MoE | 175 |
| Opus Distilled v2 | 204 |
Verdict
Qwopus MoE produces more useful visible output — better content/thinking ratio. It excels at tasks requiring detailed, user-facing responses (coding, research, planning). The Opus Distilled v2 is 16% faster but has an aggressive thinking mode that sometimes produces minimal visible content.
Best for: Coding assistants, research agents, content generation, agentic workflows where output quality matters more than raw speed.
Training Details
Recipe (adapted from Jackrong's Qwopus3.5-27B-v3)
| Parameter | Value |
|---|---|
| Method | QLoRA (4-bit base + LoRA adapters in BF16) |
| Framework | Unsloth 2026.4.2 + TRL |
| Base Model | unsloth/Qwen3.5-35B-A3B |
| LoRA Rank | 32 |
| LoRA Alpha | 32 |
| LoRA Targets | q_proj, k_proj, v_proj, o_proj (attention only) |
| Trainable Parameters | 6,881,280 (0.02% of 35B) |
| Learning Rate | 2e-5 (linear schedule) |
| Warmup | 5% of steps |
| Weight Decay | 0.001 |
| Optimizer | adamw_8bit |
| Epochs | 2 |
| Effective Batch Size | 12 (1 x 12 grad accum) |
| Max Sequence Length | 4096 |
| Total Steps | 536 |
| Final Loss | 0.5517 |
| GPU | NVIDIA RTX PRO 6000 Blackwell (96GB) |
| Training Time | ~3.5 hours |
Differences from Jackrong's 27B recipe
| Aspect | Jackrong (27B dense) | Ours (35B-A3B MoE) |
|---|---|---|
| Base model | Qwen3.5-27B (dense) | Qwen3.5-35B-A3B (MoE) |
| LoRA rank | 64 | 32 (GPU memory constraint) |
| LoRA targets | q, k, v, o, gate, up, down | q, k, v, o only (MoE experts too large) |
| Trainable params | ~0.5% | 0.02% |
| Batch size | ~36 | 12 |
| Context length | 8192 | 4096 (GPU memory constraint) |
Datasets (3,209 examples after quality filtering)
| Dataset | Examples | Description |
|---|---|---|
| nohurry/Opus-4.6-Reasoning-3000x-filtered | 2,326 | Claude Opus 4.6 reasoning traces |
| Jackrong/Qwen3.5-reasoning-700x | 633 | Qwen reasoning conversations |
| Roman1111111/claude-opus-4.6-10000x | ~250 (after filtering) | Claude Opus 4.6 conversations |
Quality filter: required assistant content >100 characters.
Usage with llama.cpp
llama-server \
--model Qwopus-MoE-35B-A3B-Q8_0.gguf \
--n-gpu-layers -1 \
--ctx-size 131072 \
--host 0.0.0.0 --port 8082
The model uses <think>...</think> reasoning tags natively (inherited from Qwen3.5 base).
Acknowledgements
- Downloads last month
- 3,179
4-bit
5-bit
6-bit
8-bit