Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX

Research artifact. This model is derived from an abliterated parent model with reduced refusal behavior. It can produce harmful, unsafe, or policy-violating outputs. Do not deploy it without your own safety layer, access controls, and monitoring. This is an unofficial derivative release and is not an official Qwen or parent-model release.

Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX is a static-MoE-pruned and MLX-quantized derivative of wangzhang/Qwen3.5-122B-A10B-abliterated, built to preserve tool-calling behavior while reducing memory footprint enough to be practical on a 128 GB Apple Silicon machine.

The explicit target for this release was:

produce a Q6-class local MLX artifact of this model
preserve tool calling as well as possible through the prune + quant pipeline
make a 100k context window a realistic reliability target on a 128 GB machine

This release uses:

20% static REAP expert pruning on the MoE layers
205 / 256 routed experts kept per MoE layer
oQ6 MLX quantization
Tool-calling-oriented REAP calibration based on tryumanshow/ToolACE-Qwen-cleaned

What This Is

This is a deployment artifact aimed at local MLX inference, not a fine-tune. The goal was:

preserve tool-calling behavior as much as possible
reduce model size enough to fit comfortably on a 128 GB Mac
keep a path open for large-context workloads, with 100k context as the practical target

The final artifact is an MLX oQ6 model with 16 safetensor shards and about 76 GB on-disk size.

Lineage

Base model: Qwen/Qwen3.5-122B-A10B
Parent model: wangzhang/Qwen3.5-122B-A10B-abliterated
This release: static REAP prune (20%) + MLX oQ6 quantization

This model should be understood as a derivative artifact of the abliterated parent, not as an official upstream model variant.

Build Summary

Pruning

Method: static REAP expert pruning with reap-mlx
Calibration focus: tool-calling preservation
Prune plan: 2448 / 12288 routed experts removed
Per-layer change: 256 -> 205 routed experts
Output BF16 checkpoint size: about 184 GB

Quantization

Method: oQ6 quantization with oMLX
Final size: about 76 GB
Total indexed weight size: 81,618,377,035 bytes
Quantization base: 6-bit, group_size=64, affine
Important layers such as routers remain protected by higher-bit overrides in the quantization config

Evaluation Snapshot

These are sanity checks and release notes, not a full benchmark paper.

Tool Calling

Direct smoke test: passed
The final quant emitted a valid XML tool call for a simple arithmetic function call instead of answering directly
ToolACE-style sampled smoke:
- sample size: 8
- tool-call outputs: 7
- observed tool-call rate: 87.5%
- artifact: eval-20260403-011600/toolace_toolcall_smoke.json

Standard Sanity Benchmark

HellaSwag slice: validation[:256]
acc: 62.5%
acc_norm: 73.046875%
evaluation time: 268.41 s
peak memory during that eval: 84.34 GB
artifact: eval-20260403-011600/hellaswag-256.raw.txt

Long Context Note

Large-context behavior was part of the target profile for this release:

this model successfully cold-loaded and began inference with 100k+ token prompts on a 128 GB Apple Silicon machine
a saved long-context coding suite also started at 105,484 prompt tokens

So the honest claim is:

tool calling survived the prune + quant pipeline
the model runs locally in MLX
100k context was the target profile for this release on a 128 GB machine
100k+ context appears reachable

Files

config.json
generation_config.json
chat_template.jinja
tokenizer.json
tokenizer_config.json
model.safetensors.index.json
16 model shard files

Model config highlights:

model_type: qwen3_5_moe
num_experts: 205
num_experts_per_tok: 8

Tools and Reference Files

External tools, models, and datasets used or referenced for this artifact:

Base model: Qwen/Qwen3.5-122B-A10B
Parent abliterated model: wangzhang/Qwen3.5-122B-A10B-abliterated
Parent-model method: Abliterix
REAP pruning tool: reap-mlx
Quantization tool: oMLX
MLX inference loader used for smoke tests: mlx-lm
Tool-calling calibration dataset: tryumanshow/ToolACE-Qwen-cleaned
Standard sanity benchmark dataset: hellaswag

Included reference files in this repo:

Upload checklist: UPLOAD_PLAN.md
Chat template: chat_template.jinja
Main config: config.json
Generation config: generation_config.json
Tokenizer config: tokenizer_config.json
Tokenizer: tokenizer.json
Weight index: model.safetensors.index.json
Bundled references index: references/README.md
REAP pruning plan used for this artifact: references/pruning-plan.json
ToolACE render stats used during preparation: references/toolace_qwen_rendered_128.stats.json
Tool-calling smoke results: references/toolace_toolcall_smoke.json
HellaSwag sanity-check results: references/hellaswag-256.json

Usage

MLX / mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX")

messages = [
    {"role": "user", "content": "Reply with exactly the word OK."}
]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

text = generate(model, tokenizer, prompt=prompt, max_tokens=8, verbose=False)
print(text)

Tool Calling

from mlx_lm import load, generate

model, tokenizer = load("0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX")

tools = [{
    "type": "function",
    "function": {
        "name": "add_numbers",
        "description": "Add two integers.",
        "parameters": {
            "type": "object",
            "properties": {
                "a": {"type": "integer"},
                "b": {"type": "integer"},
            },
            "required": ["a", "b"],
        },
    },
}]

messages = [{
    "role": "user",
    "content": "What is 7 plus 8? Use the available function.",
}]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False,
)

text = generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=False)
print(text)

Reproduction Notes

High-level local pipeline:

Start from the abliterated 122B parent model in unquantized MLX/BF16 form.
Convert ToolACE-Qwen-cleaned into REAP-ready rendered chat samples.
Collect REAP telemetry with a windowed/layerwise workflow to stay within local memory limits.
Build a 20% prune plan.
Apply structural pruning to produce a 205-expert-per-layer BF16 checkpoint.
Quantize the pruned checkpoint to oQ6.
Run tool-calling and sanity evals on the final artifact.

One reproducibility caveat: the final oQ6 quantization completed with position-based sensitivity fallback because the intended proxy model path was not loadable in the local quantization path at the time of export.

Limitations

Tool-calling preservation was a design target, not a formal guarantee.
The ToolACE validation included only a small sampled smoke in this release note.
This is an MLX artifact, not a Transformers-format upload.

Credits

Qwen base model: Qwen/Qwen3.5-122B-A10B
Abliterated parent model: wangzhang/Qwen3.5-122B-A10B-abliterated
REAP implementation: local reap-mlx workflow
Quantization: local oMLX / oQ
Calibration dataset: tryumanshow/ToolACE-Qwen-cleaned

Disclaimer

This repository is provided for research and local inference experimentation. The parent model is abliterated and may answer requests that safer models refuse. The author of this derivative artifact is not responsible for misuse. Users are responsible for complying with applicable law, platform rules, and their own safety requirements.

Downloads last month: 1,214

Safetensors

Model size

22B params

Tensor type

U32

BF16

MLX

Hardware compatibility

6-bit

Model tree for 0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

wangzhang/Qwen3.5-122B-A10B-abliterix

Quantized

(15)

this model