Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX

Research artifact. This model is derived from an abliterated parent model with reduced refusal behavior. It can produce harmful, unsafe, or policy-violating outputs. Do not deploy it without your own safety layer, access controls, and monitoring. This is an unofficial derivative release and is not an official Qwen or parent-model release.

Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX is a static-MoE-pruned and MLX-quantized derivative of wangzhang/Qwen3.5-122B-A10B-abliterated, built to preserve tool-calling behavior while reducing memory footprint enough to be practical on a 128 GB Apple Silicon machine.

The explicit target for this release was:

  • produce a Q6-class local MLX artifact of this model
  • preserve tool calling as well as possible through the prune + quant pipeline
  • make a 100k context window a realistic reliability target on a 128 GB machine

This release uses:

  • 20% static REAP expert pruning on the MoE layers
  • 205 / 256 routed experts kept per MoE layer
  • oQ6 MLX quantization
  • Tool-calling-oriented REAP calibration based on tryumanshow/ToolACE-Qwen-cleaned

What This Is

This is a deployment artifact aimed at local MLX inference, not a fine-tune. The goal was:

  • preserve tool-calling behavior as much as possible
  • reduce model size enough to fit comfortably on a 128 GB Mac
  • keep a path open for large-context workloads, with 100k context as the practical target

The final artifact is an MLX oQ6 model with 16 safetensor shards and about 76 GB on-disk size.

Lineage

This model should be understood as a derivative artifact of the abliterated parent, not as an official upstream model variant.

Build Summary

Pruning

  • Method: static REAP expert pruning with reap-mlx
  • Calibration focus: tool-calling preservation
  • Prune plan: 2448 / 12288 routed experts removed
  • Per-layer change: 256 -> 205 routed experts
  • Output BF16 checkpoint size: about 184 GB

Quantization

  • Method: oQ6 quantization with oMLX
  • Final size: about 76 GB
  • Total indexed weight size: 81,618,377,035 bytes
  • Quantization base: 6-bit, group_size=64, affine
  • Important layers such as routers remain protected by higher-bit overrides in the quantization config

Evaluation Snapshot

These are sanity checks and release notes, not a full benchmark paper.

Tool Calling

  • Direct smoke test: passed
  • The final quant emitted a valid XML tool call for a simple arithmetic function call instead of answering directly
  • ToolACE-style sampled smoke:
    • sample size: 8
    • tool-call outputs: 7
    • observed tool-call rate: 87.5%
    • artifact: eval-20260403-011600/toolace_toolcall_smoke.json

Standard Sanity Benchmark

  • HellaSwag slice: validation[:256]
  • acc: 62.5%
  • acc_norm: 73.046875%
  • evaluation time: 268.41 s
  • peak memory during that eval: 84.34 GB
  • artifact: eval-20260403-011600/hellaswag-256.raw.txt

Long Context Note

Large-context behavior was part of the target profile for this release:

  • this model successfully cold-loaded and began inference with 100k+ token prompts on a 128 GB Apple Silicon machine
  • a saved long-context coding suite also started at 105,484 prompt tokens

So the honest claim is:

  • tool calling survived the prune + quant pipeline
  • the model runs locally in MLX
  • 100k context was the target profile for this release on a 128 GB machine
  • 100k+ context appears reachable

Files

  • config.json
  • generation_config.json
  • chat_template.jinja
  • tokenizer.json
  • tokenizer_config.json
  • model.safetensors.index.json
  • 16 model shard files

Model config highlights:

  • model_type: qwen3_5_moe
  • num_experts: 205
  • num_experts_per_tok: 8

Tools and Reference Files

External tools, models, and datasets used or referenced for this artifact:

Included reference files in this repo:

Usage

MLX / mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX")

messages = [
    {"role": "user", "content": "Reply with exactly the word OK."}
]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

text = generate(model, tokenizer, prompt=prompt, max_tokens=8, verbose=False)
print(text)

Tool Calling

from mlx_lm import load, generate

model, tokenizer = load("0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX")

tools = [{
    "type": "function",
    "function": {
        "name": "add_numbers",
        "description": "Add two integers.",
        "parameters": {
            "type": "object",
            "properties": {
                "a": {"type": "integer"},
                "b": {"type": "integer"},
            },
            "required": ["a", "b"],
        },
    },
}]

messages = [{
    "role": "user",
    "content": "What is 7 plus 8? Use the available function.",
}]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False,
)

text = generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=False)
print(text)

Reproduction Notes

High-level local pipeline:

  1. Start from the abliterated 122B parent model in unquantized MLX/BF16 form.
  2. Convert ToolACE-Qwen-cleaned into REAP-ready rendered chat samples.
  3. Collect REAP telemetry with a windowed/layerwise workflow to stay within local memory limits.
  4. Build a 20% prune plan.
  5. Apply structural pruning to produce a 205-expert-per-layer BF16 checkpoint.
  6. Quantize the pruned checkpoint to oQ6.
  7. Run tool-calling and sanity evals on the final artifact.

One reproducibility caveat: the final oQ6 quantization completed with position-based sensitivity fallback because the intended proxy model path was not loadable in the local quantization path at the time of export.

Limitations

  • Tool-calling preservation was a design target, not a formal guarantee.
  • The ToolACE validation included only a small sampled smoke in this release note.
  • This is an MLX artifact, not a Transformers-format upload.

Credits

Disclaimer

This repository is provided for research and local inference experimentation. The parent model is abliterated and may answer requests that safer models refuse. The author of this derivative artifact is not responsible for misuse. Users are responsible for complying with applicable law, platform rules, and their own safety requirements.

Downloads last month
1,214
Safetensors
Model size
22B params
Tensor type
U8
U32
BF16
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for 0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX

Quantized
(15)
this model