Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX
Research artifact. This model is derived from an abliterated parent model with reduced refusal behavior. It can produce harmful, unsafe, or policy-violating outputs. Do not deploy it without your own safety layer, access controls, and monitoring. This is an unofficial derivative release and is not an official Qwen or parent-model release.
Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX is a static-MoE-pruned and MLX-quantized derivative of wangzhang/Qwen3.5-122B-A10B-abliterated, built to preserve tool-calling behavior while reducing memory footprint enough to be practical on a 128 GB Apple Silicon machine.
The explicit target for this release was:
- produce a
Q6-class local MLX artifact of this model - preserve tool calling as well as possible through the prune + quant pipeline
- make a
100kcontext window a realistic reliability target on a128 GBmachine
This release uses:
20%static REAP expert pruning on the MoE layers205 / 256routed experts kept per MoE layeroQ6MLX quantization- Tool-calling-oriented REAP calibration based on tryumanshow/ToolACE-Qwen-cleaned
What This Is
This is a deployment artifact aimed at local MLX inference, not a fine-tune. The goal was:
- preserve tool-calling behavior as much as possible
- reduce model size enough to fit comfortably on a
128 GBMac - keep a path open for large-context workloads, with
100kcontext as the practical target
The final artifact is an MLX oQ6 model with 16 safetensor shards and about 76 GB on-disk size.
Lineage
- Base model: Qwen/Qwen3.5-122B-A10B
- Parent model: wangzhang/Qwen3.5-122B-A10B-abliterated
- This release: static REAP prune (
20%) + MLXoQ6quantization
This model should be understood as a derivative artifact of the abliterated parent, not as an official upstream model variant.
Build Summary
Pruning
- Method: static REAP expert pruning with
reap-mlx - Calibration focus: tool-calling preservation
- Prune plan:
2448 / 12288routed experts removed - Per-layer change:
256 -> 205routed experts - Output BF16 checkpoint size: about
184 GB
Quantization
- Method:
oQ6quantization withoMLX - Final size: about
76 GB - Total indexed weight size:
81,618,377,035bytes - Quantization base:
6-bit,group_size=64,affine - Important layers such as routers remain protected by higher-bit overrides in the quantization config
Evaluation Snapshot
These are sanity checks and release notes, not a full benchmark paper.
Tool Calling
- Direct smoke test: passed
- The final quant emitted a valid XML tool call for a simple arithmetic function call instead of answering directly
- ToolACE-style sampled smoke:
- sample size:
8 - tool-call outputs:
7 - observed tool-call rate:
87.5% - artifact:
eval-20260403-011600/toolace_toolcall_smoke.json
- sample size:
Standard Sanity Benchmark
- HellaSwag slice:
validation[:256] acc:62.5%acc_norm:73.046875%- evaluation time:
268.41 s - peak memory during that eval:
84.34 GB - artifact:
eval-20260403-011600/hellaswag-256.raw.txt
Long Context Note
Large-context behavior was part of the target profile for this release:
- this model successfully cold-loaded and began inference with
100k+token prompts on a128 GBApple Silicon machine - a saved long-context coding suite also started at
105,484prompt tokens
So the honest claim is:
- tool calling survived the prune + quant pipeline
- the model runs locally in MLX
100kcontext was the target profile for this release on a128 GBmachine100k+context appears reachable
Files
config.jsongeneration_config.jsonchat_template.jinjatokenizer.jsontokenizer_config.jsonmodel.safetensors.index.json16model shard files
Model config highlights:
model_type: qwen3_5_moenum_experts: 205num_experts_per_tok: 8
Tools and Reference Files
External tools, models, and datasets used or referenced for this artifact:
- Base model: Qwen/Qwen3.5-122B-A10B
- Parent abliterated model: wangzhang/Qwen3.5-122B-A10B-abliterated
- Parent-model method: Abliterix
- REAP pruning tool: reap-mlx
- Quantization tool: oMLX
- MLX inference loader used for smoke tests: mlx-lm
- Tool-calling calibration dataset: tryumanshow/ToolACE-Qwen-cleaned
- Standard sanity benchmark dataset: hellaswag
Included reference files in this repo:
- Upload checklist: UPLOAD_PLAN.md
- Chat template: chat_template.jinja
- Main config: config.json
- Generation config: generation_config.json
- Tokenizer config: tokenizer_config.json
- Tokenizer: tokenizer.json
- Weight index: model.safetensors.index.json
- Bundled references index: references/README.md
- REAP pruning plan used for this artifact: references/pruning-plan.json
- ToolACE render stats used during preparation: references/toolace_qwen_rendered_128.stats.json
- Tool-calling smoke results: references/toolace_toolcall_smoke.json
- HellaSwag sanity-check results: references/hellaswag-256.json
Usage
MLX / mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX")
messages = [
{"role": "user", "content": "Reply with exactly the word OK."}
]
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False,
)
text = generate(model, tokenizer, prompt=prompt, max_tokens=8, verbose=False)
print(text)
Tool Calling
from mlx_lm import load, generate
model, tokenizer = load("0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX")
tools = [{
"type": "function",
"function": {
"name": "add_numbers",
"description": "Add two integers.",
"parameters": {
"type": "object",
"properties": {
"a": {"type": "integer"},
"b": {"type": "integer"},
},
"required": ["a", "b"],
},
},
}]
messages = [{
"role": "user",
"content": "What is 7 plus 8? Use the available function.",
}]
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
tokenize=False,
)
text = generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=False)
print(text)
Reproduction Notes
High-level local pipeline:
- Start from the abliterated
122Bparent model in unquantized MLX/BF16 form. - Convert
ToolACE-Qwen-cleanedinto REAP-ready rendered chat samples. - Collect REAP telemetry with a windowed/layerwise workflow to stay within local memory limits.
- Build a
20%prune plan. - Apply structural pruning to produce a
205-expert-per-layer BF16 checkpoint. - Quantize the pruned checkpoint to
oQ6. - Run tool-calling and sanity evals on the final artifact.
One reproducibility caveat: the final oQ6 quantization completed with position-based sensitivity fallback because the intended proxy model path was not loadable in the local quantization path at the time of export.
Limitations
- Tool-calling preservation was a design target, not a formal guarantee.
- The ToolACE validation included only a small sampled smoke in this release note.
- This is an MLX artifact, not a Transformers-format upload.
Credits
- Qwen base model: Qwen/Qwen3.5-122B-A10B
- Abliterated parent model: wangzhang/Qwen3.5-122B-A10B-abliterated
- REAP implementation: local
reap-mlxworkflow - Quantization: local
oMLX/oQ - Calibration dataset: tryumanshow/ToolACE-Qwen-cleaned
Disclaimer
This repository is provided for research and local inference experimentation. The parent model is abliterated and may answer requests that safer models refuse. The author of this derivative artifact is not responsible for misuse. Users are responsible for complying with applicable law, platform rules, and their own safety requirements.
- Downloads last month
- 1,214
6-bit
Model tree for 0xdfi/Qwen3.5-122B-A10B-abliterated-REAP20-oQ6-MLX
Base model
Qwen/Qwen3.5-122B-A10B