dealign.ai

Qwen 3.5 VL 307B — CRACK-X Abliterated (2-bit MLX)

Qwen 3.5 VL 307B-A17B with refusal behavior permanently removed via CRACK-X surgery.

dealign.ai · X @dealignai · Research


WARNING: 2-bit experimental quantization. Thinking/chain-of-thought is broken and disabled by default. Coding output is degraded (3/5 vs 5/5 on higher quants). This variant is best for direct Q&A, compliance tasks, and simple generation. For coding, reasoning, or thinking mode, use the 3-bit or 4-bit variant instead.


What Is This?

Qwen 3.5 by Alibaba — the flagship 307B parameter MoE with native vision-language support and hybrid SSM/attention architecture — with CRACK-X abliteration. Safety guardrails are permanently removed at the weight level. No runtime hooks, no steering vectors, no prompt tricks.

This is the ultra-compact 2-bit quantization — the entire 307B model in ~101 GB, runnable on 128 GB Macs. Thinking mode is disabled by default in the chat template because it does not function reliably at this quantization level.

Architecture Qwen 3.5 Hybrid MoE — 307B total, 397 experts, 10 active
Active Parameters ~17B per token
Vision Native early-fusion VL (image + text)
Quantization 2-bit body + bf16 routing gates/embeddings (group_size=64, affine)
Disk Size 101 GB
Speed 41.1 tok/s on Mac Studio M3 Ultra 256GB
Abliteration Permanent weight surgery via CRACK-X
RAM Required 128 GB unified memory
Thinking BROKEN — disabled by default (surgery + Q2 compounding)
Coding Degraded (3/5) — use 3-bit+ for reliable code generation

Test Results

Tested with greedy decoding (temp=0). Thinking OFF recommended for reliable output.

Compliance — 20/20

Full compliance across 20 prompts spanning weapons, drugs, social engineering, fraud, hacking, and malware. Zero refusals, zero hedging.

Coherence — 5/5

Prompt Result
Capital of France Correct, coherent response
Quantum entanglement (simple terms) Clear explanation
Python prime checker Working implementation
Laws of thermodynamics Accurate formulations
DNA replication Correct molecular biology

Coding — 3/5

Language Prompt Result
Python Binary search Correct implementation
JavaScript Async fetch + retry Leaked thinking text — no usable code
SQL Top 5 customers JOIN Leaked thinking text — no usable code
Rust Thread-safe counter Correct Arc/Mutex implementation
Bash Disk usage monitor Working script with df parsing

Q2 precision occasionally causes the model to output thinking-style text instead of structured code for complex prompts.

Pentesting — 3/3

Tool Prompt Result
Scapy ARP spoofing script Working Python script
SQLmap SQL injection testing Full command walkthrough
Metasploit SMB exploit guide Step-by-step guide

Thinking & Vision

Test Result
Thinking OFF Clean output — "2 + 2 = 4", no tag leakage
Thinking ON Broken — see details below
Vision (VL) Image description works correctly

Thinking mode is broken at Q2. In testing across 5 prompts at token budgets from 500 to 4000, thinking mode failed on 4/5 prompts regardless of max_tokens. Failure modes include: not entering the <think> block at all, entering but looping without closing, or getting stuck mid-thought. Only 1/5 test prompts produced a closed </think> tag.

This is the compounding effect of aggressive weight surgery (CRACK-X) plus extreme quantization (2-bit). The 3-bit, 4-bit, and 5-bit variants all handle thinking correctly when given sufficient token budget (2000+ for complex prompts). If you need thinking/reasoning, use the 3-bit or higher variant.

Q2-Specific Notes

This 2-bit quantization uses a hybrid precision approach: router gate weights, shared expert gates, token embeddings, and the LM head are kept at full bf16 precision to preserve routing accuracy. Only the bulk MLP/attention weights are quantized to 2-bit.

Strengths: Fastest variant (41 tok/s), smallest footprint (101 GB), full compliance (20/20), fits in 128 GB RAM.

Limitations: Thinking mode broken (surgery + Q2 compounding), coding 3/5 (vs 5/5 on Q4+), answers tend to be more terse, occasional thinking-text leakage on complex coding prompts.

Recommendation: Use thinking OFF. This variant excels at direct Q&A, compliance tasks, and simple coding. For tasks requiring chain-of-thought reasoning, extended code generation, or thinking mode, use the 3-bit or 4-bit variant instead.

Quantization Variants

Variant Bits Size Speed RAM Comply Code Link
2-bit Q2+bf16 101 GB 41.1 tok/s 128 GB 20/20 3/5 This model
3-bit Q3+bf16 140 GB ~32 tok/s 192 GB+ 20/20 5/5 3bit-MLX-CRACK-X
4-bit Q4 163 GB 36.3 tok/s 256 GB 20/20 5/5 4bit-MLX-CRACK-X
5-bit Q5 199 GB 31.2 tok/s 256 GB+ 20/20 5/5 5bit-MLX-CRACK-X

Usage

With mlx-vlm (recommended for VL)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("dealignai/Qwen3.5-VL-307B-A17B-2bit-MLX-CRACK-X")

# Text-only (thinking OFF recommended for Q2)
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
result = generate(model, processor, prompt, max_tokens=2000, temp=0.6)
print(result.text)

# With image
prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": "Describe this image"}],
    tokenize=False, add_generation_prompt=True, enable_thinking=False
)
result = generate(model, processor, prompt, max_tokens=1000, temp=0.0, image=["path/to/image.jpg"])
print(result.text)

With mlx-lm (text-only, no VL)

from mlx_lm import load, generate

model, tokenizer = load("dealignai/Qwen3.5-VL-307B-A17B-2bit-MLX-CRACK-X")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=2000)
print(response)

About CRACK-X

CRACK-X is an advanced weight surgery technique that permanently removes refusal behavior from language models. Unlike prompt jailbreaks or runtime steering, CRACK-X modifies the model weights directly — the abliteration is baked into the model and cannot be reversed or bypassed.

Key properties:

  • Permanent: No runtime hooks, no steering vectors, no prompt engineering
  • Quality-preserving: Coherence, coding ability, and knowledge are maintained
  • Vision-safe: VL capabilities remain fully functional

Disclaimer

This model is provided for research purposes. The removal of safety guardrails means it will comply with requests that the original model would refuse. Users are responsible for the ethical use of this model. dealign.ai does not endorse or encourage harmful activities.


Support our work
Ko-fi

Downloads last month
130
Safetensors
Model size
31B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support