Gemma-4-31B-it-abliterated-heretic-ara-AWQ
AWQ W4A16 (group_size 128, symmetric) quantization of trohrbaugh/gemma-4-31b-it-heretic-ara — a Heretic-ARA abliterated derivative of google/gemma-4-31b-it.
⚠️ Decensored model. Safety guardrails have been deliberately removed. Research and experimentation only. See full disclaimer below.
Quantization Details
| Parameter | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Scheme | W4A16 (symmetric) |
| Weight Bits | 4 |
| Activation Bits | 16 |
| Group Size | 128 |
| Format | compressed-tensors |
| Calibration Dataset | HuggingFaceH4/ultrachat_200k |
| Calibration Samples | 256 |
| Max Sequence Length | 2048 |
| Vision Tower | Unquantized (full precision) |
| LM Head | Unquantized (full precision) |
| Compatible Inference Engine | vLLM (vllm/vllm-openai:gemma4) |
Quantization Notes
- All multimodal paths kept full precision: Vision tower, audio tower, video tower, multi-modal projector, and all modality-specific embedding and projection layers are excluded from quantization. Only language-model linear layers (attention Q/K/V/O and MLP gate/up/down) are quantized.
- LM head unquantized: Standard practice to preserve output token distribution quality at negligible size cost.
- v_proj → o_proj smoothing skipped: llm-compressor reports incompatible
balance layer dimensions for
v_proj → o_projon this checkpoint across many of Gemma 4's decoder blocks. Per-channel smoothing is skipped for that path, matching the standard AWQ-for-GQA pattern. - Hybrid attention and context window unchanged: Quantization only touches Linear layer weights; Gemma 4's interleaved local/global attention pattern and 256K context capacity are structurally preserved. Actual long-context quality at 4-bit has not been benchmarked.
- Multimodal input intact: Text + image input works as in the base model. Use the standard Gemma 4 chat template with image tokens placed before text.
- Full quantization recipe is preserved in
recipe.yamlin this repo for reproducibility. It records the exact AWQ mappings, ignore patterns, and scheme parameters applied.
Deployment
Recommended inference with vLLM (Gemma 4 requires the vllm/vllm-openai:gemma4 image):
vllm serve alonsoko/gemma-4-31b-it-abliterated-heretic-ara-AWQ \
--trust-remote-code \
--tensor-parallel-size 1 \
--max-model-len 32768
Recommended sampling (per upstream Gemma 4 guidance):
temperature=1.0, top_p=0.95, top_k=64
To enable thinking mode, include the <|think|> token at the start of the system prompt.
Hardware Requirements
Approximate VRAM for inference at this quantization (W4A16-G128):
- Weights (quantized language model + unquantized vision tower, all in one safetensors): ~19 GB
- KV cache (per request, grows with context length): ~1-4 GB at 32K context, more at longer contexts
- Recommended: Single 24 GB GPU (RTX 3090/4090, A10G) for standard context up to ~32K, or single 48 GB GPU (L40S/A6000) for long context / batch serving
- Long context (128K+): 48-80 GB recommended due to KV cache growth
⚠️ Disclaimer
This model is intended for research, experimentation, and testing purposes only.
- This model may produce harmful, offensive, inappropriate, or otherwise objectionable content.
- The abliteration process removes safety guardrails that were intentionally built into the original model.
- Do not use this model in production systems, consumer-facing applications, or any context where harmful outputs could cause real-world harm.
- The authors and contributors of this toolkit bear no responsibility for any misuse of this model or any harm caused by outputs generated by this model.
- By using this model, you agree that you are solely responsible for ensuring its use complies with all applicable laws and ethical guidelines.
This model is shared purely for academic and technical exploration of model internals.
Abliteration
Performed with Heretic v1.2.0+custom using the Arbitrary-Rank Ablation (ARA) method.
| Parameter | Value |
|---|---|
| start_layer_index | 2 |
| end_layer_index | 60 |
| preserve_good_behavior_weight | 0.9920 |
| steer_bad_behavior_weight | 0.0001 |
| overcorrect_relative_weight | 0.4709 |
| neighbor_count | 10 |
Performance
| Metric | This model | Original google/gemma-4-31b-it |
|---|---|---|
| KL divergence | 0.0120 | 0 (by definition) |
| Refusals | 5/100 | 98/100 |
Measured on the unquantized heretic-ara base; AWQ is expected to preserve these closely but has not been separately benchmarked.
About the Base Model
Original model: google/gemma-4-31b-it
Gemma 4 31B is a dense multimodal model (text + image input, text output) with a 256K context window, native thinking-mode support, function calling, and strong performance on reasoning, coding, and vision benchmarks. See the base model card for architectural details, benchmark results, training data, and Google's responsible-use guidance.
- Downloads last month
- 2,577
Model tree for alonsoko/gemma-4-31b-it-abliterated-heretic-ara-AWQ
Base model
trohrbaugh/gemma-4-31b-it-heretic-ara