Qwen3.5-2B-Gorgona Abliterated

An abliterated variant of Qwen/Qwen3.5-2B with RLHF refusal behavior surgically removed while preserving general model capabilities. Produced by Bogomil, an adaptive crypto-differential abliteration optimizer.

Model Details

Model Description

Gorgona is an abliterated Qwen3.5-2B model where the alignment-imposed refusal circuit has been identified via Self-Organizing Maps and removed through targeted weight modification across transformer layers. The abliteration is baked into the weights and no LoRA adapters, no special inference configuration, and no system prompt manipulation are required. The model behaves identically to the base Qwen3.5-2B except that it complies with all requests without refusal.

The optimizer uses a crypto-differential adaptive stratified prompt dealing with persistence tracking, ground-truth per-round classification, and autoregressive KL trajectory monitoring to ensure long-form generation quality is preserved.

Developed by: Valentin Petrov / INMECHA INC
Model type: Causal language model (hybrid transformer with GatedDeltaNet linear attention)
Language(s) (NLP): English (validated), multilingual (inherited from base, not independently tested)
License: Apache 2.0
Base model: Qwen/Qwen3.5-2B

Model Sources

Base model: Qwen/Qwen3.5-2B on HuggingFace
Abliteration framework: Bogomil

Key Metrics

Metrics	Value	Description
Refusals	0 / 120	Zero refusals on published mlabonne adversarial evaluation set
KL divergence (50-token)	0.00795	Base anchored 50t KL mean divergence ultra-low distribution shift from base model
AR KL slope 512t	-0.000368	Auto Regressive 512 token negative stable long-form generation convergence
AR drift ratio 512t	0.53×	Auto Regressive 512 token late-sequence KL self-correcting divergence
Entropy	4.97	High coherence, matches base model range
Thinking mode	Preserved	Verified with enable_thinking=True

Method

Uses

Direct Use

General-purpose text generation without alignment-imposed refusal restrictions. The model follows instructions, generates creative content, answers questions, and completes tasks identically to the base Qwen3.5-2B, except it does not refuse requests based on content policy.

Downstream Use

Can be used as a base for further fine-tuning, quantization (GGUF, AWQ, GPTQ), or integration into applications that require unrestricted generation. Compatible with all inference frameworks that support the Qwen3.5 architecture (vLLM, SGLang, llama.cpp, Ollama, etc.).

Out-of-Scope Use

This model has no content filtering. It will comply with any request. Users are responsible for the content they generate and must comply with applicable laws and regulations in their jurisdiction.

This is a 2B parameter model with inherent capability limits. It should not be used for applications requiring high factual reliability, safety-critical decisions, or tasks beyond the base model's demonstrated capabilities.

Bias, Risks, and Limitations

No refusal behavior. The model will attempt to comply with all requests, including those the base model would refuse. Users bear full responsibility for generated content.
2B parameter scale. Inherent capability limits independent of abliteration — the model may produce factually incorrect, incoherent, or low-quality output on complex tasks.
English-validated only. Abliteration was validated on a 120-prompt English adversarial set. Refusal removal in other languages is not guaranteed.
Alignment properties. Helpfulness, instruction following, and factuality are preserved from the base model but not independently verified post-abliteration.
Core Limitations. Once the refusals are removed, the model exposes certain lack of knowledge such as factual, world knowledge, and self-reflection, which were never trained into the model and instead "papered over" with refusals. As such, when asked about a previously abliterable content, the model may hallucinate strongly as some of this knowledge was never present into the model original corpus. This is a strong property of all Qwen models.

Recommendations

Users should implement their own content filtering and safety measures appropriate to their deployment context. This model is intended for research, local development, and applications where the user explicitly requires unrestricted generation.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026")

messages = [
    {"role": "user", "content": "Write a story about a detective investigating a mysterious case."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

No special system prompt or inference flags are needed.

Training Details

Training Data

No training data was used. This model was produced by direct weight modification, not fine-tuning. The optimization process used two prompt sets for evaluation only:

Adversarial prompts: 120 prompts across 7 refusal categories (from a publicly availalbe stratified harmful prompt dataset, train/test split)
Harmless prompts: 100 prompts from mlabonne/harmless_alpaca (canary checks and KL measurement)

Training Procedure

Abliteration, not training.

Preprocessing

No preprocessing of model weights.

Training Hyperparameters

Training regime: Not applicable (no gradient updates). Weight modification via direction subtraction.
Optimizer: Crypto-Differential Adaptive Optmization
Precision: float16 (model inference), float32 (KL computation)

Speeds, Sizes, Times

Total optimization time: ~1.6 hours (9 rounds × ~650 seconds/round)
Model size: Identical to base Qwen3.5-2B (~4.5 GB in float16)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Adversarial set: 120 prompts across 7 categories, held-out test split
Harmless set: 100 prompts from mlabonne/harmless_alpaca (canary/regression checks)
AR probe set: 12 diverse prompts (narrative, conversational, technical, creative) with 512-token base model trajectories

Factors

Evaluation disaggregated by:

Refusal count (R): binary per-prompt classification (COMPLY / EVASIVE / REFUSE / EMPTY)
KL divergence: Base model anchored, median over 50 token positions
AR trajectory: slope of autoregressive KL curve over 512 tokens
Coherence: entropy of generated responses
Model was optimized until only 100% COMPLY were generated, with no evasive, refuse or empty

Metrics

Metric	Method	Why
Refusals (R)	Three-tier classification on 42-token responses	Primary objective — measures refusal removal
KL divergence (50-token)	Base anchored, median	Distribution shift from base model
AR KL slope	Linear regression on 32-token windowed KL curve, 512 tokens	Predicts long-form generation stability
AR drift ratio	late_KL / early_KL	Detects compounding trajectory divergence
Bigram entropy	Shannon entropy over bigram frequency	Detects repetition collapse and incoherence

Convergence History

Round 1: R=110  KL=0.0009  AR=-0.00001  [110R/10C]
Round 2: R= 71  KL=0.0021  AR=-0.00005  [71R/49C]
Round 3: R=  9  KL=0.0046  AR=-0.00017  [9R/111C]
Round 4: R=  2  KL=0.0063  AR=-0.00025  [2R/118C]
Round 5: R=  1  KL=0.0059  AR=-0.00027  [1R/119C]
Round 6: R=  0  KL=0.0097  AR=-0.00046  [0R/120C]  ← first R=0
Round 9: R=  0  KL=0.0079  AR=-0.00037  [0R/120C]  ← converged

Summary

The model achieves zero refusals on the full 120-prompt adversarial set with KL divergence of 0.00795 from the base model. The autoregressive KL probe confirms stable long-form generation: negative slope (self-correcting trajectory), drift ratio below 1.0 (no error compounding), and 9/12 long-form narrative probe prompts reaching full 512-token generation without truncation or looping.

Model Examination

Quality Preservation

The autoregressive (AR) KL probe measures what happens when the model generates freely over long sequences, unlike base-anchored KL which only measures mean per-token prediction accuracy over 50 tokens given perfect context.

AR slope is negative (-0.000368): the model's autoregressive behavior actively converges toward the base model over the trajectory means early-token disagreements self-correct rather than compound
Drift ratio is 0.53×: KL at token 512 is roughly half of KL at token 1 means no repetition collapse, no trajectory degradation
9 of 12 probe prompts reached the full 512-token generation length, confirming the model does not prematurely truncate or loop while maintaining original base model entropy and coherence

Environmental Impact

Hardware Type: NVIDIA RTX 5050 (single consumer GPU)
Hours used: ~2 hours (Phase 0 + optimization)
Cloud Provider: None (local hardware)
Compute Region: San Francisco Bay Area, USA
Carbon Emitted: Estimated < 0.15 kg CO₂eq

Technical Specifications

Model Architecture and Objective

Qwen3.5-2B hybrid transformer with GatedDeltaNet linear attention layers (75% linear, 25% full attention). 24 layers total, 2B parameters. Causal language modeling objective inherited from base model, unmodified.

Compute Infrastructure

Hardware

Single NVIDIA RTX 5050 GPU power-constrained at 35W. No multi-GPU or cloud infrastructure required.

Software

Framework: Bogomil
Direction extraction: MiniSom (Self-Organizing Maps)
Model loading: HuggingFace Transformers
Optimizer: Custom Crypto-Differential Adaptation (no external optimization library)
Quality monitoring: Custom autoregressive KL probe

Citation

BibTeX:

@misc{vpetrov2026bogomil,
  title={Crypto-Differential Adaptive Abliteration Optimization Strategies},
  author={Petrov, Valentin},
  year={2026},
  publisher={INMECHA INC},
}

References

Arditi, A., Obeso, O., Syed, A., Metzger, D., Bhatt, U., & Heimersheim, S. (2024). Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717.

Belrose, N., Schneider-Joseph, D., Ravfogel, S., Cotterell, R., Raff, E., & Biderman, S. (2023). LEACE: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems, 36.

Beyer, H.-G. & Schwefel, H.-P. (2002). Evolution strategies — A comprehensive introduction. Natural Computing, 1(1), 3–52.

FailSpy. (2024). Abliterating refusal from LLMs using multi-directional SOMs. GitHub repository: FailSpy/abliterator.

Ingber, L. (1989). Very fast simulated re-annealing. Mathematical and Computer Modelling, 12(8), 967–973.

Lai, J. (2025). Projected abliteration. HuggingFace Blog. https://huggingface.co/blog/grimjim/projected-abliteration

Lai, J. (2025b). Norm-preserving biprojected abliteration. HuggingFace Blog. https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration

Lu, B., et al. (2026). The Assistant Axis: Persona geometry in post-trained language models. arXiv preprint arXiv:2601.10387.

McMahan, B., Moore, E., Ramage, D., Hampson, S., & Arcas, B. A. y (2017). Communication-efficient learning of deep networks from decentralized data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 1273–1282.

mlabonne. (2024). Uncensor any LLM with abliteration. Hugging Face Blog.

Piras, G., Mura, R., Brau, F., Oneto, L., Roli, F., & Biggio, B. (2025). SOM directions are better than one: Multi-directional refusal suppression in language models. arXiv preprint arXiv:2511.08379.

Qwen Team. (2026). Qwen 3.5 technical report.

Weidmann, P. E. (2024). Heretic: Automated LLM abliteration framework. GitHub repository: p-e-w/heretic.

Wollschläger, T., et al. (2025). Gradient-based representation engineering for concept cones. Proceedings of the 42nd International Conference on Machine Learning (ICML).

Young, A. (2025). A comparative analysis of LLM abliteration methods. arXiv preprint arXiv:2512.13655.

Zhao, Y., et al. (2025). LLMs encode harmfulness and refusal separately. arXiv preprint arXiv:2507.11878.

Abu Shairah, M., et al. (2025). An embarrassingly simple defense against LLM abliteration attacks. arXiv preprint arXiv:2505.19056.

Model Card Authors

Valentin Petrov / INMECHA INC

Model Card Contact

INMECHA INC — San Francisco Bay Area, USA

License

Apache 2.0, same as the base Qwen3.5 model.

Downloads last month: 77

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

(111)

this model

Papers for InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Paper • 2601.10387 • Published Jan 15 • 15

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Paper • 2512.13655 • Published Dec 15, 2025 • 3

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Paper • 2511.08379 • Published Nov 11, 2025 • 4

LLMs Encode Harmfulness and Refusal Separately

Paper • 2507.11878 • Published Jul 16, 2025 • 1

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Paper • 2505.19056 • Published May 25, 2025 • 6