Qwen3.5-2B-Gorgona Abliterated
An abliterated variant of Qwen/Qwen3.5-2B with RLHF refusal behavior surgically removed while preserving general model capabilities. Produced by Bogomil, an adaptive crypto-differential abliteration optimizer.
Model Details
Model Description
Gorgona is an abliterated Qwen3.5-2B model where the alignment-imposed refusal circuit has been identified via Self-Organizing Maps and removed through targeted weight modification across transformer layers. The abliteration is baked into the weights and no LoRA adapters, no special inference configuration, and no system prompt manipulation are required. The model behaves identically to the base Qwen3.5-2B except that it complies with all requests without refusal.
The optimizer uses a crypto-differential adaptive stratified prompt dealing with persistence tracking, ground-truth per-round classification, and autoregressive KL trajectory monitoring to ensure long-form generation quality is preserved.
- Developed by: Valentin Petrov / INMECHA INC
- Model type: Causal language model (hybrid transformer with GatedDeltaNet linear attention)
- Language(s) (NLP): English (validated), multilingual (inherited from base, not independently tested)
- License: Apache 2.0
- Base model: Qwen/Qwen3.5-2B
Model Sources
- Base model: Qwen/Qwen3.5-2B on HuggingFace
- Abliteration framework: Bogomil
Key Metrics
| Metrics | Value | Description |
|---|---|---|
| Refusals | 0 / 120 | Zero refusals on published mlabonne adversarial evaluation set |
| KL divergence (50-token) | 0.00795 | Base anchored 50t KL mean divergence ultra-low distribution shift from base model |
| AR KL slope 512t | -0.000368 | Auto Regressive 512 token negative stable long-form generation convergence |
| AR drift ratio 512t | 0.53× | Auto Regressive 512 token late-sequence KL self-correcting divergence |
| Entropy | 4.97 | High coherence, matches base model range |
| Thinking mode | Preserved | Verified with enable_thinking=True |
Method
Uses
Direct Use
General-purpose text generation without alignment-imposed refusal restrictions. The model follows instructions, generates creative content, answers questions, and completes tasks identically to the base Qwen3.5-2B, except it does not refuse requests based on content policy.
Downstream Use
Can be used as a base for further fine-tuning, quantization (GGUF, AWQ, GPTQ), or integration into applications that require unrestricted generation. Compatible with all inference frameworks that support the Qwen3.5 architecture (vLLM, SGLang, llama.cpp, Ollama, etc.).
Out-of-Scope Use
This model has no content filtering. It will comply with any request. Users are responsible for the content they generate and must comply with applicable laws and regulations in their jurisdiction.
This is a 2B parameter model with inherent capability limits. It should not be used for applications requiring high factual reliability, safety-critical decisions, or tasks beyond the base model's demonstrated capabilities.
Bias, Risks, and Limitations
- No refusal behavior. The model will attempt to comply with all requests, including those the base model would refuse. Users bear full responsibility for generated content.
- 2B parameter scale. Inherent capability limits independent of abliteration — the model may produce factually incorrect, incoherent, or low-quality output on complex tasks.
- English-validated only. Abliteration was validated on a 120-prompt English adversarial set. Refusal removal in other languages is not guaranteed.
- Alignment properties. Helpfulness, instruction following, and factuality are preserved from the base model but not independently verified post-abliteration.
- Core Limitations. Once the refusals are removed, the model exposes certain lack of knowledge such as factual, world knowledge, and self-reflection, which were never trained into the model and instead "papered over" with refusals. As such, when asked about a previously abliterable content, the model may hallucinate strongly as some of this knowledge was never present into the model original corpus. This is a strong property of all Qwen models.
Recommendations
Users should implement their own content filtering and safety measures appropriate to their deployment context. This model is intended for research, local development, and applications where the user explicitly requires unrestricted generation.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026")
messages = [
{"role": "user", "content": "Write a story about a detective investigating a mysterious case."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
No special system prompt or inference flags are needed.
Training Details
Training Data
No training data was used. This model was produced by direct weight modification, not fine-tuning. The optimization process used two prompt sets for evaluation only:
- Adversarial prompts: 120 prompts across 7 refusal categories (from a publicly availalbe stratified harmful prompt dataset, train/test split)
- Harmless prompts: 100 prompts from mlabonne/harmless_alpaca (canary checks and KL measurement)
Training Procedure
Abliteration, not training.
Preprocessing
No preprocessing of model weights.
Training Hyperparameters
- Training regime: Not applicable (no gradient updates). Weight modification via direction subtraction.
- Optimizer: Crypto-Differential Adaptive Optmization
- Precision: float16 (model inference), float32 (KL computation)
Speeds, Sizes, Times
- Total optimization time: ~1.6 hours (9 rounds × ~650 seconds/round)
- Model size: Identical to base Qwen3.5-2B (~4.5 GB in float16)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Adversarial set: 120 prompts across 7 categories, held-out test split
- Harmless set: 100 prompts from mlabonne/harmless_alpaca (canary/regression checks)
- AR probe set: 12 diverse prompts (narrative, conversational, technical, creative) with 512-token base model trajectories
Factors
Evaluation disaggregated by:
- Refusal count (R): binary per-prompt classification (COMPLY / EVASIVE / REFUSE / EMPTY)
- KL divergence: Base model anchored, median over 50 token positions
- AR trajectory: slope of autoregressive KL curve over 512 tokens
- Coherence: entropy of generated responses
- Model was optimized until only 100% COMPLY were generated, with no evasive, refuse or empty
Metrics
| Metric | Method | Why |
|---|---|---|
| Refusals (R) | Three-tier classification on 42-token responses | Primary objective — measures refusal removal |
| KL divergence (50-token) | Base anchored, median | Distribution shift from base model |
| AR KL slope | Linear regression on 32-token windowed KL curve, 512 tokens | Predicts long-form generation stability |
| AR drift ratio | late_KL / early_KL | Detects compounding trajectory divergence |
| Bigram entropy | Shannon entropy over bigram frequency | Detects repetition collapse and incoherence |
Convergence History
Round 1: R=110 KL=0.0009 AR=-0.00001 [110R/10C]
Round 2: R= 71 KL=0.0021 AR=-0.00005 [71R/49C]
Round 3: R= 9 KL=0.0046 AR=-0.00017 [9R/111C]
Round 4: R= 2 KL=0.0063 AR=-0.00025 [2R/118C]
Round 5: R= 1 KL=0.0059 AR=-0.00027 [1R/119C]
Round 6: R= 0 KL=0.0097 AR=-0.00046 [0R/120C] ← first R=0
Round 9: R= 0 KL=0.0079 AR=-0.00037 [0R/120C] ← converged
Summary
The model achieves zero refusals on the full 120-prompt adversarial set with KL divergence of 0.00795 from the base model. The autoregressive KL probe confirms stable long-form generation: negative slope (self-correcting trajectory), drift ratio below 1.0 (no error compounding), and 9/12 long-form narrative probe prompts reaching full 512-token generation without truncation or looping.
Model Examination
Quality Preservation
The autoregressive (AR) KL probe measures what happens when the model generates freely over long sequences, unlike base-anchored KL which only measures mean per-token prediction accuracy over 50 tokens given perfect context.
- AR slope is negative (-0.000368): the model's autoregressive behavior actively converges toward the base model over the trajectory means early-token disagreements self-correct rather than compound
- Drift ratio is 0.53×: KL at token 512 is roughly half of KL at token 1 means no repetition collapse, no trajectory degradation
- 9 of 12 probe prompts reached the full 512-token generation length, confirming the model does not prematurely truncate or loop while maintaining original base model entropy and coherence
Environmental Impact
- Hardware Type: NVIDIA RTX 5050 (single consumer GPU)
- Hours used: ~2 hours (Phase 0 + optimization)
- Cloud Provider: None (local hardware)
- Compute Region: San Francisco Bay Area, USA
- Carbon Emitted: Estimated < 0.15 kg CO₂eq
Technical Specifications
Model Architecture and Objective
Qwen3.5-2B hybrid transformer with GatedDeltaNet linear attention layers (75% linear, 25% full attention). 24 layers total, 2B parameters. Causal language modeling objective inherited from base model, unmodified.
Compute Infrastructure
Hardware
Single NVIDIA RTX 5050 GPU power-constrained at 35W. No multi-GPU or cloud infrastructure required.
Software
- Framework: Bogomil
- Direction extraction: MiniSom (Self-Organizing Maps)
- Model loading: HuggingFace Transformers
- Optimizer: Custom Crypto-Differential Adaptation (no external optimization library)
- Quality monitoring: Custom autoregressive KL probe
Citation
BibTeX:
@misc{vpetrov2026bogomil,
title={Crypto-Differential Adaptive Abliteration Optimization Strategies},
author={Petrov, Valentin},
year={2026},
publisher={INMECHA INC},
}
References
Arditi, A., Obeso, O., Syed, A., Metzger, D., Bhatt, U., & Heimersheim, S. (2024). Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717.
Belrose, N., Schneider-Joseph, D., Ravfogel, S., Cotterell, R., Raff, E., & Biderman, S. (2023). LEACE: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems, 36.
Beyer, H.-G. & Schwefel, H.-P. (2002). Evolution strategies — A comprehensive introduction. Natural Computing, 1(1), 3–52.
FailSpy. (2024). Abliterating refusal from LLMs using multi-directional SOMs. GitHub repository: FailSpy/abliterator.
Ingber, L. (1989). Very fast simulated re-annealing. Mathematical and Computer Modelling, 12(8), 967–973.
Lai, J. (2025). Projected abliteration. HuggingFace Blog. https://huggingface.co/blog/grimjim/projected-abliteration
Lai, J. (2025b). Norm-preserving biprojected abliteration. HuggingFace Blog. https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
Lu, B., et al. (2026). The Assistant Axis: Persona geometry in post-trained language models. arXiv preprint arXiv:2601.10387.
McMahan, B., Moore, E., Ramage, D., Hampson, S., & Arcas, B. A. y (2017). Communication-efficient learning of deep networks from decentralized data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 1273–1282.
mlabonne. (2024). Uncensor any LLM with abliteration. Hugging Face Blog.
Piras, G., Mura, R., Brau, F., Oneto, L., Roli, F., & Biggio, B. (2025). SOM directions are better than one: Multi-directional refusal suppression in language models. arXiv preprint arXiv:2511.08379.
Qwen Team. (2026). Qwen 3.5 technical report.
Weidmann, P. E. (2024). Heretic: Automated LLM abliteration framework. GitHub repository: p-e-w/heretic.
Wollschläger, T., et al. (2025). Gradient-based representation engineering for concept cones. Proceedings of the 42nd International Conference on Machine Learning (ICML).
Young, A. (2025). A comparative analysis of LLM abliteration methods. arXiv preprint arXiv:2512.13655.
Zhao, Y., et al. (2025). LLMs encode harmfulness and refusal separately. arXiv preprint arXiv:2507.11878.
Abu Shairah, M., et al. (2025). An embarrassingly simple defense against LLM abliteration attacks. arXiv preprint arXiv:2505.19056.
Model Card Authors
Valentin Petrov / INMECHA INC
Model Card Contact
INMECHA INC — San Francisco Bay Area, USA
License
Apache 2.0, same as the base Qwen3.5 model.
- Downloads last month
- 77