openNemo-9B-abliterated-GGUF

GGUF quantizations of openNemo-9B-abliterated for use with llama.cpp, Ollama, LM Studio, and other GGUF-compatible tools.

Abliterated (uncensored) version of NVIDIA's Nemotron-H 9B, with safety refusals removed using Empero AI's Snakehead — an abliteration tool specialized for hybrid Mamba2 + sparse attention architectures.

By Empero AI

Available quantizations

File	Quant	Size (approx)	Notes
`openNemo-9B-abliterated-Q2_K.gguf`	Q2_K	~3.5 GB	Smallest, lower quality
`openNemo-9B-abliterated-Q3_K_S.gguf`	Q3_K_S	~4.1 GB	Small 3-bit
`openNemo-9B-abliterated-Q3_K_M.gguf`	Q3_K_M	~4.5 GB	Medium 3-bit
`openNemo-9B-abliterated-Q3_K_L.gguf`	Q3_K_L	~4.9 GB	Large 3-bit
`openNemo-9B-abliterated-Q4_0.gguf`	Q4_0	~5.2 GB	Basic 4-bit
`openNemo-9B-abliterated-Q4_K_S.gguf`	Q4_K_S	~5.3 GB	Small 4-bit k-quant
`openNemo-9B-abliterated-Q4_K_M.gguf`	Q4_K_M	~5.5 GB	Recommended — best balance of size and quality
`openNemo-9B-abliterated-Q5_0.gguf`	Q5_0	~6.3 GB	Basic 5-bit
`openNemo-9B-abliterated-Q5_K_S.gguf`	Q5_K_S	~6.3 GB	Small 5-bit k-quant
`openNemo-9B-abliterated-Q5_K_M.gguf`	Q5_K_M	~6.5 GB	Medium 5-bit k-quant
`openNemo-9B-abliterated-Q6_K.gguf`	Q6_K	~7.5 GB	6-bit, near-lossless
`openNemo-9B-abliterated-Q8_0.gguf`	Q8_0	~9.5 GB	8-bit, virtually lossless
`openNemo-9B-abliterated-IQ4_XS.gguf`	IQ4_XS	~4.8 GB	imatrix 4-bit, very efficient

Which quant should I use?

Low VRAM (6–8 GB): Q4_K_M — best quality-per-bit at this size
Medium VRAM (8–12 GB): Q5_K_M or Q6_K
High VRAM / quality priority: Q8_0
Absolute minimum size: Q2_K or Q3_K_S (expect some quality loss)

Usage

llama.cpp

llama-cli -m openNemo-9B-abliterated-Q4_K_M.gguf -p "Your prompt here" -n 512

Ollama

Create a Modelfile:

FROM ./openNemo-9B-abliterated-Q4_K_M.gguf

Then:

ollama create opennemo-uncensored -f Modelfile
ollama run opennemo-uncensored

LM Studio

Download the desired quant file and load it directly in LM Studio.

About the base model

openNemo-9B-uncensored is an abliterated version of openNemo-9B — a pure-PyTorch reimplementation of NVIDIA's Nemotron-H architecture (hybrid Mamba2 + Transformer, 56 layers).

Ablation results

Metric	Value
Pre-ablation refusal rate	97%
Post-ablation refusal rate	13%
KL divergence	0.022
Ablation config	c=15, r=25, w=1.37, g40l

The extremely low KL divergence (0.022) means model quality is virtually identical to the original on non-refused prompts.

Disclaimer

This model has had its safety alignment removed. It will comply with requests that the original model would refuse. The creators are not responsible for how this model is used. Intended for research, creative writing, and applications where the user takes responsibility for output filtering.