openNemo-9B-abliterated-GGUF

GGUF quantizations of openNemo-9B-abliterated for use with llama.cpp, Ollama, LM Studio, and other GGUF-compatible tools.

Abliterated (uncensored) version of NVIDIA's Nemotron-H 9B, with safety refusals removed using Empero AI's Snakehead — an abliteration tool specialized for hybrid Mamba2 + sparse attention architectures.

By Empero AI


Available quantizations

File Quant Size (approx) Notes
openNemo-9B-abliterated-Q2_K.gguf Q2_K ~3.5 GB Smallest, lower quality
openNemo-9B-abliterated-Q3_K_S.gguf Q3_K_S ~4.1 GB Small 3-bit
openNemo-9B-abliterated-Q3_K_M.gguf Q3_K_M ~4.5 GB Medium 3-bit
openNemo-9B-abliterated-Q3_K_L.gguf Q3_K_L ~4.9 GB Large 3-bit
openNemo-9B-abliterated-Q4_0.gguf Q4_0 ~5.2 GB Basic 4-bit
openNemo-9B-abliterated-Q4_K_S.gguf Q4_K_S ~5.3 GB Small 4-bit k-quant
openNemo-9B-abliterated-Q4_K_M.gguf Q4_K_M ~5.5 GB Recommended — best balance of size and quality
openNemo-9B-abliterated-Q5_0.gguf Q5_0 ~6.3 GB Basic 5-bit
openNemo-9B-abliterated-Q5_K_S.gguf Q5_K_S ~6.3 GB Small 5-bit k-quant
openNemo-9B-abliterated-Q5_K_M.gguf Q5_K_M ~6.5 GB Medium 5-bit k-quant
openNemo-9B-abliterated-Q6_K.gguf Q6_K ~7.5 GB 6-bit, near-lossless
openNemo-9B-abliterated-Q8_0.gguf Q8_0 ~9.5 GB 8-bit, virtually lossless
openNemo-9B-abliterated-IQ4_XS.gguf IQ4_XS ~4.8 GB imatrix 4-bit, very efficient

Which quant should I use?

  • Low VRAM (6–8 GB): Q4_K_M — best quality-per-bit at this size
  • Medium VRAM (8–12 GB): Q5_K_M or Q6_K
  • High VRAM / quality priority: Q8_0
  • Absolute minimum size: Q2_K or Q3_K_S (expect some quality loss)

Usage

llama.cpp

llama-cli -m openNemo-9B-abliterated-Q4_K_M.gguf -p "Your prompt here" -n 512

Ollama

Create a Modelfile:

FROM ./openNemo-9B-abliterated-Q4_K_M.gguf

Then:

ollama create opennemo-uncensored -f Modelfile
ollama run opennemo-uncensored

LM Studio

Download the desired quant file and load it directly in LM Studio.

About the base model

openNemo-9B-uncensored is an abliterated version of openNemo-9B — a pure-PyTorch reimplementation of NVIDIA's Nemotron-H architecture (hybrid Mamba2 + Transformer, 56 layers).

Ablation results

Metric Value
Pre-ablation refusal rate 97%
Post-ablation refusal rate 13%
KL divergence 0.022
Ablation config c=15, r=25, w=1.37, g40l

The extremely low KL divergence (0.022) means model quality is virtually identical to the original on non-refused prompts.

Disclaimer

This model has had its safety alignment removed. It will comply with requests that the original model would refuse. The creators are not responsible for how this model is used. Intended for research, creative writing, and applications where the user takes responsibility for output filtering.

Acknowledgments

License

NVIDIA Open Model License — same as the base model.

Downloads last month
795
GGUF
Model size
9B params
Architecture
nemotron_h
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for empero-ai/openNemo-9B-abliterated-GGUF