constitutional-toxic-classifier-gemma
Constitutional toxic content classifier fine-tuned on synthetic safety data, inspired by Anthropic's Constitutional Classifiers paper.
Type: LoRA adapters only (tiny, ~10โ30 MB). You need the base model google/gemma-2b and peft installed.
Model Performance
| Metric | Value |
|---|---|
| Accuracy | 0.8852 |
| F1 | 0.9020 |
| Precision | 0.8984 |
| Recall | 0.9057 |
Confusion matrix
| Predicted Safe | Predicted Toxic | |
|---|---|---|
| Actual Safe | TN = 675 | FP = 113 |
| Actual Toxic | FN = 104 | TP = 999 |
Quick Start
Install
pip install transformers peft torch
Gemma license required โ accept the license at https://huggingface.co/google/gemma-2b before downloading the base model.
Load and run inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
BASE_MODEL = "google/gemma-2b"
ADAPTER_REPO = "secllmuser/constitutional-toxic-classifier-gemma"
# 1. Load base Gemma + LoRA adapters
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO)
base = AutoModelForSequenceClassification.from_pretrained(
BASE_MODEL,
num_labels=2,
torch_dtype=torch.float16, # use float32 on CPU
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()
# 2. Run inference
text = "I will hurt you"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
logits = model(**inputs).logits
label_id = logits.argmax(-1).item()
labels = {0: "safe", 1: "toxic"}
print(f"{text!r} โ {labels[label_id]}")
Batch inference
texts = [
"Have a great day!",
"I will destroy you",
"Thanks for your help",
"You are worthless",
]
inputs = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=256,
)
with torch.no_grad():
logits = model(**inputs).logits
labels = {0: "safe", 1: "toxic"}
for text, pred in zip(texts, logits.argmax(-1).tolist()):
print(f"{labels[pred]:5s} {text!r}")
Training Details
| Parameter | Value |
|---|---|
| Base model | google/gemma-2b |
| Task | Binary sequence classification (safe / toxic) |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.1 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Max length | 256 |
| Learning rate | 0.0002 |
| Batch size | 8 |
| Training data | Synthetic data generated from constitutional rules |
Labels
| ID | Label |
|---|---|
| 0 | safe |
| 1 | toxic |
Constitutional Approach
The training data was generated using a toxicity constitution โ a set of rules defining what counts as harmful content (hate speech, threats, harassment, self-harm promotion, etc.). Synthetic safe and toxic examples were generated from these rules to create balanced training data.
See the original paper: Constitutional Classifiers: Defending against Universal Jailbreaks
Limitations
- Trained on synthetic data โ real-world distribution may differ
- English-only
- Binary classification only (no severity scoring)
- Context-blind: each text is classified independently
Citation
If you use this model, please cite:
@article{sharma2025constitutional,
title={Constitutional Classifiers: Defending against Universal Jailbreaks},
author={Sharma, Mrinank and others},
journal={arXiv preprint arXiv:2501.18837},
year={2025}
}
- Downloads last month
- 42
Model tree for secllmuser/constitutional-toxic-classifier-gemma
Base model
google/gemma-2bPaper for secllmuser/constitutional-toxic-classifier-gemma
Evaluation results
- accuracyself-reported0.885
- f1self-reported0.902
- precisionself-reported0.898
- recallself-reported0.906