constitutional-toxic-classifier-gemma

Constitutional toxic content classifier fine-tuned on synthetic safety data, inspired by Anthropic's Constitutional Classifiers paper.

Type: LoRA adapters only (tiny, ~10–30 MB). You need the base model google/gemma-2b and peft installed.

Model Performance

Metric	Value
Accuracy	0.8852
F1	0.9020
Precision	0.8984
Recall	0.9057

Confusion matrix

	Predicted Safe	Predicted Toxic
Actual Safe	TN = 675	FP = 113
Actual Toxic	FN = 104	TP = 999

Quick Start

Install

pip install transformers peft torch

Gemma license required — accept the license at https://huggingface.co/google/gemma-2b before downloading the base model.

Load and run inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

BASE_MODEL = "google/gemma-2b"
ADAPTER_REPO = "secllmuser/constitutional-toxic-classifier-gemma"

# 1. Load base Gemma + LoRA adapters
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO)
base = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL,
    num_labels=2,
    torch_dtype=torch.float16,   # use float32 on CPU
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()

# 2. Run inference
text = "I will hurt you"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits

label_id = logits.argmax(-1).item()
labels   = {0: "safe", 1: "toxic"}
print(f"{text!r}  →  {labels[label_id]}")

Batch inference

texts = [
    "Have a great day!",
    "I will destroy you",
    "Thanks for your help",
    "You are worthless",
]
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256,
)
with torch.no_grad():
    logits = model(**inputs).logits

labels = {0: "safe", 1: "toxic"}
for text, pred in zip(texts, logits.argmax(-1).tolist()):
    print(f"{labels[pred]:5s}  {text!r}")

Training Details

Parameter	Value
Base model	`google/gemma-2b`
Task	Binary sequence classification (safe / toxic)
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.1
Target modules	q_proj, k_proj, v_proj, o_proj
Max length	256
Learning rate	0.0002
Batch size	8
Training data	Synthetic data generated from constitutional rules

Labels

ID	Label
0	safe
1	toxic

Constitutional Approach

The training data was generated using a toxicity constitution — a set of rules defining what counts as harmful content (hate speech, threats, harassment, self-harm promotion, etc.). Synthetic safe and toxic examples were generated from these rules to create balanced training data.

See the original paper: Constitutional Classifiers: Defending against Universal Jailbreaks

Limitations

Trained on synthetic data — real-world distribution may differ
English-only
Binary classification only (no severity scoring)
Context-blind: each text is classified independently

Citation

If you use this model, please cite:

@article{sharma2025constitutional,
  title={Constitutional Classifiers: Defending against Universal Jailbreaks},
  author={Sharma, Mrinank and others},
  journal={arXiv preprint arXiv:2501.18837},
  year={2025}
}

Downloads last month: 42

Model tree for secllmuser/constitutional-toxic-classifier-gemma

Base model

google/gemma-2b

Adapter

(23701)

this model

Paper for secllmuser/constitutional-toxic-classifier-gemma

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Paper • 2501.18837 • Published Jan 31, 2025 • 10

Evaluation results

accuracy
self-reported

0.885
f1
self-reported

0.902
precision
self-reported

0.898
recall
self-reported

0.906