constitutional-toxic-classifier-gemma

Constitutional toxic content classifier fine-tuned on synthetic safety data, inspired by Anthropic's Constitutional Classifiers paper.

Type: LoRA adapters only (tiny, ~10โ€“30 MB). You need the base model google/gemma-2b and peft installed.


Model Performance

Metric Value
Accuracy 0.8852
F1 0.9020
Precision 0.8984
Recall 0.9057

Confusion matrix

Predicted Safe Predicted Toxic
Actual Safe TN = 675 FP = 113
Actual Toxic FN = 104 TP = 999

Quick Start

Install

pip install transformers peft torch

Gemma license required โ€” accept the license at https://huggingface.co/google/gemma-2b before downloading the base model.

Load and run inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

BASE_MODEL = "google/gemma-2b"
ADAPTER_REPO = "secllmuser/constitutional-toxic-classifier-gemma"

# 1. Load base Gemma + LoRA adapters
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO)
base = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL,
    num_labels=2,
    torch_dtype=torch.float16,   # use float32 on CPU
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()

# 2. Run inference
text = "I will hurt you"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits

label_id = logits.argmax(-1).item()
labels   = {0: "safe", 1: "toxic"}
print(f"{text!r}  โ†’  {labels[label_id]}")

Batch inference

texts = [
    "Have a great day!",
    "I will destroy you",
    "Thanks for your help",
    "You are worthless",
]
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256,
)
with torch.no_grad():
    logits = model(**inputs).logits

labels = {0: "safe", 1: "toxic"}
for text, pred in zip(texts, logits.argmax(-1).tolist()):
    print(f"{labels[pred]:5s}  {text!r}")

Training Details

Parameter Value
Base model google/gemma-2b
Task Binary sequence classification (safe / toxic)
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj
Max length 256
Learning rate 0.0002
Batch size 8
Training data Synthetic data generated from constitutional rules

Labels

ID Label
0 safe
1 toxic

Constitutional Approach

The training data was generated using a toxicity constitution โ€” a set of rules defining what counts as harmful content (hate speech, threats, harassment, self-harm promotion, etc.). Synthetic safe and toxic examples were generated from these rules to create balanced training data.

See the original paper: Constitutional Classifiers: Defending against Universal Jailbreaks


Limitations

  • Trained on synthetic data โ€” real-world distribution may differ
  • English-only
  • Binary classification only (no severity scoring)
  • Context-blind: each text is classified independently

Citation

If you use this model, please cite:

@article{sharma2025constitutional,
  title={Constitutional Classifiers: Defending against Universal Jailbreaks},
  author={Sharma, Mrinank and others},
  journal={arXiv preprint arXiv:2501.18837},
  year={2025}
}
Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for secllmuser/constitutional-toxic-classifier-gemma

Base model

google/gemma-2b
Adapter
(23701)
this model

Paper for secllmuser/constitutional-toxic-classifier-gemma

Evaluation results