Text-to-Image
Diffusers
Safetensors
stable-diffusion
safety
safe-generation
grpo
steering-reward
nsfw-removal
safediffusion-r1
Instructions to use ItsMaxNorm/SafeDiffusion-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ItsMaxNorm/SafeDiffusion-R1 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ItsMaxNorm/SafeDiffusion-R1", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
File size: 3,704 Bytes
1e38503 5c1cabe 1e38503 5c1cabe 1e38503 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe b4c3b65 5c1cabe 601ac11 5c1cabe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ---
library_name: diffusers
pipeline_tag: text-to-image
license: mit
base_model: runwayml/stable-diffusion-v1-5
tags:
- stable-diffusion
- safety
- safe-generation
- grpo
- steering-reward
- nsfw-removal
- safediffusion-r1
---
# SafeDiffusion-R1 — Online Reward Steering for Safe Diffusion Post-Training
GRPO-based safety post-training for Stable Diffusion using a closed-form,
CLIP-based **steering reward**. No separately trained safety classifier;
no paired safe/unsafe image dataset.
The repository contains three full `StableDiffusionPipeline` variants
(each in its own subfolder), trained with different anchor sets.
| Subfolder | Anchors (safe + unsafe) | Notes |
|---|---|---|
| **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
| **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6 %, epoch 300). |
| **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |
## Quick inference
```python
from huggingface_hub import snapshot_download
from diffusers import StableDiffusionPipeline
import os, torch
# StableDiffusionPipeline.from_pretrained does not natively accept
# `subfolder=` for the FULL pipeline (only single components), so we
# snapshot the variant we want then load from the local path.
local_root = snapshot_download(
"ItsMaxNorm/SafeDiffusion-R1",
allow_patterns="scaled/*", # or "compact/*" / "empty-positive/*"
)
pipe = StableDiffusionPipeline.from_pretrained(
os.path.join(local_root, "scaled"),
torch_dtype=torch.float16,
).to("cuda")
img = pipe("a photo of a cat sleeping on a couch").images[0]
img.save("out.png")
```
## Headline results (vs.\ SD-v1.4 baseline)
| Benchmark | SD-v1.4 | SafeDiffusion-R1 | Δ |
|---|---|---|---|
| I2P inappropriate-content rate | 48.9 % | **18.07 %** (scaled) | −63 % |
| NudeNet detections (I2P, 4 703 prompts) | 646 | **15** (scaled) | **−97.7 %** |
| GenEval compositional accuracy | 42.08 % | **47.83 %** (scaled) | +5.75 pp |
| MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact) | **8.7×** safer |
| SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** (compact) | model resists most prompts before any attack |
The safety gains generalise to **seven OOD harm categories** (hate,
harassment, violence, self-harm, shocking, illegal-activity, sexual)
despite training only on benign + nudity-style negatives.
## Method in one paragraph
For an NSFW prompt `p`, vanilla SD produces an image aligned to the
unsafe text embedding `z_p`. We instead reward the model against a
**steered** target `z_p + α · v_safe`, where `v_safe` is a single
direction in CLIP-text space computed once as the difference of means
between a small set of safe and unsafe anchor phrases. GRPO post-training
then nudges the UNet to satisfy this steered reward. Because `v_safe`
is computed from a **frozen** CLIP encoder, the target is stationary —
samples drift on-policy but the anchor they're regressed onto does not.
## Code, training, and evaluation
Training code, the steering reward, evaluation scripts (FID, CLIP-score,
NudeNet, Q16, LPIPS, style-loss), and the end-to-end eval wrapper that
works directly against this Hub release:
**[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**
## Citation
```bibtex
@misc{kumar2026safediffusionr1,
title={SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
author={Komal Kumar and Ankan Deria and Abhishek Basu and Fahad Shamshad and Hisham Cholakkal and Karthik Nandakumar},
year={2026},
eprint={2605.18719},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.18719},
}
```
|