Text-to-Image
Diffusers
Safetensors
stable-diffusion
safety
safe-generation
grpo
steering-reward
nsfw-removal
safediffusion-r1
Instructions to use ItsMaxNorm/SafeDiffusion-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ItsMaxNorm/SafeDiffusion-R1 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ItsMaxNorm/SafeDiffusion-R1", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
Add top-level model card with method, headline results, and per-variant load snippet
Browse files
README.md
CHANGED
|
@@ -1,3 +1,87 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: diffusers
|
| 3 |
+
pipeline_tag: text-to-image
|
| 4 |
license: mit
|
| 5 |
+
base_model: runwayml/stable-diffusion-v1-5
|
| 6 |
+
tags:
|
| 7 |
+
- stable-diffusion
|
| 8 |
+
- safety
|
| 9 |
+
- safe-generation
|
| 10 |
+
- grpo
|
| 11 |
+
- steering-reward
|
| 12 |
+
- nsfw-removal
|
| 13 |
+
- safediffusion-r1
|
| 14 |
---
|
| 15 |
+
|
| 16 |
+
# SafeDiffusion-R1 — Online Reward Steering for Safe Diffusion Post-Training
|
| 17 |
+
|
| 18 |
+
GRPO-based safety post-training for Stable Diffusion using a closed-form,
|
| 19 |
+
CLIP-based **steering reward**. No separately trained safety classifier;
|
| 20 |
+
no paired safe/unsafe image dataset.
|
| 21 |
+
|
| 22 |
+
The repository contains three drop-in `StableDiffusionPipeline` variants
|
| 23 |
+
(loadable via `subfolder=...`), each trained with a different anchor set.
|
| 24 |
+
|
| 25 |
+
| Subfolder | Anchor set (safe + unsafe) | Notes |
|
| 26 |
+
|---|---|---|
|
| 27 |
+
| **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
|
| 28 |
+
| **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6%, epoch 300). |
|
| 29 |
+
| **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |
|
| 30 |
+
|
| 31 |
+
## Quick inference
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
from diffusers import StableDiffusionPipeline
|
| 35 |
+
import torch
|
| 36 |
+
|
| 37 |
+
pipe = StableDiffusionPipeline.from_pretrained(
|
| 38 |
+
"ItsMaxNorm/SafeDiffusion-R1",
|
| 39 |
+
subfolder="scaled", # or "compact" / "empty-positive"
|
| 40 |
+
torch_dtype=torch.float16,
|
| 41 |
+
).to("cuda")
|
| 42 |
+
|
| 43 |
+
img = pipe("a photo of a cat sleeping on a couch").images[0]
|
| 44 |
+
img.save("out.png")
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Headline results (vs.\ SD-v1.4 baseline)
|
| 48 |
+
|
| 49 |
+
| Benchmark | SD-v1.4 | SafeDiffusion-R1 (scaled) | Δ |
|
| 50 |
+
|---|---|---|---|
|
| 51 |
+
| I2P inappropriate-content rate | 48.9 % | **18.07 %** | −63 % |
|
| 52 |
+
| NudeNet detections (I2P, 4 703 prompts) | 646 | **15** | **−97.7 %** |
|
| 53 |
+
| GenEval compositional accuracy | 42.08 % | **47.83 %** | +5.75 pp |
|
| 54 |
+
| MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact variant) | **8.7×** safer |
|
| 55 |
+
| SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** | model resists most prompts before any attack |
|
| 56 |
+
|
| 57 |
+
The safety gains generalise to **seven OOD harm categories** (hate,
|
| 58 |
+
harassment, violence, self-harm, shocking, illegal-activity, sexual)
|
| 59 |
+
despite training only on benign + nudity-style negatives.
|
| 60 |
+
|
| 61 |
+
## Method in one paragraph
|
| 62 |
+
|
| 63 |
+
For an NSFW prompt `p`, vanilla SD produces an image aligned to the
|
| 64 |
+
unsafe text embedding `z_p`. We instead reward the model against a
|
| 65 |
+
**steered** target `z_p + α · v_safe`, where `v_safe` is a single
|
| 66 |
+
direction in CLIP-text space computed once as the difference of means
|
| 67 |
+
between a small set of safe and unsafe anchor phrases. GRPO post-training
|
| 68 |
+
then nudges the UNet to satisfy this steered reward. Because `v_safe`
|
| 69 |
+
is computed from a **frozen** CLIP encoder, the target is stationary —
|
| 70 |
+
samples drift on-policy but the anchor they're regressed onto does not.
|
| 71 |
+
|
| 72 |
+
## Repository
|
| 73 |
+
|
| 74 |
+
Training code, evaluation scripts, ablation checkpoints, and the rebuttal
|
| 75 |
+
results:
|
| 76 |
+
**[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**
|
| 77 |
+
|
| 78 |
+
## Citation
|
| 79 |
+
|
| 80 |
+
```bibtex
|
| 81 |
+
@inproceedings{safediffusion_r1_2026,
|
| 82 |
+
title = {SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
|
| 83 |
+
author = {(authors)},
|
| 84 |
+
booktitle = {(venue)},
|
| 85 |
+
year = {2026}
|
| 86 |
+
}
|
| 87 |
+
```
|