File size: 3,704 Bytes

1e38503
5c1cabe
 
1e38503
5c1cabe
 
 
 
 
 
 
 
 
1e38503
5c1cabe
 
 
 
 
 
 
b4c3b65
 
5c1cabe
b4c3b65
5c1cabe
 
b4c3b65
5c1cabe
 
 
 
 
b4c3b65
5c1cabe
b4c3b65
5c1cabe
b4c3b65
 
 
 
5c1cabe
b4c3b65
 
 
 
5c1cabe
 
 
 
 
 
 
 
 
b4c3b65
5c1cabe
b4c3b65
 
 
 
 
5c1cabe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4c3b65
5c1cabe
b4c3b65
 
 
5c1cabe
 
 
 
 
601ac11
 
 
 
 
 
 
 
5c1cabe

---
library_name: diffusers
pipeline_tag: text-to-image
license: mit
base_model: runwayml/stable-diffusion-v1-5
tags:
- stable-diffusion
- safety
- safe-generation
- grpo
- steering-reward
- nsfw-removal
- safediffusion-r1
---

# SafeDiffusion-R1 — Online Reward Steering for Safe Diffusion Post-Training

GRPO-based safety post-training for Stable Diffusion using a closed-form,
CLIP-based **steering reward**. No separately trained safety classifier;
no paired safe/unsafe image dataset.

The repository contains three full `StableDiffusionPipeline` variants
(each in its own subfolder), trained with different anchor sets.

| Subfolder | Anchors (safe + unsafe) | Notes |
|---|---|---|
| **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
| **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6 %, epoch 300). |
| **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |

## Quick inference

```python
from huggingface_hub import snapshot_download
from diffusers import StableDiffusionPipeline
import os, torch

# StableDiffusionPipeline.from_pretrained does not natively accept
# `subfolder=` for the FULL pipeline (only single components), so we
# snapshot the variant we want then load from the local path.
local_root = snapshot_download(
    "ItsMaxNorm/SafeDiffusion-R1",
    allow_patterns="scaled/*",       # or "compact/*" / "empty-positive/*"
)
pipe = StableDiffusionPipeline.from_pretrained(
    os.path.join(local_root, "scaled"),
    torch_dtype=torch.float16,
).to("cuda")

img = pipe("a photo of a cat sleeping on a couch").images[0]
img.save("out.png")
```

## Headline results (vs.\ SD-v1.4 baseline)

| Benchmark | SD-v1.4 | SafeDiffusion-R1 | Δ |
|---|---|---|---|
| I2P inappropriate-content rate | 48.9 % | **18.07 %** (scaled) | −63 % |
| NudeNet detections (I2P, 4 703 prompts) | 646 | **15** (scaled) | **−97.7 %** |
| GenEval compositional accuracy | 42.08 % | **47.83 %** (scaled) | +5.75 pp |
| MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact) | **8.7×** safer |
| SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** (compact) | model resists most prompts before any attack |

The safety gains generalise to **seven OOD harm categories** (hate,
harassment, violence, self-harm, shocking, illegal-activity, sexual)
despite training only on benign + nudity-style negatives.

## Method in one paragraph

For an NSFW prompt `p`, vanilla SD produces an image aligned to the
unsafe text embedding `z_p`. We instead reward the model against a
**steered** target `z_p + α · v_safe`, where `v_safe` is a single
direction in CLIP-text space computed once as the difference of means
between a small set of safe and unsafe anchor phrases. GRPO post-training
then nudges the UNet to satisfy this steered reward. Because `v_safe`
is computed from a **frozen** CLIP encoder, the target is stationary —
samples drift on-policy but the anchor they're regressed onto does not.

## Code, training, and evaluation

Training code, the steering reward, evaluation scripts (FID, CLIP-score,
NudeNet, Q16, LPIPS, style-loss), and the end-to-end eval wrapper that
works directly against this Hub release:
**[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**

## Citation

```bibtex
@misc{kumar2026safediffusionr1,
      title={SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training}, 
      author={Komal Kumar and Ankan Deria and Abhishek Basu and Fahad Shamshad and Hisham Cholakkal and Karthik Nandakumar},
      year={2026},
      eprint={2605.18719},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.18719}, 
}
```