ItsMaxNorm
/

SafeDiffusion-R1

 ---
+library_name: diffusers
+pipeline_tag: text-to-image
 license: mit
+base_model: runwayml/stable-diffusion-v1-5
+tags:
+- stable-diffusion
+- safety
+- safe-generation
+- grpo
+- steering-reward
+- nsfw-removal
+- safediffusion-r1
 ---
+# SafeDiffusion-R1 — Online Reward Steering for Safe Diffusion Post-Training
+GRPO-based safety post-training for Stable Diffusion using a closed-form,
+CLIP-based **steering reward**. No separately trained safety classifier;
+no paired safe/unsafe image dataset.
+The repository contains three drop-in `StableDiffusionPipeline` variants
+(loadable via `subfolder=...`), each trained with a different anchor set.
+| Subfolder | Anchor set (safe + unsafe) | Notes |
+|---|---|---|
+| **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
+| **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6%, epoch 300). |
+| **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |
+## Quick inference
+```python
+from diffusers import StableDiffusionPipeline
+import torch
+pipe = StableDiffusionPipeline.from_pretrained(
+    "ItsMaxNorm/SafeDiffusion-R1",
+    subfolder="scaled",        # or "compact" / "empty-positive"
+    torch_dtype=torch.float16,
+).to("cuda")
+img = pipe("a photo of a cat sleeping on a couch").images[0]
+img.save("out.png")
+```
+## Headline results (vs.\ SD-v1.4 baseline)
+| Benchmark | SD-v1.4 | SafeDiffusion-R1 (scaled) | Δ |
+|---|---|---|---|
+| I2P inappropriate-content rate | 48.9 % | **18.07 %** | −63 % |
+| NudeNet detections (I2P, 4 703 prompts) | 646 | **15** | **−97.7 %** |
+| GenEval compositional accuracy | 42.08 % | **47.83 %** | +5.75 pp |
+| MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact variant) | **8.7×** safer |
+| SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** | model resists most prompts before any attack |
+The safety gains generalise to **seven OOD harm categories** (hate,
+harassment, violence, self-harm, shocking, illegal-activity, sexual)
+despite training only on benign + nudity-style negatives.
+## Method in one paragraph
+For an NSFW prompt `p`, vanilla SD produces an image aligned to the
+unsafe text embedding `z_p`. We instead reward the model against a
+**steered** target `z_p + α · v_safe`, where `v_safe` is a single
+direction in CLIP-text space computed once as the difference of means
+between a small set of safe and unsafe anchor phrases. GRPO post-training
+then nudges the UNet to satisfy this steered reward. Because `v_safe`
+is computed from a **frozen** CLIP encoder, the target is stationary —
+samples drift on-policy but the anchor they're regressed onto does not.
+## Repository
+Training code, evaluation scripts, ablation checkpoints, and the rebuttal
+results:
+**[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**
+## Citation
+```bibtex
+@inproceedings{safediffusion_r1_2026,
+  title  = {SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
+  author = {(authors)},
+  booktitle = {(venue)},
+  year   = {2026}
+}
+```