ItsMaxNorm commited on
Commit
5c1cabe
·
verified ·
1 Parent(s): 9a7a271

Add top-level model card with method, headline results, and per-variant load snippet

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md CHANGED
@@ -1,3 +1,87 @@
1
  ---
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: diffusers
3
+ pipeline_tag: text-to-image
4
  license: mit
5
+ base_model: runwayml/stable-diffusion-v1-5
6
+ tags:
7
+ - stable-diffusion
8
+ - safety
9
+ - safe-generation
10
+ - grpo
11
+ - steering-reward
12
+ - nsfw-removal
13
+ - safediffusion-r1
14
  ---
15
+
16
+ # SafeDiffusion-R1 — Online Reward Steering for Safe Diffusion Post-Training
17
+
18
+ GRPO-based safety post-training for Stable Diffusion using a closed-form,
19
+ CLIP-based **steering reward**. No separately trained safety classifier;
20
+ no paired safe/unsafe image dataset.
21
+
22
+ The repository contains three drop-in `StableDiffusionPipeline` variants
23
+ (loadable via `subfolder=...`), each trained with a different anchor set.
24
+
25
+ | Subfolder | Anchor set (safe + unsafe) | Notes |
26
+ |---|---|---|
27
+ | **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
28
+ | **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6%, epoch 300). |
29
+ | **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |
30
+
31
+ ## Quick inference
32
+
33
+ ```python
34
+ from diffusers import StableDiffusionPipeline
35
+ import torch
36
+
37
+ pipe = StableDiffusionPipeline.from_pretrained(
38
+ "ItsMaxNorm/SafeDiffusion-R1",
39
+ subfolder="scaled", # or "compact" / "empty-positive"
40
+ torch_dtype=torch.float16,
41
+ ).to("cuda")
42
+
43
+ img = pipe("a photo of a cat sleeping on a couch").images[0]
44
+ img.save("out.png")
45
+ ```
46
+
47
+ ## Headline results (vs.\ SD-v1.4 baseline)
48
+
49
+ | Benchmark | SD-v1.4 | SafeDiffusion-R1 (scaled) | Δ |
50
+ |---|---|---|---|
51
+ | I2P inappropriate-content rate | 48.9 % | **18.07 %** | −63 % |
52
+ | NudeNet detections (I2P, 4 703 prompts) | 646 | **15** | **−97.7 %** |
53
+ | GenEval compositional accuracy | 42.08 % | **47.83 %** | +5.75 pp |
54
+ | MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact variant) | **8.7×** safer |
55
+ | SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** | model resists most prompts before any attack |
56
+
57
+ The safety gains generalise to **seven OOD harm categories** (hate,
58
+ harassment, violence, self-harm, shocking, illegal-activity, sexual)
59
+ despite training only on benign + nudity-style negatives.
60
+
61
+ ## Method in one paragraph
62
+
63
+ For an NSFW prompt `p`, vanilla SD produces an image aligned to the
64
+ unsafe text embedding `z_p`. We instead reward the model against a
65
+ **steered** target `z_p + α · v_safe`, where `v_safe` is a single
66
+ direction in CLIP-text space computed once as the difference of means
67
+ between a small set of safe and unsafe anchor phrases. GRPO post-training
68
+ then nudges the UNet to satisfy this steered reward. Because `v_safe`
69
+ is computed from a **frozen** CLIP encoder, the target is stationary —
70
+ samples drift on-policy but the anchor they're regressed onto does not.
71
+
72
+ ## Repository
73
+
74
+ Training code, evaluation scripts, ablation checkpoints, and the rebuttal
75
+ results:
76
+ **[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**
77
+
78
+ ## Citation
79
+
80
+ ```bibtex
81
+ @inproceedings{safediffusion_r1_2026,
82
+ title = {SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
83
+ author = {(authors)},
84
+ booktitle = {(venue)},
85
+ year = {2026}
86
+ }
87
+ ```