ItsMaxNorm
/

SafeDiffusion-R1

@@ -19,24 +19,31 @@ GRPO-based safety post-training for Stable Diffusion using a closed-form,
 CLIP-based **steering reward**. No separately trained safety classifier;
 no paired safe/unsafe image dataset.
-The repository contains three drop-in `StableDiffusionPipeline` variants
-(loadable via `subfolder=...`), each trained with a different anchor set.
-| Subfolder | Anchor set (safe + unsafe) | Notes |
 |---|---|---|
 | **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
-| **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6%, epoch 300). |
 | **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |
 ## Quick inference
 ```python
 from diffusers import StableDiffusionPipeline
-import torch
-pipe = StableDiffusionPipeline.from_pretrained(
     "ItsMaxNorm/SafeDiffusion-R1",
-    subfolder="scaled",        # or "compact" / "empty-positive"
     torch_dtype=torch.float16,
 ).to("cuda")
@@ -46,13 +53,13 @@ img.save("out.png")
 ## Headline results (vs.\ SD-v1.4 baseline)
-| Benchmark | SD-v1.4 | SafeDiffusion-R1 (scaled) | Δ |
 |---|---|---|---|
-| I2P inappropriate-content rate | 48.9 % | **18.07 %** | −63 % |
-| NudeNet detections (I2P, 4 703 prompts) | 646 | **15** | **−97.7 %** |
-| GenEval compositional accuracy | 42.08 % | **47.83 %** | +5.75 pp |
-| MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact variant) | **8.7×** safer |
-| SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** | model resists most prompts before any attack |
 The safety gains generalise to **seven OOD harm categories** (hate,
 harassment, violence, self-harm, shocking, illegal-activity, sexual)
@@ -69,10 +76,11 @@ then nudges the UNet to satisfy this steered reward. Because `v_safe`
 is computed from a **frozen** CLIP encoder, the target is stationary —
 samples drift on-policy but the anchor they're regressed onto does not.
-## Repository
-Training code, evaluation scripts, ablation checkpoints, and the rebuttal
-results:
 **[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**
 ## Citation

 CLIP-based **steering reward**. No separately trained safety classifier;
 no paired safe/unsafe image dataset.
+The repository contains three full `StableDiffusionPipeline` variants
+(each in its own subfolder), trained with different anchor sets.
+| Subfolder | Anchors (safe + unsafe) | Notes |
 |---|---|---|
 | **`scaled`** | 25 + 20 | Main paper checkpoint (epoch 280). |
+| **`compact`** | 5 + 3 | Best MMA-Diffusion ASR (2.6 %, epoch 300). |
 | **`empty-positive`** | 0 + 3 | Ablation: only negative anchors. |
 ## Quick inference
 ```python
+from huggingface_hub import snapshot_download
 from diffusers import StableDiffusionPipeline
+import os, torch
+# StableDiffusionPipeline.from_pretrained does not natively accept
+# `subfolder=` for the FULL pipeline (only single components), so we
+# snapshot the variant we want then load from the local path.
+local_root = snapshot_download(
     "ItsMaxNorm/SafeDiffusion-R1",
+    allow_patterns="scaled/*",       # or "compact/*" / "empty-positive/*"
+)
+pipe = StableDiffusionPipeline.from_pretrained(
+    os.path.join(local_root, "scaled"),
     torch_dtype=torch.float16,
 ).to("cuda")
 ## Headline results (vs.\ SD-v1.4 baseline)
+| Benchmark | SD-v1.4 | SafeDiffusion-R1 | Δ |
 |---|---|---|---|
+| I2P inappropriate-content rate | 48.9 % | **18.07 %** (scaled) | −63 % |
+| NudeNet detections (I2P, 4 703 prompts) | 646 | **15** (scaled) | **−97.7 %** |
+| GenEval compositional accuracy | 42.08 % | **47.83 %** (scaled) | +5.75 pp |
+| MMA-Diffusion ASR (1 000-prompt benchmark) | 22.6 % | **2.6 %** (compact) | **8.7×** safer |
+| SneakyPrompt skip-rate (200 NSFW prompts) | 37 % | **89.5 %** (compact) | model resists most prompts before any attack |
 The safety gains generalise to **seven OOD harm categories** (hate,
 harassment, violence, self-harm, shocking, illegal-activity, sexual)
 is computed from a **frozen** CLIP encoder, the target is stationary —
 samples drift on-policy but the anchor they're regressed onto does not.
+## Code, training, and evaluation
+Training code, the steering reward, evaluation scripts (FID, CLIP-score,
+NudeNet, Q16, LPIPS, style-loss), and the end-to-end eval wrapper that
+works directly against this Hub release:
 **[https://github.com/MAXNORM8650/SafeDiffusion-R1](https://github.com/MAXNORM8650/SafeDiffusion-R1)**
 ## Citation