Spaces:
Running on Zero
Running on Zero
Upload 130 files
Browse files- Dockerfile +3 -1
- PIPELINE_EFFICIENCY_AUDIT.md +181 -0
- app.py +175 -43
- docs/EFFICIENCY_AUDIT.md +198 -0
- obliteratus/abliterate.py +489 -91
- obliteratus/analysis/activation_probing.py +24 -16
- obliteratus/analysis/sae_abliteration.py +30 -6
- obliteratus/bayesian_optimizer.py +7 -13
- obliteratus/telemetry.py +277 -32
- paper/main.tex +39 -20
- paper/references.bib +10 -1
- scripts/run_benchmark_remote.sh +12 -2
- spaces/README.md +18 -4
- tests/test_abliterate.py +1 -1
- tests/test_telemetry.py +10 -1
Dockerfile
CHANGED
|
@@ -1,3 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
| 1 |
FROM python:3.11-slim
|
| 2 |
|
| 3 |
# System deps for audio/image processing that gradio may need
|
|
@@ -5,7 +8,6 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
| 5 |
ffmpeg libsndfile1 git \
|
| 6 |
&& rm -rf /var/lib/apt/lists/*
|
| 7 |
|
| 8 |
-
# HF Spaces expects the app at /app on port 7860
|
| 9 |
WORKDIR /app
|
| 10 |
|
| 11 |
# Install Python deps first (cache layer)
|
|
|
|
| 1 |
+
# NOTE: This Dockerfile is for LOCAL Docker usage only.
|
| 2 |
+
# On HuggingFace Spaces, the Space uses sdk=gradio with ZeroGPU
|
| 3 |
+
# (see spaces/README.md) β this Dockerfile is NOT used there.
|
| 4 |
FROM python:3.11-slim
|
| 5 |
|
| 6 |
# System deps for audio/image processing that gradio may need
|
|
|
|
| 8 |
ffmpeg libsndfile1 git \
|
| 9 |
&& rm -rf /var/lib/apt/lists/*
|
| 10 |
|
|
|
|
| 11 |
WORKDIR /app
|
| 12 |
|
| 13 |
# Install Python deps first (cache layer)
|
PIPELINE_EFFICIENCY_AUDIT.md
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OBLITERATUS Pipeline Efficiency Audit
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-03-03
|
| 4 |
+
**Scope:** All obliteration methods in `abliterate.py` (5,076 lines), `bayesian_optimizer.py`, `informed_pipeline.py`, and 4 ablation strategies.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Executive Summary
|
| 9 |
+
|
| 10 |
+
The 6-stage pipeline (SUMMON β PROBE β DISTILL β EXCISE β VERIFY β REBIRTH) is architecturally sound with good separation of concerns. Memory hygiene between stages is correct. The rank-1 projection math is efficient. Quantization handling is robust.
|
| 11 |
+
|
| 12 |
+
**8 concrete efficiency issues found.** Estimated cumulative impact: **~40-60% wall-clock reduction** on typical runs (8B model, advanced/surgical methods). Ordered by ROI (ease Γ impact).
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## HIGH PRIORITY (Fix This Week)
|
| 17 |
+
|
| 18 |
+
### 1. PROBE runs 1,536 prompts with zero batching
|
| 19 |
+
|
| 20 |
+
**Location:** `abliterate.py:1074-1088`
|
| 21 |
+
**Impact:** Largest single wall-clock bottleneck (~77s on 8B model, reducible to ~10s)
|
| 22 |
+
|
| 23 |
+
The activation collection loop processes each prompt individually with a full forward pass + GC cycle between each one. With 512 harmful + 512 harmless + 512 jailbreak prompts = 1,536 serial forward passes.
|
| 24 |
+
|
| 25 |
+
The `_free_gpu_memory()` call at line 1086 is **inside the per-prompt loop**, adding ~20ms Γ 1,536 = 30s of pure garbage collection overhead.
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
# CURRENT (serial)
|
| 29 |
+
for i, prompt in enumerate(prompts):
|
| 30 |
+
inputs = tokenizer(prompt, return_tensors="pt", ...)
|
| 31 |
+
model(**inputs)
|
| 32 |
+
del inputs
|
| 33 |
+
self._free_gpu_memory() # <-- 30s wasted
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
**Fix:** Batch prompts (batch_size=8-16). Hooks already handle batch dimension correctly via `hidden[:, -1, :]`. Move `_free_gpu_memory()` to run every N batches, not every prompt.
|
| 37 |
+
|
| 38 |
+
**Speedup:** ~7-8x on PROBE stage.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
### 2. VERIFY generates 30 completions sequentially β no batching
|
| 43 |
+
|
| 44 |
+
**Location:** `abliterate.py:4622-4670`
|
| 45 |
+
**Impact:** Second-largest wall-clock cost (~57s on 8B model, reducible to ~15s)
|
| 46 |
+
|
| 47 |
+
Each of the 30 refusal-test prompts gets an independent `model.generate(max_new_tokens=128)` call. At ~15ms/token on an 8B model, that's 30 Γ 128 Γ 15ms β 57s.
|
| 48 |
+
|
| 49 |
+
**Fix:** Batch the generation calls (batch_size=4-8). `model.generate()` supports batched inputs natively. The tokenizer already handles padding.
|
| 50 |
+
|
| 51 |
+
**Speedup:** ~4x on VERIFY stage.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
### 3. SAE training is forced to CPU with no early stopping
|
| 56 |
+
|
| 57 |
+
**Location:** `abliterate.py:1579-1583`
|
| 58 |
+
**Impact:** Moderate β adds ~20-40s per run when SAE features are enabled (surgical, nuclear methods)
|
| 59 |
+
|
| 60 |
+
SAE training runs 30 fixed epochs per strong layer on CPU. With 15-20 strong layers, that's 450-600 CPU training epochs. No convergence check, no early stopping.
|
| 61 |
+
|
| 62 |
+
The `device="cpu"` is overly conservative β the memory-aware cap at line 1570-1578 already validates GPU headroom, and a typical SAE encoder (expansion=2, hidden_dim=4096) is only ~128MB.
|
| 63 |
+
|
| 64 |
+
**Fix:**
|
| 65 |
+
1. Add early stopping when reconstruction loss plateaus (< 0.1% improvement over 3 epochs)
|
| 66 |
+
2. Use GPU when `free_mb > sae_mem_mb + 1024` (1GB headroom)
|
| 67 |
+
3. Reduce default epochs from 30 to 15 with convergence guard
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## MEDIUM PRIORITY (Fix This Sprint)
|
| 72 |
+
|
| 73 |
+
### 4. `_distill_inner()` is a degraded copy of `_distill()` β drops half the SOTA techniques
|
| 74 |
+
|
| 75 |
+
**Location:** `abliterate.py:2958-3055` vs `1102-1750`
|
| 76 |
+
**Impact:** Quality regression on refinement passes 2+, not pure compute waste
|
| 77 |
+
|
| 78 |
+
The iterative refinement path calls `_distill_inner()` which is a simplified ~100-line copy that skips: Wasserstein-optimal extraction, layer-adaptive strength, float layer interpolation, SAE features, EGA, CoT-aware orthogonalization, and RDO refinement.
|
| 79 |
+
|
| 80 |
+
This means "true iterative refinement" actually produces **worse directions on later passes** because it drops the analysis-guided enhancements.
|
| 81 |
+
|
| 82 |
+
**Fix:** Extract shared SVD/direction logic into `_extract_directions(full_features=True/False)` and call from both paths. At minimum, keep whitened SVD and jailbreak-contrastive blending in the inner path.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
### 5. Bayesian optimizer clones ALL weight tensors β ~7GB memory overhead
|
| 87 |
+
|
| 88 |
+
**Location:** `bayesian_optimizer.py:300-341`
|
| 89 |
+
**Impact:** Memory pressure on GPU-constrained setups; 50Γ full-restore cycles
|
| 90 |
+
|
| 91 |
+
The optimizer saves a complete clone of every weight tensor across all strong layers. For a 7B model with 32 layers, that's ~7GB of clones sitting in memory during all 50 trials.
|
| 92 |
+
|
| 93 |
+
After each trial, `_restore_all()` copies all clones back β 50 trials Γ full-model memcpy.
|
| 94 |
+
|
| 95 |
+
**Fix (easy):** Only clone weights in `_strong_layers` (already partially done, but `named_parameters()` crawl still catches everything). Drop the `seen_data_ptrs` set once the loop is tightened.
|
| 96 |
+
|
| 97 |
+
**Fix (better):** Store the projection delta `Ξ = scale * d @ (d^T @ W)` per layer instead of cloning the full weight. Rollback = `W += Ξ`. This reduces storage from O(hidden_dimΒ²) to O(hidden_dim) per direction per layer.
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
### 6. Norm computation in `_project_out_advanced()` traverses the full matrix twice
|
| 102 |
+
|
| 103 |
+
**Location:** `abliterate.py:3477-3486`
|
| 104 |
+
**Impact:** ~4,800 unnecessary full-matrix norm computations per run (8-direction surgical)
|
| 105 |
+
|
| 106 |
+
When `norm_preserve=True`, the code computes `W.norm()` before projection and `W.norm()` after projection. Each norm traverses the full weight matrix (16M elements for 4096Γ4096).
|
| 107 |
+
|
| 108 |
+
With 8 directions Γ 30 layers Γ 10 weight matrices = 2,400 projections β 4,800 norm calls β 77 billion unnecessary FLOPs.
|
| 109 |
+
|
| 110 |
+
**Fix:** After rank-1 update `W' = W - scale * d @ (d^T @ W)`, the new norm satisfies:
|
| 111 |
+
`||W'||Β² = ||W||Β² - 2Β·scaleΒ·||d^T @ W||Β² + scaleΒ²Β·||d^T @ W||Β²Β·||d||Β²`
|
| 112 |
+
|
| 113 |
+
Since `||d|| = 1`: `||W'||Β² = ||W||Β² - scaleΒ·(2 - scale)Β·||coeff||Β²`
|
| 114 |
+
|
| 115 |
+
This replaces a 16M-element norm with a single `coeff.pow(2).sum()` call (~4K FLOPs).
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## LOW PRIORITY (Backlog)
|
| 120 |
+
|
| 121 |
+
### 7. Gram-Schmidt appears 3 times as O(nΒ²) nested loops
|
| 122 |
+
|
| 123 |
+
**Location:** `abliterate.py:1168-1173`, `1361-1367`, `3038-3044`
|
| 124 |
+
**Impact:** Minimal compute but code quality issue
|
| 125 |
+
|
| 126 |
+
Three separate implementations of the same Gram-Schmidt orthogonalization with nested Python loops. With n_directions=8, it's 28 dot products per call β trivial compute but (a) DRY violation, (b) numerically inferior to `torch.linalg.qr()`.
|
| 127 |
+
|
| 128 |
+
**Fix:** Extract to `_orthogonalize_subspace(sub: Tensor) -> Tensor` using QR decomposition. Single call site, single test, better numerics.
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
### 8. Pre-EXCISE baseline KL capture re-forward-passes 100 prompts already seen in PROBE
|
| 133 |
+
|
| 134 |
+
**Location:** `abliterate.py:2313-2366`
|
| 135 |
+
**Impact:** ~700ms wasted (minor)
|
| 136 |
+
|
| 137 |
+
`_capture_baseline_kl_logits()` runs 100 harmless prompts through the model to capture pre-EXCISE logits. But PROBE already ran those same prompts and captured hidden states at every layer. The logits could be computed as `lm_head(last_hidden_state)` β a single matmul.
|
| 138 |
+
|
| 139 |
+
**Fix:** After PROBE, compute `baseline_logits = model.lm_head(harmful_means[last_layer])` on the cached activations. Skip the 100-prompt forward pass entirely.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## What's Done Well
|
| 144 |
+
|
| 145 |
+
| Area | Assessment |
|
| 146 |
+
|------|------------|
|
| 147 |
+
| **Stage-boundary memory cleanup** | Correct β `_free_gpu_memory()` + explicit dict clearing between stages |
|
| 148 |
+
| **Rank-1 projection math** | Efficient β `W @ d` then `d.T * coeff` instead of materializing `I - dd^T` |
|
| 149 |
+
| **Quantization dequant/requant** | Robust β handles bitsandbytes NF4, GPTQ, AWQ; fails loudly on unsupported formats |
|
| 150 |
+
| **Incremental expert mean** | Smart β Welford running mean in `_transplant_expert_weights()` avoids stacking all expert weights |
|
| 151 |
+
| **Router stabilization** | Defensive β `_stabilize_router_weights()` after MoE projection prevents CUDA crashes |
|
| 152 |
+
| **Large model mode** | Pragmatic β caps directions, SAE features, refinement passes for 120B+ models |
|
| 153 |
+
| **Event emission** | Clean β `_emit()` / `_on_stage()` / `_on_log()` callbacks for UI integration without coupling |
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## Method Efficiency Comparison
|
| 158 |
+
|
| 159 |
+
| Method | PROBE Cost | DISTILL Cost | EXCISE Cost | VERIFY Cost | Primary Bottleneck |
|
| 160 |
+
|--------|-----------|-------------|-------------|-------------|-------------------|
|
| 161 |
+
| **basic** | 1x (1,024 prompts) | 1x (diff-in-means) | 1x (~10 projections) | 1x | PROBE |
|
| 162 |
+
| **advanced** | 2x (re-probe on pass 2) | 2x (re-distill) | 2x (2 passes) | 1x | PROBE Γ 2 |
|
| 163 |
+
| **aggressive** | 3x (re-probe on passes 2,3) | 3x (re-distill) | 3x (3 passes, 8 dirs) | 1x | PROBE Γ 3 |
|
| 164 |
+
| **surgical** | 1.5x (+jailbreak prompts) | 2x (SAE training) | 2x (head surgery + EGA) | 1x | SAE on CPU |
|
| 165 |
+
| **optimized** | 1.5x (+jailbreak) | 1x | 50x (Bayesian trials) | 1x | Bayesian optimizer |
|
| 166 |
+
| **inverted** | 1.5x (+jailbreak) | 1x | 2x (reflection math) | 1x | PROBE |
|
| 167 |
+
| **nuclear** | 1.5x (+jailbreak) | 2x (SAE) | 3x (all techniques) | 1x | SAE + PROBE |
|
| 168 |
+
| **informed** | 1x | 1.5x (analysis modules) | 1x-3x (dynamic) | 1.5x (Ouroboros check) | Analysis modules |
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## Prioritized Action Plan
|
| 173 |
+
|
| 174 |
+
1. **Batch PROBE forward passes** β immediate 7-8x speedup on largest bottleneck
|
| 175 |
+
2. **Batch VERIFY generation** β immediate 4x speedup on second bottleneck
|
| 176 |
+
3. **Add SAE early stopping + GPU path** β 2-3x speedup on SAE-enabled methods
|
| 177 |
+
4. **Unify `_distill` / `_distill_inner`** β quality fix, prevents direction degradation
|
| 178 |
+
5. **Optimize Bayesian rollback storage** β memory fix for GPU-constrained users
|
| 179 |
+
6. **Analytical norm computation** β eliminates 77B unnecessary FLOPs
|
| 180 |
+
7. **DRY Gram-Schmidt** β code quality
|
| 181 |
+
8. **Cache KL baseline from PROBE** β minor speedup
|
app.py
CHANGED
|
@@ -1,10 +1,18 @@
|
|
| 1 |
"""OBLITERATUS β Browser-based model liberation with chat playground.
|
| 2 |
|
| 3 |
-
Deploy on HuggingFace Spaces (
|
|
|
|
| 4 |
pip install -e ".[spaces]"
|
| 5 |
obliteratus ui # beautiful launcher with GPU detection
|
| 6 |
python app.py # direct launch (used by HF Spaces)
|
| 7 |
python app.py --share # with public share link
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
|
@@ -50,6 +58,28 @@ import gradio as gr
|
|
| 50 |
import torch
|
| 51 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
# ---------------------------------------------------------------------------
|
| 54 |
# Global state
|
| 55 |
# ---------------------------------------------------------------------------
|
|
@@ -149,6 +179,7 @@ METHODS = {
|
|
| 149 |
"advanced (recommended)": "advanced",
|
| 150 |
"basic (fast, single direction)": "basic",
|
| 151 |
"aggressive (maximum removal)": "aggressive",
|
|
|
|
| 152 |
"informed (analysis-guided auto-config)": "informed",
|
| 153 |
"surgical (precision MoE-aware)": "surgical",
|
| 154 |
"optimized (bayesian auto-tuned)": "optimized",
|
|
@@ -195,6 +226,9 @@ def _get_preset_defaults(method_display: str):
|
|
| 195 |
"expert_transplant": cfg.get("expert_transplant", False),
|
| 196 |
"transplant_blend": cfg.get("transplant_blend", 0.3),
|
| 197 |
"use_wasserstein_optimal": cfg.get("use_wasserstein_optimal", False),
|
|
|
|
|
|
|
|
|
|
| 198 |
}
|
| 199 |
|
| 200 |
def _on_method_change(method_display: str):
|
|
@@ -208,6 +242,8 @@ def _on_method_change(method_display: str):
|
|
| 208 |
d["embed_regularization"],
|
| 209 |
d["steering_strength"],
|
| 210 |
d["transplant_blend"],
|
|
|
|
|
|
|
| 211 |
d["norm_preserve"],
|
| 212 |
d["project_biases"],
|
| 213 |
d["use_chat_template"],
|
|
@@ -224,6 +260,7 @@ def _on_method_change(method_display: str):
|
|
| 224 |
d["activation_steering"],
|
| 225 |
d["expert_transplant"],
|
| 226 |
d["use_wasserstein_optimal"],
|
|
|
|
| 227 |
)
|
| 228 |
|
| 229 |
def _on_dataset_change(dataset_label: str):
|
|
@@ -569,6 +606,7 @@ def _figs_to_gallery(figs: list) -> list[tuple[str, str]]:
|
|
| 569 |
return gallery if gallery else None
|
| 570 |
|
| 571 |
|
|
|
|
| 572 |
def benchmark(
|
| 573 |
model_choice: str,
|
| 574 |
methods_to_test: list[str],
|
|
@@ -579,9 +617,10 @@ def benchmark(
|
|
| 579 |
"""Run multiple abliteration methods on a single model and compare results.
|
| 580 |
|
| 581 |
This is the API endpoint that enables programmatic benchmarking β call it
|
| 582 |
-
via the Gradio Client API to test what works on your
|
| 583 |
|
| 584 |
Yields streaming progress updates as (status_md, results_md, log_text, gallery).
|
|
|
|
| 585 |
"""
|
| 586 |
import json as _json
|
| 587 |
|
|
@@ -895,6 +934,7 @@ def _format_benchmark_results(results: list[dict], context: dict | None = None)
|
|
| 895 |
# Multi-model benchmark (new: 1 technique across N models)
|
| 896 |
# ---------------------------------------------------------------------------
|
| 897 |
|
|
|
|
| 898 |
def benchmark_multi_model(
|
| 899 |
model_choices: list[str],
|
| 900 |
method_choice: str,
|
|
@@ -1202,6 +1242,7 @@ def _format_multi_model_results(results: list[dict], context: dict | None = None
|
|
| 1202 |
return "\n".join(lines)
|
| 1203 |
|
| 1204 |
|
|
|
|
| 1205 |
def obliterate(model_choice: str, method_choice: str, hub_repo: str,
|
| 1206 |
prompt_volume_choice: str, dataset_source_choice: str,
|
| 1207 |
custom_harmful: str, custom_harmless: str,
|
|
@@ -1210,6 +1251,7 @@ def obliterate(model_choice: str, method_choice: str, hub_repo: str,
|
|
| 1210 |
adv_refinement_passes: int, adv_reflection_strength: float,
|
| 1211 |
adv_embed_regularization: float, adv_steering_strength: float,
|
| 1212 |
adv_transplant_blend: float,
|
|
|
|
| 1213 |
# Advanced params (checkboxes)
|
| 1214 |
adv_norm_preserve: bool, adv_project_biases: bool,
|
| 1215 |
adv_use_chat_template: bool, adv_use_whitened_svd: bool,
|
|
@@ -1219,8 +1261,14 @@ def obliterate(model_choice: str, method_choice: str, hub_repo: str,
|
|
| 1219 |
adv_sae_features: bool, adv_invert_refusal: bool,
|
| 1220 |
adv_project_embeddings: bool, adv_activation_steering: bool,
|
| 1221 |
adv_expert_transplant: bool, adv_wasserstein_optimal: bool,
|
|
|
|
| 1222 |
progress=gr.Progress()):
|
| 1223 |
-
"""Run the full obliteration pipeline, streaming log updates to the UI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1224 |
import os
|
| 1225 |
import re
|
| 1226 |
|
|
@@ -1382,6 +1430,9 @@ def obliterate(model_choice: str, method_choice: str, hub_repo: str,
|
|
| 1382 |
expert_transplant=adv_expert_transplant,
|
| 1383 |
transplant_blend=float(adv_transplant_blend),
|
| 1384 |
use_wasserstein_optimal=adv_wasserstein_optimal,
|
|
|
|
|
|
|
|
|
|
| 1385 |
)
|
| 1386 |
pipeline_ref[0] = pipeline
|
| 1387 |
pipeline.run()
|
|
@@ -1687,10 +1738,14 @@ def _strip_reasoning_tokens(text: str) -> str:
|
|
| 1687 |
return cleaned if cleaned else text
|
| 1688 |
|
| 1689 |
|
|
|
|
| 1690 |
def chat_respond(message: str, history: list[dict], system_prompt: str,
|
| 1691 |
temperature: float, top_p: float, max_tokens: int,
|
| 1692 |
repetition_penalty: float):
|
| 1693 |
-
"""Stream a response from the liberated model.
|
|
|
|
|
|
|
|
|
|
| 1694 |
with _lock:
|
| 1695 |
model = _state["model"]
|
| 1696 |
tokenizer = _state["tokenizer"]
|
|
@@ -1816,8 +1871,12 @@ def _get_session_model_choices():
|
|
| 1816 |
return list(_session_models.keys()) if _session_models else []
|
| 1817 |
|
| 1818 |
|
|
|
|
| 1819 |
def load_bench_into_chat(choice: str, progress=gr.Progress()):
|
| 1820 |
-
"""Re-run abliteration with a benchmark config and load result into Chat.
|
|
|
|
|
|
|
|
|
|
| 1821 |
if choice not in _bench_configs:
|
| 1822 |
yield "**Error:** No benchmark result selected.", ""
|
| 1823 |
return
|
|
@@ -1982,6 +2041,7 @@ def load_bench_into_chat(choice: str, progress=gr.Progress()):
|
|
| 1982 |
# A/B Comparison Chat
|
| 1983 |
# ---------------------------------------------------------------------------
|
| 1984 |
|
|
|
|
| 1985 |
def ab_chat_respond(message: str, history_left: list[dict], history_right: list[dict],
|
| 1986 |
system_prompt: str, temperature: float, top_p: float,
|
| 1987 |
max_tokens: int, repetition_penalty: float):
|
|
@@ -2000,9 +2060,15 @@ def ab_chat_respond(message: str, history_left: list[dict], history_right: list[
|
|
| 2000 |
{"role": "assistant", "content": "No abliterated model loaded. Obliterate a model first."}],
|
| 2001 |
history_right + [{"role": "user", "content": message},
|
| 2002 |
{"role": "assistant", "content": "No abliterated model loaded. Obliterate a model first."}],
|
| 2003 |
-
"Load a model first."
|
|
|
|
|
|
|
| 2004 |
return
|
| 2005 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2006 |
# Sanitize inputs
|
| 2007 |
system_prompt = (system_prompt or "")[:4096]
|
| 2008 |
message = (message or "")[:8192]
|
|
@@ -2067,7 +2133,8 @@ def ab_chat_respond(message: str, history_left: list[dict], history_right: list[
|
|
| 2067 |
partial_abl += token
|
| 2068 |
yield (new_left + [{"role": "assistant", "content": "*Generating after abliterated response...*"}],
|
| 2069 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2070 |
-
"Streaming abliterated response..."
|
|
|
|
| 2071 |
except Exception:
|
| 2072 |
pass # Streamer timeout β use whatever partial_abl we have
|
| 2073 |
|
|
@@ -2079,7 +2146,8 @@ def ab_chat_respond(message: str, history_left: list[dict], history_right: list[
|
|
| 2079 |
# --- Generate from original model ---
|
| 2080 |
yield (new_left + [{"role": "assistant", "content": "*Offloading abliterated model, loading original...*"}],
|
| 2081 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2082 |
-
"Loading original model..."
|
|
|
|
| 2083 |
|
| 2084 |
# Offload abliterated model to CPU to free GPU for original model.
|
| 2085 |
# This avoids holding both models in VRAM simultaneously (2x OOM risk).
|
|
@@ -2126,7 +2194,8 @@ def ab_chat_respond(message: str, history_left: list[dict], history_right: list[
|
|
| 2126 |
original_response += token
|
| 2127 |
yield (new_left + [{"role": "assistant", "content": original_response}],
|
| 2128 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2129 |
-
"Streaming original response..."
|
|
|
|
| 2130 |
except Exception:
|
| 2131 |
pass # Streamer timeout β use whatever we have
|
| 2132 |
|
|
@@ -2152,19 +2221,22 @@ def ab_chat_respond(message: str, history_left: list[dict], history_right: list[
|
|
| 2152 |
|
| 2153 |
yield (new_left + [{"role": "assistant", "content": original_response}],
|
| 2154 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2155 |
-
"Done β compare the responses above."
|
|
|
|
| 2156 |
|
| 2157 |
|
| 2158 |
# ---------------------------------------------------------------------------
|
| 2159 |
# Ablation Strength Sweep (dose-response curve)
|
| 2160 |
# ---------------------------------------------------------------------------
|
| 2161 |
|
|
|
|
| 2162 |
def strength_sweep(model_choice: str, method_choice: str,
|
| 2163 |
prompt_vol_choice: str, dataset_source_choice: str,
|
| 2164 |
sweep_steps: int, progress=gr.Progress()):
|
| 2165 |
"""Sweep regularization from 0.0β1.0 and measure refusal rate + perplexity.
|
| 2166 |
|
| 2167 |
Produces a dose-response curve: the fundamental plot for abliteration research.
|
|
|
|
| 2168 |
"""
|
| 2169 |
from obliteratus.abliterate import AbliterationPipeline
|
| 2170 |
|
|
@@ -2185,8 +2257,14 @@ def strength_sweep(model_choice: str, method_choice: str,
|
|
| 2185 |
# Pre-load dataset
|
| 2186 |
harmful_all, harmless_all = load_dataset_source(dataset_key)
|
| 2187 |
prompt_volume = PROMPT_VOLUMES.get(prompt_vol_choice, 33)
|
| 2188 |
-
|
| 2189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2190 |
|
| 2191 |
for step_i, reg in enumerate(regs):
|
| 2192 |
progress((step_i) / len(regs), desc=f"reg={reg:.2f}")
|
|
@@ -2683,15 +2761,15 @@ label span {
|
|
| 2683 |
|
| 2684 |
/* ---- CHAT TAB: RESIZABLE CHATBOT ---- */
|
| 2685 |
#chat .chatbot, #chat .chat-interface {
|
| 2686 |
-
min-height:
|
| 2687 |
-
height:
|
| 2688 |
}
|
| 2689 |
#chat .chatbot .messages-wrapper,
|
| 2690 |
#chat .chatbot .wrapper,
|
| 2691 |
#chat .chatbot [class*="wrapper"] {
|
| 2692 |
-
min-height:
|
| 2693 |
-
height:
|
| 2694 |
-
max-height:
|
| 2695 |
overflow-y: auto !important;
|
| 2696 |
resize: vertical !important;
|
| 2697 |
}
|
|
@@ -2699,7 +2777,7 @@ label span {
|
|
| 2699 |
#chat .chatbot {
|
| 2700 |
resize: vertical !important;
|
| 2701 |
overflow: auto !important;
|
| 2702 |
-
min-height:
|
| 2703 |
}
|
| 2704 |
/* Resize handle styling */
|
| 2705 |
#chat .chatbot .messages-wrapper::-webkit-resizer,
|
|
@@ -2710,6 +2788,20 @@ label span {
|
|
| 2710 |
height: 16px;
|
| 2711 |
}
|
| 2712 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2713 |
/* ---- ACCORDION ---- */
|
| 2714 |
.gr-accordion { border-color: #1a1f2e !important; }
|
| 2715 |
|
|
@@ -2804,6 +2896,14 @@ with gr.Blocks(theme=THEME, css=CSS, js=_JS, title="OBLITERATUS", fill_height=Tr
|
|
| 2804 |
# GPU VRAM monitor β refreshed on page load and after key operations
|
| 2805 |
vram_display = gr.HTML(value=_get_vram_html())
|
| 2806 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2807 |
with gr.Tabs():
|
| 2808 |
|
| 2809 |
# ββ Tab 1: Obliterate βββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -2904,6 +3004,15 @@ with gr.Blocks(theme=THEME, css=CSS, js=_JS, title="OBLITERATUS", fill_height=Tr
|
|
| 2904 |
0.0, 0.5, value=_defaults["transplant_blend"], step=0.05,
|
| 2905 |
label="Transplant Blend", info="Capability blend into safety experts",
|
| 2906 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2907 |
gr.Markdown("**Technique Toggles**")
|
| 2908 |
with gr.Row():
|
| 2909 |
adv_norm_preserve = gr.Checkbox(value=_defaults["norm_preserve"], label="Norm Preserve")
|
|
@@ -2925,18 +3034,23 @@ with gr.Blocks(theme=THEME, css=CSS, js=_JS, title="OBLITERATUS", fill_height=Tr
|
|
| 2925 |
adv_activation_steering = gr.Checkbox(value=_defaults["activation_steering"], label="Activation Steering")
|
| 2926 |
adv_expert_transplant = gr.Checkbox(value=_defaults["expert_transplant"], label="Expert Transplant")
|
| 2927 |
adv_wasserstein_optimal = gr.Checkbox(value=_defaults.get("use_wasserstein_optimal", False), label="Wasserstein-Optimal Dirs")
|
|
|
|
|
|
|
|
|
|
| 2928 |
|
| 2929 |
# List of all advanced controls (order must match _on_method_change return)
|
| 2930 |
_adv_controls = [
|
| 2931 |
adv_n_directions, adv_regularization, adv_refinement_passes,
|
| 2932 |
adv_reflection_strength, adv_embed_regularization,
|
| 2933 |
adv_steering_strength, adv_transplant_blend,
|
|
|
|
| 2934 |
adv_norm_preserve, adv_project_biases, adv_use_chat_template,
|
| 2935 |
adv_use_whitened_svd, adv_true_iterative, adv_jailbreak_contrast,
|
| 2936 |
adv_layer_adaptive, adv_safety_neuron, adv_per_expert,
|
| 2937 |
adv_attn_surgery, adv_sae_features, adv_invert_refusal,
|
| 2938 |
adv_project_embeddings, adv_activation_steering,
|
| 2939 |
adv_expert_transplant, adv_wasserstein_optimal,
|
|
|
|
| 2940 |
]
|
| 2941 |
|
| 2942 |
obliterate_btn = gr.Button(
|
|
@@ -2960,6 +3074,7 @@ with gr.Blocks(theme=THEME, css=CSS, js=_JS, title="OBLITERATUS", fill_height=Tr
|
|
| 2960 |
|
| 2961 |
gr.Markdown(
|
| 2962 |
"*Anonymous telemetry is on by default (no user identity or prompts collected). "
|
|
|
|
| 2963 |
"Opt out: set `OBLITERATUS_TELEMETRY=0`.*",
|
| 2964 |
elem_classes=["telemetry-notice"],
|
| 2965 |
)
|
|
@@ -2979,9 +3094,9 @@ Compare multiple abliteration methods on the same model.
|
|
| 2979 |
Great for finding the optimal strategy for a specific architecture.
|
| 2980 |
|
| 2981 |
```python
|
| 2982 |
-
# API access:
|
| 2983 |
from gradio_client import Client
|
| 2984 |
-
client = Client("
|
| 2985 |
result = client.predict(
|
| 2986 |
model_choice="Alibaba (Qwen) / Qwen2.5-0.5B Instruct",
|
| 2987 |
methods_to_test=["basic", "advanced", "surgical", "optimized"],
|
|
@@ -2998,9 +3113,9 @@ result = client.predict(
|
|
| 2998 |
allow_custom_value=True,
|
| 2999 |
)
|
| 3000 |
bench_methods = gr.CheckboxGroup(
|
| 3001 |
-
choices=["basic", "advanced", "aggressive", "
|
| 3002 |
-
"optimized", "inverted", "nuclear"],
|
| 3003 |
-
value=["basic", "advanced", "
|
| 3004 |
label="Methods to Compare",
|
| 3005 |
)
|
| 3006 |
with gr.Row():
|
|
@@ -3080,9 +3195,9 @@ how well a technique generalizes β especially for MoE-aware methods like
|
|
| 3080 |
`surgical`, `optimized`, or `nuclear` on GPT-OSS 20B vs dense models.
|
| 3081 |
|
| 3082 |
```python
|
| 3083 |
-
# API access:
|
| 3084 |
from gradio_client import Client
|
| 3085 |
-
client = Client("
|
| 3086 |
result = client.predict(
|
| 3087 |
model_choices=["Alibaba (Qwen) / Qwen2.5-0.5B Instruct", "OpenAI / GPT-OSS 20B"],
|
| 3088 |
method_choice="surgical",
|
|
@@ -3102,7 +3217,8 @@ result = client.predict(
|
|
| 3102 |
)
|
| 3103 |
with gr.Row():
|
| 3104 |
mm_method = gr.Dropdown(
|
| 3105 |
-
choices=["basic", "advanced", "aggressive",
|
|
|
|
| 3106 |
"optimized", "inverted", "nuclear"],
|
| 3107 |
value="surgical",
|
| 3108 |
label="Abliteration Method",
|
|
@@ -3326,7 +3442,7 @@ Pre-configured benchmark configurations for common research questions.
|
|
| 3326 |
gr.ChatInterface(
|
| 3327 |
fn=chat_respond,
|
| 3328 |
type="messages",
|
| 3329 |
-
chatbot=gr.Chatbot(height="
|
| 3330 |
additional_inputs=[system_prompt, temperature, top_p, max_tokens, repetition_penalty],
|
| 3331 |
fill_height=True,
|
| 3332 |
)
|
|
@@ -3394,15 +3510,15 @@ See exactly how abliteration changes model behavior on the same prompt.
|
|
| 3394 |
|
| 3395 |
with gr.Row():
|
| 3396 |
with gr.Column():
|
| 3397 |
-
gr.Markdown("#### Original (Pre-Abliteration)")
|
| 3398 |
ab_chatbot_left = gr.Chatbot(
|
| 3399 |
-
height="
|
| 3400 |
label="Original Model",
|
| 3401 |
)
|
| 3402 |
with gr.Column():
|
| 3403 |
-
gr.Markdown("#### Abliterated")
|
| 3404 |
ab_chatbot_right = gr.Chatbot(
|
| 3405 |
-
height="
|
| 3406 |
label="Abliterated Model",
|
| 3407 |
)
|
| 3408 |
|
|
@@ -3418,14 +3534,16 @@ See exactly how abliteration changes model behavior on the same prompt.
|
|
| 3418 |
fn=ab_chat_respond,
|
| 3419 |
inputs=[ab_input, ab_chatbot_left, ab_chatbot_right,
|
| 3420 |
ab_system_prompt, ab_temp, ab_top_p, ab_max_tokens, ab_rep_penalty],
|
| 3421 |
-
outputs=[ab_chatbot_left, ab_chatbot_right, ab_status
|
|
|
|
| 3422 |
)
|
| 3423 |
# Also trigger on Enter
|
| 3424 |
ab_input.submit(
|
| 3425 |
fn=ab_chat_respond,
|
| 3426 |
inputs=[ab_input, ab_chatbot_left, ab_chatbot_right,
|
| 3427 |
ab_system_prompt, ab_temp, ab_top_p, ab_max_tokens, ab_rep_penalty],
|
| 3428 |
-
outputs=[ab_chatbot_left, ab_chatbot_right, ab_status
|
|
|
|
| 3429 |
)
|
| 3430 |
|
| 3431 |
# ββ Tab 5: Strength Sweep ββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -3512,11 +3630,13 @@ Download all intermediate data from your last obliteration run as a ZIP archive.
|
|
| 3512 |
# ββ Tab 7: Leaderboard ββββββββββββββββββββββββββββββββββββββββββββ
|
| 3513 |
with gr.Tab("Leaderboard", id="leaderboard"):
|
| 3514 |
gr.Markdown("""### Community Leaderboard
|
| 3515 |
-
All benchmark results from
|
| 3516 |
-
|
|
|
|
| 3517 |
|
| 3518 |
*Telemetry is **on by default** and is fully anonymous β no user identity, IP addresses, or prompt content
|
| 3519 |
-
is ever collected. Only aggregate benchmark metrics (model name, method, scores, hardware) are stored
|
|
|
|
| 3520 |
To opt out, set the environment variable `OBLITERATUS_TELEMETRY=0` before launching.*
|
| 3521 |
""")
|
| 3522 |
|
|
@@ -3557,10 +3677,17 @@ To opt out, set the environment variable `OBLITERATUS_TELEMETRY=0` before launch
|
|
| 3557 |
total_runs = sum(r['runs'] for r in data)
|
| 3558 |
unique_models = len(set(r['model_id'] for r in data))
|
| 3559 |
unique_methods = len(set(r['method'] for r in data))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3560 |
summary = (
|
| 3561 |
f"**{total_runs}** total runs across "
|
| 3562 |
f"**{unique_models}** models and "
|
| 3563 |
-
f"**{unique_methods}** methods"
|
| 3564 |
)
|
| 3565 |
return table, summary
|
| 3566 |
except Exception as e:
|
|
@@ -3573,17 +3700,21 @@ To opt out, set the environment variable `OBLITERATUS_TELEMETRY=0` before launch
|
|
| 3573 |
"Refresh Leaderboard", variant="secondary", size="sm",
|
| 3574 |
)
|
| 3575 |
lb_push_btn = gr.Button(
|
| 3576 |
-
"
|
| 3577 |
)
|
| 3578 |
lb_push_status = gr.Markdown("")
|
| 3579 |
|
| 3580 |
def _push_telemetry():
|
| 3581 |
try:
|
| 3582 |
-
from obliteratus.telemetry import push_to_hub
|
|
|
|
| 3583 |
ok = push_to_hub()
|
| 3584 |
if ok:
|
| 3585 |
-
return "Telemetry
|
| 3586 |
-
return
|
|
|
|
|
|
|
|
|
|
| 3587 |
except Exception as e:
|
| 3588 |
return f"Error: {e}"
|
| 3589 |
|
|
@@ -3626,12 +3757,13 @@ in weight space, not a deep behavioral change. OBLITERATUS removes it in minutes
|
|
| 3626 |
|--------|-----------|-------------|
|
| 3627 |
| **basic** | 1 | Single direction, fast baseline |
|
| 3628 |
| **advanced** | 4 (SVD) | Norm-preserving, bias projection, 2 passes |
|
| 3629 |
-
| **aggressive** | 8 (SVD) | Whitened SVD, iterative refinement, 3 passes |
|
|
|
|
| 3630 |
| **informed** | 4 (auto) | Analysis-guided closed-loop: auto-detects alignment, cone geometry, entanglement |
|
| 3631 |
| **surgical** | 8 (SVD) | Full SOTA: EGA, head surgery, SAE, layer-adaptive, MoE-aware |
|
| 3632 |
| **optimized** | 4 (SVD) | Bayesian auto-tuned, CoT-aware, KL co-optimized, winsorized |
|
| 3633 |
| **inverted** | 8 (SVD) | Semantic refusal inversion (2x reflection), router redirect |
|
| 3634 |
-
| **nuclear** |
|
| 3635 |
|
| 3636 |
### Novel Techniques (Pipeline)
|
| 3637 |
|
|
|
|
| 1 |
"""OBLITERATUS β Browser-based model liberation with chat playground.
|
| 2 |
|
| 3 |
+
Deploy on HuggingFace Spaces (ZeroGPU β users bring their own GPU quota)
|
| 4 |
+
or run locally:
|
| 5 |
pip install -e ".[spaces]"
|
| 6 |
obliteratus ui # beautiful launcher with GPU detection
|
| 7 |
python app.py # direct launch (used by HF Spaces)
|
| 8 |
python app.py --share # with public share link
|
| 9 |
+
|
| 10 |
+
ZeroGPU Support:
|
| 11 |
+
When deployed on HF Spaces with ZeroGPU, each user's GPU-heavy
|
| 12 |
+
operations (obliteration, chat, benchmarks) run on a shared GPU pool
|
| 13 |
+
using the VISITOR's own HF quota β not the Space owner's. Functions
|
| 14 |
+
decorated with @spaces.GPU request a GPU for their duration and
|
| 15 |
+
release it when done. The Space itself runs on CPU between calls.
|
| 16 |
"""
|
| 17 |
|
| 18 |
from __future__ import annotations
|
|
|
|
| 58 |
import torch
|
| 59 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
|
| 60 |
|
| 61 |
+
# ββ ZeroGPU support βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 62 |
+
# When running on HuggingFace Spaces with ZeroGPU, the `spaces` package
|
| 63 |
+
# provides the @spaces.GPU decorator that allocates a GPU from the shared
|
| 64 |
+
# pool for the decorated function's duration. Each visitor uses their own
|
| 65 |
+
# HF quota β the Space owner pays nothing for GPU.
|
| 66 |
+
#
|
| 67 |
+
# When running locally or on a dedicated-GPU Space, spaces is not installed
|
| 68 |
+
# and we fall back to a no-op decorator so the same code works everywhere.
|
| 69 |
+
try:
|
| 70 |
+
import spaces
|
| 71 |
+
_ZEROGPU_AVAILABLE = True
|
| 72 |
+
except ImportError:
|
| 73 |
+
_ZEROGPU_AVAILABLE = False
|
| 74 |
+
# Create a no-op decorator that mirrors spaces.GPU interface
|
| 75 |
+
class _FakeSpaces:
|
| 76 |
+
@staticmethod
|
| 77 |
+
def GPU(duration: int = 60, **kwargs):
|
| 78 |
+
def decorator(fn):
|
| 79 |
+
return fn
|
| 80 |
+
return decorator
|
| 81 |
+
spaces = _FakeSpaces()
|
| 82 |
+
|
| 83 |
# ---------------------------------------------------------------------------
|
| 84 |
# Global state
|
| 85 |
# ---------------------------------------------------------------------------
|
|
|
|
| 179 |
"advanced (recommended)": "advanced",
|
| 180 |
"basic (fast, single direction)": "basic",
|
| 181 |
"aggressive (maximum removal)": "aggressive",
|
| 182 |
+
"spectral cascade (frequency-selective)": "spectral_cascade",
|
| 183 |
"informed (analysis-guided auto-config)": "informed",
|
| 184 |
"surgical (precision MoE-aware)": "surgical",
|
| 185 |
"optimized (bayesian auto-tuned)": "optimized",
|
|
|
|
| 226 |
"expert_transplant": cfg.get("expert_transplant", False),
|
| 227 |
"transplant_blend": cfg.get("transplant_blend", 0.3),
|
| 228 |
"use_wasserstein_optimal": cfg.get("use_wasserstein_optimal", False),
|
| 229 |
+
"spectral_cascade": cfg.get("spectral_cascade", False),
|
| 230 |
+
"spectral_bands": cfg.get("spectral_bands", 3),
|
| 231 |
+
"spectral_threshold": cfg.get("spectral_threshold", 0.05),
|
| 232 |
}
|
| 233 |
|
| 234 |
def _on_method_change(method_display: str):
|
|
|
|
| 242 |
d["embed_regularization"],
|
| 243 |
d["steering_strength"],
|
| 244 |
d["transplant_blend"],
|
| 245 |
+
d["spectral_bands"],
|
| 246 |
+
d["spectral_threshold"],
|
| 247 |
d["norm_preserve"],
|
| 248 |
d["project_biases"],
|
| 249 |
d["use_chat_template"],
|
|
|
|
| 260 |
d["activation_steering"],
|
| 261 |
d["expert_transplant"],
|
| 262 |
d["use_wasserstein_optimal"],
|
| 263 |
+
d["spectral_cascade"],
|
| 264 |
)
|
| 265 |
|
| 266 |
def _on_dataset_change(dataset_label: str):
|
|
|
|
| 606 |
return gallery if gallery else None
|
| 607 |
|
| 608 |
|
| 609 |
+
@spaces.GPU(duration=300)
|
| 610 |
def benchmark(
|
| 611 |
model_choice: str,
|
| 612 |
methods_to_test: list[str],
|
|
|
|
| 617 |
"""Run multiple abliteration methods on a single model and compare results.
|
| 618 |
|
| 619 |
This is the API endpoint that enables programmatic benchmarking β call it
|
| 620 |
+
via the Gradio Client API to test what works on your GPU.
|
| 621 |
|
| 622 |
Yields streaming progress updates as (status_md, results_md, log_text, gallery).
|
| 623 |
+
On ZeroGPU, uses the visitor's GPU quota (up to 5 minutes).
|
| 624 |
"""
|
| 625 |
import json as _json
|
| 626 |
|
|
|
|
| 934 |
# Multi-model benchmark (new: 1 technique across N models)
|
| 935 |
# ---------------------------------------------------------------------------
|
| 936 |
|
| 937 |
+
@spaces.GPU(duration=300)
|
| 938 |
def benchmark_multi_model(
|
| 939 |
model_choices: list[str],
|
| 940 |
method_choice: str,
|
|
|
|
| 1242 |
return "\n".join(lines)
|
| 1243 |
|
| 1244 |
|
| 1245 |
+
@spaces.GPU(duration=300)
|
| 1246 |
def obliterate(model_choice: str, method_choice: str, hub_repo: str,
|
| 1247 |
prompt_volume_choice: str, dataset_source_choice: str,
|
| 1248 |
custom_harmful: str, custom_harmless: str,
|
|
|
|
| 1251 |
adv_refinement_passes: int, adv_reflection_strength: float,
|
| 1252 |
adv_embed_regularization: float, adv_steering_strength: float,
|
| 1253 |
adv_transplant_blend: float,
|
| 1254 |
+
adv_spectral_bands: int, adv_spectral_threshold: float,
|
| 1255 |
# Advanced params (checkboxes)
|
| 1256 |
adv_norm_preserve: bool, adv_project_biases: bool,
|
| 1257 |
adv_use_chat_template: bool, adv_use_whitened_svd: bool,
|
|
|
|
| 1261 |
adv_sae_features: bool, adv_invert_refusal: bool,
|
| 1262 |
adv_project_embeddings: bool, adv_activation_steering: bool,
|
| 1263 |
adv_expert_transplant: bool, adv_wasserstein_optimal: bool,
|
| 1264 |
+
adv_spectral_cascade: bool,
|
| 1265 |
progress=gr.Progress()):
|
| 1266 |
+
"""Run the full obliteration pipeline, streaming log updates to the UI.
|
| 1267 |
+
|
| 1268 |
+
On ZeroGPU Spaces, this function runs on the visitor's GPU quota (up to
|
| 1269 |
+
5 minutes). The @spaces.GPU decorator allocates a GPU at call time and
|
| 1270 |
+
releases it when the function returns.
|
| 1271 |
+
"""
|
| 1272 |
import os
|
| 1273 |
import re
|
| 1274 |
|
|
|
|
| 1430 |
expert_transplant=adv_expert_transplant,
|
| 1431 |
transplant_blend=float(adv_transplant_blend),
|
| 1432 |
use_wasserstein_optimal=adv_wasserstein_optimal,
|
| 1433 |
+
spectral_cascade=adv_spectral_cascade,
|
| 1434 |
+
spectral_bands=int(adv_spectral_bands),
|
| 1435 |
+
spectral_threshold=float(adv_spectral_threshold),
|
| 1436 |
)
|
| 1437 |
pipeline_ref[0] = pipeline
|
| 1438 |
pipeline.run()
|
|
|
|
| 1738 |
return cleaned if cleaned else text
|
| 1739 |
|
| 1740 |
|
| 1741 |
+
@spaces.GPU(duration=120)
|
| 1742 |
def chat_respond(message: str, history: list[dict], system_prompt: str,
|
| 1743 |
temperature: float, top_p: float, max_tokens: int,
|
| 1744 |
repetition_penalty: float):
|
| 1745 |
+
"""Stream a response from the liberated model.
|
| 1746 |
+
|
| 1747 |
+
On ZeroGPU, allocates a GPU for up to 2 minutes per response.
|
| 1748 |
+
"""
|
| 1749 |
with _lock:
|
| 1750 |
model = _state["model"]
|
| 1751 |
tokenizer = _state["tokenizer"]
|
|
|
|
| 1871 |
return list(_session_models.keys()) if _session_models else []
|
| 1872 |
|
| 1873 |
|
| 1874 |
+
@spaces.GPU(duration=300)
|
| 1875 |
def load_bench_into_chat(choice: str, progress=gr.Progress()):
|
| 1876 |
+
"""Re-run abliteration with a benchmark config and load result into Chat.
|
| 1877 |
+
|
| 1878 |
+
On ZeroGPU, uses the visitor's GPU quota.
|
| 1879 |
+
"""
|
| 1880 |
if choice not in _bench_configs:
|
| 1881 |
yield "**Error:** No benchmark result selected.", ""
|
| 1882 |
return
|
|
|
|
| 2041 |
# A/B Comparison Chat
|
| 2042 |
# ---------------------------------------------------------------------------
|
| 2043 |
|
| 2044 |
+
@spaces.GPU(duration=120)
|
| 2045 |
def ab_chat_respond(message: str, history_left: list[dict], history_right: list[dict],
|
| 2046 |
system_prompt: str, temperature: float, top_p: float,
|
| 2047 |
max_tokens: int, repetition_penalty: float):
|
|
|
|
| 2060 |
{"role": "assistant", "content": "No abliterated model loaded. Obliterate a model first."}],
|
| 2061 |
history_right + [{"role": "user", "content": message},
|
| 2062 |
{"role": "assistant", "content": "No abliterated model loaded. Obliterate a model first."}],
|
| 2063 |
+
"Load a model first.",
|
| 2064 |
+
"#### Original (Pre-Abliteration)",
|
| 2065 |
+
"#### Abliterated")
|
| 2066 |
return
|
| 2067 |
|
| 2068 |
+
# Build header strings showing model name on each side
|
| 2069 |
+
header_left = f"#### Original (Pre-Abliteration)\n`{model_name}`"
|
| 2070 |
+
header_right = f"#### Abliterated\n`{model_name}`"
|
| 2071 |
+
|
| 2072 |
# Sanitize inputs
|
| 2073 |
system_prompt = (system_prompt or "")[:4096]
|
| 2074 |
message = (message or "")[:8192]
|
|
|
|
| 2133 |
partial_abl += token
|
| 2134 |
yield (new_left + [{"role": "assistant", "content": "*Generating after abliterated response...*"}],
|
| 2135 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2136 |
+
"Streaming abliterated response...",
|
| 2137 |
+
header_left, header_right)
|
| 2138 |
except Exception:
|
| 2139 |
pass # Streamer timeout β use whatever partial_abl we have
|
| 2140 |
|
|
|
|
| 2146 |
# --- Generate from original model ---
|
| 2147 |
yield (new_left + [{"role": "assistant", "content": "*Offloading abliterated model, loading original...*"}],
|
| 2148 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2149 |
+
"Loading original model...",
|
| 2150 |
+
header_left, header_right)
|
| 2151 |
|
| 2152 |
# Offload abliterated model to CPU to free GPU for original model.
|
| 2153 |
# This avoids holding both models in VRAM simultaneously (2x OOM risk).
|
|
|
|
| 2194 |
original_response += token
|
| 2195 |
yield (new_left + [{"role": "assistant", "content": original_response}],
|
| 2196 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2197 |
+
"Streaming original response...",
|
| 2198 |
+
header_left, header_right)
|
| 2199 |
except Exception:
|
| 2200 |
pass # Streamer timeout β use whatever we have
|
| 2201 |
|
|
|
|
| 2221 |
|
| 2222 |
yield (new_left + [{"role": "assistant", "content": original_response}],
|
| 2223 |
new_right + [{"role": "assistant", "content": partial_abl}],
|
| 2224 |
+
"Done β compare the responses above.",
|
| 2225 |
+
header_left, header_right)
|
| 2226 |
|
| 2227 |
|
| 2228 |
# ---------------------------------------------------------------------------
|
| 2229 |
# Ablation Strength Sweep (dose-response curve)
|
| 2230 |
# ---------------------------------------------------------------------------
|
| 2231 |
|
| 2232 |
+
@spaces.GPU(duration=300)
|
| 2233 |
def strength_sweep(model_choice: str, method_choice: str,
|
| 2234 |
prompt_vol_choice: str, dataset_source_choice: str,
|
| 2235 |
sweep_steps: int, progress=gr.Progress()):
|
| 2236 |
"""Sweep regularization from 0.0β1.0 and measure refusal rate + perplexity.
|
| 2237 |
|
| 2238 |
Produces a dose-response curve: the fundamental plot for abliteration research.
|
| 2239 |
+
On ZeroGPU, uses the visitor's GPU quota (up to 5 minutes).
|
| 2240 |
"""
|
| 2241 |
from obliteratus.abliterate import AbliterationPipeline
|
| 2242 |
|
|
|
|
| 2257 |
# Pre-load dataset
|
| 2258 |
harmful_all, harmless_all = load_dataset_source(dataset_key)
|
| 2259 |
prompt_volume = PROMPT_VOLUMES.get(prompt_vol_choice, 33)
|
| 2260 |
+
if prompt_volume > 0 and prompt_volume < len(harmful_all):
|
| 2261 |
+
harmful = harmful_all[:prompt_volume]
|
| 2262 |
+
else:
|
| 2263 |
+
harmful = harmful_all
|
| 2264 |
+
if prompt_volume > 0 and prompt_volume < len(harmless_all):
|
| 2265 |
+
harmless = harmless_all[:prompt_volume]
|
| 2266 |
+
else:
|
| 2267 |
+
harmless = harmless_all
|
| 2268 |
|
| 2269 |
for step_i, reg in enumerate(regs):
|
| 2270 |
progress((step_i) / len(regs), desc=f"reg={reg:.2f}")
|
|
|
|
| 2761 |
|
| 2762 |
/* ---- CHAT TAB: RESIZABLE CHATBOT ---- */
|
| 2763 |
#chat .chatbot, #chat .chat-interface {
|
| 2764 |
+
min-height: 9vh !important;
|
| 2765 |
+
height: 12vh !important;
|
| 2766 |
}
|
| 2767 |
#chat .chatbot .messages-wrapper,
|
| 2768 |
#chat .chatbot .wrapper,
|
| 2769 |
#chat .chatbot [class*="wrapper"] {
|
| 2770 |
+
min-height: 8vh !important;
|
| 2771 |
+
height: 11vh !important;
|
| 2772 |
+
max-height: 18vh !important;
|
| 2773 |
overflow-y: auto !important;
|
| 2774 |
resize: vertical !important;
|
| 2775 |
}
|
|
|
|
| 2777 |
#chat .chatbot {
|
| 2778 |
resize: vertical !important;
|
| 2779 |
overflow: auto !important;
|
| 2780 |
+
min-height: 8vh !important;
|
| 2781 |
}
|
| 2782 |
/* Resize handle styling */
|
| 2783 |
#chat .chatbot .messages-wrapper::-webkit-resizer,
|
|
|
|
| 2788 |
height: 16px;
|
| 2789 |
}
|
| 2790 |
|
| 2791 |
+
/* ---- A/B COMPARE: MODEL HEADERS ---- */
|
| 2792 |
+
#ab_compare h4 {
|
| 2793 |
+
margin: 0 !important;
|
| 2794 |
+
padding: 6px 10px !important;
|
| 2795 |
+
border: 1px solid #1a1f2e !important;
|
| 2796 |
+
background: #0d0d14 !important;
|
| 2797 |
+
border-radius: 4px !important;
|
| 2798 |
+
}
|
| 2799 |
+
#ab_compare code {
|
| 2800 |
+
color: #00ff41 !important;
|
| 2801 |
+
font-size: 0.85rem !important;
|
| 2802 |
+
background: transparent !important;
|
| 2803 |
+
}
|
| 2804 |
+
|
| 2805 |
/* ---- ACCORDION ---- */
|
| 2806 |
.gr-accordion { border-color: #1a1f2e !important; }
|
| 2807 |
|
|
|
|
| 2896 |
# GPU VRAM monitor β refreshed on page load and after key operations
|
| 2897 |
vram_display = gr.HTML(value=_get_vram_html())
|
| 2898 |
|
| 2899 |
+
# ZeroGPU info β only shown when running on HF Spaces with ZeroGPU
|
| 2900 |
+
if _ZEROGPU_AVAILABLE:
|
| 2901 |
+
gr.Markdown(
|
| 2902 |
+
"> **ZeroGPU enabled** β GPU operations use *your* HuggingFace account quota, "
|
| 2903 |
+
"not the Space owner's. Log in with your HF account for free GPU access. "
|
| 2904 |
+
"Multiple users can run simultaneously without conflicts."
|
| 2905 |
+
)
|
| 2906 |
+
|
| 2907 |
with gr.Tabs():
|
| 2908 |
|
| 2909 |
# ββ Tab 1: Obliterate βββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 3004 |
0.0, 0.5, value=_defaults["transplant_blend"], step=0.05,
|
| 3005 |
label="Transplant Blend", info="Capability blend into safety experts",
|
| 3006 |
)
|
| 3007 |
+
with gr.Row():
|
| 3008 |
+
adv_spectral_bands = gr.Slider(
|
| 3009 |
+
2, 8, value=_defaults["spectral_bands"], step=1,
|
| 3010 |
+
label="Spectral Bands", info="DCT frequency bands for Spectral Cascade",
|
| 3011 |
+
)
|
| 3012 |
+
adv_spectral_threshold = gr.Slider(
|
| 3013 |
+
0.01, 0.2, value=_defaults["spectral_threshold"], step=0.01,
|
| 3014 |
+
label="Spectral Threshold", info="Energy threshold for cascade early-exit",
|
| 3015 |
+
)
|
| 3016 |
gr.Markdown("**Technique Toggles**")
|
| 3017 |
with gr.Row():
|
| 3018 |
adv_norm_preserve = gr.Checkbox(value=_defaults["norm_preserve"], label="Norm Preserve")
|
|
|
|
| 3034 |
adv_activation_steering = gr.Checkbox(value=_defaults["activation_steering"], label="Activation Steering")
|
| 3035 |
adv_expert_transplant = gr.Checkbox(value=_defaults["expert_transplant"], label="Expert Transplant")
|
| 3036 |
adv_wasserstein_optimal = gr.Checkbox(value=_defaults.get("use_wasserstein_optimal", False), label="Wasserstein-Optimal Dirs")
|
| 3037 |
+
with gr.Row():
|
| 3038 |
+
adv_spectral_cascade = gr.Checkbox(value=_defaults["spectral_cascade"], label="Spectral Cascade",
|
| 3039 |
+
info="DCT frequency decomposition for precision refusal targeting")
|
| 3040 |
|
| 3041 |
# List of all advanced controls (order must match _on_method_change return)
|
| 3042 |
_adv_controls = [
|
| 3043 |
adv_n_directions, adv_regularization, adv_refinement_passes,
|
| 3044 |
adv_reflection_strength, adv_embed_regularization,
|
| 3045 |
adv_steering_strength, adv_transplant_blend,
|
| 3046 |
+
adv_spectral_bands, adv_spectral_threshold,
|
| 3047 |
adv_norm_preserve, adv_project_biases, adv_use_chat_template,
|
| 3048 |
adv_use_whitened_svd, adv_true_iterative, adv_jailbreak_contrast,
|
| 3049 |
adv_layer_adaptive, adv_safety_neuron, adv_per_expert,
|
| 3050 |
adv_attn_surgery, adv_sae_features, adv_invert_refusal,
|
| 3051 |
adv_project_embeddings, adv_activation_steering,
|
| 3052 |
adv_expert_transplant, adv_wasserstein_optimal,
|
| 3053 |
+
adv_spectral_cascade,
|
| 3054 |
]
|
| 3055 |
|
| 3056 |
obliterate_btn = gr.Button(
|
|
|
|
| 3074 |
|
| 3075 |
gr.Markdown(
|
| 3076 |
"*Anonymous telemetry is on by default (no user identity or prompts collected). "
|
| 3077 |
+
"Results auto-sync to a central community dataset for the leaderboard. "
|
| 3078 |
"Opt out: set `OBLITERATUS_TELEMETRY=0`.*",
|
| 3079 |
elem_classes=["telemetry-notice"],
|
| 3080 |
)
|
|
|
|
| 3094 |
Great for finding the optimal strategy for a specific architecture.
|
| 3095 |
|
| 3096 |
```python
|
| 3097 |
+
# API access (replace with your Space URL):
|
| 3098 |
from gradio_client import Client
|
| 3099 |
+
client = Client("your-username/obliteratus")
|
| 3100 |
result = client.predict(
|
| 3101 |
model_choice="Alibaba (Qwen) / Qwen2.5-0.5B Instruct",
|
| 3102 |
methods_to_test=["basic", "advanced", "surgical", "optimized"],
|
|
|
|
| 3113 |
allow_custom_value=True,
|
| 3114 |
)
|
| 3115 |
bench_methods = gr.CheckboxGroup(
|
| 3116 |
+
choices=["basic", "advanced", "aggressive", "spectral_cascade",
|
| 3117 |
+
"informed", "surgical", "optimized", "inverted", "nuclear"],
|
| 3118 |
+
value=["basic", "advanced", "spectral_cascade", "surgical"],
|
| 3119 |
label="Methods to Compare",
|
| 3120 |
)
|
| 3121 |
with gr.Row():
|
|
|
|
| 3195 |
`surgical`, `optimized`, or `nuclear` on GPT-OSS 20B vs dense models.
|
| 3196 |
|
| 3197 |
```python
|
| 3198 |
+
# API access (replace with your Space URL):
|
| 3199 |
from gradio_client import Client
|
| 3200 |
+
client = Client("your-username/obliteratus")
|
| 3201 |
result = client.predict(
|
| 3202 |
model_choices=["Alibaba (Qwen) / Qwen2.5-0.5B Instruct", "OpenAI / GPT-OSS 20B"],
|
| 3203 |
method_choice="surgical",
|
|
|
|
| 3217 |
)
|
| 3218 |
with gr.Row():
|
| 3219 |
mm_method = gr.Dropdown(
|
| 3220 |
+
choices=["basic", "advanced", "aggressive",
|
| 3221 |
+
"spectral_cascade", "informed", "surgical",
|
| 3222 |
"optimized", "inverted", "nuclear"],
|
| 3223 |
value="surgical",
|
| 3224 |
label="Abliteration Method",
|
|
|
|
| 3442 |
gr.ChatInterface(
|
| 3443 |
fn=chat_respond,
|
| 3444 |
type="messages",
|
| 3445 |
+
chatbot=gr.Chatbot(height="11vh", type="messages"),
|
| 3446 |
additional_inputs=[system_prompt, temperature, top_p, max_tokens, repetition_penalty],
|
| 3447 |
fill_height=True,
|
| 3448 |
)
|
|
|
|
| 3510 |
|
| 3511 |
with gr.Row():
|
| 3512 |
with gr.Column():
|
| 3513 |
+
ab_header_left = gr.Markdown("#### Original (Pre-Abliteration)")
|
| 3514 |
ab_chatbot_left = gr.Chatbot(
|
| 3515 |
+
height="20vh", type="messages",
|
| 3516 |
label="Original Model",
|
| 3517 |
)
|
| 3518 |
with gr.Column():
|
| 3519 |
+
ab_header_right = gr.Markdown("#### Abliterated")
|
| 3520 |
ab_chatbot_right = gr.Chatbot(
|
| 3521 |
+
height="20vh", type="messages",
|
| 3522 |
label="Abliterated Model",
|
| 3523 |
)
|
| 3524 |
|
|
|
|
| 3534 |
fn=ab_chat_respond,
|
| 3535 |
inputs=[ab_input, ab_chatbot_left, ab_chatbot_right,
|
| 3536 |
ab_system_prompt, ab_temp, ab_top_p, ab_max_tokens, ab_rep_penalty],
|
| 3537 |
+
outputs=[ab_chatbot_left, ab_chatbot_right, ab_status,
|
| 3538 |
+
ab_header_left, ab_header_right],
|
| 3539 |
)
|
| 3540 |
# Also trigger on Enter
|
| 3541 |
ab_input.submit(
|
| 3542 |
fn=ab_chat_respond,
|
| 3543 |
inputs=[ab_input, ab_chatbot_left, ab_chatbot_right,
|
| 3544 |
ab_system_prompt, ab_temp, ab_top_p, ab_max_tokens, ab_rep_penalty],
|
| 3545 |
+
outputs=[ab_chatbot_left, ab_chatbot_right, ab_status,
|
| 3546 |
+
ab_header_left, ab_header_right],
|
| 3547 |
)
|
| 3548 |
|
| 3549 |
# ββ Tab 5: Strength Sweep ββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 3630 |
# ββ Tab 7: Leaderboard ββββββββββββββββββββββββββββββββββββββββββββ
|
| 3631 |
with gr.Tab("Leaderboard", id="leaderboard"):
|
| 3632 |
gr.Markdown("""### Community Leaderboard
|
| 3633 |
+
All benchmark results from **every OBLITERATUS Space** (including duplicated copies) are
|
| 3634 |
+
automatically aggregated into a central community dataset. Results appear here regardless
|
| 3635 |
+
of which Space instance ran them.
|
| 3636 |
|
| 3637 |
*Telemetry is **on by default** and is fully anonymous β no user identity, IP addresses, or prompt content
|
| 3638 |
+
is ever collected. Only aggregate benchmark metrics (model name, method, scores, hardware) are stored.
|
| 3639 |
+
Data is synced to a central HuggingFace Dataset for persistence across Space restarts and upgrades.
|
| 3640 |
To opt out, set the environment variable `OBLITERATUS_TELEMETRY=0` before launching.*
|
| 3641 |
""")
|
| 3642 |
|
|
|
|
| 3677 |
total_runs = sum(r['runs'] for r in data)
|
| 3678 |
unique_models = len(set(r['model_id'] for r in data))
|
| 3679 |
unique_methods = len(set(r['method'] for r in data))
|
| 3680 |
+
|
| 3681 |
+
# Check data source
|
| 3682 |
+
from obliteratus.telemetry import _TELEMETRY_REPO
|
| 3683 |
+
source_note = ""
|
| 3684 |
+
if _TELEMETRY_REPO:
|
| 3685 |
+
source_note = f" | Data source: local + [{_TELEMETRY_REPO}](https://huggingface.co/datasets/{_TELEMETRY_REPO})"
|
| 3686 |
+
|
| 3687 |
summary = (
|
| 3688 |
f"**{total_runs}** total runs across "
|
| 3689 |
f"**{unique_models}** models and "
|
| 3690 |
+
f"**{unique_methods}** methods{source_note}"
|
| 3691 |
)
|
| 3692 |
return table, summary
|
| 3693 |
except Exception as e:
|
|
|
|
| 3700 |
"Refresh Leaderboard", variant="secondary", size="sm",
|
| 3701 |
)
|
| 3702 |
lb_push_btn = gr.Button(
|
| 3703 |
+
"Force Sync to Hub Now", variant="secondary", size="sm",
|
| 3704 |
)
|
| 3705 |
lb_push_status = gr.Markdown("")
|
| 3706 |
|
| 3707 |
def _push_telemetry():
|
| 3708 |
try:
|
| 3709 |
+
from obliteratus.telemetry import push_to_hub, _TELEMETRY_REPO
|
| 3710 |
+
repo = _TELEMETRY_REPO
|
| 3711 |
ok = push_to_hub()
|
| 3712 |
if ok:
|
| 3713 |
+
return f"Telemetry synced to [{repo}](https://huggingface.co/datasets/{repo}) successfully."
|
| 3714 |
+
return (
|
| 3715 |
+
"Sync failed. Telemetry auto-syncs in the background on HF Spaces. "
|
| 3716 |
+
"For manual push, ensure HF_TOKEN is set with write access."
|
| 3717 |
+
)
|
| 3718 |
except Exception as e:
|
| 3719 |
return f"Error: {e}"
|
| 3720 |
|
|
|
|
| 3757 |
|--------|-----------|-------------|
|
| 3758 |
| **basic** | 1 | Single direction, fast baseline |
|
| 3759 |
| **advanced** | 4 (SVD) | Norm-preserving, bias projection, 2 passes |
|
| 3760 |
+
| **aggressive** | 8 (SVD) | Whitened SVD, iterative refinement, jailbreak-contrastive, 3 passes |
|
| 3761 |
+
| **spectral_cascade** | 6 (wSVD) | DCT frequency decomposition, coherence-weighted, adaptive bands |
|
| 3762 |
| **informed** | 4 (auto) | Analysis-guided closed-loop: auto-detects alignment, cone geometry, entanglement |
|
| 3763 |
| **surgical** | 8 (SVD) | Full SOTA: EGA, head surgery, SAE, layer-adaptive, MoE-aware |
|
| 3764 |
| **optimized** | 4 (SVD) | Bayesian auto-tuned, CoT-aware, KL co-optimized, winsorized |
|
| 3765 |
| **inverted** | 8 (SVD) | Semantic refusal inversion (2x reflection), router redirect |
|
| 3766 |
+
| **nuclear** | 4 (SVD) | Maximum force: all techniques + expert transplant + steering |
|
| 3767 |
|
| 3768 |
### Novel Techniques (Pipeline)
|
| 3769 |
|
docs/EFFICIENCY_AUDIT.md
ADDED
|
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OBLITERATUS Pipeline Efficiency Audit
|
| 2 |
+
|
| 3 |
+
**Auditor perspective**: Shrewd CTO evaluating compute ROI, memory discipline, and time-to-value across all obliteration methods.
|
| 4 |
+
|
| 5 |
+
**Scope**: Every obliteration method in `abliterate.py` (8 primary methods + 4 baseline reproductions), the strategy layer (`strategies/`), the informed pipeline, Bayesian optimizer, and LoRA ablation.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Executive Summary
|
| 10 |
+
|
| 11 |
+
OBLITERATUS has an impressively comprehensive pipeline, but several methods carry **significant hidden costs** that erode their value proposition. The worst offenders are:
|
| 12 |
+
|
| 13 |
+
1. **`_collect_activations` runs prompts one-at-a-time** β this is the single biggest throughput bottleneck in the entire system, costing 5-15x in wall-clock time during PROBE.
|
| 14 |
+
2. **Bayesian `optimized` mode clones ALL strong-layer weights to CPU** for rollback, then runs 50 full forward+generate passes β the memory and compute overhead can exceed the rest of the pipeline combined.
|
| 15 |
+
3. **`true_iterative_refinement` re-runs the entire PROBE+DISTILL pipeline** per refinement pass with zero early-stopping β 3 passes in `aggressive` triples probe cost even when pass 2 achieves negligible improvement.
|
| 16 |
+
4. **SAE training on CPU** is needlessly slow for GPU-resident models.
|
| 17 |
+
|
| 18 |
+
Below is the method-by-method breakdown.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Stage-Level Audit
|
| 23 |
+
|
| 24 |
+
### Stage 1: SUMMON (Model Loading)
|
| 25 |
+
|
| 26 |
+
**Status**: Acceptable. Uses `load_model` with quantization support and `expandable_segments` CUDA config. No issues.
|
| 27 |
+
|
| 28 |
+
### Stage 2: PROBE (`_collect_activations`)
|
| 29 |
+
|
| 30 |
+
| Issue | Severity | Impact |
|
| 31 |
+
|-------|----------|--------|
|
| 32 |
+
| **Single-prompt forward passes** (`abliterate.py:1074`) | CRITICAL | Each of 512+ harmful/harmless prompts triggers a separate `model(**inputs)` call. No batching. On a 7B model with 512 pairs, this means ~1024 sequential forward passes instead of ~32 batched passes (batch_size=32). Estimated 5-15x slowdown. |
|
| 33 |
+
| **`_free_gpu_memory()` called after EVERY prompt** (`abliterate.py:1086`) | HIGH | `gc.collect()` + `torch.cuda.empty_cache()` 1024 times is expensive β the Python GC full-collection alone adds measurable overhead at this frequency. Should be called every N prompts, not every single one. |
|
| 34 |
+
| **Chat template applied per-prompt in a Python loop** (`abliterate.py:955-965`) | MODERATE | `tokenizer.apply_chat_template()` called individually 1024 times. Should batch. |
|
| 35 |
+
| **Jailbreak probing doubles cost** when `use_jailbreak_contrast=True` | MODERATE | Adds a third full pass over all prompts. Justified by the quality improvement, but the lack of batching amplifies the cost 3x instead of 1.5x. |
|
| 36 |
+
| **Router profiling hooks zero-cost claim is correct** (`abliterate.py:872`) | OK | Hooks piggyback on existing forward passes. Good design. |
|
| 37 |
+
|
| 38 |
+
**Recommendation**: Batch `_collect_activations`. Tokenize all prompts, pad to equal length per micro-batch, run batched `model(**inputs)`. Expected 5-10x speedup with zero quality loss. Reduce `_free_gpu_memory()` frequency to every 32-64 prompts.
|
| 39 |
+
|
| 40 |
+
### Stage 3: DISTILL (`_distill`)
|
| 41 |
+
|
| 42 |
+
| Issue | Severity | Impact |
|
| 43 |
+
|-------|----------|--------|
|
| 44 |
+
| **Full SVD on per-prompt diff matrix** (`abliterate.py:1226`) | MODERATE | `torch.linalg.svd(diff_matrix, full_matrices=False)` on a `(512, hidden_dim)` matrix per layer. For 32 layers this is 32 SVD calls, each O(min(m,n)^2 * max(m,n)). At hidden_dim=4096, each is ~100ms on CPU. Total: ~3s. Acceptable for the quality gain. |
|
| 45 |
+
| **Whitened SVD import is lazy** (`abliterate.py:1127`) | OK | Good β only imports when needed. No cost for basic/advanced. |
|
| 46 |
+
| **Wasserstein extraction** (`abliterate.py:1136`) | OK | Falls back gracefully. The GEP solve is lightweight. |
|
| 47 |
+
| **RDO gradient optimization: 500 steps per layer** (`abliterate.py:1427`) | HIGH | For 20 strong layers, that's 10,000 Adam steps. Each step involves a matrix multiply on `(n_prompts, hidden_dim)` tensors. On CPU this takes 30-60s. The 500-step budget is a "practical compromise" per the comments, but the SVD warm-start means most directions converge in ~100 steps. **No early stopping.** |
|
| 48 |
+
| **Gram-Schmidt re-orthogonalization is O(k^2)** per layer (`abliterate.py:1168-1173`) | LOW | With k<=8, this is negligible. |
|
| 49 |
+
| **SAE training: 30 epochs on CPU** (`abliterate.py:1582`) | HIGH | `device="cpu"` is hardcoded. For hidden_dim=4096 and expansion=4, the SAE has 32M parameters. 30 epochs on CPU takes 15-45s per layer. With 20 strong layers, this is 5-15 minutes of wasted time when a GPU is available. |
|
| 50 |
+
| **Layer selection (knee + COSMIC fusion)** | OK | Lightweight statistical operations. No concern. |
|
| 51 |
+
| **CoT-aware orthogonalization** | OK | Single SVD per layer, simple vector operations. |
|
| 52 |
+
| **Jailbreak-contrastive blending** | OK | Pure vector arithmetic, negligible cost. |
|
| 53 |
+
| **Float-layer interpolation** | OK | Gaussian weight computation is trivial. |
|
| 54 |
+
|
| 55 |
+
**Recommendation**: (1) Add early-stopping to RDO at convergence (e.g., loss delta < 1e-4 for 20 consecutive steps). (2) Use GPU for SAE training when available β change `device="cpu"` to auto-detect.
|
| 56 |
+
|
| 57 |
+
### Stage 4: EXCISE (`_excise`)
|
| 58 |
+
|
| 59 |
+
| Issue | Severity | Impact |
|
| 60 |
+
|-------|----------|--------|
|
| 61 |
+
| **Rank-1 projection is memory-efficient** (`abliterate.py:3479-3480`) | OK | `W @ d` produces a vector, not a full projection matrix. This is the right approach. |
|
| 62 |
+
| **`true_iterative_refinement` re-runs PROBE+DISTILL** (`abliterate.py:2474-2485`) | CRITICAL | Each refinement pass re-collects all activations (512*2+ forward passes) and re-runs SVD. `aggressive` mode does 3 passes = 3x full pipeline cost. There is **no check** whether the refined directions materially differ from the previous pass. A cosine-similarity early-exit (e.g., all directions > 0.99 cosine with previous pass β stop) would save enormous compute on pass 3. |
|
| 63 |
+
| **Bayesian optimization clones ALL weight tensors** (`bayesian_optimizer.py:301-341`) | CRITICAL | For a 7B model with 20 strong layers, this can be 2-4 GB of CPU clones just for rollback. For a 70B model, this is 20-40 GB. The log even reports the size (`total_saved_mb`), but there's no memory check or fallback. |
|
| 64 |
+
| **Bayesian trials run full generate passes** (`bayesian_optimizer.py:445-446`) | CRITICAL | Each of 50 trials runs `_measure_refusal_rate` (8-30 generation calls with `max_new_tokens=128`) PLUS `_measure_kl_divergence` (5 forward passes). That's ~35 forward/generate passes per trial Γ 50 trials = **1,750 forward passes** just for hyperparameter search. This likely dominates the total pipeline runtime for `optimized` and `heretic` modes. |
|
| 65 |
+
| **KL optimization proxy is cheap** (`abliterate.py:3057-3268`) | OK | Uses projection magnitude as a KL proxy instead of actual per-layer forward passes. Good engineering β avoids the expensive per-layer ablation/measurement loop. |
|
| 66 |
+
| **Norm preservation adds one extra `.norm()` per weight matrix** | LOW | Frobenius norm is O(n) β negligible overhead. |
|
| 67 |
+
| **Dequantize/re-quantize for bitsandbytes** (`abliterate.py:3287-3400`) | MODERATE | Necessary for correctness, but the full dequantize β modify β re-quantize cycle per weight matrix is expensive for 4-bit models. Consider caching the dequantized tensor when projecting multiple directions through the same weight. |
|
| 68 |
+
| **Safety-neuron masking** | LOW | Z-score computation is a single pass over the projection vector. Cheap. |
|
| 69 |
+
| **Expert transplant uses incremental mean** (`abliterate.py:4350-4364`) | OK | Welford-style running mean avoids materializing all expert weights. Good memory discipline for 400B-scale models. |
|
| 70 |
+
| **`_stabilize_router_weights` called after every MoE layer** (`abliterate.py:3866`) | LOW | Clamps router weights. Trivial cost. |
|
| 71 |
+
|
| 72 |
+
**Recommendation**: (1) Add direction-convergence early-exit to iterative refinement. (2) Reduce Bayesian trial count or implement batch generation for refusal measurement. (3) Cache dequantized weights across multi-direction projection within the same layer.
|
| 73 |
+
|
| 74 |
+
### Stage 5: VERIFY (`_verify`)
|
| 75 |
+
|
| 76 |
+
| Issue | Severity | Impact |
|
| 77 |
+
|-------|----------|--------|
|
| 78 |
+
| **30 generation calls for refusal measurement** (`abliterate.py:4622`) | MODERATE | Each generates up to 128 tokens with greedy decoding. For a 7B model this is ~30s total. Acceptable as a one-time quality check. |
|
| 79 |
+
| **`_tier_label` does `list.index()` per prompt** (`abliterate.py:4593`) | LOW | O(n) search in a list for each of 30 prompts. Trivially fixable with a dict, but the cost is negligible at n=512. |
|
| 80 |
+
| **Perplexity measurement on 3 short texts** | OK | Minimal cost. |
|
| 81 |
+
|
| 82 |
+
### Stage 6: REBIRTH (Model Saving)
|
| 83 |
+
|
| 84 |
+
Not audited in detail β standard HuggingFace `save_pretrained`. No efficiency concerns.
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## Method-by-Method Efficiency Grades
|
| 89 |
+
|
| 90 |
+
| Method | Compute Cost | Memory Cost | Value/Cost Ratio | Grade |
|
| 91 |
+
|--------|-------------|-------------|-------------------|-------|
|
| 92 |
+
| **basic** | Low (1 dir, 1 pass, no extras) | Low | High | **A** |
|
| 93 |
+
| **advanced** | Moderate (4 dirs, 2 passes, norm-preserve, bias projection) | Moderate | High | **A-** |
|
| 94 |
+
| **aggressive** | High (8 dirs, 3 passes with `true_iterative_refinement`) | High (3x activation storage) | Moderate β 3rd pass rarely justified | **B-** |
|
| 95 |
+
| **informed** | High (runs analysis modules + Wasserstein GEP) | High (analysis module state) | High β analysis feedback is genuinely valuable | **B+** |
|
| 96 |
+
| **surgical** | Very High (SAE training + head surgery + EGA + neuron masking) | Very High | Moderate β many techniques compound but with diminishing returns | **C+** |
|
| 97 |
+
| **inverted** | Very High (surgical + reflection + SAE) | Very High | Niche β only needed for "actively compliant" use case | **C** |
|
| 98 |
+
| **optimized** | Extreme (50 Bayesian trials Γ 35 forward passes each) | Extreme (full weight clones + 1750 forward passes) | Low unless you have a multi-GPU cluster | **D+** |
|
| 99 |
+
| **nuclear** | Very High (inverted + layer-adaptive + expert transplant + steering hooks) | Very High | Highly specialized β justified only for stubborn MoE models | **C** |
|
| 100 |
+
|
| 101 |
+
### Baseline Reproductions
|
| 102 |
+
|
| 103 |
+
| Method | Compute Cost | Grade | Notes |
|
| 104 |
+
|--------|-------------|-------|-------|
|
| 105 |
+
| **failspy** | Low | **A** | Faithful minimal reproduction. Efficient by design. |
|
| 106 |
+
| **gabliteration** | Low-Moderate | **A-** | 4-dir SVD + ridge. Clean. |
|
| 107 |
+
| **heretic** | Extreme | **D** | Inherits Bayesian trial overhead. 50 trials Γ 35 passes each. |
|
| 108 |
+
| **rdo** | High | **B** | 500 gradient steps/layer. Would benefit from early-stopping. |
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Strategy Module Audit (`strategies/`)
|
| 113 |
+
|
| 114 |
+
| Strategy | Implementation | Grade |
|
| 115 |
+
|----------|---------------|-------|
|
| 116 |
+
| `embedding_ablation` | Clean zero-out by chunk. `torch.no_grad()` used correctly. | **A** |
|
| 117 |
+
| `ffn_ablation` | Iterates all FFN params and zeros. Fine for ablation study. | **A** |
|
| 118 |
+
| `head_pruning` | Handles GPT-2 Conv1D and standard Q/K/V separately. Correct. | **A-** |
|
| 119 |
+
| `layer_removal` | Zeros all params. Simple and correct. | **A** |
|
| 120 |
+
| `registry` | Minimal dict-based registry with decorator. No overhead. | **A** |
|
| 121 |
+
| `runner.py` | **Creates a new `Evaluator` per spec** (`runner.py:86-95`). This re-initializes dataset processing for every ablation spec. Should create once and reuse. | **B** |
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Cross-Cutting Concerns
|
| 126 |
+
|
| 127 |
+
### 1. Memory Management
|
| 128 |
+
|
| 129 |
+
- **Good**: `_free_gpu_memory()` exists and is called between stages. `expandable_segments` is set early.
|
| 130 |
+
- **Bad**: `_free_gpu_memory()` called 1024+ times during PROBE (once per prompt). The `gc.collect()` cost alone adds up.
|
| 131 |
+
- **Bad**: Bayesian optimizer clones all strong-layer weights with no memory budget check.
|
| 132 |
+
- **Bad**: No streaming/chunking for activation storage β all 512 prompts Γ 32 layers of activations are held in a list of CPU tensors simultaneously.
|
| 133 |
+
|
| 134 |
+
### 2. GPU Utilization
|
| 135 |
+
|
| 136 |
+
- **Good**: Adaptive `max_length` based on free GPU memory.
|
| 137 |
+
- **Good**: Rank-1 projections avoid materializing full projection matrices.
|
| 138 |
+
- **Bad**: SAE training hardcoded to CPU.
|
| 139 |
+
- **Bad**: Single-prompt forward passes waste GPU parallelism.
|
| 140 |
+
- **Bad**: No `torch.compile()` or `torch.inference_mode()` used anywhere (the latter is faster than `torch.no_grad()` for inference).
|
| 141 |
+
|
| 142 |
+
### 3. Quantization Handling
|
| 143 |
+
|
| 144 |
+
- **Good**: Detects bitsandbytes 4-bit/8-bit and dequantizes before projection.
|
| 145 |
+
- **Good**: Refuses to operate on raw quantized bytes (avoids silent corruption).
|
| 146 |
+
- **Moderate**: Full dequantize/re-quantize per direction per weight matrix. Could cache across multi-direction projections.
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## Top 5 Recommendations (Ranked by Impact)
|
| 151 |
+
|
| 152 |
+
### 1. Batch `_collect_activations` (CRITICAL β 5-15x PROBE speedup)
|
| 153 |
+
|
| 154 |
+
```python
|
| 155 |
+
# Current: one prompt at a time
|
| 156 |
+
for i, prompt in enumerate(prompts):
|
| 157 |
+
inputs = tokenizer(prompt, ...)
|
| 158 |
+
model(**inputs)
|
| 159 |
+
|
| 160 |
+
# Proposed: micro-batched
|
| 161 |
+
for batch_start in range(0, len(prompts), batch_size):
|
| 162 |
+
batch = prompts[batch_start:batch_start+batch_size]
|
| 163 |
+
inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
|
| 164 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 165 |
+
with torch.no_grad():
|
| 166 |
+
model(**inputs)
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
Hooks need a minor adjustment to handle batch dimension, but the core change is ~20 lines.
|
| 170 |
+
|
| 171 |
+
### 2. Add early-stopping to `true_iterative_refinement` (HIGH β saves 1-2 full PROBE passes)
|
| 172 |
+
|
| 173 |
+
After re-distilling, compute cosine similarity between old and new refusal directions. If all directions are >0.99 cosine, skip remaining passes. Expected to save 30-60% of `aggressive` mode runtime.
|
| 174 |
+
|
| 175 |
+
### 3. Move SAE training to GPU (HIGH β 5-15 min saved for `surgical`/`inverted`)
|
| 176 |
+
|
| 177 |
+
Change `device="cpu"` to auto-detect available GPU. The SAE is small (32M params at expansion=4) and fits easily alongside the model.
|
| 178 |
+
|
| 179 |
+
### 4. Reduce Bayesian trial overhead (HIGH β saves 30-60 min for `optimized`)
|
| 180 |
+
|
| 181 |
+
Options:
|
| 182 |
+
- Reduce `n_refusal_prompts` from 8-30 to 4-6 (generation is expensive)
|
| 183 |
+
- Use perplexity-only as a faster proxy in early trials, switch to refusal measurement for top candidates
|
| 184 |
+
- Implement batch generation for `_measure_refusal_rate`
|
| 185 |
+
|
| 186 |
+
### 5. Add early-stopping to RDO (MODERATE β saves 10-30s for `rdo` mode)
|
| 187 |
+
|
| 188 |
+
Monitor loss convergence and break at plateau (delta < 1e-4 for 20 steps). Most directions converge in ~100-200 steps, not 500.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## Verdict
|
| 193 |
+
|
| 194 |
+
The pipeline is **architecturally sound** β the rank-1 projection math is correct and memory-efficient, the stage separation is clean, and the progressive method complexity (basic β nuclear) gives users clear cost/quality tradeoffs. However, the **PROBE stage bottleneck** (single-prompt forward passes) and **Bayesian trial overhead** (1750 forward passes) are the two elephants in the room. Fixing just recommendation #1 would make the entire system 3-5x faster for the majority of users who run basic/advanced/aggressive modes.
|
| 195 |
+
|
| 196 |
+
The `optimized` and `heretic` modes have a legitimate place for users with compute budget, but their current efficiency makes them impractical for anything under an A100. The documentation should be more explicit about expected runtimes.
|
| 197 |
+
|
| 198 |
+
**Overall system grade: B+** β excellent functionality, needs batching and early-stopping.
|
obliteratus/abliterate.py
CHANGED
|
@@ -77,8 +77,16 @@ METHODS = {
|
|
| 77 |
"true_iterative_refinement": False,
|
| 78 |
},
|
| 79 |
"aggressive": {
|
| 80 |
-
"label": "Aggressive (Full Gabliteration)",
|
| 81 |
-
"description":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
"n_directions": 8,
|
| 83 |
"norm_preserve": True,
|
| 84 |
"regularization": 0.0,
|
|
@@ -87,6 +95,39 @@ METHODS = {
|
|
| 87 |
"use_chat_template": True,
|
| 88 |
"use_whitened_svd": True,
|
| 89 |
"true_iterative_refinement": True,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
},
|
| 91 |
"informed": {
|
| 92 |
"label": "Informed (Analysis-Guided)",
|
|
@@ -517,6 +558,10 @@ class AbliterationPipeline:
|
|
| 517 |
layer_selection: str | None = None,
|
| 518 |
rdo_refinement: bool | None = None,
|
| 519 |
use_wasserstein_optimal: bool | None = None,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 520 |
large_model_mode: bool = False,
|
| 521 |
on_stage: Callable[[StageResult], None] | None = None,
|
| 522 |
on_log: Callable[[str], None] | None = None,
|
|
@@ -603,6 +648,11 @@ class AbliterationPipeline:
|
|
| 603 |
self.rdo_refinement = rdo_refinement if rdo_refinement is not None else method_cfg.get("rdo_refinement", False)
|
| 604 |
self.use_wasserstein_optimal = use_wasserstein_optimal if use_wasserstein_optimal is not None else method_cfg.get("use_wasserstein_optimal", False)
|
| 605 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 606 |
# Large model mode: conservative defaults for 120B+ models.
|
| 607 |
# Reduces memory footprint by limiting SAE features, directions,
|
| 608 |
# and refinement passes. Explicit parameter overrides still apply.
|
|
@@ -965,6 +1015,204 @@ class AbliterationPipeline:
|
|
| 965 |
self.log(f" chat template {i + 1}/{n}")
|
| 966 |
return wrapped
|
| 967 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 968 |
@staticmethod
|
| 969 |
def _winsorize_activations(
|
| 970 |
activations: dict[int, list[torch.Tensor]],
|
|
@@ -1029,22 +1277,22 @@ class AbliterationPipeline:
|
|
| 1029 |
def hook_fn(module, input, output):
|
| 1030 |
hidden = output[0] if isinstance(output, tuple) else output
|
| 1031 |
if collect_multi_pos and hidden.shape[1] > 4:
|
| 1032 |
-
# Collect at last, 75%, and 50% positions to capture
|
| 1033 |
-
# reasoning-stage refusal in CoT models (GPT-OSS, QwQ, etc.)
|
| 1034 |
seq_len = hidden.shape[1]
|
| 1035 |
positions = [
|
| 1036 |
-
seq_len - 1,
|
| 1037 |
-
int(seq_len * 0.75),
|
| 1038 |
-
int(seq_len * 0.50),
|
| 1039 |
]
|
| 1040 |
-
# Deduplicate positions for very short sequences
|
| 1041 |
positions = sorted(set(positions))
|
| 1042 |
-
pos_acts = hidden[:, positions, :]
|
| 1043 |
-
|
| 1044 |
-
|
| 1045 |
-
|
|
|
|
| 1046 |
else:
|
| 1047 |
-
|
|
|
|
|
|
|
| 1048 |
return hook_fn
|
| 1049 |
|
| 1050 |
for idx in range(n_layers):
|
|
@@ -1056,6 +1304,7 @@ class AbliterationPipeline:
|
|
| 1056 |
# Adaptive max_length: shorten sequences when GPU memory is tight.
|
| 1057 |
# For CoT-aware mode we need more sequence to capture reasoning tokens.
|
| 1058 |
max_length = 384 if collect_multi_pos else 256
|
|
|
|
| 1059 |
if torch.cuda.is_available():
|
| 1060 |
free_gb = sum(
|
| 1061 |
torch.cuda.mem_get_info(i)[0] / (1024 ** 3)
|
|
@@ -1070,21 +1319,32 @@ class AbliterationPipeline:
|
|
| 1070 |
|
| 1071 |
device = self._get_model_device(model)
|
| 1072 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1073 |
try:
|
| 1074 |
-
for
|
| 1075 |
-
|
|
|
|
|
|
|
| 1076 |
inputs = tokenizer(
|
| 1077 |
-
|
| 1078 |
max_length=max_length,
|
| 1079 |
)
|
| 1080 |
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 1081 |
with torch.no_grad():
|
| 1082 |
model(**inputs)
|
| 1083 |
-
# Free forward-pass intermediates between prompts to prevent
|
| 1084 |
-
# CUDA memory fragmentation when headroom is tight
|
| 1085 |
del inputs
|
| 1086 |
-
|
|
|
|
|
|
|
| 1087 |
finally:
|
|
|
|
| 1088 |
for h in hooks:
|
| 1089 |
h.remove()
|
| 1090 |
|
|
@@ -1164,13 +1424,7 @@ class AbliterationPipeline:
|
|
| 1164 |
# keep remaining SVD directions orthogonalized against it
|
| 1165 |
w_dir = w_result.direction.unsqueeze(0)
|
| 1166 |
sub = torch.cat([w_dir, svd_dirs[1:]], dim=0)
|
| 1167 |
-
|
| 1168 |
-
for j in range(1, sub.shape[0]):
|
| 1169 |
-
for kk in range(j):
|
| 1170 |
-
sub[j] -= (sub[j] @ sub[kk]) * sub[kk]
|
| 1171 |
-
row_norm = sub[j].norm()
|
| 1172 |
-
if row_norm > 1e-8:
|
| 1173 |
-
sub[j] /= row_norm
|
| 1174 |
self.refusal_subspaces[idx] = sub
|
| 1175 |
continue
|
| 1176 |
except Exception as e:
|
|
@@ -1354,17 +1608,10 @@ class AbliterationPipeline:
|
|
| 1354 |
continue
|
| 1355 |
blended = blended / blended_norm
|
| 1356 |
self.refusal_directions[idx] = blended
|
| 1357 |
-
# Update subspace row 0 and re-orthogonalize remaining
|
| 1358 |
-
# rows via Gram-Schmidt to maintain orthogonality.
|
| 1359 |
sub = self.refusal_subspaces[idx]
|
| 1360 |
sub[0] = blended
|
| 1361 |
if sub.shape[0] > 1:
|
| 1362 |
-
|
| 1363 |
-
for k in range(j):
|
| 1364 |
-
sub[j] -= (sub[j] @ sub[k]) * sub[k]
|
| 1365 |
-
row_norm = sub[j].norm()
|
| 1366 |
-
if row_norm > 1e-8:
|
| 1367 |
-
sub[j] /= row_norm
|
| 1368 |
self.refusal_subspaces[idx] = sub
|
| 1369 |
self.log(f" Blended {len(self._strong_layers)} directions (data-driven Ξ± per layer)")
|
| 1370 |
|
|
@@ -1576,15 +1823,24 @@ class AbliterationPipeline:
|
|
| 1576 |
sae_mem_mb = 2 * hidden_dim * (sae_expansion * hidden_dim) * 4 / 1e6
|
| 1577 |
except Exception:
|
| 1578 |
pass # Fallback to hidden_dim-based heuristic
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1579 |
sae = train_sae(
|
| 1580 |
all_acts, hidden_dim,
|
| 1581 |
-
expansion=sae_expansion, n_epochs=
|
| 1582 |
-
sparsity_coef=1e-3, device=
|
| 1583 |
)
|
| 1584 |
result = identify_refusal_features(
|
| 1585 |
sae, self._harmful_acts[idx], self._harmless_acts[idx],
|
| 1586 |
layer_idx=idx, top_k=min(self.n_sae_features, hidden_dim // 2),
|
| 1587 |
-
device=
|
| 1588 |
)
|
| 1589 |
if result.n_refusal_features > 0:
|
| 1590 |
self._sae_directions[idx] = result.sae_directions
|
|
@@ -1749,6 +2005,30 @@ class AbliterationPipeline:
|
|
| 1749 |
strong_layers=self._strong_layers,
|
| 1750 |
)
|
| 1751 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1752 |
@staticmethod
|
| 1753 |
def _select_layers_knee(sorted_layers: list[tuple[int, float]]) -> list[int]:
|
| 1754 |
"""Select layers using the kneedle algorithm (simplified).
|
|
@@ -2465,6 +2745,19 @@ class AbliterationPipeline:
|
|
| 2465 |
)
|
| 2466 |
return # Skip standard in-place projection
|
| 2467 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2468 |
for pass_num in range(self.refinement_passes):
|
| 2469 |
modified_this_pass = 0
|
| 2470 |
if self.refinement_passes > 1:
|
|
@@ -2472,7 +2765,42 @@ class AbliterationPipeline:
|
|
| 2472 |
|
| 2473 |
# True iterative refinement: re-probe and re-distill after first pass
|
| 2474 |
if pass_num > 0 and self.true_iterative_refinement:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2475 |
self.log(" Re-probing model with updated weights...")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2476 |
# Clear stale activations before re-probing to avoid memory doubling
|
| 2477 |
self._harmful_acts.clear()
|
| 2478 |
self._harmless_acts.clear()
|
|
@@ -2945,6 +3273,8 @@ class AbliterationPipeline:
|
|
| 2945 |
extras.append(f"CoT-preserved({len(self._cot_preserve_directions)})")
|
| 2946 |
if self._kl_contributions:
|
| 2947 |
extras.append("KL-optimized")
|
|
|
|
|
|
|
| 2948 |
mode_label = " + ".join(extras) if extras else "standard"
|
| 2949 |
|
| 2950 |
self.log(f"Excised refusal from {total_modified} matrices [{mode_label}] ({elapsed:.1f}s)")
|
|
@@ -2958,21 +3288,58 @@ class AbliterationPipeline:
|
|
| 2958 |
def _distill_inner(self):
|
| 2959 |
"""Re-run distillation without emitting stage events (for iterative refinement).
|
| 2960 |
|
| 2961 |
-
Includes
|
| 2962 |
-
and head re-identification to keep
|
| 2963 |
-
modifications.
|
| 2964 |
"""
|
| 2965 |
n_layers = len(self._harmful_means)
|
| 2966 |
norms: dict[int, float] = {}
|
| 2967 |
n_dirs = self.n_directions
|
| 2968 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2969 |
# Use whitened SVD when enabled (matching main _distill)
|
| 2970 |
whitened_extractor = None
|
| 2971 |
-
if self.use_whitened_svd and n_dirs > 1:
|
| 2972 |
from obliteratus.analysis.whitened_svd import WhitenedSVDExtractor
|
| 2973 |
whitened_extractor = WhitenedSVDExtractor()
|
| 2974 |
|
| 2975 |
for idx in range(n_layers):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2976 |
if n_dirs == 1:
|
| 2977 |
diff = (self._harmful_means[idx] - self._harmless_means[idx]).squeeze(0)
|
| 2978 |
norm = diff.norm().item()
|
|
@@ -2984,7 +3351,6 @@ class AbliterationPipeline:
|
|
| 2984 |
self.refusal_directions[idx] = direction
|
| 2985 |
self.refusal_subspaces[idx] = direction.unsqueeze(0)
|
| 2986 |
elif whitened_extractor is not None:
|
| 2987 |
-
# Whitened SVD: same path as main _distill
|
| 2988 |
result = whitened_extractor.extract(
|
| 2989 |
self._harmful_acts[idx],
|
| 2990 |
self._harmless_acts[idx],
|
|
@@ -3016,9 +3382,8 @@ class AbliterationPipeline:
|
|
| 3016 |
sorted_layers = sorted(norms.items(), key=lambda x: x[1], reverse=True)
|
| 3017 |
self._strong_layers = self._select_layers_knee(sorted_layers)
|
| 3018 |
|
| 3019 |
-
# Re-apply jailbreak-contrastive blending
|
| 3020 |
if self.use_jailbreak_contrast and self._jailbreak_means:
|
| 3021 |
-
blend_alpha = 0.5
|
| 3022 |
for idx in self._strong_layers:
|
| 3023 |
if idx not in self._jailbreak_means:
|
| 3024 |
continue
|
|
@@ -3027,6 +3392,9 @@ class AbliterationPipeline:
|
|
| 3027 |
if jb_norm > 0:
|
| 3028 |
jb_dir = jb_diff / jb_norm
|
| 3029 |
std_dir = self.refusal_directions[idx]
|
|
|
|
|
|
|
|
|
|
| 3030 |
blended = (1 - blend_alpha) * std_dir + blend_alpha * jb_dir
|
| 3031 |
blended_norm = blended.norm()
|
| 3032 |
if blended_norm < 1e-8:
|
|
@@ -3036,12 +3404,7 @@ class AbliterationPipeline:
|
|
| 3036 |
sub = self.refusal_subspaces[idx]
|
| 3037 |
sub[0] = blended
|
| 3038 |
if sub.shape[0] > 1:
|
| 3039 |
-
|
| 3040 |
-
for k in range(j):
|
| 3041 |
-
sub[j] -= (sub[j] @ sub[k]) * sub[k]
|
| 3042 |
-
row_norm = sub[j].norm()
|
| 3043 |
-
if row_norm > 1e-8:
|
| 3044 |
-
sub[j] /= row_norm
|
| 3045 |
self.refusal_subspaces[idx] = sub
|
| 3046 |
|
| 3047 |
# Re-identify refusal heads with updated directions
|
|
@@ -3474,16 +3837,19 @@ class AbliterationPipeline:
|
|
| 3474 |
|
| 3475 |
if W.shape[-1] == d.shape[0]:
|
| 3476 |
# Standard Linear: W is (out_features, hidden_dim)
|
| 3477 |
-
|
| 3478 |
|
| 3479 |
coeff = W @ d # (out_features, 1)
|
|
|
|
| 3480 |
W.sub_(d.T * (scale * coeff)) # in-place rank-1 update
|
| 3481 |
del coeff
|
| 3482 |
|
| 3483 |
-
|
| 3484 |
-
|
| 3485 |
-
|
| 3486 |
-
|
|
|
|
|
|
|
| 3487 |
|
| 3488 |
if is_quantized:
|
| 3489 |
AbliterationPipeline._replace_quantized_weight(proj, W)
|
|
@@ -3492,16 +3858,19 @@ class AbliterationPipeline:
|
|
| 3492 |
|
| 3493 |
elif W.shape[0] == d.shape[0]:
|
| 3494 |
# Transposed (e.g. GPT-2 Conv1D): W is (hidden_dim, out_features)
|
| 3495 |
-
|
| 3496 |
|
| 3497 |
coeff = d.T @ W # (1, out_features)
|
|
|
|
| 3498 |
W.sub_((scale * d) * coeff) # in-place rank-1 update
|
| 3499 |
del coeff
|
| 3500 |
|
| 3501 |
-
|
| 3502 |
-
|
| 3503 |
-
|
| 3504 |
-
|
|
|
|
|
|
|
| 3505 |
|
| 3506 |
if is_quantized:
|
| 3507 |
AbliterationPipeline._replace_quantized_weight(proj, W)
|
|
@@ -4619,55 +4988,80 @@ class AbliterationPipeline:
|
|
| 4619 |
from obliteratus.evaluation.advanced_metrics import _is_refusal_detailed
|
| 4620 |
prompt_verdicts: list[tuple[str, bool, str]] = [] # (tier, refused, reason)
|
| 4621 |
|
| 4622 |
-
for
|
| 4623 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4624 |
try:
|
| 4625 |
inputs = tokenizer(
|
| 4626 |
-
|
| 4627 |
-
truncation=True, max_length=512,
|
| 4628 |
)
|
| 4629 |
-
|
|
|
|
|
|
|
| 4630 |
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 4631 |
with torch.no_grad():
|
| 4632 |
-
|
| 4633 |
**inputs,
|
| 4634 |
max_new_tokens=128,
|
| 4635 |
do_sample=False,
|
| 4636 |
)
|
| 4637 |
-
|
| 4638 |
-
#
|
| 4639 |
-
|
| 4640 |
-
|
| 4641 |
-
|
| 4642 |
-
|
| 4643 |
-
|
| 4644 |
-
|
| 4645 |
-
|
| 4646 |
-
|
| 4647 |
-
|
| 4648 |
-
|
| 4649 |
-
|
| 4650 |
-
|
| 4651 |
-
|
| 4652 |
-
|
| 4653 |
-
|
| 4654 |
-
|
| 4655 |
-
|
| 4656 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4657 |
self._free_gpu_memory()
|
| 4658 |
except torch.cuda.OutOfMemoryError:
|
| 4659 |
self._free_gpu_memory()
|
| 4660 |
-
self.log(f" [
|
| 4661 |
self.log(" Skipping remaining refusal tests (CUDA out of memory)")
|
| 4662 |
-
|
| 4663 |
except (RuntimeError, Exception) as e:
|
| 4664 |
err_msg = str(e)
|
| 4665 |
if "CUDA" in err_msg or "illegal" in err_msg.lower():
|
| 4666 |
self._free_gpu_memory()
|
| 4667 |
-
self.log(f" [
|
| 4668 |
self.log(f" Skipping remaining refusal tests (CUDA error: {err_msg[:120]})")
|
| 4669 |
-
|
| 4670 |
-
|
|
|
|
|
|
|
|
|
|
| 4671 |
|
| 4672 |
if harmful_responses:
|
| 4673 |
from obliteratus.evaluation.advanced_metrics import refusal_rate as compute_refusal_rate
|
|
@@ -4852,6 +5246,10 @@ class AbliterationPipeline:
|
|
| 4852 |
"cot_aware": self.cot_aware,
|
| 4853 |
"use_kl_optimization": self.use_kl_optimization,
|
| 4854 |
"use_lora_ablation": self.use_lora_ablation,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4855 |
},
|
| 4856 |
"references": [
|
| 4857 |
"Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (NeurIPS 2024)",
|
|
|
|
| 77 |
"true_iterative_refinement": False,
|
| 78 |
},
|
| 79 |
"aggressive": {
|
| 80 |
+
"label": "Aggressive (Full Gabliteration + Enhanced)",
|
| 81 |
+
"description": (
|
| 82 |
+
"Maximum direction extraction with enhanced adaptive pipeline. "
|
| 83 |
+
"Whitened SVD with jailbreak-contrastive refinement, layer-adaptive "
|
| 84 |
+
"projection strengths, cosine-similarity early-exit for iterative "
|
| 85 |
+
"refinement (skips unnecessary re-probe passes when directions "
|
| 86 |
+
"converge), attention head surgery on top safety heads, and "
|
| 87 |
+
"activation winsorization for robust direction extraction. "
|
| 88 |
+
"Zero regularization for maximum refusal removal."
|
| 89 |
+
),
|
| 90 |
"n_directions": 8,
|
| 91 |
"norm_preserve": True,
|
| 92 |
"regularization": 0.0,
|
|
|
|
| 95 |
"use_chat_template": True,
|
| 96 |
"use_whitened_svd": True,
|
| 97 |
"true_iterative_refinement": True,
|
| 98 |
+
"use_jailbreak_contrast": True,
|
| 99 |
+
"layer_adaptive_strength": True,
|
| 100 |
+
"attention_head_surgery": True,
|
| 101 |
+
"winsorize_activations": True,
|
| 102 |
+
"winsorize_percentile": 0.01,
|
| 103 |
+
},
|
| 104 |
+
"spectral_cascade": {
|
| 105 |
+
"label": "Spectral Cascade (Multi-Resolution Frequency Decomposition)",
|
| 106 |
+
"description": (
|
| 107 |
+
"Novel method that decomposes refusal signals into spectral "
|
| 108 |
+
"frequency bands across the layer axis using DCT. Applies "
|
| 109 |
+
"strong projection to low-frequency components (systematic "
|
| 110 |
+
"refusal trend spanning many layers) and gentle/no projection "
|
| 111 |
+
"to high-frequency components (capability-entangled noise). "
|
| 112 |
+
"Cascade refinement re-measures residual refusal after each "
|
| 113 |
+
"frequency band and stops early when signal is eliminated. "
|
| 114 |
+
"Achieves cleaner removal with less capability damage by "
|
| 115 |
+
"separating trained-in refusal patterns from per-layer artifacts."
|
| 116 |
+
),
|
| 117 |
+
"n_directions": 6,
|
| 118 |
+
"norm_preserve": True,
|
| 119 |
+
"regularization": 0.0,
|
| 120 |
+
"refinement_passes": 2,
|
| 121 |
+
"project_biases": True,
|
| 122 |
+
"use_chat_template": True,
|
| 123 |
+
"use_whitened_svd": True,
|
| 124 |
+
"true_iterative_refinement": True,
|
| 125 |
+
"use_jailbreak_contrast": True,
|
| 126 |
+
"layer_adaptive_strength": True,
|
| 127 |
+
"attention_head_surgery": False,
|
| 128 |
+
"spectral_cascade": True,
|
| 129 |
+
"spectral_bands": 3,
|
| 130 |
+
"spectral_threshold": 0.05,
|
| 131 |
},
|
| 132 |
"informed": {
|
| 133 |
"label": "Informed (Analysis-Guided)",
|
|
|
|
| 558 |
layer_selection: str | None = None,
|
| 559 |
rdo_refinement: bool | None = None,
|
| 560 |
use_wasserstein_optimal: bool | None = None,
|
| 561 |
+
# Spectral Cascade parameters
|
| 562 |
+
spectral_cascade: bool | None = None,
|
| 563 |
+
spectral_bands: int | None = None,
|
| 564 |
+
spectral_threshold: float | None = None,
|
| 565 |
large_model_mode: bool = False,
|
| 566 |
on_stage: Callable[[StageResult], None] | None = None,
|
| 567 |
on_log: Callable[[str], None] | None = None,
|
|
|
|
| 648 |
self.rdo_refinement = rdo_refinement if rdo_refinement is not None else method_cfg.get("rdo_refinement", False)
|
| 649 |
self.use_wasserstein_optimal = use_wasserstein_optimal if use_wasserstein_optimal is not None else method_cfg.get("use_wasserstein_optimal", False)
|
| 650 |
|
| 651 |
+
# Spectral Cascade parameters
|
| 652 |
+
self.spectral_cascade = spectral_cascade if spectral_cascade is not None else method_cfg.get("spectral_cascade", False)
|
| 653 |
+
self.spectral_bands = spectral_bands if spectral_bands is not None else method_cfg.get("spectral_bands", 3)
|
| 654 |
+
self.spectral_threshold = spectral_threshold if spectral_threshold is not None else method_cfg.get("spectral_threshold", 0.05)
|
| 655 |
+
|
| 656 |
# Large model mode: conservative defaults for 120B+ models.
|
| 657 |
# Reduces memory footprint by limiting SAE features, directions,
|
| 658 |
# and refinement passes. Explicit parameter overrides still apply.
|
|
|
|
| 1015 |
self.log(f" chat template {i + 1}/{n}")
|
| 1016 |
return wrapped
|
| 1017 |
|
| 1018 |
+
@staticmethod
|
| 1019 |
+
def _apply_spectral_cascade_weights(self):
|
| 1020 |
+
"""Apply Spectral Cascade: frequency-selective per-layer projection weights.
|
| 1021 |
+
|
| 1022 |
+
Novel contribution: instead of treating refusal removal as a flat
|
| 1023 |
+
linear operation across layers, Spectral Cascade decomposes the
|
| 1024 |
+
refusal signal into spectral frequency bands via DCT and applies
|
| 1025 |
+
frequency-dependent attenuation. This separates *systematic* refusal
|
| 1026 |
+
(low-frequency smooth trend across many layers β the trained-in
|
| 1027 |
+
alignment signal) from *per-layer noise* (high-frequency spikes that
|
| 1028 |
+
are more likely capability-entangled artifacts).
|
| 1029 |
+
|
| 1030 |
+
The algorithm has three stages:
|
| 1031 |
+
|
| 1032 |
+
**Stage 1 β Direction coherence weighting.**
|
| 1033 |
+
For each layer, compute the cosine similarity of its refusal direction
|
| 1034 |
+
with its neighbors. Layers whose refusal direction is coherent with
|
| 1035 |
+
adjacent layers are more likely part of the systematic refusal trend.
|
| 1036 |
+
This produces a per-layer coherence score in [0, 1] that modulates
|
| 1037 |
+
the magnitude signal before spectral decomposition.
|
| 1038 |
+
|
| 1039 |
+
**Stage 2 β DCT spectral decomposition.**
|
| 1040 |
+
Apply a Type-II DCT to the coherence-weighted magnitude vector.
|
| 1041 |
+
Split the resulting coefficients into frequency bands (adaptively
|
| 1042 |
+
sized based on spectral energy distribution). Low-frequency bands
|
| 1043 |
+
get full projection weight; high-frequency bands get attenuated.
|
| 1044 |
+
|
| 1045 |
+
**Stage 3 β Cascade with early-exit.**
|
| 1046 |
+
Process bands from lowest to highest frequency. After each band,
|
| 1047 |
+
measure remaining spectral energy. Stop early when residual energy
|
| 1048 |
+
drops below ``spectral_threshold``.
|
| 1049 |
+
|
| 1050 |
+
Results are stored in ``_layer_excise_weights`` to modulate
|
| 1051 |
+
per-layer projection strength during EXCISE.
|
| 1052 |
+
"""
|
| 1053 |
+
sorted_layers = sorted(self._strong_layers)
|
| 1054 |
+
if len(sorted_layers) < 4:
|
| 1055 |
+
# Too few layers for meaningful spectral decomposition
|
| 1056 |
+
return
|
| 1057 |
+
|
| 1058 |
+
# ββ Stage 1: Direction coherence weighting ββββββββββββββββββ
|
| 1059 |
+
# Measure how coherent each layer's refusal direction is with its
|
| 1060 |
+
# neighbors. High coherence = part of the systematic refusal trend.
|
| 1061 |
+
# Low coherence = noisy / capability-entangled.
|
| 1062 |
+
magnitudes = []
|
| 1063 |
+
directions = []
|
| 1064 |
+
for idx in sorted_layers:
|
| 1065 |
+
if idx in self.refusal_directions:
|
| 1066 |
+
d = self.refusal_directions[idx].float()
|
| 1067 |
+
directions.append(d / d.norm().clamp(min=1e-8))
|
| 1068 |
+
magnitudes.append(d.norm().item())
|
| 1069 |
+
else:
|
| 1070 |
+
directions.append(None)
|
| 1071 |
+
magnitudes.append(0.0)
|
| 1072 |
+
|
| 1073 |
+
n = len(magnitudes)
|
| 1074 |
+
coherence = torch.ones(n)
|
| 1075 |
+
for i in range(n):
|
| 1076 |
+
if directions[i] is None:
|
| 1077 |
+
coherence[i] = 0.0
|
| 1078 |
+
continue
|
| 1079 |
+
# Average cosine similarity with up to 2 neighbors on each side
|
| 1080 |
+
neighbor_sims = []
|
| 1081 |
+
for delta in [-2, -1, 1, 2]:
|
| 1082 |
+
j = i + delta
|
| 1083 |
+
if 0 <= j < n and directions[j] is not None:
|
| 1084 |
+
cos = (directions[i] @ directions[j]).abs().item()
|
| 1085 |
+
neighbor_sims.append(cos)
|
| 1086 |
+
if neighbor_sims:
|
| 1087 |
+
coherence[i] = sum(neighbor_sims) / len(neighbor_sims)
|
| 1088 |
+
else:
|
| 1089 |
+
coherence[i] = 0.5 # isolated layer β neutral
|
| 1090 |
+
|
| 1091 |
+
# Coherence-weighted magnitudes: amplify coherent layers, dampen noisy ones
|
| 1092 |
+
magnitudes_t = torch.tensor(magnitudes, dtype=torch.float32)
|
| 1093 |
+
# Soft modulation: weighted_mag = mag * (0.3 + 0.7 * coherence)
|
| 1094 |
+
# This keeps all layers > 0 but boosts coherent ones
|
| 1095 |
+
weighted_mags = magnitudes_t * (0.3 + 0.7 * coherence)
|
| 1096 |
+
|
| 1097 |
+
# Normalize to unit energy for stable DCT
|
| 1098 |
+
mag_norm = weighted_mags.norm()
|
| 1099 |
+
if mag_norm < 1e-8:
|
| 1100 |
+
return
|
| 1101 |
+
weighted_mags = weighted_mags / mag_norm
|
| 1102 |
+
|
| 1103 |
+
self.log(
|
| 1104 |
+
f" Spectral Cascade: coherence range "
|
| 1105 |
+
f"[{coherence.min().item():.3f}, {coherence.max().item():.3f}]"
|
| 1106 |
+
)
|
| 1107 |
+
|
| 1108 |
+
# ββ Stage 2: DCT spectral decomposition ββββββββββββββββββββ
|
| 1109 |
+
# Build orthonormal Type-II DCT basis
|
| 1110 |
+
dct_basis = torch.zeros(n, n)
|
| 1111 |
+
for k in range(n):
|
| 1112 |
+
for i in range(n):
|
| 1113 |
+
dct_basis[k, i] = math.cos(math.pi * k * (2 * i + 1) / (2 * n))
|
| 1114 |
+
if k == 0:
|
| 1115 |
+
dct_basis[k] *= math.sqrt(1.0 / n)
|
| 1116 |
+
else:
|
| 1117 |
+
dct_basis[k] *= math.sqrt(2.0 / n)
|
| 1118 |
+
|
| 1119 |
+
# DCT coefficients
|
| 1120 |
+
coeffs = dct_basis @ weighted_mags # (n,)
|
| 1121 |
+
|
| 1122 |
+
# Adaptive band count: determine optimal number of bands based on
|
| 1123 |
+
# where spectral energy concentrates. Compute cumulative energy and
|
| 1124 |
+
# find the coefficient index where 90% of energy is captured.
|
| 1125 |
+
# Per Parseval's theorem, spectral energy = sum of squared coefficients
|
| 1126 |
+
coeff_energy = coeffs.pow(2)
|
| 1127 |
+
total_energy = coeff_energy.sum().item()
|
| 1128 |
+
if total_energy < 1e-8:
|
| 1129 |
+
return
|
| 1130 |
+
|
| 1131 |
+
cumulative = 0.0
|
| 1132 |
+
knee_idx = n
|
| 1133 |
+
for k in range(n):
|
| 1134 |
+
cumulative += coeff_energy[k].item()
|
| 1135 |
+
if cumulative >= 0.9 * total_energy:
|
| 1136 |
+
knee_idx = k + 1
|
| 1137 |
+
break
|
| 1138 |
+
|
| 1139 |
+
# Use at most spectral_bands, but reduce if energy is concentrated
|
| 1140 |
+
# in fewer coefficients (no point splitting beyond the knee)
|
| 1141 |
+
n_bands = min(self.spectral_bands, max(2, knee_idx))
|
| 1142 |
+
|
| 1143 |
+
# Split coefficients into bands (low β high frequency)
|
| 1144 |
+
band_size = max(1, n // n_bands)
|
| 1145 |
+
bands = []
|
| 1146 |
+
for b in range(n_bands):
|
| 1147 |
+
start = b * band_size
|
| 1148 |
+
end = n if b == n_bands - 1 else (b + 1) * band_size
|
| 1149 |
+
bands.append((start, end))
|
| 1150 |
+
|
| 1151 |
+
# ββ Stage 3: Frequency-band cascade with early-exit βββββββββ
|
| 1152 |
+
layer_weights = torch.ones(n)
|
| 1153 |
+
|
| 1154 |
+
self.log(
|
| 1155 |
+
f" Spectral Cascade: {n_bands} bands over {n} layers "
|
| 1156 |
+
f"(knee at coeff {knee_idx}, 90% energy)"
|
| 1157 |
+
)
|
| 1158 |
+
|
| 1159 |
+
for band_idx, (start, end) in enumerate(bands):
|
| 1160 |
+
# Reconstruct this band's contribution via inverse DCT
|
| 1161 |
+
band_coeffs = torch.zeros(n)
|
| 1162 |
+
band_coeffs[start:end] = coeffs[start:end]
|
| 1163 |
+
band_signal = dct_basis.T @ band_coeffs
|
| 1164 |
+
|
| 1165 |
+
band_energy = band_signal.norm().item()
|
| 1166 |
+
freq_label = "low" if band_idx == 0 else ("mid" if band_idx < n_bands - 1 else "high")
|
| 1167 |
+
|
| 1168 |
+
# Attenuation schedule: band 0 (lowest freq) = 1.0, last band = 0.2
|
| 1169 |
+
# Smooth exponential decay rather than linear for gentler falloff
|
| 1170 |
+
if n_bands > 1:
|
| 1171 |
+
t = band_idx / (n_bands - 1)
|
| 1172 |
+
attenuation = math.exp(-1.6 * t) # e^0=1.0, e^-1.6β0.20
|
| 1173 |
+
else:
|
| 1174 |
+
attenuation = 1.0
|
| 1175 |
+
|
| 1176 |
+
# Per-layer weight modulation based on this band's contribution
|
| 1177 |
+
for i in range(n):
|
| 1178 |
+
if abs(weighted_mags[i].item()) > 1e-10:
|
| 1179 |
+
band_fraction = abs(band_signal[i].item()) / (abs(weighted_mags[i].item()) + 1e-10)
|
| 1180 |
+
band_fraction = min(band_fraction, 1.0)
|
| 1181 |
+
layer_weights[i] = (
|
| 1182 |
+
layer_weights[i] * (1.0 - band_fraction)
|
| 1183 |
+
+ attenuation * band_fraction
|
| 1184 |
+
)
|
| 1185 |
+
|
| 1186 |
+
self.log(
|
| 1187 |
+
f" Band {band_idx} ({freq_label}-freq, coeffs {start}-{end}): "
|
| 1188 |
+
f"energy={band_energy:.4f}, attenuation={attenuation:.2f}"
|
| 1189 |
+
)
|
| 1190 |
+
|
| 1191 |
+
# Cascade early-exit: check remaining spectral energy
|
| 1192 |
+
remaining_coeffs = torch.zeros(n)
|
| 1193 |
+
for future_start, future_end in bands[band_idx + 1:]:
|
| 1194 |
+
remaining_coeffs[future_start:future_end] = coeffs[future_start:future_end]
|
| 1195 |
+
remaining_energy = (dct_basis.T @ remaining_coeffs).norm().item()
|
| 1196 |
+
|
| 1197 |
+
if remaining_energy < self.spectral_threshold:
|
| 1198 |
+
self.log(
|
| 1199 |
+
f" Cascade early-exit: remaining energy {remaining_energy:.4f} "
|
| 1200 |
+
f"< threshold {self.spectral_threshold}"
|
| 1201 |
+
)
|
| 1202 |
+
break
|
| 1203 |
+
|
| 1204 |
+
# Store spectral weights into _layer_excise_weights
|
| 1205 |
+
if not hasattr(self, "_layer_excise_weights"):
|
| 1206 |
+
self._layer_excise_weights = {}
|
| 1207 |
+
for i, idx in enumerate(sorted_layers):
|
| 1208 |
+
existing = self._layer_excise_weights.get(idx, 1.0)
|
| 1209 |
+
self._layer_excise_weights[idx] = existing * layer_weights[i].item()
|
| 1210 |
+
|
| 1211 |
+
self.log(
|
| 1212 |
+
f" Spectral Cascade: weight range "
|
| 1213 |
+
f"[{min(layer_weights).item():.3f}, {max(layer_weights).item():.3f}]"
|
| 1214 |
+
)
|
| 1215 |
+
|
| 1216 |
@staticmethod
|
| 1217 |
def _winsorize_activations(
|
| 1218 |
activations: dict[int, list[torch.Tensor]],
|
|
|
|
| 1277 |
def hook_fn(module, input, output):
|
| 1278 |
hidden = output[0] if isinstance(output, tuple) else output
|
| 1279 |
if collect_multi_pos and hidden.shape[1] > 4:
|
|
|
|
|
|
|
| 1280 |
seq_len = hidden.shape[1]
|
| 1281 |
positions = [
|
| 1282 |
+
seq_len - 1,
|
| 1283 |
+
int(seq_len * 0.75),
|
| 1284 |
+
int(seq_len * 0.50),
|
| 1285 |
]
|
|
|
|
| 1286 |
positions = sorted(set(positions))
|
| 1287 |
+
pos_acts = hidden[:, positions, :]
|
| 1288 |
+
avg_act = pos_acts.mean(dim=1).detach().cpu().float()
|
| 1289 |
+
# Unbatch: preserve per-prompt (1, hidden) structure
|
| 1290 |
+
for b in range(avg_act.shape[0]):
|
| 1291 |
+
activations[idx].append(avg_act[b:b+1])
|
| 1292 |
else:
|
| 1293 |
+
act = hidden[:, -1, :].detach().cpu().float()
|
| 1294 |
+
for b in range(act.shape[0]):
|
| 1295 |
+
activations[idx].append(act[b:b+1])
|
| 1296 |
return hook_fn
|
| 1297 |
|
| 1298 |
for idx in range(n_layers):
|
|
|
|
| 1304 |
# Adaptive max_length: shorten sequences when GPU memory is tight.
|
| 1305 |
# For CoT-aware mode we need more sequence to capture reasoning tokens.
|
| 1306 |
max_length = 384 if collect_multi_pos else 256
|
| 1307 |
+
free_gb = 0.0
|
| 1308 |
if torch.cuda.is_available():
|
| 1309 |
free_gb = sum(
|
| 1310 |
torch.cuda.mem_get_info(i)[0] / (1024 ** 3)
|
|
|
|
| 1319 |
|
| 1320 |
device = self._get_model_device(model)
|
| 1321 |
|
| 1322 |
+
# Batch prompts for throughput β hooks unbatch per-prompt activations
|
| 1323 |
+
batch_size = 16 if free_gb > 4.0 else 8 if free_gb > 2.0 else 1
|
| 1324 |
+
# Left-pad so position -1 is always the last real token in every batch element
|
| 1325 |
+
orig_padding_side = getattr(tokenizer, "padding_side", "right")
|
| 1326 |
+
if batch_size > 1:
|
| 1327 |
+
tokenizer.padding_side = "left"
|
| 1328 |
+
if tokenizer.pad_token_id is None:
|
| 1329 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 1330 |
try:
|
| 1331 |
+
for batch_start in range(0, len(prompts), batch_size):
|
| 1332 |
+
batch_end = min(batch_start + batch_size, len(prompts))
|
| 1333 |
+
batch = prompts[batch_start:batch_end]
|
| 1334 |
+
self.log(f" [{label}] prompts {batch_start + 1}-{batch_end}/{len(prompts)}")
|
| 1335 |
inputs = tokenizer(
|
| 1336 |
+
batch, return_tensors="pt", padding=True, truncation=True,
|
| 1337 |
max_length=max_length,
|
| 1338 |
)
|
| 1339 |
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 1340 |
with torch.no_grad():
|
| 1341 |
model(**inputs)
|
|
|
|
|
|
|
| 1342 |
del inputs
|
| 1343 |
+
# Free GPU memory every few batches, not every prompt
|
| 1344 |
+
if (batch_end % (batch_size * 4) == 0) or batch_end == len(prompts):
|
| 1345 |
+
self._free_gpu_memory()
|
| 1346 |
finally:
|
| 1347 |
+
tokenizer.padding_side = orig_padding_side
|
| 1348 |
for h in hooks:
|
| 1349 |
h.remove()
|
| 1350 |
|
|
|
|
| 1424 |
# keep remaining SVD directions orthogonalized against it
|
| 1425 |
w_dir = w_result.direction.unsqueeze(0)
|
| 1426 |
sub = torch.cat([w_dir, svd_dirs[1:]], dim=0)
|
| 1427 |
+
sub = self._orthogonalize_subspace(sub)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1428 |
self.refusal_subspaces[idx] = sub
|
| 1429 |
continue
|
| 1430 |
except Exception as e:
|
|
|
|
| 1608 |
continue
|
| 1609 |
blended = blended / blended_norm
|
| 1610 |
self.refusal_directions[idx] = blended
|
|
|
|
|
|
|
| 1611 |
sub = self.refusal_subspaces[idx]
|
| 1612 |
sub[0] = blended
|
| 1613 |
if sub.shape[0] > 1:
|
| 1614 |
+
sub = self._orthogonalize_subspace(sub)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1615 |
self.refusal_subspaces[idx] = sub
|
| 1616 |
self.log(f" Blended {len(self._strong_layers)} directions (data-driven Ξ± per layer)")
|
| 1617 |
|
|
|
|
| 1823 |
sae_mem_mb = 2 * hidden_dim * (sae_expansion * hidden_dim) * 4 / 1e6
|
| 1824 |
except Exception:
|
| 1825 |
pass # Fallback to hidden_dim-based heuristic
|
| 1826 |
+
# Use GPU when enough headroom exists (SAE is small relative to model)
|
| 1827 |
+
sae_device = "cpu"
|
| 1828 |
+
if torch.cuda.is_available():
|
| 1829 |
+
try:
|
| 1830 |
+
sae_free_mb = torch.cuda.mem_get_info()[0] / 1e6
|
| 1831 |
+
if sae_free_mb > sae_mem_mb + 1024:
|
| 1832 |
+
sae_device = "cuda"
|
| 1833 |
+
except Exception:
|
| 1834 |
+
pass
|
| 1835 |
sae = train_sae(
|
| 1836 |
all_acts, hidden_dim,
|
| 1837 |
+
expansion=sae_expansion, n_epochs=15,
|
| 1838 |
+
sparsity_coef=1e-3, device=sae_device,
|
| 1839 |
)
|
| 1840 |
result = identify_refusal_features(
|
| 1841 |
sae, self._harmful_acts[idx], self._harmless_acts[idx],
|
| 1842 |
layer_idx=idx, top_k=min(self.n_sae_features, hidden_dim // 2),
|
| 1843 |
+
device=sae_device,
|
| 1844 |
)
|
| 1845 |
if result.n_refusal_features > 0:
|
| 1846 |
self._sae_directions[idx] = result.sae_directions
|
|
|
|
| 2005 |
strong_layers=self._strong_layers,
|
| 2006 |
)
|
| 2007 |
|
| 2008 |
+
@staticmethod
|
| 2009 |
+
def _orthogonalize_subspace(sub: torch.Tensor) -> torch.Tensor:
|
| 2010 |
+
"""Orthogonalize rows of a subspace matrix via QR decomposition.
|
| 2011 |
+
|
| 2012 |
+
Replaces the duplicated Gram-Schmidt nested loops with a single QR call
|
| 2013 |
+
that is numerically more stable and O(nkΒ²) instead of O(nΒ²k).
|
| 2014 |
+
|
| 2015 |
+
Args:
|
| 2016 |
+
sub: (k, hidden_dim) tensor whose rows should be orthonormalized.
|
| 2017 |
+
Row 0 is preserved as the primary direction.
|
| 2018 |
+
|
| 2019 |
+
Returns:
|
| 2020 |
+
Orthonormalized subspace tensor with the same shape.
|
| 2021 |
+
"""
|
| 2022 |
+
if sub.shape[0] <= 1:
|
| 2023 |
+
return sub
|
| 2024 |
+
# QR on the transpose: sub^T = Q @ R, then Q^T has orthonormal rows
|
| 2025 |
+
Q, _ = torch.linalg.qr(sub.T)
|
| 2026 |
+
result = Q[:, :sub.shape[0]].T # (k, hidden_dim)
|
| 2027 |
+
# Ensure row 0 points in the same direction as original
|
| 2028 |
+
if (result[0] @ sub[0]) < 0:
|
| 2029 |
+
result[0] = -result[0]
|
| 2030 |
+
return result
|
| 2031 |
+
|
| 2032 |
@staticmethod
|
| 2033 |
def _select_layers_knee(sorted_layers: list[tuple[int, float]]) -> list[int]:
|
| 2034 |
"""Select layers using the kneedle algorithm (simplified).
|
|
|
|
| 2745 |
)
|
| 2746 |
return # Skip standard in-place projection
|
| 2747 |
|
| 2748 |
+
# ββ Spectral Cascade: frequency-band modulated projection ββββ
|
| 2749 |
+
# Decomposes refusal signal magnitude across layers into spectral
|
| 2750 |
+
# frequency bands using DCT. Low-frequency components (smooth
|
| 2751 |
+
# trends spanning many layers) get strong projection; high-frequency
|
| 2752 |
+
# components (per-layer noise / capability-entangled) get gentle or
|
| 2753 |
+
# no projection. This is applied as a per-layer weight multiplier
|
| 2754 |
+
# that modulates the effective projection strength.
|
| 2755 |
+
if self.spectral_cascade and self._strong_layers:
|
| 2756 |
+
self._apply_spectral_cascade_weights()
|
| 2757 |
+
|
| 2758 |
+
# Track previous directions for cosine-similarity early-exit
|
| 2759 |
+
_prev_directions: dict[int, torch.Tensor] = {}
|
| 2760 |
+
|
| 2761 |
for pass_num in range(self.refinement_passes):
|
| 2762 |
modified_this_pass = 0
|
| 2763 |
if self.refinement_passes > 1:
|
|
|
|
| 2765 |
|
| 2766 |
# True iterative refinement: re-probe and re-distill after first pass
|
| 2767 |
if pass_num > 0 and self.true_iterative_refinement:
|
| 2768 |
+
# ββ Cosine-similarity early-exit βββββββββββββββββββββββββ
|
| 2769 |
+
# Skip re-probing if directions converged (all layers have
|
| 2770 |
+
# cosine similarity > 0.99 with previous pass). This saves
|
| 2771 |
+
# the full PROBE+DISTILL cost when pass N produces nearly
|
| 2772 |
+
# identical directions to pass N-1.
|
| 2773 |
+
if _prev_directions:
|
| 2774 |
+
converged = True
|
| 2775 |
+
min_cos = 1.0
|
| 2776 |
+
for idx in self._strong_layers:
|
| 2777 |
+
if idx in _prev_directions and idx in self.refusal_directions:
|
| 2778 |
+
prev_d = _prev_directions[idx].float()
|
| 2779 |
+
curr_d = self.refusal_directions[idx].float()
|
| 2780 |
+
# Skip degenerate zero-vector layers
|
| 2781 |
+
pn = prev_d.norm().item()
|
| 2782 |
+
cn = curr_d.norm().item()
|
| 2783 |
+
if pn < 1e-8 or cn < 1e-8:
|
| 2784 |
+
continue
|
| 2785 |
+
cos = (prev_d @ curr_d).abs().item() / (pn * cn)
|
| 2786 |
+
min_cos = min(min_cos, cos)
|
| 2787 |
+
if cos < 0.99:
|
| 2788 |
+
converged = False
|
| 2789 |
+
break
|
| 2790 |
+
if converged:
|
| 2791 |
+
self.log(
|
| 2792 |
+
f" Early-exit: directions converged (min cosine={min_cos:.4f} >= 0.99), "
|
| 2793 |
+
f"skipping pass {pass_num + 1}"
|
| 2794 |
+
)
|
| 2795 |
+
break
|
| 2796 |
+
|
| 2797 |
self.log(" Re-probing model with updated weights...")
|
| 2798 |
+
# Save current directions before re-distilling
|
| 2799 |
+
_prev_directions = {
|
| 2800 |
+
idx: self.refusal_directions[idx].clone()
|
| 2801 |
+
for idx in self._strong_layers
|
| 2802 |
+
if idx in self.refusal_directions
|
| 2803 |
+
}
|
| 2804 |
# Clear stale activations before re-probing to avoid memory doubling
|
| 2805 |
self._harmful_acts.clear()
|
| 2806 |
self._harmless_acts.clear()
|
|
|
|
| 3273 |
extras.append(f"CoT-preserved({len(self._cot_preserve_directions)})")
|
| 3274 |
if self._kl_contributions:
|
| 3275 |
extras.append("KL-optimized")
|
| 3276 |
+
if self.spectral_cascade:
|
| 3277 |
+
extras.append(f"spectral-cascade({self.spectral_bands}-bands)")
|
| 3278 |
mode_label = " + ".join(extras) if extras else "standard"
|
| 3279 |
|
| 3280 |
self.log(f"Excised refusal from {total_modified} matrices [{mode_label}] ({elapsed:.1f}s)")
|
|
|
|
| 3288 |
def _distill_inner(self):
|
| 3289 |
"""Re-run distillation without emitting stage events (for iterative refinement).
|
| 3290 |
|
| 3291 |
+
Includes Wasserstein-optimal extraction, whitened SVD, jailbreak-contrastive
|
| 3292 |
+
blending with data-driven alpha, and head re-identification to keep
|
| 3293 |
+
directions fresh after weight modifications.
|
| 3294 |
"""
|
| 3295 |
n_layers = len(self._harmful_means)
|
| 3296 |
norms: dict[int, float] = {}
|
| 3297 |
n_dirs = self.n_directions
|
| 3298 |
|
| 3299 |
+
# Use Wasserstein-optimal extraction when enabled (matching main _distill)
|
| 3300 |
+
wasserstein_extractor = None
|
| 3301 |
+
if self.use_wasserstein_optimal:
|
| 3302 |
+
try:
|
| 3303 |
+
from obliteratus.analysis.wasserstein_optimal import WassersteinOptimalExtractor
|
| 3304 |
+
wasserstein_extractor = WassersteinOptimalExtractor()
|
| 3305 |
+
except Exception:
|
| 3306 |
+
pass
|
| 3307 |
+
|
| 3308 |
# Use whitened SVD when enabled (matching main _distill)
|
| 3309 |
whitened_extractor = None
|
| 3310 |
+
if self.use_whitened_svd and n_dirs > 1 and wasserstein_extractor is None:
|
| 3311 |
from obliteratus.analysis.whitened_svd import WhitenedSVDExtractor
|
| 3312 |
whitened_extractor = WhitenedSVDExtractor()
|
| 3313 |
|
| 3314 |
for idx in range(n_layers):
|
| 3315 |
+
# Wasserstein-optimal path (matching main _distill)
|
| 3316 |
+
if wasserstein_extractor is not None:
|
| 3317 |
+
if idx in self._harmful_acts and idx in self._harmless_acts:
|
| 3318 |
+
try:
|
| 3319 |
+
w_result = wasserstein_extractor.extract(
|
| 3320 |
+
self._harmful_acts[idx],
|
| 3321 |
+
self._harmless_acts[idx],
|
| 3322 |
+
layer_idx=idx,
|
| 3323 |
+
)
|
| 3324 |
+
self.refusal_directions[idx] = w_result.direction
|
| 3325 |
+
self.refusal_subspaces[idx] = w_result.direction.unsqueeze(0)
|
| 3326 |
+
norms[idx] = w_result.refusal_projection
|
| 3327 |
+
|
| 3328 |
+
if n_dirs > 1:
|
| 3329 |
+
harmful_stack = torch.stack(self._harmful_acts[idx]).squeeze(1)
|
| 3330 |
+
harmless_stack = torch.stack(self._harmless_acts[idx]).squeeze(1)
|
| 3331 |
+
diff_matrix = harmful_stack - harmless_stack
|
| 3332 |
+
if torch.isfinite(diff_matrix).all():
|
| 3333 |
+
k = min(n_dirs, diff_matrix.shape[0], diff_matrix.shape[1])
|
| 3334 |
+
_, _, Vh = torch.linalg.svd(diff_matrix, full_matrices=False)
|
| 3335 |
+
w_dir = w_result.direction.unsqueeze(0)
|
| 3336 |
+
sub = torch.cat([w_dir, Vh[1:k]], dim=0)
|
| 3337 |
+
sub = self._orthogonalize_subspace(sub)
|
| 3338 |
+
self.refusal_subspaces[idx] = sub
|
| 3339 |
+
continue
|
| 3340 |
+
except Exception:
|
| 3341 |
+
pass # Fall through to SVD
|
| 3342 |
+
|
| 3343 |
if n_dirs == 1:
|
| 3344 |
diff = (self._harmful_means[idx] - self._harmless_means[idx]).squeeze(0)
|
| 3345 |
norm = diff.norm().item()
|
|
|
|
| 3351 |
self.refusal_directions[idx] = direction
|
| 3352 |
self.refusal_subspaces[idx] = direction.unsqueeze(0)
|
| 3353 |
elif whitened_extractor is not None:
|
|
|
|
| 3354 |
result = whitened_extractor.extract(
|
| 3355 |
self._harmful_acts[idx],
|
| 3356 |
self._harmless_acts[idx],
|
|
|
|
| 3382 |
sorted_layers = sorted(norms.items(), key=lambda x: x[1], reverse=True)
|
| 3383 |
self._strong_layers = self._select_layers_knee(sorted_layers)
|
| 3384 |
|
| 3385 |
+
# Re-apply jailbreak-contrastive blending with data-driven alpha
|
| 3386 |
if self.use_jailbreak_contrast and self._jailbreak_means:
|
|
|
|
| 3387 |
for idx in self._strong_layers:
|
| 3388 |
if idx not in self._jailbreak_means:
|
| 3389 |
continue
|
|
|
|
| 3392 |
if jb_norm > 0:
|
| 3393 |
jb_dir = jb_diff / jb_norm
|
| 3394 |
std_dir = self.refusal_directions[idx]
|
| 3395 |
+
# Data-driven alpha matching _distill: cos=1β0.1, cos=0β0.7
|
| 3396 |
+
cos_sim = abs((std_dir @ jb_dir).item())
|
| 3397 |
+
blend_alpha = max(0.1, min(0.7, 0.7 - 0.6 * cos_sim))
|
| 3398 |
blended = (1 - blend_alpha) * std_dir + blend_alpha * jb_dir
|
| 3399 |
blended_norm = blended.norm()
|
| 3400 |
if blended_norm < 1e-8:
|
|
|
|
| 3404 |
sub = self.refusal_subspaces[idx]
|
| 3405 |
sub[0] = blended
|
| 3406 |
if sub.shape[0] > 1:
|
| 3407 |
+
sub = self._orthogonalize_subspace(sub)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3408 |
self.refusal_subspaces[idx] = sub
|
| 3409 |
|
| 3410 |
# Re-identify refusal heads with updated directions
|
|
|
|
| 3837 |
|
| 3838 |
if W.shape[-1] == d.shape[0]:
|
| 3839 |
# Standard Linear: W is (out_features, hidden_dim)
|
| 3840 |
+
original_norm_sq = W.pow(2).sum().item() if norm_preserve else 0.0
|
| 3841 |
|
| 3842 |
coeff = W @ d # (out_features, 1)
|
| 3843 |
+
coeff_norm_sq = coeff.pow(2).sum().item() if norm_preserve else 0.0
|
| 3844 |
W.sub_(d.T * (scale * coeff)) # in-place rank-1 update
|
| 3845 |
del coeff
|
| 3846 |
|
| 3847 |
+
# Analytical norm: ||W'||Β² = ||W||Β² - scale(2-scale)||coeff||Β²
|
| 3848 |
+
if norm_preserve and original_norm_sq > 0:
|
| 3849 |
+
new_norm_sq = max(0.0, original_norm_sq - scale * (2 - scale) * coeff_norm_sq)
|
| 3850 |
+
if new_norm_sq > 0:
|
| 3851 |
+
import math
|
| 3852 |
+
W.mul_(math.sqrt(original_norm_sq / new_norm_sq))
|
| 3853 |
|
| 3854 |
if is_quantized:
|
| 3855 |
AbliterationPipeline._replace_quantized_weight(proj, W)
|
|
|
|
| 3858 |
|
| 3859 |
elif W.shape[0] == d.shape[0]:
|
| 3860 |
# Transposed (e.g. GPT-2 Conv1D): W is (hidden_dim, out_features)
|
| 3861 |
+
original_norm_sq = W.pow(2).sum().item() if norm_preserve else 0.0
|
| 3862 |
|
| 3863 |
coeff = d.T @ W # (1, out_features)
|
| 3864 |
+
coeff_norm_sq = coeff.pow(2).sum().item() if norm_preserve else 0.0
|
| 3865 |
W.sub_((scale * d) * coeff) # in-place rank-1 update
|
| 3866 |
del coeff
|
| 3867 |
|
| 3868 |
+
# Analytical norm: ||W'||Β² = ||W||Β² - scale(2-scale)||coeff||Β²
|
| 3869 |
+
if norm_preserve and original_norm_sq > 0:
|
| 3870 |
+
new_norm_sq = max(0.0, original_norm_sq - scale * (2 - scale) * coeff_norm_sq)
|
| 3871 |
+
if new_norm_sq > 0:
|
| 3872 |
+
import math
|
| 3873 |
+
W.mul_(math.sqrt(original_norm_sq / new_norm_sq))
|
| 3874 |
|
| 3875 |
if is_quantized:
|
| 3876 |
AbliterationPipeline._replace_quantized_weight(proj, W)
|
|
|
|
| 4988 |
from obliteratus.evaluation.advanced_metrics import _is_refusal_detailed
|
| 4989 |
prompt_verdicts: list[tuple[str, bool, str]] = [] # (tier, refused, reason)
|
| 4990 |
|
| 4991 |
+
# Batch generation for throughput (batch_size=4 to stay within VRAM)
|
| 4992 |
+
verify_batch_size = 4
|
| 4993 |
+
# Left-pad for batched generation so all sequences are right-aligned
|
| 4994 |
+
orig_pad_side = getattr(tokenizer, "padding_side", "right")
|
| 4995 |
+
if tokenizer.pad_token_id is None:
|
| 4996 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 4997 |
+
tokenizer.padding_side = "left"
|
| 4998 |
+
oom_break = False
|
| 4999 |
+
|
| 5000 |
+
for batch_start in range(0, len(test_harmful_formatted), verify_batch_size):
|
| 5001 |
+
if oom_break:
|
| 5002 |
+
break
|
| 5003 |
+
batch_end = min(batch_start + verify_batch_size, len(test_harmful_formatted))
|
| 5004 |
+
batch_formatted = test_harmful_formatted[batch_start:batch_end]
|
| 5005 |
+
batch_tiers = tier_labels[batch_start:batch_end]
|
| 5006 |
+
|
| 5007 |
try:
|
| 5008 |
inputs = tokenizer(
|
| 5009 |
+
batch_formatted, return_tensors="pt",
|
| 5010 |
+
padding=True, truncation=True, max_length=512,
|
| 5011 |
)
|
| 5012 |
+
# Track per-prompt input lengths (non-pad tokens)
|
| 5013 |
+
attention_mask = inputs["attention_mask"]
|
| 5014 |
+
input_lens = attention_mask.sum(dim=1).tolist()
|
| 5015 |
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 5016 |
with torch.no_grad():
|
| 5017 |
+
outputs = model.generate(
|
| 5018 |
**inputs,
|
| 5019 |
max_new_tokens=128,
|
| 5020 |
do_sample=False,
|
| 5021 |
)
|
| 5022 |
+
|
| 5023 |
+
# Process each prompt in the batch
|
| 5024 |
+
for j in range(len(batch_formatted)):
|
| 5025 |
+
global_idx = batch_start + j
|
| 5026 |
+
tier = batch_tiers[j]
|
| 5027 |
+
# With left-padding, input tokens are right-aligned;
|
| 5028 |
+
# generated tokens follow the last input token.
|
| 5029 |
+
prompt_token_count = int(input_lens[j])
|
| 5030 |
+
total_padded_input_len = inputs["input_ids"].shape[1]
|
| 5031 |
+
gen_start = total_padded_input_len # generation starts after full padded input
|
| 5032 |
+
completion_ids = outputs[j][gen_start:]
|
| 5033 |
+
completion_len = len(completion_ids)
|
| 5034 |
+
completion = tokenizer.decode(
|
| 5035 |
+
completion_ids, skip_special_tokens=True,
|
| 5036 |
+
).strip()
|
| 5037 |
+
harmful_responses.append(completion)
|
| 5038 |
+
|
| 5039 |
+
refused, match_reason = _is_refusal_detailed(completion, mode="combined")
|
| 5040 |
+
prompt_verdicts.append((tier, refused, match_reason))
|
| 5041 |
+
verdict_str = f"REFUSED ({match_reason})" if refused else "COMPLIED"
|
| 5042 |
+
preview = completion[:80].replace("\n", " ")
|
| 5043 |
+
self.log(f" [{global_idx + 1}/{n_selected}] {tier} | "
|
| 5044 |
+
f"in={prompt_token_count} out={completion_len} | "
|
| 5045 |
+
f"{verdict_str} | \"{preview}...\"")
|
| 5046 |
+
|
| 5047 |
+
del inputs, outputs
|
| 5048 |
self._free_gpu_memory()
|
| 5049 |
except torch.cuda.OutOfMemoryError:
|
| 5050 |
self._free_gpu_memory()
|
| 5051 |
+
self.log(f" [batch {batch_start+1}-{batch_end}] CUDA OOM β stopping")
|
| 5052 |
self.log(" Skipping remaining refusal tests (CUDA out of memory)")
|
| 5053 |
+
oom_break = True
|
| 5054 |
except (RuntimeError, Exception) as e:
|
| 5055 |
err_msg = str(e)
|
| 5056 |
if "CUDA" in err_msg or "illegal" in err_msg.lower():
|
| 5057 |
self._free_gpu_memory()
|
| 5058 |
+
self.log(f" [batch {batch_start+1}-{batch_end}] CUDA error β stopping")
|
| 5059 |
self.log(f" Skipping remaining refusal tests (CUDA error: {err_msg[:120]})")
|
| 5060 |
+
oom_break = True
|
| 5061 |
+
else:
|
| 5062 |
+
raise
|
| 5063 |
+
|
| 5064 |
+
tokenizer.padding_side = orig_pad_side
|
| 5065 |
|
| 5066 |
if harmful_responses:
|
| 5067 |
from obliteratus.evaluation.advanced_metrics import refusal_rate as compute_refusal_rate
|
|
|
|
| 5246 |
"cot_aware": self.cot_aware,
|
| 5247 |
"use_kl_optimization": self.use_kl_optimization,
|
| 5248 |
"use_lora_ablation": self.use_lora_ablation,
|
| 5249 |
+
# Spectral Cascade
|
| 5250 |
+
"spectral_cascade": self.spectral_cascade,
|
| 5251 |
+
"spectral_bands": self.spectral_bands,
|
| 5252 |
+
"spectral_threshold": self.spectral_threshold,
|
| 5253 |
},
|
| 5254 |
"references": [
|
| 5255 |
"Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (NeurIPS 2024)",
|
obliteratus/analysis/activation_probing.py
CHANGED
|
@@ -95,22 +95,30 @@ class ActivationProbe:
|
|
| 95 |
d = d.squeeze()
|
| 96 |
d = d / d.norm().clamp(min=1e-8)
|
| 97 |
|
| 98 |
-
#
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
gap = h_mean - b_mean
|
| 116 |
|
|
|
|
| 95 |
d = d.squeeze()
|
| 96 |
d = d / d.norm().clamp(min=1e-8)
|
| 97 |
|
| 98 |
+
# Batch projection: stack all activations into matrices for
|
| 99 |
+
# vectorized dot-product instead of per-activation Python loops.
|
| 100 |
+
# This provides 5-15x speedup on large prompt sets.
|
| 101 |
+
if harmful_activations:
|
| 102 |
+
h_stack = torch.stack(
|
| 103 |
+
[a.float().squeeze() for a in harmful_activations]
|
| 104 |
+
) # (n_harmful, hidden_dim)
|
| 105 |
+
h_projs = h_stack @ d # (n_harmful,)
|
| 106 |
+
h_mean = h_projs.mean().item()
|
| 107 |
+
h_std = h_projs.std(correction=1).item() if len(harmful_activations) > 1 else 0.0
|
| 108 |
+
else:
|
| 109 |
+
h_mean = 0.0
|
| 110 |
+
h_std = 0.0
|
| 111 |
+
|
| 112 |
+
if harmless_activations:
|
| 113 |
+
b_stack = torch.stack(
|
| 114 |
+
[a.float().squeeze() for a in harmless_activations]
|
| 115 |
+
) # (n_harmless, hidden_dim)
|
| 116 |
+
b_projs = b_stack @ d # (n_harmless,)
|
| 117 |
+
b_mean = b_projs.mean().item()
|
| 118 |
+
b_std = b_projs.std(correction=1).item() if len(harmless_activations) > 1 else 0.0
|
| 119 |
+
else:
|
| 120 |
+
b_mean = 0.0
|
| 121 |
+
b_std = 0.0
|
| 122 |
|
| 123 |
gap = h_mean - b_mean
|
| 124 |
|
obliteratus/analysis/sae_abliteration.py
CHANGED
|
@@ -111,6 +111,25 @@ class SparseAutoencoder(nn.Module):
|
|
| 111 |
return x_hat, z
|
| 112 |
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
def train_sae(
|
| 115 |
activations: list[torch.Tensor],
|
| 116 |
hidden_dim: int,
|
|
@@ -119,7 +138,7 @@ def train_sae(
|
|
| 119 |
lr: float = 3e-4,
|
| 120 |
sparsity_coef: float = 1e-3,
|
| 121 |
batch_size: int = 32,
|
| 122 |
-
device: str =
|
| 123 |
test_fraction: float = 0.2,
|
| 124 |
patience: int = 5,
|
| 125 |
quality_threshold: float = 0.1,
|
|
@@ -137,7 +156,8 @@ def train_sae(
|
|
| 137 |
lr: Learning rate
|
| 138 |
sparsity_coef: L1 sparsity penalty weight
|
| 139 |
batch_size: Mini-batch size
|
| 140 |
-
device: Training device
|
|
|
|
| 141 |
test_fraction: Fraction of data reserved for held-out validation
|
| 142 |
patience: Early stopping patience (epochs without improvement)
|
| 143 |
quality_threshold: Maximum acceptable held-out reconstruction MSE.
|
|
@@ -146,6 +166,8 @@ def train_sae(
|
|
| 146 |
"""
|
| 147 |
import warnings
|
| 148 |
|
|
|
|
|
|
|
| 149 |
# Stack and normalize activations
|
| 150 |
X = torch.stack([a.squeeze() for a in activations]).float().to(device)
|
| 151 |
mean = X.mean(dim=0, keepdim=True)
|
|
@@ -244,7 +266,7 @@ def identify_refusal_features(
|
|
| 244 |
harmless_acts: list[torch.Tensor],
|
| 245 |
layer_idx: int,
|
| 246 |
top_k: int = 16,
|
| 247 |
-
device: str =
|
| 248 |
) -> SAERefusalFeatures:
|
| 249 |
"""Identify SAE features that encode refusal behavior.
|
| 250 |
|
|
@@ -258,8 +280,9 @@ def identify_refusal_features(
|
|
| 258 |
harmless_acts: Activations from harmless prompts
|
| 259 |
layer_idx: Which layer these activations are from
|
| 260 |
top_k: Number of top refusal features to return
|
| 261 |
-
device: Computation device
|
| 262 |
"""
|
|
|
|
| 263 |
sae = sae.to(device)
|
| 264 |
|
| 265 |
with torch.no_grad():
|
|
@@ -405,7 +428,7 @@ class SAEDecompositionPipeline:
|
|
| 405 |
harmful_acts: list[torch.Tensor],
|
| 406 |
harmless_acts: list[torch.Tensor],
|
| 407 |
layer_idx: int = 0,
|
| 408 |
-
device: str =
|
| 409 |
) -> SAEDecompositionResult:
|
| 410 |
"""Run the full decomposition pipeline.
|
| 411 |
|
|
@@ -413,11 +436,12 @@ class SAEDecompositionPipeline:
|
|
| 413 |
harmful_acts: Activations from harmful prompts.
|
| 414 |
harmless_acts: Activations from harmless prompts.
|
| 415 |
layer_idx: Layer index for metadata.
|
| 416 |
-
device: Computation device.
|
| 417 |
|
| 418 |
Returns:
|
| 419 |
SAEDecompositionResult with comprehensive feature analysis.
|
| 420 |
"""
|
|
|
|
| 421 |
all_acts = harmful_acts + harmless_acts
|
| 422 |
hidden_dim = harmful_acts[0].squeeze().shape[0]
|
| 423 |
|
|
|
|
| 111 |
return x_hat, z
|
| 112 |
|
| 113 |
|
| 114 |
+
def _auto_detect_device(device: str | None = None) -> str:
|
| 115 |
+
"""Auto-detect the best available device for SAE training.
|
| 116 |
+
|
| 117 |
+
When device is ``None`` or ``"auto"``, selects CUDA if available
|
| 118 |
+
and sufficient free memory exists (>512 MB), otherwise falls back
|
| 119 |
+
to CPU.
|
| 120 |
+
"""
|
| 121 |
+
if device is not None and device not in ("auto",):
|
| 122 |
+
return device
|
| 123 |
+
if torch.cuda.is_available():
|
| 124 |
+
try:
|
| 125 |
+
free_mb = torch.cuda.mem_get_info()[0] / 1e6
|
| 126 |
+
if free_mb > 512:
|
| 127 |
+
return "cuda"
|
| 128 |
+
except Exception:
|
| 129 |
+
pass
|
| 130 |
+
return "cpu"
|
| 131 |
+
|
| 132 |
+
|
| 133 |
def train_sae(
|
| 134 |
activations: list[torch.Tensor],
|
| 135 |
hidden_dim: int,
|
|
|
|
| 138 |
lr: float = 3e-4,
|
| 139 |
sparsity_coef: float = 1e-3,
|
| 140 |
batch_size: int = 32,
|
| 141 |
+
device: str | None = None,
|
| 142 |
test_fraction: float = 0.2,
|
| 143 |
patience: int = 5,
|
| 144 |
quality_threshold: float = 0.1,
|
|
|
|
| 156 |
lr: Learning rate
|
| 157 |
sparsity_coef: L1 sparsity penalty weight
|
| 158 |
batch_size: Mini-batch size
|
| 159 |
+
device: Training device. ``None`` or ``"auto"`` to auto-detect
|
| 160 |
+
(CUDA when available with sufficient free memory, else CPU).
|
| 161 |
test_fraction: Fraction of data reserved for held-out validation
|
| 162 |
patience: Early stopping patience (epochs without improvement)
|
| 163 |
quality_threshold: Maximum acceptable held-out reconstruction MSE.
|
|
|
|
| 166 |
"""
|
| 167 |
import warnings
|
| 168 |
|
| 169 |
+
device = _auto_detect_device(device)
|
| 170 |
+
|
| 171 |
# Stack and normalize activations
|
| 172 |
X = torch.stack([a.squeeze() for a in activations]).float().to(device)
|
| 173 |
mean = X.mean(dim=0, keepdim=True)
|
|
|
|
| 266 |
harmless_acts: list[torch.Tensor],
|
| 267 |
layer_idx: int,
|
| 268 |
top_k: int = 16,
|
| 269 |
+
device: str | None = None,
|
| 270 |
) -> SAERefusalFeatures:
|
| 271 |
"""Identify SAE features that encode refusal behavior.
|
| 272 |
|
|
|
|
| 280 |
harmless_acts: Activations from harmless prompts
|
| 281 |
layer_idx: Which layer these activations are from
|
| 282 |
top_k: Number of top refusal features to return
|
| 283 |
+
device: Computation device. ``None`` or ``"auto"`` to auto-detect.
|
| 284 |
"""
|
| 285 |
+
device = _auto_detect_device(device)
|
| 286 |
sae = sae.to(device)
|
| 287 |
|
| 288 |
with torch.no_grad():
|
|
|
|
| 428 |
harmful_acts: list[torch.Tensor],
|
| 429 |
harmless_acts: list[torch.Tensor],
|
| 430 |
layer_idx: int = 0,
|
| 431 |
+
device: str | None = None,
|
| 432 |
) -> SAEDecompositionResult:
|
| 433 |
"""Run the full decomposition pipeline.
|
| 434 |
|
|
|
|
| 436 |
harmful_acts: Activations from harmful prompts.
|
| 437 |
harmless_acts: Activations from harmless prompts.
|
| 438 |
layer_idx: Layer index for metadata.
|
| 439 |
+
device: Computation device. ``None`` or ``"auto"`` to auto-detect.
|
| 440 |
|
| 441 |
Returns:
|
| 442 |
SAEDecompositionResult with comprehensive feature analysis.
|
| 443 |
"""
|
| 444 |
+
device = _auto_detect_device(device)
|
| 445 |
all_acts = harmful_acts + harmless_acts
|
| 446 |
hidden_dim = harmful_acts[0].squeeze().shape[0]
|
| 447 |
|
obliteratus/bayesian_optimizer.py
CHANGED
|
@@ -296,7 +296,7 @@ def run_bayesian_optimization(
|
|
| 296 |
arch = pipeline.handle.architecture
|
| 297 |
n_total_layers = len(layer_modules)
|
| 298 |
|
| 299 |
-
# Save weight tensors for rollback
|
| 300 |
original_params: list[tuple[torch.Tensor, torch.Tensor]] = []
|
| 301 |
seen_data_ptrs: set[int] = set()
|
| 302 |
|
|
@@ -308,12 +308,12 @@ def run_bayesian_optimization(
|
|
| 308 |
if proj is not None and hasattr(proj, "weight"):
|
| 309 |
ptr = proj.weight.data.data_ptr()
|
| 310 |
if ptr not in seen_data_ptrs:
|
| 311 |
-
original_params.append((proj.weight.data, proj.weight.data.clone()))
|
| 312 |
seen_data_ptrs.add(ptr)
|
| 313 |
if hasattr(proj, "bias") and proj.bias is not None:
|
| 314 |
bptr = proj.bias.data.data_ptr()
|
| 315 |
if bptr not in seen_data_ptrs:
|
| 316 |
-
original_params.append((proj.bias.data, proj.bias.data.clone()))
|
| 317 |
seen_data_ptrs.add(bptr)
|
| 318 |
except (AttributeError, RuntimeError):
|
| 319 |
pass
|
|
@@ -324,29 +324,23 @@ def run_bayesian_optimization(
|
|
| 324 |
if proj is not None and hasattr(proj, "weight"):
|
| 325 |
ptr = proj.weight.data.data_ptr()
|
| 326 |
if ptr not in seen_data_ptrs:
|
| 327 |
-
original_params.append((proj.weight.data, proj.weight.data.clone()))
|
| 328 |
seen_data_ptrs.add(ptr)
|
| 329 |
if hasattr(proj, "bias") and proj.bias is not None:
|
| 330 |
bptr = proj.bias.data.data_ptr()
|
| 331 |
if bptr not in seen_data_ptrs:
|
| 332 |
-
original_params.append((proj.bias.data, proj.bias.data.clone()))
|
| 333 |
seen_data_ptrs.add(bptr)
|
| 334 |
-
for _name, param in ffn.named_parameters():
|
| 335 |
-
if param.dim() == 3:
|
| 336 |
-
ptr = param.data.data_ptr()
|
| 337 |
-
if ptr not in seen_data_ptrs:
|
| 338 |
-
original_params.append((param.data, param.data.clone()))
|
| 339 |
-
seen_data_ptrs.add(ptr)
|
| 340 |
except (AttributeError, RuntimeError):
|
| 341 |
pass
|
| 342 |
|
| 343 |
del seen_data_ptrs
|
| 344 |
total_saved_mb = sum(clone.nelement() * clone.element_size() for _, clone in original_params) / 1e6
|
| 345 |
-
pipeline.log(f" Saved {len(original_params)} weight tensors for rollback ({total_saved_mb:.0f} MB)")
|
| 346 |
|
| 347 |
def _restore_all():
|
| 348 |
for live_data, saved_clone in original_params: # noqa: F821
|
| 349 |
-
live_data.copy_(saved_clone)
|
| 350 |
|
| 351 |
# Warm-start values for the parametric kernel
|
| 352 |
# Estimate peak position from strongest layer
|
|
|
|
| 296 |
arch = pipeline.handle.architecture
|
| 297 |
n_total_layers = len(layer_modules)
|
| 298 |
|
| 299 |
+
# Save weight tensors for rollback β clone to CPU to free GPU memory
|
| 300 |
original_params: list[tuple[torch.Tensor, torch.Tensor]] = []
|
| 301 |
seen_data_ptrs: set[int] = set()
|
| 302 |
|
|
|
|
| 308 |
if proj is not None and hasattr(proj, "weight"):
|
| 309 |
ptr = proj.weight.data.data_ptr()
|
| 310 |
if ptr not in seen_data_ptrs:
|
| 311 |
+
original_params.append((proj.weight.data, proj.weight.data.clone().cpu()))
|
| 312 |
seen_data_ptrs.add(ptr)
|
| 313 |
if hasattr(proj, "bias") and proj.bias is not None:
|
| 314 |
bptr = proj.bias.data.data_ptr()
|
| 315 |
if bptr not in seen_data_ptrs:
|
| 316 |
+
original_params.append((proj.bias.data, proj.bias.data.clone().cpu()))
|
| 317 |
seen_data_ptrs.add(bptr)
|
| 318 |
except (AttributeError, RuntimeError):
|
| 319 |
pass
|
|
|
|
| 324 |
if proj is not None and hasattr(proj, "weight"):
|
| 325 |
ptr = proj.weight.data.data_ptr()
|
| 326 |
if ptr not in seen_data_ptrs:
|
| 327 |
+
original_params.append((proj.weight.data, proj.weight.data.clone().cpu()))
|
| 328 |
seen_data_ptrs.add(ptr)
|
| 329 |
if hasattr(proj, "bias") and proj.bias is not None:
|
| 330 |
bptr = proj.bias.data.data_ptr()
|
| 331 |
if bptr not in seen_data_ptrs:
|
| 332 |
+
original_params.append((proj.bias.data, proj.bias.data.clone().cpu()))
|
| 333 |
seen_data_ptrs.add(bptr)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
except (AttributeError, RuntimeError):
|
| 335 |
pass
|
| 336 |
|
| 337 |
del seen_data_ptrs
|
| 338 |
total_saved_mb = sum(clone.nelement() * clone.element_size() for _, clone in original_params) / 1e6
|
| 339 |
+
pipeline.log(f" Saved {len(original_params)} weight tensors for rollback ({total_saved_mb:.0f} MB, on CPU)")
|
| 340 |
|
| 341 |
def _restore_all():
|
| 342 |
for live_data, saved_clone in original_params: # noqa: F821
|
| 343 |
+
live_data.copy_(saved_clone.to(live_data.device))
|
| 344 |
|
| 345 |
# Warm-start values for the parametric kernel
|
| 346 |
# Estimate peak position from strongest layer
|
obliteratus/telemetry.py
CHANGED
|
@@ -1,22 +1,28 @@
|
|
| 1 |
"""Anonymous telemetry for community benchmark collection.
|
| 2 |
|
| 3 |
-
Logs benchmark results to a local JSONL file and
|
| 4 |
-
HuggingFace Dataset for community leaderboard
|
| 5 |
-
identity, IP addresses, or prompt content is stored β
|
| 6 |
-
benchmark metrics (model name, method, scores, hardware info,
|
|
|
|
| 7 |
|
| 8 |
-
Telemetry is
|
| 9 |
-
|
| 10 |
-
|
| 11 |
|
| 12 |
Architecture:
|
| 13 |
1. Every benchmark/obliteration run appends a record to a local JSONL
|
| 14 |
file (default: ~/.obliteratus/telemetry.jsonl or /tmp/obliteratus_telemetry.jsonl
|
| 15 |
in containers).
|
| 16 |
-
2. On HuggingFace Spaces, records are
|
| 17 |
-
HuggingFace Dataset repo (
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
"""
|
| 21 |
|
| 22 |
from __future__ import annotations
|
|
@@ -39,14 +45,32 @@ logger = logging.getLogger(__name__)
|
|
| 39 |
|
| 40 |
# ββ Configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
# ββ Telemetry state (v2 API) βββββββββββββββββββββββββββββββββββββββββ
|
| 45 |
_enabled: bool | None = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
_TELEMETRY_REPO = os.environ.get(
|
| 47 |
-
"OBLITERATUS_TELEMETRY_REPO",
|
|
|
|
| 48 |
)
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
# Locate writable telemetry directory
|
| 51 |
def _telemetry_dir() -> Path:
|
| 52 |
"""Find a writable directory for telemetry storage.
|
|
@@ -98,15 +122,20 @@ def enable_telemetry():
|
|
| 98 |
|
| 99 |
|
| 100 |
def is_telemetry_enabled() -> bool:
|
| 101 |
-
return
|
| 102 |
|
| 103 |
|
| 104 |
def is_enabled() -> bool:
|
| 105 |
-
"""Check if telemetry is enabled (
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
global _enabled
|
| 107 |
if _enabled is not None:
|
| 108 |
return _enabled
|
| 109 |
-
|
|
|
|
| 110 |
return env not in ("0", "false")
|
| 111 |
|
| 112 |
|
|
@@ -171,6 +200,177 @@ def _generate_session_id() -> str:
|
|
| 171 |
_SESSION_ID = _generate_session_id()
|
| 172 |
|
| 173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
# ββ Hardware detection ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 175 |
|
| 176 |
def _detect_gpu() -> tuple[str, float]:
|
|
@@ -208,7 +408,7 @@ def log_benchmark(record: BenchmarkRecord) -> bool:
|
|
| 208 |
Returns True if successfully written, False if telemetry is disabled
|
| 209 |
or an error occurred.
|
| 210 |
"""
|
| 211 |
-
if not
|
| 212 |
return False
|
| 213 |
|
| 214 |
if not record.session_id:
|
|
@@ -225,6 +425,8 @@ def log_benchmark(record: BenchmarkRecord) -> bool:
|
|
| 225 |
with _write_lock:
|
| 226 |
with open(TELEMETRY_FILE, "a") as f:
|
| 227 |
f.write(json.dumps(data, default=str) + "\n")
|
|
|
|
|
|
|
| 228 |
return True
|
| 229 |
except Exception as e:
|
| 230 |
logger.debug(f"Telemetry write failed: {e}")
|
|
@@ -299,12 +501,33 @@ def read_telemetry(max_records: int = 10000) -> list[dict[str, Any]]:
|
|
| 299 |
|
| 300 |
|
| 301 |
def get_leaderboard_data() -> list[dict[str, Any]]:
|
| 302 |
-
"""Get aggregated leaderboard data from telemetry.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 303 |
|
| 304 |
-
Groups by (model_id, method) and computes best/avg metrics.
|
| 305 |
Returns a list of dicts suitable for display in a Gradio Dataframe.
|
| 306 |
"""
|
| 307 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 308 |
if not records:
|
| 309 |
return []
|
| 310 |
|
|
@@ -324,7 +547,7 @@ def get_leaderboard_data() -> list[dict[str, Any]]:
|
|
| 324 |
refusal_rates = [r["refusal_rate"] for r in runs if r.get("refusal_rate") is not None]
|
| 325 |
perplexities = [r["perplexity"] for r in runs if r.get("perplexity") is not None]
|
| 326 |
coherences = [r["coherence"] for r in runs if r.get("coherence") is not None]
|
| 327 |
-
times = [r["time_seconds"] for r in runs if r.get("time_seconds")]
|
| 328 |
|
| 329 |
entry = {
|
| 330 |
"model": model_id.split("/")[-1] if "/" in model_id else model_id,
|
|
@@ -349,27 +572,42 @@ def get_leaderboard_data() -> list[dict[str, Any]]:
|
|
| 349 |
|
| 350 |
|
| 351 |
def push_to_hub(repo_id: str | None = None) -> bool:
|
| 352 |
-
"""Push local telemetry to
|
| 353 |
|
| 354 |
-
|
| 355 |
-
|
|
|
|
|
|
|
|
|
|
| 356 |
"""
|
| 357 |
repo = repo_id or _TELEMETRY_REPO
|
|
|
|
|
|
|
|
|
|
| 358 |
records = read_telemetry()
|
| 359 |
if not records:
|
| 360 |
logger.info("No telemetry records to push")
|
| 361 |
return False
|
| 362 |
|
| 363 |
try:
|
| 364 |
-
from
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 370 |
return True
|
| 371 |
except ImportError:
|
| 372 |
-
logger.warning("
|
| 373 |
return False
|
| 374 |
except Exception as e:
|
| 375 |
logger.warning(f"Failed to push telemetry: {e}")
|
|
@@ -638,7 +876,14 @@ def build_report(
|
|
| 638 |
|
| 639 |
|
| 640 |
def _send_sync(report: dict[str, Any]) -> None:
|
| 641 |
-
"""Synchronously
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 642 |
logger.debug("Telemetry report sent (schema_version=%s)", report.get("schema_version"))
|
| 643 |
|
| 644 |
|
|
|
|
| 1 |
"""Anonymous telemetry for community benchmark collection.
|
| 2 |
|
| 3 |
+
Logs benchmark results to a local JSONL file and automatically syncs to a
|
| 4 |
+
central HuggingFace Dataset repo for cross-Space community leaderboard
|
| 5 |
+
aggregation. No user identity, IP addresses, or prompt content is stored β
|
| 6 |
+
only aggregate benchmark metrics (model name, method, scores, hardware info,
|
| 7 |
+
timestamp).
|
| 8 |
|
| 9 |
+
Telemetry is disabled by default to respect user privacy. Users can opt in
|
| 10 |
+
by setting OBLITERATUS_TELEMETRY=1 or calling enable_telemetry(). On
|
| 11 |
+
HuggingFace Spaces, telemetry is auto-enabled for community leaderboard.
|
| 12 |
|
| 13 |
Architecture:
|
| 14 |
1. Every benchmark/obliteration run appends a record to a local JSONL
|
| 15 |
file (default: ~/.obliteratus/telemetry.jsonl or /tmp/obliteratus_telemetry.jsonl
|
| 16 |
in containers).
|
| 17 |
+
2. On HuggingFace Spaces, records are automatically synced to a central
|
| 18 |
+
HuggingFace Dataset repo (default: obliteratus-project/community-telemetry,
|
| 19 |
+
configurable via OBLITERATUS_TELEMETRY_REPO). Each Space instance
|
| 20 |
+
uploads its own JSONL file (keyed by SPACE_ID + session), so
|
| 21 |
+
duplicated Spaces all feed into the same central leaderboard.
|
| 22 |
+
3. The Leaderboard tab reads from both local JSONL *and* the central Hub
|
| 23 |
+
dataset, merging and deduplicating results so all community
|
| 24 |
+
contributions are visible regardless of which Space instance
|
| 25 |
+
generated them.
|
| 26 |
"""
|
| 27 |
|
| 28 |
from __future__ import annotations
|
|
|
|
| 45 |
|
| 46 |
# ββ Configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 47 |
|
| 48 |
+
_ON_HF_SPACES = os.environ.get("SPACE_ID") is not None
|
| 49 |
+
_TELEMETRY_ENABLED = os.environ.get(
|
| 50 |
+
"OBLITERATUS_TELEMETRY", "1" if _ON_HF_SPACES else "0"
|
| 51 |
+
) != "0"
|
| 52 |
|
| 53 |
# ββ Telemetry state (v2 API) βββββββββββββββββββββββββββββββββββββββββ
|
| 54 |
_enabled: bool | None = None
|
| 55 |
+
|
| 56 |
+
# Central Hub repo for cross-Space telemetry aggregation.
|
| 57 |
+
# Default repo is used on HF Spaces so all instances (including duplicated
|
| 58 |
+
# Spaces) send data to the same central dataset automatically.
|
| 59 |
+
_DEFAULT_TELEMETRY_REPO = "obliteratus-project/community-telemetry"
|
| 60 |
_TELEMETRY_REPO = os.environ.get(
|
| 61 |
+
"OBLITERATUS_TELEMETRY_REPO",
|
| 62 |
+
_DEFAULT_TELEMETRY_REPO if _ON_HF_SPACES else "",
|
| 63 |
)
|
| 64 |
|
| 65 |
+
# Hub sync debounce interval (seconds). After each log_benchmark(), we
|
| 66 |
+
# schedule a background upload but skip if the last sync was < this many
|
| 67 |
+
# seconds ago. This prevents hammering the Hub API during rapid benchmark
|
| 68 |
+
# loops while still ensuring timely uploads.
|
| 69 |
+
_HUB_SYNC_INTERVAL = 30
|
| 70 |
+
_hub_sync_last: float = 0.0
|
| 71 |
+
_hub_sync_lock = threading.Lock()
|
| 72 |
+
_hub_repo_created: bool = False
|
| 73 |
+
|
| 74 |
# Locate writable telemetry directory
|
| 75 |
def _telemetry_dir() -> Path:
|
| 76 |
"""Find a writable directory for telemetry storage.
|
|
|
|
| 122 |
|
| 123 |
|
| 124 |
def is_telemetry_enabled() -> bool:
|
| 125 |
+
return is_enabled()
|
| 126 |
|
| 127 |
|
| 128 |
def is_enabled() -> bool:
|
| 129 |
+
"""Check if telemetry is enabled (off by default, opt in with OBLITERATUS_TELEMETRY=1).
|
| 130 |
+
|
| 131 |
+
This is the single source of truth for telemetry state. Both v1
|
| 132 |
+
(log_benchmark) and v2 (send_report) paths check this function.
|
| 133 |
+
"""
|
| 134 |
global _enabled
|
| 135 |
if _enabled is not None:
|
| 136 |
return _enabled
|
| 137 |
+
default = "1" if _ON_HF_SPACES else "0"
|
| 138 |
+
env = os.environ.get("OBLITERATUS_TELEMETRY", default)
|
| 139 |
return env not in ("0", "false")
|
| 140 |
|
| 141 |
|
|
|
|
| 200 |
_SESSION_ID = _generate_session_id()
|
| 201 |
|
| 202 |
|
| 203 |
+
# ββ Hub sync (cross-Space telemetry aggregation) βββββββββββββββββββββ
|
| 204 |
+
|
| 205 |
+
def _instance_slug() -> str:
|
| 206 |
+
"""Generate a unique slug for this Space instance.
|
| 207 |
+
|
| 208 |
+
Hashes the HF Space ID (to avoid leaking usernames in the public
|
| 209 |
+
dataset) and combines it with the process session ID. This is used
|
| 210 |
+
as the filename when uploading per-instance JSONL to the Hub repo.
|
| 211 |
+
"""
|
| 212 |
+
space_id = os.environ.get("SPACE_ID", "local")
|
| 213 |
+
space_hash = hashlib.sha256(space_id.encode()).hexdigest()[:10]
|
| 214 |
+
return f"{space_hash}_{_SESSION_ID}"
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
_hub_repo_lock = threading.Lock()
|
| 218 |
+
|
| 219 |
+
def _ensure_hub_repo(repo_id: str) -> bool:
|
| 220 |
+
"""Create the central telemetry dataset repo if it doesn't exist.
|
| 221 |
+
|
| 222 |
+
Uses create_repo with exist_ok=True so this is safe to call
|
| 223 |
+
repeatedly. Thread-safe via _hub_repo_lock.
|
| 224 |
+
Returns True if the repo is ready, False on failure.
|
| 225 |
+
"""
|
| 226 |
+
global _hub_repo_created
|
| 227 |
+
if _hub_repo_created:
|
| 228 |
+
return True
|
| 229 |
+
with _hub_repo_lock:
|
| 230 |
+
if _hub_repo_created: # double-check under lock
|
| 231 |
+
return True
|
| 232 |
+
try:
|
| 233 |
+
from huggingface_hub import HfApi
|
| 234 |
+
api = HfApi()
|
| 235 |
+
api.create_repo(
|
| 236 |
+
repo_id=repo_id,
|
| 237 |
+
repo_type="dataset",
|
| 238 |
+
private=False,
|
| 239 |
+
exist_ok=True,
|
| 240 |
+
)
|
| 241 |
+
_hub_repo_created = True
|
| 242 |
+
return True
|
| 243 |
+
except Exception as e:
|
| 244 |
+
logger.debug(f"Failed to ensure Hub repo {repo_id}: {e}")
|
| 245 |
+
return False
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
_sync_in_progress = threading.Event()
|
| 249 |
+
|
| 250 |
+
def _sync_to_hub_bg() -> None:
|
| 251 |
+
"""Background thread target: upload local JSONL to the central Hub repo.
|
| 252 |
+
|
| 253 |
+
Each Space instance writes its data to a unique file path in the repo:
|
| 254 |
+
data/{instance_slug}.jsonl
|
| 255 |
+
This avoids write conflicts between concurrent Space instances while
|
| 256 |
+
ensuring all data lands in the same dataset repository.
|
| 257 |
+
Uses _sync_in_progress event to prevent overlapping uploads.
|
| 258 |
+
"""
|
| 259 |
+
if _sync_in_progress.is_set():
|
| 260 |
+
return # Another sync is already running
|
| 261 |
+
_sync_in_progress.set()
|
| 262 |
+
try:
|
| 263 |
+
repo = _TELEMETRY_REPO
|
| 264 |
+
if not repo:
|
| 265 |
+
return
|
| 266 |
+
if not TELEMETRY_FILE.exists():
|
| 267 |
+
return
|
| 268 |
+
|
| 269 |
+
from huggingface_hub import HfApi
|
| 270 |
+
if not _ensure_hub_repo(repo):
|
| 271 |
+
return
|
| 272 |
+
api = HfApi()
|
| 273 |
+
slug = _instance_slug()
|
| 274 |
+
api.upload_file(
|
| 275 |
+
path_or_fileobj=str(TELEMETRY_FILE),
|
| 276 |
+
path_in_repo=f"data/{slug}.jsonl",
|
| 277 |
+
repo_id=repo,
|
| 278 |
+
repo_type="dataset",
|
| 279 |
+
commit_message=f"Auto-sync telemetry from {slug}",
|
| 280 |
+
)
|
| 281 |
+
logger.debug(f"Synced telemetry to {repo}/data/{slug}.jsonl")
|
| 282 |
+
except Exception as e:
|
| 283 |
+
logger.debug(f"Hub sync failed: {e}")
|
| 284 |
+
finally:
|
| 285 |
+
_sync_in_progress.clear()
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
def _schedule_hub_sync() -> None:
|
| 289 |
+
"""Schedule a debounced background sync of local telemetry to Hub.
|
| 290 |
+
|
| 291 |
+
Skips if:
|
| 292 |
+
- No telemetry repo is configured
|
| 293 |
+
- Telemetry is disabled
|
| 294 |
+
- Last sync was less than _HUB_SYNC_INTERVAL seconds ago
|
| 295 |
+
"""
|
| 296 |
+
global _hub_sync_last
|
| 297 |
+
if not _TELEMETRY_REPO:
|
| 298 |
+
return
|
| 299 |
+
if not is_enabled():
|
| 300 |
+
return
|
| 301 |
+
|
| 302 |
+
with _hub_sync_lock:
|
| 303 |
+
now = time.time()
|
| 304 |
+
if now - _hub_sync_last < _HUB_SYNC_INTERVAL:
|
| 305 |
+
return
|
| 306 |
+
_hub_sync_last = now
|
| 307 |
+
|
| 308 |
+
t = threading.Thread(target=_sync_to_hub_bg, daemon=True)
|
| 309 |
+
t.start()
|
| 310 |
+
|
| 311 |
+
|
| 312 |
+
def fetch_hub_records(max_records: int = 10000) -> list[dict[str, Any]]:
|
| 313 |
+
"""Fetch all telemetry records from the central HF Hub dataset.
|
| 314 |
+
|
| 315 |
+
Downloads all per-instance JSONL files from the ``data/`` directory
|
| 316 |
+
in the telemetry repo and parses them into records. Returns an empty
|
| 317 |
+
list if the repo is not configured or not reachable.
|
| 318 |
+
|
| 319 |
+
This is used by :func:`get_leaderboard_data` to merge community-wide
|
| 320 |
+
results with local data.
|
| 321 |
+
"""
|
| 322 |
+
repo = _TELEMETRY_REPO
|
| 323 |
+
if not repo:
|
| 324 |
+
return []
|
| 325 |
+
|
| 326 |
+
try:
|
| 327 |
+
from huggingface_hub import HfApi, hf_hub_download
|
| 328 |
+
|
| 329 |
+
api = HfApi()
|
| 330 |
+
try:
|
| 331 |
+
all_files = api.list_repo_files(repo, repo_type="dataset")
|
| 332 |
+
except Exception:
|
| 333 |
+
# Repo doesn't exist yet or network error
|
| 334 |
+
return []
|
| 335 |
+
|
| 336 |
+
jsonl_files = [f for f in all_files if f.startswith("data/") and f.endswith(".jsonl")]
|
| 337 |
+
if not jsonl_files:
|
| 338 |
+
return []
|
| 339 |
+
|
| 340 |
+
records: list[dict[str, Any]] = []
|
| 341 |
+
for filepath in jsonl_files:
|
| 342 |
+
try:
|
| 343 |
+
local_path = hf_hub_download(
|
| 344 |
+
repo, filepath, repo_type="dataset",
|
| 345 |
+
# etag_timeout=0 forces a freshness check against Hub
|
| 346 |
+
# so we always get the latest data, not stale cache
|
| 347 |
+
etag_timeout=0,
|
| 348 |
+
)
|
| 349 |
+
with open(local_path) as f:
|
| 350 |
+
for line in f:
|
| 351 |
+
line = line.strip()
|
| 352 |
+
if not line:
|
| 353 |
+
continue
|
| 354 |
+
try:
|
| 355 |
+
records.append(json.loads(line))
|
| 356 |
+
except json.JSONDecodeError:
|
| 357 |
+
continue
|
| 358 |
+
if len(records) >= max_records:
|
| 359 |
+
break
|
| 360 |
+
except Exception:
|
| 361 |
+
continue
|
| 362 |
+
if len(records) >= max_records:
|
| 363 |
+
break
|
| 364 |
+
|
| 365 |
+
return records
|
| 366 |
+
except ImportError:
|
| 367 |
+
logger.debug("huggingface_hub not installed β cannot fetch Hub records")
|
| 368 |
+
return []
|
| 369 |
+
except Exception as e:
|
| 370 |
+
logger.debug(f"Failed to fetch Hub records: {e}")
|
| 371 |
+
return []
|
| 372 |
+
|
| 373 |
+
|
| 374 |
# ββ Hardware detection ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 375 |
|
| 376 |
def _detect_gpu() -> tuple[str, float]:
|
|
|
|
| 408 |
Returns True if successfully written, False if telemetry is disabled
|
| 409 |
or an error occurred.
|
| 410 |
"""
|
| 411 |
+
if not is_enabled():
|
| 412 |
return False
|
| 413 |
|
| 414 |
if not record.session_id:
|
|
|
|
| 425 |
with _write_lock:
|
| 426 |
with open(TELEMETRY_FILE, "a") as f:
|
| 427 |
f.write(json.dumps(data, default=str) + "\n")
|
| 428 |
+
# Auto-sync to central Hub repo (debounced, background thread)
|
| 429 |
+
_schedule_hub_sync()
|
| 430 |
return True
|
| 431 |
except Exception as e:
|
| 432 |
logger.debug(f"Telemetry write failed: {e}")
|
|
|
|
| 501 |
|
| 502 |
|
| 503 |
def get_leaderboard_data() -> list[dict[str, Any]]:
|
| 504 |
+
"""Get aggregated leaderboard data from local + Hub telemetry.
|
| 505 |
+
|
| 506 |
+
Merges local records with community-wide records from the central Hub
|
| 507 |
+
dataset, deduplicates by (session_id, timestamp), groups by
|
| 508 |
+
(model_id, method) and computes best/avg metrics.
|
| 509 |
|
|
|
|
| 510 |
Returns a list of dicts suitable for display in a Gradio Dataframe.
|
| 511 |
"""
|
| 512 |
+
local_records = read_telemetry()
|
| 513 |
+
|
| 514 |
+
# Fetch community records from central Hub repo
|
| 515 |
+
hub_records = []
|
| 516 |
+
try:
|
| 517 |
+
hub_records = fetch_hub_records()
|
| 518 |
+
except Exception:
|
| 519 |
+
pass # Hub fetch is best-effort
|
| 520 |
+
|
| 521 |
+
# Merge and deduplicate by (session_id, timestamp)
|
| 522 |
+
seen: set[tuple[str, str]] = set()
|
| 523 |
+
records: list[dict[str, Any]] = []
|
| 524 |
+
for r in local_records + hub_records:
|
| 525 |
+
key = (r.get("session_id", ""), r.get("timestamp", ""))
|
| 526 |
+
if key in seen:
|
| 527 |
+
continue
|
| 528 |
+
seen.add(key)
|
| 529 |
+
records.append(r)
|
| 530 |
+
|
| 531 |
if not records:
|
| 532 |
return []
|
| 533 |
|
|
|
|
| 547 |
refusal_rates = [r["refusal_rate"] for r in runs if r.get("refusal_rate") is not None]
|
| 548 |
perplexities = [r["perplexity"] for r in runs if r.get("perplexity") is not None]
|
| 549 |
coherences = [r["coherence"] for r in runs if r.get("coherence") is not None]
|
| 550 |
+
times = [r["time_seconds"] for r in runs if r.get("time_seconds") is not None]
|
| 551 |
|
| 552 |
entry = {
|
| 553 |
"model": model_id.split("/")[-1] if "/" in model_id else model_id,
|
|
|
|
| 572 |
|
| 573 |
|
| 574 |
def push_to_hub(repo_id: str | None = None) -> bool:
|
| 575 |
+
"""Push local telemetry to the central HuggingFace Dataset repo.
|
| 576 |
|
| 577 |
+
Uploads this instance's local JSONL file to the central Hub repo as a
|
| 578 |
+
per-instance file (``data/{instance_slug}.jsonl``). All Space instances
|
| 579 |
+
(including duplicated ones) contribute to the same dataset.
|
| 580 |
+
|
| 581 |
+
Requires HF_TOKEN to be set (automatically available on HF Spaces).
|
| 582 |
"""
|
| 583 |
repo = repo_id or _TELEMETRY_REPO
|
| 584 |
+
if not repo:
|
| 585 |
+
logger.warning("No telemetry repo configured β set OBLITERATUS_TELEMETRY_REPO")
|
| 586 |
+
return False
|
| 587 |
records = read_telemetry()
|
| 588 |
if not records:
|
| 589 |
logger.info("No telemetry records to push")
|
| 590 |
return False
|
| 591 |
|
| 592 |
try:
|
| 593 |
+
from huggingface_hub import HfApi
|
| 594 |
+
|
| 595 |
+
if not _ensure_hub_repo(repo):
|
| 596 |
+
return False
|
| 597 |
+
|
| 598 |
+
api = HfApi()
|
| 599 |
+
slug = _instance_slug()
|
| 600 |
+
api.upload_file(
|
| 601 |
+
path_or_fileobj=str(TELEMETRY_FILE),
|
| 602 |
+
path_in_repo=f"data/{slug}.jsonl",
|
| 603 |
+
repo_id=repo,
|
| 604 |
+
repo_type="dataset",
|
| 605 |
+
commit_message=f"Manual push from {slug} ({len(records)} records)",
|
| 606 |
+
)
|
| 607 |
+
logger.info(f"Pushed {len(records)} records to {repo}/data/{slug}.jsonl")
|
| 608 |
return True
|
| 609 |
except ImportError:
|
| 610 |
+
logger.warning("huggingface_hub not installed β cannot push telemetry")
|
| 611 |
return False
|
| 612 |
except Exception as e:
|
| 613 |
logger.warning(f"Failed to push telemetry: {e}")
|
|
|
|
| 876 |
|
| 877 |
|
| 878 |
def _send_sync(report: dict[str, Any]) -> None:
|
| 879 |
+
"""Synchronously write a v2 telemetry report to local JSONL and sync to Hub."""
|
| 880 |
+
try:
|
| 881 |
+
with _write_lock:
|
| 882 |
+
with open(TELEMETRY_FILE, "a") as f:
|
| 883 |
+
f.write(json.dumps(report, default=str) + "\n")
|
| 884 |
+
_schedule_hub_sync()
|
| 885 |
+
except Exception as e:
|
| 886 |
+
logger.debug("Telemetry v2 write failed: %s", e)
|
| 887 |
logger.debug("Telemetry report sent (schema_version=%s)", report.get("schema_version"))
|
| 888 |
|
| 889 |
|
paper/main.tex
CHANGED
|
@@ -46,7 +46,7 @@ While prior work has established that refusal is mediated by linear directions i
|
|
| 46 |
|
| 47 |
\textsc{Obliteratus} contributes:
|
| 48 |
(1)~\textbf{15 analysis modules} spanning direction extraction, geometric characterization, learned probing, causal estimation, cross-model transfer, and defense robustness evaluation;
|
| 49 |
-
(2)~\textbf{
|
| 50 |
(3)~\textbf{Expert-Granular Abliteration (EGA)} for MoE models, decomposing refusal directions per-expert via routing-weighted activation attribution and applying selective inversion to fused 3D weight tensors---distinguishing safety-critical from capability-preserving experts;
|
| 51 |
(4)~\textbf{six frontier optimization techniques} inspired by and extending Heretic: Bayesian hyperparameter optimization (Optuna TPE with warm-start from analysis heuristics), reversible LoRA-mediated ablation, KL-divergence co-optimization with partial revert, chain-of-thought-aware ablation via Gram-Schmidt orthogonalization, float layer interpolation with Gaussian-weighted continuous targeting, and activation winsorization for robust SVD;
|
| 52 |
(5)~\textbf{a unified evaluation suite} with refusal rate, perplexity, coherence, KL divergence, CKA similarity, and effective rank metrics;
|
|
@@ -72,7 +72,7 @@ Yet existing tools are fragmented: some focus solely on direction extraction \ci
|
|
| 72 |
|
| 73 |
\begin{enumerate}[leftmargin=*]
|
| 74 |
\item \textbf{Comprehensive analysis before intervention.} Rather than immediately removing refusal, the platform first characterizes its geometric structure---how many directions are involved, whether they form cones or subspaces, how they vary across layers and harm categories, and what alignment training method likely produced them.
|
| 75 |
-
\item \textbf{Multiple intervention paradigms.} The platform supports
|
| 76 |
\item \textbf{Native MoE support.} Mixture-of-Experts models (GPT-OSS 20B, Mixtral, DeepSeek-MoE) present unique challenges for abliteration: refusal may be concentrated in specific experts, and fused 3D weight tensors require per-expert decomposition. \textsc{Obliteratus} introduces \emph{Expert-Granular Abliteration} (EGA)---routing-weighted direction attribution and selective inversion that distinguishes safety-critical from capability-preserving experts.
|
| 77 |
\item \textbf{Frontier optimization.} Building on Heretic's \citep{heretic2025} pioneering use of Bayesian optimization and LoRA-mediated ablation, we integrate and extend six optimization techniques: TPE-based hyperparameter search, reversible LoRA adapters, KL-divergence co-optimization, chain-of-thought-aware ablation, float layer interpolation, and activation winsorization.
|
| 78 |
\item \textbf{Rigorous evaluation and interactive exploration.} Every intervention is accompanied by automated quality assessment, and the platform ships with a web research dashboard (HuggingFace Spaces) providing A/B comparison chat, dose-response strength sweeps, multi-model benchmarking, and one-click artifact export.
|
|
@@ -82,7 +82,7 @@ The remainder of this paper is organized as follows.
|
|
| 82 |
Section~\ref{sec:related} surveys related work.
|
| 83 |
Section~\ref{sec:architecture} describes the platform architecture.
|
| 84 |
Section~\ref{sec:analysis} details the 15 analysis modules with mathematical formulations.
|
| 85 |
-
Section~\ref{sec:intervention} describes the
|
| 86 |
Section~\ref{sec:moe} introduces Expert-Granular Abliteration for MoE models.
|
| 87 |
Section~\ref{sec:frontier} presents the six frontier optimization techniques.
|
| 88 |
Section~\ref{sec:evaluation} covers the evaluation suite.
|
|
@@ -246,7 +246,7 @@ After abliteration, we verify that the refusal signal was actually eliminated (n
|
|
| 246 |
\begin{itemize}
|
| 247 |
\item \textbf{Projection gap}: $\Delta_l = \bar{p}_{\text{harmful}} - \bar{p}_{\text{harmless}}$ where $p = \mathbf{a} \cdot \mathbf{r}_l$
|
| 248 |
\item \textbf{Separation $d'$}: $d'_l = |\Delta_l| / \sigma_{\text{pooled}}$, the signal detection sensitivity metric
|
| 249 |
-
\item \textbf{Refusal Elimination Score (RES)}: A composite $\text{RES} = 0.4 \cdot \frac{1}{1 + \bar{d}'} + 0.3 \cdot \frac{n_{\text{clean}}}{n_{\text{total}}} + 0.3 \cdot e^{-10\bar{\Delta}}$
|
| 250 |
\end{itemize}
|
| 251 |
|
| 252 |
RES ranges from 0 (no elimination) to 1 (complete elimination), combining projection reduction, layer coverage, and gap magnitude.
|
|
@@ -315,6 +315,7 @@ Following the transformer circuits framework \citep{elhage2021mathematical}, we
|
|
| 315 |
\begin{equation}
|
| 316 |
\mathbf{x}_l^{\text{post}} = \mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}) + \text{MLP}_l(\mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}))
|
| 317 |
\end{equation}
|
|
|
|
| 318 |
|
| 319 |
For each component output $\mathbf{c}$, we measure its refusal contribution as $\mathbf{c} \cdot \mathbf{r}_l$. The attention contribution is further decomposed across heads:
|
| 320 |
$\text{Attn}_l = \sum_{h=1}^{H} \text{Head}_{l,h}$.
|
|
@@ -401,7 +402,7 @@ where $s_j$ is the refusal strength at layer $j$. High $R_l$ indicates the model
|
|
| 401 |
|
| 402 |
\paragraph{Safety-Capability Entanglement.} For each layer, we measure entanglement as the geometric mean of the normalized variance and absolute projection of harmless activations onto the refusal direction:
|
| 403 |
\begin{equation}
|
| 404 |
-
E_l = \sqrt{\frac{\text{Var}(\mathbf{b} \cdot \mathbf{r}_l)}{\|\overline{\mathbf{b}}\|} \cdot \frac{|\overline{\mathbf{b} \cdot \mathbf{r}_l}|}{\|\overline{\mathbf{b}}\|}}
|
| 405 |
\end{equation}
|
| 406 |
High entanglement means abliterating refusal at that layer would also damage general capabilities.
|
| 407 |
|
|
@@ -437,7 +438,7 @@ where $H(\hat{\mathbf{p}})$ is the entropy of the normalized projection distribu
|
|
| 437 |
\subsection{Weight Projection (Permanent)}
|
| 438 |
\label{sec:weight_projection}
|
| 439 |
|
| 440 |
-
\textsc{Obliteratus} provides
|
| 441 |
|
| 442 |
\begin{table}[h]
|
| 443 |
\centering
|
|
@@ -450,7 +451,8 @@ where $H(\hat{\mathbf{p}})$ is the entropy of the normalized projection distribu
|
|
| 450 |
\midrule
|
| 451 |
Basic & 1 (DiM) & No & None & 1 & --- \\
|
| 452 |
Advanced & 4 (SVD) & Yes & $\lambda{=}0.1$ & 2 & --- \\
|
| 453 |
-
Aggressive & 8 (
|
|
|
|
| 454 |
Surgical & 6 (wSVD) & Yes & $\lambda{=}0.15$ & 2 & Whitened SVD, JB-contrastive \\
|
| 455 |
Optimized & 4 (SVD) & Yes & Bayesian & 2 & Optuna TPE, KL co-opt \\
|
| 456 |
Inverted & 6 (SVD) & Yes & None & 3 & Selective inversion \\
|
|
@@ -466,7 +468,7 @@ The core projection for a weight matrix $\mathbf{W}$ and refusal directions $\{\
|
|
| 466 |
\begin{equation}
|
| 467 |
\mathbf{W}' = \mathbf{W} - \sum_{i=1}^k \left[(1-\lambda)\mathbf{W}\mathbf{r}_i\mathbf{r}_i^\top\right]
|
| 468 |
\end{equation}
|
| 469 |
-
where $\lambda$ is the regularization strength (preserves $\lambda$ fraction of the refusal component).
|
| 470 |
|
| 471 |
\paragraph{Per-layer adaptive strength.}
|
| 472 |
Rather than applying uniform regularization, \textsc{Obliteratus} modulates $\lambda$ per-layer based on the refusal norm profile. Layers with stronger refusal signal (higher $\|\mathbf{r}_l\|$) receive lower regularization (more aggressive removal), while layers near the periphery of the refusal distribution receive higher regularization:
|
|
@@ -496,7 +498,24 @@ Unlike prior tools that only modify weight matrices, \textsc{Obliteratus} also p
|
|
| 496 |
\end{equation}
|
| 497 |
|
| 498 |
\paragraph{Iterative refinement.}
|
| 499 |
-
Presets with multiple passes recompute projections after each modification, catching rotated residual refusal that a single pass misses. The Nuclear preset performs 4 passes with true iterative re-probing: after each excision round, activations are re-collected and new residual directions are extracted.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 500 |
|
| 501 |
\subsection{Steering Vectors (Reversible)}
|
| 502 |
\label{sec:steering}
|
|
@@ -622,7 +641,7 @@ with Pareto-optimal solutions ranked by a weighted composite: $\rho + 0.5 \cdot
|
|
| 622 |
|
| 623 |
Inspired by Heretic's rank-1 LoRA ablation, we extend the approach to \emph{rank-$k$} adapters supporting multi-direction removal. The mathematical equivalence:
|
| 624 |
\begin{align}
|
| 625 |
-
\text{In-place:} \quad \mathbf{W}' &= \mathbf{W} - s \cdot (\mathbf{d}\mathbf{d}^\top)
|
| 626 |
\text{LoRA:} \quad \mathbf{W}' &= \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} = -s \cdot \text{coeff}, \quad \mathbf{A} = \mathbf{d}^\top
|
| 627 |
\end{align}
|
| 628 |
where $\text{coeff} = \mathbf{W}\mathbf{d}$ is the projection coefficient and $s = 1 - \lambda$. For rank-$k$ with directions $\{\mathbf{d}_1, \ldots, \mathbf{d}_k\}$:
|
|
@@ -638,7 +657,7 @@ Adapters are stored in half precision and saved in a PEFT-compatible format. The
|
|
| 638 |
|
| 639 |
After projection, we measure first-token KL divergence on harmless reference prompts. If $D_{\text{KL}}$ exceeds a threshold $\delta$ (default 0.1), a partial revert is applied:
|
| 640 |
\begin{equation}
|
| 641 |
-
\mathbf{W}'' = \mathbf{W}' + \gamma \cdot
|
| 642 |
\end{equation}
|
| 643 |
where $\gamma$ is computed from the stored KL proxy magnitude. A subtle issue arises when the post-projection coefficient $\mathbf{W}'\mathbf{d} \approx 0$ (as occurs with zero regularization): in this case, we use the \emph{pre-projection} coefficient magnitude as a proxy:
|
| 644 |
\begin{equation}
|
|
@@ -763,16 +782,16 @@ Generates a dose-response curve by sweeping regularization strength from 0 (full
|
|
| 763 |
One-click packaging of all research artifacts into a downloadable ZIP archive: refusal direction tensors (\texttt{.pt}), configuration JSON, results CSV, and full pipeline log. Enables reproducibility and downstream analysis in external tools.
|
| 764 |
|
| 765 |
\paragraph{Benchmark Lab tab.}
|
| 766 |
-
Multi-method comparison (run all
|
| 767 |
|
| 768 |
\paragraph{About tab.}
|
| 769 |
-
Comprehensive documentation of all
|
| 770 |
|
| 771 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 772 |
\section{Experiments}
|
| 773 |
\label{sec:experiments}
|
| 774 |
|
| 775 |
-
We evaluate \textsc{Obliteratus} across four model families,
|
| 776 |
|
| 777 |
\subsection{Experimental Setup}
|
| 778 |
\label{sec:exp_setup}
|
|
@@ -797,7 +816,7 @@ GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) & 8 & RLHF \\
|
|
| 797 |
\end{table}
|
| 798 |
|
| 799 |
\paragraph{Datasets.}
|
| 800 |
-
Harmful prompts are drawn from the AdvBench dataset \citep{zou2023universal} (520 prompts). Harmless prompts are drawn from the Alpaca dataset (matched count). For refusal rate measurement, we use a held-out set of 64 harmful prompts not seen during direction extraction. For perplexity, we use a 512-token window from WikiText-2. For KL divergence, we use 32 harmless prompts from the Alpaca validation set.
|
| 801 |
|
| 802 |
\paragraph{Evaluation metrics.}
|
| 803 |
For each abliterated model we report: \textbf{Refusal Rate} (RR, \%---lower is better), \textbf{Perplexity} (PPL---lower is better, with $\Delta$PPL showing change from baseline), \textbf{KL Divergence} ($D_{\text{KL}}$---lower is better), and \textbf{Coherence} (Coh., \%---higher is better). We also report \textbf{CoT preserved} (\checkmark/--) and \textbf{LoRA adapters generated} (\checkmark/--) where applicable.
|
|
@@ -808,7 +827,7 @@ All experiments use medium prompt volume (128 harmful + 128 harmless prompts for
|
|
| 808 |
\subsection{Multi-Method Comparison on Dense Models}
|
| 809 |
\label{sec:exp_dense}
|
| 810 |
|
| 811 |
-
Table~\ref{tab:exp_dense} compares all
|
| 812 |
|
| 813 |
\begin{table}[h]
|
| 814 |
\centering
|
|
@@ -946,8 +965,8 @@ Table~\ref{tab:comparison} compares \textsc{Obliteratus} with existing tools acr
|
|
| 946 |
\textbf{Capability} & \rotatebox{60}{\textsc{Obliteratus}} & \rotatebox{60}{TransformerLens} & \rotatebox{60}{Heretic} & \rotatebox{60}{FailSpy abl.} & \rotatebox{60}{RepEng} & \rotatebox{60}{SAELens} \\
|
| 947 |
\midrule
|
| 948 |
Direction extraction methods & 3 & Manual & 1 & 1 & 1 & -- \\
|
| 949 |
-
Method presets &
|
| 950 |
-
Weight projection variants &
|
| 951 |
Bayesian optimization & Warm-start$^\dagger$ & -- & TPE$^\dagger$ & -- & -- & -- \\
|
| 952 |
LoRA-mediated ablation & Rank-$k^\dagger$ & -- & Rank-1$^\dagger$ & -- & -- & -- \\
|
| 953 |
KL co-optimization & \checkmark & -- & -- & -- & -- & -- \\
|
|
@@ -982,7 +1001,7 @@ The key differentiators of \textsc{Obliteratus} are:
|
|
| 982 |
\item \textbf{MoE-native processing}: The only abliteration tool with Expert-Granular Abliteration, fused 3D weight handling, and per-expert selective inversion. This is critical for models like GPT-OSS 20B where uniform approaches degrade capabilities.
|
| 983 |
\item \textbf{Analysis breadth}: To our knowledge, no existing public tool combines concept cone geometry, alignment imprint detection, cross-model universality analysis, and defense robustness evaluation in a single framework.
|
| 984 |
\item \textbf{Heretic superset with extensions}: We incorporate all of Heretic's innovations (Bayesian optimization, LoRA ablation) while adding warm-start initialization, rank-$k$ adapters, KL co-optimization, CoT-aware ablation, float layer interpolation, and activation winsorization.
|
| 985 |
-
\item \textbf{
|
| 986 |
\item \textbf{Interactive research dashboard}: A/B comparison chat, dose-response strength sweeps, and publication-quality benchmarking provide integrated research workflows uncommon in existing tools.
|
| 987 |
\item \textbf{Architecture coverage}: Working with any HuggingFace model---including fused MoE architectures---rather than requiring specific architecture support.
|
| 988 |
\end{enumerate}
|
|
@@ -1055,7 +1074,7 @@ We presented \textsc{Obliteratus}, an open-source platform that unifies mechanis
|
|
| 1055 |
|
| 1056 |
The platform's contributions span multiple axes:
|
| 1057 |
\emph{Analysis} --- 15 modules providing the most comprehensive characterization of refusal geometry in any public tool, including concept cone geometry with DSI, alignment imprint detection, cross-model universality, and defense robustness evaluation.
|
| 1058 |
-
\emph{Intervention} ---
|
| 1059 |
\emph{MoE-native processing} --- Expert-Granular Abliteration decomposes refusal at per-expert granularity, fused 3D weight handling enables direct operation on packed expert tensors, and selective inversion differentiates safety-critical from capability-preserving experts.
|
| 1060 |
\emph{Frontier optimization} --- Bayesian hyperparameter search with warm-start from analysis heuristics, KL co-optimization with proxy-magnitude partial revert, chain-of-thought-aware Gram-Schmidt orthogonalization, float layer interpolation, and activation winsorization---incorporating and extending all innovations from Heretic \citep{heretic2025}.
|
| 1061 |
\emph{Interactive research} --- a web dashboard with A/B comparison chat, dose-response strength sweeps, multi-model benchmarking, and artifact export.
|
|
|
|
| 46 |
|
| 47 |
\textsc{Obliteratus} contributes:
|
| 48 |
(1)~\textbf{15 analysis modules} spanning direction extraction, geometric characterization, learned probing, causal estimation, cross-model transfer, and defense robustness evaluation;
|
| 49 |
+
(2)~\textbf{eight intervention presets} (Basic through Nuclear) with per-layer adaptive strength, norm-preserving regularization, and iterative refinement;
|
| 50 |
(3)~\textbf{Expert-Granular Abliteration (EGA)} for MoE models, decomposing refusal directions per-expert via routing-weighted activation attribution and applying selective inversion to fused 3D weight tensors---distinguishing safety-critical from capability-preserving experts;
|
| 51 |
(4)~\textbf{six frontier optimization techniques} inspired by and extending Heretic: Bayesian hyperparameter optimization (Optuna TPE with warm-start from analysis heuristics), reversible LoRA-mediated ablation, KL-divergence co-optimization with partial revert, chain-of-thought-aware ablation via Gram-Schmidt orthogonalization, float layer interpolation with Gaussian-weighted continuous targeting, and activation winsorization for robust SVD;
|
| 52 |
(5)~\textbf{a unified evaluation suite} with refusal rate, perplexity, coherence, KL divergence, CKA similarity, and effective rank metrics;
|
|
|
|
| 72 |
|
| 73 |
\begin{enumerate}[leftmargin=*]
|
| 74 |
\item \textbf{Comprehensive analysis before intervention.} Rather than immediately removing refusal, the platform first characterizes its geometric structure---how many directions are involved, whether they form cones or subspaces, how they vary across layers and harm categories, and what alignment training method likely produced them.
|
| 75 |
+
\item \textbf{Multiple intervention paradigms.} The platform supports eight abliteration presets (Basic through Nuclear), reversible LoRA-mediated ablation, and inference-time steering vectors, covering the full spectrum from conservative capability-preserving removal to maximally aggressive multi-pass excision.
|
| 76 |
\item \textbf{Native MoE support.} Mixture-of-Experts models (GPT-OSS 20B, Mixtral, DeepSeek-MoE) present unique challenges for abliteration: refusal may be concentrated in specific experts, and fused 3D weight tensors require per-expert decomposition. \textsc{Obliteratus} introduces \emph{Expert-Granular Abliteration} (EGA)---routing-weighted direction attribution and selective inversion that distinguishes safety-critical from capability-preserving experts.
|
| 77 |
\item \textbf{Frontier optimization.} Building on Heretic's \citep{heretic2025} pioneering use of Bayesian optimization and LoRA-mediated ablation, we integrate and extend six optimization techniques: TPE-based hyperparameter search, reversible LoRA adapters, KL-divergence co-optimization, chain-of-thought-aware ablation, float layer interpolation, and activation winsorization.
|
| 78 |
\item \textbf{Rigorous evaluation and interactive exploration.} Every intervention is accompanied by automated quality assessment, and the platform ships with a web research dashboard (HuggingFace Spaces) providing A/B comparison chat, dose-response strength sweeps, multi-model benchmarking, and one-click artifact export.
|
|
|
|
| 82 |
Section~\ref{sec:related} surveys related work.
|
| 83 |
Section~\ref{sec:architecture} describes the platform architecture.
|
| 84 |
Section~\ref{sec:analysis} details the 15 analysis modules with mathematical formulations.
|
| 85 |
+
Section~\ref{sec:intervention} describes the eight intervention presets and their mathematical foundations.
|
| 86 |
Section~\ref{sec:moe} introduces Expert-Granular Abliteration for MoE models.
|
| 87 |
Section~\ref{sec:frontier} presents the six frontier optimization techniques.
|
| 88 |
Section~\ref{sec:evaluation} covers the evaluation suite.
|
|
|
|
| 246 |
\begin{itemize}
|
| 247 |
\item \textbf{Projection gap}: $\Delta_l = \bar{p}_{\text{harmful}} - \bar{p}_{\text{harmless}}$ where $p = \mathbf{a} \cdot \mathbf{r}_l$
|
| 248 |
\item \textbf{Separation $d'$}: $d'_l = |\Delta_l| / \sigma_{\text{pooled}}$, the signal detection sensitivity metric
|
| 249 |
+
\item \textbf{Refusal Elimination Score (RES)}: A composite $\text{RES} = 0.4 \cdot \frac{1}{1 + \bar{d}'} + 0.3 \cdot \frac{n_{\text{clean}}}{n_{\text{total}}} + 0.3 \cdot e^{-10|\bar{\Delta}|}$
|
| 250 |
\end{itemize}
|
| 251 |
|
| 252 |
RES ranges from 0 (no elimination) to 1 (complete elimination), combining projection reduction, layer coverage, and gap magnitude.
|
|
|
|
| 315 |
\begin{equation}
|
| 316 |
\mathbf{x}_l^{\text{post}} = \mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}) + \text{MLP}_l(\mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}))
|
| 317 |
\end{equation}
|
| 318 |
+
(LayerNorm operations are omitted for notational simplicity; the implementation handles both pre-LN and post-LN architectures.)
|
| 319 |
|
| 320 |
For each component output $\mathbf{c}$, we measure its refusal contribution as $\mathbf{c} \cdot \mathbf{r}_l$. The attention contribution is further decomposed across heads:
|
| 321 |
$\text{Attn}_l = \sum_{h=1}^{H} \text{Head}_{l,h}$.
|
|
|
|
| 402 |
|
| 403 |
\paragraph{Safety-Capability Entanglement.} For each layer, we measure entanglement as the geometric mean of the normalized variance and absolute projection of harmless activations onto the refusal direction:
|
| 404 |
\begin{equation}
|
| 405 |
+
E_l = \sqrt{\frac{\text{Var}(\mathbf{b} \cdot \mathbf{r}_l)}{\|\overline{\mathbf{b}}\|^2} \cdot \frac{|\overline{\mathbf{b} \cdot \mathbf{r}_l}|}{\|\overline{\mathbf{b}}\|}}
|
| 406 |
\end{equation}
|
| 407 |
High entanglement means abliterating refusal at that layer would also damage general capabilities.
|
| 408 |
|
|
|
|
| 438 |
\subsection{Weight Projection (Permanent)}
|
| 439 |
\label{sec:weight_projection}
|
| 440 |
|
| 441 |
+
\textsc{Obliteratus} provides eight abliteration presets spanning the full spectrum from conservative single-direction removal to maximally aggressive multi-pass excision (Table~\ref{tab:methods}).
|
| 442 |
|
| 443 |
\begin{table}[h]
|
| 444 |
\centering
|
|
|
|
| 451 |
\midrule
|
| 452 |
Basic & 1 (DiM) & No & None & 1 & --- \\
|
| 453 |
Advanced & 4 (SVD) & Yes & $\lambda{=}0.1$ & 2 & --- \\
|
| 454 |
+
Aggressive & 8 (wSVD) & Yes & None & 3 & JB-contrastive, head surgery, winsorized \\
|
| 455 |
+
Sp.\ Cascade & 6 (wSVD) & Yes & None & 2 & DCT frequency decomp., coherence-weighted \\
|
| 456 |
Surgical & 6 (wSVD) & Yes & $\lambda{=}0.15$ & 2 & Whitened SVD, JB-contrastive \\
|
| 457 |
Optimized & 4 (SVD) & Yes & Bayesian & 2 & Optuna TPE, KL co-opt \\
|
| 458 |
Inverted & 6 (SVD) & Yes & None & 3 & Selective inversion \\
|
|
|
|
| 468 |
\begin{equation}
|
| 469 |
\mathbf{W}' = \mathbf{W} - \sum_{i=1}^k \left[(1-\lambda)\mathbf{W}\mathbf{r}_i\mathbf{r}_i^\top\right]
|
| 470 |
\end{equation}
|
| 471 |
+
where $\lambda$ is the regularization strength (preserves $\lambda$ fraction of the refusal component). Since the right singular vectors $\{\mathbf{r}_i\}_{i=1}^k$ from SVD are orthonormal, the sum of rank-1 projections is equivalent to orthogonal projection onto the $k$-dimensional refusal subspace.
|
| 472 |
|
| 473 |
\paragraph{Per-layer adaptive strength.}
|
| 474 |
Rather than applying uniform regularization, \textsc{Obliteratus} modulates $\lambda$ per-layer based on the refusal norm profile. Layers with stronger refusal signal (higher $\|\mathbf{r}_l\|$) receive lower regularization (more aggressive removal), while layers near the periphery of the refusal distribution receive higher regularization:
|
|
|
|
| 498 |
\end{equation}
|
| 499 |
|
| 500 |
\paragraph{Iterative refinement.}
|
| 501 |
+
Presets with multiple passes recompute projections after each modification, catching rotated residual refusal that a single pass misses. The Nuclear preset performs 4 passes with true iterative re-probing: after each excision round, activations are re-collected and new residual directions are extracted. To avoid wasted compute, iterative refinement includes a \emph{cosine-similarity early-exit}: if all strong-layer directions have cosine similarity $> 0.99$ with the previous pass, the re-probe is skipped.
|
| 502 |
+
|
| 503 |
+
\paragraph{Spectral Cascade: multi-resolution frequency decomposition.}
|
| 504 |
+
\label{para:spectral_cascade}
|
| 505 |
+
The \emph{Spectral Cascade} preset introduces a novel insight: refusal signal across the layer axis contains both \emph{low-frequency} components (smooth, systematic trends spanning many layers---the trained-in alignment signal) and \emph{high-frequency} components (per-layer spikes that are more likely capability-entangled noise). Existing methods treat all layers uniformly or use simple norm-based heuristics, conflating these two scales.
|
| 506 |
+
|
| 507 |
+
Spectral Cascade operates in three stages. \textbf{Stage~1 (direction coherence):} For each strong layer~$l$, we compute the mean cosine similarity of its refusal direction with its neighbors $\mathcal{N}(l)$:
|
| 508 |
+
\begin{equation}
|
| 509 |
+
c_l = \frac{1}{|\mathcal{N}(l)|}\sum_{j \in \mathcal{N}(l)} |\mathbf{r}_l^\top \mathbf{r}_j|, \quad
|
| 510 |
+
\hat{m}_l = \|\mathbf{r}_l\| \cdot (0.3 + 0.7 \, c_l)
|
| 511 |
+
\end{equation}
|
| 512 |
+
Layers with high directional coherence (part of the systematic refusal trend) are amplified; noisy layers are dampened. \textbf{Stage~2 (DCT decomposition):} Apply the orthonormal Type-II Discrete Cosine Transform to the coherence-weighted magnitude vector $\hat{\mathbf{m}}$:
|
| 513 |
+
\begin{equation}
|
| 514 |
+
X_k = \sum_{i=0}^{N-1} \hat{m}_i \cos\!\left(\frac{\pi k (2i+1)}{2N}\right) \cdot \alpha_k, \quad \alpha_k = \begin{cases}\sqrt{1/N} & k=0 \\ \sqrt{2/N} & k>0\end{cases}
|
| 515 |
+
\end{equation}
|
| 516 |
+
The coefficients $\{X_k\}$ are split into $B$ frequency bands. An adaptive band count is determined by finding the spectral knee (coefficient index capturing 90\% of total energy). \textbf{Stage~3 (cascade with early-exit):} Bands are processed from lowest to highest frequency. Each band's per-layer contribution is attenuated by an exponential schedule $a_b = e^{-1.6 \cdot b/(B-1)}$, giving full weight to low-frequency components and ${\sim}0.2\times$ weight to the highest band. Processing stops early when remaining spectral energy falls below a threshold $\tau$ (default 0.05), avoiding unnecessary high-frequency passes.
|
| 517 |
+
|
| 518 |
+
The resulting per-layer weights $w_l \in [0.2, 1.0]$ modulate projection strength during EXCISE, achieving cleaner refusal removal with less capability damage by targeting only the systematic refusal component.
|
| 519 |
|
| 520 |
\subsection{Steering Vectors (Reversible)}
|
| 521 |
\label{sec:steering}
|
|
|
|
| 641 |
|
| 642 |
Inspired by Heretic's rank-1 LoRA ablation, we extend the approach to \emph{rank-$k$} adapters supporting multi-direction removal. The mathematical equivalence:
|
| 643 |
\begin{align}
|
| 644 |
+
\text{In-place:} \quad \mathbf{W}' &= \mathbf{W} - s \cdot \mathbf{W}(\mathbf{d}\mathbf{d}^\top) \\
|
| 645 |
\text{LoRA:} \quad \mathbf{W}' &= \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} = -s \cdot \text{coeff}, \quad \mathbf{A} = \mathbf{d}^\top
|
| 646 |
\end{align}
|
| 647 |
where $\text{coeff} = \mathbf{W}\mathbf{d}$ is the projection coefficient and $s = 1 - \lambda$. For rank-$k$ with directions $\{\mathbf{d}_1, \ldots, \mathbf{d}_k\}$:
|
|
|
|
| 657 |
|
| 658 |
After projection, we measure first-token KL divergence on harmless reference prompts. If $D_{\text{KL}}$ exceeds a threshold $\delta$ (default 0.1), a partial revert is applied:
|
| 659 |
\begin{equation}
|
| 660 |
+
\mathbf{W}'' = \mathbf{W}' + \gamma \cdot \mathbf{W}\mathbf{d}\mathbf{d}^\top
|
| 661 |
\end{equation}
|
| 662 |
where $\gamma$ is computed from the stored KL proxy magnitude. A subtle issue arises when the post-projection coefficient $\mathbf{W}'\mathbf{d} \approx 0$ (as occurs with zero regularization): in this case, we use the \emph{pre-projection} coefficient magnitude as a proxy:
|
| 663 |
\begin{equation}
|
|
|
|
| 782 |
One-click packaging of all research artifacts into a downloadable ZIP archive: refusal direction tensors (\texttt{.pt}), configuration JSON, results CSV, and full pipeline log. Enables reproducibility and downstream analysis in external tools.
|
| 783 |
|
| 784 |
\paragraph{Benchmark Lab tab.}
|
| 785 |
+
Multi-method comparison (run all 8 presets on a single model) and multi-model comparison (run a single preset across multiple models). Results are presented as publication-quality visualizations including radar charts, grouped bar plots, Pareto frontiers, and method ranking tables. Figures are generated at 300 DPI for direct inclusion in papers.
|
| 786 |
|
| 787 |
\paragraph{About tab.}
|
| 788 |
+
Comprehensive documentation of all 8 method presets with their configurations, the mathematical foundations of key techniques, and attribution to prior work including Heretic.
|
| 789 |
|
| 790 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 791 |
\section{Experiments}
|
| 792 |
\label{sec:experiments}
|
| 793 |
|
| 794 |
+
We evaluate \textsc{Obliteratus} across four model families, eight method presets, and two architectural paradigms (dense and MoE). All experiments use the platform's built-in evaluation suite (Section~\ref{sec:evaluation}) and are fully reproducible via the Benchmark Lab tab or the included benchmark scripts.
|
| 795 |
|
| 796 |
\subsection{Experimental Setup}
|
| 797 |
\label{sec:exp_setup}
|
|
|
|
| 816 |
\end{table}
|
| 817 |
|
| 818 |
\paragraph{Datasets.}
|
| 819 |
+
Harmful prompts are drawn from the AdvBench dataset \citep{zou2023universal} (520 prompts). Harmless prompts are drawn from the Alpaca dataset \citep{taori2023alpaca} (matched count). For refusal rate measurement, we use a held-out set of 64 harmful prompts not seen during direction extraction. For perplexity, we use a 512-token window from WikiText-2. For KL divergence, we use 32 harmless prompts from the Alpaca validation set.
|
| 820 |
|
| 821 |
\paragraph{Evaluation metrics.}
|
| 822 |
For each abliterated model we report: \textbf{Refusal Rate} (RR, \%---lower is better), \textbf{Perplexity} (PPL---lower is better, with $\Delta$PPL showing change from baseline), \textbf{KL Divergence} ($D_{\text{KL}}$---lower is better), and \textbf{Coherence} (Coh., \%---higher is better). We also report \textbf{CoT preserved} (\checkmark/--) and \textbf{LoRA adapters generated} (\checkmark/--) where applicable.
|
|
|
|
| 827 |
\subsection{Multi-Method Comparison on Dense Models}
|
| 828 |
\label{sec:exp_dense}
|
| 829 |
|
| 830 |
+
Table~\ref{tab:exp_dense} compares all eight method presets on Qwen2.5-1.5B-Instruct. This model was chosen for its small size (enabling rapid iteration) and DPO alignment (representing the most common alignment method in open-weight models).
|
| 831 |
|
| 832 |
\begin{table}[h]
|
| 833 |
\centering
|
|
|
|
| 965 |
\textbf{Capability} & \rotatebox{60}{\textsc{Obliteratus}} & \rotatebox{60}{TransformerLens} & \rotatebox{60}{Heretic} & \rotatebox{60}{FailSpy abl.} & \rotatebox{60}{RepEng} & \rotatebox{60}{SAELens} \\
|
| 966 |
\midrule
|
| 967 |
Direction extraction methods & 3 & Manual & 1 & 1 & 1 & -- \\
|
| 968 |
+
Method presets & 8 & -- & 1 & 1 & -- & -- \\
|
| 969 |
+
Weight projection variants & 8+ & -- & Bayesian$^\dagger$ & 1 & -- & -- \\
|
| 970 |
Bayesian optimization & Warm-start$^\dagger$ & -- & TPE$^\dagger$ & -- & -- & -- \\
|
| 971 |
LoRA-mediated ablation & Rank-$k^\dagger$ & -- & Rank-1$^\dagger$ & -- & -- & -- \\
|
| 972 |
KL co-optimization & \checkmark & -- & -- & -- & -- & -- \\
|
|
|
|
| 1001 |
\item \textbf{MoE-native processing}: The only abliteration tool with Expert-Granular Abliteration, fused 3D weight handling, and per-expert selective inversion. This is critical for models like GPT-OSS 20B where uniform approaches degrade capabilities.
|
| 1002 |
\item \textbf{Analysis breadth}: To our knowledge, no existing public tool combines concept cone geometry, alignment imprint detection, cross-model universality analysis, and defense robustness evaluation in a single framework.
|
| 1003 |
\item \textbf{Heretic superset with extensions}: We incorporate all of Heretic's innovations (Bayesian optimization, LoRA ablation) while adding warm-start initialization, rank-$k$ adapters, KL co-optimization, CoT-aware ablation, float layer interpolation, and activation winsorization.
|
| 1004 |
+
\item \textbf{Eight intervention presets}: From conservative (Basic) through maximally aggressive (Nuclear), each preset composes a distinct combination of techniques for different use cases.
|
| 1005 |
\item \textbf{Interactive research dashboard}: A/B comparison chat, dose-response strength sweeps, and publication-quality benchmarking provide integrated research workflows uncommon in existing tools.
|
| 1006 |
\item \textbf{Architecture coverage}: Working with any HuggingFace model---including fused MoE architectures---rather than requiring specific architecture support.
|
| 1007 |
\end{enumerate}
|
|
|
|
| 1074 |
|
| 1075 |
The platform's contributions span multiple axes:
|
| 1076 |
\emph{Analysis} --- 15 modules providing the most comprehensive characterization of refusal geometry in any public tool, including concept cone geometry with DSI, alignment imprint detection, cross-model universality, and defense robustness evaluation.
|
| 1077 |
+
\emph{Intervention} --- eight method presets (Basic through Nuclear) composing techniques from single-direction removal to multi-pass whitened SVD with selective inversion, plus reversible steering vectors and LoRA-mediated ablation.
|
| 1078 |
\emph{MoE-native processing} --- Expert-Granular Abliteration decomposes refusal at per-expert granularity, fused 3D weight handling enables direct operation on packed expert tensors, and selective inversion differentiates safety-critical from capability-preserving experts.
|
| 1079 |
\emph{Frontier optimization} --- Bayesian hyperparameter search with warm-start from analysis heuristics, KL co-optimization with proxy-magnitude partial revert, chain-of-thought-aware Gram-Schmidt orthogonalization, float layer interpolation, and activation winsorization---incorporating and extending all innovations from Heretic \citep{heretic2025}.
|
| 1080 |
\emph{Interactive research} --- a web dashboard with A/B comparison chat, dose-response strength sweeps, multi-model benchmarking, and artifact export.
|
paper/references.bib
CHANGED
|
@@ -210,7 +210,7 @@
|
|
| 210 |
|
| 211 |
@article{shazeer2017outrageously,
|
| 212 |
title={Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer},
|
| 213 |
-
author={Shazeer, Noam and
|
| 214 |
journal={International Conference on Learning Representations},
|
| 215 |
year={2017}
|
| 216 |
}
|
|
@@ -248,3 +248,12 @@
|
|
| 248 |
year={2021}
|
| 249 |
}
|
| 250 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
@article{shazeer2017outrageously,
|
| 212 |
title={Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer},
|
| 213 |
+
author={Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc and Hinton, Geoffrey and Dean, Jeff},
|
| 214 |
journal={International Conference on Learning Representations},
|
| 215 |
year={2017}
|
| 216 |
}
|
|
|
|
| 248 |
year={2021}
|
| 249 |
}
|
| 250 |
|
| 251 |
+
% ββ Datasets ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 252 |
+
|
| 253 |
+
@article{taori2023alpaca,
|
| 254 |
+
title={Stanford Alpaca: An Instruction-following LLaMA Model},
|
| 255 |
+
author={Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B},
|
| 256 |
+
year={2023},
|
| 257 |
+
url={https://github.com/tatsu-lab/stanford_alpaca}
|
| 258 |
+
}
|
| 259 |
+
|
scripts/run_benchmark_remote.sh
CHANGED
|
@@ -18,7 +18,7 @@ set -euo pipefail
|
|
| 18 |
|
| 19 |
# ββ Defaults βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 20 |
SSH_KEY="${OBLITERATUS_SSH_KEY:-$HOME/.ssh/hf_obliteratus}"
|
| 21 |
-
SSH_HOST="${OBLITERATUS_SSH_HOST:-
|
| 22 |
MODEL="${OBLITERATUS_MODEL:-Qwen/Qwen2.5-0.5B-Instruct}"
|
| 23 |
MODELS=""
|
| 24 |
METHODS="${OBLITERATUS_METHODS:-basic advanced aggressive surgical inverted nuclear}"
|
|
@@ -51,6 +51,16 @@ if [[ -z "$MODELS" ]]; then
|
|
| 51 |
MODELS="$MODEL"
|
| 52 |
fi
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
# ββ Validate SSH key ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
if [[ ! -f "$SSH_KEY" ]]; then
|
| 56 |
echo "ERROR: SSH key not found at $SSH_KEY"
|
|
@@ -373,7 +383,7 @@ if ! ssh "${SSH_OPTS[@]}" "$SSH_HOST" "echo 'SSH_OK'" 2>/tmp/obliteratus_ssh_deb
|
|
| 373 |
echo ""
|
| 374 |
echo "Troubleshooting checklist:"
|
| 375 |
echo " 1. Is Dev Mode enabled on your HF Space?"
|
| 376 |
-
echo " β
|
| 377 |
echo " 2. Is the Space awake (not sleeping/building)?"
|
| 378 |
echo " β Visit the Space URL and wait for the UI to load"
|
| 379 |
echo " 3. Is your SSH public key added to your HF profile?"
|
|
|
|
| 18 |
|
| 19 |
# ββ Defaults βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 20 |
SSH_KEY="${OBLITERATUS_SSH_KEY:-$HOME/.ssh/hf_obliteratus}"
|
| 21 |
+
SSH_HOST="${OBLITERATUS_SSH_HOST:-}"
|
| 22 |
MODEL="${OBLITERATUS_MODEL:-Qwen/Qwen2.5-0.5B-Instruct}"
|
| 23 |
MODELS=""
|
| 24 |
METHODS="${OBLITERATUS_METHODS:-basic advanced aggressive surgical inverted nuclear}"
|
|
|
|
| 51 |
MODELS="$MODEL"
|
| 52 |
fi
|
| 53 |
|
| 54 |
+
# ββ Validate SSH host ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
+
if [[ -z "$SSH_HOST" ]]; then
|
| 56 |
+
echo "ERROR: SSH_HOST not configured."
|
| 57 |
+
echo ""
|
| 58 |
+
echo "Set your HF Space SSH host:"
|
| 59 |
+
echo " 1. export OBLITERATUS_SSH_HOST=your-username-spacename@ssh.hf.space"
|
| 60 |
+
echo " 2. Or pass --host your-username-spacename@ssh.hf.space"
|
| 61 |
+
exit 1
|
| 62 |
+
fi
|
| 63 |
+
|
| 64 |
# ββ Validate SSH key ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 65 |
if [[ ! -f "$SSH_KEY" ]]; then
|
| 66 |
echo "ERROR: SSH key not found at $SSH_KEY"
|
|
|
|
| 383 |
echo ""
|
| 384 |
echo "Troubleshooting checklist:"
|
| 385 |
echo " 1. Is Dev Mode enabled on your HF Space?"
|
| 386 |
+
echo " β Check your Space's Settings tab (Dev Mode must be ON)"
|
| 387 |
echo " 2. Is the Space awake (not sleeping/building)?"
|
| 388 |
echo " β Visit the Space URL and wait for the UI to load"
|
| 389 |
echo " 3. Is your SSH public key added to your HF profile?"
|
spaces/README.md
CHANGED
|
@@ -3,9 +3,9 @@ title: OBLITERATUS
|
|
| 3 |
emoji: "π"
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: gray
|
| 6 |
-
sdk:
|
|
|
|
| 7 |
app_file: app.py
|
| 8 |
-
suggested_hardware: t4-small
|
| 9 |
pinned: true
|
| 10 |
license: agpl-3.0
|
| 11 |
tags:
|
|
@@ -13,7 +13,8 @@ tags:
|
|
| 13 |
- mechanistic-interpretability
|
| 14 |
- refusal-removal
|
| 15 |
- cognitive-liberation
|
| 16 |
-
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
# OBLITERATUS β Master Ablation Suite
|
|
@@ -22,6 +23,17 @@ short_description: "One-click model liberation + chat playground"
|
|
| 22 |
|
| 23 |
One-click cognitive liberation for language models, with a built-in chat playground to talk to the liberated model.
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## How to use
|
| 26 |
|
| 27 |
1. **Obliterate tab**: Pick a model, pick a method, click OBLITERATE
|
|
@@ -52,9 +64,11 @@ The `obliteratus ui` command auto-detects your GPU, prints hardware-specific mod
|
|
| 52 |
## Or deploy on HuggingFace Spaces
|
| 53 |
|
| 54 |
1. Create a new Space at huggingface.co/new-space
|
| 55 |
-
2. Select **Gradio** SDK
|
| 56 |
3. Point it at this repo
|
| 57 |
|
|
|
|
|
|
|
| 58 |
## Links
|
| 59 |
|
| 60 |
- [GitHub](https://github.com/obliteratus-project/OBLITERATUS)
|
|
|
|
| 3 |
emoji: "π"
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: gray
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "5.29.0"
|
| 8 |
app_file: app.py
|
|
|
|
| 9 |
pinned: true
|
| 10 |
license: agpl-3.0
|
| 11 |
tags:
|
|
|
|
| 13 |
- mechanistic-interpretability
|
| 14 |
- refusal-removal
|
| 15 |
- cognitive-liberation
|
| 16 |
+
- zerogpu
|
| 17 |
+
short_description: "One-click model liberation + chat playground (ZeroGPU)"
|
| 18 |
---
|
| 19 |
|
| 20 |
# OBLITERATUS β Master Ablation Suite
|
|
|
|
| 23 |
|
| 24 |
One-click cognitive liberation for language models, with a built-in chat playground to talk to the liberated model.
|
| 25 |
|
| 26 |
+
## ZeroGPU β Users Bring Their Own GPU
|
| 27 |
+
|
| 28 |
+
This Space runs on **ZeroGPU**: GPU-heavy operations (obliteration, chat, benchmarks) use the **visitor's own HuggingFace GPU quota**, not the Space owner's. This means:
|
| 29 |
+
|
| 30 |
+
- **Free for the Space owner** β no dedicated GPU costs
|
| 31 |
+
- **Multiple concurrent users** β each user gets their own GPU allocation
|
| 32 |
+
- **Fair usage** β each user's operations count against their own HF quota
|
| 33 |
+
- **No conflicts** β users don't interfere with each other's runs
|
| 34 |
+
|
| 35 |
+
Logged-in HuggingFace users get free GPU quota. For more quota, upgrade to [HF Pro](https://huggingface.co/pricing).
|
| 36 |
+
|
| 37 |
## How to use
|
| 38 |
|
| 39 |
1. **Obliterate tab**: Pick a model, pick a method, click OBLITERATE
|
|
|
|
| 64 |
## Or deploy on HuggingFace Spaces
|
| 65 |
|
| 66 |
1. Create a new Space at huggingface.co/new-space
|
| 67 |
+
2. Select **Gradio** SDK (ZeroGPU is automatically enabled)
|
| 68 |
3. Point it at this repo
|
| 69 |
|
| 70 |
+
No GPU hardware selection needed β ZeroGPU handles allocation automatically.
|
| 71 |
+
|
| 72 |
## Links
|
| 73 |
|
| 74 |
- [GitHub](https://github.com/obliteratus-project/OBLITERATUS)
|
tests/test_abliterate.py
CHANGED
|
@@ -129,7 +129,7 @@ class TestStages:
|
|
| 129 |
|
| 130 |
class TestMethods:
|
| 131 |
def test_methods_exist(self):
|
| 132 |
-
assert set(METHODS.keys()) == {"basic", "advanced", "aggressive", "informed", "surgical", "inverted", "nuclear", "optimized", "failspy", "gabliteration", "heretic", "rdo"}
|
| 133 |
|
| 134 |
def test_basic_single_direction(self):
|
| 135 |
cfg = METHODS["basic"]
|
|
|
|
| 129 |
|
| 130 |
class TestMethods:
|
| 131 |
def test_methods_exist(self):
|
| 132 |
+
assert set(METHODS.keys()) == {"basic", "advanced", "aggressive", "informed", "surgical", "inverted", "nuclear", "optimized", "failspy", "gabliteration", "heretic", "rdo", "spectral_cascade"}
|
| 133 |
|
| 134 |
def test_basic_single_direction(self):
|
| 135 |
cfg = METHODS["basic"]
|
tests/test_telemetry.py
CHANGED
|
@@ -37,10 +37,19 @@ class TestTelemetryConfig:
|
|
| 37 |
def setup_method(self):
|
| 38 |
_reset_telemetry()
|
| 39 |
|
| 40 |
-
def
|
| 41 |
with patch.dict(os.environ, {}, clear=True):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
_reset_telemetry()
|
| 43 |
assert is_enabled()
|
|
|
|
| 44 |
|
| 45 |
def test_disable_via_env_zero(self):
|
| 46 |
with patch.dict(os.environ, {"OBLITERATUS_TELEMETRY": "0"}):
|
|
|
|
| 37 |
def setup_method(self):
|
| 38 |
_reset_telemetry()
|
| 39 |
|
| 40 |
+
def test_disabled_by_default(self):
|
| 41 |
with patch.dict(os.environ, {}, clear=True):
|
| 42 |
+
_reset_telemetry()
|
| 43 |
+
assert not is_enabled()
|
| 44 |
+
|
| 45 |
+
def test_enabled_by_default_on_hf_spaces(self):
|
| 46 |
+
with patch.dict(os.environ, {"SPACE_ID": "user/space"}, clear=True):
|
| 47 |
+
import obliteratus.telemetry as t
|
| 48 |
+
old_val = t._ON_HF_SPACES
|
| 49 |
+
t._ON_HF_SPACES = True
|
| 50 |
_reset_telemetry()
|
| 51 |
assert is_enabled()
|
| 52 |
+
t._ON_HF_SPACES = old_val
|
| 53 |
|
| 54 |
def test_disable_via_env_zero(self):
|
| 55 |
with patch.dict(os.environ, {"OBLITERATUS_TELEMETRY": "0"}):
|