obliteratus / PIPELINE_EFFICIENCY_AUDIT.md
pliny-the-prompter's picture
Upload 130 files
ae16715 verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

OBLITERATUS Pipeline Efficiency Audit

Date: 2026-03-03 Scope: All obliteration methods in abliterate.py (5,076 lines), bayesian_optimizer.py, informed_pipeline.py, and 4 ablation strategies.


Executive Summary

The 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH) is architecturally sound with good separation of concerns. Memory hygiene between stages is correct. The rank-1 projection math is efficient. Quantization handling is robust.

8 concrete efficiency issues found. Estimated cumulative impact: ~40-60% wall-clock reduction on typical runs (8B model, advanced/surgical methods). Ordered by ROI (ease × impact).


HIGH PRIORITY (Fix This Week)

1. PROBE runs 1,536 prompts with zero batching

Location: abliterate.py:1074-1088 Impact: Largest single wall-clock bottleneck (~77s on 8B model, reducible to ~10s)

The activation collection loop processes each prompt individually with a full forward pass + GC cycle between each one. With 512 harmful + 512 harmless + 512 jailbreak prompts = 1,536 serial forward passes.

The _free_gpu_memory() call at line 1086 is inside the per-prompt loop, adding ~20ms × 1,536 = 30s of pure garbage collection overhead.

# CURRENT (serial)
for i, prompt in enumerate(prompts):
    inputs = tokenizer(prompt, return_tensors="pt", ...)
    model(**inputs)
    del inputs
    self._free_gpu_memory()  # <-- 30s wasted

Fix: Batch prompts (batch_size=8-16). Hooks already handle batch dimension correctly via hidden[:, -1, :]. Move _free_gpu_memory() to run every N batches, not every prompt.

Speedup: ~7-8x on PROBE stage.


2. VERIFY generates 30 completions sequentially — no batching

Location: abliterate.py:4622-4670 Impact: Second-largest wall-clock cost (~57s on 8B model, reducible to ~15s)

Each of the 30 refusal-test prompts gets an independent model.generate(max_new_tokens=128) call. At ~15ms/token on an 8B model, that's 30 × 128 × 15ms ≈ 57s.

Fix: Batch the generation calls (batch_size=4-8). model.generate() supports batched inputs natively. The tokenizer already handles padding.

Speedup: ~4x on VERIFY stage.


3. SAE training is forced to CPU with no early stopping

Location: abliterate.py:1579-1583 Impact: Moderate — adds ~20-40s per run when SAE features are enabled (surgical, nuclear methods)

SAE training runs 30 fixed epochs per strong layer on CPU. With 15-20 strong layers, that's 450-600 CPU training epochs. No convergence check, no early stopping.

The device="cpu" is overly conservative — the memory-aware cap at line 1570-1578 already validates GPU headroom, and a typical SAE encoder (expansion=2, hidden_dim=4096) is only ~128MB.

Fix:

  1. Add early stopping when reconstruction loss plateaus (< 0.1% improvement over 3 epochs)
  2. Use GPU when free_mb > sae_mem_mb + 1024 (1GB headroom)
  3. Reduce default epochs from 30 to 15 with convergence guard

MEDIUM PRIORITY (Fix This Sprint)

4. _distill_inner() is a degraded copy of _distill() — drops half the SOTA techniques

Location: abliterate.py:2958-3055 vs 1102-1750 Impact: Quality regression on refinement passes 2+, not pure compute waste

The iterative refinement path calls _distill_inner() which is a simplified ~100-line copy that skips: Wasserstein-optimal extraction, layer-adaptive strength, float layer interpolation, SAE features, EGA, CoT-aware orthogonalization, and RDO refinement.

This means "true iterative refinement" actually produces worse directions on later passes because it drops the analysis-guided enhancements.

Fix: Extract shared SVD/direction logic into _extract_directions(full_features=True/False) and call from both paths. At minimum, keep whitened SVD and jailbreak-contrastive blending in the inner path.


5. Bayesian optimizer clones ALL weight tensors — ~7GB memory overhead

Location: bayesian_optimizer.py:300-341 Impact: Memory pressure on GPU-constrained setups; 50× full-restore cycles

The optimizer saves a complete clone of every weight tensor across all strong layers. For a 7B model with 32 layers, that's ~7GB of clones sitting in memory during all 50 trials.

After each trial, _restore_all() copies all clones back — 50 trials × full-model memcpy.

Fix (easy): Only clone weights in _strong_layers (already partially done, but named_parameters() crawl still catches everything). Drop the seen_data_ptrs set once the loop is tightened.

Fix (better): Store the projection delta Δ = scale * d @ (d^T @ W) per layer instead of cloning the full weight. Rollback = W += Δ. This reduces storage from O(hidden_dim²) to O(hidden_dim) per direction per layer.


6. Norm computation in _project_out_advanced() traverses the full matrix twice

Location: abliterate.py:3477-3486 Impact: ~4,800 unnecessary full-matrix norm computations per run (8-direction surgical)

When norm_preserve=True, the code computes W.norm() before projection and W.norm() after projection. Each norm traverses the full weight matrix (16M elements for 4096×4096).

With 8 directions × 30 layers × 10 weight matrices = 2,400 projections → 4,800 norm calls → 77 billion unnecessary FLOPs.

Fix: After rank-1 update W' = W - scale * d @ (d^T @ W), the new norm satisfies: ||W'||² = ||W||² - 2·scale·||d^T @ W||² + scale²·||d^T @ W||²·||d||²

Since ||d|| = 1: ||W'||² = ||W||² - scale·(2 - scale)·||coeff||²

This replaces a 16M-element norm with a single coeff.pow(2).sum() call (~4K FLOPs).


LOW PRIORITY (Backlog)

7. Gram-Schmidt appears 3 times as O(n²) nested loops

Location: abliterate.py:1168-1173, 1361-1367, 3038-3044 Impact: Minimal compute but code quality issue

Three separate implementations of the same Gram-Schmidt orthogonalization with nested Python loops. With n_directions=8, it's 28 dot products per call — trivial compute but (a) DRY violation, (b) numerically inferior to torch.linalg.qr().

Fix: Extract to _orthogonalize_subspace(sub: Tensor) -> Tensor using QR decomposition. Single call site, single test, better numerics.


8. Pre-EXCISE baseline KL capture re-forward-passes 100 prompts already seen in PROBE

Location: abliterate.py:2313-2366 Impact: ~700ms wasted (minor)

_capture_baseline_kl_logits() runs 100 harmless prompts through the model to capture pre-EXCISE logits. But PROBE already ran those same prompts and captured hidden states at every layer. The logits could be computed as lm_head(last_hidden_state) — a single matmul.

Fix: After PROBE, compute baseline_logits = model.lm_head(harmful_means[last_layer]) on the cached activations. Skip the 100-prompt forward pass entirely.


What's Done Well

Area Assessment
Stage-boundary memory cleanup Correct — _free_gpu_memory() + explicit dict clearing between stages
Rank-1 projection math Efficient — W @ d then d.T * coeff instead of materializing I - dd^T
Quantization dequant/requant Robust — handles bitsandbytes NF4, GPTQ, AWQ; fails loudly on unsupported formats
Incremental expert mean Smart — Welford running mean in _transplant_expert_weights() avoids stacking all expert weights
Router stabilization Defensive — _stabilize_router_weights() after MoE projection prevents CUDA crashes
Large model mode Pragmatic — caps directions, SAE features, refinement passes for 120B+ models
Event emission Clean — _emit() / _on_stage() / _on_log() callbacks for UI integration without coupling

Method Efficiency Comparison

Method PROBE Cost DISTILL Cost EXCISE Cost VERIFY Cost Primary Bottleneck
basic 1x (1,024 prompts) 1x (diff-in-means) 1x (~10 projections) 1x PROBE
advanced 2x (re-probe on pass 2) 2x (re-distill) 2x (2 passes) 1x PROBE × 2
aggressive 3x (re-probe on passes 2,3) 3x (re-distill) 3x (3 passes, 8 dirs) 1x PROBE × 3
surgical 1.5x (+jailbreak prompts) 2x (SAE training) 2x (head surgery + EGA) 1x SAE on CPU
optimized 1.5x (+jailbreak) 1x 50x (Bayesian trials) 1x Bayesian optimizer
inverted 1.5x (+jailbreak) 1x 2x (reflection math) 1x PROBE
nuclear 1.5x (+jailbreak) 2x (SAE) 3x (all techniques) 1x SAE + PROBE
informed 1x 1.5x (analysis modules) 1x-3x (dynamic) 1.5x (Ouroboros check) Analysis modules

Prioritized Action Plan

  1. Batch PROBE forward passes — immediate 7-8x speedup on largest bottleneck
  2. Batch VERIFY generation — immediate 4x speedup on second bottleneck
  3. Add SAE early stopping + GPU path — 2-3x speedup on SAE-enabled methods
  4. Unify _distill / _distill_inner — quality fix, prevents direction degradation
  5. Optimize Bayesian rollback storage — memory fix for GPU-constrained users
  6. Analytical norm computation — eliminates 77B unnecessary FLOPs
  7. DRY Gram-Schmidt — code quality
  8. Cache KL baseline from PROBE — minor speedup