arxiv:2509.11816

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Published on Nov 13, 2025

Authors:

Abstract

Researchers developed a novel unlearning technique called Collapse of Irrelevant Representations (CIR) that selectively removes harmful knowledge from language models by identifying and collapsing shared representations while preserving general performance.

AI-generated summary

Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause - unlearning targets representations which are too general - and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2509.11816

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.11816 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.11816 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.11816 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.