Papers
arxiv:2509.11816

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Published on Nov 13, 2025
Authors:
,

Abstract

Researchers developed a novel unlearning technique called Collapse of Irrelevant Representations (CIR) that selectively removes harmful knowledge from language models by identifying and collapsing shared representations while preserving general performance.

AI-generated summary

Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause - unlearning targets representations which are too general - and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2509.11816
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.11816 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.11816 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.11816 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.