Papers
arxiv:2410.20526

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Published on Oct 27, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Sparse Autoencoders are applied to each layer of the Llama-3.1-8B-Base model to extract sparse representations, assessing their generalizability and analyzing the geometry of learned features.

AI-generated summary

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that feature splitting enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~https://huggingface.co/fnlp/Llama-Scope, alongside our scalable training, interpretation, and visualization tools at https://github.com/OpenMOSS/Language-Model-SAEs. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2410.20526
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.20526 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.20526 in a Space README.md to link it from this page.

Collections including this paper 1