Add README with base model metadata

3da32bd verified 16 days ago

2.49 kB

base_model: google/gemma-4-31B-it
library_name: peft
license: apache-2.0
tags:
  - activation-oracles
  - interpretability
  - lora
  - self-introspection
  - sae

Activation Oracle: gemma-4-31B-it

This is a LoRA adapter that turns gemma-4-31B-it into an activation oracle -- an LLM that can read and interpret the internal activations of other LLMs (or itself) in natural language.

What is an activation oracle?

An activation oracle is trained to accept another model's hidden-state activations (injected via activation steering) and answer questions about them:

"What topic is the model thinking about?" -- classification from activations
"What token will come next?" -- next-token prediction from hidden states
"Is this SAE feature active?" -- sparse autoencoder feature detection

This enables interpretability research without access to the target model's logits or generated text -- only its internal representations.

Paper: Activation Oracles (arXiv:2512.15674)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-85000")
model.eval()

Training Details

Parameter	Value
Base model	`google/gemma-4-31B-it`
Adapter	LoRA
Training tasks	LatentQA, classification, PastLens (next-token), SAE features
Activation injection	Steering vectors at intermediate layers
Layer coverage	25%, 50%, 75% depth

Training Data

The oracle is trained on a mixture of:

LatentQA -- open-ended questions about hidden states
Classification -- topic, sentiment, NER, gender, tense, entailment from activations
PastLens -- predicting upcoming tokens from hidden states
SAE features -- identifying active sparse autoencoder features

Related Resources

Paper: Activation Oracles (arXiv:2512.15674)
Code: activation_oracles