metadata
base_model: google/gemma-4-31B-it
library_name: peft
license: apache-2.0
tags:
- activation-oracles
- interpretability
- lora
- self-introspection
- sae
Activation Oracle: gemma-4-31B-it
This is a LoRA adapter that turns gemma-4-31B-it into an activation oracle -- an LLM that can read and interpret the internal activations of other LLMs (or itself) in natural language.
What is an activation oracle?
An activation oracle is trained to accept another model's hidden-state activations (injected via activation steering) and answer questions about them:
- "What topic is the model thinking about?" -- classification from activations
- "What token will come next?" -- next-token prediction from hidden states
- "Is this SAE feature active?" -- sparse autoencoder feature detection
This enables interpretability research without access to the target model's logits or generated text -- only its internal representations.
Paper: Activation Oracles (arXiv:2512.15674)
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-85000")
model.eval()
Training Details
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-31B-it |
| Adapter | LoRA |
| Training tasks | LatentQA, classification, PastLens (next-token), SAE features |
| Activation injection | Steering vectors at intermediate layers |
| Layer coverage | 25%, 50%, 75% depth |
Training Data
The oracle is trained on a mixture of:
- LatentQA -- open-ended questions about hidden states
- Classification -- topic, sentiment, NER, gender, tense, entailment from activations
- PastLens -- predicting upcoming tokens from hidden states
- SAE features -- identifying active sparse autoencoder features
Related Resources
- Paper: Activation Oracles (arXiv:2512.15674)
- Code: activation_oracles