---
base_model: google/gemma-4-31B-it
library_name: peft
license: apache-2.0
tags:
  - activation-oracles
  - interpretability
  - lora
  - self-introspection
  - sae
---

# Activation Oracle: gemma-4-31B-it

This is a **LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
into an **activation oracle** -- an LLM that can read and interpret the internal
activations of other LLMs (or itself) in natural language.

## What is an activation oracle?

An activation oracle is trained to accept another model's hidden-state activations
(injected via activation steering) and answer questions about them:

- **"What topic is the model thinking about?"** -- classification from activations
- **"What token will come next?"** -- next-token prediction from hidden states
- **"Is this SAE feature active?"** -- sparse autoencoder feature detection

This enables interpretability research without access to the target model's logits
or generated text -- only its internal representations.

**Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-85000")
model.eval()
```

## Training Details

| Parameter | Value |
|-----------|-------|
| **Base model** | `google/gemma-4-31B-it` |
| **Adapter** | LoRA |
| **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
| **Activation injection** | Steering vectors at intermediate layers |
| **Layer coverage** | 25%, 50%, 75% depth |

## Training Data

The oracle is trained on a mixture of:

1. **LatentQA** -- open-ended questions about hidden states
2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations
3. **PastLens** -- predicting upcoming tokens from hidden states
4. **SAE features** -- identifying active sparse autoencoder features

## Related Resources

- **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
- **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)