File size: 2,489 Bytes

bf0c88f
a05a780
 
 
 
 
 
 
 
 
bf0c88f
 
a05a780
bf0c88f
a05a780
 
 
bf0c88f
a05a780
bf0c88f
a05a780
 
bf0c88f
a05a780
 
 
bf0c88f
a05a780
 
bf0c88f
a05a780
bf0c88f
a05a780
bf0c88f
a05a780
 
 
 
bf0c88f
a05a780
 
 
 
 
 
 
bf0c88f
a05a780
 
 
 
bf0c88f
 
 
a05a780
 
 
 
 
 
 
bf0c88f
a05a780
bf0c88f
a05a780
bf0c88f
a05a780
 
 
 
bf0c88f
a05a780
bf0c88f
a05a780

---
base_model: google/gemma-4-31B-it
library_name: peft
license: apache-2.0
tags:
  - activation-oracles
  - interpretability
  - lora
  - self-introspection
  - sae
---

# Activation Oracle: gemma-4-31B-it

This is a **LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
into an **activation oracle** -- an LLM that can read and interpret the internal
activations of other LLMs (or itself) in natural language.

## What is an activation oracle?

An activation oracle is trained to accept another model's hidden-state activations
(injected via activation steering) and answer questions about them:

- **"What topic is the model thinking about?"** -- classification from activations
- **"What token will come next?"** -- next-token prediction from hidden states
- **"Is this SAE feature active?"** -- sparse autoencoder feature detection

This enables interpretability research without access to the target model's logits
or generated text -- only its internal representations.

**Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-60000")
model.eval()
```

## Training Details

| Parameter | Value |
|-----------|-------|
| **Base model** | `google/gemma-4-31B-it` |
| **Adapter** | LoRA |
| **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
| **Activation injection** | Steering vectors at intermediate layers |
| **Layer coverage** | 25%, 50%, 75% depth |

## Training Data

The oracle is trained on a mixture of:

1. **LatentQA** -- open-ended questions about hidden states
2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations
3. **PastLens** -- predicting upcoming tokens from hidden states
4. **SAE features** -- identifying active sparse autoencoder features

## Related Resources

- **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
- **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)