EvilScript's picture
Add README with base model metadata
a05a780 verified
---
base_model: google/gemma-4-31B-it
library_name: peft
license: apache-2.0
tags:
- activation-oracles
- interpretability
- lora
- self-introspection
- sae
---
# Activation Oracle: gemma-4-31B-it
This is a **LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
into an **activation oracle** -- an LLM that can read and interpret the internal
activations of other LLMs (or itself) in natural language.
## What is an activation oracle?
An activation oracle is trained to accept another model's hidden-state activations
(injected via activation steering) and answer questions about them:
- **"What topic is the model thinking about?"** -- classification from activations
- **"What token will come next?"** -- next-token prediction from hidden states
- **"Is this SAE feature active?"** -- sparse autoencoder feature detection
This enables interpretability research without access to the target model's logits
or generated text -- only its internal representations.
**Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-60000")
model.eval()
```
## Training Details
| Parameter | Value |
|-----------|-------|
| **Base model** | `google/gemma-4-31B-it` |
| **Adapter** | LoRA |
| **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
| **Activation injection** | Steering vectors at intermediate layers |
| **Layer coverage** | 25%, 50%, 75% depth |
## Training Data
The oracle is trained on a mixture of:
1. **LatentQA** -- open-ended questions about hidden states
2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations
3. **PastLens** -- predicting upcoming tokens from hidden states
4. **SAE features** -- identifying active sparse autoencoder features
## Related Resources
- **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
- **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)