| --- |
| base_model: google/gemma-4-31B-it |
| library_name: peft |
| license: apache-2.0 |
| tags: |
| - activation-oracles |
| - interpretability |
| - lora |
| - self-introspection |
| - sae |
| --- |
| |
| # Activation Oracle: gemma-4-31B-it |
|
|
| This is a **LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
| into an **activation oracle** -- an LLM that can read and interpret the internal |
| activations of other LLMs (or itself) in natural language. |
|
|
| ## What is an activation oracle? |
|
|
| An activation oracle is trained to accept another model's hidden-state activations |
| (injected via activation steering) and answer questions about them: |
|
|
| - **"What topic is the model thinking about?"** -- classification from activations |
| - **"What token will come next?"** -- next-token prediction from hidden states |
| - **"Is this SAE feature active?"** -- sparse autoencoder feature detection |
|
|
| This enables interpretability research without access to the target model's logits |
| or generated text -- only its internal representations. |
|
|
| **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674) |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from peft import PeftModel |
| import torch |
| |
| # Load the base model |
| base_model = AutoModelForCausalLM.from_pretrained( |
| "google/gemma-4-31B-it", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it") |
| |
| # Load the activation oracle LoRA |
| model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-60000") |
| model.eval() |
| ``` |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | **Base model** | `google/gemma-4-31B-it` | |
| | **Adapter** | LoRA | |
| | **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features | |
| | **Activation injection** | Steering vectors at intermediate layers | |
| | **Layer coverage** | 25%, 50%, 75% depth | |
|
|
| ## Training Data |
|
|
| The oracle is trained on a mixture of: |
|
|
| 1. **LatentQA** -- open-ended questions about hidden states |
| 2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations |
| 3. **PastLens** -- predicting upcoming tokens from hidden states |
| 4. **SAE features** -- identifying active sparse autoencoder features |
|
|
| ## Related Resources |
|
|
| - **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674) |
| - **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles) |
|
|