Instructions to use EvilScript/activation-oracle-gemma-4-31B-it-step-60000 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use EvilScript/activation-oracle-gemma-4-31B-it-step-60000 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it") model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-60000") - Notebooks
- Google Colab
- Kaggle
File size: 2,489 Bytes
bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 bf0c88f a05a780 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | ---
base_model: google/gemma-4-31B-it
library_name: peft
license: apache-2.0
tags:
- activation-oracles
- interpretability
- lora
- self-introspection
- sae
---
# Activation Oracle: gemma-4-31B-it
This is a **LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
into an **activation oracle** -- an LLM that can read and interpret the internal
activations of other LLMs (or itself) in natural language.
## What is an activation oracle?
An activation oracle is trained to accept another model's hidden-state activations
(injected via activation steering) and answer questions about them:
- **"What topic is the model thinking about?"** -- classification from activations
- **"What token will come next?"** -- next-token prediction from hidden states
- **"Is this SAE feature active?"** -- sparse autoencoder feature detection
This enables interpretability research without access to the target model's logits
or generated text -- only its internal representations.
**Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-60000")
model.eval()
```
## Training Details
| Parameter | Value |
|-----------|-------|
| **Base model** | `google/gemma-4-31B-it` |
| **Adapter** | LoRA |
| **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
| **Activation injection** | Steering vectors at intermediate layers |
| **Layer coverage** | 25%, 50%, 75% depth |
## Training Data
The oracle is trained on a mixture of:
1. **LatentQA** -- open-ended questions about hidden states
2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations
3. **PastLens** -- predicting upcoming tokens from hidden states
4. **SAE features** -- identifying active sparse autoencoder features
## Related Resources
- **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
- **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)
|