--- base_model: google/gemma-4-31B-it library_name: peft license: apache-2.0 tags: - activation-oracles - interpretability - lora - self-introspection - sae --- # Activation Oracle: gemma-4-31B-it This is a **LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) into an **activation oracle** -- an LLM that can read and interpret the internal activations of other LLMs (or itself) in natural language. ## What is an activation oracle? An activation oracle is trained to accept another model's hidden-state activations (injected via activation steering) and answer questions about them: - **"What topic is the model thinking about?"** -- classification from activations - **"What token will come next?"** -- next-token prediction from hidden states - **"Is this SAE feature active?"** -- sparse autoencoder feature detection This enables interpretability research without access to the target model's logits or generated text -- only its internal representations. **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674) ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch # Load the base model base_model = AutoModelForCausalLM.from_pretrained( "google/gemma-4-31B-it", torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it") # Load the activation oracle LoRA model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-85000") model.eval() ``` ## Training Details | Parameter | Value | |-----------|-------| | **Base model** | `google/gemma-4-31B-it` | | **Adapter** | LoRA | | **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features | | **Activation injection** | Steering vectors at intermediate layers | | **Layer coverage** | 25%, 50%, 75% depth | ## Training Data The oracle is trained on a mixture of: 1. **LatentQA** -- open-ended questions about hidden states 2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations 3. **PastLens** -- predicting upcoming tokens from hidden states 4. **SAE features** -- identifying active sparse autoencoder features ## Related Resources - **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674) - **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)