EvilScript
/

activation-oracle-gemma-4-31B-it-step-60000

activation-oracles

interpretability

self-introspection

Model card Files Files and versions

activation-oracle-gemma-4-31B-it-step-60000 / README.md

EvilScript's picture

Add README with base model metadata

a05a780 verified 16 days ago

|

history blame contribute delete

2.49 kB

	---
	base_model: google/gemma-4-31B-it
	library_name: peft
	license: apache-2.0
	tags:
	- activation-oracles
	- interpretability
	- lora
	- self-introspection
	- sae
	---

	# Activation Oracle: gemma-4-31B-it

	This is a LoRA adapter that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
	into an activation oracle -- an LLM that can read and interpret the internal
	activations of other LLMs (or itself) in natural language.

	## What is an activation oracle?

	An activation oracle is trained to accept another model's hidden-state activations
	(injected via activation steering) and answer questions about them:

	- "What topic is the model thinking about?" -- classification from activations
	- "What token will come next?" -- next-token prediction from hidden states
	- "Is this SAE feature active?" -- sparse autoencoder feature detection

	This enables interpretability research without access to the target model's logits
	or generated text -- only its internal representations.

	Paper: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	# Load the base model
	base_model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-4-31B-it",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

	# Load the activation oracle LoRA
	model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-60000")
	model.eval()
	```

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| `google/gemma-4-31B-it` \|
	\| Adapter \| LoRA \|
	\| Training tasks \| LatentQA, classification, PastLens (next-token), SAE features \|
	\| Activation injection \| Steering vectors at intermediate layers \|
	\| Layer coverage \| 25%, 50%, 75% depth \|

	## Training Data

	The oracle is trained on a mixture of:

	1. LatentQA -- open-ended questions about hidden states
	2. Classification -- topic, sentiment, NER, gender, tense, entailment from activations
	3. PastLens -- predicting upcoming tokens from hidden states
	4. SAE features -- identifying active sparse autoencoder features

	## Related Resources

	- Paper: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
	- Code: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)