EvilScript
/

activation-oracle-legacy-gemma-4-31B-it

@@ -12,38 +12,39 @@ tags:
   - legacy
 ---
-# OLD Activation Oracle: gemma-4-31B-it
-> **Legacy / old checkpoint**
-> This adapter predates the new activation injection standard used by the Gemma 4 SFT script.
-> Do not use it for new work unless you specifically need the legacy injection format.
-This is an **old / legacy LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
-into an **activation oracle** -- an LLM that can read and interpret the internal
-activations of other LLMs (or itself) in natural language.
-## What is an activation oracle?
-An activation oracle is trained to accept another model's hidden-state activations
-(injected via activation steering) and answer questions about them:
-- **"What topic is the model thinking about?"** -- classification from activations
-- **"What token will come next?"** -- next-token prediction from hidden states
-- **"Is this SAE feature active?"** -- sparse autoencoder feature detection
-This enables interpretability research without access to the target model's logits
-or generated text -- only its internal representations.
-**Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
-## Quick Start
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
 import torch
-# Load the base model
 base_model = AutoModelForCausalLM.from_pretrained(
     "google/gemma-4-31B-it",
     torch_dtype=torch.bfloat16,
@@ -51,31 +52,25 @@ base_model = AutoModelForCausalLM.from_pretrained(
 )
 tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
-# Load the activation oracle LoRA
-model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it")
 model.eval()
 ```
-## Training Details
 | Parameter | Value |
 |-----------|-------|
 | **Base model** | `google/gemma-4-31B-it` |
 | **Adapter** | LoRA |
 | **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
-| **Activation injection** | Legacy pre-Gemma-4-SFT injection format at intermediate layers |
-| **Layer coverage** | 25%, 50%, 75% depth |
-## Training Data
-The oracle is trained on a mixture of:
-1. **LatentQA** -- open-ended questions about hidden states
-2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations
-3. **PastLens** -- predicting upcoming tokens from hidden states
-4. **SAE features** -- identifying active sparse autoencoder features
 ## Related Resources
-- **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
 - **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)

   - legacy
 ---
+# Legacy Activation Oracle: gemma-4-31B-it
+> **Deprecated Gemma 4 checkpoint**
+> This adapter was trained with the older generic `nl_probes/sft.py` path, not the architecture-aware `nl_probes/gemma4_sft.py` path now used for Gemma 4.
+> It does **not** follow the current Gemma 4 injection standard and should not be used for new experiments or for the `probabilistic_activation_oracles` taboo pipeline.
+This is a legacy LoRA adapter for [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).
+It is kept for historical comparison only.
+## Why This Repo Is Legacy
+This adapter predates the Gemma-4-specific training path added in this repo.
+The main incompatibilities are:
+- **Legacy training entrypoint**: it was trained with `nl_probes/sft.py`, while current Gemma 4 oracles are trained with `nl_probes/gemma4_sft.py`.
+- **Wrong oracle-side injection layer for the current standard**: this adapter used `hook_onto_layer=1`, while the current Gemma 4 recipe injects at the first full-attention layer, which is layer `5` for this base model.
+- **Legacy read-layer mapping**: this adapter used the generic `25/50/75%` depth mapping from the old trainer, while the current Gemma 4 path snaps those reads to real full-attention layers.
+- **Validation gap**: this legacy recipe produced reasonable classification-style eval curves, but this repo explicitly notes that it did **not** establish correctness for the taboo extraction pipeline in `probabilistic_activation_oracles`.
+Because the adapter was trained on a different steering / readout distribution than the new Gemma 4 standard, it is not the right checkpoint format for current Gemma 4 oracle work.
+## When To Use It
+- Use it only if you are reproducing the earlier generic Gemma 4 oracle experiments.
+- Do not use it as the default Gemma 4 oracle for new work.
+## Quick Start (Legacy Only)
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
 import torch
 base_model = AutoModelForCausalLM.from_pretrained(
     "google/gemma-4-31B-it",
     torch_dtype=torch.bfloat16,
 )
 tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
+model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-legacy-gemma-4-31B-it")
 model.eval()
 ```
+## Legacy Training Details
 | Parameter | Value |
 |-----------|-------|
 | **Base model** | `google/gemma-4-31B-it` |
 | **Adapter** | LoRA |
+| **Training entrypoint** | `nl_probes/sft.py` |
 | **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
+| **Activation injection** | Legacy generic steering setup |
+| **Oracle hook layer** | `1` |
+| **Read-layer selection** | Generic `25/50/75%` depth mapping |
+| **Current Gemma 4 standard** | `nl_probes/gemma4_sft.py` with first-full-attention injection and full-attention-aware read-layer selection |
 ## Related Resources
+- **Gemma 4 notes in this repo**: `docs/gemma4_oracle_training_notes.md`
+- **Internal port report**: `docs/evilscript_gemma4_report.md`
 - **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)