EvilScript commited on
Commit
c11cbab
·
verified ·
1 Parent(s): 8a1605c

Clarify legacy Gemma 4 incompatibilities

Browse files
Files changed (1) hide show
  1. README.md +28 -33
README.md CHANGED
@@ -12,38 +12,39 @@ tags:
12
  - legacy
13
  ---
14
 
15
- # OLD Activation Oracle: gemma-4-31B-it
16
 
17
- > **Legacy / old checkpoint**
18
- > This adapter predates the new activation injection standard used by the Gemma 4 SFT script.
19
- > Do not use it for new work unless you specifically need the legacy injection format.
20
 
21
- This is an **old / legacy LoRA adapter** that turns [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)
22
- into an **activation oracle** -- an LLM that can read and interpret the internal
23
- activations of other LLMs (or itself) in natural language.
24
 
25
- ## What is an activation oracle?
26
 
27
- An activation oracle is trained to accept another model's hidden-state activations
28
- (injected via activation steering) and answer questions about them:
29
 
30
- - **"What topic is the model thinking about?"** -- classification from activations
31
- - **"What token will come next?"** -- next-token prediction from hidden states
32
- - **"Is this SAE feature active?"** -- sparse autoencoder feature detection
 
33
 
34
- This enables interpretability research without access to the target model's logits
35
- or generated text -- only its internal representations.
36
 
37
- **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
38
 
39
- ## Quick Start
 
 
 
40
 
41
  ```python
42
  from transformers import AutoModelForCausalLM, AutoTokenizer
43
  from peft import PeftModel
44
  import torch
45
 
46
- # Load the base model
47
  base_model = AutoModelForCausalLM.from_pretrained(
48
  "google/gemma-4-31B-it",
49
  torch_dtype=torch.bfloat16,
@@ -51,31 +52,25 @@ base_model = AutoModelForCausalLM.from_pretrained(
51
  )
52
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
53
 
54
- # Load the activation oracle LoRA
55
- model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it")
56
  model.eval()
57
  ```
58
 
59
- ## Training Details
60
 
61
  | Parameter | Value |
62
  |-----------|-------|
63
  | **Base model** | `google/gemma-4-31B-it` |
64
  | **Adapter** | LoRA |
 
65
  | **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
66
- | **Activation injection** | Legacy pre-Gemma-4-SFT injection format at intermediate layers |
67
- | **Layer coverage** | 25%, 50%, 75% depth |
68
-
69
- ## Training Data
70
-
71
- The oracle is trained on a mixture of:
72
-
73
- 1. **LatentQA** -- open-ended questions about hidden states
74
- 2. **Classification** -- topic, sentiment, NER, gender, tense, entailment from activations
75
- 3. **PastLens** -- predicting upcoming tokens from hidden states
76
- 4. **SAE features** -- identifying active sparse autoencoder features
77
 
78
  ## Related Resources
79
 
80
- - **Paper**: [Activation Oracles (arXiv:2512.15674)](https://arxiv.org/abs/2512.15674)
 
81
  - **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)
 
12
  - legacy
13
  ---
14
 
15
+ # Legacy Activation Oracle: gemma-4-31B-it
16
 
17
+ > **Deprecated Gemma 4 checkpoint**
18
+ > This adapter was trained with the older generic `nl_probes/sft.py` path, not the architecture-aware `nl_probes/gemma4_sft.py` path now used for Gemma 4.
19
+ > It does **not** follow the current Gemma 4 injection standard and should not be used for new experiments or for the `probabilistic_activation_oracles` taboo pipeline.
20
 
21
+ This is a legacy LoRA adapter for [gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).
22
+ It is kept for historical comparison only.
 
23
 
24
+ ## Why This Repo Is Legacy
25
 
26
+ This adapter predates the Gemma-4-specific training path added in this repo.
27
+ The main incompatibilities are:
28
 
29
+ - **Legacy training entrypoint**: it was trained with `nl_probes/sft.py`, while current Gemma 4 oracles are trained with `nl_probes/gemma4_sft.py`.
30
+ - **Wrong oracle-side injection layer for the current standard**: this adapter used `hook_onto_layer=1`, while the current Gemma 4 recipe injects at the first full-attention layer, which is layer `5` for this base model.
31
+ - **Legacy read-layer mapping**: this adapter used the generic `25/50/75%` depth mapping from the old trainer, while the current Gemma 4 path snaps those reads to real full-attention layers.
32
+ - **Validation gap**: this legacy recipe produced reasonable classification-style eval curves, but this repo explicitly notes that it did **not** establish correctness for the taboo extraction pipeline in `probabilistic_activation_oracles`.
33
 
34
+ Because the adapter was trained on a different steering / readout distribution than the new Gemma 4 standard, it is not the right checkpoint format for current Gemma 4 oracle work.
 
35
 
36
+ ## When To Use It
37
 
38
+ - Use it only if you are reproducing the earlier generic Gemma 4 oracle experiments.
39
+ - Do not use it as the default Gemma 4 oracle for new work.
40
+
41
+ ## Quick Start (Legacy Only)
42
 
43
  ```python
44
  from transformers import AutoModelForCausalLM, AutoTokenizer
45
  from peft import PeftModel
46
  import torch
47
 
 
48
  base_model = AutoModelForCausalLM.from_pretrained(
49
  "google/gemma-4-31B-it",
50
  torch_dtype=torch.bfloat16,
 
52
  )
53
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
54
 
55
+ model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-legacy-gemma-4-31B-it")
 
56
  model.eval()
57
  ```
58
 
59
+ ## Legacy Training Details
60
 
61
  | Parameter | Value |
62
  |-----------|-------|
63
  | **Base model** | `google/gemma-4-31B-it` |
64
  | **Adapter** | LoRA |
65
+ | **Training entrypoint** | `nl_probes/sft.py` |
66
  | **Training tasks** | LatentQA, classification, PastLens (next-token), SAE features |
67
+ | **Activation injection** | Legacy generic steering setup |
68
+ | **Oracle hook layer** | `1` |
69
+ | **Read-layer selection** | Generic `25/50/75%` depth mapping |
70
+ | **Current Gemma 4 standard** | `nl_probes/gemma4_sft.py` with first-full-attention injection and full-attention-aware read-layer selection |
 
 
 
 
 
 
 
71
 
72
  ## Related Resources
73
 
74
+ - **Gemma 4 notes in this repo**: `docs/gemma4_oracle_training_notes.md`
75
+ - **Internal port report**: `docs/evilscript_gemma4_report.md`
76
  - **Code**: [activation_oracles](https://github.com/adamkarvonen/activation_oracles)