Initial PLOT-DAS submission: 12 cells (10 residual + 2 IOI)

36f902e verified 4 days ago

5.4 kB

	---
	license: mit
	tags:
	- mib
	- causal-variable-localization
	- interpretability
	- featurizers
	---

	# PLOT-DAS — MIB Causal Variable Localization Track submission

	This repo contains a PLOT-DAS submission to the [MIB Causal Variable Localization Track](https://huggingface.co/spaces/mib-bench/leaderboard).

	PLOT = Progressive Localization via Optimal Transport. Two-stage Sinkhorn OT picks `(layer, token_position)` sites; DAS rotations are then trained only at those picked sites. The variant in this repo (PLOT-DAS) ships the trained rotations.

	Source paper / reference implementation: <https://github.com/jchang153/causal-abstractions-ot>.
	Method narrative and per-cell engineering notes: see the [project repo](https://github.com/bojro/plot-mib-submissions) (`JOURNAL.md`, `PLOT_SHORTCOMINGS.md`, `WALKTHROUGHS.md`).

	## Submission contents

	12 of 26 cells shipped (the other 14 — all Llama-8B cells and Qwen/Gemma IOI — need ≥16 GB GPU and were deferred to cloud):

	\| Folder \| Task × Model × Variable \| Type \|
	\|---\|---\|---\|
	\| `4_answer_MCQA_Qwen2ForCausalLM_answer_pointer` \| MCQA × Qwen-2.5-0.5B × answer_pointer \| residual stream \|
	\| `4_answer_MCQA_Qwen2ForCausalLM_answer` \| MCQA × Qwen-2.5-0.5B × answer \| residual stream \|
	\| `4_answer_MCQA_Gemma2ForCausalLM_answer_pointer` \| MCQA × Gemma-2-2B × answer_pointer \| residual stream \|
	\| `4_answer_MCQA_Gemma2ForCausalLM_answer` \| MCQA × Gemma-2-2B × answer \| residual stream \|
	\| `ARC_easy_Gemma2ForCausalLM_answer_pointer` \| ARC × Gemma-2-2B × answer_pointer \| residual stream \|
	\| `ARC_easy_Gemma2ForCausalLM_answer` \| ARC × Gemma-2-2B × answer \| residual stream \|
	\| `arithmetic_Gemma2ForCausalLM_ones_carry` \| arithmetic × Gemma-2-2B × ones_carry \| residual stream \|
	\| `ravel_task_Gemma2ForCausalLM_Country` \| RAVEL × Gemma-2-2B × Country \| residual stream \|
	\| `ravel_task_Gemma2ForCausalLM_Continent` \| RAVEL × Gemma-2-2B × Continent \| residual stream \|
	\| `ravel_task_Gemma2ForCausalLM_Language` \| RAVEL × Gemma-2-2B × Language \| residual stream \|
	\| `ioi_task_GPT2LMHeadModel_output_token` \| IOI × GPT-2 small × output_token \| attention head \|
	\| `ioi_task_GPT2LMHeadModel_output_position` \| IOI × GPT-2 small × output_position \| attention head \|
	\| `ioi_linear_params.json` \| IOI causal-model linear params (required) \| metadata \|

	This submission qualifies for the "best" (single-layer) leaderboard — each cell has 2–6 picked layers, not every layer.

	## Local public-test scores

	Per-split max IIA on the public MIB test sets (full numbers and methodology in the project repo's `RESULTS.md`):

	#### Residual-stream cells (IIA, higher is better)

	\| cell \| sites \| mean IIA \|
	\|---\|---\|---\|
	\| MCQA × Qwen × answer_pointer \| 5 \| 1.000 \|
	\| MCQA × Qwen × answer \| 3 \| 0.849 \|
	\| MCQA × Gemma × answer_pointer \| 4 \| 0.955 \|
	\| MCQA × Gemma × answer \| 4 \| 0.908 \|
	\| ARC × Gemma × answer_pointer \| 6 \| 0.884 \|
	\| ARC × Gemma × answer \| 4 \| 0.999 ‡ \|
	\| arithmetic × Gemma × ones_carry \| 2 \| 0.448 (smoke settings) \|
	\| RAVEL × Gemma × Continent \| 2 \| 0.856 \|
	\| RAVEL × Gemma × Country \| 2 \| 0.615 \|
	\| RAVEL × Gemma × Language \| 2 \| 0.629 \|

	#### IOI cells (MSE, lower is better)

	\| cell \| sites \| MSE \|
	\|---\|---\|---\|
	\| IOI × GPT-2 × output_token \| 3 heads \| 5.16 \|
	\| IOI × GPT-2 × output_position \| 3 heads \| 16.0 \|

	‡ Cell 8 ARC × Gemma × answer (0.999) caveat. This score is driven by the harness's automatic identity-fallback at `L25 last_token` — a position PLOT did not pick to train. PLOT's actually-trained DAS rotations at the picked sites score 0.04–0.79 at this cell. The 0.999 is methodologically valid per the eval's scoring rules (it scores every position at picked layers, defaulting to identity at unselected positions) but is not a direct PLOT-rotation result. The mechanism and discussion live in `PLOT_SHORTCOMINGS.md` §15 in the project repo.

	## Method (one paragraph)

	For each cell: Stage A is a per-OT-row Sinkhorn between abstract layer signatures and per-layer mean-aggregated neural signatures; each OT row picks its top-1 layer. Stage B does a second Sinkhorn within each Stage-A layer between abstract rows and per-token-position neural rows, keeping `top_k ∈ {1, 2}` positions per layer. Stage C trains DAS orthogonal-rotation featurizers at the selected `(layer, position)` sites only. The output is a `Featurizer` per site, satisfying the MIB harness's invertibility contract.

	Compared to baseline DAS (which trains rotations at all 72 sites per cell), PLOT-DAS trains 2–6 sites per cell while remaining within seed-variance of DAS on 5 of 11 IIA cells we ran. The remaining cells have structural gaps documented in `PLOT_SHORTCOMINGS.md` (notably §13 for IOI signature design and §14 for RAVEL site-selection ceilings on high-cardinality outputs).

	## Reproducing locally

	```bash
	git clone https://github.com/bojro/plot-mib-submissions
	cd plot-mib-submissions
	# follow README.md "Setup from a fresh clone" → uses .venv-mib
	.venv-mib/bin/python -m mib_submission.plot.run \
	--task 4_answer_MCQA \
	--model Qwen/Qwen2.5-0.5B \
	--variable answer_pointer
	```

	Cells are CLI-configurable; per-task configs live in `mib_submission/plot/configs.py`. The 8 GB-box pipeline reaches all of the cells above; cells marked LlamaForCausalLM or non-GPT-2 IOI in the MIB cell set require ≥16 GB VRAM and were not run.

	---
	license: mit
	tags:
	- mib
	- causal-variable-localization
	- interpretability
	- featurizers
	---

	# PLOT-DAS — MIB Causal Variable Localization Track submission

	This repo contains a PLOT-DAS submission to the [MIB Causal Variable Localization Track](https://huggingface.co/spaces/mib-bench/leaderboard).

	PLOT = Progressive Localization via Optimal Transport. Two-stage Sinkhorn OT picks `(layer, token_position)` sites; DAS rotations are then trained only at those picked sites. The variant in this repo (PLOT-DAS) ships the trained rotations.

	Source paper / reference implementation: <https://github.com/jchang153/causal-abstractions-ot>.
	Method narrative and per-cell engineering notes: see the [project repo](https://github.com/bojro/plot-mib-submissions) (`JOURNAL.md`, `PLOT_SHORTCOMINGS.md`, `WALKTHROUGHS.md`).

	## Submission contents

	12 of 26 cells shipped (the other 14 — all Llama-8B cells and Qwen/Gemma IOI — need ≥16 GB GPU and were deferred to cloud):

	\| Folder \| Task × Model × Variable \| Type \|
	\|---\|---\|---\|
	\| `4_answer_MCQA_Qwen2ForCausalLM_answer_pointer` \| MCQA × Qwen-2.5-0.5B × answer_pointer \| residual stream \|
	\| `4_answer_MCQA_Qwen2ForCausalLM_answer` \| MCQA × Qwen-2.5-0.5B × answer \| residual stream \|
	\| `4_answer_MCQA_Gemma2ForCausalLM_answer_pointer` \| MCQA × Gemma-2-2B × answer_pointer \| residual stream \|
	\| `4_answer_MCQA_Gemma2ForCausalLM_answer` \| MCQA × Gemma-2-2B × answer \| residual stream \|
	\| `ARC_easy_Gemma2ForCausalLM_answer_pointer` \| ARC × Gemma-2-2B × answer_pointer \| residual stream \|
	\| `ARC_easy_Gemma2ForCausalLM_answer` \| ARC × Gemma-2-2B × answer \| residual stream \|
	\| `arithmetic_Gemma2ForCausalLM_ones_carry` \| arithmetic × Gemma-2-2B × ones_carry \| residual stream \|
	\| `ravel_task_Gemma2ForCausalLM_Country` \| RAVEL × Gemma-2-2B × Country \| residual stream \|
	\| `ravel_task_Gemma2ForCausalLM_Continent` \| RAVEL × Gemma-2-2B × Continent \| residual stream \|
	\| `ravel_task_Gemma2ForCausalLM_Language` \| RAVEL × Gemma-2-2B × Language \| residual stream \|
	\| `ioi_task_GPT2LMHeadModel_output_token` \| IOI × GPT-2 small × output_token \| attention head \|
	\| `ioi_task_GPT2LMHeadModel_output_position` \| IOI × GPT-2 small × output_position \| attention head \|
	\| `ioi_linear_params.json` \| IOI causal-model linear params (required) \| metadata \|

	This submission qualifies for the "best" (single-layer) leaderboard — each cell has 2–6 picked layers, not every layer.

	## Local public-test scores

	Per-split max IIA on the public MIB test sets (full numbers and methodology in the project repo's `RESULTS.md`):

	#### Residual-stream cells (IIA, higher is better)

	\| cell \| sites \| mean IIA \|
	\|---\|---\|---\|
	\| MCQA × Qwen × answer_pointer \| 5 \| 1.000 \|
	\| MCQA × Qwen × answer \| 3 \| 0.849 \|
	\| MCQA × Gemma × answer_pointer \| 4 \| 0.955 \|
	\| MCQA × Gemma × answer \| 4 \| 0.908 \|
	\| ARC × Gemma × answer_pointer \| 6 \| 0.884 \|
	\| ARC × Gemma × answer \| 4 \| 0.999 ‡ \|
	\| arithmetic × Gemma × ones_carry \| 2 \| 0.448 (smoke settings) \|
	\| RAVEL × Gemma × Continent \| 2 \| 0.856 \|
	\| RAVEL × Gemma × Country \| 2 \| 0.615 \|
	\| RAVEL × Gemma × Language \| 2 \| 0.629 \|

	#### IOI cells (MSE, lower is better)

	\| cell \| sites \| MSE \|
	\|---\|---\|---\|
	\| IOI × GPT-2 × output_token \| 3 heads \| 5.16 \|
	\| IOI × GPT-2 × output_position \| 3 heads \| 16.0 \|

	‡ Cell 8 ARC × Gemma × answer (0.999) caveat. This score is driven by the harness's automatic identity-fallback at `L25 last_token` — a position PLOT did not pick to train. PLOT's actually-trained DAS rotations at the picked sites score 0.04–0.79 at this cell. The 0.999 is methodologically valid per the eval's scoring rules (it scores every position at picked layers, defaulting to identity at unselected positions) but is not a direct PLOT-rotation result. The mechanism and discussion live in `PLOT_SHORTCOMINGS.md` §15 in the project repo.

	## Method (one paragraph)

	For each cell: Stage A is a per-OT-row Sinkhorn between abstract layer signatures and per-layer mean-aggregated neural signatures; each OT row picks its top-1 layer. Stage B does a second Sinkhorn within each Stage-A layer between abstract rows and per-token-position neural rows, keeping `top_k ∈ {1, 2}` positions per layer. Stage C trains DAS orthogonal-rotation featurizers at the selected `(layer, position)` sites only. The output is a `Featurizer` per site, satisfying the MIB harness's invertibility contract.

	Compared to baseline DAS (which trains rotations at all 72 sites per cell), PLOT-DAS trains 2–6 sites per cell while remaining within seed-variance of DAS on 5 of 11 IIA cells we ran. The remaining cells have structural gaps documented in `PLOT_SHORTCOMINGS.md` (notably §13 for IOI signature design and §14 for RAVEL site-selection ceilings on high-cardinality outputs).

	## Reproducing locally

	```bash
	git clone https://github.com/bojro/plot-mib-submissions
	cd plot-mib-submissions
	# follow README.md "Setup from a fresh clone" → uses .venv-mib
	.venv-mib/bin/python -m mib_submission.plot.run \
	--task 4_answer_MCQA \
	--model Qwen/Qwen2.5-0.5B \
	--variable answer_pointer
	```

	Cells are CLI-configurable; per-task configs live in `mib_submission/plot/configs.py`. The 8 GB-box pipeline reaches all of the cells above; cells marked LlamaForCausalLM or non-GPT-2 IOI in the MIB cell set require ≥16 GB VRAM and were not run.