UniSITH / README.md

Add UniSITH README

d7bf1a9 verified 14 days ago

8.82 kB

	# UniSITH: Unimodal Semantic Inspection of Transformer Heads

	Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.

	Adapted from [SITH](https://arxiv.org/abs/2603.24653) (Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition"), extended to work with unimodal vision models (DINOv2, ViT, etc.) that lack a text encoder.

	## Key Idea

	Original SITH analyzes CLIP's vision transformer by:
	1. Decomposing attention head W_VO matrices via SVD
	2. Projecting singular vectors into CLIP's multimodal embedding space
	3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder

	UniSITH replaces text concepts with captioned images:
	1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
	2. Projects singular vectors to the model's own feature space
	3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K))
	4. Captions provide human interpretability — each concept is an image whose caption describes what semantic content the attention head encodes

	This makes SITH-style interpretability available for any ViT, not just CLIP.

	## Supported Models

	\| Architecture \| Example Models \| Status \|
	\|---\|---\|---\|
	\| DINOv2 \| `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` \| ✅ \|
	\| CLIP ViT \| `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` \| ✅ \|
	\| ViT \| `google/vit-base-patch16-224`, `vit-large-patch16-224` \| ✅ \|

	## Installation

	```bash
	pip install torch transformers datasets scipy tqdm Pillow
	# For CLIP models:
	pip install open-clip-torch
	```

	## Quick Start

	```python
	from transformers import AutoModel, AutoImageProcessor
	from datasets import load_dataset
	from unimodal_sith import UniSITH, VisualConceptPool

	# 1. Load a unimodal vision model
	model = AutoModel.from_pretrained("facebook/dinov2-base")
	processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
	model.eval()

	# 2. Build visual concept pool from captioned images
	dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
	pool = VisualConceptPool.from_dataset(
	dataset=dataset,
	model=model,
	processor=processor,
	architecture="dinov2",
	image_column="image",
	caption_column="caption",
	max_concepts=5000, # Use more concepts for better fidelity
	device="cuda", # GPU recommended for large pools
	)

	# 3. Create analyzer
	analyzer = UniSITH(
	model=model,
	architecture="dinov2",
	n_heads=12, # DINOv2-base
	d_model=768,
	concept_pool=pool,
	device="cuda",
	)

	# 4. Analyze attention heads
	results = analyzer.analyze_model(
	layers=[10, 11], # Last 2 layers
	n_singular_vectors=5, # Top-5 SVs per head
	K=5, # 5 concepts per SV
	lambda_coh=0.3, # COMP coherence weight
	)

	# 5. Inspect results
	for layer_idx, heads in results.items():
	for head in heads:
	for sv in head.singular_vectors:
	print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
	print(f" σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
	for caption, coeff in zip(sv.concepts, sv.coefficients):
	print(f" [{coeff:.4f}] {caption}")
	```

	## CLI Usage

	```bash
	python run_unisith.py \
	--model facebook/dinov2-large \
	--max-concepts 5000 \
	--layers 20 21 22 23 \
	--n-sv 5 \
	--K 5 \
	--lambda-coh 0.3 \
	--method comp \
	--device cuda \
	--output results/dinov2_large_analysis.json
	```

	## How It Works

	### Step 1: Weight Extraction & LN Folding

	For each attention head, we extract the Value-Output (VO) weight matrix:

	$$W_{VO}^h = W_V^h \cdot W_O^h$$

	Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.

	### Step 2: SVD Decomposition

	$$W_{VO} = U \Sigma V^T$$

	The right singular vectors $\mathbf{v}_i$ define the writing directions — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.

	### Step 3: Projection to Feature Space

	Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):

	$$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$

	Both the projected singular vectors and concept embeddings are mean-centered (analogous to SITH's modality gap correction) and re-normalized.

	### Step 4: COMP (Coherent Orthogonal Matching Pursuit)

	Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:

	$$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$

	COMP extends standard Orthogonal Matching Pursuit with a coherence term ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.

	### Step 5: Model Editing (Optional)

	UniSITH supports interpretable weight-space model edits by scaling singular values:
	- Suppress concepts by setting $\sigma_i \to 0$
	- Amplify concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$

	```python
	# Suppress the 3rd singular vector in layer 23, head 5
	analyzer.edit_model(
	layer_idx=23, head_idx=5,
	sv_indices=[2], scale_factors=[0.0]
	)
	```

	## Concept Pool

	UniSITH uses [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K) as the default concept pool:
	- 30,000 diverse images from COCO val2014
	- Each image has a short caption (1-sentence COCO annotation) for concept labels
	- Each image also has a detailed recaption (GPT-4V dense description) for verification
	- Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.

	For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories).

	### Why Images Instead of Text?

	\| Aspect \| SITH (Text Concepts) \| UniSITH (Image Concepts) \|
	\|---\|---\|---\|
	\| Requires \| CLIP (multimodal model) \| Any ViT \|
	\| Concept pool \| ConceptNet text strings (~1.35M) \| Captioned images (~30K) \|
	\| Human interpretation \| Text string directly \| Caption of the matched image \|
	\| Encoding \| CLIP text encoder \| Same model being analyzed \|
	\| Modality gap \| Cross-modal (text↔image) \| None (same modality) \|
	\| Diversity \| Lexical diversity \| Visual diversity \|

	## Architecture Details

	### Weight Extraction

	\| Model \| W_V source \| W_O source \| Final LN \| Projection \|
	\|---\|---\|---\|---\|---\|
	\| DINOv2 \| `layer.attention.attention.value.weight` \| `layer.attention.output.dense.weight` \| `model.layernorm` \| None (1024d) \|
	\| CLIP ViT \| `layer.self_attn.v_proj.weight` \| `layer.self_attn.out_proj.weight` \| `vision_model.post_layernorm` \| `visual_projection` \|
	\| ViT \| `layer.attention.attention.value.weight` \| `layer.attention.output.dense.weight` \| `model.layernorm` \| None \|

	### DINOv2-specific: LayerScale

	DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:

	$$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$

	## Output Format

	Results are saved as JSON:

	```json
	{
	"23": [
	{
	"layer": 23,
	"head": 0,
	"singular_vectors": [
	{
	"sv_index": 0,
	"singular_value": 3.17,
	"concepts": [
	{"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
	{"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
	],
	"fidelity": 0.45,
	"image_ids": [53120, 196865]
	}
	]
	}
	]
	}
	```

	## Tips for Better Results

	1. More concepts = higher fidelity: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
	2. Use GPU: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
	3. Cache embeddings: Set `cache_path` to avoid re-encoding on every run.
	4. COMP vs top-k: COMP gives more coherent explanations; top-k is faster but less complete.
	5. λ tuning: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
	6. Last layers are most interpretable: Focus on the last 4 layers, as in the original SITH paper.

	## Citation

	If you use UniSITH, please cite the original SITH paper:

	```bibtex
	@article{vaquero2025sith,
	title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
	author={Vaquero, Lorenzo and others},
	journal={arXiv preprint arXiv:2603.24653},
	year={2025}
	}
	```

	## License

	MIT License