File size: 8,819 Bytes
d7bf1a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | # UniSITH: Unimodal Semantic Inspection of Transformer Heads
**Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.**
Adapted from [SITH](https://arxiv.org/abs/2603.24653) *(Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition")*, extended to work with **unimodal vision models** (DINOv2, ViT, etc.) that lack a text encoder.
## Key Idea
**Original SITH** analyzes CLIP's vision transformer by:
1. Decomposing attention head W_VO matrices via SVD
2. Projecting singular vectors into CLIP's multimodal embedding space
3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder
**UniSITH** replaces text concepts with **captioned images**:
1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
2. Projects singular vectors to the model's own feature space
3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K))
4. **Captions provide human interpretability** — each concept is an image whose caption describes what semantic content the attention head encodes
This makes SITH-style interpretability available for **any ViT**, not just CLIP.
## Supported Models
| Architecture | Example Models | Status |
|---|---|---|
| DINOv2 | `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` | ✅ |
| CLIP ViT | `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` | ✅ |
| ViT | `google/vit-base-patch16-224`, `vit-large-patch16-224` | ✅ |
## Installation
```bash
pip install torch transformers datasets scipy tqdm Pillow
# For CLIP models:
pip install open-clip-torch
```
## Quick Start
```python
from transformers import AutoModel, AutoImageProcessor
from datasets import load_dataset
from unimodal_sith import UniSITH, VisualConceptPool
# 1. Load a unimodal vision model
model = AutoModel.from_pretrained("facebook/dinov2-base")
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model.eval()
# 2. Build visual concept pool from captioned images
dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
pool = VisualConceptPool.from_dataset(
dataset=dataset,
model=model,
processor=processor,
architecture="dinov2",
image_column="image",
caption_column="caption",
max_concepts=5000, # Use more concepts for better fidelity
device="cuda", # GPU recommended for large pools
)
# 3. Create analyzer
analyzer = UniSITH(
model=model,
architecture="dinov2",
n_heads=12, # DINOv2-base
d_model=768,
concept_pool=pool,
device="cuda",
)
# 4. Analyze attention heads
results = analyzer.analyze_model(
layers=[10, 11], # Last 2 layers
n_singular_vectors=5, # Top-5 SVs per head
K=5, # 5 concepts per SV
lambda_coh=0.3, # COMP coherence weight
)
# 5. Inspect results
for layer_idx, heads in results.items():
for head in heads:
for sv in head.singular_vectors:
print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
print(f" σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
for caption, coeff in zip(sv.concepts, sv.coefficients):
print(f" [{coeff:.4f}] {caption}")
```
## CLI Usage
```bash
python run_unisith.py \
--model facebook/dinov2-large \
--max-concepts 5000 \
--layers 20 21 22 23 \
--n-sv 5 \
--K 5 \
--lambda-coh 0.3 \
--method comp \
--device cuda \
--output results/dinov2_large_analysis.json
```
## How It Works
### Step 1: Weight Extraction & LN Folding
For each attention head, we extract the Value-Output (VO) weight matrix:
$$W_{VO}^h = W_V^h \cdot W_O^h$$
Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.
### Step 2: SVD Decomposition
$$W_{VO} = U \Sigma V^T$$
The right singular vectors $\mathbf{v}_i$ define the **writing directions** — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.
### Step 3: Projection to Feature Space
Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):
$$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$
Both the projected singular vectors and concept embeddings are **mean-centered** (analogous to SITH's modality gap correction) and re-normalized.
### Step 4: COMP (Coherent Orthogonal Matching Pursuit)
Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:
$$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$
COMP extends standard Orthogonal Matching Pursuit with a **coherence term** ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.
### Step 5: Model Editing (Optional)
UniSITH supports interpretable weight-space model edits by scaling singular values:
- **Suppress** concepts by setting $\sigma_i \to 0$
- **Amplify** concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$
```python
# Suppress the 3rd singular vector in layer 23, head 5
analyzer.edit_model(
layer_idx=23, head_idx=5,
sv_indices=[2], scale_factors=[0.0]
)
```
## Concept Pool
UniSITH uses **[Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)** as the default concept pool:
- 30,000 diverse images from COCO val2014
- Each image has a **short caption** (1-sentence COCO annotation) for concept labels
- Each image also has a **detailed recaption** (GPT-4V dense description) for verification
- Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.
For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories).
### Why Images Instead of Text?
| Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) |
|---|---|---|
| **Requires** | CLIP (multimodal model) | Any ViT |
| **Concept pool** | ConceptNet text strings (~1.35M) | Captioned images (~30K) |
| **Human interpretation** | Text string directly | Caption of the matched image |
| **Encoding** | CLIP text encoder | Same model being analyzed |
| **Modality gap** | Cross-modal (text↔image) | None (same modality) |
| **Diversity** | Lexical diversity | Visual diversity |
## Architecture Details
### Weight Extraction
| Model | W_V source | W_O source | Final LN | Projection |
|---|---|---|---|---|
| DINOv2 | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None (1024d) |
| CLIP ViT | `layer.self_attn.v_proj.weight` | `layer.self_attn.out_proj.weight` | `vision_model.post_layernorm` | `visual_projection` |
| ViT | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None |
### DINOv2-specific: LayerScale
DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:
$$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$
## Output Format
Results are saved as JSON:
```json
{
"23": [
{
"layer": 23,
"head": 0,
"singular_vectors": [
{
"sv_index": 0,
"singular_value": 3.17,
"concepts": [
{"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
{"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
],
"fidelity": 0.45,
"image_ids": [53120, 196865]
}
]
}
]
}
```
## Tips for Better Results
1. **More concepts = higher fidelity**: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
2. **Use GPU**: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
3. **Cache embeddings**: Set `cache_path` to avoid re-encoding on every run.
4. **COMP vs top-k**: COMP gives more coherent explanations; top-k is faster but less complete.
5. **λ tuning**: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
6. **Last layers are most interpretable**: Focus on the last 4 layers, as in the original SITH paper.
## Citation
If you use UniSITH, please cite the original SITH paper:
```bibtex
@article{vaquero2025sith,
title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
author={Vaquero, Lorenzo and others},
journal={arXiv preprint arXiv:2603.24653},
year={2025}
}
```
## License
MIT License
|