lorenzovaquero commited on
Commit
d7bf1a9
·
verified ·
1 Parent(s): c940ec5

Add UniSITH README

Browse files
Files changed (1) hide show
  1. README.md +240 -0
README.md ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # UniSITH: Unimodal Semantic Inspection of Transformer Heads
2
+
3
+ **Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.**
4
+
5
+ Adapted from [SITH](https://arxiv.org/abs/2603.24653) *(Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition")*, extended to work with **unimodal vision models** (DINOv2, ViT, etc.) that lack a text encoder.
6
+
7
+ ## Key Idea
8
+
9
+ **Original SITH** analyzes CLIP's vision transformer by:
10
+ 1. Decomposing attention head W_VO matrices via SVD
11
+ 2. Projecting singular vectors into CLIP's multimodal embedding space
12
+ 3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder
13
+
14
+ **UniSITH** replaces text concepts with **captioned images**:
15
+ 1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
16
+ 2. Projects singular vectors to the model's own feature space
17
+ 3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K))
18
+ 4. **Captions provide human interpretability** — each concept is an image whose caption describes what semantic content the attention head encodes
19
+
20
+ This makes SITH-style interpretability available for **any ViT**, not just CLIP.
21
+
22
+ ## Supported Models
23
+
24
+ | Architecture | Example Models | Status |
25
+ |---|---|---|
26
+ | DINOv2 | `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` | ✅ |
27
+ | CLIP ViT | `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` | ✅ |
28
+ | ViT | `google/vit-base-patch16-224`, `vit-large-patch16-224` | ✅ |
29
+
30
+ ## Installation
31
+
32
+ ```bash
33
+ pip install torch transformers datasets scipy tqdm Pillow
34
+ # For CLIP models:
35
+ pip install open-clip-torch
36
+ ```
37
+
38
+ ## Quick Start
39
+
40
+ ```python
41
+ from transformers import AutoModel, AutoImageProcessor
42
+ from datasets import load_dataset
43
+ from unimodal_sith import UniSITH, VisualConceptPool
44
+
45
+ # 1. Load a unimodal vision model
46
+ model = AutoModel.from_pretrained("facebook/dinov2-base")
47
+ processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
48
+ model.eval()
49
+
50
+ # 2. Build visual concept pool from captioned images
51
+ dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
52
+ pool = VisualConceptPool.from_dataset(
53
+ dataset=dataset,
54
+ model=model,
55
+ processor=processor,
56
+ architecture="dinov2",
57
+ image_column="image",
58
+ caption_column="caption",
59
+ max_concepts=5000, # Use more concepts for better fidelity
60
+ device="cuda", # GPU recommended for large pools
61
+ )
62
+
63
+ # 3. Create analyzer
64
+ analyzer = UniSITH(
65
+ model=model,
66
+ architecture="dinov2",
67
+ n_heads=12, # DINOv2-base
68
+ d_model=768,
69
+ concept_pool=pool,
70
+ device="cuda",
71
+ )
72
+
73
+ # 4. Analyze attention heads
74
+ results = analyzer.analyze_model(
75
+ layers=[10, 11], # Last 2 layers
76
+ n_singular_vectors=5, # Top-5 SVs per head
77
+ K=5, # 5 concepts per SV
78
+ lambda_coh=0.3, # COMP coherence weight
79
+ )
80
+
81
+ # 5. Inspect results
82
+ for layer_idx, heads in results.items():
83
+ for head in heads:
84
+ for sv in head.singular_vectors:
85
+ print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
86
+ print(f" σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
87
+ for caption, coeff in zip(sv.concepts, sv.coefficients):
88
+ print(f" [{coeff:.4f}] {caption}")
89
+ ```
90
+
91
+ ## CLI Usage
92
+
93
+ ```bash
94
+ python run_unisith.py \
95
+ --model facebook/dinov2-large \
96
+ --max-concepts 5000 \
97
+ --layers 20 21 22 23 \
98
+ --n-sv 5 \
99
+ --K 5 \
100
+ --lambda-coh 0.3 \
101
+ --method comp \
102
+ --device cuda \
103
+ --output results/dinov2_large_analysis.json
104
+ ```
105
+
106
+ ## How It Works
107
+
108
+ ### Step 1: Weight Extraction & LN Folding
109
+
110
+ For each attention head, we extract the Value-Output (VO) weight matrix:
111
+
112
+ $$W_{VO}^h = W_V^h \cdot W_O^h$$
113
+
114
+ Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.
115
+
116
+ ### Step 2: SVD Decomposition
117
+
118
+ $$W_{VO} = U \Sigma V^T$$
119
+
120
+ The right singular vectors $\mathbf{v}_i$ define the **writing directions** — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.
121
+
122
+ ### Step 3: Projection to Feature Space
123
+
124
+ Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):
125
+
126
+ $$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$
127
+
128
+ Both the projected singular vectors and concept embeddings are **mean-centered** (analogous to SITH's modality gap correction) and re-normalized.
129
+
130
+ ### Step 4: COMP (Coherent Orthogonal Matching Pursuit)
131
+
132
+ Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:
133
+
134
+ $$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$
135
+
136
+ COMP extends standard Orthogonal Matching Pursuit with a **coherence term** ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.
137
+
138
+ ### Step 5: Model Editing (Optional)
139
+
140
+ UniSITH supports interpretable weight-space model edits by scaling singular values:
141
+ - **Suppress** concepts by setting $\sigma_i \to 0$
142
+ - **Amplify** concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$
143
+
144
+ ```python
145
+ # Suppress the 3rd singular vector in layer 23, head 5
146
+ analyzer.edit_model(
147
+ layer_idx=23, head_idx=5,
148
+ sv_indices=[2], scale_factors=[0.0]
149
+ )
150
+ ```
151
+
152
+ ## Concept Pool
153
+
154
+ UniSITH uses **[Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)** as the default concept pool:
155
+ - 30,000 diverse images from COCO val2014
156
+ - Each image has a **short caption** (1-sentence COCO annotation) for concept labels
157
+ - Each image also has a **detailed recaption** (GPT-4V dense description) for verification
158
+ - Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.
159
+
160
+ For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories).
161
+
162
+ ### Why Images Instead of Text?
163
+
164
+ | Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) |
165
+ |---|---|---|
166
+ | **Requires** | CLIP (multimodal model) | Any ViT |
167
+ | **Concept pool** | ConceptNet text strings (~1.35M) | Captioned images (~30K) |
168
+ | **Human interpretation** | Text string directly | Caption of the matched image |
169
+ | **Encoding** | CLIP text encoder | Same model being analyzed |
170
+ | **Modality gap** | Cross-modal (text↔image) | None (same modality) |
171
+ | **Diversity** | Lexical diversity | Visual diversity |
172
+
173
+ ## Architecture Details
174
+
175
+ ### Weight Extraction
176
+
177
+ | Model | W_V source | W_O source | Final LN | Projection |
178
+ |---|---|---|---|---|
179
+ | DINOv2 | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None (1024d) |
180
+ | CLIP ViT | `layer.self_attn.v_proj.weight` | `layer.self_attn.out_proj.weight` | `vision_model.post_layernorm` | `visual_projection` |
181
+ | ViT | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None |
182
+
183
+ ### DINOv2-specific: LayerScale
184
+
185
+ DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:
186
+
187
+ $$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$
188
+
189
+ ## Output Format
190
+
191
+ Results are saved as JSON:
192
+
193
+ ```json
194
+ {
195
+ "23": [
196
+ {
197
+ "layer": 23,
198
+ "head": 0,
199
+ "singular_vectors": [
200
+ {
201
+ "sv_index": 0,
202
+ "singular_value": 3.17,
203
+ "concepts": [
204
+ {"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
205
+ {"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
206
+ ],
207
+ "fidelity": 0.45,
208
+ "image_ids": [53120, 196865]
209
+ }
210
+ ]
211
+ }
212
+ ]
213
+ }
214
+ ```
215
+
216
+ ## Tips for Better Results
217
+
218
+ 1. **More concepts = higher fidelity**: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
219
+ 2. **Use GPU**: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
220
+ 3. **Cache embeddings**: Set `cache_path` to avoid re-encoding on every run.
221
+ 4. **COMP vs top-k**: COMP gives more coherent explanations; top-k is faster but less complete.
222
+ 5. **λ tuning**: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
223
+ 6. **Last layers are most interpretable**: Focus on the last 4 layers, as in the original SITH paper.
224
+
225
+ ## Citation
226
+
227
+ If you use UniSITH, please cite the original SITH paper:
228
+
229
+ ```bibtex
230
+ @article{vaquero2025sith,
231
+ title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
232
+ author={Vaquero, Lorenzo and others},
233
+ journal={arXiv preprint arXiv:2603.24653},
234
+ year={2025}
235
+ }
236
+ ```
237
+
238
+ ## License
239
+
240
+ MIT License