File size: 8,819 Bytes
d7bf1a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# UniSITH: Unimodal Semantic Inspection of Transformer Heads

**Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.**

Adapted from [SITH](https://arxiv.org/abs/2603.24653) *(Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition")*, extended to work with **unimodal vision models** (DINOv2, ViT, etc.) that lack a text encoder.

## Key Idea

**Original SITH** analyzes CLIP's vision transformer by:
1. Decomposing attention head W_VO matrices via SVD
2. Projecting singular vectors into CLIP's multimodal embedding space
3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder

**UniSITH** replaces text concepts with **captioned images**:
1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
2. Projects singular vectors to the model's own feature space
3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K))
4. **Captions provide human interpretability** — each concept is an image whose caption describes what semantic content the attention head encodes

This makes SITH-style interpretability available for **any ViT**, not just CLIP.

## Supported Models

| Architecture | Example Models | Status |
|---|---|---|
| DINOv2 | `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` | ✅ |
| CLIP ViT | `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` | ✅ |
| ViT | `google/vit-base-patch16-224`, `vit-large-patch16-224` | ✅ |

## Installation

```bash
pip install torch transformers datasets scipy tqdm Pillow
# For CLIP models:
pip install open-clip-torch
```

## Quick Start

```python
from transformers import AutoModel, AutoImageProcessor
from datasets import load_dataset
from unimodal_sith import UniSITH, VisualConceptPool

# 1. Load a unimodal vision model
model = AutoModel.from_pretrained("facebook/dinov2-base")
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model.eval()

# 2. Build visual concept pool from captioned images
dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
pool = VisualConceptPool.from_dataset(
    dataset=dataset,
    model=model,
    processor=processor,
    architecture="dinov2",
    image_column="image",
    caption_column="caption",
    max_concepts=5000,  # Use more concepts for better fidelity
    device="cuda",      # GPU recommended for large pools
)

# 3. Create analyzer
analyzer = UniSITH(
    model=model,
    architecture="dinov2",
    n_heads=12,    # DINOv2-base
    d_model=768,
    concept_pool=pool,
    device="cuda",
)

# 4. Analyze attention heads
results = analyzer.analyze_model(
    layers=[10, 11],      # Last 2 layers
    n_singular_vectors=5, # Top-5 SVs per head
    K=5,                  # 5 concepts per SV
    lambda_coh=0.3,       # COMP coherence weight
)

# 5. Inspect results
for layer_idx, heads in results.items():
    for head in heads:
        for sv in head.singular_vectors:
            print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
            print(f"  σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
            for caption, coeff in zip(sv.concepts, sv.coefficients):
                print(f"  [{coeff:.4f}] {caption}")
```

## CLI Usage

```bash
python run_unisith.py \
  --model facebook/dinov2-large \
  --max-concepts 5000 \
  --layers 20 21 22 23 \
  --n-sv 5 \
  --K 5 \
  --lambda-coh 0.3 \
  --method comp \
  --device cuda \
  --output results/dinov2_large_analysis.json
```

## How It Works

### Step 1: Weight Extraction & LN Folding

For each attention head, we extract the Value-Output (VO) weight matrix:

$$W_{VO}^h = W_V^h \cdot W_O^h$$

Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.

### Step 2: SVD Decomposition

$$W_{VO} = U \Sigma V^T$$

The right singular vectors $\mathbf{v}_i$ define the **writing directions** — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.

### Step 3: Projection to Feature Space

Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):

$$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$

Both the projected singular vectors and concept embeddings are **mean-centered** (analogous to SITH's modality gap correction) and re-normalized.

### Step 4: COMP (Coherent Orthogonal Matching Pursuit)

Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:

$$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$

COMP extends standard Orthogonal Matching Pursuit with a **coherence term** ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.

### Step 5: Model Editing (Optional)

UniSITH supports interpretable weight-space model edits by scaling singular values:
- **Suppress** concepts by setting $\sigma_i \to 0$
- **Amplify** concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$

```python
# Suppress the 3rd singular vector in layer 23, head 5
analyzer.edit_model(
    layer_idx=23, head_idx=5,
    sv_indices=[2], scale_factors=[0.0]
)
```

## Concept Pool

UniSITH uses **[Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)** as the default concept pool:
- 30,000 diverse images from COCO val2014
- Each image has a **short caption** (1-sentence COCO annotation) for concept labels
- Each image also has a **detailed recaption** (GPT-4V dense description) for verification
- Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.

For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories).

### Why Images Instead of Text?

| Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) |
|---|---|---|
| **Requires** | CLIP (multimodal model) | Any ViT |
| **Concept pool** | ConceptNet text strings (~1.35M) | Captioned images (~30K) |
| **Human interpretation** | Text string directly | Caption of the matched image |
| **Encoding** | CLIP text encoder | Same model being analyzed |
| **Modality gap** | Cross-modal (text↔image) | None (same modality) |
| **Diversity** | Lexical diversity | Visual diversity |

## Architecture Details

### Weight Extraction

| Model | W_V source | W_O source | Final LN | Projection |
|---|---|---|---|---|
| DINOv2 | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None (1024d) |
| CLIP ViT | `layer.self_attn.v_proj.weight` | `layer.self_attn.out_proj.weight` | `vision_model.post_layernorm` | `visual_projection` |
| ViT | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None |

### DINOv2-specific: LayerScale

DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:

$$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$

## Output Format

Results are saved as JSON:

```json
{
  "23": [
    {
      "layer": 23,
      "head": 0,
      "singular_vectors": [
        {
          "sv_index": 0,
          "singular_value": 3.17,
          "concepts": [
            {"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
            {"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
          ],
          "fidelity": 0.45,
          "image_ids": [53120, 196865]
        }
      ]
    }
  ]
}
```

## Tips for Better Results

1. **More concepts = higher fidelity**: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
2. **Use GPU**: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
3. **Cache embeddings**: Set `cache_path` to avoid re-encoding on every run.
4. **COMP vs top-k**: COMP gives more coherent explanations; top-k is faster but less complete.
5. **λ tuning**: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
6. **Last layers are most interpretable**: Focus on the last 4 layers, as in the original SITH paper.

## Citation

If you use UniSITH, please cite the original SITH paper:

```bibtex
@article{vaquero2025sith,
  title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
  author={Vaquero, Lorenzo and others},
  journal={arXiv preprint arXiv:2603.24653},
  year={2025}
}
```

## License

MIT License