Add UniSITH README
Browse files
README.md
ADDED
|
@@ -0,0 +1,240 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# UniSITH: Unimodal Semantic Inspection of Transformer Heads
|
| 2 |
+
|
| 3 |
+
**Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.**
|
| 4 |
+
|
| 5 |
+
Adapted from [SITH](https://arxiv.org/abs/2603.24653) *(Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition")*, extended to work with **unimodal vision models** (DINOv2, ViT, etc.) that lack a text encoder.
|
| 6 |
+
|
| 7 |
+
## Key Idea
|
| 8 |
+
|
| 9 |
+
**Original SITH** analyzes CLIP's vision transformer by:
|
| 10 |
+
1. Decomposing attention head W_VO matrices via SVD
|
| 11 |
+
2. Projecting singular vectors into CLIP's multimodal embedding space
|
| 12 |
+
3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder
|
| 13 |
+
|
| 14 |
+
**UniSITH** replaces text concepts with **captioned images**:
|
| 15 |
+
1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
|
| 16 |
+
2. Projects singular vectors to the model's own feature space
|
| 17 |
+
3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K))
|
| 18 |
+
4. **Captions provide human interpretability** — each concept is an image whose caption describes what semantic content the attention head encodes
|
| 19 |
+
|
| 20 |
+
This makes SITH-style interpretability available for **any ViT**, not just CLIP.
|
| 21 |
+
|
| 22 |
+
## Supported Models
|
| 23 |
+
|
| 24 |
+
| Architecture | Example Models | Status |
|
| 25 |
+
|---|---|---|
|
| 26 |
+
| DINOv2 | `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` | ✅ |
|
| 27 |
+
| CLIP ViT | `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` | ✅ |
|
| 28 |
+
| ViT | `google/vit-base-patch16-224`, `vit-large-patch16-224` | ✅ |
|
| 29 |
+
|
| 30 |
+
## Installation
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
pip install torch transformers datasets scipy tqdm Pillow
|
| 34 |
+
# For CLIP models:
|
| 35 |
+
pip install open-clip-torch
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Quick Start
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
from transformers import AutoModel, AutoImageProcessor
|
| 42 |
+
from datasets import load_dataset
|
| 43 |
+
from unimodal_sith import UniSITH, VisualConceptPool
|
| 44 |
+
|
| 45 |
+
# 1. Load a unimodal vision model
|
| 46 |
+
model = AutoModel.from_pretrained("facebook/dinov2-base")
|
| 47 |
+
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
|
| 48 |
+
model.eval()
|
| 49 |
+
|
| 50 |
+
# 2. Build visual concept pool from captioned images
|
| 51 |
+
dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
|
| 52 |
+
pool = VisualConceptPool.from_dataset(
|
| 53 |
+
dataset=dataset,
|
| 54 |
+
model=model,
|
| 55 |
+
processor=processor,
|
| 56 |
+
architecture="dinov2",
|
| 57 |
+
image_column="image",
|
| 58 |
+
caption_column="caption",
|
| 59 |
+
max_concepts=5000, # Use more concepts for better fidelity
|
| 60 |
+
device="cuda", # GPU recommended for large pools
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
# 3. Create analyzer
|
| 64 |
+
analyzer = UniSITH(
|
| 65 |
+
model=model,
|
| 66 |
+
architecture="dinov2",
|
| 67 |
+
n_heads=12, # DINOv2-base
|
| 68 |
+
d_model=768,
|
| 69 |
+
concept_pool=pool,
|
| 70 |
+
device="cuda",
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
# 4. Analyze attention heads
|
| 74 |
+
results = analyzer.analyze_model(
|
| 75 |
+
layers=[10, 11], # Last 2 layers
|
| 76 |
+
n_singular_vectors=5, # Top-5 SVs per head
|
| 77 |
+
K=5, # 5 concepts per SV
|
| 78 |
+
lambda_coh=0.3, # COMP coherence weight
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
# 5. Inspect results
|
| 82 |
+
for layer_idx, heads in results.items():
|
| 83 |
+
for head in heads:
|
| 84 |
+
for sv in head.singular_vectors:
|
| 85 |
+
print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
|
| 86 |
+
print(f" σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
|
| 87 |
+
for caption, coeff in zip(sv.concepts, sv.coefficients):
|
| 88 |
+
print(f" [{coeff:.4f}] {caption}")
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## CLI Usage
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
python run_unisith.py \
|
| 95 |
+
--model facebook/dinov2-large \
|
| 96 |
+
--max-concepts 5000 \
|
| 97 |
+
--layers 20 21 22 23 \
|
| 98 |
+
--n-sv 5 \
|
| 99 |
+
--K 5 \
|
| 100 |
+
--lambda-coh 0.3 \
|
| 101 |
+
--method comp \
|
| 102 |
+
--device cuda \
|
| 103 |
+
--output results/dinov2_large_analysis.json
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
## How It Works
|
| 107 |
+
|
| 108 |
+
### Step 1: Weight Extraction & LN Folding
|
| 109 |
+
|
| 110 |
+
For each attention head, we extract the Value-Output (VO) weight matrix:
|
| 111 |
+
|
| 112 |
+
$$W_{VO}^h = W_V^h \cdot W_O^h$$
|
| 113 |
+
|
| 114 |
+
Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.
|
| 115 |
+
|
| 116 |
+
### Step 2: SVD Decomposition
|
| 117 |
+
|
| 118 |
+
$$W_{VO} = U \Sigma V^T$$
|
| 119 |
+
|
| 120 |
+
The right singular vectors $\mathbf{v}_i$ define the **writing directions** — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.
|
| 121 |
+
|
| 122 |
+
### Step 3: Projection to Feature Space
|
| 123 |
+
|
| 124 |
+
Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):
|
| 125 |
+
|
| 126 |
+
$$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$
|
| 127 |
+
|
| 128 |
+
Both the projected singular vectors and concept embeddings are **mean-centered** (analogous to SITH's modality gap correction) and re-normalized.
|
| 129 |
+
|
| 130 |
+
### Step 4: COMP (Coherent Orthogonal Matching Pursuit)
|
| 131 |
+
|
| 132 |
+
Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:
|
| 133 |
+
|
| 134 |
+
$$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$
|
| 135 |
+
|
| 136 |
+
COMP extends standard Orthogonal Matching Pursuit with a **coherence term** ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.
|
| 137 |
+
|
| 138 |
+
### Step 5: Model Editing (Optional)
|
| 139 |
+
|
| 140 |
+
UniSITH supports interpretable weight-space model edits by scaling singular values:
|
| 141 |
+
- **Suppress** concepts by setting $\sigma_i \to 0$
|
| 142 |
+
- **Amplify** concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$
|
| 143 |
+
|
| 144 |
+
```python
|
| 145 |
+
# Suppress the 3rd singular vector in layer 23, head 5
|
| 146 |
+
analyzer.edit_model(
|
| 147 |
+
layer_idx=23, head_idx=5,
|
| 148 |
+
sv_indices=[2], scale_factors=[0.0]
|
| 149 |
+
)
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
## Concept Pool
|
| 153 |
+
|
| 154 |
+
UniSITH uses **[Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)** as the default concept pool:
|
| 155 |
+
- 30,000 diverse images from COCO val2014
|
| 156 |
+
- Each image has a **short caption** (1-sentence COCO annotation) for concept labels
|
| 157 |
+
- Each image also has a **detailed recaption** (GPT-4V dense description) for verification
|
| 158 |
+
- Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.
|
| 159 |
+
|
| 160 |
+
For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories).
|
| 161 |
+
|
| 162 |
+
### Why Images Instead of Text?
|
| 163 |
+
|
| 164 |
+
| Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) |
|
| 165 |
+
|---|---|---|
|
| 166 |
+
| **Requires** | CLIP (multimodal model) | Any ViT |
|
| 167 |
+
| **Concept pool** | ConceptNet text strings (~1.35M) | Captioned images (~30K) |
|
| 168 |
+
| **Human interpretation** | Text string directly | Caption of the matched image |
|
| 169 |
+
| **Encoding** | CLIP text encoder | Same model being analyzed |
|
| 170 |
+
| **Modality gap** | Cross-modal (text↔image) | None (same modality) |
|
| 171 |
+
| **Diversity** | Lexical diversity | Visual diversity |
|
| 172 |
+
|
| 173 |
+
## Architecture Details
|
| 174 |
+
|
| 175 |
+
### Weight Extraction
|
| 176 |
+
|
| 177 |
+
| Model | W_V source | W_O source | Final LN | Projection |
|
| 178 |
+
|---|---|---|---|---|
|
| 179 |
+
| DINOv2 | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None (1024d) |
|
| 180 |
+
| CLIP ViT | `layer.self_attn.v_proj.weight` | `layer.self_attn.out_proj.weight` | `vision_model.post_layernorm` | `visual_projection` |
|
| 181 |
+
| ViT | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None |
|
| 182 |
+
|
| 183 |
+
### DINOv2-specific: LayerScale
|
| 184 |
+
|
| 185 |
+
DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:
|
| 186 |
+
|
| 187 |
+
$$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$
|
| 188 |
+
|
| 189 |
+
## Output Format
|
| 190 |
+
|
| 191 |
+
Results are saved as JSON:
|
| 192 |
+
|
| 193 |
+
```json
|
| 194 |
+
{
|
| 195 |
+
"23": [
|
| 196 |
+
{
|
| 197 |
+
"layer": 23,
|
| 198 |
+
"head": 0,
|
| 199 |
+
"singular_vectors": [
|
| 200 |
+
{
|
| 201 |
+
"sv_index": 0,
|
| 202 |
+
"singular_value": 3.17,
|
| 203 |
+
"concepts": [
|
| 204 |
+
{"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
|
| 205 |
+
{"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
|
| 206 |
+
],
|
| 207 |
+
"fidelity": 0.45,
|
| 208 |
+
"image_ids": [53120, 196865]
|
| 209 |
+
}
|
| 210 |
+
]
|
| 211 |
+
}
|
| 212 |
+
]
|
| 213 |
+
}
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
## Tips for Better Results
|
| 217 |
+
|
| 218 |
+
1. **More concepts = higher fidelity**: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
|
| 219 |
+
2. **Use GPU**: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
|
| 220 |
+
3. **Cache embeddings**: Set `cache_path` to avoid re-encoding on every run.
|
| 221 |
+
4. **COMP vs top-k**: COMP gives more coherent explanations; top-k is faster but less complete.
|
| 222 |
+
5. **λ tuning**: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
|
| 223 |
+
6. **Last layers are most interpretable**: Focus on the last 4 layers, as in the original SITH paper.
|
| 224 |
+
|
| 225 |
+
## Citation
|
| 226 |
+
|
| 227 |
+
If you use UniSITH, please cite the original SITH paper:
|
| 228 |
+
|
| 229 |
+
```bibtex
|
| 230 |
+
@article{vaquero2025sith,
|
| 231 |
+
title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
|
| 232 |
+
author={Vaquero, Lorenzo and others},
|
| 233 |
+
journal={arXiv preprint arXiv:2603.24653},
|
| 234 |
+
year={2025}
|
| 235 |
+
}
|
| 236 |
+
```
|
| 237 |
+
|
| 238 |
+
## License
|
| 239 |
+
|
| 240 |
+
MIT License
|