lorenzovaquero
/

UniSITH

Model card Files Files and versions

xet

Community

lorenzovaquero commited on 14 days ago

Commit

d7bf1a9

verified ·

1 Parent(s): c940ec5

Add UniSITH README

Browse files

Files changed (1) hide show

README.md +240 -0

README.md ADDED Viewed

	@@ -0,0 +1,240 @@

+# UniSITH: Unimodal Semantic Inspection of Transformer Heads
+**Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.**
+Adapted from [SITH](https://arxiv.org/abs/2603.24653) *(Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition")*, extended to work with **unimodal vision models** (DINOv2, ViT, etc.) that lack a text encoder.
+## Key Idea
+**Original SITH** analyzes CLIP's vision transformer by:
+1. Decomposing attention head W_VO matrices via SVD
+2. Projecting singular vectors into CLIP's multimodal embedding space
+3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder
+**UniSITH** replaces text concepts with **captioned images**:
+1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
+2. Projects singular vectors to the model's own feature space
+3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K))
+4. **Captions provide human interpretability** — each concept is an image whose caption describes what semantic content the attention head encodes
+This makes SITH-style interpretability available for **any ViT**, not just CLIP.
+## Supported Models
+| Architecture | Example Models | Status |
+|---|---|---|
+| DINOv2 | `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` | ✅ |
+| CLIP ViT | `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` | ✅ |
+| ViT | `google/vit-base-patch16-224`, `vit-large-patch16-224` | ✅ |
+## Installation
+```bash
+pip install torch transformers datasets scipy tqdm Pillow
+# For CLIP models:
+pip install open-clip-torch
+```
+## Quick Start
+```python
+from transformers import AutoModel, AutoImageProcessor
+from datasets import load_dataset
+from unimodal_sith import UniSITH, VisualConceptPool
+# 1. Load a unimodal vision model
+model = AutoModel.from_pretrained("facebook/dinov2-base")
+processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
+model.eval()
+# 2. Build visual concept pool from captioned images
+dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
+pool = VisualConceptPool.from_dataset(
+    dataset=dataset,
+    model=model,
+    processor=processor,
+    architecture="dinov2",
+    image_column="image",
+    caption_column="caption",
+    max_concepts=5000,  # Use more concepts for better fidelity
+    device="cuda",      # GPU recommended for large pools
+)
+# 3. Create analyzer
+analyzer = UniSITH(
+    model=model,
+    architecture="dinov2",
+    n_heads=12,    # DINOv2-base
+    d_model=768,
+    concept_pool=pool,
+    device="cuda",
+)
+# 4. Analyze attention heads
+results = analyzer.analyze_model(
+    layers=[10, 11],      # Last 2 layers
+    n_singular_vectors=5, # Top-5 SVs per head
+    K=5,                  # 5 concepts per SV
+    lambda_coh=0.3,       # COMP coherence weight
+)
+# 5. Inspect results
+for layer_idx, heads in results.items():
+    for head in heads:
+        for sv in head.singular_vectors:
+            print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
+            print(f"  σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
+            for caption, coeff in zip(sv.concepts, sv.coefficients):
+                print(f"  [{coeff:.4f}] {caption}")
+```
+## CLI Usage
+```bash
+python run_unisith.py \
+  --model facebook/dinov2-large \
+  --max-concepts 5000 \
+  --layers 20 21 22 23 \
+  --n-sv 5 \
+  --K 5 \
+  --lambda-coh 0.3 \
+  --method comp \
+  --device cuda \
+  --output results/dinov2_large_analysis.json
+```
+## How It Works
+### Step 1: Weight Extraction & LN Folding
+For each attention head, we extract the Value-Output (VO) weight matrix:
+$$W_{VO}^h = W_V^h \cdot W_O^h$$
+Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.
+### Step 2: SVD Decomposition
+$$W_{VO} = U \Sigma V^T$$
+The right singular vectors $\mathbf{v}_i$ define the **writing directions** — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.
+### Step 3: Projection to Feature Space
+Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):
+$$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$
+Both the projected singular vectors and concept embeddings are **mean-centered** (analogous to SITH's modality gap correction) and re-normalized.
+### Step 4: COMP (Coherent Orthogonal Matching Pursuit)
+Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:
+$$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$
+COMP extends standard Orthogonal Matching Pursuit with a **coherence term** ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.
+### Step 5: Model Editing (Optional)
+UniSITH supports interpretable weight-space model edits by scaling singular values:
+- **Suppress** concepts by setting $\sigma_i \to 0$
+- **Amplify** concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$
+```python
+# Suppress the 3rd singular vector in layer 23, head 5
+analyzer.edit_model(
+    layer_idx=23, head_idx=5,
+    sv_indices=[2], scale_factors=[0.0]
+)
+```
+## Concept Pool
+UniSITH uses **[Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)** as the default concept pool:
+- 30,000 diverse images from COCO val2014
+- Each image has a **short caption** (1-sentence COCO annotation) for concept labels
+- Each image also has a **detailed recaption** (GPT-4V dense description) for verification
+- Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.
+For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories).
+### Why Images Instead of Text?
+| Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) |
+|---|---|---|
+| **Requires** | CLIP (multimodal model) | Any ViT |
+| **Concept pool** | ConceptNet text strings (~1.35M) | Captioned images (~30K) |
+| **Human interpretation** | Text string directly | Caption of the matched image |
+| **Encoding** | CLIP text encoder | Same model being analyzed |
+| **Modality gap** | Cross-modal (text↔image) | None (same modality) |
+| **Diversity** | Lexical diversity | Visual diversity |
+## Architecture Details
+### Weight Extraction
+| Model | W_V source | W_O source | Final LN | Projection |
+|---|---|---|---|---|
+| DINOv2 | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None (1024d) |
+| CLIP ViT | `layer.self_attn.v_proj.weight` | `layer.self_attn.out_proj.weight` | `vision_model.post_layernorm` | `visual_projection` |
+| ViT | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None |
+### DINOv2-specific: LayerScale
+DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:
+$$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$
+## Output Format
+Results are saved as JSON:
+```json
+{
+  "23": [
+    {
+      "layer": 23,
+      "head": 0,
+      "singular_vectors": [
+        {
+          "sv_index": 0,
+          "singular_value": 3.17,
+          "concepts": [
+            {"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
+            {"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
+          ],
+          "fidelity": 0.45,
+          "image_ids": [53120, 196865]
+        }
+      ]
+    }
+  ]
+}
+```
+## Tips for Better Results
+1. **More concepts = higher fidelity**: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
+2. **Use GPU**: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
+3. **Cache embeddings**: Set `cache_path` to avoid re-encoding on every run.
+4. **COMP vs top-k**: COMP gives more coherent explanations; top-k is faster but less complete.
+5. **λ tuning**: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
+6. **Last layers are most interpretable**: Focus on the last 4 layers, as in the original SITH paper.
+## Citation
+If you use UniSITH, please cite the original SITH paper:
+```bibtex
+@article{vaquero2025sith,
+  title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
+  author={Vaquero, Lorenzo and others},
+  journal={arXiv preprint arXiv:2603.24653},
+  year={2025}
+}
+```
+## License
+MIT License