| # UniSITH: Unimodal Semantic Inspection of Transformer Heads |
|
|
| **Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.** |
|
|
| Adapted from [SITH](https://arxiv.org/abs/2603.24653) *(Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition")*, extended to work with **unimodal vision models** (DINOv2, ViT, etc.) that lack a text encoder. |
|
|
| ## Key Idea |
|
|
| **Original SITH** analyzes CLIP's vision transformer by: |
| 1. Decomposing attention head W_VO matrices via SVD |
| 2. Projecting singular vectors into CLIP's multimodal embedding space |
| 3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder |
| |
| **UniSITH** replaces text concepts with **captioned images**: |
| 1. Same SVD decomposition of W_VO matrices (architecture-agnostic) |
| 2. Projects singular vectors to the model's own feature space |
| 3. Matches against image embeddings from a captioned dataset (e.g., [Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)) |
| 4. **Captions provide human interpretability** — each concept is an image whose caption describes what semantic content the attention head encodes |
|
|
| This makes SITH-style interpretability available for **any ViT**, not just CLIP. |
|
|
| ## Supported Models |
|
|
| | Architecture | Example Models | Status | |
| |---|---|---| |
| | DINOv2 | `facebook/dinov2-small`, `dinov2-base`, `dinov2-large` | ✅ | |
| | CLIP ViT | `openai/clip-vit-base-patch16`, `clip-vit-large-patch14` | ✅ | |
| | ViT | `google/vit-base-patch16-224`, `vit-large-patch16-224` | ✅ | |
|
|
| ## Installation |
|
|
| ```bash |
| pip install torch transformers datasets scipy tqdm Pillow |
| # For CLIP models: |
| pip install open-clip-torch |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoModel, AutoImageProcessor |
| from datasets import load_dataset |
| from unimodal_sith import UniSITH, VisualConceptPool |
| |
| # 1. Load a unimodal vision model |
| model = AutoModel.from_pretrained("facebook/dinov2-base") |
| processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base") |
| model.eval() |
| |
| # 2. Build visual concept pool from captioned images |
| dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train") |
| pool = VisualConceptPool.from_dataset( |
| dataset=dataset, |
| model=model, |
| processor=processor, |
| architecture="dinov2", |
| image_column="image", |
| caption_column="caption", |
| max_concepts=5000, # Use more concepts for better fidelity |
| device="cuda", # GPU recommended for large pools |
| ) |
| |
| # 3. Create analyzer |
| analyzer = UniSITH( |
| model=model, |
| architecture="dinov2", |
| n_heads=12, # DINOv2-base |
| d_model=768, |
| concept_pool=pool, |
| device="cuda", |
| ) |
| |
| # 4. Analyze attention heads |
| results = analyzer.analyze_model( |
| layers=[10, 11], # Last 2 layers |
| n_singular_vectors=5, # Top-5 SVs per head |
| K=5, # 5 concepts per SV |
| lambda_coh=0.3, # COMP coherence weight |
| ) |
| |
| # 5. Inspect results |
| for layer_idx, heads in results.items(): |
| for head in heads: |
| for sv in head.singular_vectors: |
| print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:") |
| print(f" σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}") |
| for caption, coeff in zip(sv.concepts, sv.coefficients): |
| print(f" [{coeff:.4f}] {caption}") |
| ``` |
|
|
| ## CLI Usage |
|
|
| ```bash |
| python run_unisith.py \ |
| --model facebook/dinov2-large \ |
| --max-concepts 5000 \ |
| --layers 20 21 22 23 \ |
| --n-sv 5 \ |
| --K 5 \ |
| --lambda-coh 0.3 \ |
| --method comp \ |
| --device cuda \ |
| --output results/dinov2_large_analysis.json |
| ``` |
|
|
| ## How It Works |
|
|
| ### Step 1: Weight Extraction & LN Folding |
|
|
| For each attention head, we extract the Value-Output (VO) weight matrix: |
|
|
| $$W_{VO}^h = W_V^h \cdot W_O^h$$ |
| |
| Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$. |
| |
| ### Step 2: SVD Decomposition |
| |
| $$W_{VO} = U \Sigma V^T$$ |
|
|
| The right singular vectors $\mathbf{v}_i$ define the **writing directions** — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance. |
|
|
| ### Step 3: Projection to Feature Space |
|
|
| Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP): |
|
|
| $$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$$ |
| |
| Both the projected singular vectors and concept embeddings are **mean-centered** (analogous to SITH's modality gap correction) and re-normalized. |
| |
| ### Step 4: COMP (Coherent Orthogonal Matching Pursuit) |
| |
| Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings: |
| |
| $$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$$ |
| |
| COMP extends standard Orthogonal Matching Pursuit with a **coherence term** ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations. |
| |
| ### Step 5: Model Editing (Optional) |
| |
| UniSITH supports interpretable weight-space model edits by scaling singular values: |
| - **Suppress** concepts by setting $\sigma_i \to 0$ |
| - **Amplify** concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$ |
|
|
| ```python |
| # Suppress the 3rd singular vector in layer 23, head 5 |
| analyzer.edit_model( |
| layer_idx=23, head_idx=5, |
| sv_indices=[2], scale_factors=[0.0] |
| ) |
| ``` |
|
|
| ## Concept Pool |
|
|
| UniSITH uses **[Recap-COCO-30K](https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K)** as the default concept pool: |
| - 30,000 diverse images from COCO val2014 |
| - Each image has a **short caption** (1-sentence COCO annotation) for concept labels |
| - Each image also has a **detailed recaption** (GPT-4V dense description) for verification |
| - Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc. |
|
|
| For larger pools, consider [220k-GPT4Vision-captions-from-LIVIS](https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS) (220K images, 1200 LVIS categories). |
|
|
| ### Why Images Instead of Text? |
|
|
| | Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) | |
| |---|---|---| |
| | **Requires** | CLIP (multimodal model) | Any ViT | |
| | **Concept pool** | ConceptNet text strings (~1.35M) | Captioned images (~30K) | |
| | **Human interpretation** | Text string directly | Caption of the matched image | |
| | **Encoding** | CLIP text encoder | Same model being analyzed | |
| | **Modality gap** | Cross-modal (text↔image) | None (same modality) | |
| | **Diversity** | Lexical diversity | Visual diversity | |
|
|
| ## Architecture Details |
|
|
| ### Weight Extraction |
|
|
| | Model | W_V source | W_O source | Final LN | Projection | |
| |---|---|---|---|---| |
| | DINOv2 | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None (1024d) | |
| | CLIP ViT | `layer.self_attn.v_proj.weight` | `layer.self_attn.out_proj.weight` | `vision_model.post_layernorm` | `visual_projection` | |
| | ViT | `layer.attention.attention.value.weight` | `layer.attention.output.dense.weight` | `model.layernorm` | None | |
|
|
| ### DINOv2-specific: LayerScale |
|
|
| DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$: |
| |
| $$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$$ |
|
|
| ## Output Format |
|
|
| Results are saved as JSON: |
|
|
| ```json |
| { |
| "23": [ |
| { |
| "layer": 23, |
| "head": 0, |
| "singular_vectors": [ |
| { |
| "sv_index": 0, |
| "singular_value": 3.17, |
| "concepts": [ |
| {"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42}, |
| {"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88} |
| ], |
| "fidelity": 0.45, |
| "image_ids": [53120, 196865] |
| } |
| ] |
| } |
| ] |
| } |
| ``` |
|
|
| ## Tips for Better Results |
|
|
| 1. **More concepts = higher fidelity**: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7. |
| 2. **Use GPU**: Encoding 30K images takes ~10 min on GPU vs. hours on CPU. |
| 3. **Cache embeddings**: Set `cache_path` to avoid re-encoding on every run. |
| 4. **COMP vs top-k**: COMP gives more coherent explanations; top-k is faster but less complete. |
| 5. **λ tuning**: Higher λ = more coherent but potentially less faithful. Default 0.3 works well. |
| 6. **Last layers are most interpretable**: Focus on the last 4 layers, as in the original SITH paper. |
|
|
| ## Citation |
|
|
| If you use UniSITH, please cite the original SITH paper: |
|
|
| ```bibtex |
| @article{vaquero2025sith, |
| title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition}, |
| author={Vaquero, Lorenzo and others}, |
| journal={arXiv preprint arXiv:2603.24653}, |
| year={2025} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT License |
|
|