UniSITH: Unimodal Semantic Inspection of Transformer Heads
Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.
Adapted from SITH (Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition"), extended to work with unimodal vision models (DINOv2, ViT, etc.) that lack a text encoder.
Key Idea
Original SITH analyzes CLIP's vision transformer by:
- Decomposing attention head W_VO matrices via SVD
- Projecting singular vectors into CLIP's multimodal embedding space
- Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder
UniSITH replaces text concepts with captioned images:
- Same SVD decomposition of W_VO matrices (architecture-agnostic)
- Projects singular vectors to the model's own feature space
- Matches against image embeddings from a captioned dataset (e.g., Recap-COCO-30K)
- Captions provide human interpretability β each concept is an image whose caption describes what semantic content the attention head encodes
This makes SITH-style interpretability available for any ViT, not just CLIP.
Supported Models
| Architecture | Example Models | Status |
|---|---|---|
| DINOv2 | facebook/dinov2-small, dinov2-base, dinov2-large |
β |
| CLIP ViT | openai/clip-vit-base-patch16, clip-vit-large-patch14 |
β |
| ViT | google/vit-base-patch16-224, vit-large-patch16-224 |
β |
Installation
pip install torch transformers datasets scipy tqdm Pillow
# For CLIP models:
pip install open-clip-torch
Quick Start
from transformers import AutoModel, AutoImageProcessor
from datasets import load_dataset
from unimodal_sith import UniSITH, VisualConceptPool
# 1. Load a unimodal vision model
model = AutoModel.from_pretrained("facebook/dinov2-base")
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model.eval()
# 2. Build visual concept pool from captioned images
dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
pool = VisualConceptPool.from_dataset(
dataset=dataset,
model=model,
processor=processor,
architecture="dinov2",
image_column="image",
caption_column="caption",
max_concepts=5000, # Use more concepts for better fidelity
device="cuda", # GPU recommended for large pools
)
# 3. Create analyzer
analyzer = UniSITH(
model=model,
architecture="dinov2",
n_heads=12, # DINOv2-base
d_model=768,
concept_pool=pool,
device="cuda",
)
# 4. Analyze attention heads
results = analyzer.analyze_model(
layers=[10, 11], # Last 2 layers
n_singular_vectors=5, # Top-5 SVs per head
K=5, # 5 concepts per SV
lambda_coh=0.3, # COMP coherence weight
)
# 5. Inspect results
for layer_idx, heads in results.items():
for head in heads:
for sv in head.singular_vectors:
print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
print(f" Ο={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
for caption, coeff in zip(sv.concepts, sv.coefficients):
print(f" [{coeff:.4f}] {caption}")
CLI Usage
python run_unisith.py \
--model facebook/dinov2-large \
--max-concepts 5000 \
--layers 20 21 22 23 \
--n-sv 5 \
--K 5 \
--lambda-coh 0.3 \
--method comp \
--device cuda \
--output results/dinov2_large_analysis.json
How It Works
Step 1: Weight Extraction & LN Folding
For each attention head, we extract the Value-Output (VO) weight matrix:
Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.
Step 2: SVD Decomposition
The right singular vectors $\mathbf{v}_i$ define the writing directions β what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.
Step 3: Projection to Feature Space
Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):
Both the projected singular vectors and concept embeddings are mean-centered (analogous to SITH's modality gap correction) and re-normalized.
Step 4: COMP (Coherent Orthogonal Matching Pursuit)
Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:
COMP extends standard Orthogonal Matching Pursuit with a coherence term ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.
Step 5: Model Editing (Optional)
UniSITH supports interpretable weight-space model edits by scaling singular values:
- Suppress concepts by setting $\sigma_i \to 0$
- Amplify concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$
# Suppress the 3rd singular vector in layer 23, head 5
analyzer.edit_model(
layer_idx=23, head_idx=5,
sv_indices=[2], scale_factors=[0.0]
)
Concept Pool
UniSITH uses Recap-COCO-30K as the default concept pool:
- 30,000 diverse images from COCO val2014
- Each image has a short caption (1-sentence COCO annotation) for concept labels
- Each image also has a detailed recaption (GPT-4V dense description) for verification
- Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.
For larger pools, consider 220k-GPT4Vision-captions-from-LIVIS (220K images, 1200 LVIS categories).
Why Images Instead of Text?
| Aspect | SITH (Text Concepts) | UniSITH (Image Concepts) |
|---|---|---|
| Requires | CLIP (multimodal model) | Any ViT |
| Concept pool | ConceptNet text strings (~1.35M) | Captioned images (~30K) |
| Human interpretation | Text string directly | Caption of the matched image |
| Encoding | CLIP text encoder | Same model being analyzed |
| Modality gap | Cross-modal (textβimage) | None (same modality) |
| Diversity | Lexical diversity | Visual diversity |
Architecture Details
Weight Extraction
| Model | W_V source | W_O source | Final LN | Projection |
|---|---|---|---|---|
| DINOv2 | layer.attention.attention.value.weight |
layer.attention.output.dense.weight |
model.layernorm |
None (1024d) |
| CLIP ViT | layer.self_attn.v_proj.weight |
layer.self_attn.out_proj.weight |
vision_model.post_layernorm |
visual_projection |
| ViT | layer.attention.attention.value.weight |
layer.attention.output.dense.weight |
model.layernorm |
None |
DINOv2-specific: LayerScale
DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:
Output Format
Results are saved as JSON:
{
"23": [
{
"layer": 23,
"head": 0,
"singular_vectors": [
{
"sv_index": 0,
"singular_value": 3.17,
"concepts": [
{"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
{"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
],
"fidelity": 0.45,
"image_ids": [53120, 196865]
}
]
}
]
}
Tips for Better Results
- More concepts = higher fidelity: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
- Use GPU: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
- Cache embeddings: Set
cache_pathto avoid re-encoding on every run. - COMP vs top-k: COMP gives more coherent explanations; top-k is faster but less complete.
- Ξ» tuning: Higher Ξ» = more coherent but potentially less faithful. Default 0.3 works well.
- Last layers are most interpretable: Focus on the last 4 layers, as in the original SITH paper.
Citation
If you use UniSITH, please cite the original SITH paper:
@article{vaquero2025sith,
title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
author={Vaquero, Lorenzo and others},
journal={arXiv preprint arXiv:2603.24653},
year={2025}
}
License
MIT License