UniSITH / README.md

Add UniSITH README

d7bf1a9 verified 14 days ago

preview code

raw

history blame contribute delete

8.82 kB

UniSITH: Unimodal Semantic Inspection of Transformer Heads

Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.

Adapted from SITH (Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition"), extended to work with unimodal vision models (DINOv2, ViT, etc.) that lack a text encoder.

Key Idea

Original SITH analyzes CLIP's vision transformer by:

Decomposing attention head W_VO matrices via SVD
Projecting singular vectors into CLIP's multimodal embedding space
Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder

UniSITH replaces text concepts with captioned images:

Same SVD decomposition of W_VO matrices (architecture-agnostic)
Projects singular vectors to the model's own feature space
Matches against image embeddings from a captioned dataset (e.g., Recap-COCO-30K)
Captions provide human interpretability — each concept is an image whose caption describes what semantic content the attention head encodes

This makes SITH-style interpretability available for any ViT, not just CLIP.

Supported Models

Architecture	Example Models	Status
DINOv2	`facebook/dinov2-small`, `dinov2-base`, `dinov2-large`	✅
CLIP ViT	`openai/clip-vit-base-patch16`, `clip-vit-large-patch14`	✅
ViT	`google/vit-base-patch16-224`, `vit-large-patch16-224`	✅

Installation

pip install torch transformers datasets scipy tqdm Pillow
# For CLIP models:
pip install open-clip-torch

Quick Start

from transformers import AutoModel, AutoImageProcessor
from datasets import load_dataset
from unimodal_sith import UniSITH, VisualConceptPool

# 1. Load a unimodal vision model
model = AutoModel.from_pretrained("facebook/dinov2-base")
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model.eval()

# 2. Build visual concept pool from captioned images
dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
pool = VisualConceptPool.from_dataset(
    dataset=dataset,
    model=model,
    processor=processor,
    architecture="dinov2",
    image_column="image",
    caption_column="caption",
    max_concepts=5000,  # Use more concepts for better fidelity
    device="cuda",      # GPU recommended for large pools
)

# 3. Create analyzer
analyzer = UniSITH(
    model=model,
    architecture="dinov2",
    n_heads=12,    # DINOv2-base
    d_model=768,
    concept_pool=pool,
    device="cuda",
)

# 4. Analyze attention heads
results = analyzer.analyze_model(
    layers=[10, 11],      # Last 2 layers
    n_singular_vectors=5, # Top-5 SVs per head
    K=5,                  # 5 concepts per SV
    lambda_coh=0.3,       # COMP coherence weight
)

# 5. Inspect results
for layer_idx, heads in results.items():
    for head in heads:
        for sv in head.singular_vectors:
            print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
            print(f"  σ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
            for caption, coeff in zip(sv.concepts, sv.coefficients):
                print(f"  [{coeff:.4f}] {caption}")

CLI Usage

python run_unisith.py \
  --model facebook/dinov2-large \
  --max-concepts 5000 \
  --layers 20 21 22 23 \
  --n-sv 5 \
  --K 5 \
  --lambda-coh 0.3 \
  --method comp \
  --device cuda \
  --output results/dinov2_large_analysis.json

How It Works

Step 1: Weight Extraction & LN Folding

For each attention head, we extract the Value-Output (VO) weight matrix:

$W_{VO}^h = W_V^h \cdot W_O^h$

Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.

Step 2: SVD Decomposition

$W_{VO} = U \Sigma V^T$

The right singular vectors $\mathbf{v}_i$ define the writing directions — what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.

Step 3: Projection to Feature Space

Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):

$\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))$

Both the projected singular vectors and concept embeddings are mean-centered (analogous to SITH's modality gap correction) and re-normalized.

Step 4: COMP (Coherent Orthogonal Matching Pursuit)

Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:

$\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0$

COMP extends standard Orthogonal Matching Pursuit with a coherence term ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.

Step 5: Model Editing (Optional)

UniSITH supports interpretable weight-space model edits by scaling singular values:

Suppress concepts by setting $\sigma_i \to 0$
Amplify concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$

# Suppress the 3rd singular vector in layer 23, head 5
analyzer.edit_model(
    layer_idx=23, head_idx=5,
    sv_indices=[2], scale_factors=[0.0]
)

Concept Pool

UniSITH uses Recap-COCO-30K as the default concept pool:

30,000 diverse images from COCO val2014
Each image has a short caption (1-sentence COCO annotation) for concept labels
Each image also has a detailed recaption (GPT-4V dense description) for verification
Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.

For larger pools, consider 220k-GPT4Vision-captions-from-LIVIS (220K images, 1200 LVIS categories).

Why Images Instead of Text?

Aspect	SITH (Text Concepts)	UniSITH (Image Concepts)
Requires	CLIP (multimodal model)	Any ViT
Concept pool	ConceptNet text strings (~1.35M)	Captioned images (~30K)
Human interpretation	Text string directly	Caption of the matched image
Encoding	CLIP text encoder	Same model being analyzed
Modality gap	Cross-modal (text↔image)	None (same modality)
Diversity	Lexical diversity	Visual diversity

Architecture Details

Weight Extraction

Model	W_V source	W_O source	Final LN	Projection
DINOv2	`layer.attention.attention.value.weight`	`layer.attention.output.dense.weight`	`model.layernorm`	None (1024d)
CLIP ViT	`layer.self_attn.v_proj.weight`	`layer.self_attn.out_proj.weight`	`vision_model.post_layernorm`	`visual_projection`
ViT	`layer.attention.attention.value.weight`	`layer.attention.output.dense.weight`	`model.layernorm`	None

DINOv2-specific: LayerScale

DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:

$W_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O$

Output Format

Results are saved as JSON:

{
  "23": [
    {
      "layer": 23,
      "head": 0,
      "singular_vectors": [
        {
          "sv_index": 0,
          "singular_value": 3.17,
          "concepts": [
            {"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
            {"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
          ],
          "fidelity": 0.45,
          "image_ids": [53120, 196865]
        }
      ]
    }
  ]
}

Tips for Better Results

More concepts = higher fidelity: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
Use GPU: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
Cache embeddings: Set cache_path to avoid re-encoding on every run.
COMP vs top-k: COMP gives more coherent explanations; top-k is faster but less complete.
λ tuning: Higher λ = more coherent but potentially less faithful. Default 0.3 works well.
Last layers are most interpretable: Focus on the last 4 layers, as in the original SITH paper.

Citation

If you use UniSITH, please cite the original SITH paper:

@article{vaquero2025sith,
  title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
  author={Vaquero, Lorenzo and others},
  journal={arXiv preprint arXiv:2603.24653},
  year={2025}
}

License

MIT License