YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

UniSITH: Unimodal Semantic Inspection of Transformer Heads

Data-free interpretability for any Vision Transformer, using captioned images as the concept pool.

Adapted from SITH (Vaquero et al., 2025: "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition"), extended to work with unimodal vision models (DINOv2, ViT, etc.) that lack a text encoder.

Key Idea

Original SITH analyzes CLIP's vision transformer by:

  1. Decomposing attention head W_VO matrices via SVD
  2. Projecting singular vectors into CLIP's multimodal embedding space
  3. Matching them against text concepts (ConceptNet) encoded by CLIP's text encoder

UniSITH replaces text concepts with captioned images:

  1. Same SVD decomposition of W_VO matrices (architecture-agnostic)
  2. Projects singular vectors to the model's own feature space
  3. Matches against image embeddings from a captioned dataset (e.g., Recap-COCO-30K)
  4. Captions provide human interpretability β€” each concept is an image whose caption describes what semantic content the attention head encodes

This makes SITH-style interpretability available for any ViT, not just CLIP.

Supported Models

Architecture Example Models Status
DINOv2 facebook/dinov2-small, dinov2-base, dinov2-large βœ…
CLIP ViT openai/clip-vit-base-patch16, clip-vit-large-patch14 βœ…
ViT google/vit-base-patch16-224, vit-large-patch16-224 βœ…

Installation

pip install torch transformers datasets scipy tqdm Pillow
# For CLIP models:
pip install open-clip-torch

Quick Start

from transformers import AutoModel, AutoImageProcessor
from datasets import load_dataset
from unimodal_sith import UniSITH, VisualConceptPool

# 1. Load a unimodal vision model
model = AutoModel.from_pretrained("facebook/dinov2-base")
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model.eval()

# 2. Build visual concept pool from captioned images
dataset = load_dataset("UCSC-VLAA/Recap-COCO-30K", split="train")
pool = VisualConceptPool.from_dataset(
    dataset=dataset,
    model=model,
    processor=processor,
    architecture="dinov2",
    image_column="image",
    caption_column="caption",
    max_concepts=5000,  # Use more concepts for better fidelity
    device="cuda",      # GPU recommended for large pools
)

# 3. Create analyzer
analyzer = UniSITH(
    model=model,
    architecture="dinov2",
    n_heads=12,    # DINOv2-base
    d_model=768,
    concept_pool=pool,
    device="cuda",
)

# 4. Analyze attention heads
results = analyzer.analyze_model(
    layers=[10, 11],      # Last 2 layers
    n_singular_vectors=5, # Top-5 SVs per head
    K=5,                  # 5 concepts per SV
    lambda_coh=0.3,       # COMP coherence weight
)

# 5. Inspect results
for layer_idx, heads in results.items():
    for head in heads:
        for sv in head.singular_vectors:
            print(f"Layer {layer_idx}, Head {head.head_idx}, SV {sv.sv_idx}:")
            print(f"  Οƒ={sv.singular_value:.4f}, fidelity={sv.fidelity:.4f}")
            for caption, coeff in zip(sv.concepts, sv.coefficients):
                print(f"  [{coeff:.4f}] {caption}")

CLI Usage

python run_unisith.py \
  --model facebook/dinov2-large \
  --max-concepts 5000 \
  --layers 20 21 22 23 \
  --n-sv 5 \
  --K 5 \
  --lambda-coh 0.3 \
  --method comp \
  --device cuda \
  --output results/dinov2_large_analysis.json

How It Works

Step 1: Weight Extraction & LN Folding

For each attention head, we extract the Value-Output (VO) weight matrix:

WVOh=WVhβ‹…WOhW_{VO}^h = W_V^h \cdot W_O^h

Following SITH, we fold the pre-attention LayerNorm parameters into $W_V$ and project out the all-ones direction to account for LN centering. For DINOv2, we also fold the LayerScale parameter into $W_O$.

Step 2: SVD Decomposition

WVO=UΞ£VTW_{VO} = U \Sigma V^T

The right singular vectors $\mathbf{v}_i$ define the writing directions β€” what the head writes to the residual stream. The singular values $\sigma_i$ indicate importance.

Step 3: Projection to Feature Space

Singular vectors are projected from the residual stream to the model's output feature space using the final LayerNorm (and projection matrix, if present for CLIP):

v^=norm(WpTβ‹…LN(v))\hat{\mathbf{v}} = \text{norm}(W_p^T \cdot \text{LN}(\mathbf{v}))

Both the projected singular vectors and concept embeddings are mean-centered (analogous to SITH's modality gap correction) and re-normalized.

Step 4: COMP (Coherent Orthogonal Matching Pursuit)

Each singular vector is expressed as a sparse, non-negative combination of K concept embeddings:

v^β‰ˆβˆ‘k=1KckΞ³^k,ckβ‰₯0\hat{\mathbf{v}} \approx \sum_{k=1}^K c_k \hat{\boldsymbol{\gamma}}_k, \quad c_k \geq 0

COMP extends standard Orthogonal Matching Pursuit with a coherence term ($\lambda$) that encourages selected concepts to be semantically related to each other, producing more interpretable explanations.

Step 5: Model Editing (Optional)

UniSITH supports interpretable weight-space model edits by scaling singular values:

  • Suppress concepts by setting $\sigma_i \to 0$
  • Amplify concepts by setting $\sigma_i \to \alpha \cdot \sigma_i$
# Suppress the 3rd singular vector in layer 23, head 5
analyzer.edit_model(
    layer_idx=23, head_idx=5,
    sv_indices=[2], scale_factors=[0.0]
)

Concept Pool

UniSITH uses Recap-COCO-30K as the default concept pool:

  • 30,000 diverse images from COCO val2014
  • Each image has a short caption (1-sentence COCO annotation) for concept labels
  • Each image also has a detailed recaption (GPT-4V dense description) for verification
  • Covers: objects, scenes, animals, food, sports, indoor/outdoor, textures, etc.

For larger pools, consider 220k-GPT4Vision-captions-from-LIVIS (220K images, 1200 LVIS categories).

Why Images Instead of Text?

Aspect SITH (Text Concepts) UniSITH (Image Concepts)
Requires CLIP (multimodal model) Any ViT
Concept pool ConceptNet text strings (~1.35M) Captioned images (~30K)
Human interpretation Text string directly Caption of the matched image
Encoding CLIP text encoder Same model being analyzed
Modality gap Cross-modal (text↔image) None (same modality)
Diversity Lexical diversity Visual diversity

Architecture Details

Weight Extraction

Model W_V source W_O source Final LN Projection
DINOv2 layer.attention.attention.value.weight layer.attention.output.dense.weight model.layernorm None (1024d)
CLIP ViT layer.self_attn.v_proj.weight layer.self_attn.out_proj.weight vision_model.post_layernorm visual_projection
ViT layer.attention.attention.value.weight layer.attention.output.dense.weight model.layernorm None

DINOv2-specific: LayerScale

DINOv2 applies a learnable per-channel scalar (LayerScale) after attention. UniSITH folds this into $W_O$:

WOeff=diag(Ξ»1)β‹…WOW_O^{\text{eff}} = \text{diag}(\lambda_1) \cdot W_O

Output Format

Results are saved as JSON:

{
  "23": [
    {
      "layer": 23,
      "head": 0,
      "singular_vectors": [
        {
          "sv_index": 0,
          "singular_value": 3.17,
          "concepts": [
            {"caption": "A plate of cheese pizza on a table", "coefficient": 0.31, "concept_idx": 42},
            {"caption": "A chocolate cake with ice cream", "coefficient": 0.15, "concept_idx": 88}
          ],
          "fidelity": 0.45,
          "image_ids": [53120, 196865]
        }
      ]
    }
  ]
}

Tips for Better Results

  1. More concepts = higher fidelity: With 200 concepts, fidelity is ~0.3. With 5000+, expect ~0.5-0.7.
  2. Use GPU: Encoding 30K images takes ~10 min on GPU vs. hours on CPU.
  3. Cache embeddings: Set cache_path to avoid re-encoding on every run.
  4. COMP vs top-k: COMP gives more coherent explanations; top-k is faster but less complete.
  5. Ξ» tuning: Higher Ξ» = more coherent but potentially less faithful. Default 0.3 works well.
  6. Last layers are most interpretable: Focus on the last 4 layers, as in the original SITH paper.

Citation

If you use UniSITH, please cite the original SITH paper:

@article{vaquero2025sith,
  title={From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition},
  author={Vaquero, Lorenzo and others},
  journal={arXiv preprint arXiv:2603.24653},
  year={2025}
}

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for lorenzovaquero/UniSITH