NearID — Identity Representation Learning via Near-identity Distractors

Paper: NearID Code: github.com/Aleksandar/NearID Paper: NearID: Identity Representation Learning via Near-identity Distractors

NearID produces identity-aware image embeddings that remain stable across background and context changes while correctly rejecting near-identity distractors (visually similar but different instances placed in the same context). It is designed for evaluating identity preservation in personalized image generation.

Architecture

Property	Value
Base model	`google/siglip2-so400m-patch14-384`
Backbone	SigLIP2 SO400M ViT/14 @ 384 px (frozen)
Pooling head	Multi-head Attention Pooling (MAP), initialised from SigLIP2 weights (trained)
Embedding dim	1152
Normalisation	L2 (built-in, `config.normalize_embeddings=True`)
Total parameters	~428 M
Trainable parameters	~15 M (head-only; backbone weights are frozen to preserve pretrained priors)
Input resolution	384 × 384
Format	`safetensors` (fp16)

Quick Start

from transformers import AutoModel, AutoImageProcessor
from PIL import Image

model = AutoModel.from_pretrained("Aleksandar/nearid-siglip2", trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained("Aleksandar/nearid-siglip2")

inputs = processor(images=Image.open("photo.jpg"), return_tensors="pt")

# Full output (ModelOutput with image_embeds, last_hidden_state, pooler_output)
outputs = model(**inputs)
embedding = outputs.image_embeds  # [1, 1152], L2-normalised

# Tensor shortcut
embedding = model.get_image_features(**inputs)  # [1, 1152]

Note on image processor: The original training used SiglipImageProcessor (slow). The release defaults to SiglipImageProcessorFast for performance. To use the original slow processor, pass use_fast=False to AutoImageProcessor.from_pretrained().

Pairwise Similarity

import torch

emb_a = model.get_image_features(**processor(images=img_a, return_tensors="pt"))
emb_b = model.get_image_features(**processor(images=img_b, return_tensors="pt"))

similarity = (emb_a @ emb_b.T).item()  # cosine similarity (embeddings are normalised)

Batch Inference

images = [Image.open(p) for p in image_paths]
inputs = processor(images=images, return_tensors="pt", padding=True)
embeddings = model.get_image_features(**inputs)  # [B, 1152]

# Pairwise similarity matrix
sim_matrix = embeddings @ embeddings.T

Evaluation

Near-Identity Discrimination & Alignment (Table 1)

We evaluate on three complementary benchmarks: NearID (object-level near-identity discrimination), MTG (part-level discrimination + oracle alignment), and DreamBench++ (human-judgment alignment).

Scoring Model	NearID SSR ↑	NearID PA ↑	MTG MO ↑	MTG MOpair ↑	MTG SSR ↑	MTG PA ↑	DB++ MH ↑
CLIP ViT-L/14	10.31	20.92	0.239	0.484	0.0	0.0	0.493
DINOv2 ViT-L/14	20.43	34.55	0.324	0.519	0.0	0.0	0.492
SigLIP2 (backbone)	30.74	48.81	0.180	0.366	0.0	0.0	0.516
VSM	32.13	46.70	0.394	0.445	7.0	24.5	0.190
NearID (Ours)	99.17	99.71	0.465	0.486	35.0	46.5	0.545

SSR and PA are averaged across seven inpainting settings (three excluded from training). MO/MOpair = metric-to-oracle correlation; MH = metric-to-human correlation (Fisher-z averaged).

DreamBench++ Per-Method Human Alignment (Table 2)

NearID improves over SigLIP2 on every personalization method tested, with Fisher-z averaged MH of 0.545 vs 0.516.

Training Details

Training Data

NearID was trained on the NearID dataset, which consists of multi-view positives per identity paired with near-identity distractors: different but semantically similar instances inpainted into the exact same background using an ensemble of generation pipelines (Flux, PowerPaint, SDXL, Qwen). Part-level training signal is provided by the MTG dataset.

Training Procedure

Hyperparameter	Value
Tuning strategy	Head-only (backbone frozen)
Loss function	NearID loss (InfoNCE + near-identity distractor ranking, α = 0.5, τ = 0.07)
Optimiser	AdamW
Learning rate	1e-4
LR schedule	Cosine with 100 warmup steps
Batch size	128
Epochs	11
Precision	fp16 mixed precision
Hardware	1 × NVIDIA A100
Training time	~6.5 hours
Framework	PyTorch + HuggingFace Accelerate

Intended Uses

Primary use cases:

Evaluating identity preservation in personalized image generation (e.g., scoring outputs of DreamBooth, Textual Inversion, IP-Adapter)
Embedding extraction for identity-aware retrieval or clustering
Benchmarking and research on near-identity discrimination

Out-of-scope uses:

This model is not a face recognition or person re-identification system
Surveillance or tracking without consent
Production biometric authentication (the model has not been audited for that purpose)
Demographic classification or profiling

Limitations

Domain: NearID was trained on synthetic inpaintings of common objects. Performance on domains not represented in the training set (e.g., highly specialised industrial parts, medical imagery) has not been evaluated.
Resolution: The model expects 384 × 384 input. Performance may degrade on images significantly below this resolution or with heavy compression.
Single-image scoring: The model scores individual images independently; it does not reason over video or image sequences.
Generative models: The near-identity distractors were generated using specific inpainting pipelines. Novel generation artifacts from unseen pipelines may affect discrimination performance.

Citation

@article{cvejic2026nearid,
  title={NearID: Identity Representation Learning via Near-identity Distractors},
  author={Cvejic, Aleksandar and Abdal, Rameen and Eldesokey, Abdelrahman and Ghanem, Bernard and Wonka, Peter},
  journal={arXiv preprint arXiv:2604.01973},
  year={2026}
}

See more at https://arxiv.org/abs/2604.01973 for the full paper.