Reduced SigLIP for Person Visual Descriptions
This model is part of the family of reduced-dimension variants of google/siglip-base-patch16-224 and google/siglip2-base-patch16-224 finetuned for person visual description. It reduces the original embedding dimension to a smaller space using trainable linear projection layers.
Model Details for Reduced version
- Base model:
google/siglip2-base-patch16-224 - Reduced dimension: 64
- Architecture modifications: Added two linear layers (for text and image) to project embeddings from the original dimension down to
reduced_dim.
Intended Uses & Limitations
Example Applications
- Person retrieval based on textual or visual descriptions of the person
- Person re-identification
- Embedding extraction for retrieval systems
Limitations and Bias
- May inherit biases from the base SigLIP model and training data
- Not suitable for tasks requiring detailed fine-grained recognition without further training
- Trained on surveillance data; suitable for tasks where a substantial portion of the person is visible
Training
Loss function
- Type: Soft contrastive loss with label smoothing; optional image-to-image contrastive loss for same-identity image pairs.
- Description: The model is trained to align text and image embeddings using a modified contrastive objective. Instead of relying on hard one-hot targets, label smoothing allocates a small probability mass to all other samples in the batch. All embeddings are normalized prior to similarity computation, and the loss is applied symmetrically in both the image-to-text and text-to-image directions. To encourage re-identification and emphasize clothing features rather than pose or background, an additional image-to-image contrastive loss can be incorporated.
Evaluation Metrics
- Truncated Cumulative Matching Characteristic (CMC) AUC
- Measures the fraction of queries where the correct match appears within the top
Kranks (e.g., top-10). Unlike MRR or strict top-1 accuracy, this metric rewards consistent retrieval of all relevant matches near the top ranks, rather than a few perfect hits with others ranked very low.
- Measures the fraction of queries where the correct match appears within the top
Datasets
- Sources: CUHK-PEDES, ICFG-PEDES, IIITD, ITCPR, PRW-TBPS, PETA, (part of SYNTH and RSTPReid for testing)
- Processing: Text descriptions were processed with the Mistral LLM to remove ambiguous information about pose or context, leaving only clear visual characteristics in a structured format, where key features are separated with comma. During train/val/test the original splits are respected as much as possible.
The splits are created as follows:
- CUHK-PEDES: test: 1000 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
- ICFG-PEDES: test: 946 identities following the original split; validation: 150 identities sampled from the original training split; train: all remaining identities
- IIITD: test: 2500 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
- ITCPR: Originally designed for zero-shot evaluation, so a new split is created:test: 1000 randomly selected identities; validation: 150 randomly selected identities; train: all remaining identities
- PRW: test: 450 identities following the original split; validation: 150 identities sampled from the original training split; train: all remaining identities
- PETA: The original random image-level split can cause identity leakage. Therefore, the identity-based split from PETA-ZS is adopted: test: 1706 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
- RSTPReid: Only the 200 test images are used for evaluation due to annotation errors in the rest of the dataset.
- SYNTH: 1000 random images are used for the test set because the generated captions are noisy and sometimes inaccurate.
Training Setup
- Frozen pretrained base model initially; only projection layers were trained.
- Fine-tuning: later trained the model head and then the whole model.
- Optimizer: AdamW
- Learning rate: 1e-4 for warm-up, 5e-6 for the rest
- Epochs: 4 epochs for projection layer warm-up, 4 epochs for model head fine-tuning, and 50 epochs for full model fine-tuning
- Visual Augmentations: random operations including small rotations, scale, hue variations, horizontal flip, and color jitter
- Text Augmentations: random subsets of key features are selected and removed from the comma-separated description strings to create augmented training samples
- Training Codes: codes are available on [github] (https://github.com/MarketaJu/ReducedSiglipForPersonDescription)
- Package Versions: torch==2.9.1, transformers==4.57.3, pillow==12.0.0, torchvision==0.24.1
Results
The model is evaluated on data mentioned above. The following table summarizes the number of identities, images, and queries for each subset.
The final test set is a merge of all subsets.
| Dataset | #Identities | #Images | #Queries |
|---|---|---|---|
| CUHK | 1000 | 3074 | 6156 |
| ICFG | 946 | 19848 | 19873 |
| IIITD | 2500 | 2500 | 5000 |
| ITCPR | 1000 | 1620 | 1620 |
| PRW | 450 | 2057 | 4114 |
| PETA | 1706 | 3933 | 4614 |
| RSTPReid | 200 | 1000 | 1932 |
| SYNTH | 1000 | 1000 | 1000 |
| Final (All) | 8802 | 35032 | 44309 |
Evaluation of the Model per Dataset
Since the task is focused on retrieving the correct person identity rather than the exact matching image, the evaluation is performed as follows:
- For each text query, the goal is to retrieve the correct identity, not the exact corresponding image.
- During evaluation, scores are computed over all images belonging to the same identity, and the maximum score is taken to represent that identity.
- These identity-level scores are then ranked, and the following retrieval metrics are calculated:
- Top-k: standard top-1, top-5, top-10 accuracy
- MRR: Mean Reciprocal Rank
- CMC AUC: Truncated CMC AUC evaluated up to rank 20
| Dataset | Top-1 | Top-5 | Top-10 | MRR | CMC AUC (20) |
|---|---|---|---|---|---|
| CUHK | 62.4 | 86.4 | 91.8 | 71.4 | 90.7 |
| ICFG | 51.1 | 76.1 | 83.6 | 60.1 | 82.6 |
| IIITD | 65.2 | 88.6 | 93.3 | 75.6 | 92.3 |
| ITCPR | 37.7 | 64.1 | 76.2 | 48.8 | 76.0 |
| PRW | 56.6 | 83.3 | 90.7 | 67.0 | 89.2 |
| PETA | 47.6 | 75.8 | 88.2 | 58.1 | 85.6 |
| RSTPReid | 47.4 | 75.5 | 85.2 | 57.8 | 83.5 |
| SYNTH | 42.7 | 69.0 | 78.3 | 54.8 | 77.2 |
| Final (All) | 49.0 | 73.8 | 81.7 | 58.6 | 80.5 |
Cross-Model Comparison
Model based on google/siglip-base-patch16-224:
| Model Variant | Top-1 | Top-5 | Top-10 | MRR | CMC AUC (20) |
|---|---|---|---|---|---|
| google/siglip-base-patch16-224 | 16.4 | 33.1 | 41.4 | 23.8 | 41.1 |
| siglip-person-description-64 | 49.0 | 74.1 | 81.4 | 58.8 | 80.5 |
Model based on google/siglip2-base-patch16-224:
| Model Variant | Top-1 | Top-5 | Top-10 | MRR | CMC AUC (20) |
|---|---|---|---|---|---|
| google/siglip2-base-patch16-224 | 12.3 | 26.3 | 34.2 | 18.7 | 34.2 |
| finetuned_siglip2 | 53.8 | 77.0 | 83.8 | 62.6 | 82.8 |
| siglip2-person-description-128 | 51.0 | 75.0 | 82.4 | 60.4 | 81.4 |
| siglip2-person-description-64 | 49.0 | 73.8 | 81.7 | 58.6 | 80.5 |
| siglip2-person-description-32 | 43.1 | 70.0 | 78.2 | 53.5 | 77.3 |
Usage
The usage is identical to SigLIP.
# Import custom model code from repository
from modeling_resipvd import ReSiPVDModel
# Load the model from Hugging Face Hub
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
model = AutoModel.from_pretrained("MarketaJu/siglip2-person-description-64")
# Example: get embeddings
from skimage.io import imread
image = imread("test.jpg")
text_inputs = processor(text=["random person description"], return_tensors="pt", padding="max_length", max_length=64, truncation=True)
image_inputs = processor(images=image, return_tensors="pt", padding="max_length", max_length=64, truncation=True)
text_embeds = model.get_text_features(**text_inputs)
image_embeds = model.get_image_features(**image_inputs)
---
## Citation
If you use this model, please cite:
```bibtex
@misc{reduced-siglip-visualdescription,
title={Reduced SigLIP for Visual Descriptions},
author={Marketa Jurankova},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/MarketaJu/reduced-siglip}}
}
- Downloads last month
- 132