---
language: en
tags:
- fashion
- clip
- multimodal
- image-search
- text-search
- embeddings
- contrastive-learning
- zero-shot-classification
license: mit
datasets:
- Marqo/KAGL
- zalando-research/fashionmnist
metrics:
- accuracy
- cosine-similarity
- f1
library_name: transformers
pipeline_tag: zero-shot-image-classification
base_model: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
---

# GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch 2.0+](https://img.shields.io/badge/pytorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Leacb4/gap-clip)

**A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.**

---

## Quick Start

### Installation

```bash
git clone https://github.com/Leacb4/gap-clip.git
cd gap-clip
pip install -e .
```

### Load from Hugging Face

```python
from example_usage import load_models_from_hf

models = load_models_from_hf("Leacb4/gap-clip")

# Extract structured embeddings from text
import torch, torch.nn.functional as F

processor = models['processor']
main_model = models['main_model']
device = models['device']

text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt")
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}

with torch.no_grad():
    text_outputs = main_model.text_model(**text_inputs)
    text_features = main_model.text_projection(text_outputs.pooler_output)
    text_features = F.normalize(text_features, dim=-1)

color_emb     = text_features[:, :16]     # dims 0-15  — color
category_emb  = text_features[:, 16:80]   # dims 16-79 — category
general_emb   = text_features[:, 80:]     # dims 80-511 — general CLIP
```

---

## Architecture

GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes:

| Subspace | Dimensions | Aligned with |
|----------|-----------|--------------|
| Color | 0-15 (16 D) | ColorCLIP specialist model |
| Category | 16-79 (64 D) | HierarchyModel specialist model |
| General CLIP | 80-511 (432 D) | Standard CLIP semantic space |

### Specialist Models (v2)

Both specialist models use **frozen CLIP ViT-B/32 encoders** with small trainable projection heads:

- **ColorCLIP**: Frozen CLIP image/text encoder + `Linear(512, 16)` + L2 norm. ~16K trainable parameters.
- **HierarchyModel**: Frozen CLIP image/text encoder + `MLP(512 -> 128 -> 64)` + LayerNorm + classifier heads. ~100K trainable parameters.

Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations.

### Main Model Training

The main CLIP model is fine-tuned end-to-end with an **enhanced contrastive loss** that combines:

1. **Triple contrastive loss** (text-image, text-attributes, image-attributes)
2. **Alignment loss** — MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides)
3. **Reference loss** — optional regularization to stay close to the base CLIP text space

```
total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss
```

Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight).

**Hyperparameters**: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset.

---

## Project Structure

```
.
├── config.py                  # Paths, dimensions, device detection
├── example_usage.py           # Load from HuggingFace + demo search
├── setup.py                   # pip install -e .
├── __init__.py
├── README.md                  # This file (also the HF model card)
│
├── training/
│   ├── color_model.py         # ColorCLIP: frozen CLIP + Linear(512,16)
│   ├── hierarchy_model.py     # HierarchyModel: frozen CLIP + MLP(512,128,64)
│   └── main_model.py          # GAP-CLIP fine-tuning with enhanced loss
│
├── evaluation/
│   ├── run_all_evaluations.py # Orchestrator for all paper evaluations
│   ├── sec51_color_model_eval.py      # Table 1 — color accuracy
│   ├── sec52_category_model_eval.py   # Table 2 — category accuracy
│   ├── sec533_clip_nn_accuracy.py     # Table 3 — NN classification
│   ├── sec5354_separation_semantic.py # Separation & zero-shot semantic
│   ├── sec536_embedding_structure.py  # Table 4 — structure tests A/B/C/D
│   ├── annex92_color_heatmaps.py      # Color similarity heatmaps
│   ├── annex93_tsne.py                # t-SNE visualizations
│   ├── annex94_search_demo.py         # Fashion search engine demo
│   └── utils/
│       ├── datasets.py        # Dataset loaders (internal, KAGL, FMNIST)
│       ├── metrics.py         # Separation score, accuracy metrics
│       └── model_loader.py    # Model loading helpers (v2 checkpoint)
│
├── models/                    # Trained weights (git-ignored, on HF Hub)
│   ├── color_model.pt         # ColorCLIP checkpoint (~600 MB)
│   ├── hierarchy_model.pth    # HierarchyModel checkpoint (~600 MB)
│   └── gap_clip.pth           # Main GAP-CLIP checkpoint (~1.7 GB)
│
├── figures/                   # Paper figures & evaluation outputs
│   ├── scheme.png             # Architecture diagram
│   ├── training_curves.png    # Training/validation loss curves
│   ├── heatmap.png            # GAP-CLIP color similarity heatmap
│   ├── heatmap_baseline.jpg   # Baseline color similarity heatmap
│   ├── tsne_*.png             # t-SNE visualizations (4 files)
│   ├── *_red_dress.png        # Search demo: "red dress"
│   ├── *_blue_pant.png        # Search demo: "blue pant"
│   └── confusion_matrices/    # Color (8) and hierarchy (12) matrices
│
├── paper/
│   ├── paper.ltx              # LaTeX source
│   └── paper.pdf              # Compiled paper
│
└── data/                      # Training data (git-ignored)
    └── fashion-mnist_test.csv # Fashion-MNIST evaluation set
```

---

## Usage

### Text Search

```python
from example_usage import load_models_from_hf

models = load_models_from_hf("Leacb4/gap-clip")

# Use specialist models directly
color_emb = models['color_model'].get_text_embeddings(["red"])           # [1, 16]
hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64]
```

### Image Search

```python
from PIL import Image

image = Image.open("path/to/image.jpg").convert("RGB")
image_inputs = models['processor'](images=[image], return_tensors="pt")
image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()}

with torch.no_grad():
    vision_outputs = models['main_model'].vision_model(**image_inputs)
    image_features = models['main_model'].visual_projection(vision_outputs.pooler_output)
    image_features = F.normalize(image_features, dim=-1)

# Structured subspaces
color_emb     = image_features[:, :16]
category_emb  = image_features[:, 16:80]
general_emb   = image_features[:, 80:]
```

### Alignment Check

```python
import torch.nn.functional as F

# Compare specialist vs main-model subspace
color_from_specialist = models['color_model'].get_text_embeddings(["red"])
color_from_main = text_features[:, :16]

similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1)
print(f"Color alignment: {similarity.item():.4f}")
```

### CLI

```bash
# Load from HuggingFace and run example search
python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress"

# With an image
python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg
```

---

## Training

### 1. Train the Color Model

```python
# From the repository root:
python -m training.color_model
```

Trains `ColorCLIP`: frozen CLIP ViT-B/32 + trainable `Linear(512, 16)` projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to `models/color_model.pt`.

### 2. Train the Hierarchy Model

```python
python -m training.hierarchy_model
```

Trains `HierarchyModel`: frozen CLIP ViT-B/32 + trainable `MLP(512 -> 128 -> 64)` + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to `models/hierarchy_model.pth`.

Steps 1 and 2 can run in parallel.

### 3. Train the Main GAP-CLIP Model

```python
python -m training.main_model
```

Fine-tunes `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` with the enhanced contrastive loss using specialist models as alignment targets. Training features:

- Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
- Gradient clipping (max_norm=1.0)
- ReduceLROnPlateau scheduler (patience=3, factor=0.5)
- Early stopping (patience=7)
- Automatic best-model checkpointing
- Training curves saved to `figures/training_curves.png`

---

## Evaluation

Run all paper evaluations:

```bash
python evaluation/run_all_evaluations.py
```

Or specific sections:

```bash
python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536
```

| Step | Paper Section | Description |
|------|--------------|-------------|
| `sec51` | Section 5.1 | Color model accuracy (Table 1) |
| `sec52` | Section 5.2 | Category model confusion matrices (Table 2) |
| `sec533` | Section 5.3.3 | NN classification accuracy (Table 3) |
| `sec5354` | Section 5.3.4-5 | Separation & zero-shot semantic eval |
| `sec536` | Section 5.3.6 | Embedding structure tests A/B/C/D (Table 4) |
| `annex92` | Annex 9.2 | Color similarity heatmaps |
| `annex93` | Annex 9.3 | t-SNE visualizations |
| `annex94` | Annex 9.4 | Fashion search engine demo |

All evaluations compare GAP-CLIP against the `patrickjohncyh/fashion-clip` baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST.

---

## Configuration

All paths and hyperparameters are in `config.py`:

```python
import config

config.device              # Auto-detected: CUDA > MPS > CPU
config.color_emb_dim       # 16
config.hierarchy_emb_dim   # 64
config.main_emb_dim        # 512
config.print_config()      # Pretty-print settings
config.validate_paths()    # Check model files exist
```

---

## Repository Files on Hugging Face

| File | Description |
|------|-------------|
| `models/gap_clip.pth` | Main GAP-CLIP model checkpoint (~1.7 GB) |
| `models/color_model.pt` | ColorCLIP specialist checkpoint (~600 MB) |
| `models/hierarchy_model.pth` | HierarchyModel specialist checkpoint (~600 MB) |

---

## Citation

```bibtex
@misc{gap-clip-2025,
  title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
  author={Sarfati, Lea Attia},
  year={2025},
  howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
}
```

## License

MIT License. See LICENSE for details.

## Contact

**Author**: Lea Attia Sarfati
**Email**: lea.attia@gmail.com
**Hugging Face**: [@Leacb4](https://huggingface.co/Leacb4)