--- language: en tags: - fashion - clip - multimodal - image-search - text-search - embeddings - contrastive-learning - zero-shot-classification license: mit datasets: - Marqo/KAGL - zalando-research/fashionmnist metrics: - accuracy - cosine-similarity - f1 library_name: transformers pipeline_tag: zero-shot-image-classification base_model: laion/CLIP-ViT-B-32-laion2B-s34B-b79K --- # GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch 2.0+](https://img.shields.io/badge/pytorch-2.0+-ee4c2c.svg)](https://pytorch.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Leacb4/gap-clip) **A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.** --- ## Quick Start ### Installation ```bash git clone https://github.com/Leacb4/gap-clip.git cd gap-clip pip install -e . ``` ### Load from Hugging Face ```python from example_usage import load_models_from_hf models = load_models_from_hf("Leacb4/gap-clip") # Extract structured embeddings from text import torch, torch.nn.functional as F processor = models['processor'] main_model = models['main_model'] device = models['device'] text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt") text_inputs = {k: v.to(device) for k, v in text_inputs.items()} with torch.no_grad(): text_outputs = main_model.text_model(**text_inputs) text_features = main_model.text_projection(text_outputs.pooler_output) text_features = F.normalize(text_features, dim=-1) color_emb = text_features[:, :16] # dims 0-15 — color category_emb = text_features[:, 16:80] # dims 16-79 — category general_emb = text_features[:, 80:] # dims 80-511 — general CLIP ``` --- ## Architecture GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes: | Subspace | Dimensions | Aligned with | |----------|-----------|--------------| | Color | 0-15 (16 D) | ColorCLIP specialist model | | Category | 16-79 (64 D) | HierarchyModel specialist model | | General CLIP | 80-511 (432 D) | Standard CLIP semantic space | ### Specialist Models (v2) Both specialist models use **frozen CLIP ViT-B/32 encoders** with small trainable projection heads: - **ColorCLIP**: Frozen CLIP image/text encoder + `Linear(512, 16)` + L2 norm. ~16K trainable parameters. - **HierarchyModel**: Frozen CLIP image/text encoder + `MLP(512 -> 128 -> 64)` + LayerNorm + classifier heads. ~100K trainable parameters. Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations. ### Main Model Training The main CLIP model is fine-tuned end-to-end with an **enhanced contrastive loss** that combines: 1. **Triple contrastive loss** (text-image, text-attributes, image-attributes) 2. **Alignment loss** — MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides) 3. **Reference loss** — optional regularization to stay close to the base CLIP text space ``` total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss ``` Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight). **Hyperparameters**: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset. --- ## Project Structure ``` . ├── config.py # Paths, dimensions, device detection ├── example_usage.py # Load from HuggingFace + demo search ├── setup.py # pip install -e . ├── __init__.py ├── README.md # This file (also the HF model card) │ ├── training/ │ ├── color_model.py # ColorCLIP: frozen CLIP + Linear(512,16) │ ├── hierarchy_model.py # HierarchyModel: frozen CLIP + MLP(512,128,64) │ └── main_model.py # GAP-CLIP fine-tuning with enhanced loss │ ├── evaluation/ │ ├── run_all_evaluations.py # Orchestrator for all paper evaluations │ ├── sec51_color_model_eval.py # Table 1 — color accuracy │ ├── sec52_category_model_eval.py # Table 2 — category accuracy │ ├── sec533_clip_nn_accuracy.py # Table 3 — NN classification │ ├── sec5354_separation_semantic.py # Separation & zero-shot semantic │ ├── sec536_embedding_structure.py # Table 4 — structure tests A/B/C/D │ ├── annex92_color_heatmaps.py # Color similarity heatmaps │ ├── annex93_tsne.py # t-SNE visualizations │ ├── annex94_search_demo.py # Fashion search engine demo │ └── utils/ │ ├── datasets.py # Dataset loaders (internal, KAGL, FMNIST) │ ├── metrics.py # Separation score, accuracy metrics │ └── model_loader.py # Model loading helpers (v2 checkpoint) │ ├── models/ # Trained weights (git-ignored, on HF Hub) │ ├── color_model.pt # ColorCLIP checkpoint (~600 MB) │ ├── hierarchy_model.pth # HierarchyModel checkpoint (~600 MB) │ └── gap_clip.pth # Main GAP-CLIP checkpoint (~1.7 GB) │ ├── figures/ # Paper figures & evaluation outputs │ ├── scheme.png # Architecture diagram │ ├── training_curves.png # Training/validation loss curves │ ├── heatmap.png # GAP-CLIP color similarity heatmap │ ├── heatmap_baseline.jpg # Baseline color similarity heatmap │ ├── tsne_*.png # t-SNE visualizations (4 files) │ ├── *_red_dress.png # Search demo: "red dress" │ ├── *_blue_pant.png # Search demo: "blue pant" │ └── confusion_matrices/ # Color (8) and hierarchy (12) matrices │ ├── paper/ │ ├── paper.ltx # LaTeX source │ └── paper.pdf # Compiled paper │ └── data/ # Training data (git-ignored) └── fashion-mnist_test.csv # Fashion-MNIST evaluation set ``` --- ## Usage ### Text Search ```python from example_usage import load_models_from_hf models = load_models_from_hf("Leacb4/gap-clip") # Use specialist models directly color_emb = models['color_model'].get_text_embeddings(["red"]) # [1, 16] hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64] ``` ### Image Search ```python from PIL import Image image = Image.open("path/to/image.jpg").convert("RGB") image_inputs = models['processor'](images=[image], return_tensors="pt") image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()} with torch.no_grad(): vision_outputs = models['main_model'].vision_model(**image_inputs) image_features = models['main_model'].visual_projection(vision_outputs.pooler_output) image_features = F.normalize(image_features, dim=-1) # Structured subspaces color_emb = image_features[:, :16] category_emb = image_features[:, 16:80] general_emb = image_features[:, 80:] ``` ### Alignment Check ```python import torch.nn.functional as F # Compare specialist vs main-model subspace color_from_specialist = models['color_model'].get_text_embeddings(["red"]) color_from_main = text_features[:, :16] similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1) print(f"Color alignment: {similarity.item():.4f}") ``` ### CLI ```bash # Load from HuggingFace and run example search python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress" # With an image python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg ``` --- ## Training ### 1. Train the Color Model ```python # From the repository root: python -m training.color_model ``` Trains `ColorCLIP`: frozen CLIP ViT-B/32 + trainable `Linear(512, 16)` projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to `models/color_model.pt`. ### 2. Train the Hierarchy Model ```python python -m training.hierarchy_model ``` Trains `HierarchyModel`: frozen CLIP ViT-B/32 + trainable `MLP(512 -> 128 -> 64)` + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to `models/hierarchy_model.pth`. Steps 1 and 2 can run in parallel. ### 3. Train the Main GAP-CLIP Model ```python python -m training.main_model ``` Fine-tunes `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` with the enhanced contrastive loss using specialist models as alignment targets. Training features: - Enhanced data augmentation (rotation, color jitter, blur, affine transforms) - Gradient clipping (max_norm=1.0) - ReduceLROnPlateau scheduler (patience=3, factor=0.5) - Early stopping (patience=7) - Automatic best-model checkpointing - Training curves saved to `figures/training_curves.png` --- ## Evaluation Run all paper evaluations: ```bash python evaluation/run_all_evaluations.py ``` Or specific sections: ```bash python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536 ``` | Step | Paper Section | Description | |------|--------------|-------------| | `sec51` | Section 5.1 | Color model accuracy (Table 1) | | `sec52` | Section 5.2 | Category model confusion matrices (Table 2) | | `sec533` | Section 5.3.3 | NN classification accuracy (Table 3) | | `sec5354` | Section 5.3.4-5 | Separation & zero-shot semantic eval | | `sec536` | Section 5.3.6 | Embedding structure tests A/B/C/D (Table 4) | | `annex92` | Annex 9.2 | Color similarity heatmaps | | `annex93` | Annex 9.3 | t-SNE visualizations | | `annex94` | Annex 9.4 | Fashion search engine demo | All evaluations compare GAP-CLIP against the `patrickjohncyh/fashion-clip` baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST. --- ## Configuration All paths and hyperparameters are in `config.py`: ```python import config config.device # Auto-detected: CUDA > MPS > CPU config.color_emb_dim # 16 config.hierarchy_emb_dim # 64 config.main_emb_dim # 512 config.print_config() # Pretty-print settings config.validate_paths() # Check model files exist ``` --- ## Repository Files on Hugging Face | File | Description | |------|-------------| | `models/gap_clip.pth` | Main GAP-CLIP model checkpoint (~1.7 GB) | | `models/color_model.pt` | ColorCLIP specialist checkpoint (~600 MB) | | `models/hierarchy_model.pth` | HierarchyModel specialist checkpoint (~600 MB) | --- ## Citation ```bibtex @misc{gap-clip-2025, title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search}, author={Sarfati, Lea Attia}, year={2025}, howpublished={\url{https://huggingface.co/Leacb4/gap-clip}}, } ``` ## License MIT License. See LICENSE for details. ## Contact **Author**: Lea Attia Sarfati **Email**: lea.attia@gmail.com **Hugging Face**: [@Leacb4](https://huggingface.co/Leacb4)