Leacb4
/

gap-clip

@@ -29,928 +29,311 @@ base_model: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Leacb4/gap-clip)
-**Advanced multimodal fashion search model combining specialized color embeddings, hierarchical category embeddings, and CLIP for intelligent fashion item retrieval.**
 ---
-## 🚀 Quick Start
-### Installation (< 1 minute)
 ```bash
-# Clone the repository
 git clone https://github.com/Leacb4/gap-clip.git
 cd gap-clip
-# Install package with pip
 pip install -e .
-# Or just install dependencies
-pip install -r requirements.txt
 ```
-### Try It Now (< 2 minutes)
 ```python
 from example_usage import load_models_from_hf
-# Load pre-trained models from Hugging Face
 models = load_models_from_hf("Leacb4/gap-clip")
-# Search with text
-import torch.nn.functional as F
-text_query = "red summer dress"
-text_inputs = models['processor'](text=[text_query], padding=True, return_tensors="pt")
-text_inputs = {k: v.to(models['device']) for k, v in text_inputs.items()}
-with torch.no_grad():
-    text_features = models['main_model'](**text_inputs).text_embeds
-# Extract specialized embeddings
-color_emb = text_features[:, :16]      # Color (dims 0-15)
-category_emb = text_features[:, 16:80]  # Category (dims 16-79)
-general_emb = text_features[:, 80:]     # General CLIP (dims 80-511)
-print(f"✅ Successfully extracted embeddings!")
-print(f"   Color: {color_emb.shape}, Category: {category_emb.shape}, General: {general_emb.shape}")
-```
----
-## 📋 Description
-This project implements an advanced fashion search system based on CLIP, with three specialized models:
-1. **Color Model** (`color_model.pt`) : Specialized CLIP model for extracting reduced-size color embeddings from text and images
-2. **Hierarchy Model** (`hierarchy_model.pth`) : Model for classifying and encoding reduced-size categorical hierarchy of fashion items
-3. **Main CLIP Model** (`gap_clip.pth`) : Main CLIP model based on LAION, trained with color and hierarchy embeddings
-### Architecture
-The main model's embedding structure:
-- **Dimensions 0-15** (16 dims): Color embeddings aligned with specialized color model
-- **Dimensions 16-79** (64 dims): Hierarchy embeddings aligned with specialized hierarchy model
-- **Dimensions 80-511** (432 dims): Standard CLIP embeddings for general visual-semantic understanding
-**Total: 512 dimensions** per embedding (text or image)
-**Key Innovation**: The first 80 dimensions are explicitly trained to align with specialized models through direct MSE and cosine similarity losses, ensuring guaranteed attribute positioning (GAP) while maintaining full CLIP capabilities in the remaining dimensions.
-### Loss Functions
-**1. Enhanced Contrastive Loss** (`enhanced_contrastive_loss`):
-Combines multiple objectives:
-- **Original Triple Loss**: Text-image-attributes contrastive learning
-- **Color Alignment**: Forces dims 0-15 to match color model embeddings
-- **Hierarchy Alignment**: Forces dims 16-79 to match hierarchy model embeddings
-- **Reference Loss**: Optional regularization to stay close to base CLIP
-**2. Alignment Components**:
-```python
-# Color alignment (text & image)
-color_text_mse = F.mse_loss(main_color_dims, color_model_emb)
-color_text_cosine = 1 - F.cosine_similarity(main_color_dims, color_model_emb).mean()
-# Hierarchy alignment (text & image)
-hierarchy_text_mse = F.mse_loss(main_hierarchy_dims, hierarchy_model_emb)
-hierarchy_text_cosine = 1 - F.cosine_similarity(main_hierarchy_dims, hierarchy_model_emb).mean()
-# Combined alignment
-alignment_loss = (color_alignment + hierarchy_alignment) / 2
 ```
-**3. Final Loss**:
-```python
-total_loss = (1 - α) * contrastive_loss + α * alignment_loss + β * reference_loss
-```
-Where:
-- α (alignment_weight) = 0.2 : Balances contrastive and alignment objectives
-- β (reference_weight) = 0.1 : Keeps text space close to base CLIP
-## 🚀 Installation
-### Prerequisites
-- Python 3.8 or higher
-- PyTorch 2.0+ (with CUDA for GPU support, optional but recommended)
-- 16GB RAM minimum (32GB recommended for training)
-- ~5GB disk space for models and data
-### Method 1: Install as Package (Recommended)
-```bash
-# Clone repository
-git clone https://github.com/Leacb4/gap-clip.git
-cd gap-clip
-# Install in development mode
-pip install -e .
-# Or install with optional dependencies
-pip install -e ".[dev]"      # With development tools
-pip install -e ".[optuna]"   # With hyperparameter optimization
-pip install -e ".[all]"      # With all extras
-```
-### Method 2: Install Dependencies Only
-```bash
-pip install -r requirements.txt
-```
-### Method 3: From Hugging Face (Model Only)
-```python
-from example_usage import load_models_from_hf
-models = load_models_from_hf("Leacb4/gap-clip")
 ```
-### Main Dependencies
-| Package | Version | Purpose |
-|---------|---------|---------|
-| `torch` | ≥2.0.0 | Deep learning framework |
-| `transformers` | ≥4.30.0 | Hugging Face CLIP models |
-| `huggingface-hub` | ≥0.16.0 | Model download/upload |
-| `pillow` | ≥9.0.0 | Image processing |
-| `pandas` | ≥1.5.0 | Data manipulation |
-| `scikit-learn` | ≥1.3.0 | ML metrics & evaluation |
-| `tqdm` | ≥4.65.0 | Progress bars |
-| `matplotlib` | ≥3.7.0 | Visualization |
-### Verify Installation
-```python
-# Test that everything works
-import config
-config.print_config()
-# Check device
-print(f"Using device: {config.device}")
-```
-## 📁 Project Structure
 ```
 .
-├── config.py                   # Configuration for paths and parameters
-├── example_usage.py            # Usage examples and HuggingFace loading
-├── setup.py                    # Package installation
-├── __init__.py                 # Package initialization
-├── README.md                   # This documentation
-├── MODEL_CARD.md               # Hugging Face model card
-│
-├── paper/                      # Scientific paper
-│   ├── latex_paper.ltx         # LaTeX source
-│   └── paper.pdf               # Compiled PDF
 │
-├── figures/                    # Paper figures
-│   ├── scheme.png              # Architecture diagram
-│   ├── heatmap_baseline.jpg    # Baseline color heatmap
-│   ├── heatmap.png             # GAP-CLIP color heatmap
-│   ├── tsne_*.png              # t-SNE visualizations
-│   ├── red_dress.png           # Search demo example
-│   ├── blue_jeans.png          # Search demo example
-│   ├── optuna_param_importances.png  # Optuna importance plot
-│   └── training_curves.png     # Training loss curves
 │
-├── training/                   # Model training code
-│   ├── main_model.py           # Main GAP-CLIP model with enhanced loss
-│   ├── hierarchy_model.py      # Hierarchy/category model
-│   ├── train_main_model.py     # Training with Optuna-optimized params
-│   └── optuna_optimisation.py  # Hyperparameter optimization
 │
-├── evaluation/                 # Paper evaluation scripts
-│   ├── run_all_evaluations.py  # Orchestrates all evaluations
-│   ├── sec51_color_model_eval.py       # Section 5.1 - Color model
-│   ├── sec52_category_model_eval.py    # Section 5.2 - Category model
-│   ├── sec533_clip_nn_accuracy.py      # Section 5.3.3 - Classification
-│   ├── sec5354_separation_semantic.py  # Sections 5.3.4-5.3.5
-│   ├── sec536_embedding_structure.py   # Section 5.3.6 - Structure tests
-│   ├── annex92_color_heatmaps.py       # Annex - Color heatmaps
-│   ├── annex93_tsne.py                 # Annex - t-SNE visualizations
-│   ├── annex94_search_demo.py          # Annex - Search demo
-│   └── utils/                  # Shared evaluation utilities
-│       ├── datasets.py         # Dataset loaders
-│       ├── metrics.py          # Metrics (separation, accuracy)
-│       └── model_loader.py     # Model loading helpers
 │
-├── data/                       # Data preparation
-│   ├── download_images.py      # Download dataset images
-│   └── get_csv_from_chunks.py  # Merge CSV chunks
 │
-├── models/                     # Trained model weights
-│   ├── color_model.pt          # Color model checkpoint
-│   ├── hierarchy_model.pth     # Hierarchy model checkpoint
-│   └── gap_clip.pth            # Main GAP-CLIP checkpoint
 │
-└── optuna/                     # Optuna optimization artifacts
-    ├── optuna_results.txt      # Best hyperparameters
-    ├── optuna_study.pkl        # Saved study
-    ├── optuna_optimization_history.png
-    └── optuna_param_importances.png
-```
-### Key Files Description
-**Core Model Files** (in `training/`):
-- `main_model.py`: GAP-CLIP implementation with enhanced contrastive loss
-- `hierarchy_model.py`: ResNet18-based hierarchy classification model (64 dims)
-- `train_main_model.py`: Training with Optuna-optimized hyperparameters
-- `optuna_optimisation.py`: Hyperparameter search with Optuna
-**Configuration & Setup**:
-- `config.py`: Configuration with type hints, auto device detection, validation
-- `setup.py`: Package installer with CLI entry points
-- `__init__.py`: Package initialization for easy imports
-**Evaluation Suite** (in `evaluation/`):
-- Scripts prefixed `sec5*` correspond to paper sections 5.1–5.3.6
-- Scripts prefixed `annex9*` generate annex figures (heatmaps, t-SNE, search demo)
-- `run_all_evaluations.py`: Orchestrates all paper evaluations
-- `utils/`: Shared datasets, metrics, and model loading
-**CLI Commands**:
-After installation with `pip install -e .`, you can use:
-```bash
-gap-clip-train      # Start training
-gap-clip-example    # Run usage examples
 ```
-## 🔧 Configuration
-Main parameters are defined in `config.py` (✨ completely rewritten with improvements):
-```python
-import config
-# Automatic device detection (CUDA > MPS > CPU)
-device = config.device  # Automatically selects best available device
-# Embedding dimensions
-color_emb_dim = config.color_emb_dim           # 16 dims (0-15)
-hierarchy_emb_dim = config.hierarchy_emb_dim   # 64 dims (16-79)
-main_emb_dim = config.main_emb_dim             # 512 dims total
-# Default training hyperparameters
-batch_size = config.DEFAULT_BATCH_SIZE         # 32
-learning_rate = config.DEFAULT_LEARNING_RATE   # 1.5e-5
-temperature = config.DEFAULT_TEMPERATURE       # 0.09
-# Utility functions
-config.print_config()      # Print current configuration
-config.validate_paths()    # Validate that all files exist
-```
-### New Features in config.py ✨
-- **Automatic device detection**: Selects CUDA > MPS > CPU automatically
-- **Type hints**: Full type annotations for better IDE support
-- **Validation**: `validate_paths()` checks all model files exist
-- **Print utility**: `print_config()` shows current settings
-- **Constants**: Pre-defined default hyperparameters
-- **Documentation**: Comprehensive docstrings for all settings
-### Model Paths
-Default paths configured in `config.py`:
-- `models/color_model.pt` : Trained color model checkpoint
-- `models/hierarchy_model.pth` : Trained hierarchy model checkpoint
-- `models/gap_clip.pth` : Main GAP-CLIP model checkpoint
-- `tokenizer_vocab.json` : Tokenizer vocabulary for color model
-- `data.csv` : Training/validation dataset
-### Dataset Format
-The training dataset CSV should contain:
-- `text`: Text description of the fashion item
-- `color`: Color label (e.g., "red", "blue", "black")
-- `hierarchy`: Category label (e.g., "dress", "shirt", "shoes")
-- `local_image_path`: Path to the image file
-Example:
-```csv
-text,color,hierarchy,local_image_path
-"red summer dress with floral pattern",red,dress,data/images/001.jpg
-"blue denim jeans casual style",blue,jeans,data/images/002.jpg
-```
-## 📦 Usage
-### 1. Load Models from Hugging Face
-If your models are already uploaded to Hugging Face:
 ```python
 from example_usage import load_models_from_hf
-# Load all models
-models = load_models_from_hf("your-username/your-model")
-color_model = models['color_model']
-hierarchy_model = models['hierarchy_model']
-main_model = models['main_model']
-processor = models['processor']
-device = models['device']
-```
-### 2. Text Search
-```python
-import torch
-from transformers import CLIPProcessor
-# Prepare text query
-text_query = "red dress"
-text_inputs = processor(text=[text_query], padding=True, return_tensors="pt")
-text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
-# Get main model embeddings
-with torch.no_grad():
-    outputs = main_model(**text_inputs)
-    text_features = outputs.text_embeds
-# Get specialized embeddings
-color_emb = color_model.get_text_embeddings([text_query])
-hierarchy_emb = hierarchy_model.get_text_embeddings([text_query])
 ```
-### 3. Image Search
 ```python
 from PIL import Image
-# Load image
 image = Image.open("path/to/image.jpg").convert("RGB")
-image_inputs = processor(images=[image], return_tensors="pt")
-image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
-# Get embeddings
 with torch.no_grad():
-    outputs = main_model(**image_inputs)
-    image_features = outputs.image_embeds
-```
-### 4. Using the Example Script
-The `example_usage.py` provides ready-to-use examples for loading and using GAP-CLIP:
-```bash
-# Load from HuggingFace and search with text
-python example_usage.py \
-    --repo-id Leacb4/gap-clip \
-    --text "red summer dress"
-# Search with image
-python example_usage.py \
-    --repo-id Leacb4/gap-clip \
-    --image path/to/image.jpg
-# Both text and image
-python example_usage.py \
-    --repo-id Leacb4/gap-clip \
-    --text "blue denim jeans" \
-    --image path/to/image.jpg
 ```
-This script demonstrates:
-- Loading models from HuggingFace Hub
-- Extracting text and image embeddings
-- Accessing color and hierarchy subspaces
-- Measuring alignment quality with specialized models
-## 🎯 Model Training
-### Train the Color Model
 ```python
-from color_model import ColorCLIP, train_color_model
-# Configuration
-model = ColorCLIP(vocab_size=10000, embedding_dim=16)
-# ... dataset configuration ...
-# Training
-train_color_model(model, train_loader, val_loader, num_epochs=20)
 ```
-### Train the Hierarchy Model
-```python
-from training.hierarchy_model import Model as HierarchyModel, train_hierarchy_model
-# Configuration
-model = HierarchyModel(num_hierarchy_classes=10, embed_dim=64)
-# ... dataset configuration ...
-# Training
-train_hierarchy_model(model, train_loader, val_loader, num_epochs=20)
 ```
-### Train the Main CLIP Model
-The main model trains with both specialized models using an enhanced contrastive loss.
-**Option 1: Train with optimized hyperparameters (recommended)**:
-```bash
-python -m training.train_main_model
-```
-This uses hyperparameters optimized with Optuna (Trial 29, validation loss ~0.1129).
-**Option 2: Train with default parameters**:
-```bash
-python -m training.main_model
 ```
-This runs the main training loop with manually configured parameters.
-**Default Training Parameters** (in `training/main_model.py`):
-- `num_epochs = 20` : Number of training epochs
-- `learning_rate = 1.5e-5` : Learning rate with AdamW optimizer
-- `temperature = 0.09` : Temperature for softer contrastive learning
-- `alignment_weight = 0.2` : Weight for color/hierarchy alignment loss
-- `weight_decay = 5e-4` : L2 regularization to prevent overfitting
-- `batch_size = 32` : Batch size
-- `subset_size = 20000` : Dataset size for better generalization
-- `reference_weight = 0.1` : Weight for base CLIP regularization
-**Enhanced Loss Function**:
-The training uses `enhanced_contrastive_loss` which combines:
-1. **Triple Contrastive Loss** (weighted):
-   - Text-Image alignment (70%)
-   - Text-Attributes alignment (15%)
-   - Image-Attributes alignment (15%)
-2. **Direct Alignment Loss** (combines color & hierarchy):
-   - MSE loss between main model color dims (0-15) and color model embeddings
-   - MSE loss between main model hierarchy dims (16-79) and hierarchy model embeddings
-   - Cosine similarity losses for both color and hierarchy
-   - Applied to both text and image embeddings
-3. **Reference Model Loss** (optional):
-   - Keeps text embeddings close to base CLIP
-   - Improves cross-domain generalization
-**Training Features**:
-- Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
-- Gradient clipping (max_norm=1.0) to prevent exploding gradients
-- ReduceLROnPlateau scheduler (patience=3, factor=0.5)
-- Early stopping (patience=7)
-- Automatic best model saving with checkpoints
-- Detailed metrics logging (alignment losses, cosine similarities)
-- Overfitting detection and warnings
-- Training curves visualization with 3 plots (losses, overfitting gap, comparison)
-### Hyperparameter Optimization
-The project includes Optuna-based hyperparameter optimization:
-```bash
-python -m training.optuna_optimisation
 ```
-This optimizes:
-- Learning rate
-- Temperature for contrastive loss
-- Alignment weight
-- Weight decay
-Results are saved in `optuna/optuna_study.pkl` and visualizations in `optuna/optuna_optimization_history.png` and `optuna/optuna_param_importances.png`.
-The best hyperparameters from Optuna optimization are used in `training/train_main_model.py`.
-## 📊 Models
-### Color Model
-- **Architecture** : ResNet18 (image encoder) + Embedding (text encoder)
-- **Embedding dimension** : 16
-- **Trained on** : Fashion data with color annotations
-- **Usage** : Extract color embeddings from text or images
-### Hierarchy Model
-- **Architecture** : ResNet18 (image encoder) + Embedding (hierarchy encoder)
-- **Embedding dimension** : 64
-- **Hierarchy classes** : shirt, dress, pant, shoe, bag, etc.
-- **Usage** : Classify and encode categorical hierarchy
-### Main CLIP Model (GAP-CLIP)
-- **Architecture** : CLIP ViT-B/32 (LAION)
-- **Base Model** : `laion/CLIP-ViT-B-32-laion2B-s34B-b79K`
-- **Training Approach** : Enhanced contrastive loss with direct attribute alignment
-- **Embedding Dimensions** : 512 total
-  - Color subspace: dims 0-15 (16 dims)
-  - Hierarchy subspace: dims 16-79 (64 dims)
-  - General CLIP: dims 80-511 (432 dims)
-- **Training Dataset** : 20,000 fashion items with color and hierarchy annotations
-- **Validation Split** : 80/20 train-validation split
-- **Optimizer** : AdamW with weight decay (5e-4)
-- **Best Checkpoint** : Automatically saved based on validation loss
-- **Features** :
-  - Multi-modal text-image search
-  - Guaranteed attribute positioning (GAP) in specific dimensions
-  - Direct alignment with specialized color and hierarchy models
-  - Maintains general CLIP capabilities for cross-domain tasks
-  - Reduced overfitting through augmentation and regularization
-## 🔍 Advanced Usage Examples
-### Search with Combined Embeddings
 ```python
-import torch
-import torch.nn.functional as F
-# Text query
-text_query = "red dress"
-text_inputs = processor(text=[text_query], padding=True, return_tensors="pt")
-text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
-# Main model embeddings
-with torch.no_grad():
-    outputs = main_model(**text_inputs)
-    text_features = outputs.text_embeds  # Shape: [1, 512]
-# Extract specialized embeddings from main model
-main_color_emb = text_features[:, :16]  # Color dimensions (0-15)
-main_hierarchy_emb = text_features[:, 16:80]  # Hierarchy dimensions (16-79)
-main_clip_emb = text_features[:, 80:]  # General CLIP dimensions (80-511)
-# Compare with specialized models
-color_emb = color_model.get_text_embeddings([text_query])
-hierarchy_emb = hierarchy_model.get_text_embeddings([text_query])
-# Measure alignment quality
-color_similarity = F.cosine_similarity(color_emb, main_color_emb, dim=1)
-hierarchy_similarity = F.cosine_similarity(hierarchy_emb, main_hierarchy_emb, dim=1)
-print(f"Color alignment: {color_similarity.item():.4f}")
-print(f"Hierarchy alignment: {hierarchy_similarity.item():.4f}")
-# For search, you can use different strategies:
-# 1. Use full embeddings for general search
-# 2. Use color subspace for color-specific search
-# 3. Use hierarchy subspace for category search
-# 4. Weighted combination of subspaces
 ```
-### Search in an Image Database
-```python
-import numpy as np
-import torch
-import torch.nn.functional as F
-from tqdm import tqdm
-# Step 1: Pre-compute image embeddings (do this once)
-image_paths = [...]  # List of image paths
-image_features_list = []
-print("Computing image embeddings...")
-for img_path in tqdm(image_paths):
-    image = Image.open(img_path).convert("RGB")
-    image_inputs = processor(images=[image], return_tensors="pt")
-    image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
-    with torch.no_grad():
-        outputs = main_model(**image_inputs)
-        features = outputs.image_embeds  # Shape: [1, 512]
-        image_features_list.append(features.cpu())
-# Stack all features
-image_features = torch.cat(image_features_list, dim=0)  # Shape: [N, 512]
-# Step 2: Search with text query
-query = "red dress"
-text_inputs = processor(text=[query], padding=True, return_tensors="pt")
-text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
-with torch.no_grad():
-    outputs = main_model(**text_inputs)
-    text_features = outputs.text_embeds  # Shape: [1, 512]
-# Step 3: Calculate similarities
-# Normalize embeddings for cosine similarity
-text_features_norm = F.normalize(text_features, dim=-1)
-image_features_norm = F.normalize(image_features.to(device), dim=-1)
-# Compute cosine similarities
-similarities = (text_features_norm @ image_features_norm.T).squeeze(0)  # Shape: [N]
-# Step 4: Get top-k results
-top_k = 10
-top_scores, top_indices = similarities.topk(top_k, largest=True)
-# Display results
-print(f"\nTop {top_k} results for query: '{query}'")
-for i, (idx, score) in enumerate(zip(top_indices, top_scores)):
-    print(f"{i+1}. {image_paths[idx]} (similarity: {score.item():.4f})")
-# Optional: Filter by color or hierarchy
-# Extract color embeddings from query
-query_color_emb = text_features[:, :16]
-# Extract hierarchy embeddings from query
-query_hierarchy_emb = text_features[:, 16:80]
-# Use these for more targeted search
-```
-## 📝 Evaluation
-### Running All Evaluations
-Use the orchestrator script to run all paper evaluations:
 ```bash
 python evaluation/run_all_evaluations.py
 ```
-Or run specific sections:
 ```bash
-python evaluation/run_all_evaluations.py --steps sec51,sec52
 ```
-**Available steps**:
 | Step | Paper Section | Description |
 |------|--------------|-------------|
-| `sec51` | §5.1 | Color model accuracy (Table 1) |
-| `sec52` | §5.2 | Category model confusion matrices (Table 2) |
-| `sec533` | §5.3.3 | NN classification accuracy (Table 3) |
-| `sec5354` | §5.3.4-5 | Separation & zero-shot semantic eval |
-| `sec536` | §5.3.6 | Embedding structure Tests A/B/C (Table 4) |
 | `annex92` | Annex 9.2 | Color similarity heatmaps |
 | `annex93` | Annex 9.3 | t-SNE visualizations |
-| `annex94` | Annex 9.4 | Fashion search demo |
-**Evaluation Datasets**:
-1. **Internal dataset** (~50,000 samples) — Fashion items with color and category annotations
-2. **KAGL Marqo** (HuggingFace dataset) — Real-world fashion e-commerce data
-3. **Fashion-MNIST** (~10,000 samples) — Standard benchmark with 10 categories
-**Evaluation Metrics**:
-- Nearest-neighbor classification accuracy
-- Centroid-based classification accuracy
-- Separation score (intra-class vs inter-class cosine similarity)
-- Confusion matrices (text and image modalities)
-**Baseline Comparison**: All evaluations compare GAP-CLIP against `patrickjohncyh/fashion-clip`.
-## 📊 Performance & Results
-The evaluation framework tests GAP-CLIP across three datasets with comparison to the Fashion-CLIP baseline.
-### Evaluation Metrics
-**Color Classification** (dimensions 0-15):
-- Nearest Neighbor Accuracy
-- Centroid-based Accuracy
-- Separation Score (class separability)
-**Hierarchy Classification** (dimensions 16-79):
-- Nearest Neighbor Accuracy
-- Centroid-based Accuracy
-- Separation Score
-### Datasets Used for Evaluation
-1. **Fashion-MNIST**: 10,000 grayscale fashion item images
-   - 10 categories (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot)
-   - Mapped to model's hierarchy classes
-2. **KAGL Marqo Dataset**: Real-world fashion images from HuggingFace
-   - Diverse fashion items with rich metadata
-   - Color and category annotations
-   - Realistic product images
-3. **Local Validation Set**: Custom validation dataset
-   - Fashion items with local image paths
-   - Annotated with colors and hierarchies
-   - Domain-specific evaluation
-### Comparative Analysis
-The evaluation includes:
-- **Baseline comparison**: GAP-CLIP vs `patrickjohncyh/fashion-clip`
-- **Subspace analysis**: Dedicated dimensions (0-79) vs full space (0-511)
-- **Cross-dataset generalization**: Performance consistency across datasets
-- **Alignment quality**: How well specialized dimensions match expert models
-All visualizations (confusion matrices, t-SNE plots, heatmaps) are automatically saved in the analysis directory.
-## 📄 Citation
-If you use GAP-CLIP in your research, please cite:
 ```bibtex
-@misc{gap-clip-2024,
   title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
   author={Sarfati, Lea Attia},
-  year={2024},
-  note={A multi-loss framework combining contrastive learning with direct attribute alignment},
   howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
-  abstract={GAP-CLIP introduces a novel training approach that guarantees specific embedding
-            dimensions encode color (dims 0-15) and hierarchy (dims 16-79) information through
-            direct alignment with specialized models, while maintaining full CLIP capabilities
-            in the remaining dimensions (80-511).}
 }
 ```
-### Key Contributions
-- **Guaranteed Attribute Positioning**: Specific dimensions reliably encode color and hierarchy
-- **Multi-Loss Training**: Combines contrastive learning with MSE and cosine alignment losses
-- **Specialized Model Alignment**: Direct supervision from expert color and hierarchy models
-- **Preserved Generalization**: Maintains base CLIP capabilities for cross-domain tasks
-- **Comprehensive Evaluation**: Tested across multiple datasets with baseline comparisons
-## ❓ FAQ & Troubleshooting
-### Q: What are the minimum hardware requirements?
-**A**:
-- **GPU**: Recommended for training (CUDA or MPS). CPU training is very slow.
-- **RAM**: Minimum 16GB, recommended 32GB for training
-- **Storage**: ~5GB for models and datasets
-### Q: Why are my embeddings not aligned?
-**A**: Check that:
-1. You're using the correct dimension ranges (0-15 for color, 16-79 for hierarchy)
-2. The model was trained with alignment_weight > 0
-3. Color and hierarchy models were properly loaded during training
-### Q: How do I use only the color or hierarchy subspace for search?
-**A**:
-```python
-# Extract and use only color embeddings
-text_color_emb = text_features[:, :16]
-image_color_emb = image_features[:, :16]
-color_similarity = F.cosine_similarity(text_color_emb, image_color_emb)
-# Extract and use only hierarchy embeddings
-text_hierarchy_emb = text_features[:, 16:80]
-image_hierarchy_emb = image_features[:, 16:80]
-hierarchy_similarity = F.cosine_similarity(text_hierarchy_emb, image_hierarchy_emb)
-```
-### Q: Can I add more attributes beyond color and hierarchy?
-**A**: Yes! The architecture is extensible:
-1. Train a new specialized model for your attribute
-2. Reserve additional dimensions in the embedding space
-3. Add alignment losses for these dimensions in `enhanced_contrastive_loss`
-4. Update `config.py` with new dimension ranges
-### Q: How do I evaluate on my own dataset?
-**A**:
-1. Format your dataset as CSV with columns: `text`, `color`, `hierarchy`, `local_image_path`
-2. Update `config.local_dataset_path` in `config.py`
-3. Run the evaluation: `python evaluation/run_all_evaluations.py`
-### Q: Training loss is decreasing but validation loss is increasing. What should I do?
-**A**: This indicates overfitting. Try:
-- Increase `weight_decay` (e.g., from 5e-4 to 1e-3)
-- Reduce `alignment_weight` (e.g., from 0.2 to 0.1)
-- Increase dataset size (`subset_size`)
-- Add more data augmentation in `CustomDataset`
-- Enable or increase early stopping patience
-### Q: Can I fine-tune GAP-CLIP on a specific domain?
-**A**: Yes! Load the checkpoint and continue training:
-```python
-checkpoint = torch.load('models/gap_clip.pth')
-model.load_state_dict(checkpoint['model_state_dict'])
-# Continue training with your domain-specific data
-```
-## 🧪 Testing & Evaluation
-### Quick Test
-```bash
-# Test configuration
-python -c "import config; config.print_config()"
-# Test model loading
-python example_usage.py --repo-id Leacb4/gap-clip --text "red dress"
-```
-### Full Evaluation Suite
-```bash
-# Run all evaluations
-cd evaluation
-python run_all_evaluations.py --repo-id Leacb4/gap-clip
-# Results will be saved to evaluation_results/ with:
-# - summary.json: Detailed metrics
-# - summary_comparison.png: Visual comparison
-```
-## 🐛 Known Issues & Fixes
-### Fixed Issues ✨
-1. **Color model image loading bug** (Fixed in `color_model.py`)
-   - Previous: `Image.open(config.column_local_image_path)`
-   - Fixed: `Image.open(img_path)` - Now correctly gets path from dataframe
-2. **Function naming in training** (Fixed in `training/main_model.py` and `training/train_main_model.py`)
-   - Previous: `train_one_epoch_enhanced`
-   - Fixed: `train_one_epoch` - Consistent naming
-3. **Device compatibility** (Improved in `config.py`)
-   - Now automatically detects and selects best device (CUDA > MPS > CPU)
-## 🎓 Learning Resources
-### Documentation Files
-- **README.md** (this file): Complete project documentation
-- **paper/latex_paper.ltx**: Scientific paper (LaTeX source)
-- **MODEL_CARD.md**: Hugging Face model card
-### Code Examples
-- **example_usage.py**: Basic usage with Hugging Face Hub
-- **evaluation/annex94_search_demo.py**: Interactive search demo
-- **evaluation/annex93_tsne.py**: t-SNE visualization
-## 🤝 Contributing
-We welcome contributions! Here's how:
-1. **Report bugs**: Open an issue with detailed description
-2. **Suggest features**: Describe your idea in an issue
-3. **Submit PR**: Fork, create branch, commit, and open pull request
-4. **Improve docs**: Help make documentation clearer
-### Development Setup
-```bash
-# Install with dev dependencies
-pip install -e ".[dev]"
-# Run tests (if available)
-pytest
-# Format code
-black .
-flake8 .
-```
-## 📊 Project Statistics
-- **Language**: Python 3.8+
-- **Framework**: PyTorch 2.0+
-- **Models**: 3 specialized models (color, hierarchy, main)
-- **Embedding Size**: 512 dimensions
-- **Training Data**: 20,000+ fashion items
-- **Lines of Code**: 5,000+ (including documentation)
-- **Documentation**: Comprehensive docstrings and guides
-## 🔗 Links
-- **Hugging Face Hub**: [Leacb4/gap-clip](https://huggingface.co/Leacb4/gap-clip)
-- **GitHub**: [github.com/Leacb4/gap-clip](https://github.com/Leacb4/gap-clip)
-- **Contact**: lea.attia@gmail.com
-## 📧 Contact & Support
-**Author**: Lea Attia Sarfati
-**Email**: lea.attia@gmail.com
 **Hugging Face**: [@Leacb4](https://huggingface.co/Leacb4)
-For questions, issues, or suggestions:
-- 🐛 **Bug reports**: Open an issue on GitHub
-- 💡 **Feature requests**: Open an issue with [Feature Request] tag
-- 📧 **Direct contact**: lea.attia@gmail.com
-- 💬 **Discussions**: Hugging Face Discussions
----
-## 📜 License
-This project is licensed under the MIT License - see the LICENSE file for details.
-## 🙏 Acknowledgments
-- LAION team for the base CLIP model
-- Hugging Face for transformers library and model hosting
-- PyTorch team for the deep learning framework
-- Fashion-MNIST dataset creators
-- All contributors and users of this project
----
-**⭐ If you find this project useful, please consider giving it a star on GitHub!**
-**📢 Version**: 1.0.0 | **Status**: Production Ready ✅ | **Last Updated**: December 2024

 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Leacb4/gap-clip)
+**A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.**
 ---
+## Quick Start
+### Installation
 ```bash
 git clone https://github.com/Leacb4/gap-clip.git
 cd gap-clip
 pip install -e .
 ```
+### Load from Hugging Face
 ```python
 from example_usage import load_models_from_hf
 models = load_models_from_hf("Leacb4/gap-clip")
+# Extract structured embeddings from text
+import torch, torch.nn.functional as F
+processor = models['processor']
+main_model = models['main_model']
+device = models['device']
+text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt")
+text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
+with torch.no_grad():
+    text_outputs = main_model.text_model(**text_inputs)
+    text_features = main_model.text_projection(text_outputs.pooler_output)
+    text_features = F.normalize(text_features, dim=-1)
+color_emb     = text_features[:, :16]     # dims 0-15  — color
+category_emb  = text_features[:, 16:80]   # dims 16-79 — category
+general_emb   = text_features[:, 80:]     # dims 80-511 — general CLIP
 ```
+---
+## Architecture
+GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes:
+| Subspace | Dimensions | Aligned with |
+|----------|-----------|--------------|
+| Color | 0-15 (16 D) | ColorCLIP specialist model |
+| Category | 16-79 (64 D) | HierarchyModel specialist model |
+| General CLIP | 80-511 (432 D) | Standard CLIP semantic space |
+### Specialist Models (v2)
+Both specialist models use **frozen CLIP ViT-B/32 encoders** with small trainable projection heads:
+- **ColorCLIP**: Frozen CLIP image/text encoder + `Linear(512, 16)` + L2 norm. ~16K trainable parameters.
+- **HierarchyModel**: Frozen CLIP image/text encoder + `MLP(512 -> 128 -> 64)` + LayerNorm + classifier heads. ~100K trainable parameters.
+Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations.
+### Main Model Training
+The main CLIP model is fine-tuned end-to-end with an **enhanced contrastive loss** that combines:
+1. **Triple contrastive loss** (text-image, text-attributes, image-attributes)
+2. **Alignment loss** — MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides)
+3. **Reference loss** — optional regularization to stay close to the base CLIP text space
+```
+total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss
 ```
+Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight).
+**Hyperparameters**: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset.
+---
+## Project Structure
 ```
 .
+├── config.py                  # Paths, dimensions, device detection
+├── example_usage.py           # Load from HuggingFace + demo search
+├── setup.py                   # pip install -e .
+├── __init__.py
+├── README.md                  # This file (also the HF model card)
 │
+├── training/
+│   ├── color_model.py         # ColorCLIP: frozen CLIP + Linear(512,16)
+│   ├── hierarchy_model.py     # HierarchyModel: frozen CLIP + MLP(512,128,64)
+│   └── main_model.py          # GAP-CLIP fine-tuning with enhanced loss
 │
+├── evaluation/
+│   ├── run_all_evaluations.py # Orchestrator for all paper evaluations
+│   ├── sec51_color_model_eval.py      # Table 1 — color accuracy
+│   ├── sec52_category_model_eval.py   # Table 2 — category accuracy
+│   ├── sec533_clip_nn_accuracy.py     # Table 3 — NN classification
+│   ├── sec5354_separation_semantic.py # Separation & zero-shot semantic
+│   ├── sec536_embedding_structure.py  # Table 4 — structure tests A/B/C/D
+│   ├── annex92_color_heatmaps.py      # Color similarity heatmaps
+│   ├── annex93_tsne.py                # t-SNE visualizations
+│   ├── annex94_search_demo.py         # Fashion search engine demo
+│   └── utils/
+│       ├── datasets.py        # Dataset loaders (internal, KAGL, FMNIST)
+│       ├── metrics.py         # Separation score, accuracy metrics
+│       └── model_loader.py    # Model loading helpers (v2 checkpoint)
 │
+├── models/                    # Trained weights (git-ignored, on HF Hub)
+│   ├── color_model.pt         # ColorCLIP checkpoint (~600 MB)
+│   ├── hierarchy_model.pth    # HierarchyModel checkpoint (~600 MB)
+│   └── gap_clip.pth           # Main GAP-CLIP checkpoint (~1.7 GB)
 │
+├── figures/                   # Paper figures & evaluation outputs
+│   ├── scheme.png             # Architecture diagram
+│   ├── training_curves.png    # Training/validation loss curves
+│   ├── heatmap.png            # GAP-CLIP color similarity heatmap
+│   ├── heatmap_baseline.jpg   # Baseline color similarity heatmap
+│   ├── tsne_*.png             # t-SNE visualizations (4 files)
+│   ├── *_red_dress.png        # Search demo: "red dress"
+│   ├── *_blue_pant.png        # Search demo: "blue pant"
+│   └── confusion_matrices/    # Color (8) and hierarchy (12) matrices
 │
+├── paper/
+│   ├── paper.ltx              # LaTeX source
+│   └── paper.pdf              # Compiled paper
 │
+└── data/                      # Training data (git-ignored)
+    └── fashion-mnist_test.csv # Fashion-MNIST evaluation set
 ```
+---
+## Usage
+### Text Search
 ```python
 from example_usage import load_models_from_hf
+models = load_models_from_hf("Leacb4/gap-clip")
+# Use specialist models directly
+color_emb = models['color_model'].get_text_embeddings(["red"])           # [1, 16]
+hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64]
 ```
+### Image Search
 ```python
 from PIL import Image
 image = Image.open("path/to/image.jpg").convert("RGB")
+image_inputs = models['processor'](images=[image], return_tensors="pt")
+image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()}
 with torch.no_grad():
+    vision_outputs = models['main_model'].vision_model(**image_inputs)
+    image_features = models['main_model'].visual_projection(vision_outputs.pooler_output)
+    image_features = F.normalize(image_features, dim=-1)
+# Structured subspaces
+color_emb     = image_features[:, :16]
+category_emb  = image_features[:, 16:80]
+general_emb   = image_features[:, 80:]
 ```
+### Alignment Check
 ```python
+import torch.nn.functional as F
+# Compare specialist vs main-model subspace
+color_from_specialist = models['color_model'].get_text_embeddings(["red"])
+color_from_main = text_features[:, :16]
+similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1)
+print(f"Color alignment: {similarity.item():.4f}")
 ```
+### CLI
+```bash
+# Load from HuggingFace and run example search
+python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress"
+# With an image
+python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg
 ```
+---
+## Training
+### 1. Train the Color Model
+```python
+# From the repository root:
+python -m training.color_model
 ```
+Trains `ColorCLIP`: frozen CLIP ViT-B/32 + trainable `Linear(512, 16)` projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to `models/color_model.pt`.
+### 2. Train the Hierarchy Model
+```python
+python -m training.hierarchy_model
 ```
+Trains `HierarchyModel`: frozen CLIP ViT-B/32 + trainable `MLP(512 -> 128 -> 64)` + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to `models/hierarchy_model.pth`.
+Steps 1 and 2 can run in parallel.
+### 3. Train the Main GAP-CLIP Model
 ```python
+python -m training.main_model
 ```
+Fine-tunes `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` with the enhanced contrastive loss using specialist models as alignment targets. Training features:
+- Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
+- Gradient clipping (max_norm=1.0)
+- ReduceLROnPlateau scheduler (patience=3, factor=0.5)
+- Early stopping (patience=7)
+- Automatic best-model checkpointing
+- Training curves saved to `figures/training_curves.png`
+---
+## Evaluation
+Run all paper evaluations:
 ```bash
 python evaluation/run_all_evaluations.py
 ```
+Or specific sections:
 ```bash
+python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536
 ```
 | Step | Paper Section | Description |
 |------|--------------|-------------|
+| `sec51` | Section 5.1 | Color model accuracy (Table 1) |
+| `sec52` | Section 5.2 | Category model confusion matrices (Table 2) |
+| `sec533` | Section 5.3.3 | NN classification accuracy (Table 3) |
+| `sec5354` | Section 5.3.4-5 | Separation & zero-shot semantic eval |
+| `sec536` | Section 5.3.6 | Embedding structure tests A/B/C/D (Table 4) |
 | `annex92` | Annex 9.2 | Color similarity heatmaps |
 | `annex93` | Annex 9.3 | t-SNE visualizations |
+| `annex94` | Annex 9.4 | Fashion search engine demo |
+All evaluations compare GAP-CLIP against the `patrickjohncyh/fashion-clip` baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST.
+---
+## Configuration
+All paths and hyperparameters are in `config.py`:
+```python
+import config
+config.device              # Auto-detected: CUDA > MPS > CPU
+config.color_emb_dim       # 16
+config.hierarchy_emb_dim   # 64
+config.main_emb_dim        # 512
+config.print_config()      # Pretty-print settings
+config.validate_paths()    # Check model files exist
+```
+---
+## Repository Files on Hugging Face
+| File | Description |
+|------|-------------|
+| `models/gap_clip.pth` | Main GAP-CLIP model checkpoint (~1.7 GB) |
+| `models/color_model.pt` | ColorCLIP specialist checkpoint (~600 MB) |
+| `models/hierarchy_model.pth` | HierarchyModel specialist checkpoint (~600 MB) |
+---
+## Citation
 ```bibtex
+@misc{gap-clip-2025,
   title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
   author={Sarfati, Lea Attia},
+  year={2025},
   howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
 }
 ```
+## License
+MIT License. See LICENSE for details.
+## Contact
+**Author**: Lea Attia Sarfati
+**Email**: lea.attia@gmail.com
 **Hugging Face**: [@Leacb4](https://huggingface.co/Leacb4)