Leacb4 commited on
Commit
1930636
·
verified ·
1 Parent(s): 9a67f8b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +175 -792
README.md CHANGED
@@ -29,928 +29,311 @@ base_model: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
29
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
30
  [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Leacb4/gap-clip)
31
 
32
- **Advanced multimodal fashion search model combining specialized color embeddings, hierarchical category embeddings, and CLIP for intelligent fashion item retrieval.**
33
 
34
  ---
35
 
36
- ## 🚀 Quick Start
37
 
38
- ### Installation (< 1 minute)
39
 
40
  ```bash
41
- # Clone the repository
42
  git clone https://github.com/Leacb4/gap-clip.git
43
  cd gap-clip
44
-
45
- # Install package with pip
46
  pip install -e .
47
-
48
- # Or just install dependencies
49
- pip install -r requirements.txt
50
  ```
51
 
52
- ### Try It Now (< 2 minutes)
53
 
54
  ```python
55
  from example_usage import load_models_from_hf
56
 
57
- # Load pre-trained models from Hugging Face
58
  models = load_models_from_hf("Leacb4/gap-clip")
59
 
60
- # Search with text
61
- import torch.nn.functional as F
62
- text_query = "red summer dress"
63
- text_inputs = models['processor'](text=[text_query], padding=True, return_tensors="pt")
64
- text_inputs = {k: v.to(models['device']) for k, v in text_inputs.items()}
65
-
66
- with torch.no_grad():
67
- text_features = models['main_model'](**text_inputs).text_embeds
68
-
69
- # Extract specialized embeddings
70
- color_emb = text_features[:, :16] # Color (dims 0-15)
71
- category_emb = text_features[:, 16:80] # Category (dims 16-79)
72
- general_emb = text_features[:, 80:] # General CLIP (dims 80-511)
73
-
74
- print(f"✅ Successfully extracted embeddings!")
75
- print(f" Color: {color_emb.shape}, Category: {category_emb.shape}, General: {general_emb.shape}")
76
- ```
77
-
78
- ---
79
-
80
- ## 📋 Description
81
-
82
- This project implements an advanced fashion search system based on CLIP, with three specialized models:
83
-
84
- 1. **Color Model** (`color_model.pt`) : Specialized CLIP model for extracting reduced-size color embeddings from text and images
85
- 2. **Hierarchy Model** (`hierarchy_model.pth`) : Model for classifying and encoding reduced-size categorical hierarchy of fashion items
86
- 3. **Main CLIP Model** (`gap_clip.pth`) : Main CLIP model based on LAION, trained with color and hierarchy embeddings
87
-
88
- ### Architecture
89
-
90
- The main model's embedding structure:
91
- - **Dimensions 0-15** (16 dims): Color embeddings aligned with specialized color model
92
- - **Dimensions 16-79** (64 dims): Hierarchy embeddings aligned with specialized hierarchy model
93
- - **Dimensions 80-511** (432 dims): Standard CLIP embeddings for general visual-semantic understanding
94
 
95
- **Total: 512 dimensions** per embedding (text or image)
96
-
97
- **Key Innovation**: The first 80 dimensions are explicitly trained to align with specialized models through direct MSE and cosine similarity losses, ensuring guaranteed attribute positioning (GAP) while maintaining full CLIP capabilities in the remaining dimensions.
98
-
99
- ### Loss Functions
100
-
101
- **1. Enhanced Contrastive Loss** (`enhanced_contrastive_loss`):
102
-
103
- Combines multiple objectives:
104
- - **Original Triple Loss**: Text-image-attributes contrastive learning
105
- - **Color Alignment**: Forces dims 0-15 to match color model embeddings
106
- - **Hierarchy Alignment**: Forces dims 16-79 to match hierarchy model embeddings
107
- - **Reference Loss**: Optional regularization to stay close to base CLIP
108
 
109
- **2. Alignment Components**:
110
- ```python
111
- # Color alignment (text & image)
112
- color_text_mse = F.mse_loss(main_color_dims, color_model_emb)
113
- color_text_cosine = 1 - F.cosine_similarity(main_color_dims, color_model_emb).mean()
114
 
115
- # Hierarchy alignment (text & image)
116
- hierarchy_text_mse = F.mse_loss(main_hierarchy_dims, hierarchy_model_emb)
117
- hierarchy_text_cosine = 1 - F.cosine_similarity(main_hierarchy_dims, hierarchy_model_emb).mean()
 
118
 
119
- # Combined alignment
120
- alignment_loss = (color_alignment + hierarchy_alignment) / 2
 
121
  ```
122
 
123
- **3. Final Loss**:
124
- ```python
125
- total_loss = (1 - α) * contrastive_loss + α * alignment_loss + β * reference_loss
126
- ```
127
- Where:
128
- - α (alignment_weight) = 0.2 : Balances contrastive and alignment objectives
129
- - β (reference_weight) = 0.1 : Keeps text space close to base CLIP
130
 
131
- ## 🚀 Installation
132
 
133
- ### Prerequisites
134
 
135
- - Python 3.8 or higher
136
- - PyTorch 2.0+ (with CUDA for GPU support, optional but recommended)
137
- - 16GB RAM minimum (32GB recommended for training)
138
- - ~5GB disk space for models and data
 
139
 
140
- ### Method 1: Install as Package (Recommended)
141
 
142
- ```bash
143
- # Clone repository
144
- git clone https://github.com/Leacb4/gap-clip.git
145
- cd gap-clip
146
 
147
- # Install in development mode
148
- pip install -e .
149
 
150
- # Or install with optional dependencies
151
- pip install -e ".[dev]" # With development tools
152
- pip install -e ".[optuna]" # With hyperparameter optimization
153
- pip install -e ".[all]" # With all extras
154
- ```
155
 
156
- ### Method 2: Install Dependencies Only
157
 
158
- ```bash
159
- pip install -r requirements.txt
160
- ```
161
 
162
- ### Method 3: From Hugging Face (Model Only)
 
 
163
 
164
- ```python
165
- from example_usage import load_models_from_hf
166
- models = load_models_from_hf("Leacb4/gap-clip")
167
  ```
168
 
169
- ### Main Dependencies
170
-
171
- | Package | Version | Purpose |
172
- |---------|---------|---------|
173
- | `torch` | ≥2.0.0 | Deep learning framework |
174
- | `transformers` | ≥4.30.0 | Hugging Face CLIP models |
175
- | `huggingface-hub` | ≥0.16.0 | Model download/upload |
176
- | `pillow` | ≥9.0.0 | Image processing |
177
- | `pandas` | ≥1.5.0 | Data manipulation |
178
- | `scikit-learn` | ≥1.3.0 | ML metrics & evaluation |
179
- | `tqdm` | ≥4.65.0 | Progress bars |
180
- | `matplotlib` | ≥3.7.0 | Visualization |
181
 
182
- ### Verify Installation
183
 
184
- ```python
185
- # Test that everything works
186
- import config
187
- config.print_config()
188
-
189
- # Check device
190
- print(f"Using device: {config.device}")
191
- ```
192
 
193
- ## 📁 Project Structure
194
 
195
  ```
196
  .
197
- ├── config.py # Configuration for paths and parameters
198
- ├── example_usage.py # Usage examples and HuggingFace loading
199
- ├── setup.py # Package installation
200
- ├── __init__.py # Package initialization
201
- ├── README.md # This documentation
202
- ├── MODEL_CARD.md # Hugging Face model card
203
-
204
- ├── paper/ # Scientific paper
205
- │ ├── latex_paper.ltx # LaTeX source
206
- │ └── paper.pdf # Compiled PDF
207
 
208
- ├── figures/ # Paper figures
209
- │ ├── scheme.png # Architecture diagram
210
- │ ├── heatmap_baseline.jpg # Baseline color heatmap
211
- ── heatmap.png # GAP-CLIP color heatmap
212
- │ ├── tsne_*.png # t-SNE visualizations
213
- │ ├── red_dress.png # Search demo example
214
- │ ├── blue_jeans.png # Search demo example
215
- │ ├── optuna_param_importances.png # Optuna importance plot
216
- │ └── training_curves.png # Training loss curves
217
 
218
- ├── training/ # Model training code
219
- │ ├── main_model.py # Main GAP-CLIP model with enhanced loss
220
- │ ├── hierarchy_model.py # Hierarchy/category model
221
- │ ├── train_main_model.py # Training with Optuna-optimized params
222
- ── optuna_optimisation.py # Hyperparameter optimization
 
 
 
 
 
 
 
 
 
223
 
224
- ├── evaluation/ # Paper evaluation scripts
225
- │ ├── run_all_evaluations.py # Orchestrates all evaluations
226
- │ ├── sec51_color_model_eval.py # Section 5.1 - Color model
227
- ── sec52_category_model_eval.py # Section 5.2 - Category model
228
- │ ├── sec533_clip_nn_accuracy.py # Section 5.3.3 - Classification
229
- │ ├── sec5354_separation_semantic.py # Sections 5.3.4-5.3.5
230
- │ ├── sec536_embedding_structure.py # Section 5.3.6 - Structure tests
231
- │ ├── annex92_color_heatmaps.py # Annex - Color heatmaps
232
- │ ├── annex93_tsne.py # Annex - t-SNE visualizations
233
- │ ├── annex94_search_demo.py # Annex - Search demo
234
- │ └── utils/ # Shared evaluation utilities
235
- │ ├── datasets.py # Dataset loaders
236
- │ ├── metrics.py # Metrics (separation, accuracy)
237
- │ └── model_loader.py # Model loading helpers
238
 
239
- ├── data/ # Data preparation
240
- │ ├── download_images.py # Download dataset images
241
- ── get_csv_from_chunks.py # Merge CSV chunks
 
 
 
 
 
 
242
 
243
- ├── models/ # Trained model weights
244
- │ ├── color_model.pt # Color model checkpoint
245
- ── hierarchy_model.pth # Hierarchy model checkpoint
246
- │ └── gap_clip.pth # Main GAP-CLIP checkpoint
247
 
248
- └── optuna/ # Optuna optimization artifacts
249
- ── optuna_results.txt # Best hyperparameters
250
- ├── optuna_study.pkl # Saved study
251
- ├── optuna_optimization_history.png
252
- └── optuna_param_importances.png
253
- ```
254
-
255
- ### Key Files Description
256
-
257
- **Core Model Files** (in `training/`):
258
- - `main_model.py`: GAP-CLIP implementation with enhanced contrastive loss
259
- - `hierarchy_model.py`: ResNet18-based hierarchy classification model (64 dims)
260
- - `train_main_model.py`: Training with Optuna-optimized hyperparameters
261
- - `optuna_optimisation.py`: Hyperparameter search with Optuna
262
-
263
- **Configuration & Setup**:
264
- - `config.py`: Configuration with type hints, auto device detection, validation
265
- - `setup.py`: Package installer with CLI entry points
266
- - `__init__.py`: Package initialization for easy imports
267
-
268
- **Evaluation Suite** (in `evaluation/`):
269
- - Scripts prefixed `sec5*` correspond to paper sections 5.1–5.3.6
270
- - Scripts prefixed `annex9*` generate annex figures (heatmaps, t-SNE, search demo)
271
- - `run_all_evaluations.py`: Orchestrates all paper evaluations
272
- - `utils/`: Shared datasets, metrics, and model loading
273
-
274
- **CLI Commands**:
275
- After installation with `pip install -e .`, you can use:
276
- ```bash
277
- gap-clip-train # Start training
278
- gap-clip-example # Run usage examples
279
  ```
280
 
281
- ## 🔧 Configuration
282
-
283
- Main parameters are defined in `config.py` (✨ completely rewritten with improvements):
284
-
285
- ```python
286
- import config
287
-
288
- # Automatic device detection (CUDA > MPS > CPU)
289
- device = config.device # Automatically selects best available device
290
-
291
- # Embedding dimensions
292
- color_emb_dim = config.color_emb_dim # 16 dims (0-15)
293
- hierarchy_emb_dim = config.hierarchy_emb_dim # 64 dims (16-79)
294
- main_emb_dim = config.main_emb_dim # 512 dims total
295
-
296
- # Default training hyperparameters
297
- batch_size = config.DEFAULT_BATCH_SIZE # 32
298
- learning_rate = config.DEFAULT_LEARNING_RATE # 1.5e-5
299
- temperature = config.DEFAULT_TEMPERATURE # 0.09
300
-
301
- # Utility functions
302
- config.print_config() # Print current configuration
303
- config.validate_paths() # Validate that all files exist
304
- ```
305
-
306
- ### New Features in config.py ✨
307
-
308
- - **Automatic device detection**: Selects CUDA > MPS > CPU automatically
309
- - **Type hints**: Full type annotations for better IDE support
310
- - **Validation**: `validate_paths()` checks all model files exist
311
- - **Print utility**: `print_config()` shows current settings
312
- - **Constants**: Pre-defined default hyperparameters
313
- - **Documentation**: Comprehensive docstrings for all settings
314
-
315
- ### Model Paths
316
-
317
- Default paths configured in `config.py`:
318
- - `models/color_model.pt` : Trained color model checkpoint
319
- - `models/hierarchy_model.pth` : Trained hierarchy model checkpoint
320
- - `models/gap_clip.pth` : Main GAP-CLIP model checkpoint
321
- - `tokenizer_vocab.json` : Tokenizer vocabulary for color model
322
- - `data.csv` : Training/validation dataset
323
-
324
- ### Dataset Format
325
-
326
- The training dataset CSV should contain:
327
- - `text`: Text description of the fashion item
328
- - `color`: Color label (e.g., "red", "blue", "black")
329
- - `hierarchy`: Category label (e.g., "dress", "shirt", "shoes")
330
- - `local_image_path`: Path to the image file
331
-
332
- Example:
333
- ```csv
334
- text,color,hierarchy,local_image_path
335
- "red summer dress with floral pattern",red,dress,data/images/001.jpg
336
- "blue denim jeans casual style",blue,jeans,data/images/002.jpg
337
- ```
338
-
339
- ## 📦 Usage
340
 
341
- ### 1. Load Models from Hugging Face
342
 
343
- If your models are already uploaded to Hugging Face:
344
 
345
  ```python
346
  from example_usage import load_models_from_hf
347
 
348
- # Load all models
349
- models = load_models_from_hf("your-username/your-model")
350
-
351
- color_model = models['color_model']
352
- hierarchy_model = models['hierarchy_model']
353
- main_model = models['main_model']
354
- processor = models['processor']
355
- device = models['device']
356
- ```
357
-
358
- ### 2. Text Search
359
-
360
- ```python
361
- import torch
362
- from transformers import CLIPProcessor
363
-
364
- # Prepare text query
365
- text_query = "red dress"
366
- text_inputs = processor(text=[text_query], padding=True, return_tensors="pt")
367
- text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
368
-
369
- # Get main model embeddings
370
- with torch.no_grad():
371
- outputs = main_model(**text_inputs)
372
- text_features = outputs.text_embeds
373
 
374
- # Get specialized embeddings
375
- color_emb = color_model.get_text_embeddings([text_query])
376
- hierarchy_emb = hierarchy_model.get_text_embeddings([text_query])
377
  ```
378
 
379
- ### 3. Image Search
380
 
381
  ```python
382
  from PIL import Image
383
 
384
- # Load image
385
  image = Image.open("path/to/image.jpg").convert("RGB")
386
- image_inputs = processor(images=[image], return_tensors="pt")
387
- image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
388
 
389
- # Get embeddings
390
  with torch.no_grad():
391
- outputs = main_model(**image_inputs)
392
- image_features = outputs.image_embeds
393
- ```
394
-
395
- ### 4. Using the Example Script
396
-
397
- The `example_usage.py` provides ready-to-use examples for loading and using GAP-CLIP:
398
 
399
- ```bash
400
- # Load from HuggingFace and search with text
401
- python example_usage.py \
402
- --repo-id Leacb4/gap-clip \
403
- --text "red summer dress"
404
-
405
- # Search with image
406
- python example_usage.py \
407
- --repo-id Leacb4/gap-clip \
408
- --image path/to/image.jpg
409
-
410
- # Both text and image
411
- python example_usage.py \
412
- --repo-id Leacb4/gap-clip \
413
- --text "blue denim jeans" \
414
- --image path/to/image.jpg
415
  ```
416
 
417
- This script demonstrates:
418
- - Loading models from HuggingFace Hub
419
- - Extracting text and image embeddings
420
- - Accessing color and hierarchy subspaces
421
- - Measuring alignment quality with specialized models
422
-
423
- ## 🎯 Model Training
424
-
425
- ### Train the Color Model
426
 
427
  ```python
428
- from color_model import ColorCLIP, train_color_model
429
 
430
- # Configuration
431
- model = ColorCLIP(vocab_size=10000, embedding_dim=16)
432
- # ... dataset configuration ...
433
 
434
- # Training
435
- train_color_model(model, train_loader, val_loader, num_epochs=20)
436
  ```
437
 
438
- ### Train the Hierarchy Model
439
-
440
- ```python
441
- from training.hierarchy_model import Model as HierarchyModel, train_hierarchy_model
442
 
443
- # Configuration
444
- model = HierarchyModel(num_hierarchy_classes=10, embed_dim=64)
445
- # ... dataset configuration ...
446
 
447
- # Training
448
- train_hierarchy_model(model, train_loader, val_loader, num_epochs=20)
449
  ```
450
 
451
- ### Train the Main CLIP Model
452
 
453
- The main model trains with both specialized models using an enhanced contrastive loss.
454
 
455
- **Option 1: Train with optimized hyperparameters (recommended)**:
456
- ```bash
457
- python -m training.train_main_model
458
- ```
459
- This uses hyperparameters optimized with Optuna (Trial 29, validation loss ~0.1129).
460
 
461
- **Option 2: Train with default parameters**:
462
- ```bash
463
- python -m training.main_model
464
  ```
465
- This runs the main training loop with manually configured parameters.
466
-
467
- **Default Training Parameters** (in `training/main_model.py`):
468
- - `num_epochs = 20` : Number of training epochs
469
- - `learning_rate = 1.5e-5` : Learning rate with AdamW optimizer
470
- - `temperature = 0.09` : Temperature for softer contrastive learning
471
- - `alignment_weight = 0.2` : Weight for color/hierarchy alignment loss
472
- - `weight_decay = 5e-4` : L2 regularization to prevent overfitting
473
- - `batch_size = 32` : Batch size
474
- - `subset_size = 20000` : Dataset size for better generalization
475
- - `reference_weight = 0.1` : Weight for base CLIP regularization
476
-
477
- **Enhanced Loss Function**:
478
-
479
- The training uses `enhanced_contrastive_loss` which combines:
480
-
481
- 1. **Triple Contrastive Loss** (weighted):
482
- - Text-Image alignment (70%)
483
- - Text-Attributes alignment (15%)
484
- - Image-Attributes alignment (15%)
485
-
486
- 2. **Direct Alignment Loss** (combines color & hierarchy):
487
- - MSE loss between main model color dims (0-15) and color model embeddings
488
- - MSE loss between main model hierarchy dims (16-79) and hierarchy model embeddings
489
- - Cosine similarity losses for both color and hierarchy
490
- - Applied to both text and image embeddings
491
-
492
- 3. **Reference Model Loss** (optional):
493
- - Keeps text embeddings close to base CLIP
494
- - Improves cross-domain generalization
495
-
496
- **Training Features**:
497
- - Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
498
- - Gradient clipping (max_norm=1.0) to prevent exploding gradients
499
- - ReduceLROnPlateau scheduler (patience=3, factor=0.5)
500
- - Early stopping (patience=7)
501
- - Automatic best model saving with checkpoints
502
- - Detailed metrics logging (alignment losses, cosine similarities)
503
- - Overfitting detection and warnings
504
- - Training curves visualization with 3 plots (losses, overfitting gap, comparison)
505
 
506
- ### Hyperparameter Optimization
507
 
508
- The project includes Optuna-based hyperparameter optimization:
509
 
510
- ```bash
511
- python -m training.optuna_optimisation
512
  ```
513
 
514
- This optimizes:
515
- - Learning rate
516
- - Temperature for contrastive loss
517
- - Alignment weight
518
- - Weight decay
519
-
520
- Results are saved in `optuna/optuna_study.pkl` and visualizations in `optuna/optuna_optimization_history.png` and `optuna/optuna_param_importances.png`.
521
-
522
- The best hyperparameters from Optuna optimization are used in `training/train_main_model.py`.
523
-
524
- ## 📊 Models
525
 
526
- ### Color Model
527
 
528
- - **Architecture** : ResNet18 (image encoder) + Embedding (text encoder)
529
- - **Embedding dimension** : 16
530
- - **Trained on** : Fashion data with color annotations
531
- - **Usage** : Extract color embeddings from text or images
532
-
533
- ### Hierarchy Model
534
-
535
- - **Architecture** : ResNet18 (image encoder) + Embedding (hierarchy encoder)
536
- - **Embedding dimension** : 64
537
- - **Hierarchy classes** : shirt, dress, pant, shoe, bag, etc.
538
- - **Usage** : Classify and encode categorical hierarchy
539
-
540
- ### Main CLIP Model (GAP-CLIP)
541
-
542
- - **Architecture** : CLIP ViT-B/32 (LAION)
543
- - **Base Model** : `laion/CLIP-ViT-B-32-laion2B-s34B-b79K`
544
- - **Training Approach** : Enhanced contrastive loss with direct attribute alignment
545
- - **Embedding Dimensions** : 512 total
546
- - Color subspace: dims 0-15 (16 dims)
547
- - Hierarchy subspace: dims 16-79 (64 dims)
548
- - General CLIP: dims 80-511 (432 dims)
549
- - **Training Dataset** : 20,000 fashion items with color and hierarchy annotations
550
- - **Validation Split** : 80/20 train-validation split
551
- - **Optimizer** : AdamW with weight decay (5e-4)
552
- - **Best Checkpoint** : Automatically saved based on validation loss
553
- - **Features** :
554
- - Multi-modal text-image search
555
- - Guaranteed attribute positioning (GAP) in specific dimensions
556
- - Direct alignment with specialized color and hierarchy models
557
- - Maintains general CLIP capabilities for cross-domain tasks
558
- - Reduced overfitting through augmentation and regularization
559
-
560
- ## 🔍 Advanced Usage Examples
561
-
562
- ### Search with Combined Embeddings
563
 
564
  ```python
565
- import torch
566
- import torch.nn.functional as F
567
-
568
- # Text query
569
- text_query = "red dress"
570
- text_inputs = processor(text=[text_query], padding=True, return_tensors="pt")
571
- text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
572
-
573
- # Main model embeddings
574
- with torch.no_grad():
575
- outputs = main_model(**text_inputs)
576
- text_features = outputs.text_embeds # Shape: [1, 512]
577
-
578
- # Extract specialized embeddings from main model
579
- main_color_emb = text_features[:, :16] # Color dimensions (0-15)
580
- main_hierarchy_emb = text_features[:, 16:80] # Hierarchy dimensions (16-79)
581
- main_clip_emb = text_features[:, 80:] # General CLIP dimensions (80-511)
582
-
583
- # Compare with specialized models
584
- color_emb = color_model.get_text_embeddings([text_query])
585
- hierarchy_emb = hierarchy_model.get_text_embeddings([text_query])
586
-
587
- # Measure alignment quality
588
- color_similarity = F.cosine_similarity(color_emb, main_color_emb, dim=1)
589
- hierarchy_similarity = F.cosine_similarity(hierarchy_emb, main_hierarchy_emb, dim=1)
590
-
591
- print(f"Color alignment: {color_similarity.item():.4f}")
592
- print(f"Hierarchy alignment: {hierarchy_similarity.item():.4f}")
593
-
594
- # For search, you can use different strategies:
595
- # 1. Use full embeddings for general search
596
- # 2. Use color subspace for color-specific search
597
- # 3. Use hierarchy subspace for category search
598
- # 4. Weighted combination of subspaces
599
  ```
600
 
601
- ### Search in an Image Database
602
 
603
- ```python
604
- import numpy as np
605
- import torch
606
- import torch.nn.functional as F
607
- from tqdm import tqdm
608
-
609
- # Step 1: Pre-compute image embeddings (do this once)
610
- image_paths = [...] # List of image paths
611
- image_features_list = []
612
-
613
- print("Computing image embeddings...")
614
- for img_path in tqdm(image_paths):
615
- image = Image.open(img_path).convert("RGB")
616
- image_inputs = processor(images=[image], return_tensors="pt")
617
- image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
618
-
619
- with torch.no_grad():
620
- outputs = main_model(**image_inputs)
621
- features = outputs.image_embeds # Shape: [1, 512]
622
- image_features_list.append(features.cpu())
623
-
624
- # Stack all features
625
- image_features = torch.cat(image_features_list, dim=0) # Shape: [N, 512]
626
-
627
- # Step 2: Search with text query
628
- query = "red dress"
629
- text_inputs = processor(text=[query], padding=True, return_tensors="pt")
630
- text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
631
-
632
- with torch.no_grad():
633
- outputs = main_model(**text_inputs)
634
- text_features = outputs.text_embeds # Shape: [1, 512]
635
-
636
- # Step 3: Calculate similarities
637
- # Normalize embeddings for cosine similarity
638
- text_features_norm = F.normalize(text_features, dim=-1)
639
- image_features_norm = F.normalize(image_features.to(device), dim=-1)
640
-
641
- # Compute cosine similarities
642
- similarities = (text_features_norm @ image_features_norm.T).squeeze(0) # Shape: [N]
643
-
644
- # Step 4: Get top-k results
645
- top_k = 10
646
- top_scores, top_indices = similarities.topk(top_k, largest=True)
647
-
648
- # Display results
649
- print(f"\nTop {top_k} results for query: '{query}'")
650
- for i, (idx, score) in enumerate(zip(top_indices, top_scores)):
651
- print(f"{i+1}. {image_paths[idx]} (similarity: {score.item():.4f})")
652
-
653
- # Optional: Filter by color or hierarchy
654
- # Extract color embeddings from query
655
- query_color_emb = text_features[:, :16]
656
- # Extract hierarchy embeddings from query
657
- query_hierarchy_emb = text_features[:, 16:80]
658
- # Use these for more targeted search
659
- ```
660
 
661
- ## 📝 Evaluation
662
 
663
- ### Running All Evaluations
664
 
665
- Use the orchestrator script to run all paper evaluations:
666
 
667
  ```bash
668
  python evaluation/run_all_evaluations.py
669
  ```
670
 
671
- Or run specific sections:
 
672
  ```bash
673
- python evaluation/run_all_evaluations.py --steps sec51,sec52
674
  ```
675
 
676
- **Available steps**:
677
  | Step | Paper Section | Description |
678
  |------|--------------|-------------|
679
- | `sec51` | §5.1 | Color model accuracy (Table 1) |
680
- | `sec52` | §5.2 | Category model confusion matrices (Table 2) |
681
- | `sec533` | §5.3.3 | NN classification accuracy (Table 3) |
682
- | `sec5354` | §5.3.4-5 | Separation & zero-shot semantic eval |
683
- | `sec536` | §5.3.6 | Embedding structure Tests A/B/C (Table 4) |
684
  | `annex92` | Annex 9.2 | Color similarity heatmaps |
685
  | `annex93` | Annex 9.3 | t-SNE visualizations |
686
- | `annex94` | Annex 9.4 | Fashion search demo |
687
-
688
- **Evaluation Datasets**:
689
- 1. **Internal dataset** (~50,000 samples) — Fashion items with color and category annotations
690
- 2. **KAGL Marqo** (HuggingFace dataset) — Real-world fashion e-commerce data
691
- 3. **Fashion-MNIST** (~10,000 samples) — Standard benchmark with 10 categories
692
-
693
- **Evaluation Metrics**:
694
- - Nearest-neighbor classification accuracy
695
- - Centroid-based classification accuracy
696
- - Separation score (intra-class vs inter-class cosine similarity)
697
- - Confusion matrices (text and image modalities)
698
 
699
- **Baseline Comparison**: All evaluations compare GAP-CLIP against `patrickjohncyh/fashion-clip`.
700
 
 
701
 
702
- ## 📊 Performance & Results
703
-
704
- The evaluation framework tests GAP-CLIP across three datasets with comparison to the Fashion-CLIP baseline.
705
-
706
- ### Evaluation Metrics
707
-
708
- **Color Classification** (dimensions 0-15):
709
- - Nearest Neighbor Accuracy
710
- - Centroid-based Accuracy
711
- - Separation Score (class separability)
712
-
713
- **Hierarchy Classification** (dimensions 16-79):
714
- - Nearest Neighbor Accuracy
715
- - Centroid-based Accuracy
716
- - Separation Score
717
-
718
- ### Datasets Used for Evaluation
719
 
720
- 1. **Fashion-MNIST**: 10,000 grayscale fashion item images
721
- - 10 categories (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot)
722
- - Mapped to model's hierarchy classes
723
 
724
- 2. **KAGL Marqo Dataset**: Real-world fashion images from HuggingFace
725
- - Diverse fashion items with rich metadata
726
- - Color and category annotations
727
- - Realistic product images
728
 
729
- 3. **Local Validation Set**: Custom validation dataset
730
- - Fashion items with local image paths
731
- - Annotated with colors and hierarchies
732
- - Domain-specific evaluation
 
 
 
733
 
734
- ### Comparative Analysis
735
 
736
- The evaluation includes:
737
- - **Baseline comparison**: GAP-CLIP vs `patrickjohncyh/fashion-clip`
738
- - **Subspace analysis**: Dedicated dimensions (0-79) vs full space (0-511)
739
- - **Cross-dataset generalization**: Performance consistency across datasets
740
- - **Alignment quality**: How well specialized dimensions match expert models
741
 
742
- All visualizations (confusion matrices, t-SNE plots, heatmaps) are automatically saved in the analysis directory.
 
 
 
 
743
 
744
- ## 📄 Citation
745
 
746
- If you use GAP-CLIP in your research, please cite:
747
 
748
  ```bibtex
749
- @misc{gap-clip-2024,
750
  title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
751
  author={Sarfati, Lea Attia},
752
- year={2024},
753
- note={A multi-loss framework combining contrastive learning with direct attribute alignment},
754
  howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
755
- abstract={GAP-CLIP introduces a novel training approach that guarantees specific embedding
756
- dimensions encode color (dims 0-15) and hierarchy (dims 16-79) information through
757
- direct alignment with specialized models, while maintaining full CLIP capabilities
758
- in the remaining dimensions (80-511).}
759
  }
760
  ```
761
 
762
- ### Key Contributions
763
-
764
- - **Guaranteed Attribute Positioning**: Specific dimensions reliably encode color and hierarchy
765
- - **Multi-Loss Training**: Combines contrastive learning with MSE and cosine alignment losses
766
- - **Specialized Model Alignment**: Direct supervision from expert color and hierarchy models
767
- - **Preserved Generalization**: Maintains base CLIP capabilities for cross-domain tasks
768
- - **Comprehensive Evaluation**: Tested across multiple datasets with baseline comparisons
769
-
770
- ## ❓ FAQ & Troubleshooting
771
-
772
- ### Q: What are the minimum hardware requirements?
773
-
774
- **A**:
775
- - **GPU**: Recommended for training (CUDA or MPS). CPU training is very slow.
776
- - **RAM**: Minimum 16GB, recommended 32GB for training
777
- - **Storage**: ~5GB for models and datasets
778
-
779
- ### Q: Why are my embeddings not aligned?
780
-
781
- **A**: Check that:
782
- 1. You're using the correct dimension ranges (0-15 for color, 16-79 for hierarchy)
783
- 2. The model was trained with alignment_weight > 0
784
- 3. Color and hierarchy models were properly loaded during training
785
-
786
- ### Q: How do I use only the color or hierarchy subspace for search?
787
-
788
- **A**:
789
- ```python
790
- # Extract and use only color embeddings
791
- text_color_emb = text_features[:, :16]
792
- image_color_emb = image_features[:, :16]
793
- color_similarity = F.cosine_similarity(text_color_emb, image_color_emb)
794
-
795
- # Extract and use only hierarchy embeddings
796
- text_hierarchy_emb = text_features[:, 16:80]
797
- image_hierarchy_emb = image_features[:, 16:80]
798
- hierarchy_similarity = F.cosine_similarity(text_hierarchy_emb, image_hierarchy_emb)
799
- ```
800
-
801
- ### Q: Can I add more attributes beyond color and hierarchy?
802
-
803
- **A**: Yes! The architecture is extensible:
804
- 1. Train a new specialized model for your attribute
805
- 2. Reserve additional dimensions in the embedding space
806
- 3. Add alignment losses for these dimensions in `enhanced_contrastive_loss`
807
- 4. Update `config.py` with new dimension ranges
808
-
809
- ### Q: How do I evaluate on my own dataset?
810
-
811
- **A**:
812
- 1. Format your dataset as CSV with columns: `text`, `color`, `hierarchy`, `local_image_path`
813
- 2. Update `config.local_dataset_path` in `config.py`
814
- 3. Run the evaluation: `python evaluation/run_all_evaluations.py`
815
-
816
- ### Q: Training loss is decreasing but validation loss is increasing. What should I do?
817
-
818
- **A**: This indicates overfitting. Try:
819
- - Increase `weight_decay` (e.g., from 5e-4 to 1e-3)
820
- - Reduce `alignment_weight` (e.g., from 0.2 to 0.1)
821
- - Increase dataset size (`subset_size`)
822
- - Add more data augmentation in `CustomDataset`
823
- - Enable or increase early stopping patience
824
-
825
- ### Q: Can I fine-tune GAP-CLIP on a specific domain?
826
-
827
- **A**: Yes! Load the checkpoint and continue training:
828
- ```python
829
- checkpoint = torch.load('models/gap_clip.pth')
830
- model.load_state_dict(checkpoint['model_state_dict'])
831
- # Continue training with your domain-specific data
832
- ```
833
-
834
- ## 🧪 Testing & Evaluation
835
-
836
- ### Quick Test
837
-
838
- ```bash
839
- # Test configuration
840
- python -c "import config; config.print_config()"
841
-
842
- # Test model loading
843
- python example_usage.py --repo-id Leacb4/gap-clip --text "red dress"
844
- ```
845
-
846
- ### Full Evaluation Suite
847
-
848
- ```bash
849
- # Run all evaluations
850
- cd evaluation
851
- python run_all_evaluations.py --repo-id Leacb4/gap-clip
852
-
853
- # Results will be saved to evaluation_results/ with:
854
- # - summary.json: Detailed metrics
855
- # - summary_comparison.png: Visual comparison
856
- ```
857
-
858
- ## 🐛 Known Issues & Fixes
859
-
860
- ### Fixed Issues ✨
861
-
862
- 1. **Color model image loading bug** (Fixed in `color_model.py`)
863
- - Previous: `Image.open(config.column_local_image_path)`
864
- - Fixed: `Image.open(img_path)` - Now correctly gets path from dataframe
865
-
866
- 2. **Function naming in training** (Fixed in `training/main_model.py` and `training/train_main_model.py`)
867
- - Previous: `train_one_epoch_enhanced`
868
- - Fixed: `train_one_epoch` - Consistent naming
869
-
870
- 3. **Device compatibility** (Improved in `config.py`)
871
- - Now automatically detects and selects best device (CUDA > MPS > CPU)
872
-
873
- ## 🎓 Learning Resources
874
-
875
- ### Documentation Files
876
-
877
- - **README.md** (this file): Complete project documentation
878
- - **paper/latex_paper.ltx**: Scientific paper (LaTeX source)
879
- - **MODEL_CARD.md**: Hugging Face model card
880
-
881
- ### Code Examples
882
-
883
- - **example_usage.py**: Basic usage with Hugging Face Hub
884
- - **evaluation/annex94_search_demo.py**: Interactive search demo
885
- - **evaluation/annex93_tsne.py**: t-SNE visualization
886
-
887
- ## 🤝 Contributing
888
-
889
- We welcome contributions! Here's how:
890
-
891
- 1. **Report bugs**: Open an issue with detailed description
892
- 2. **Suggest features**: Describe your idea in an issue
893
- 3. **Submit PR**: Fork, create branch, commit, and open pull request
894
- 4. **Improve docs**: Help make documentation clearer
895
-
896
- ### Development Setup
897
-
898
- ```bash
899
- # Install with dev dependencies
900
- pip install -e ".[dev]"
901
-
902
- # Run tests (if available)
903
- pytest
904
-
905
- # Format code
906
- black .
907
- flake8 .
908
- ```
909
-
910
- ## 📊 Project Statistics
911
-
912
- - **Language**: Python 3.8+
913
- - **Framework**: PyTorch 2.0+
914
- - **Models**: 3 specialized models (color, hierarchy, main)
915
- - **Embedding Size**: 512 dimensions
916
- - **Training Data**: 20,000+ fashion items
917
- - **Lines of Code**: 5,000+ (including documentation)
918
- - **Documentation**: Comprehensive docstrings and guides
919
 
920
- ## 🔗 Links
921
 
922
- - **Hugging Face Hub**: [Leacb4/gap-clip](https://huggingface.co/Leacb4/gap-clip)
923
- - **GitHub**: [github.com/Leacb4/gap-clip](https://github.com/Leacb4/gap-clip)
924
- - **Contact**: lea.attia@gmail.com
925
 
926
- ## 📧 Contact & Support
927
-
928
- **Author**: Lea Attia Sarfati
929
- **Email**: lea.attia@gmail.com
930
  **Hugging Face**: [@Leacb4](https://huggingface.co/Leacb4)
931
-
932
- For questions, issues, or suggestions:
933
- - 🐛 **Bug reports**: Open an issue on GitHub
934
- - 💡 **Feature requests**: Open an issue with [Feature Request] tag
935
- - 📧 **Direct contact**: lea.attia@gmail.com
936
- - 💬 **Discussions**: Hugging Face Discussions
937
-
938
- ---
939
-
940
- ## 📜 License
941
-
942
- This project is licensed under the MIT License - see the LICENSE file for details.
943
-
944
- ## 🙏 Acknowledgments
945
-
946
- - LAION team for the base CLIP model
947
- - Hugging Face for transformers library and model hosting
948
- - PyTorch team for the deep learning framework
949
- - Fashion-MNIST dataset creators
950
- - All contributors and users of this project
951
-
952
- ---
953
-
954
- **⭐ If you find this project useful, please consider giving it a star on GitHub!**
955
-
956
- **📢 Version**: 1.0.0 | **Status**: Production Ready ✅ | **Last Updated**: December 2024
 
29
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
30
  [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Leacb4/gap-clip)
31
 
32
+ **A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.**
33
 
34
  ---
35
 
36
+ ## Quick Start
37
 
38
+ ### Installation
39
 
40
  ```bash
 
41
  git clone https://github.com/Leacb4/gap-clip.git
42
  cd gap-clip
 
 
43
  pip install -e .
 
 
 
44
  ```
45
 
46
+ ### Load from Hugging Face
47
 
48
  ```python
49
  from example_usage import load_models_from_hf
50
 
 
51
  models = load_models_from_hf("Leacb4/gap-clip")
52
 
53
+ # Extract structured embeddings from text
54
+ import torch, torch.nn.functional as F
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
+ processor = models['processor']
57
+ main_model = models['main_model']
58
+ device = models['device']
 
 
 
 
 
 
 
 
 
 
59
 
60
+ text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt")
61
+ text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
 
 
 
62
 
63
+ with torch.no_grad():
64
+ text_outputs = main_model.text_model(**text_inputs)
65
+ text_features = main_model.text_projection(text_outputs.pooler_output)
66
+ text_features = F.normalize(text_features, dim=-1)
67
 
68
+ color_emb = text_features[:, :16] # dims 0-15 — color
69
+ category_emb = text_features[:, 16:80] # dims 16-79 — category
70
+ general_emb = text_features[:, 80:] # dims 80-511 — general CLIP
71
  ```
72
 
73
+ ---
 
 
 
 
 
 
74
 
75
+ ## Architecture
76
 
77
+ GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes:
78
 
79
+ | Subspace | Dimensions | Aligned with |
80
+ |----------|-----------|--------------|
81
+ | Color | 0-15 (16 D) | ColorCLIP specialist model |
82
+ | Category | 16-79 (64 D) | HierarchyModel specialist model |
83
+ | General CLIP | 80-511 (432 D) | Standard CLIP semantic space |
84
 
85
+ ### Specialist Models (v2)
86
 
87
+ Both specialist models use **frozen CLIP ViT-B/32 encoders** with small trainable projection heads:
 
 
 
88
 
89
+ - **ColorCLIP**: Frozen CLIP image/text encoder + `Linear(512, 16)` + L2 norm. ~16K trainable parameters.
90
+ - **HierarchyModel**: Frozen CLIP image/text encoder + `MLP(512 -> 128 -> 64)` + LayerNorm + classifier heads. ~100K trainable parameters.
91
 
92
+ Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations.
 
 
 
 
93
 
94
+ ### Main Model Training
95
 
96
+ The main CLIP model is fine-tuned end-to-end with an **enhanced contrastive loss** that combines:
 
 
97
 
98
+ 1. **Triple contrastive loss** (text-image, text-attributes, image-attributes)
99
+ 2. **Alignment loss** — MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides)
100
+ 3. **Reference loss** — optional regularization to stay close to the base CLIP text space
101
 
102
+ ```
103
+ total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss
 
104
  ```
105
 
106
+ Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight).
 
 
 
 
 
 
 
 
 
 
 
107
 
108
+ **Hyperparameters**: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset.
109
 
110
+ ---
 
 
 
 
 
 
 
111
 
112
+ ## Project Structure
113
 
114
  ```
115
  .
116
+ ├── config.py # Paths, dimensions, device detection
117
+ ├── example_usage.py # Load from HuggingFace + demo search
118
+ ├── setup.py # pip install -e .
119
+ ├── __init__.py
120
+ ├── README.md # This file (also the HF model card)
 
 
 
 
 
121
 
122
+ ├── training/
123
+ │ ├── color_model.py # ColorCLIP: frozen CLIP + Linear(512,16)
124
+ │ ├── hierarchy_model.py # HierarchyModel: frozen CLIP + MLP(512,128,64)
125
+ ── main_model.py # GAP-CLIP fine-tuning with enhanced loss
 
 
 
 
 
126
 
127
+ ├── evaluation/
128
+ │ ├── run_all_evaluations.py # Orchestrator for all paper evaluations
129
+ │ ├── sec51_color_model_eval.py # Table 1 — color accuracy
130
+ │ ├── sec52_category_model_eval.py # Table 2 category accuracy
131
+ ── sec533_clip_nn_accuracy.py # Table 3 — NN classification
132
+ │ ├── sec5354_separation_semantic.py # Separation & zero-shot semantic
133
+ │ ├── sec536_embedding_structure.py # Table 4 — structure tests A/B/C/D
134
+ │ ├── annex92_color_heatmaps.py # Color similarity heatmaps
135
+ │ ├── annex93_tsne.py # t-SNE visualizations
136
+ │ ├── annex94_search_demo.py # Fashion search engine demo
137
+ │ └── utils/
138
+ │ ├── datasets.py # Dataset loaders (internal, KAGL, FMNIST)
139
+ │ ├── metrics.py # Separation score, accuracy metrics
140
+ │ └── model_loader.py # Model loading helpers (v2 checkpoint)
141
 
142
+ ├── models/ # Trained weights (git-ignored, on HF Hub)
143
+ │ ├── color_model.pt # ColorCLIP checkpoint (~600 MB)
144
+ │ ├── hierarchy_model.pth # HierarchyModel checkpoint (~600 MB)
145
+ ── gap_clip.pth # Main GAP-CLIP checkpoint (~1.7 GB)
 
 
 
 
 
 
 
 
 
 
146
 
147
+ ├── figures/ # Paper figures & evaluation outputs
148
+ │ ├── scheme.png # Architecture diagram
149
+ ── training_curves.png # Training/validation loss curves
150
+ │ ├── heatmap.png # GAP-CLIP color similarity heatmap
151
+ │ ├── heatmap_baseline.jpg # Baseline color similarity heatmap
152
+ │ ├── tsne_*.png # t-SNE visualizations (4 files)
153
+ │ ├── *_red_dress.png # Search demo: "red dress"
154
+ │ ├── *_blue_pant.png # Search demo: "blue pant"
155
+ │ └── confusion_matrices/ # Color (8) and hierarchy (12) matrices
156
 
157
+ ├── paper/
158
+ │ ├── paper.ltx # LaTeX source
159
+ ── paper.pdf # Compiled paper
 
160
 
161
+ └── data/ # Training data (git-ignored)
162
+ ── fashion-mnist_test.csv # Fashion-MNIST evaluation set
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```
164
 
165
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
+ ## Usage
168
 
169
+ ### Text Search
170
 
171
  ```python
172
  from example_usage import load_models_from_hf
173
 
174
+ models = load_models_from_hf("Leacb4/gap-clip")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
+ # Use specialist models directly
177
+ color_emb = models['color_model'].get_text_embeddings(["red"]) # [1, 16]
178
+ hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64]
179
  ```
180
 
181
+ ### Image Search
182
 
183
  ```python
184
  from PIL import Image
185
 
 
186
  image = Image.open("path/to/image.jpg").convert("RGB")
187
+ image_inputs = models['processor'](images=[image], return_tensors="pt")
188
+ image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()}
189
 
 
190
  with torch.no_grad():
191
+ vision_outputs = models['main_model'].vision_model(**image_inputs)
192
+ image_features = models['main_model'].visual_projection(vision_outputs.pooler_output)
193
+ image_features = F.normalize(image_features, dim=-1)
 
 
 
 
194
 
195
+ # Structured subspaces
196
+ color_emb = image_features[:, :16]
197
+ category_emb = image_features[:, 16:80]
198
+ general_emb = image_features[:, 80:]
 
 
 
 
 
 
 
 
 
 
 
 
199
  ```
200
 
201
+ ### Alignment Check
 
 
 
 
 
 
 
 
202
 
203
  ```python
204
+ import torch.nn.functional as F
205
 
206
+ # Compare specialist vs main-model subspace
207
+ color_from_specialist = models['color_model'].get_text_embeddings(["red"])
208
+ color_from_main = text_features[:, :16]
209
 
210
+ similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1)
211
+ print(f"Color alignment: {similarity.item():.4f}")
212
  ```
213
 
214
+ ### CLI
 
 
 
215
 
216
+ ```bash
217
+ # Load from HuggingFace and run example search
218
+ python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress"
219
 
220
+ # With an image
221
+ python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg
222
  ```
223
 
224
+ ---
225
 
226
+ ## Training
227
 
228
+ ### 1. Train the Color Model
 
 
 
 
229
 
230
+ ```python
231
+ # From the repository root:
232
+ python -m training.color_model
233
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
+ Trains `ColorCLIP`: frozen CLIP ViT-B/32 + trainable `Linear(512, 16)` projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to `models/color_model.pt`.
236
 
237
+ ### 2. Train the Hierarchy Model
238
 
239
+ ```python
240
+ python -m training.hierarchy_model
241
  ```
242
 
243
+ Trains `HierarchyModel`: frozen CLIP ViT-B/32 + trainable `MLP(512 -> 128 -> 64)` + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to `models/hierarchy_model.pth`.
 
 
 
 
 
 
 
 
 
 
244
 
245
+ Steps 1 and 2 can run in parallel.
246
 
247
+ ### 3. Train the Main GAP-CLIP Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
 
249
  ```python
250
+ python -m training.main_model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
251
  ```
252
 
253
+ Fine-tunes `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` with the enhanced contrastive loss using specialist models as alignment targets. Training features:
254
 
255
+ - Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
256
+ - Gradient clipping (max_norm=1.0)
257
+ - ReduceLROnPlateau scheduler (patience=3, factor=0.5)
258
+ - Early stopping (patience=7)
259
+ - Automatic best-model checkpointing
260
+ - Training curves saved to `figures/training_curves.png`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
+ ---
263
 
264
+ ## Evaluation
265
 
266
+ Run all paper evaluations:
267
 
268
  ```bash
269
  python evaluation/run_all_evaluations.py
270
  ```
271
 
272
+ Or specific sections:
273
+
274
  ```bash
275
+ python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536
276
  ```
277
 
 
278
  | Step | Paper Section | Description |
279
  |------|--------------|-------------|
280
+ | `sec51` | Section 5.1 | Color model accuracy (Table 1) |
281
+ | `sec52` | Section 5.2 | Category model confusion matrices (Table 2) |
282
+ | `sec533` | Section 5.3.3 | NN classification accuracy (Table 3) |
283
+ | `sec5354` | Section 5.3.4-5 | Separation & zero-shot semantic eval |
284
+ | `sec536` | Section 5.3.6 | Embedding structure tests A/B/C/D (Table 4) |
285
  | `annex92` | Annex 9.2 | Color similarity heatmaps |
286
  | `annex93` | Annex 9.3 | t-SNE visualizations |
287
+ | `annex94` | Annex 9.4 | Fashion search engine demo |
 
 
 
 
 
 
 
 
 
 
 
288
 
289
+ All evaluations compare GAP-CLIP against the `patrickjohncyh/fashion-clip` baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST.
290
 
291
+ ---
292
 
293
+ ## Configuration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
 
295
+ All paths and hyperparameters are in `config.py`:
 
 
296
 
297
+ ```python
298
+ import config
 
 
299
 
300
+ config.device # Auto-detected: CUDA > MPS > CPU
301
+ config.color_emb_dim # 16
302
+ config.hierarchy_emb_dim # 64
303
+ config.main_emb_dim # 512
304
+ config.print_config() # Pretty-print settings
305
+ config.validate_paths() # Check model files exist
306
+ ```
307
 
308
+ ---
309
 
310
+ ## Repository Files on Hugging Face
 
 
 
 
311
 
312
+ | File | Description |
313
+ |------|-------------|
314
+ | `models/gap_clip.pth` | Main GAP-CLIP model checkpoint (~1.7 GB) |
315
+ | `models/color_model.pt` | ColorCLIP specialist checkpoint (~600 MB) |
316
+ | `models/hierarchy_model.pth` | HierarchyModel specialist checkpoint (~600 MB) |
317
 
318
+ ---
319
 
320
+ ## Citation
321
 
322
  ```bibtex
323
+ @misc{gap-clip-2025,
324
  title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
325
  author={Sarfati, Lea Attia},
326
+ year={2025},
 
327
  howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
 
 
 
 
328
  }
329
  ```
330
 
331
+ ## License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
 
333
+ MIT License. See LICENSE for details.
334
 
335
+ ## Contact
 
 
336
 
337
+ **Author**: Lea Attia Sarfati
338
+ **Email**: lea.attia@gmail.com
 
 
339
  **Hugging Face**: [@Leacb4](https://huggingface.co/Leacb4)