Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string
Istanbul StreetCLIP
A LoRA fine-tuned StreetCLIP model for predicting Istanbul district locations from street-level photos.
Model Description
This model uses LoRA (Low-Rank Adaptation) to fine-tune StreetCLIP for Istanbul district classification. Given a street-level photograph, it predicts which of 10 Istanbul districts the photo was taken in using zero-shot CLIP similarity.
Districts: Beyoglu, Kadikoy, Besiktas, Uskudar, Fatih, Sisli, Bakirkoy, Maltepe, Sariyer, Atasehir
Performance
| Metric | Score |
|---|---|
| Top-1 Accuracy | 63.51% |
| Top-3 Accuracy | 83.16% |
| Test Samples | 570 |
Per-District Accuracy
| District | Accuracy |
|---|---|
| Sariyer | 93.33% |
| Sisli | 84.44% |
| Beyoglu | 77.33% |
| Besiktas | 73.33% |
| Kadikoy | 69.33% |
| Uskudar | 61.33% |
| Atasehir | 55.56% |
| Bakirkoy | 44.44% |
| Maltepe | 35.56% |
| Fatih | 22.22% |
Usage
import torch
from transformers import CLIPModel, CLIPProcessor
from peft import PeftModel
# Load model
base_model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
model = PeftModel.from_pretrained(base_model, "sibernetik/istanbul-streetclip")
model.eval()
processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")
# Districts
districts = [
"Beyoglu", "Kadikoy", "Besiktas", "Uskudar", "Fatih",
"Sisli", "Bakirkoy", "Maltepe", "Sariyer", "Atasehir"
]
captions = [f"A street-level photo of {d}, Istanbul, Turkey" for d in districts]
# Predict
from PIL import Image
image = Image.open("your_photo.jpg")
inputs = processor(images=image, text=captions, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)[0]
for district, prob in sorted(zip(districts, probs), key=lambda x: x[1], reverse=True):
print(f"{district}: {prob.item()*100:.2f}%")
Training Details
- Base Model: geolocal/StreetCLIP (~432M params)
- Method: LoRA (r=16, alpha=32, dropout=0.1)
- Target Modules: q_proj, k_proj, v_proj, out_proj
- Trainable Parameters: 4,325,376 (1.0% of total)
- Training Data: 3,800 street-level images from Mapillary across 10 Istanbul districts
- Train/Val/Test Split: 70/15/15
- Epochs: 5
- Batch Size: 4 (effective 32 with gradient accumulation)
- Learning Rate: 5e-6 with cosine schedule
- Hardware: Apple Silicon (MPS)
- Loss: CLIP contrastive loss
Training Progress
| Epoch | Train Loss | Val Accuracy |
|---|---|---|
| 1 | 1.9099 | 27.37% |
| 2 | 1.3854 | 36.84% |
| 3 | 1.0433 | 47.89% |
| 4 | 0.8700 | 58.25% |
| 5 | 0.7431 | 64.91% |
Limitations
- Optimized for Istanbul only; won't generalize to other cities
- Performance varies by district (best for Sariyer, weakest for Fatih)
- Trained on Mapillary street-level imagery; may not work well on aerial/satellite photos
- Small dataset (3,800 images) limits generalization
Framework Versions
- PEFT 0.18.1
- Transformers 4.x
- PyTorch 2.x
- Downloads last month
- 2
Model tree for sibernetik/istanbul-streetclip
Base model
geolocal/StreetCLIP