Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

Istanbul StreetCLIP

A LoRA fine-tuned StreetCLIP model for predicting Istanbul district locations from street-level photos.

Model Description

This model uses LoRA (Low-Rank Adaptation) to fine-tune StreetCLIP for Istanbul district classification. Given a street-level photograph, it predicts which of 10 Istanbul districts the photo was taken in using zero-shot CLIP similarity.

Districts: Beyoglu, Kadikoy, Besiktas, Uskudar, Fatih, Sisli, Bakirkoy, Maltepe, Sariyer, Atasehir

Performance

Metric Score
Top-1 Accuracy 63.51%
Top-3 Accuracy 83.16%
Test Samples 570

Per-District Accuracy

District Accuracy
Sariyer 93.33%
Sisli 84.44%
Beyoglu 77.33%
Besiktas 73.33%
Kadikoy 69.33%
Uskudar 61.33%
Atasehir 55.56%
Bakirkoy 44.44%
Maltepe 35.56%
Fatih 22.22%

Usage

import torch
from transformers import CLIPModel, CLIPProcessor
from peft import PeftModel

# Load model
base_model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
model = PeftModel.from_pretrained(base_model, "sibernetik/istanbul-streetclip")
model.eval()

processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")

# Districts
districts = [
    "Beyoglu", "Kadikoy", "Besiktas", "Uskudar", "Fatih",
    "Sisli", "Bakirkoy", "Maltepe", "Sariyer", "Atasehir"
]

captions = [f"A street-level photo of {d}, Istanbul, Turkey" for d in districts]

# Predict
from PIL import Image

image = Image.open("your_photo.jpg")
inputs = processor(images=image, text=captions, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=1)[0]

for district, prob in sorted(zip(districts, probs), key=lambda x: x[1], reverse=True):
    print(f"{district}: {prob.item()*100:.2f}%")

Training Details

  • Base Model: geolocal/StreetCLIP (~432M params)
  • Method: LoRA (r=16, alpha=32, dropout=0.1)
  • Target Modules: q_proj, k_proj, v_proj, out_proj
  • Trainable Parameters: 4,325,376 (1.0% of total)
  • Training Data: 3,800 street-level images from Mapillary across 10 Istanbul districts
  • Train/Val/Test Split: 70/15/15
  • Epochs: 5
  • Batch Size: 4 (effective 32 with gradient accumulation)
  • Learning Rate: 5e-6 with cosine schedule
  • Hardware: Apple Silicon (MPS)
  • Loss: CLIP contrastive loss

Training Progress

Epoch Train Loss Val Accuracy
1 1.9099 27.37%
2 1.3854 36.84%
3 1.0433 47.89%
4 0.8700 58.25%
5 0.7431 64.91%

Limitations

  • Optimized for Istanbul only; won't generalize to other cities
  • Performance varies by district (best for Sariyer, weakest for Fatih)
  • Trained on Mapillary street-level imagery; may not work well on aerial/satellite photos
  • Small dataset (3,800 images) limits generalization

Framework Versions

  • PEFT 0.18.1
  • Transformers 4.x
  • PyTorch 2.x
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sibernetik/istanbul-streetclip

Adapter
(1)
this model