Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

Istanbul StreetCLIP

A LoRA fine-tuned StreetCLIP model for predicting Istanbul district locations from street-level photos.

Model Description

This model uses LoRA (Low-Rank Adaptation) to fine-tune StreetCLIP for Istanbul district classification. Given a street-level photograph, it predicts which of 10 Istanbul districts the photo was taken in using zero-shot CLIP similarity.

Districts: Beyoglu, Kadikoy, Besiktas, Uskudar, Fatih, Sisli, Bakirkoy, Maltepe, Sariyer, Atasehir

Performance

Metric	Score
Top-1 Accuracy	63.51%
Top-3 Accuracy	83.16%
Test Samples	570

Per-District Accuracy

District	Accuracy
Sariyer	93.33%
Sisli	84.44%
Beyoglu	77.33%
Besiktas	73.33%
Kadikoy	69.33%
Uskudar	61.33%
Atasehir	55.56%
Bakirkoy	44.44%
Maltepe	35.56%
Fatih	22.22%

Usage

import torch
from transformers import CLIPModel, CLIPProcessor
from peft import PeftModel

# Load model
base_model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
model = PeftModel.from_pretrained(base_model, "sibernetik/istanbul-streetclip")
model.eval()

processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")

# Districts
districts = [
    "Beyoglu", "Kadikoy", "Besiktas", "Uskudar", "Fatih",
    "Sisli", "Bakirkoy", "Maltepe", "Sariyer", "Atasehir"
]

captions = [f"A street-level photo of {d}, Istanbul, Turkey" for d in districts]

# Predict
from PIL import Image

image = Image.open("your_photo.jpg")
inputs = processor(images=image, text=captions, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=1)[0]

for district, prob in sorted(zip(districts, probs), key=lambda x: x[1], reverse=True):
    print(f"{district}: {prob.item()*100:.2f}%")

Training Details

Base Model: geolocal/StreetCLIP (~432M params)
Method: LoRA (r=16, alpha=32, dropout=0.1)
Target Modules: q_proj, k_proj, v_proj, out_proj
Trainable Parameters: 4,325,376 (1.0% of total)
Training Data: 3,800 street-level images from Mapillary across 10 Istanbul districts
Train/Val/Test Split: 70/15/15
Epochs: 5
Batch Size: 4 (effective 32 with gradient accumulation)
Learning Rate: 5e-6 with cosine schedule
Hardware: Apple Silicon (MPS)
Loss: CLIP contrastive loss

Training Progress

Epoch	Train Loss	Val Accuracy
1	1.9099	27.37%
2	1.3854	36.84%
3	1.0433	47.89%
4	0.8700	58.25%
5	0.7431	64.91%

Limitations

Optimized for Istanbul only; won't generalize to other cities
Performance varies by district (best for Sariyer, weakest for Fatih)
Trained on Mapillary street-level imagery; may not work well on aerial/satellite photos
Small dataset (3,800 images) limits generalization

Framework Versions

PEFT 0.18.1
Transformers 4.x
PyTorch 2.x

Downloads last month: 2

Model tree for sibernetik/istanbul-streetclip

Base model

geolocal/StreetCLIP

Adapter

(1)

this model