--- license: apache-2.0 tags: - geolocation - vision - siglip - clip - geoclip datasets: - osv5m pipeline_tag: image-feature-extraction --- # GeoSpot Base A geolocation model built on SigLIP2-so400m (512px) that predicts GPS coordinates from images. ## Model Details - **Backbone**: google/siglip2-so400m-patch16-512 (frozen) - **Image Resolution**: 512x512 - **Embedding Dim**: 512 - **Training Steps**: 206k - **Training Data**: ~10.6M streetview images ## Architecture GeoCLIP-style contrastive learning between: - **Image Encoder**: SigLIP2 vision tower + MLP projection (1152 → 512) - **Location Encoder**: Multi-scale RFF encoding with learnable capsules ## Usage ```python from geoclip.model.GeoCLIP import GeoCLIP import torch model = GeoCLIP(from_pretrained=False, encoder_name="siglip2") state_dict = torch.load("model.safetensors") model.load_state_dict(state_dict) # Predict location from image top_gps, top_probs = model.predict("image.jpg", top_k=5) ```