| --- |
| license: mit |
| tags: |
| - satellite-imagery |
| - audio |
| - multimodal |
| - contrastive-learning |
| - soundscape |
| - remote-sensing |
| --- |
| |
| # Sat2Sound |
|
|
| Trained checkpoints and backbone weights for **Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping**, accepted at EarthVision 2026 (IEEE/ISPRS Workshop on Large Scale Computer Vision for Remote Sensing). |
|
|
| - Paper: [arxiv.org/pdf/2505.13777](https://arxiv.org/pdf/2505.13777) |
| - Code: [github.com/MVRL/sat2sound](https://github.com/MVRL/sat2sound) |
|
|
| ## Files |
|
|
| | Path | Description | |
| |---|---| |
| | `sat2sound/bingmap_nometa.ckpt` | GeoSound-Bing, no metadata | |
| | `sat2sound/bingmap_withmeta.ckpt` | GeoSound-Bing, with metadata | |
| | `sat2sound/sentinel_nometa.ckpt` | GeoSound-Sentinel, no metadata | |
| | `sat2sound/sentinel_withmeta.ckpt` | GeoSound-Sentinel, with metadata | |
| | `sat2sound/SoundingEarth_nometa.ckpt` | SoundingEarth, no metadata | |
| | `sat2sound/SoundingEarth_withmeta.ckpt` | SoundingEarth, with metadata | |
| | `sat2text/bingmap_i2t_baseline.ckpt` | Sat2Text image-text baseline | |
| | `backbones/pretrain-vit-base-e199.pth` | SatMAE ViT-Base backbone | |
| | `backbones/mga-clap.pt` | MGACLAP audio encoder backbone | |
| | `demo/GeoSound_gallery_w_bingmap.h5` | Retrieval demo gallery (9,931 samples) | |
| | `ckpt_cfg.json` | Experiment name → checkpoint path mapping | |
|
|
| Checkpoints and backbones are resolved automatically by the codebase via `src/hub.py:resolve_hf_ckpt` — no manual download needed. |
|
|
| ## Quick-start: computing embeddings |
|
|
| Clone the [code repo](https://github.com/MVRL/sat2sound), install the environment, then: |
|
|
| ```python |
| import torch |
| import torchaudio |
| from src.engine import l2normalize |
| from utilities.utils import load_sat2sound, encode_text, encode_gps_time, load_audio_mel, prepare_batch |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| B = 4 |
| |
| model, tokenizer = load_sat2sound("bingmap_withmeta", device) |
| |
| # audio — swap the next two lines to use a real recording instead of white noise |
| torchaudio.save("/tmp/demo.wav", torch.randn(1, 320_000), sample_rate=32_000) |
| mel = load_audio_mel("/tmp/demo.wav", device) # (1, 1001, 64) |
| |
| latlong, time_enc, month_enc = encode_gps_time(37.77, -122.42, hour=13, month=5, B=B, device=device) |
| |
| batch = prepare_batch( |
| sat = torch.randn(B, 3, 224, 224, device=device), # ImageNet-normalised satellite tile |
| audio_mel = mel, |
| audio_caption = encode_text(["Traffic noise and distant birds."] * B, tokenizer, device), |
| image_caption = encode_text(["An urban intersection with dense buildings."] * B, tokenizer, device), |
| latlong=latlong, time_enc=time_enc, month_enc=month_enc, |
| ) |
| |
| with torch.no_grad(): |
| embeds = model.get_embeds(batch) |
| |
| sat_emb = l2normalize(embeds["sat_embeds_dict"]["ctotal"]) # (B, 1024) |
| audio_emb = l2normalize(embeds["audio_embeds"]) # (B, 1024) |
| text_emb = l2normalize(embeds["fdt_txt_embeds"]) # (B, 1024) |
| |
| print(sat_emb @ audio_emb.T) # (B, B) satellite ↔ audio cosine similarity |
| ``` |
|
|
| > For `*_nometa` checkpoints omit `latlong`, `time_enc`, and `month_enc` (they default to `None`). |
| |
| ## Citation |
| |
| ```bibtex |
| @inproceedings{khanal2026sat2sound, |
| title = {{Sat2Sound}: A Unified Framework for Zero-Shot Soundscape Mapping}, |
| author = {Khanal, Subash and Sastry, Srikumar and Dhakal, Aayush and |
| Ahmad, Adeel and Stylianou, Abby and Jacobs, Nathan}, |
| booktitle = {IEEE/ISPRS Workshop: Large Scale Computer Vision for |
| Remote Sensing (EarthVision)}, |
| year = {2026}, |
| } |
| ``` |
| |