File size: 3,557 Bytes

---
license: mit
tags:
  - satellite-imagery
  - audio
  - multimodal
  - contrastive-learning
  - soundscape
  - remote-sensing
---

# Sat2Sound

Trained checkpoints and backbone weights for **Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping**, accepted at EarthVision 2026 (IEEE/ISPRS Workshop on Large Scale Computer Vision for Remote Sensing).

- Paper: [arxiv.org/pdf/2505.13777](https://arxiv.org/pdf/2505.13777)
- Code: [github.com/MVRL/sat2sound](https://github.com/MVRL/sat2sound)

## Files

| Path | Description |
|---|---|
| `sat2sound/bingmap_nometa.ckpt` | GeoSound-Bing, no metadata |
| `sat2sound/bingmap_withmeta.ckpt` | GeoSound-Bing, with metadata |
| `sat2sound/sentinel_nometa.ckpt` | GeoSound-Sentinel, no metadata |
| `sat2sound/sentinel_withmeta.ckpt` | GeoSound-Sentinel, with metadata |
| `sat2sound/SoundingEarth_nometa.ckpt` | SoundingEarth, no metadata |
| `sat2sound/SoundingEarth_withmeta.ckpt` | SoundingEarth, with metadata |
| `sat2text/bingmap_i2t_baseline.ckpt` | Sat2Text image-text baseline |
| `backbones/pretrain-vit-base-e199.pth` | SatMAE ViT-Base backbone |
| `backbones/mga-clap.pt` | MGACLAP audio encoder backbone |
| `demo/GeoSound_gallery_w_bingmap.h5` | Retrieval demo gallery (9,931 samples) |
| `ckpt_cfg.json` | Experiment name → checkpoint path mapping |

Checkpoints and backbones are resolved automatically by the codebase via `src/hub.py:resolve_hf_ckpt` — no manual download needed.

## Quick-start: computing embeddings

Clone the [code repo](https://github.com/MVRL/sat2sound), install the environment, then:

```python
import torch
import torchaudio
from src.engine import l2normalize
from utilities.utils import load_sat2sound, encode_text, encode_gps_time, load_audio_mel, prepare_batch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
B = 4

model, tokenizer = load_sat2sound("bingmap_withmeta", device)

# audio — swap the next two lines to use a real recording instead of white noise
torchaudio.save("/tmp/demo.wav", torch.randn(1, 320_000), sample_rate=32_000)
mel = load_audio_mel("/tmp/demo.wav", device)                  # (1, 1001, 64)

latlong, time_enc, month_enc = encode_gps_time(37.77, -122.42, hour=13, month=5, B=B, device=device)

batch = prepare_batch(
    sat           = torch.randn(B, 3, 224, 224, device=device),  # ImageNet-normalised satellite tile
    audio_mel     = mel,
    audio_caption = encode_text(["Traffic noise and distant birds."] * B, tokenizer, device),
    image_caption = encode_text(["An urban intersection with dense buildings."] * B, tokenizer, device),
    latlong=latlong, time_enc=time_enc, month_enc=month_enc,
)

with torch.no_grad():
    embeds = model.get_embeds(batch)

sat_emb   = l2normalize(embeds["sat_embeds_dict"]["ctotal"])  # (B, 1024)
audio_emb = l2normalize(embeds["audio_embeds"])               # (B, 1024)
text_emb  = l2normalize(embeds["fdt_txt_embeds"])             # (B, 1024)

print(sat_emb @ audio_emb.T)   # (B, B) satellite ↔ audio cosine similarity
```

> For `*_nometa` checkpoints omit `latlong`, `time_enc`, and `month_enc` (they default to `None`).

## Citation

```bibtex
@inproceedings{khanal2026sat2sound,
  title     = {{Sat2Sound}: A Unified Framework for Zero-Shot Soundscape Mapping},
  author    = {Khanal, Subash and Sastry, Srikumar and Dhakal, Aayush and
               Ahmad, Adeel and Stylianou, Abby and Jacobs, Nathan},
  booktitle = {IEEE/ISPRS Workshop: Large Scale Computer Vision for
               Remote Sensing (EarthVision)},
  year      = {2026},
}
```