MVRL
/

sat2sound

satellite-imagery

contrastive-learning

Model card Files Files and versions

sat2sound / README.md

Subash-Khanal's picture

Add Quick-start: computing embeddings section

88f53e6 verified 10 days ago

|

history blame contribute delete

3.56 kB

	---
	license: mit
	tags:
	- satellite-imagery
	- audio
	- multimodal
	- contrastive-learning
	- soundscape
	- remote-sensing
	---

	# Sat2Sound

	Trained checkpoints and backbone weights for Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping, accepted at EarthVision 2026 (IEEE/ISPRS Workshop on Large Scale Computer Vision for Remote Sensing).

	- Paper: [arxiv.org/pdf/2505.13777](https://arxiv.org/pdf/2505.13777)
	- Code: [github.com/MVRL/sat2sound](https://github.com/MVRL/sat2sound)

	## Files

	\| Path \| Description \|
	\|---\|---\|
	\| `sat2sound/bingmap_nometa.ckpt` \| GeoSound-Bing, no metadata \|
	\| `sat2sound/bingmap_withmeta.ckpt` \| GeoSound-Bing, with metadata \|
	\| `sat2sound/sentinel_nometa.ckpt` \| GeoSound-Sentinel, no metadata \|
	\| `sat2sound/sentinel_withmeta.ckpt` \| GeoSound-Sentinel, with metadata \|
	\| `sat2sound/SoundingEarth_nometa.ckpt` \| SoundingEarth, no metadata \|
	\| `sat2sound/SoundingEarth_withmeta.ckpt` \| SoundingEarth, with metadata \|
	\| `sat2text/bingmap_i2t_baseline.ckpt` \| Sat2Text image-text baseline \|
	\| `backbones/pretrain-vit-base-e199.pth` \| SatMAE ViT-Base backbone \|
	\| `backbones/mga-clap.pt` \| MGACLAP audio encoder backbone \|
	\| `demo/GeoSound_gallery_w_bingmap.h5` \| Retrieval demo gallery (9,931 samples) \|
	\| `ckpt_cfg.json` \| Experiment name → checkpoint path mapping \|

	Checkpoints and backbones are resolved automatically by the codebase via `src/hub.py:resolve_hf_ckpt` — no manual download needed.

	## Quick-start: computing embeddings

	Clone the [code repo](https://github.com/MVRL/sat2sound), install the environment, then:

	```python
	import torch
	import torchaudio
	from src.engine import l2normalize
	from utilities.utils import load_sat2sound, encode_text, encode_gps_time, load_audio_mel, prepare_batch

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	B = 4

	model, tokenizer = load_sat2sound("bingmap_withmeta", device)

	# audio — swap the next two lines to use a real recording instead of white noise
	torchaudio.save("/tmp/demo.wav", torch.randn(1, 320_000), sample_rate=32_000)
	mel = load_audio_mel("/tmp/demo.wav", device) # (1, 1001, 64)

	latlong, time_enc, month_enc = encode_gps_time(37.77, -122.42, hour=13, month=5, B=B, device=device)

	batch = prepare_batch(
	sat = torch.randn(B, 3, 224, 224, device=device), # ImageNet-normalised satellite tile
	audio_mel = mel,
	audio_caption = encode_text(["Traffic noise and distant birds."] * B, tokenizer, device),
	image_caption = encode_text(["An urban intersection with dense buildings."] * B, tokenizer, device),
	latlong=latlong, time_enc=time_enc, month_enc=month_enc,
	)

	with torch.no_grad():
	embeds = model.get_embeds(batch)

	sat_emb = l2normalize(embeds["sat_embeds_dict"]["ctotal"]) # (B, 1024)
	audio_emb = l2normalize(embeds["audio_embeds"]) # (B, 1024)
	text_emb = l2normalize(embeds["fdt_txt_embeds"]) # (B, 1024)

	print(sat_emb @ audio_emb.T) # (B, B) satellite ↔ audio cosine similarity
	```

	> For `*_nometa` checkpoints omit `latlong`, `time_enc`, and `month_enc` (they default to `None`).

	## Citation

	```bibtex
	@inproceedings{khanal2026sat2sound,
	title = {{Sat2Sound}: A Unified Framework for Zero-Shot Soundscape Mapping},
	author = {Khanal, Subash and Sastry, Srikumar and Dhakal, Aayush and
	Ahmad, Adeel and Stylianou, Abby and Jacobs, Nathan},
	booktitle = {IEEE/ISPRS Workshop: Large Scale Computer Vision for
	Remote Sensing (EarthVision)},
	year = {2026},
	}
	```