Model Card for esp-aves2-sl-eat-all-ssl-all
Model Details
Model Description
esp-aves2-sl-eat-all-ssl-all is an audio representation learning model (bioacoustic encoder) trained with a two-stage recipe: self-supervised pretraining of EAT on the All mix (Bio + AudioSet), followed by supervised post-training on the same All mix, as described in What Matters for Bioacoustic Encoding.
- Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
- Funded by: More info at
https://www.earthspecies.org/about-us#support - Shared by: Earth Species Project
- Model type: Transformer; EAT backbone
- License: CC-BY-NC-SA
- Finetuned from model: EAT-all (SSL) (see Parent Models)
Model Sources
- Repository:
https://github.com/earthspecies/avex - Paper: What Matters for Bioacoustic Encoding
- Hugging Face Model: ESP-AVES2 Collection
- Configuration: train_config.yaml
Parent Models
- EAT (Efficient Audio Transformer)
- Source:
http://github.com/cwx-worst-one/EAT - Description: Open-source transformer audio encoder; the paper uses it to study modifications to SSL pretraining and subsequent supervised post-training.
- License: See upstream repository
- Source:
Uses
Direct Use
esp-aves2-sl-eat-all-ssl-all can be used as an embedding model for bioacoustic tasks such as species classification/detection, retrieval and clustering, individual ID, and repertoire analysis.
Downstream Use
Use frozen embeddings with linear probes, or fine-tune on your target task/domain.
Out-of-Scope Use
Not a generative model; does not output text.
Bias, Risks, and Limitations
- Bias: Training data biases (taxa, geography, recording conditions) can affect downstream performance.
- Risks: Potential misuse for harmful wildlife exploitation; apply safeguards.
- Limitations: 16 kHz standardization in the paper; may not capture higher-frequency information important for some taxa.
How to Get Started with the Model
Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.
Installation
pip install avex
Or with uv:
uv add avex
For more details, see https://github.com/earthspecies/avex.
Loading the Model
from avex import load_model
model = load_model("esp_aves2_sl_eat_all_ssl_all", device="cuda")
Using the Model
# Case 1: embedding extraction (features only)
backbone = load_model("esp_aves2_sl_eat_all_ssl_all", device="cuda", return_features_only=True)
with torch.no_grad():
embeddings = backbone(audio_tensor)
# Shape: (batch, time_steps, 768) for EAT
# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=1) # Shape: (batch, 768)
# Case 2: supervised predictions (logits over label IDs; see label_map.json)
model = load_model("esp_aves2_sl_eat_all_ssl_all", device="cuda")
with torch.no_grad():
logits = model(audio_tensor)
predicted_class = logits.argmax(dim=-1).item()
Transfer Learning with Probes
from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig
# Load backbone for feature extraction
base = load_model("esp_aves2_sl_eat_all_ssl_all", return_features_only=True, device="cuda")
# Define a probe head for your task
probe_config = ProbeConfig(
probe_type="linear",
target_layers=["last_layer"],
aggregation="mean",
freeze_backbone=True,
online_training=True,
)
probe = build_probe_from_config(
probe_config=probe_config,
base_model=base,
num_classes=10, # Your number of classes
device="cuda",
)
Class Label Mapping
The class label mapping for this supervised learning model can be found at label_map.json in the Hugging Face repository.
Training Details
Training Data
esp-aves2-sl-eat-all-ssl-all uses the paper’s two-stage recipe:
- SSL pretraining: EAT on All (Bio + AudioSet) to produce EAT-all
- Supervised post-training: on All to produce sl-EAT-all
Training Data Sources
| Dataset | Description | Source | License | Size |
|---|---|---|---|---|
| AudioSet | general audio | Link | See dataset terms | 5700 hours |
| Xeno-canto | birds | Link | CC (varies) | 10416 hours |
| iNaturalist | diverse taxa | Link | CC (varies) | 1539 hours |
| Watkins | marine mammals | Link | licensing agreement (paper) | 27 hours |
| Animal Sound Archive | diverse taxa | Link | See archive terms | 78 hours |
Training Procedure
As described in the paper:
- SSL objective: a mix of teacher distillation and reconstruction of masked spectrogram patches.
- Augmentations: random additive noise (p=0.5, SNR in ([-10, 20]) dB); mixup-style within-batch mixing (p=0.5) with union of labels during supervised post-training.
Training Hyperparameters
Training hyperparameters are specified in train_config.yaml.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The paper evaluates on:
- BEANS (classification and detection):
https://github.com/earthspecies/beans - BirdSet (detection):
https://huggingface.co/datasets/DBD-research-group/BirdSet - Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
- Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale
Metrics
- Linear probing: accuracy / mAP
- Retrieval: ROC AUC
- Clustering: NMI
Results
Aggregate results for linear probing (frozen base model) with esp-aves2-sl-eat-all-ssl-all (from the provided LaTeX table):
| Benchmark | Task | Metric | Score |
|---|---|---|---|
| BEANS Classification | Probe | Accuracy | 0.788 |
| BEANS Classification | Retrieval | ROC AUC | 0.791 |
| BEANS Classification | Clustering | NMI | 0.536 |
| BEANS Detection | Probe | mAP | 0.356 |
| BEANS Detection | Retrieval | ROC AUC | 0.704 |
| BirdSet | Probe | mAP | 0.255 |
| BirdSet | Retrieval | ROC AUC | 0.706 |
| Individual ID | Probe | Accuracy | 0.456 |
| Individual ID | Retrieval | ROC AUC | 0.637 |
| Vocal Repertoire | Retrieval | ROC AUC | 0.798 |
| Vocal Repertoire | Clustering | NMI | 0.530 |
Citation
BibTeX:
@inproceedings{miron2025matters,
title={What Matters for Bioacoustic Encoding},
author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
Model Card Contact
Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org