Model Card for esp-aves2-naturelm-audio-v1-beats
Model Details
Model Description
esp-aves2-naturelm-audio-v1-beats is the BEATs audio encoder extracted from NatureLM-audio v1 and evaluated as a bioacoustic encoder baseline in What Matters for Bioacoustic Encoding. It represents an ``unorthodox'' post-trained encoder derived from an audio-language model where the BEATs encoder was unfrozen during large-scale training.
- Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
- Funded by: More info at
https://www.earthspecies.org/about-us#support - Shared by: Earth Species Project
- Model type: Transformer; BEATs encoder extracted from NatureLM-audio
- License: CC-BY-NC-SA
- Finetuned from model: BEATs (see Parent Models)
Model Sources
- Repository:
https://github.com/earthspecies/avex - Paper: What Matters for Bioacoustic Encoding
- Hugging Face Model: ESP-AVES2 Collection
- Configuration: train_config.yaml
Parent Models
- BEATs (pretrained on AudioSet)
- Source:
https://github.com/microsoft/unilm/tree/master/beats - Description: Self-supervised transformer audio encoder used within NatureLM-audio.
- License: See upstream repository
- Source:
Uses
Direct Use
esp-aves2-naturelm-audio-v1-beats can be used as an embedding model for bioacoustic tasks (species classification/detection, retrieval/clustering, individual ID, repertoire analysis), as in the paper’s evaluation.
Downstream Use
Use frozen embeddings with linear probes, or fine-tune on target datasets.
Out-of-Scope Use
Not a generative model; does not output text.
Bias, Risks, and Limitations
- Bias: Inherits biases from the upstream data mix used in NatureLM-audio training (not detailed in the provided excerpt).
- Risks: Potential misuse for harmful wildlife exploitation.
- Limitations: Exact training recipe and data composition are defined by NatureLM-audio; consult that release for details.
How to Get Started with the Model
Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.
Installation
pip install avex
Or with uv:
uv add avex
For more details, see https://github.com/earthspecies/avex.
Loading the Model
from avex import load_model
model = load_model("esp_aves2_naturelm_audio_v1_beats", device="cuda")
Embedding Extraction
import torch
from avex import load_model
model = load_model("esp_aves2_naturelm_audio_v1_beats", device="cuda")
with torch.no_grad():
embeddings = model(audio_tensor)
# Shape: (batch, time_steps, 768) for BEATs
# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=1) # Shape: (batch, 768)
Transfer Learning with Probes
from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig
# Load backbone for feature extraction
base = load_model("esp_aves2_naturelm_audio_v1_beats", device="cuda")
# Define a probe head for your task
probe_config = ProbeConfig(
probe_type="linear",
target_layers=["last_layer"],
aggregation="mean",
freeze_backbone=True,
online_training=True,
)
probe = build_probe_from_config(
probe_config=probe_config,
base_model=base,
num_classes=10, # Your number of classes
device="cuda",
)
Training Details
Training Data
esp-aves2-naturelm-audio-v1-beats is derived from NatureLM-audio training. The provided excerpt does not enumerate NatureLM-audio training data; see the NatureLM-audio release and the What Matters for Bioacoustic Encoding paper for context.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The paper evaluates on:
- BEANS (classification and detection):
https://github.com/earthspecies/beans - BirdSet (detection):
https://huggingface.co/datasets/DBD-research-group/BirdSet - Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
- Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale
Metrics
- Linear probing: accuracy / mAP
- Retrieval: ROC AUC
- Clustering: NMI
Results
Aggregate results for linear probing (frozen base model) with esp-aves2-naturelm-audio-v1-beats (from the provided LaTeX table):
| Benchmark | Task | Metric | Score |
|---|---|---|---|
| BEANS Classification | Probe | Accuracy | 0.804 |
| BEANS Classification | Retrieval | ROC AUC | 0.774 |
| BEANS Classification | Clustering | NMI | 0.560 |
| BEANS Detection | Probe | mAP | 0.385 |
| BEANS Detection | Retrieval | ROC AUC | 0.724 |
| BirdSet | Probe | mAP | 0.223 |
| BirdSet | Retrieval | ROC AUC | 0.723 |
| Individual ID | Probe | Accuracy | 0.410 |
| Individual ID | Retrieval | ROC AUC | 0.645 |
| Vocal Repertoire | Retrieval | ROC AUC | 0.811 |
| Vocal Repertoire | Clustering | NMI | 0.552 |
Citation
BibTeX:
@inproceedings{miron2025matters,
title={What Matters for Bioacoustic Encoding},
author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
Model Card Contact
Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org