Model Card for esp-aves2-naturelm-audio-v1-beats

Model Details

Model Description

esp-aves2-naturelm-audio-v1-beats is the BEATs audio encoder extracted from NatureLM-audio v1 and evaluated as a bioacoustic encoder baseline in What Matters for Bioacoustic Encoding. It represents an ``unorthodox'' post-trained encoder derived from an audio-language model where the BEATs encoder was unfrozen during large-scale training.

Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
Funded by: More info at https://www.earthspecies.org/about-us#support
Shared by: Earth Species Project
Model type: Transformer; BEATs encoder extracted from NatureLM-audio
License: CC-BY-NC-SA
Finetuned from model: BEATs (see Parent Models)

Model Sources

Repository: https://github.com/earthspecies/avex
Paper: What Matters for Bioacoustic Encoding
Hugging Face Model: ESP-AVES2 Collection
Configuration: train_config.yaml

Parent Models

BEATs (pretrained on AudioSet)
- Source: https://github.com/microsoft/unilm/tree/master/beats
- Description: Self-supervised transformer audio encoder used within NatureLM-audio.
- License: See upstream repository

Uses

Direct Use

esp-aves2-naturelm-audio-v1-beats can be used as an embedding model for bioacoustic tasks (species classification/detection, retrieval/clustering, individual ID, repertoire analysis), as in the paper’s evaluation.

Downstream Use

Use frozen embeddings with linear probes, or fine-tune on target datasets.

Out-of-Scope Use

Not a generative model; does not output text.

Bias, Risks, and Limitations

Bias: Inherits biases from the upstream data mix used in NatureLM-audio training (not detailed in the provided excerpt).
Risks: Potential misuse for harmful wildlife exploitation.
Limitations: Exact training recipe and data composition are defined by NatureLM-audio; consult that release for details.

How to Get Started with the Model

Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.

Installation

pip install avex

Or with uv:

uv add avex

For more details, see https://github.com/earthspecies/avex.

Loading the Model

from avex import load_model

model = load_model("esp_aves2_naturelm_audio_v1_beats", device="cuda")

Embedding Extraction

import torch
from avex import load_model

model = load_model("esp_aves2_naturelm_audio_v1_beats", device="cuda")

with torch.no_grad():
    embeddings = model(audio_tensor)
    # Shape: (batch, time_steps, 768) for BEATs

# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=1)  # Shape: (batch, 768)

Transfer Learning with Probes

from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig

# Load backbone for feature extraction
base = load_model("esp_aves2_naturelm_audio_v1_beats", device="cuda")

# Define a probe head for your task
probe_config = ProbeConfig(
    probe_type="linear",
    target_layers=["last_layer"],
    aggregation="mean",
    freeze_backbone=True,
    online_training=True,
)

probe = build_probe_from_config(
    probe_config=probe_config,
    base_model=base,
    num_classes=10,  # Your number of classes
    device="cuda",
)

Training Details

Training Data

esp-aves2-naturelm-audio-v1-beats is derived from NatureLM-audio training. The provided excerpt does not enumerate NatureLM-audio training data; see the NatureLM-audio release and the What Matters for Bioacoustic Encoding paper for context.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The paper evaluates on:

BEANS (classification and detection): https://github.com/earthspecies/beans
BirdSet (detection): https://huggingface.co/datasets/DBD-research-group/BirdSet
Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale

Metrics

Linear probing: accuracy / mAP
Retrieval: ROC AUC
Clustering: NMI

Results

Aggregate results for linear probing (frozen base model) with esp-aves2-naturelm-audio-v1-beats (from the provided LaTeX table):

Benchmark	Task	Metric	Score
BEANS Classification	Probe	Accuracy	0.804
BEANS Classification	Retrieval	ROC AUC	0.774
BEANS Classification	Clustering	NMI	0.560
BEANS Detection	Probe	mAP	0.385
BEANS Detection	Retrieval	ROC AUC	0.724
BirdSet	Probe	mAP	0.223
BirdSet	Retrieval	ROC AUC	0.723
Individual ID	Probe	Accuracy	0.410
Individual ID	Retrieval	ROC AUC	0.645
Vocal Repertoire	Retrieval	ROC AUC	0.811
Vocal Repertoire	Clustering	NMI	0.552

Citation

BibTeX:

@inproceedings{miron2025matters,
  title={What Matters for Bioacoustic Encoding},
  author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Model Card Contact

Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EarthSpeciesProject/esp-aves2-naturelm-audio-v1-beats

esp-aves2

Collection

ESP-AVES2 model zoo. • 11 items • Updated Feb 2 • 2

Paper for EarthSpeciesProject/esp-aves2-naturelm-audio-v1-beats

What Matters for Bioacoustic Encoding

Paper • 2508.11845 • Published Aug 15, 2025