Audio-Audio Majestrino with Emotion and Quality Experts

A ViT-Base audio encoder trained via contrastive learning on 100 million DACVAE-encoded audio samples, paired with lightweight MLP expert heads for emotion recognition, speaker embedding prediction, and audio quality assessment.

Overview

This model operates in the DACVAE latent space rather than on raw waveforms. Audio is first encoded to DACVAE latents (25fps Γ— 128-dim float vectors), then the ViT-Base encoder produces a 512-dimensional embedding. Specialized MLP heads predict various audio attributes from this embedding.

Architecture

Raw Audio (48kHz)
    ↓
DACVAE Encoder (facebook/dacvae-watermarked)
    ↓
DACVAE Latents (T Γ— 128, float16, 25fps)
    ↓
ViT-Base Encoder (86M params)
  - 1D patch embedding: Conv1d(128β†’768, kernel=5, stride=5)
  - 12 transformer layers (768-dim, 12 heads, 3072 MLP dim)
  - CLS token pooling β†’ LayerNorm β†’ Linear(768β†’512) β†’ L2-normalize
    ↓
512-dim Audio Embedding
    ↓
Expert MLP Heads:
  β”œβ”€β”€ Emotion Expert (144K params)    β†’ 40 emotion predictions (best)
  β”œβ”€β”€ Attribute Expert (60K params)   β†’ 13 non-emotion attribute predictions (best)
  β”œβ”€β”€ All-53 Expert (212K params)     β†’ 53 combined predictions (alternative)
  β”œβ”€β”€ Attribute Probe (898K params)   β†’ 53 emotion/speaker attributes + duration
  β”œβ”€β”€ Speaker Probe (950K params)     β†’ 128-dim timbre embedding
  β”œβ”€β”€ 5Γ— Quality Experts (37K each)   β†’ individual quality scores
  └── Distortion Expert (74K params)  β†’ binary clean vs distorted (BCE)

Models Included

Model File Params Description
ViT-Base Encoder encoder.pt 86M Core audio encoder, produces 512-dim embeddings
Emotion Expert (40) emotion_expert_emo40.pt 144K Best for 40 emotion dimensions (MAE 0.345)
Attribute Expert (13) emotion_expert_attr13.pt 60K Best for 13 non-emotion attributes (MAE 0.507)
All-53 Expert emotion_expert_all53.pt 212K All 53 attributes in one model (MAE 0.387)
Attribute Probe attribute_probe.pt 898K Predicts 53 emotion/speaker attributes + duration
Speaker Probe speaker_probe.pt 950K Predicts 128-dim wavelm timbre embedding
Quality: CPS quality_expert_cps.pt 37K Characters per second (speech rate)
Quality: Background quality_expert_score_background_quality.pt 37K Background noise quality
Quality: Content quality_expert_score_content_enjoyment.pt 37K Content enjoyment
Quality: Overall quality_expert_score_overall_quality.pt 37K Overall audio quality
Quality: Speech quality_expert_score_speech_quality.pt 37K Speech quality
Distortion Expert distortion_expert.pt 74K Binary clean vs. distorted classifier (92.5% acc, 0.984 ROC-AUC)

Training Details

ViT-Base Encoder (Contrastive Pre-training)

The encoder was trained using symmetric InfoNCE contrastive loss with data augmentation on DACVAE latents:

Final validation metrics (audio-to-audio retrieval, 3K samples):

Metric Value
Drop→Complete Top-1 99.1%
Drop→Complete Top-5 100%
Complete→Drop Top-1 99.2%
Complete→Drop Top-5 100%
Half1β†’Half2 Top-1 91.7%
Half1β†’Half2 Top-3 97.8%

Emotion Experts v2 (Multi-Output MLPs)

Three multi-output MLP experts trained on 220K merged samples (68K emotion-attribute-conditioning + 159K balanced-audio-snippets, deduplicated). Evaluated on a common held-out 2000-sample validation set.

53 Attributes (scale 0-4):

  • 40 emotions: Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph
  • 10 speaker/voice attributes: Age, Gender, Confident vs. Hesitant, High-Pitched vs. Low-Pitched, Monotone vs. Expressive, Serious vs. Humorous, Soft vs. Harsh, Submissive vs. Dominant, Vulnerable vs. Emotionally Detached, Warm vs. Cold
  • 3 dimensional: Valence, Arousal, Authenticity
Model Architecture Params Outputs Mean MAE
Emotion-40 (best for emotions) 512β†’192β†’192β†’40 144K 40 emotions 0.345
Attribute-13 (best for non-emotions) 512β†’96β†’96β†’13 60K 13 non-emotion attrs 0.507
All-53 (combined) 512β†’256β†’256β†’53 212K All 53 attributes 0.387
  • Training: 200 epochs, Huber loss (delta=1.0), AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR, batch_size=4096
  • Architecture: Linearβ†’LayerNormβ†’GELUβ†’Dropout(0.1) Γ— 2 hidden layers β†’ Linearβ†’output_dim
  • The Emotion-40 model wins 31/53 attributes, All-53 wins 19/53, Attribute-13 wins 3/53

For best per-attribute performance, use Emotion-40 for the 40 emotion dims and Attribute-13 for the 13 non-emotion dims.

Per-attribute MAE (best model per attribute, sorted best to worst):

Attribute Best MAE Model Attribute Best MAE Model
Jealousy & Envy 0.099 emo40 Contemplation 0.429 emo40
Embarrassment 0.151 emo40 Affection 0.435 all53
Infatuation 0.156 emo40 Amusement 0.284 emo40
Shame 0.155 emo40 Elation 0.443 emo40
Intoxication 0.165 emo40 Pleasure/Ecstasy 0.447 all53
Sexual Lust 0.184 emo40 Disappointment 0.479 emo40
Sourness 0.207 emo40 Impatience 0.522 emo40
Disgust 0.222 emo40 Arousal 0.522 all53
Fatigue/Exhaustion 0.224 emo40 Confident vs. Hesitant 0.521 all53
Relief 0.237 emo40 Interest 0.537 emo40
Fear 0.257 emo40 Concentration 0.553 all53
Teasing 0.255 emo40 Hope/Enthusiasm 0.558 emo40
Pitch (High vs. Low) 0.273 all53 Sadness 0.564 emo40
Pain 0.274 emo40 Distress 0.559 emo40
Astonishment/Surprise 0.299 emo40 Gender 0.597 all53
Doubt 0.302 emo40 Warm vs. Cold 0.571 all53
Malevolence/Malice 0.301 all53 Submissive vs. Dominant 0.468 all53
Bitterness 0.312 emo40 Vulnerable vs. Detached 0.541 all53
Confusion 0.314 emo40 Monotone vs. Expressive 0.406 attr13
Authenticity 0.331 all53 Age 0.451 attr13
Contempt 0.352 all53 Valence 0.985 attr13
Triumph 0.373 emo40 Soft vs. Harsh 0.448 all53
Helplessness 0.404 emo40 Serious vs. Humorous 0.429 all53
Contentment 0.431 all53 Longing 0.408 emo40
Emotional Numbness 0.369 emo40 Pride 0.452 all53
Anger 0.433 emo40 Thankfulness/Gratitude 0.443 all53

Attribute Probe (Two-Phase Training)

Predicts 53 attributes from the 512-dim embedding using a shared backbone with separate heads for attributes and duration. This is a legacy model β€” the v2 Emotion Experts above achieve better per-attribute MAE.

Phase 1 (pre-training): 10 epochs on TTS-AGI/emotion-attribute-conditioning-dacvae (68K samples) Phase 2 (fine-tuning): 50 epochs on TTS-AGI/balanced-audio-snippets-40x3k-DACVAE (132K) + TTS-AGI/emolia-3k-speaker-clusters-DACVAE (27K)

  • Architecture: Linear(512β†’704)β†’LNβ†’GELUβ†’Dropout(0.1)β†’Linear(704β†’704)β†’LNβ†’GELUβ†’Dropout(0.1), attr_head(704β†’53), dur_head(704β†’1)
  • Loss: Huber (delta=1.0)
  • Optimizer: AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR
  • Best val loss: 0.459 (Phase 2, epoch 41)
  • Mean attribute MAE: 0.398

Speaker Probe (Two-Phase Training)

Predicts 128-dim wavelm timbre embedding from the 512-dim audio embedding.

Phase 1: 10 epochs on emotion-attribute-conditioning-dacvae (68K samples) Phase 2: 30 epochs on TTS-AGI/emolia-3k-speaker-clusters-DACVAE (63K samples, 3000 speaker clusters)

  • Architecture: Linear(512β†’704)β†’LNβ†’GELUβ†’Dropout(0.1)β†’Linear(704β†’704)β†’LNβ†’GELUβ†’Dropout(0.1)β†’Linear(704β†’128)
  • Val split: 1 sample per cluster (3000 val samples)
  • Best val loss: 0.00236
  • Cosine similarity: 0.625
  • MAE: 0.054

Quality Experts

Five independent small MLPs trained on TTS-AGI/balanced-audio-score-datasets-DACVAE.

Expert Val MAE Val Loss Description
CPS 4.638 4.198 Characters per second (speech rate)
Background Quality 0.292 0.087 Background noise quality (0-4 scale)
Content Enjoyment 0.483 0.188 Content enjoyment rating (0-8.6 scale)
Overall Quality 0.258 0.058 Overall audio quality (0-3.7 scale)
Speech Quality 0.277 0.078 Speech quality (0-3.9 scale)
  • Architecture: Linear(512β†’64)β†’LNβ†’GELUβ†’Dropout(0.1)β†’Linear(64β†’64)β†’LNβ†’GELUβ†’Dropout(0.1)β†’Linear(64β†’1)
  • Training: 50 epochs, Huber loss, AdamW (lr=1e-3), CosineAnnealingLR, batch_size=4096

Distortion Expert (Binary Clean vs. Distorted Classifier)

A lightweight binary classifier trained on top of the frozen majestrino encoder to discriminate clean speech from artificially degraded speech. The primary use case is as a fast quality filter for generated TTS output and for cleaning training corpora β€” it detects signal-level artifacts (clipping, comb-filtering) that the existing DNSMOS-based quality experts may not catch cleanly.

The expert outputs a single logit; apply sigmoid to get P(clean). For absolute quality filtering of natural speech, threshold at 0.5. For relative ranking of TTS-generated samples (which are out-of-distribution for the training set), rank by the raw logit β€” higher is better.

Training data: 100K clean/distorted pairs (200K total samples) built from laion/emolia-hq English standard-HQ tars. Each clean clip (8 seconds, 48 kHz mono) is paired with exactly one distorted twin. Three distortion families are applied in a deterministic 1/3 cycle:

Distortion Parameters Effect
Overdrive 15–30 dB gain + hard clip to [-1, 1] Heavy clipping / digital distortion
Comb short 5–10 ms delayed copy, 0.6 mix Metallic / phaser-like coloration
Comb long 40–60 ms delayed copy, 0.6 mix Slap-back echo / discrete reflection

Pipeline: Raw MP3 β†’ mono 48 kHz float32 β†’ center-crop 8 s β†’ distort β†’ DAC-VAE encode (both clean + distorted) β†’ majestrino ViT encoder (frozen) β†’ 512-d embedding β†’ save. The classifier head is trained only on these pre-computed embeddings.

  • Architecture: Linear(512β†’128)β†’GELUβ†’Dropout(0.1)β†’Linear(128β†’64)β†’GELUβ†’Dropout(0.1)β†’Linear(64β†’1)
  • Loss: BCEWithLogitsLoss
  • Optimizer: AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR
  • Training: 50 epochs, batch_size=256, 10% val split (by pair)
  • Test set: 200 clean + 200 distorted pairs, held out before training

Data-scaling results (same 400-sample held-out test set):

Train pairs Accuracy F1 ROC-AUC PR-AUC
10,000 0.8700 0.8693 0.9422 0.9418
20,000 0.8900 0.8860 0.9582 0.9594
50,000 0.8950 0.8934 0.9732 0.9725
99,800 0.9250 0.9246 0.9840 0.9838
Baseline: majestrino speech_quality (median threshold) 0.6250 0.6250 0.6401 β€”

Per-distortion-family recall (99.8K model):

Family n (test) Recall (distorted) Mean P(clean)
Overdrive 69 1.000 0.000
Comb long 73 0.959 0.077
Comb short 58 0.810 0.221
Clean 200 0.920 (recall clean) 0.888

Overdrive is trivially separable at every data scale. Comb-short (5–10 ms delays) is the hardest family but still reaches 81% recall, and both ROC-AUC and accuracy continue to improve with more data β€” the task has not plateaued.

Score Generation

Emotion Annotations

The emotion and speaker attribute annotations were generated using the LAION Emotional Annotation Pipeline, which uses LLM-based analysis of audio transcriptions and acoustic features to produce 55-dimensional emotion/attribute vectors on a 0-4 integer scale.

Quality Scores

The quality scores (background quality, content enjoyment, overall quality, speech quality) are derived from DNSMOS (Deep Noise Suppression Mean Opinion Score) and related audio quality assessment models. CPS (characters per second) measures speech rate from forced alignment.

DACVAE Codec

This model uses DACVAE (Discriminator-Augmented Compressed Vector Autoencoder) as the audio codec:

  • Model: facebook/dacvae-watermarked
  • Sample rate: 48,000 Hz
  • Hop length: 1,920 samples
  • Latent dimension: 128
  • Frame rate: 25 fps
  • Max duration: 15 seconds (375 frames)

For optimized inference, we recommend fast-dacvae which removes weight normalization for faster decoding:

pip install fast-dacvae

Usage

Installation

pip install torch fast-dacvae huggingface_hub

Quick Start

import torch
import torch.nn as nn
import torch.nn.functional as F
import copy
import numpy as np
from dacvae import DACVAE
from huggingface_hub import hf_hub_download

# === Model Definitions ===

LATENT_DIM = 128
PATCH_SIZE = 5
MAX_FRAMES = 375
EMBED_DIM = 512
PROBE_HIDDEN = 704
TIMBRE_DIM = 128


class LatentAudioEncoder(nn.Module):
    def __init__(self, hidden_dim=768, num_layers=12, num_heads=12, mlp_dim=3072):
        super().__init__()
        self.patch_embed = nn.Conv1d(LATENT_DIM, hidden_dim, PATCH_SIZE, PATCH_SIZE)
        max_tokens = MAX_FRAMES // PATCH_SIZE + 1
        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim) * 0.02)
        self.pos_embed = nn.Parameter(torch.randn(1, max_tokens, hidden_dim) * 0.02)
        layer = nn.TransformerEncoderLayer(
            d_model=hidden_dim, nhead=num_heads,
            dim_feedforward=mlp_dim, activation="gelu",
            batch_first=True, norm_first=True,
        )
        self.layers = nn.ModuleList([copy.deepcopy(layer) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(hidden_dim)
        self.proj = nn.Linear(hidden_dim, EMBED_DIM)

    def forward(self, x, mask=None):
        B = x.shape[0]
        x = self.patch_embed(x.transpose(1, 2)).transpose(1, 2)
        T_tok = x.shape[1]
        if mask is not None:
            T_fr = mask.shape[1]
            need = T_tok * PATCH_SIZE
            if T_fr < need:
                mask = F.pad(mask.float(), (0, need - T_fr)).bool()
            mask = mask[:, :need].reshape(B, T_tok, PATCH_SIZE).any(dim=2)
        cls = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls, x], dim=1)
        x = x + self.pos_embed[:, :x.shape[1]]
        pad_mask = None
        if mask is not None:
            cls_valid = torch.ones(B, 1, device=mask.device, dtype=torch.bool)
            pad_mask = ~torch.cat([cls_valid, mask], dim=1)
        for layer in self.layers:
            x = layer(x, src_key_padding_mask=pad_mask)
        out = self.norm(x[:, 0])
        out = self.proj(out)
        return F.normalize(out, p=2, dim=1)


class AttributeProbe(nn.Module):
    def __init__(self, n_attrs=53):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(EMBED_DIM, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(PROBE_HIDDEN, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
        )
        self.attr_head = nn.Linear(PROBE_HIDDEN, n_attrs)
        self.duration_head = nn.Linear(PROBE_HIDDEN, 1)

    def forward(self, x):
        h = self.backbone(x)
        return self.attr_head(h), self.duration_head(h)


class SpeakerProbe(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(EMBED_DIM, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(PROBE_HIDDEN, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
        )
        self.timbre_head = nn.Linear(PROBE_HIDDEN, TIMBRE_DIM)

    def forward(self, x):
        return self.timbre_head(self.backbone(x))


class FlexibleExpert(nn.Module):
    """Multi-output MLP: input_dim β†’ hidden layers β†’ output_dim."""
    def __init__(self, input_dim, hidden_layers, output_dim, dropout=0.1):
        super().__init__()
        layers = []
        prev = input_dim
        for h in hidden_layers:
            layers.extend([
                nn.Linear(prev, h), nn.LayerNorm(h),
                nn.GELU(), nn.Dropout(dropout),
            ])
            prev = h
        layers.append(nn.Linear(prev, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)


class QualityExpert(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(EMBED_DIM, 64), nn.LayerNorm(64),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(64, 64), nn.LayerNorm(64),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        return self.net(x)


# === Load Models ===

REPO = "laion/audio-audio-majestrino-with-emotion-and-quality-experts"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load encoder
enc_path = hf_hub_download(REPO, "encoder.pt")
encoder = LatentAudioEncoder()
encoder.load_state_dict(torch.load(enc_path, map_location="cpu", weights_only=True))
encoder.eval().to(device)

# Load attribute probe
attr_path = hf_hub_download(REPO, "attribute_probe.pt")
attr_ckpt = torch.load(attr_path, map_location="cpu", weights_only=False)
attr_probe = AttributeProbe(n_attrs=attr_ckpt["n_attrs"])
attr_probe.load_state_dict(attr_ckpt["model"])
attr_probe.eval().to(device)
canonical_keys = attr_ckpt["canonical_keys"]  # 53 attribute names

# Load emotion experts (v2 multi-output models β€” recommended over attribute probe)
emo40_path = hf_hub_download(REPO, "emotion_expert_emo40.pt")
emo40_ckpt = torch.load(emo40_path, map_location="cpu", weights_only=False)
emo40_expert = FlexibleExpert(EMBED_DIM, emo40_ckpt["layers"], len(emo40_ckpt["output_names"]))
emo40_expert.load_state_dict(emo40_ckpt["model"])
emo40_expert.eval().to(device)
emo40_keys = emo40_ckpt["output_names"]  # 40 emotion names

attr13_path = hf_hub_download(REPO, "emotion_expert_attr13.pt")
attr13_ckpt = torch.load(attr13_path, map_location="cpu", weights_only=False)
attr13_expert = FlexibleExpert(EMBED_DIM, attr13_ckpt["layers"], len(attr13_ckpt["output_names"]))
attr13_expert.load_state_dict(attr13_ckpt["model"])
attr13_expert.eval().to(device)
attr13_keys = attr13_ckpt["output_names"]  # 13 non-emotion attribute names

# Load speaker probe
spk_path = hf_hub_download(REPO, "speaker_probe.pt")
spk_ckpt = torch.load(spk_path, map_location="cpu", weights_only=False)
spk_probe = SpeakerProbe()
spk_probe.load_state_dict(spk_ckpt["model"])
spk_probe.eval().to(device)

# Load quality experts
quality_experts = {}
for score_type in ["cps", "score_background_quality", "score_content_enjoyment",
                   "score_overall_quality", "score_speech_quality"]:
    q_path = hf_hub_download(REPO, f"quality_expert_{score_type}.pt")
    q_ckpt = torch.load(q_path, map_location="cpu", weights_only=False)
    expert = QualityExpert()
    expert.load_state_dict(q_ckpt["model"])
    expert.eval().to(device)
    quality_experts[score_type] = expert

# Load DACVAE for encoding audio
dacvae = DACVAE.load("facebook/dacvae-watermarked").to(device).eval()
for _, mod in dacvae.named_modules():
    try:
        torch.nn.utils.remove_weight_norm(mod)
    except ValueError:
        pass


# === Encode Audio ===

def encode_audio(audio_path):
    """Encode an audio file to a 512-dim embedding."""
    import torchaudio
    wav, sr = torchaudio.load(audio_path)
    if sr != 48000:
        wav = torchaudio.functional.resample(wav, sr, 48000)
    if wav.shape[0] > 1:
        wav = wav.mean(0, keepdim=True)
    wav = wav.unsqueeze(0).to(device)  # (1, 1, samples)

    # Encode to DACVAE latent
    with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
        latent = dacvae.encode(wav)  # (1, 128, T)
    latent = latent.float().permute(0, 2, 1)  # (1, T, 128)

    # Truncate/pad to max frames
    T = latent.shape[1]
    if T > MAX_FRAMES:
        latent = latent[:, :MAX_FRAMES]
        T = MAX_FRAMES
    pad_len = ((T + PATCH_SIZE - 1) // PATCH_SIZE) * PATCH_SIZE
    padded = torch.zeros(1, pad_len, LATENT_DIM, device=device)
    mask = torch.zeros(1, pad_len, dtype=torch.bool, device=device)
    padded[0, :T] = latent[0]
    mask[0, :T] = True

    # Encode to embedding
    with torch.no_grad(), torch.amp.autocast("cuda"):
        embedding = encoder(padded, mask)  # (1, 512)
    return embedding


# === Predict ===

embedding = encode_audio("your_audio.wav")

# Emotion/speaker attributes (v2 experts β€” best accuracy)
with torch.no_grad():
    emo40_preds = emo40_expert(embedding).squeeze().cpu().numpy()
    attr13_preds = attr13_expert(embedding).squeeze().cpu().numpy()

print("Top-5 emotions (v2 expert):")
emo_dict = {k: v for k, v in zip(emo40_keys, emo40_preds)}
for k, v in sorted(emo_dict.items(), key=lambda x: -x[1])[:5]:
    print(f"  {k}: {v:.2f}")

print("\nNon-emotion attributes (v2 expert):")
for k, v in zip(attr13_keys, attr13_preds):
    print(f"  {k}: {v:.2f}")

# Alternative: attribute probe (also predicts duration)
with torch.no_grad():
    attrs, duration = attr_probe(embedding)
print(f"\nPredicted duration: {duration.squeeze().cpu().item():.1f}s")

# Speaker embedding
with torch.no_grad():
    timbre = spk_probe(embedding).squeeze().cpu().numpy()  # 128-dim
print(f"Timbre embedding shape: {timbre.shape}")

# Quality scores
print("\nQuality scores:")
for name, expert in quality_experts.items():
    with torch.no_grad():
        score = expert(embedding).squeeze().cpu().item()
    print(f"  {name}: {score:.3f}")

# Distortion expert (binary clean vs. distorted)
dist_path = hf_hub_download(REPO, "distortion_expert.pt")
dist_ckpt = torch.load(dist_path, map_location="cpu", weights_only=False)
dist_expert = nn.Sequential(
    nn.Linear(512, 128), nn.GELU(), nn.Dropout(0.1),
    nn.Linear(128, 64), nn.GELU(), nn.Dropout(0.1),
    nn.Linear(64, 1),
)
dist_expert.load_state_dict(dist_ckpt["model"])
dist_expert.eval().to(device)
with torch.no_grad():
    logit = dist_expert(embedding).squeeze().cpu().item()
    p_clean = torch.sigmoid(torch.tensor(logit)).item()
print(f"\nDistortion expert: logit={logit:+.3f}  P(clean)={p_clean:.4f}")

Batch Processing (from DACVAE latents directly)

# If you already have DACVAE latents (e.g., from a WebDataset):
latent = np.load("sample.npy")  # (T, 128) float16
latent = latent.astype(np.float32)

T = min(latent.shape[0], MAX_FRAMES)
pad_len = ((T + PATCH_SIZE - 1) // PATCH_SIZE) * PATCH_SIZE
batch = torch.zeros(1, pad_len, LATENT_DIM, device=device)
mask = torch.zeros(1, pad_len, dtype=torch.bool, device=device)
batch[0, :T] = torch.from_numpy(latent[:T])
mask[0, :T] = True

with torch.no_grad(), torch.amp.autocast("cuda"):
    embedding = encoder(batch, mask)  # (1, 512)

Attribute Key Reference

The 53 attributes predicted by the attribute probe, in canonical order:

Affection, Age, Amusement, Anger, Arousal, Astonishment_Surprise, Authenticity,
Awe, Bitterness, Concentration, Confident_vs._Hesitant, Confusion, Contemplation,
Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation,
Embarrassment, Emotional_Numbness, Fatigue_Exhaustion, Fear, Gender, Helplessness,
High-Pitched_vs._Low-Pitched, Hope_Enthusiasm_Optimism, Impatience_and_Irritability,
Infatuation, Interest, Intoxication_Altered_States_of_Consciousness, Jealousy_&_Envy,
Longing, Malevolence_Malice, Monotone_vs._Expressive, Pain, Pleasure_Ecstasy, Pride,
Relief, Sadness, Serious_vs._Humorous, Sexual_Lust, Shame, Soft_vs._Harsh, Sourness,
Submissive_vs._Dominant, Teasing, Thankfulness_Gratitude, Triumph, Valence,
Vulnerable_vs._Emotionally_Detached, Warm_vs._Cold

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{laion2026majestrino,
  title={Audio-Audio Majestrino with Emotion and Quality Experts},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/audio-audio-majestrino-with-emotion-and-quality-experts}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train laion/audio-audio-majestrino-with-emotion-and-quality-experts