Audio-Audio Majestrino with Emotion and Quality Experts

A ViT-Base audio encoder trained via contrastive learning on 100 million DACVAE-encoded audio samples, paired with lightweight MLP expert heads for emotion recognition, speaker embedding prediction, and audio quality assessment.

Overview

This model operates in the DACVAE latent space rather than on raw waveforms. Audio is first encoded to DACVAE latents (25fps × 128-dim float vectors), then the ViT-Base encoder produces a 512-dimensional embedding. Specialized MLP heads predict various audio attributes from this embedding.

Architecture

Raw Audio (48kHz)
    ↓
DACVAE Encoder (facebook/dacvae-watermarked)
    ↓
DACVAE Latents (T × 128, float16, 25fps)
    ↓
ViT-Base Encoder (86M params)
  - 1D patch embedding: Conv1d(128→768, kernel=5, stride=5)
  - 12 transformer layers (768-dim, 12 heads, 3072 MLP dim)
  - CLS token pooling → LayerNorm → Linear(768→512) → L2-normalize
    ↓
512-dim Audio Embedding
    ↓
Expert MLP Heads:
  ├── Emotion Expert (144K params)    → 40 emotion predictions (best)
  ├── Attribute Expert (60K params)   → 13 non-emotion attribute predictions (best)
  ├── All-53 Expert (212K params)     → 53 combined predictions (alternative)
  ├── Attribute Probe (898K params)   → 53 emotion/speaker attributes + duration
  ├── Speaker Probe (950K params)     → 128-dim timbre embedding
  ├── 5× Quality Experts (37K each)   → individual quality scores
  └── Distortion Expert (74K params)  → binary clean vs distorted (BCE)

Models Included

Model	File	Params	Description
ViT-Base Encoder	`encoder.pt`	86M	Core audio encoder, produces 512-dim embeddings
Emotion Expert (40)	`emotion_expert_emo40.pt`	144K	Best for 40 emotion dimensions (MAE 0.345)
Attribute Expert (13)	`emotion_expert_attr13.pt`	60K	Best for 13 non-emotion attributes (MAE 0.507)
All-53 Expert	`emotion_expert_all53.pt`	212K	All 53 attributes in one model (MAE 0.387)
Attribute Probe	`attribute_probe.pt`	898K	Predicts 53 emotion/speaker attributes + duration
Speaker Probe	`speaker_probe.pt`	950K	Predicts 128-dim wavelm timbre embedding
Quality: CPS	`quality_expert_cps.pt`	37K	Characters per second (speech rate)
Quality: Background	`quality_expert_score_background_quality.pt`	37K	Background noise quality
Quality: Content	`quality_expert_score_content_enjoyment.pt`	37K	Content enjoyment
Quality: Overall	`quality_expert_score_overall_quality.pt`	37K	Overall audio quality
Quality: Speech	`quality_expert_score_speech_quality.pt`	37K	Speech quality
Distortion Expert	`distortion_expert.pt`	74K	Binary clean vs. distorted classifier (92.5% acc, 0.984 ROC-AUC)

Training Details

ViT-Base Encoder (Contrastive Pre-training)

The encoder was trained using symmetric InfoNCE contrastive loss with data augmentation on DACVAE latents:

Training data: 100M samples from TTS-AGI/maestrino-data-DACVAE, TTS-AGI/enhanced-audiosnippets-DACVAE, and TTS-AGI/emotion-attribute-conditioning-dacvae
Effective batch size: 8,192 (128 per GPU × 8 GPUs × 8 gradient accumulation steps)
Optimizer: AdamW (lr=5e-4, betas=(0.9, 0.98), weight_decay=0.05)
Schedule: Cosine annealing with 5% warmup
Augmentations: Random split (40%), frame drop (30%), random crop (30%)
Training steps: 12,207 (100M samples seen)
Hardware: 8× GPU distributed training with bfloat16 mixed precision

Final validation metrics (audio-to-audio retrieval, 3K samples):

Metric	Value
Drop→Complete Top-1	99.1%
Drop→Complete Top-5	100%
Complete→Drop Top-1	99.2%
Complete→Drop Top-5	100%
Half1→Half2 Top-1	91.7%
Half1→Half2 Top-3	97.8%

Emotion Experts v2 (Multi-Output MLPs)

Three multi-output MLP experts trained on 220K merged samples (68K emotion-attribute-conditioning + 159K balanced-audio-snippets, deduplicated). Evaluated on a common held-out 2000-sample validation set.

53 Attributes (scale 0-4):

40 emotions: Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph
10 speaker/voice attributes: Age, Gender, Confident vs. Hesitant, High-Pitched vs. Low-Pitched, Monotone vs. Expressive, Serious vs. Humorous, Soft vs. Harsh, Submissive vs. Dominant, Vulnerable vs. Emotionally Detached, Warm vs. Cold
3 dimensional: Valence, Arousal, Authenticity

Model	Architecture	Params	Outputs	Mean MAE
Emotion-40 (best for emotions)	512→192→192→40	144K	40 emotions	0.345
Attribute-13 (best for non-emotions)	512→96→96→13	60K	13 non-emotion attrs	0.507
All-53 (combined)	512→256→256→53	212K	All 53 attributes	0.387

Training: 200 epochs, Huber loss (delta=1.0), AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR, batch_size=4096
Architecture: Linear→LayerNorm→GELU→Dropout(0.1) × 2 hidden layers → Linear→output_dim
The Emotion-40 model wins 31/53 attributes, All-53 wins 19/53, Attribute-13 wins 3/53

For best per-attribute performance, use Emotion-40 for the 40 emotion dims and Attribute-13 for the 13 non-emotion dims.

Per-attribute MAE (best model per attribute, sorted best to worst):

Attribute	Best MAE	Model	Attribute	Best MAE	Model
Jealousy & Envy	0.099	emo40	Contemplation	0.429	emo40
Embarrassment	0.151	emo40	Affection	0.435	all53
Infatuation	0.156	emo40	Amusement	0.284	emo40
Shame	0.155	emo40	Elation	0.443	emo40
Intoxication	0.165	emo40	Pleasure/Ecstasy	0.447	all53
Sexual Lust	0.184	emo40	Disappointment	0.479	emo40
Sourness	0.207	emo40	Impatience	0.522	emo40
Disgust	0.222	emo40	Arousal	0.522	all53
Fatigue/Exhaustion	0.224	emo40	Confident vs. Hesitant	0.521	all53
Relief	0.237	emo40	Interest	0.537	emo40
Fear	0.257	emo40	Concentration	0.553	all53
Teasing	0.255	emo40	Hope/Enthusiasm	0.558	emo40
Pitch (High vs. Low)	0.273	all53	Sadness	0.564	emo40
Pain	0.274	emo40	Distress	0.559	emo40
Astonishment/Surprise	0.299	emo40	Gender	0.597	all53
Doubt	0.302	emo40	Warm vs. Cold	0.571	all53
Malevolence/Malice	0.301	all53	Submissive vs. Dominant	0.468	all53
Bitterness	0.312	emo40	Vulnerable vs. Detached	0.541	all53
Confusion	0.314	emo40	Monotone vs. Expressive	0.406	attr13
Authenticity	0.331	all53	Age	0.451	attr13
Contempt	0.352	all53	Valence	0.985	attr13
Triumph	0.373	emo40	Soft vs. Harsh	0.448	all53
Helplessness	0.404	emo40	Serious vs. Humorous	0.429	all53
Contentment	0.431	all53	Longing	0.408	emo40
Emotional Numbness	0.369	emo40	Pride	0.452	all53
Anger	0.433	emo40	Thankfulness/Gratitude	0.443	all53

Attribute Probe (Two-Phase Training)

Predicts 53 attributes from the 512-dim embedding using a shared backbone with separate heads for attributes and duration. This is a legacy model — the v2 Emotion Experts above achieve better per-attribute MAE.

Phase 1 (pre-training): 10 epochs on TTS-AGI/emotion-attribute-conditioning-dacvae (68K samples) Phase 2 (fine-tuning): 50 epochs on TTS-AGI/balanced-audio-snippets-40x3k-DACVAE (132K) + TTS-AGI/emolia-3k-speaker-clusters-DACVAE (27K)

Architecture: Linear(512→704)→LN→GELU→Dropout(0.1)→Linear(704→704)→LN→GELU→Dropout(0.1), attr_head(704→53), dur_head(704→1)
Loss: Huber (delta=1.0)
Optimizer: AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR
Best val loss: 0.459 (Phase 2, epoch 41)
Mean attribute MAE: 0.398

Speaker Probe (Two-Phase Training)

Predicts 128-dim wavelm timbre embedding from the 512-dim audio embedding.

Phase 1: 10 epochs on emotion-attribute-conditioning-dacvae (68K samples) Phase 2: 30 epochs on TTS-AGI/emolia-3k-speaker-clusters-DACVAE (63K samples, 3000 speaker clusters)

Architecture: Linear(512→704)→LN→GELU→Dropout(0.1)→Linear(704→704)→LN→GELU→Dropout(0.1)→Linear(704→128)
Val split: 1 sample per cluster (3000 val samples)
Best val loss: 0.00236
Cosine similarity: 0.625
MAE: 0.054

Quality Experts

Five independent small MLPs trained on TTS-AGI/balanced-audio-score-datasets-DACVAE.

Expert	Val MAE	Val Loss	Description
CPS	4.638	4.198	Characters per second (speech rate)
Background Quality	0.292	0.087	Background noise quality (0-4 scale)
Content Enjoyment	0.483	0.188	Content enjoyment rating (0-8.6 scale)
Overall Quality	0.258	0.058	Overall audio quality (0-3.7 scale)
Speech Quality	0.277	0.078	Speech quality (0-3.9 scale)

Architecture: Linear(512→64)→LN→GELU→Dropout(0.1)→Linear(64→64)→LN→GELU→Dropout(0.1)→Linear(64→1)
Training: 50 epochs, Huber loss, AdamW (lr=1e-3), CosineAnnealingLR, batch_size=4096

Distortion Expert (Binary Clean vs. Distorted Classifier)

A lightweight binary classifier trained on top of the frozen majestrino encoder to discriminate clean speech from artificially degraded speech. The primary use case is as a fast quality filter for generated TTS output and for cleaning training corpora — it detects signal-level artifacts (clipping, comb-filtering) that the existing DNSMOS-based quality experts may not catch cleanly.

The expert outputs a single logit; apply sigmoid to get P(clean). For absolute quality filtering of natural speech, threshold at 0.5. For relative ranking of TTS-generated samples (which are out-of-distribution for the training set), rank by the raw logit — higher is better.

Training data: 100K clean/distorted pairs (200K total samples) built from laion/emolia-hq English standard-HQ tars. Each clean clip (8 seconds, 48 kHz mono) is paired with exactly one distorted twin. Three distortion families are applied in a deterministic 1/3 cycle:

Distortion	Parameters	Effect
Overdrive	15–30 dB gain + hard clip to [-1, 1]	Heavy clipping / digital distortion
Comb short	5–10 ms delayed copy, 0.6 mix	Metallic / phaser-like coloration
Comb long	40–60 ms delayed copy, 0.6 mix	Slap-back echo / discrete reflection

Pipeline: Raw MP3 → mono 48 kHz float32 → center-crop 8 s → distort → DAC-VAE encode (both clean + distorted) → majestrino ViT encoder (frozen) → 512-d embedding → save. The classifier head is trained only on these pre-computed embeddings.

Architecture: Linear(512→128)→GELU→Dropout(0.1)→Linear(128→64)→GELU→Dropout(0.1)→Linear(64→1)
Loss: BCEWithLogitsLoss
Optimizer: AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR
Training: 50 epochs, batch_size=256, 10% val split (by pair)
Test set: 200 clean + 200 distorted pairs, held out before training

Data-scaling results (same 400-sample held-out test set):

Train pairs	Accuracy	F1	ROC-AUC	PR-AUC
10,000	0.8700	0.8693	0.9422	0.9418
20,000	0.8900	0.8860	0.9582	0.9594
50,000	0.8950	0.8934	0.9732	0.9725
99,800	0.9250	0.9246	0.9840	0.9838
Baseline: majestrino speech_quality (median threshold)	0.6250	0.6250	0.6401	—

Per-distortion-family recall (99.8K model):

Family	n (test)	Recall (distorted)	Mean P(clean)
Overdrive	69	1.000	0.000
Comb long	73	0.959	0.077
Comb short	58	0.810	0.221
Clean	200	0.920 (recall clean)	0.888

Overdrive is trivially separable at every data scale. Comb-short (5–10 ms delays) is the hardest family but still reaches 81% recall, and both ROC-AUC and accuracy continue to improve with more data — the task has not plateaued.

Score Generation

Emotion Annotations

The emotion and speaker attribute annotations were generated using the LAION Emotional Annotation Pipeline, which uses LLM-based analysis of audio transcriptions and acoustic features to produce 55-dimensional emotion/attribute vectors on a 0-4 integer scale.

Quality Scores

The quality scores (background quality, content enjoyment, overall quality, speech quality) are derived from DNSMOS (Deep Noise Suppression Mean Opinion Score) and related audio quality assessment models. CPS (characters per second) measures speech rate from forced alignment.

DACVAE Codec

This model uses DACVAE (Discriminator-Augmented Compressed Vector Autoencoder) as the audio codec:

Model: facebook/dacvae-watermarked
Sample rate: 48,000 Hz
Hop length: 1,920 samples
Latent dimension: 128
Frame rate: 25 fps
Max duration: 15 seconds (375 frames)

For optimized inference, we recommend fast-dacvae which removes weight normalization for faster decoding:

pip install fast-dacvae

Usage

Installation

pip install torch fast-dacvae huggingface_hub

Quick Start

import torch
import torch.nn as nn
import torch.nn.functional as F
import copy
import numpy as np
from dacvae import DACVAE
from huggingface_hub import hf_hub_download

# === Model Definitions ===

LATENT_DIM = 128
PATCH_SIZE = 5
MAX_FRAMES = 375
EMBED_DIM = 512
PROBE_HIDDEN = 704
TIMBRE_DIM = 128


class LatentAudioEncoder(nn.Module):
    def __init__(self, hidden_dim=768, num_layers=12, num_heads=12, mlp_dim=3072):
        super().__init__()
        self.patch_embed = nn.Conv1d(LATENT_DIM, hidden_dim, PATCH_SIZE, PATCH_SIZE)
        max_tokens = MAX_FRAMES // PATCH_SIZE + 1
        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim) * 0.02)
        self.pos_embed = nn.Parameter(torch.randn(1, max_tokens, hidden_dim) * 0.02)
        layer = nn.TransformerEncoderLayer(
            d_model=hidden_dim, nhead=num_heads,
            dim_feedforward=mlp_dim, activation="gelu",
            batch_first=True, norm_first=True,
        )
        self.layers = nn.ModuleList([copy.deepcopy(layer) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(hidden_dim)
        self.proj = nn.Linear(hidden_dim, EMBED_DIM)

    def forward(self, x, mask=None):
        B = x.shape[0]
        x = self.patch_embed(x.transpose(1, 2)).transpose(1, 2)
        T_tok = x.shape[1]
        if mask is not None:
            T_fr = mask.shape[1]
            need = T_tok * PATCH_SIZE
            if T_fr < need:
                mask = F.pad(mask.float(), (0, need - T_fr)).bool()
            mask = mask[:, :need].reshape(B, T_tok, PATCH_SIZE).any(dim=2)
        cls = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls, x], dim=1)
        x = x + self.pos_embed[:, :x.shape[1]]
        pad_mask = None
        if mask is not None:
            cls_valid = torch.ones(B, 1, device=mask.device, dtype=torch.bool)
            pad_mask = ~torch.cat([cls_valid, mask], dim=1)
        for layer in self.layers:
            x = layer(x, src_key_padding_mask=pad_mask)
        out = self.norm(x[:, 0])
        out = self.proj(out)
        return F.normalize(out, p=2, dim=1)


class AttributeProbe(nn.Module):
    def __init__(self, n_attrs=53):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(EMBED_DIM, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(PROBE_HIDDEN, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
        )
        self.attr_head = nn.Linear(PROBE_HIDDEN, n_attrs)
        self.duration_head = nn.Linear(PROBE_HIDDEN, 1)

    def forward(self, x):
        h = self.backbone(x)
        return self.attr_head(h), self.duration_head(h)


class SpeakerProbe(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(EMBED_DIM, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(PROBE_HIDDEN, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
            nn.GELU(), nn.Dropout(0.1),
        )
        self.timbre_head = nn.Linear(PROBE_HIDDEN, TIMBRE_DIM)

    def forward(self, x):
        return self.timbre_head(self.backbone(x))


class FlexibleExpert(nn.Module):
    """Multi-output MLP: input_dim → hidden layers → output_dim."""
    def __init__(self, input_dim, hidden_layers, output_dim, dropout=0.1):
        super().__init__()
        layers = []
        prev = input_dim
        for h in hidden_layers:
            layers.extend([
                nn.Linear(prev, h), nn.LayerNorm(h),
                nn.GELU(), nn.Dropout(dropout),
            ])
            prev = h
        layers.append(nn.Linear(prev, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)


class QualityExpert(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(EMBED_DIM, 64), nn.LayerNorm(64),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(64, 64), nn.LayerNorm(64),
            nn.GELU(), nn.Dropout(0.1),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        return self.net(x)


# === Load Models ===

REPO = "laion/audio-audio-majestrino-with-emotion-and-quality-experts"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load encoder
enc_path = hf_hub_download(REPO, "encoder.pt")
encoder = LatentAudioEncoder()
encoder.load_state_dict(torch.load(enc_path, map_location="cpu", weights_only=True))
encoder.eval().to(device)

# Load attribute probe
attr_path = hf_hub_download(REPO, "attribute_probe.pt")
attr_ckpt = torch.load(attr_path, map_location="cpu", weights_only=False)
attr_probe = AttributeProbe(n_attrs=attr_ckpt["n_attrs"])
attr_probe.load_state_dict(attr_ckpt["model"])
attr_probe.eval().to(device)
canonical_keys = attr_ckpt["canonical_keys"]  # 53 attribute names

# Load emotion experts (v2 multi-output models — recommended over attribute probe)
emo40_path = hf_hub_download(REPO, "emotion_expert_emo40.pt")
emo40_ckpt = torch.load(emo40_path, map_location="cpu", weights_only=False)
emo40_expert = FlexibleExpert(EMBED_DIM, emo40_ckpt["layers"], len(emo40_ckpt["output_names"]))
emo40_expert.load_state_dict(emo40_ckpt["model"])
emo40_expert.eval().to(device)
emo40_keys = emo40_ckpt["output_names"]  # 40 emotion names

attr13_path = hf_hub_download(REPO, "emotion_expert_attr13.pt")
attr13_ckpt = torch.load(attr13_path, map_location="cpu", weights_only=False)
attr13_expert = FlexibleExpert(EMBED_DIM, attr13_ckpt["layers"], len(attr13_ckpt["output_names"]))
attr13_expert.load_state_dict(attr13_ckpt["model"])
attr13_expert.eval().to(device)
attr13_keys = attr13_ckpt["output_names"]  # 13 non-emotion attribute names

# Load speaker probe
spk_path = hf_hub_download(REPO, "speaker_probe.pt")
spk_ckpt = torch.load(spk_path, map_location="cpu", weights_only=False)
spk_probe = SpeakerProbe()
spk_probe.load_state_dict(spk_ckpt["model"])
spk_probe.eval().to(device)

# Load quality experts
quality_experts = {}
for score_type in ["cps", "score_background_quality", "score_content_enjoyment",
                   "score_overall_quality", "score_speech_quality"]:
    q_path = hf_hub_download(REPO, f"quality_expert_{score_type}.pt")
    q_ckpt = torch.load(q_path, map_location="cpu", weights_only=False)
    expert = QualityExpert()
    expert.load_state_dict(q_ckpt["model"])
    expert.eval().to(device)
    quality_experts[score_type] = expert

# Load DACVAE for encoding audio
dacvae = DACVAE.load("facebook/dacvae-watermarked").to(device).eval()
for _, mod in dacvae.named_modules():
    try:
        torch.nn.utils.remove_weight_norm(mod)
    except ValueError:
        pass


# === Encode Audio ===

def encode_audio(audio_path):
    """Encode an audio file to a 512-dim embedding."""
    import torchaudio
    wav, sr = torchaudio.load(audio_path)
    if sr != 48000:
        wav = torchaudio.functional.resample(wav, sr, 48000)
    if wav.shape[0] > 1:
        wav = wav.mean(0, keepdim=True)
    wav = wav.unsqueeze(0).to(device)  # (1, 1, samples)

    # Encode to DACVAE latent
    with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
        latent = dacvae.encode(wav)  # (1, 128, T)
    latent = latent.float().permute(0, 2, 1)  # (1, T, 128)

    # Truncate/pad to max frames
    T = latent.shape[1]
    if T > MAX_FRAMES:
        latent = latent[:, :MAX_FRAMES]
        T = MAX_FRAMES
    pad_len = ((T + PATCH_SIZE - 1) // PATCH_SIZE) * PATCH_SIZE
    padded = torch.zeros(1, pad_len, LATENT_DIM, device=device)
    mask = torch.zeros(1, pad_len, dtype=torch.bool, device=device)
    padded[0, :T] = latent[0]
    mask[0, :T] = True

    # Encode to embedding
    with torch.no_grad(), torch.amp.autocast("cuda"):
        embedding = encoder(padded, mask)  # (1, 512)
    return embedding


# === Predict ===

embedding = encode_audio("your_audio.wav")

# Emotion/speaker attributes (v2 experts — best accuracy)
with torch.no_grad():
    emo40_preds = emo40_expert(embedding).squeeze().cpu().numpy()
    attr13_preds = attr13_expert(embedding).squeeze().cpu().numpy()

print("Top-5 emotions (v2 expert):")
emo_dict = {k: v for k, v in zip(emo40_keys, emo40_preds)}
for k, v in sorted(emo_dict.items(), key=lambda x: -x[1])[:5]:
    print(f"  {k}: {v:.2f}")

print("\nNon-emotion attributes (v2 expert):")
for k, v in zip(attr13_keys, attr13_preds):
    print(f"  {k}: {v:.2f}")

# Alternative: attribute probe (also predicts duration)
with torch.no_grad():
    attrs, duration = attr_probe(embedding)
print(f"\nPredicted duration: {duration.squeeze().cpu().item():.1f}s")

# Speaker embedding
with torch.no_grad():
    timbre = spk_probe(embedding).squeeze().cpu().numpy()  # 128-dim
print(f"Timbre embedding shape: {timbre.shape}")

# Quality scores
print("\nQuality scores:")
for name, expert in quality_experts.items():
    with torch.no_grad():
        score = expert(embedding).squeeze().cpu().item()
    print(f"  {name}: {score:.3f}")

# Distortion expert (binary clean vs. distorted)
dist_path = hf_hub_download(REPO, "distortion_expert.pt")
dist_ckpt = torch.load(dist_path, map_location="cpu", weights_only=False)
dist_expert = nn.Sequential(
    nn.Linear(512, 128), nn.GELU(), nn.Dropout(0.1),
    nn.Linear(128, 64), nn.GELU(), nn.Dropout(0.1),
    nn.Linear(64, 1),
)
dist_expert.load_state_dict(dist_ckpt["model"])
dist_expert.eval().to(device)
with torch.no_grad():
    logit = dist_expert(embedding).squeeze().cpu().item()
    p_clean = torch.sigmoid(torch.tensor(logit)).item()
print(f"\nDistortion expert: logit={logit:+.3f}  P(clean)={p_clean:.4f}")

Batch Processing (from DACVAE latents directly)

# If you already have DACVAE latents (e.g., from a WebDataset):
latent = np.load("sample.npy")  # (T, 128) float16
latent = latent.astype(np.float32)

T = min(latent.shape[0], MAX_FRAMES)
pad_len = ((T + PATCH_SIZE - 1) // PATCH_SIZE) * PATCH_SIZE
batch = torch.zeros(1, pad_len, LATENT_DIM, device=device)
mask = torch.zeros(1, pad_len, dtype=torch.bool, device=device)
batch[0, :T] = torch.from_numpy(latent[:T])
mask[0, :T] = True

with torch.no_grad(), torch.amp.autocast("cuda"):
    embedding = encoder(batch, mask)  # (1, 512)

Attribute Key Reference

The 53 attributes predicted by the attribute probe, in canonical order:

Affection, Age, Amusement, Anger, Arousal, Astonishment_Surprise, Authenticity,
Awe, Bitterness, Concentration, Confident_vs._Hesitant, Confusion, Contemplation,
Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation,
Embarrassment, Emotional_Numbness, Fatigue_Exhaustion, Fear, Gender, Helplessness,
High-Pitched_vs._Low-Pitched, Hope_Enthusiasm_Optimism, Impatience_and_Irritability,
Infatuation, Interest, Intoxication_Altered_States_of_Consciousness, Jealousy_&_Envy,
Longing, Malevolence_Malice, Monotone_vs._Expressive, Pain, Pleasure_Ecstasy, Pride,
Relief, Sadness, Serious_vs._Humorous, Sexual_Lust, Shame, Soft_vs._Harsh, Sourness,
Submissive_vs._Dominant, Teasing, Thankfulness_Gratitude, Triumph, Valence,
Vulnerable_vs._Emotionally_Detached, Warm_vs._Cold

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{laion2026majestrino,
  title={Audio-Audio Majestrino with Emotion and Quality Experts},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/audio-audio-majestrino-with-emotion-and-quality-experts}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

laion
/

audio-audio-majestrino-with-emotion-and-quality-experts