Audio-Audio Majestrino with Emotion and Quality Experts
A ViT-Base audio encoder trained via contrastive learning on 100 million DACVAE-encoded audio samples, paired with lightweight MLP expert heads for emotion recognition, speaker embedding prediction, and audio quality assessment.
Overview
This model operates in the DACVAE latent space rather than on raw waveforms. Audio is first encoded to DACVAE latents (25fps Γ 128-dim float vectors), then the ViT-Base encoder produces a 512-dimensional embedding. Specialized MLP heads predict various audio attributes from this embedding.
Architecture
Raw Audio (48kHz)
β
DACVAE Encoder (facebook/dacvae-watermarked)
β
DACVAE Latents (T Γ 128, float16, 25fps)
β
ViT-Base Encoder (86M params)
- 1D patch embedding: Conv1d(128β768, kernel=5, stride=5)
- 12 transformer layers (768-dim, 12 heads, 3072 MLP dim)
- CLS token pooling β LayerNorm β Linear(768β512) β L2-normalize
β
512-dim Audio Embedding
β
Expert MLP Heads:
βββ Emotion Expert (144K params) β 40 emotion predictions (best)
βββ Attribute Expert (60K params) β 13 non-emotion attribute predictions (best)
βββ All-53 Expert (212K params) β 53 combined predictions (alternative)
βββ Attribute Probe (898K params) β 53 emotion/speaker attributes + duration
βββ Speaker Probe (950K params) β 128-dim timbre embedding
βββ 5Γ Quality Experts (37K each) β individual quality scores
βββ Distortion Expert (74K params) β binary clean vs distorted (BCE)
Models Included
| Model | File | Params | Description |
|---|---|---|---|
| ViT-Base Encoder | encoder.pt |
86M | Core audio encoder, produces 512-dim embeddings |
| Emotion Expert (40) | emotion_expert_emo40.pt |
144K | Best for 40 emotion dimensions (MAE 0.345) |
| Attribute Expert (13) | emotion_expert_attr13.pt |
60K | Best for 13 non-emotion attributes (MAE 0.507) |
| All-53 Expert | emotion_expert_all53.pt |
212K | All 53 attributes in one model (MAE 0.387) |
| Attribute Probe | attribute_probe.pt |
898K | Predicts 53 emotion/speaker attributes + duration |
| Speaker Probe | speaker_probe.pt |
950K | Predicts 128-dim wavelm timbre embedding |
| Quality: CPS | quality_expert_cps.pt |
37K | Characters per second (speech rate) |
| Quality: Background | quality_expert_score_background_quality.pt |
37K | Background noise quality |
| Quality: Content | quality_expert_score_content_enjoyment.pt |
37K | Content enjoyment |
| Quality: Overall | quality_expert_score_overall_quality.pt |
37K | Overall audio quality |
| Quality: Speech | quality_expert_score_speech_quality.pt |
37K | Speech quality |
| Distortion Expert | distortion_expert.pt |
74K | Binary clean vs. distorted classifier (92.5% acc, 0.984 ROC-AUC) |
Training Details
ViT-Base Encoder (Contrastive Pre-training)
The encoder was trained using symmetric InfoNCE contrastive loss with data augmentation on DACVAE latents:
- Training data: 100M samples from TTS-AGI/maestrino-data-DACVAE, TTS-AGI/enhanced-audiosnippets-DACVAE, and TTS-AGI/emotion-attribute-conditioning-dacvae
- Effective batch size: 8,192 (128 per GPU Γ 8 GPUs Γ 8 gradient accumulation steps)
- Optimizer: AdamW (lr=5e-4, betas=(0.9, 0.98), weight_decay=0.05)
- Schedule: Cosine annealing with 5% warmup
- Augmentations: Random split (40%), frame drop (30%), random crop (30%)
- Training steps: 12,207 (100M samples seen)
- Hardware: 8Γ GPU distributed training with bfloat16 mixed precision
Final validation metrics (audio-to-audio retrieval, 3K samples):
| Metric | Value |
|---|---|
| DropβComplete Top-1 | 99.1% |
| DropβComplete Top-5 | 100% |
| CompleteβDrop Top-1 | 99.2% |
| CompleteβDrop Top-5 | 100% |
| Half1βHalf2 Top-1 | 91.7% |
| Half1βHalf2 Top-3 | 97.8% |
Emotion Experts v2 (Multi-Output MLPs)
Three multi-output MLP experts trained on 220K merged samples (68K emotion-attribute-conditioning + 159K balanced-audio-snippets, deduplicated). Evaluated on a common held-out 2000-sample validation set.
53 Attributes (scale 0-4):
- 40 emotions: Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph
- 10 speaker/voice attributes: Age, Gender, Confident vs. Hesitant, High-Pitched vs. Low-Pitched, Monotone vs. Expressive, Serious vs. Humorous, Soft vs. Harsh, Submissive vs. Dominant, Vulnerable vs. Emotionally Detached, Warm vs. Cold
- 3 dimensional: Valence, Arousal, Authenticity
| Model | Architecture | Params | Outputs | Mean MAE |
|---|---|---|---|---|
| Emotion-40 (best for emotions) | 512β192β192β40 | 144K | 40 emotions | 0.345 |
| Attribute-13 (best for non-emotions) | 512β96β96β13 | 60K | 13 non-emotion attrs | 0.507 |
| All-53 (combined) | 512β256β256β53 | 212K | All 53 attributes | 0.387 |
- Training: 200 epochs, Huber loss (delta=1.0), AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR, batch_size=4096
- Architecture:
LinearβLayerNormβGELUβDropout(0.1)Γ 2 hidden layers βLinearβoutput_dim - The Emotion-40 model wins 31/53 attributes, All-53 wins 19/53, Attribute-13 wins 3/53
For best per-attribute performance, use Emotion-40 for the 40 emotion dims and Attribute-13 for the 13 non-emotion dims.
Per-attribute MAE (best model per attribute, sorted best to worst):
| Attribute | Best MAE | Model | Attribute | Best MAE | Model |
|---|---|---|---|---|---|
| Jealousy & Envy | 0.099 | emo40 | Contemplation | 0.429 | emo40 |
| Embarrassment | 0.151 | emo40 | Affection | 0.435 | all53 |
| Infatuation | 0.156 | emo40 | Amusement | 0.284 | emo40 |
| Shame | 0.155 | emo40 | Elation | 0.443 | emo40 |
| Intoxication | 0.165 | emo40 | Pleasure/Ecstasy | 0.447 | all53 |
| Sexual Lust | 0.184 | emo40 | Disappointment | 0.479 | emo40 |
| Sourness | 0.207 | emo40 | Impatience | 0.522 | emo40 |
| Disgust | 0.222 | emo40 | Arousal | 0.522 | all53 |
| Fatigue/Exhaustion | 0.224 | emo40 | Confident vs. Hesitant | 0.521 | all53 |
| Relief | 0.237 | emo40 | Interest | 0.537 | emo40 |
| Fear | 0.257 | emo40 | Concentration | 0.553 | all53 |
| Teasing | 0.255 | emo40 | Hope/Enthusiasm | 0.558 | emo40 |
| Pitch (High vs. Low) | 0.273 | all53 | Sadness | 0.564 | emo40 |
| Pain | 0.274 | emo40 | Distress | 0.559 | emo40 |
| Astonishment/Surprise | 0.299 | emo40 | Gender | 0.597 | all53 |
| Doubt | 0.302 | emo40 | Warm vs. Cold | 0.571 | all53 |
| Malevolence/Malice | 0.301 | all53 | Submissive vs. Dominant | 0.468 | all53 |
| Bitterness | 0.312 | emo40 | Vulnerable vs. Detached | 0.541 | all53 |
| Confusion | 0.314 | emo40 | Monotone vs. Expressive | 0.406 | attr13 |
| Authenticity | 0.331 | all53 | Age | 0.451 | attr13 |
| Contempt | 0.352 | all53 | Valence | 0.985 | attr13 |
| Triumph | 0.373 | emo40 | Soft vs. Harsh | 0.448 | all53 |
| Helplessness | 0.404 | emo40 | Serious vs. Humorous | 0.429 | all53 |
| Contentment | 0.431 | all53 | Longing | 0.408 | emo40 |
| Emotional Numbness | 0.369 | emo40 | Pride | 0.452 | all53 |
| Anger | 0.433 | emo40 | Thankfulness/Gratitude | 0.443 | all53 |
Attribute Probe (Two-Phase Training)
Predicts 53 attributes from the 512-dim embedding using a shared backbone with separate heads for attributes and duration. This is a legacy model β the v2 Emotion Experts above achieve better per-attribute MAE.
Phase 1 (pre-training): 10 epochs on TTS-AGI/emotion-attribute-conditioning-dacvae (68K samples) Phase 2 (fine-tuning): 50 epochs on TTS-AGI/balanced-audio-snippets-40x3k-DACVAE (132K) + TTS-AGI/emolia-3k-speaker-clusters-DACVAE (27K)
- Architecture: Linear(512β704)βLNβGELUβDropout(0.1)βLinear(704β704)βLNβGELUβDropout(0.1), attr_head(704β53), dur_head(704β1)
- Loss: Huber (delta=1.0)
- Optimizer: AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR
- Best val loss: 0.459 (Phase 2, epoch 41)
- Mean attribute MAE: 0.398
Speaker Probe (Two-Phase Training)
Predicts 128-dim wavelm timbre embedding from the 512-dim audio embedding.
Phase 1: 10 epochs on emotion-attribute-conditioning-dacvae (68K samples) Phase 2: 30 epochs on TTS-AGI/emolia-3k-speaker-clusters-DACVAE (63K samples, 3000 speaker clusters)
- Architecture: Linear(512β704)βLNβGELUβDropout(0.1)βLinear(704β704)βLNβGELUβDropout(0.1)βLinear(704β128)
- Val split: 1 sample per cluster (3000 val samples)
- Best val loss: 0.00236
- Cosine similarity: 0.625
- MAE: 0.054
Quality Experts
Five independent small MLPs trained on TTS-AGI/balanced-audio-score-datasets-DACVAE.
| Expert | Val MAE | Val Loss | Description |
|---|---|---|---|
| CPS | 4.638 | 4.198 | Characters per second (speech rate) |
| Background Quality | 0.292 | 0.087 | Background noise quality (0-4 scale) |
| Content Enjoyment | 0.483 | 0.188 | Content enjoyment rating (0-8.6 scale) |
| Overall Quality | 0.258 | 0.058 | Overall audio quality (0-3.7 scale) |
| Speech Quality | 0.277 | 0.078 | Speech quality (0-3.9 scale) |
- Architecture: Linear(512β64)βLNβGELUβDropout(0.1)βLinear(64β64)βLNβGELUβDropout(0.1)βLinear(64β1)
- Training: 50 epochs, Huber loss, AdamW (lr=1e-3), CosineAnnealingLR, batch_size=4096
Distortion Expert (Binary Clean vs. Distorted Classifier)
A lightweight binary classifier trained on top of the frozen majestrino encoder to discriminate clean speech from artificially degraded speech. The primary use case is as a fast quality filter for generated TTS output and for cleaning training corpora β it detects signal-level artifacts (clipping, comb-filtering) that the existing DNSMOS-based quality experts may not catch cleanly.
The expert outputs a single logit; apply sigmoid to get P(clean). For absolute quality filtering of natural speech, threshold at 0.5. For relative ranking of TTS-generated samples (which are out-of-distribution for the training set), rank by the raw logit β higher is better.
Training data: 100K clean/distorted pairs (200K total samples) built from laion/emolia-hq English standard-HQ tars. Each clean clip (8 seconds, 48 kHz mono) is paired with exactly one distorted twin. Three distortion families are applied in a deterministic 1/3 cycle:
| Distortion | Parameters | Effect |
|---|---|---|
| Overdrive | 15β30 dB gain + hard clip to [-1, 1] | Heavy clipping / digital distortion |
| Comb short | 5β10 ms delayed copy, 0.6 mix | Metallic / phaser-like coloration |
| Comb long | 40β60 ms delayed copy, 0.6 mix | Slap-back echo / discrete reflection |
Pipeline: Raw MP3 β mono 48 kHz float32 β center-crop 8 s β distort β DAC-VAE encode (both clean + distorted) β majestrino ViT encoder (frozen) β 512-d embedding β save. The classifier head is trained only on these pre-computed embeddings.
- Architecture:
Linear(512β128)βGELUβDropout(0.1)βLinear(128β64)βGELUβDropout(0.1)βLinear(64β1) - Loss: BCEWithLogitsLoss
- Optimizer: AdamW (lr=1e-3, weight_decay=1e-4), CosineAnnealingLR
- Training: 50 epochs, batch_size=256, 10% val split (by pair)
- Test set: 200 clean + 200 distorted pairs, held out before training
Data-scaling results (same 400-sample held-out test set):
| Train pairs | Accuracy | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|
| 10,000 | 0.8700 | 0.8693 | 0.9422 | 0.9418 |
| 20,000 | 0.8900 | 0.8860 | 0.9582 | 0.9594 |
| 50,000 | 0.8950 | 0.8934 | 0.9732 | 0.9725 |
| 99,800 | 0.9250 | 0.9246 | 0.9840 | 0.9838 |
| Baseline: majestrino speech_quality (median threshold) | 0.6250 | 0.6250 | 0.6401 | β |
Per-distortion-family recall (99.8K model):
| Family | n (test) | Recall (distorted) | Mean P(clean) |
|---|---|---|---|
| Overdrive | 69 | 1.000 | 0.000 |
| Comb long | 73 | 0.959 | 0.077 |
| Comb short | 58 | 0.810 | 0.221 |
| Clean | 200 | 0.920 (recall clean) | 0.888 |
Overdrive is trivially separable at every data scale. Comb-short (5β10 ms delays) is the hardest family but still reaches 81% recall, and both ROC-AUC and accuracy continue to improve with more data β the task has not plateaued.
Score Generation
Emotion Annotations
The emotion and speaker attribute annotations were generated using the LAION Emotional Annotation Pipeline, which uses LLM-based analysis of audio transcriptions and acoustic features to produce 55-dimensional emotion/attribute vectors on a 0-4 integer scale.
Quality Scores
The quality scores (background quality, content enjoyment, overall quality, speech quality) are derived from DNSMOS (Deep Noise Suppression Mean Opinion Score) and related audio quality assessment models. CPS (characters per second) measures speech rate from forced alignment.
DACVAE Codec
This model uses DACVAE (Discriminator-Augmented Compressed Vector Autoencoder) as the audio codec:
- Model:
facebook/dacvae-watermarked - Sample rate: 48,000 Hz
- Hop length: 1,920 samples
- Latent dimension: 128
- Frame rate: 25 fps
- Max duration: 15 seconds (375 frames)
For optimized inference, we recommend fast-dacvae which removes weight normalization for faster decoding:
pip install fast-dacvae
Usage
Installation
pip install torch fast-dacvae huggingface_hub
Quick Start
import torch
import torch.nn as nn
import torch.nn.functional as F
import copy
import numpy as np
from dacvae import DACVAE
from huggingface_hub import hf_hub_download
# === Model Definitions ===
LATENT_DIM = 128
PATCH_SIZE = 5
MAX_FRAMES = 375
EMBED_DIM = 512
PROBE_HIDDEN = 704
TIMBRE_DIM = 128
class LatentAudioEncoder(nn.Module):
def __init__(self, hidden_dim=768, num_layers=12, num_heads=12, mlp_dim=3072):
super().__init__()
self.patch_embed = nn.Conv1d(LATENT_DIM, hidden_dim, PATCH_SIZE, PATCH_SIZE)
max_tokens = MAX_FRAMES // PATCH_SIZE + 1
self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim) * 0.02)
self.pos_embed = nn.Parameter(torch.randn(1, max_tokens, hidden_dim) * 0.02)
layer = nn.TransformerEncoderLayer(
d_model=hidden_dim, nhead=num_heads,
dim_feedforward=mlp_dim, activation="gelu",
batch_first=True, norm_first=True,
)
self.layers = nn.ModuleList([copy.deepcopy(layer) for _ in range(num_layers)])
self.norm = nn.LayerNorm(hidden_dim)
self.proj = nn.Linear(hidden_dim, EMBED_DIM)
def forward(self, x, mask=None):
B = x.shape[0]
x = self.patch_embed(x.transpose(1, 2)).transpose(1, 2)
T_tok = x.shape[1]
if mask is not None:
T_fr = mask.shape[1]
need = T_tok * PATCH_SIZE
if T_fr < need:
mask = F.pad(mask.float(), (0, need - T_fr)).bool()
mask = mask[:, :need].reshape(B, T_tok, PATCH_SIZE).any(dim=2)
cls = self.cls_token.expand(B, -1, -1)
x = torch.cat([cls, x], dim=1)
x = x + self.pos_embed[:, :x.shape[1]]
pad_mask = None
if mask is not None:
cls_valid = torch.ones(B, 1, device=mask.device, dtype=torch.bool)
pad_mask = ~torch.cat([cls_valid, mask], dim=1)
for layer in self.layers:
x = layer(x, src_key_padding_mask=pad_mask)
out = self.norm(x[:, 0])
out = self.proj(out)
return F.normalize(out, p=2, dim=1)
class AttributeProbe(nn.Module):
def __init__(self, n_attrs=53):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(EMBED_DIM, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
nn.GELU(), nn.Dropout(0.1),
nn.Linear(PROBE_HIDDEN, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
nn.GELU(), nn.Dropout(0.1),
)
self.attr_head = nn.Linear(PROBE_HIDDEN, n_attrs)
self.duration_head = nn.Linear(PROBE_HIDDEN, 1)
def forward(self, x):
h = self.backbone(x)
return self.attr_head(h), self.duration_head(h)
class SpeakerProbe(nn.Module):
def __init__(self):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(EMBED_DIM, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
nn.GELU(), nn.Dropout(0.1),
nn.Linear(PROBE_HIDDEN, PROBE_HIDDEN), nn.LayerNorm(PROBE_HIDDEN),
nn.GELU(), nn.Dropout(0.1),
)
self.timbre_head = nn.Linear(PROBE_HIDDEN, TIMBRE_DIM)
def forward(self, x):
return self.timbre_head(self.backbone(x))
class FlexibleExpert(nn.Module):
"""Multi-output MLP: input_dim β hidden layers β output_dim."""
def __init__(self, input_dim, hidden_layers, output_dim, dropout=0.1):
super().__init__()
layers = []
prev = input_dim
for h in hidden_layers:
layers.extend([
nn.Linear(prev, h), nn.LayerNorm(h),
nn.GELU(), nn.Dropout(dropout),
])
prev = h
layers.append(nn.Linear(prev, output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
class QualityExpert(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(EMBED_DIM, 64), nn.LayerNorm(64),
nn.GELU(), nn.Dropout(0.1),
nn.Linear(64, 64), nn.LayerNorm(64),
nn.GELU(), nn.Dropout(0.1),
nn.Linear(64, 1),
)
def forward(self, x):
return self.net(x)
# === Load Models ===
REPO = "laion/audio-audio-majestrino-with-emotion-and-quality-experts"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load encoder
enc_path = hf_hub_download(REPO, "encoder.pt")
encoder = LatentAudioEncoder()
encoder.load_state_dict(torch.load(enc_path, map_location="cpu", weights_only=True))
encoder.eval().to(device)
# Load attribute probe
attr_path = hf_hub_download(REPO, "attribute_probe.pt")
attr_ckpt = torch.load(attr_path, map_location="cpu", weights_only=False)
attr_probe = AttributeProbe(n_attrs=attr_ckpt["n_attrs"])
attr_probe.load_state_dict(attr_ckpt["model"])
attr_probe.eval().to(device)
canonical_keys = attr_ckpt["canonical_keys"] # 53 attribute names
# Load emotion experts (v2 multi-output models β recommended over attribute probe)
emo40_path = hf_hub_download(REPO, "emotion_expert_emo40.pt")
emo40_ckpt = torch.load(emo40_path, map_location="cpu", weights_only=False)
emo40_expert = FlexibleExpert(EMBED_DIM, emo40_ckpt["layers"], len(emo40_ckpt["output_names"]))
emo40_expert.load_state_dict(emo40_ckpt["model"])
emo40_expert.eval().to(device)
emo40_keys = emo40_ckpt["output_names"] # 40 emotion names
attr13_path = hf_hub_download(REPO, "emotion_expert_attr13.pt")
attr13_ckpt = torch.load(attr13_path, map_location="cpu", weights_only=False)
attr13_expert = FlexibleExpert(EMBED_DIM, attr13_ckpt["layers"], len(attr13_ckpt["output_names"]))
attr13_expert.load_state_dict(attr13_ckpt["model"])
attr13_expert.eval().to(device)
attr13_keys = attr13_ckpt["output_names"] # 13 non-emotion attribute names
# Load speaker probe
spk_path = hf_hub_download(REPO, "speaker_probe.pt")
spk_ckpt = torch.load(spk_path, map_location="cpu", weights_only=False)
spk_probe = SpeakerProbe()
spk_probe.load_state_dict(spk_ckpt["model"])
spk_probe.eval().to(device)
# Load quality experts
quality_experts = {}
for score_type in ["cps", "score_background_quality", "score_content_enjoyment",
"score_overall_quality", "score_speech_quality"]:
q_path = hf_hub_download(REPO, f"quality_expert_{score_type}.pt")
q_ckpt = torch.load(q_path, map_location="cpu", weights_only=False)
expert = QualityExpert()
expert.load_state_dict(q_ckpt["model"])
expert.eval().to(device)
quality_experts[score_type] = expert
# Load DACVAE for encoding audio
dacvae = DACVAE.load("facebook/dacvae-watermarked").to(device).eval()
for _, mod in dacvae.named_modules():
try:
torch.nn.utils.remove_weight_norm(mod)
except ValueError:
pass
# === Encode Audio ===
def encode_audio(audio_path):
"""Encode an audio file to a 512-dim embedding."""
import torchaudio
wav, sr = torchaudio.load(audio_path)
if sr != 48000:
wav = torchaudio.functional.resample(wav, sr, 48000)
if wav.shape[0] > 1:
wav = wav.mean(0, keepdim=True)
wav = wav.unsqueeze(0).to(device) # (1, 1, samples)
# Encode to DACVAE latent
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
latent = dacvae.encode(wav) # (1, 128, T)
latent = latent.float().permute(0, 2, 1) # (1, T, 128)
# Truncate/pad to max frames
T = latent.shape[1]
if T > MAX_FRAMES:
latent = latent[:, :MAX_FRAMES]
T = MAX_FRAMES
pad_len = ((T + PATCH_SIZE - 1) // PATCH_SIZE) * PATCH_SIZE
padded = torch.zeros(1, pad_len, LATENT_DIM, device=device)
mask = torch.zeros(1, pad_len, dtype=torch.bool, device=device)
padded[0, :T] = latent[0]
mask[0, :T] = True
# Encode to embedding
with torch.no_grad(), torch.amp.autocast("cuda"):
embedding = encoder(padded, mask) # (1, 512)
return embedding
# === Predict ===
embedding = encode_audio("your_audio.wav")
# Emotion/speaker attributes (v2 experts β best accuracy)
with torch.no_grad():
emo40_preds = emo40_expert(embedding).squeeze().cpu().numpy()
attr13_preds = attr13_expert(embedding).squeeze().cpu().numpy()
print("Top-5 emotions (v2 expert):")
emo_dict = {k: v for k, v in zip(emo40_keys, emo40_preds)}
for k, v in sorted(emo_dict.items(), key=lambda x: -x[1])[:5]:
print(f" {k}: {v:.2f}")
print("\nNon-emotion attributes (v2 expert):")
for k, v in zip(attr13_keys, attr13_preds):
print(f" {k}: {v:.2f}")
# Alternative: attribute probe (also predicts duration)
with torch.no_grad():
attrs, duration = attr_probe(embedding)
print(f"\nPredicted duration: {duration.squeeze().cpu().item():.1f}s")
# Speaker embedding
with torch.no_grad():
timbre = spk_probe(embedding).squeeze().cpu().numpy() # 128-dim
print(f"Timbre embedding shape: {timbre.shape}")
# Quality scores
print("\nQuality scores:")
for name, expert in quality_experts.items():
with torch.no_grad():
score = expert(embedding).squeeze().cpu().item()
print(f" {name}: {score:.3f}")
# Distortion expert (binary clean vs. distorted)
dist_path = hf_hub_download(REPO, "distortion_expert.pt")
dist_ckpt = torch.load(dist_path, map_location="cpu", weights_only=False)
dist_expert = nn.Sequential(
nn.Linear(512, 128), nn.GELU(), nn.Dropout(0.1),
nn.Linear(128, 64), nn.GELU(), nn.Dropout(0.1),
nn.Linear(64, 1),
)
dist_expert.load_state_dict(dist_ckpt["model"])
dist_expert.eval().to(device)
with torch.no_grad():
logit = dist_expert(embedding).squeeze().cpu().item()
p_clean = torch.sigmoid(torch.tensor(logit)).item()
print(f"\nDistortion expert: logit={logit:+.3f} P(clean)={p_clean:.4f}")
Batch Processing (from DACVAE latents directly)
# If you already have DACVAE latents (e.g., from a WebDataset):
latent = np.load("sample.npy") # (T, 128) float16
latent = latent.astype(np.float32)
T = min(latent.shape[0], MAX_FRAMES)
pad_len = ((T + PATCH_SIZE - 1) // PATCH_SIZE) * PATCH_SIZE
batch = torch.zeros(1, pad_len, LATENT_DIM, device=device)
mask = torch.zeros(1, pad_len, dtype=torch.bool, device=device)
batch[0, :T] = torch.from_numpy(latent[:T])
mask[0, :T] = True
with torch.no_grad(), torch.amp.autocast("cuda"):
embedding = encoder(batch, mask) # (1, 512)
Attribute Key Reference
The 53 attributes predicted by the attribute probe, in canonical order:
Affection, Age, Amusement, Anger, Arousal, Astonishment_Surprise, Authenticity,
Awe, Bitterness, Concentration, Confident_vs._Hesitant, Confusion, Contemplation,
Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation,
Embarrassment, Emotional_Numbness, Fatigue_Exhaustion, Fear, Gender, Helplessness,
High-Pitched_vs._Low-Pitched, Hope_Enthusiasm_Optimism, Impatience_and_Irritability,
Infatuation, Interest, Intoxication_Altered_States_of_Consciousness, Jealousy_&_Envy,
Longing, Malevolence_Malice, Monotone_vs._Expressive, Pain, Pleasure_Ecstasy, Pride,
Relief, Sadness, Serious_vs._Humorous, Sexual_Lust, Shame, Soft_vs._Harsh, Sourness,
Submissive_vs._Dominant, Teasing, Thankfulness_Gratitude, Triumph, Valence,
Vulnerable_vs._Emotionally_Detached, Warm_vs._Cold
License
Apache 2.0
Citation
If you use this model, please cite:
@misc{laion2026majestrino,
title={Audio-Audio Majestrino with Emotion and Quality Experts},
author={LAION},
year={2026},
url={https://huggingface.co/laion/audio-audio-majestrino-with-emotion-and-quality-experts}
}