Majestrino-1.00 Voice Experts

A collection of 57 voice dimension expert models and 1 speech detector that predict interpretable voice attributes from Majestrino-1.00 embeddings.

Overview

This repository contains expert MLP models trained to predict voice attributes across 57 dimensions (age, gender, emotion, speaking style, vocal quality, etc.) on a 0-6 scale, plus a binary speech detector. All experts operate on 768-dimensional Majestrino-1.00 embeddings extracted from audio using the laion/Majestrino-1.00 encoder (Whisper-small encoder + MLP projection).

Key Features:

57 voice dimension experts with ~197K parameters each
Binary speech detector (10K parameters, F1=1.000)
Mean balanced adj±1 accuracy: 81.6%
Complete inference pipeline with standardization
Production-ready with minimal dependencies

Quick Start

import torch
import numpy as np
import torchaudio
from transformers import AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download

# Download all expert files
repo_path = snapshot_download(repo_id="laion/Majestrino-1.00-voice-experts")

# Load Majestrino encoder
encoder = AutoModel.from_pretrained("laion/Majestrino-1.00", trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("laion/Majestrino-1.00")
encoder.eval()

# Load audio file (will be resampled to 16kHz)
audio_path = "your_audio.wav"
waveform, sr = torchaudio.load(audio_path)
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

# Extract Majestrino embedding
with torch.no_grad():
    inputs = feature_extractor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
    embedding = encoder(**inputs).last_hidden_state.mean(dim=1)  # [1, 768]
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)

# Load speech detector
speech_std = np.load(f"{repo_path}/speech_standardization.npz")
speech_mean, speech_std = torch.tensor(speech_std['mean']), torch.tensor(speech_std['std'])
speech_detector = torch.load(f"{repo_path}/speech_detector_best.pt", map_location='cpu')
speech_detector.eval()

# Check if audio contains speech
emb_std = (embedding - speech_mean) / speech_std
with torch.no_grad():
    is_speech = torch.sigmoid(speech_detector(emb_std)).item() > 0.5

print(f"Contains speech: {is_speech}")

if is_speech:
    # Load expert standardization
    expert_std = np.load(f"{repo_path}/expert_standardization.npz")
    expert_mean = torch.tensor(expert_std['mean'])
    expert_std_val = torch.tensor(expert_std['std'])

    # Load a few example experts
    experts = {
        'GEND': torch.load(f"{repo_path}/experts/GEND_ce.pt", map_location='cpu'),
        'AGEV': torch.load(f"{repo_path}/experts/AGEV_huber.pt", map_location='cpu'),
        'VALN': torch.load(f"{repo_path}/experts/VALN_huber.pt", map_location='cpu'),
        'TEMP': torch.load(f"{repo_path}/experts/TEMP_huber.pt", map_location='cpu'),
    }

    # Standardize embedding for experts
    emb_std = (embedding - expert_mean) / expert_std_val

    # Run inference
    results = {}
    with torch.no_grad():
        for dim, model in experts.items():
            model.eval()
            output = model(emb_std)

            if dim == 'GEND':  # CE expert
                pred = torch.argmax(output, dim=-1).item()
            else:  # Huber expert
                pred = torch.clamp(torch.round(output), 0, 6).item()

            results[dim] = int(pred)

    print("\nVoice attributes:")
    print(f"  Gender (0=M, 6=F): {results['GEND']}")
    print(f"  Age (0=child, 6=elderly): {results['AGEV']}")
    print(f"  Valence (0=negative, 6=positive): {results['VALN']}")
    print(f"  Tempo (0=very slow, 6=very fast): {results['TEMP']}")

Model Details

Architecture

Voice Experts (57 models):

Input: 768-dim standardized Majestrino embedding
Architecture: 768 → 256 → ReLU → Dropout(0.3) → 128 → ReLU → Dropout(0.3) → 64 → ReLU → output
Huber experts (43): output=1 neuron (regression), predictions rounded and clipped to 0-6
CE experts (14): output=7 neurons (classification), argmax to 0-6
Parameters: ~197K per expert

Speech Detector:

Input: 768-dim standardized Majestrino embedding
Architecture: 768 → 13 → ReLU → Dropout(0.5) → 1
Loss: BCEWithLogitsLoss
Parameters: 10,011
Performance: 100% accuracy, F1=1.000 on validation set
Training data: 9K speech + 6K non-speech samples

Training

Optimizer: AdamW with weight decay
Loss functions: Huber loss (regression), CrossEntropy loss (classification)
Data: Annotated samples from Majestrino dataset with human labels
Validation: Balanced holdout sets + Gemini Pro 2.0 validation
Selection: Best performing loss type (Huber vs CE) chosen per dimension

Performance

Per-Dimension Results

All 57 voice dimension experts with validation accuracies (adj±1 = correct within ±1 value):

Dimension	Name	Type	Balanced Holdout	Gemini Pro	Samples
AGEV	Perceived Age	Huber	90.8%	92.2%	1,560
AROU	Arousal	Huber	90.1%	89.4%	2,256
ARSH	Arousal Shift	CE	88.7%	91.8%	150
ATCK	Attack	Huber	87.8%	93.4%	1,344
BKGN	Background Noise	Huber	86.0%	72.9%	270
BRGT	Brightness	CE	68.2%	64.3%	108
CHNK	Chunking	Huber	81.0%	89.1%	456
CLRT	Articulation Clarity	CE	83.2%	91.1%	765
COGL	Cognitive Load	Huber	57.7%	71.7%	912
DARC	Dynamic Arc	Huber	62.4%	73.7%	144
DFLU	Disfluency	Huber	69.7%	75.9%	870
EMPH	Emphasis	Huber	85.8%	90.5%	672
ESTH	Esthetics	Huber	87.3%	93.7%	7,455
EXPL	Explicitness	Huber	88.7%	90.8%	372
FOCS	Focus	CE	83.3%	80.3%	1,446
FULL	Fullness	CE	73.5%	69.0%	138
GEND	Perceived Gender	CE	70.9%	82.2%	1,080
HARM	Harmonicity	Huber	82.9%	89.3%	1,190
METL	Metallic Character	Huber	88.6%	93.9%	565
RANG	Pitch Range	CE	75.5%	82.8%	312
RCQL	Recording Quality	Huber	91.2%	88.2%	3,212
REGS	Register	Huber	78.7%	81.5%	570
RESP	Respiration	Huber	85.1%	87.4%	1,164
ROUG	Roughness	Huber	81.3%	84.6%	678
R_CHST	Chest Resonance	Huber	84.4%	88.1%	2,616
R_HEAD	Head Resonance	Huber	85.0%	90.2%	1,848
R_MASK	Mask Resonance	Huber	82.6%	91.9%	1,236
R_MIXD	Mixed Resonance	CE	77.9%	87.0%	390
R_NASL	Nasal Resonance	Huber	86.3%	84.9%	168
R_ORAL	Oral Resonance	CE	77.8%	87.1%	126
R_THRT	Throat Resonance	Huber	82.6%	85.7%	798
SMTH	Smoothness	Huber	80.8%	90.4%	5,125
STNC	Stance	Huber	78.6%	85.8%	3,300
STRU	Structure	CE	84.5%	91.0%	1,098
S_ASMR	ASMR Style	Huber	86.8%	93.6%	2,130
S_AUTH	Authoritative Style	Huber	81.8%	89.1%	2,712
S_CART	Cartoonish Style	Huber	82.2%	86.6%	1,674
S_CASU	Casual Style	Huber	78.5%	85.6%	768
S_CONV	Conversational Style	Huber	83.3%	82.9%	1,110
S_DRAM	Dramatic Style	Huber	77.6%	88.0%	2,502
S_FORM	Formal Style	Huber	92.1%	91.1%	1,505
S_MONO	Monologue Style	Huber	68.0%	70.0%	18,990
S_NARR	Narrator Style	Huber	88.1%	80.2%	4,695
S_NEWS	Newsreader Style	Huber	90.9%	68.6%	8,270
S_PLAY	Playful Style	Huber	83.6%	84.0%	5,700
S_RANT	Ranting/Angry Style	CE	82.9%	86.5%	7,230
S_STRY	Storytelling Style	Huber	86.8%	84.0%	3,725
S_TECH	Teacher/Didactic Style	Huber	91.5%	71.5%	4,785
S_WHIS	Whisper Style	Huber	90.0%	89.2%	744
TEMP	Tempo	Huber	78.3%	75.7%	246
TENS	Tension	Huber	86.8%	87.2%	1,644
VALN	Valence	Huber	86.3%	90.5%	9,648
VALS	Valence Shift	Huber	49.7%	32.6%	80
VFLX	Velocity Flux	CE	92.9%	94.3%	30
VOLT	Volatility	Huber	71.0%	87.6%	348
VULN	Vulnerability	CE	78.3%	82.9%	1,734
WARM	Warmth	Huber	82.2%	88.5%	1,092

Summary Statistics:

43 Huber experts, 14 CE experts
Mean balanced adj±1 accuracy: 81.6%
Best performers: VFLX (92.9%), S_FORM (92.1%), RCQL (91.2%)
Most training data: S_MONO (18,990), VALN (9,648), S_NEWS (8,270)

Files

experts/
├── AGEV_huber.pt          # Perceived Age expert
├── AROU_huber.pt          # Arousal expert
├── ARSH_ce.pt             # Arousal Shift expert
├── ... (54 more experts)
├── WARM_huber.pt          # Warmth expert
speech_detector_best.pt    # Binary speech classifier
expert_standardization.npz # Mean/std for expert inputs
speech_standardization.npz # Mean/std for speech detector
inference.py               # Complete inference script

Dimension Descriptions

Core Attributes

AGEV: Perceived age (0=child, 6=elderly)
GEND: Perceived gender (0=masculine, 6=feminine)
TEMP: Speaking tempo (0=very slow, 6=very fast)

Emotional Dimensions

VALN: Valence/sentiment (0=negative, 6=positive)
AROU: Arousal/energy (0=low, 6=high)
VALS: Valence shift over time
ARSH: Arousal shift over time
VULN: Vulnerability (0=guarded, 6=vulnerable)
WARM: Warmth (0=cold, 6=warm)
TENS: Tension (0=relaxed, 6=tense)

Vocal Quality

HARM: Harmonicity (0=noisy, 6=pure tone)
ROUG: Roughness/hoarseness
METL: Metallic character
BRGT: Brightness
FULL: Fullness/richness
SMTH: Smoothness
ATCK: Attack/onset sharpness

Resonance (R_*)

R_CHST: Chest resonance
R_HEAD: Head resonance
R_MASK: Mask resonance
R_NASL: Nasal resonance
R_ORAL: Oral resonance
R_THRT: Throat resonance
R_MIXD: Mixed resonance

Delivery Style

CLRT: Articulation clarity
EMPH: Emphasis/stress patterns
EXPL: Explicitness/directness
FOCS: Focus/concentration
STNC: Stance/attitude
STRU: Structural organization
CHNK: Chunking/phrasing
DFLU: Disfluency (stutters, fillers)
RESP: Respiration audibility

Speaking Styles (S_*)

S_MONO: Monologue style
S_CONV: Conversational style
S_NARR: Narrator style
S_STRY: Storytelling style
S_NEWS: Newsreader style
S_TECH: Teacher/didactic style
S_FORM: Formal style
S_CASU: Casual style
S_DRAM: Dramatic style
S_AUTH: Authoritative style
S_PLAY: Playful style
S_RANT: Ranting/angry style
S_WHIS: Whisper style
S_CART: Cartoonish style
S_ASMR: ASMR style

Technical/Context

RCQL: Recording quality
BKGN: Background noise
COGL: Cognitive load
DARC: Dynamic arc (progression)
RANG: Pitch range
REGS: Register (chest/head voice)
VFLX: Velocity flux (rhythm variation)
VOLT: Volatility (unpredictability)
ESTH: Overall esthetics/pleasantness

Usage Notes

Input Requirements: Audio must be converted to Majestrino-1.00 embeddings first using the base encoder
Standardization: Always apply mean/std normalization before inference
Speech Detection: Run speech detector first to filter non-speech audio
Expert Types: CE experts use argmax, Huber experts use round+clip
Scale: All dimensions output 0-6 integer values

Citation

@misc{majestrino-voice-experts,
  title={Majestrino-1.00 Voice Experts: Interpretable Voice Attribute Prediction},
  author={LAION},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/laion/Majestrino-1.00-voice-experts}}
}

License

MIT License - see LICENSE file for details.

laion
/

Majestrino-1.00-voice-experts