Majestrino-1.00 Voice Experts

A collection of 57 voice dimension expert models and 1 speech detector that predict interpretable voice attributes from Majestrino-1.00 embeddings.

Overview

This repository contains expert MLP models trained to predict voice attributes across 57 dimensions (age, gender, emotion, speaking style, vocal quality, etc.) on a 0-6 scale, plus a binary speech detector. All experts operate on 768-dimensional Majestrino-1.00 embeddings extracted from audio using the laion/Majestrino-1.00 encoder (Whisper-small encoder + MLP projection).

Key Features:

  • 57 voice dimension experts with ~197K parameters each
  • Binary speech detector (10K parameters, F1=1.000)
  • Mean balanced adjΒ±1 accuracy: 81.6%
  • Complete inference pipeline with standardization
  • Production-ready with minimal dependencies

Quick Start

import torch
import numpy as np
import torchaudio
from transformers import AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download

# Download all expert files
repo_path = snapshot_download(repo_id="laion/Majestrino-1.00-voice-experts")

# Load Majestrino encoder
encoder = AutoModel.from_pretrained("laion/Majestrino-1.00", trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("laion/Majestrino-1.00")
encoder.eval()

# Load audio file (will be resampled to 16kHz)
audio_path = "your_audio.wav"
waveform, sr = torchaudio.load(audio_path)
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

# Extract Majestrino embedding
with torch.no_grad():
    inputs = feature_extractor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
    embedding = encoder(**inputs).last_hidden_state.mean(dim=1)  # [1, 768]
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)

# Load speech detector
speech_std = np.load(f"{repo_path}/speech_standardization.npz")
speech_mean, speech_std = torch.tensor(speech_std['mean']), torch.tensor(speech_std['std'])
speech_detector = torch.load(f"{repo_path}/speech_detector_best.pt", map_location='cpu')
speech_detector.eval()

# Check if audio contains speech
emb_std = (embedding - speech_mean) / speech_std
with torch.no_grad():
    is_speech = torch.sigmoid(speech_detector(emb_std)).item() > 0.5

print(f"Contains speech: {is_speech}")

if is_speech:
    # Load expert standardization
    expert_std = np.load(f"{repo_path}/expert_standardization.npz")
    expert_mean = torch.tensor(expert_std['mean'])
    expert_std_val = torch.tensor(expert_std['std'])

    # Load a few example experts
    experts = {
        'GEND': torch.load(f"{repo_path}/experts/GEND_ce.pt", map_location='cpu'),
        'AGEV': torch.load(f"{repo_path}/experts/AGEV_huber.pt", map_location='cpu'),
        'VALN': torch.load(f"{repo_path}/experts/VALN_huber.pt", map_location='cpu'),
        'TEMP': torch.load(f"{repo_path}/experts/TEMP_huber.pt", map_location='cpu'),
    }

    # Standardize embedding for experts
    emb_std = (embedding - expert_mean) / expert_std_val

    # Run inference
    results = {}
    with torch.no_grad():
        for dim, model in experts.items():
            model.eval()
            output = model(emb_std)

            if dim == 'GEND':  # CE expert
                pred = torch.argmax(output, dim=-1).item()
            else:  # Huber expert
                pred = torch.clamp(torch.round(output), 0, 6).item()

            results[dim] = int(pred)

    print("\nVoice attributes:")
    print(f"  Gender (0=M, 6=F): {results['GEND']}")
    print(f"  Age (0=child, 6=elderly): {results['AGEV']}")
    print(f"  Valence (0=negative, 6=positive): {results['VALN']}")
    print(f"  Tempo (0=very slow, 6=very fast): {results['TEMP']}")

Model Details

Architecture

Voice Experts (57 models):

  • Input: 768-dim standardized Majestrino embedding
  • Architecture: 768 β†’ 256 β†’ ReLU β†’ Dropout(0.3) β†’ 128 β†’ ReLU β†’ Dropout(0.3) β†’ 64 β†’ ReLU β†’ output
  • Huber experts (43): output=1 neuron (regression), predictions rounded and clipped to 0-6
  • CE experts (14): output=7 neurons (classification), argmax to 0-6
  • Parameters: ~197K per expert

Speech Detector:

  • Input: 768-dim standardized Majestrino embedding
  • Architecture: 768 β†’ 13 β†’ ReLU β†’ Dropout(0.5) β†’ 1
  • Loss: BCEWithLogitsLoss
  • Parameters: 10,011
  • Performance: 100% accuracy, F1=1.000 on validation set
  • Training data: 9K speech + 6K non-speech samples

Training

  • Optimizer: AdamW with weight decay
  • Loss functions: Huber loss (regression), CrossEntropy loss (classification)
  • Data: Annotated samples from Majestrino dataset with human labels
  • Validation: Balanced holdout sets + Gemini Pro 2.0 validation
  • Selection: Best performing loss type (Huber vs CE) chosen per dimension

Performance

Per-Dimension Results

All 57 voice dimension experts with validation accuracies (adjΒ±1 = correct within Β±1 value):

Dimension Name Type Balanced Holdout Gemini Pro Samples
AGEV Perceived Age Huber 90.8% 92.2% 1,560
AROU Arousal Huber 90.1% 89.4% 2,256
ARSH Arousal Shift CE 88.7% 91.8% 150
ATCK Attack Huber 87.8% 93.4% 1,344
BKGN Background Noise Huber 86.0% 72.9% 270
BRGT Brightness CE 68.2% 64.3% 108
CHNK Chunking Huber 81.0% 89.1% 456
CLRT Articulation Clarity CE 83.2% 91.1% 765
COGL Cognitive Load Huber 57.7% 71.7% 912
DARC Dynamic Arc Huber 62.4% 73.7% 144
DFLU Disfluency Huber 69.7% 75.9% 870
EMPH Emphasis Huber 85.8% 90.5% 672
ESTH Esthetics Huber 87.3% 93.7% 7,455
EXPL Explicitness Huber 88.7% 90.8% 372
FOCS Focus CE 83.3% 80.3% 1,446
FULL Fullness CE 73.5% 69.0% 138
GEND Perceived Gender CE 70.9% 82.2% 1,080
HARM Harmonicity Huber 82.9% 89.3% 1,190
METL Metallic Character Huber 88.6% 93.9% 565
RANG Pitch Range CE 75.5% 82.8% 312
RCQL Recording Quality Huber 91.2% 88.2% 3,212
REGS Register Huber 78.7% 81.5% 570
RESP Respiration Huber 85.1% 87.4% 1,164
ROUG Roughness Huber 81.3% 84.6% 678
R_CHST Chest Resonance Huber 84.4% 88.1% 2,616
R_HEAD Head Resonance Huber 85.0% 90.2% 1,848
R_MASK Mask Resonance Huber 82.6% 91.9% 1,236
R_MIXD Mixed Resonance CE 77.9% 87.0% 390
R_NASL Nasal Resonance Huber 86.3% 84.9% 168
R_ORAL Oral Resonance CE 77.8% 87.1% 126
R_THRT Throat Resonance Huber 82.6% 85.7% 798
SMTH Smoothness Huber 80.8% 90.4% 5,125
STNC Stance Huber 78.6% 85.8% 3,300
STRU Structure CE 84.5% 91.0% 1,098
S_ASMR ASMR Style Huber 86.8% 93.6% 2,130
S_AUTH Authoritative Style Huber 81.8% 89.1% 2,712
S_CART Cartoonish Style Huber 82.2% 86.6% 1,674
S_CASU Casual Style Huber 78.5% 85.6% 768
S_CONV Conversational Style Huber 83.3% 82.9% 1,110
S_DRAM Dramatic Style Huber 77.6% 88.0% 2,502
S_FORM Formal Style Huber 92.1% 91.1% 1,505
S_MONO Monologue Style Huber 68.0% 70.0% 18,990
S_NARR Narrator Style Huber 88.1% 80.2% 4,695
S_NEWS Newsreader Style Huber 90.9% 68.6% 8,270
S_PLAY Playful Style Huber 83.6% 84.0% 5,700
S_RANT Ranting/Angry Style CE 82.9% 86.5% 7,230
S_STRY Storytelling Style Huber 86.8% 84.0% 3,725
S_TECH Teacher/Didactic Style Huber 91.5% 71.5% 4,785
S_WHIS Whisper Style Huber 90.0% 89.2% 744
TEMP Tempo Huber 78.3% 75.7% 246
TENS Tension Huber 86.8% 87.2% 1,644
VALN Valence Huber 86.3% 90.5% 9,648
VALS Valence Shift Huber 49.7% 32.6% 80
VFLX Velocity Flux CE 92.9% 94.3% 30
VOLT Volatility Huber 71.0% 87.6% 348
VULN Vulnerability CE 78.3% 82.9% 1,734
WARM Warmth Huber 82.2% 88.5% 1,092

Summary Statistics:

  • 43 Huber experts, 14 CE experts
  • Mean balanced adjΒ±1 accuracy: 81.6%
  • Best performers: VFLX (92.9%), S_FORM (92.1%), RCQL (91.2%)
  • Most training data: S_MONO (18,990), VALN (9,648), S_NEWS (8,270)

Files

experts/
β”œβ”€β”€ AGEV_huber.pt          # Perceived Age expert
β”œβ”€β”€ AROU_huber.pt          # Arousal expert
β”œβ”€β”€ ARSH_ce.pt             # Arousal Shift expert
β”œβ”€β”€ ... (54 more experts)
β”œβ”€β”€ WARM_huber.pt          # Warmth expert
speech_detector_best.pt    # Binary speech classifier
expert_standardization.npz # Mean/std for expert inputs
speech_standardization.npz # Mean/std for speech detector
inference.py               # Complete inference script

Dimension Descriptions

Core Attributes

  • AGEV: Perceived age (0=child, 6=elderly)
  • GEND: Perceived gender (0=masculine, 6=feminine)
  • TEMP: Speaking tempo (0=very slow, 6=very fast)

Emotional Dimensions

  • VALN: Valence/sentiment (0=negative, 6=positive)
  • AROU: Arousal/energy (0=low, 6=high)
  • VALS: Valence shift over time
  • ARSH: Arousal shift over time
  • VULN: Vulnerability (0=guarded, 6=vulnerable)
  • WARM: Warmth (0=cold, 6=warm)
  • TENS: Tension (0=relaxed, 6=tense)

Vocal Quality

  • HARM: Harmonicity (0=noisy, 6=pure tone)
  • ROUG: Roughness/hoarseness
  • METL: Metallic character
  • BRGT: Brightness
  • FULL: Fullness/richness
  • SMTH: Smoothness
  • ATCK: Attack/onset sharpness

Resonance (R_*)

  • R_CHST: Chest resonance
  • R_HEAD: Head resonance
  • R_MASK: Mask resonance
  • R_NASL: Nasal resonance
  • R_ORAL: Oral resonance
  • R_THRT: Throat resonance
  • R_MIXD: Mixed resonance

Delivery Style

  • CLRT: Articulation clarity
  • EMPH: Emphasis/stress patterns
  • EXPL: Explicitness/directness
  • FOCS: Focus/concentration
  • STNC: Stance/attitude
  • STRU: Structural organization
  • CHNK: Chunking/phrasing
  • DFLU: Disfluency (stutters, fillers)
  • RESP: Respiration audibility

Speaking Styles (S_*)

  • S_MONO: Monologue style
  • S_CONV: Conversational style
  • S_NARR: Narrator style
  • S_STRY: Storytelling style
  • S_NEWS: Newsreader style
  • S_TECH: Teacher/didactic style
  • S_FORM: Formal style
  • S_CASU: Casual style
  • S_DRAM: Dramatic style
  • S_AUTH: Authoritative style
  • S_PLAY: Playful style
  • S_RANT: Ranting/angry style
  • S_WHIS: Whisper style
  • S_CART: Cartoonish style
  • S_ASMR: ASMR style

Technical/Context

  • RCQL: Recording quality
  • BKGN: Background noise
  • COGL: Cognitive load
  • DARC: Dynamic arc (progression)
  • RANG: Pitch range
  • REGS: Register (chest/head voice)
  • VFLX: Velocity flux (rhythm variation)
  • VOLT: Volatility (unpredictability)
  • ESTH: Overall esthetics/pleasantness

Usage Notes

  1. Input Requirements: Audio must be converted to Majestrino-1.00 embeddings first using the base encoder
  2. Standardization: Always apply mean/std normalization before inference
  3. Speech Detection: Run speech detector first to filter non-speech audio
  4. Expert Types: CE experts use argmax, Huber experts use round+clip
  5. Scale: All dimensions output 0-6 integer values

Citation

@misc{majestrino-voice-experts,
  title={Majestrino-1.00 Voice Experts: Interpretable Voice Attribute Prediction},
  author={LAION},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/laion/Majestrino-1.00-voice-experts}}
}

License

MIT License - see LICENSE file for details.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support