Majestrino-1.00 Voice Experts
A collection of 57 voice dimension expert models and 1 speech detector that predict interpretable voice attributes from Majestrino-1.00 embeddings.
Overview
This repository contains expert MLP models trained to predict voice attributes across 57 dimensions (age, gender, emotion, speaking style, vocal quality, etc.) on a 0-6 scale, plus a binary speech detector. All experts operate on 768-dimensional Majestrino-1.00 embeddings extracted from audio using the laion/Majestrino-1.00 encoder (Whisper-small encoder + MLP projection).
Key Features:
- 57 voice dimension experts with ~197K parameters each
- Binary speech detector (10K parameters, F1=1.000)
- Mean balanced adjΒ±1 accuracy: 81.6%
- Complete inference pipeline with standardization
- Production-ready with minimal dependencies
Quick Start
import torch
import numpy as np
import torchaudio
from transformers import AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
# Download all expert files
repo_path = snapshot_download(repo_id="laion/Majestrino-1.00-voice-experts")
# Load Majestrino encoder
encoder = AutoModel.from_pretrained("laion/Majestrino-1.00", trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("laion/Majestrino-1.00")
encoder.eval()
# Load audio file (will be resampled to 16kHz)
audio_path = "your_audio.wav"
waveform, sr = torchaudio.load(audio_path)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# Extract Majestrino embedding
with torch.no_grad():
inputs = feature_extractor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
embedding = encoder(**inputs).last_hidden_state.mean(dim=1) # [1, 768]
embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)
# Load speech detector
speech_std = np.load(f"{repo_path}/speech_standardization.npz")
speech_mean, speech_std = torch.tensor(speech_std['mean']), torch.tensor(speech_std['std'])
speech_detector = torch.load(f"{repo_path}/speech_detector_best.pt", map_location='cpu')
speech_detector.eval()
# Check if audio contains speech
emb_std = (embedding - speech_mean) / speech_std
with torch.no_grad():
is_speech = torch.sigmoid(speech_detector(emb_std)).item() > 0.5
print(f"Contains speech: {is_speech}")
if is_speech:
# Load expert standardization
expert_std = np.load(f"{repo_path}/expert_standardization.npz")
expert_mean = torch.tensor(expert_std['mean'])
expert_std_val = torch.tensor(expert_std['std'])
# Load a few example experts
experts = {
'GEND': torch.load(f"{repo_path}/experts/GEND_ce.pt", map_location='cpu'),
'AGEV': torch.load(f"{repo_path}/experts/AGEV_huber.pt", map_location='cpu'),
'VALN': torch.load(f"{repo_path}/experts/VALN_huber.pt", map_location='cpu'),
'TEMP': torch.load(f"{repo_path}/experts/TEMP_huber.pt", map_location='cpu'),
}
# Standardize embedding for experts
emb_std = (embedding - expert_mean) / expert_std_val
# Run inference
results = {}
with torch.no_grad():
for dim, model in experts.items():
model.eval()
output = model(emb_std)
if dim == 'GEND': # CE expert
pred = torch.argmax(output, dim=-1).item()
else: # Huber expert
pred = torch.clamp(torch.round(output), 0, 6).item()
results[dim] = int(pred)
print("\nVoice attributes:")
print(f" Gender (0=M, 6=F): {results['GEND']}")
print(f" Age (0=child, 6=elderly): {results['AGEV']}")
print(f" Valence (0=negative, 6=positive): {results['VALN']}")
print(f" Tempo (0=very slow, 6=very fast): {results['TEMP']}")
Model Details
Architecture
Voice Experts (57 models):
- Input: 768-dim standardized Majestrino embedding
- Architecture:
768 β 256 β ReLU β Dropout(0.3) β 128 β ReLU β Dropout(0.3) β 64 β ReLU β output - Huber experts (43): output=1 neuron (regression), predictions rounded and clipped to 0-6
- CE experts (14): output=7 neurons (classification), argmax to 0-6
- Parameters: ~197K per expert
Speech Detector:
- Input: 768-dim standardized Majestrino embedding
- Architecture:
768 β 13 β ReLU β Dropout(0.5) β 1 - Loss: BCEWithLogitsLoss
- Parameters: 10,011
- Performance: 100% accuracy, F1=1.000 on validation set
- Training data: 9K speech + 6K non-speech samples
Training
- Optimizer: AdamW with weight decay
- Loss functions: Huber loss (regression), CrossEntropy loss (classification)
- Data: Annotated samples from Majestrino dataset with human labels
- Validation: Balanced holdout sets + Gemini Pro 2.0 validation
- Selection: Best performing loss type (Huber vs CE) chosen per dimension
Performance
Per-Dimension Results
All 57 voice dimension experts with validation accuracies (adjΒ±1 = correct within Β±1 value):
| Dimension | Name | Type | Balanced Holdout | Gemini Pro | Samples |
|---|---|---|---|---|---|
| AGEV | Perceived Age | Huber | 90.8% | 92.2% | 1,560 |
| AROU | Arousal | Huber | 90.1% | 89.4% | 2,256 |
| ARSH | Arousal Shift | CE | 88.7% | 91.8% | 150 |
| ATCK | Attack | Huber | 87.8% | 93.4% | 1,344 |
| BKGN | Background Noise | Huber | 86.0% | 72.9% | 270 |
| BRGT | Brightness | CE | 68.2% | 64.3% | 108 |
| CHNK | Chunking | Huber | 81.0% | 89.1% | 456 |
| CLRT | Articulation Clarity | CE | 83.2% | 91.1% | 765 |
| COGL | Cognitive Load | Huber | 57.7% | 71.7% | 912 |
| DARC | Dynamic Arc | Huber | 62.4% | 73.7% | 144 |
| DFLU | Disfluency | Huber | 69.7% | 75.9% | 870 |
| EMPH | Emphasis | Huber | 85.8% | 90.5% | 672 |
| ESTH | Esthetics | Huber | 87.3% | 93.7% | 7,455 |
| EXPL | Explicitness | Huber | 88.7% | 90.8% | 372 |
| FOCS | Focus | CE | 83.3% | 80.3% | 1,446 |
| FULL | Fullness | CE | 73.5% | 69.0% | 138 |
| GEND | Perceived Gender | CE | 70.9% | 82.2% | 1,080 |
| HARM | Harmonicity | Huber | 82.9% | 89.3% | 1,190 |
| METL | Metallic Character | Huber | 88.6% | 93.9% | 565 |
| RANG | Pitch Range | CE | 75.5% | 82.8% | 312 |
| RCQL | Recording Quality | Huber | 91.2% | 88.2% | 3,212 |
| REGS | Register | Huber | 78.7% | 81.5% | 570 |
| RESP | Respiration | Huber | 85.1% | 87.4% | 1,164 |
| ROUG | Roughness | Huber | 81.3% | 84.6% | 678 |
| R_CHST | Chest Resonance | Huber | 84.4% | 88.1% | 2,616 |
| R_HEAD | Head Resonance | Huber | 85.0% | 90.2% | 1,848 |
| R_MASK | Mask Resonance | Huber | 82.6% | 91.9% | 1,236 |
| R_MIXD | Mixed Resonance | CE | 77.9% | 87.0% | 390 |
| R_NASL | Nasal Resonance | Huber | 86.3% | 84.9% | 168 |
| R_ORAL | Oral Resonance | CE | 77.8% | 87.1% | 126 |
| R_THRT | Throat Resonance | Huber | 82.6% | 85.7% | 798 |
| SMTH | Smoothness | Huber | 80.8% | 90.4% | 5,125 |
| STNC | Stance | Huber | 78.6% | 85.8% | 3,300 |
| STRU | Structure | CE | 84.5% | 91.0% | 1,098 |
| S_ASMR | ASMR Style | Huber | 86.8% | 93.6% | 2,130 |
| S_AUTH | Authoritative Style | Huber | 81.8% | 89.1% | 2,712 |
| S_CART | Cartoonish Style | Huber | 82.2% | 86.6% | 1,674 |
| S_CASU | Casual Style | Huber | 78.5% | 85.6% | 768 |
| S_CONV | Conversational Style | Huber | 83.3% | 82.9% | 1,110 |
| S_DRAM | Dramatic Style | Huber | 77.6% | 88.0% | 2,502 |
| S_FORM | Formal Style | Huber | 92.1% | 91.1% | 1,505 |
| S_MONO | Monologue Style | Huber | 68.0% | 70.0% | 18,990 |
| S_NARR | Narrator Style | Huber | 88.1% | 80.2% | 4,695 |
| S_NEWS | Newsreader Style | Huber | 90.9% | 68.6% | 8,270 |
| S_PLAY | Playful Style | Huber | 83.6% | 84.0% | 5,700 |
| S_RANT | Ranting/Angry Style | CE | 82.9% | 86.5% | 7,230 |
| S_STRY | Storytelling Style | Huber | 86.8% | 84.0% | 3,725 |
| S_TECH | Teacher/Didactic Style | Huber | 91.5% | 71.5% | 4,785 |
| S_WHIS | Whisper Style | Huber | 90.0% | 89.2% | 744 |
| TEMP | Tempo | Huber | 78.3% | 75.7% | 246 |
| TENS | Tension | Huber | 86.8% | 87.2% | 1,644 |
| VALN | Valence | Huber | 86.3% | 90.5% | 9,648 |
| VALS | Valence Shift | Huber | 49.7% | 32.6% | 80 |
| VFLX | Velocity Flux | CE | 92.9% | 94.3% | 30 |
| VOLT | Volatility | Huber | 71.0% | 87.6% | 348 |
| VULN | Vulnerability | CE | 78.3% | 82.9% | 1,734 |
| WARM | Warmth | Huber | 82.2% | 88.5% | 1,092 |
Summary Statistics:
- 43 Huber experts, 14 CE experts
- Mean balanced adjΒ±1 accuracy: 81.6%
- Best performers: VFLX (92.9%), S_FORM (92.1%), RCQL (91.2%)
- Most training data: S_MONO (18,990), VALN (9,648), S_NEWS (8,270)
Files
experts/
βββ AGEV_huber.pt # Perceived Age expert
βββ AROU_huber.pt # Arousal expert
βββ ARSH_ce.pt # Arousal Shift expert
βββ ... (54 more experts)
βββ WARM_huber.pt # Warmth expert
speech_detector_best.pt # Binary speech classifier
expert_standardization.npz # Mean/std for expert inputs
speech_standardization.npz # Mean/std for speech detector
inference.py # Complete inference script
Dimension Descriptions
Core Attributes
- AGEV: Perceived age (0=child, 6=elderly)
- GEND: Perceived gender (0=masculine, 6=feminine)
- TEMP: Speaking tempo (0=very slow, 6=very fast)
Emotional Dimensions
- VALN: Valence/sentiment (0=negative, 6=positive)
- AROU: Arousal/energy (0=low, 6=high)
- VALS: Valence shift over time
- ARSH: Arousal shift over time
- VULN: Vulnerability (0=guarded, 6=vulnerable)
- WARM: Warmth (0=cold, 6=warm)
- TENS: Tension (0=relaxed, 6=tense)
Vocal Quality
- HARM: Harmonicity (0=noisy, 6=pure tone)
- ROUG: Roughness/hoarseness
- METL: Metallic character
- BRGT: Brightness
- FULL: Fullness/richness
- SMTH: Smoothness
- ATCK: Attack/onset sharpness
Resonance (R_*)
- R_CHST: Chest resonance
- R_HEAD: Head resonance
- R_MASK: Mask resonance
- R_NASL: Nasal resonance
- R_ORAL: Oral resonance
- R_THRT: Throat resonance
- R_MIXD: Mixed resonance
Delivery Style
- CLRT: Articulation clarity
- EMPH: Emphasis/stress patterns
- EXPL: Explicitness/directness
- FOCS: Focus/concentration
- STNC: Stance/attitude
- STRU: Structural organization
- CHNK: Chunking/phrasing
- DFLU: Disfluency (stutters, fillers)
- RESP: Respiration audibility
Speaking Styles (S_*)
- S_MONO: Monologue style
- S_CONV: Conversational style
- S_NARR: Narrator style
- S_STRY: Storytelling style
- S_NEWS: Newsreader style
- S_TECH: Teacher/didactic style
- S_FORM: Formal style
- S_CASU: Casual style
- S_DRAM: Dramatic style
- S_AUTH: Authoritative style
- S_PLAY: Playful style
- S_RANT: Ranting/angry style
- S_WHIS: Whisper style
- S_CART: Cartoonish style
- S_ASMR: ASMR style
Technical/Context
- RCQL: Recording quality
- BKGN: Background noise
- COGL: Cognitive load
- DARC: Dynamic arc (progression)
- RANG: Pitch range
- REGS: Register (chest/head voice)
- VFLX: Velocity flux (rhythm variation)
- VOLT: Volatility (unpredictability)
- ESTH: Overall esthetics/pleasantness
Usage Notes
- Input Requirements: Audio must be converted to Majestrino-1.00 embeddings first using the base encoder
- Standardization: Always apply mean/std normalization before inference
- Speech Detection: Run speech detector first to filter non-speech audio
- Expert Types: CE experts use argmax, Huber experts use round+clip
- Scale: All dimensions output 0-6 integer values
Citation
@misc{majestrino-voice-experts,
title={Majestrino-1.00 Voice Experts: Interpretable Voice Attribute Prediction},
author={LAION},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/laion/Majestrino-1.00-voice-experts}}
}
License
MIT License - see LICENSE file for details.
Links
- Base encoder: laion/Majestrino-1.00
- Dataset: laion/majestrino-data
- Project: LAION Voice SAE