Majestrino 1.00
Majestrino 1.00 is a state-of-the-art contrastive audio-text model developed by LAION. It maps audio and text into a shared 768-dimensional latent space, enabling tasks such as audio retrieval, zero-shot classification, and similarity search.
Compared to version 0.11 and other earlier prototypes, Majestrino 1.00 represents a massive scale-up in training data volume, variety, and annotation density.
Model Details
High-Level Overview
- Input: Raw Audio (WAV/MP3/FLAC) or Text Captions.
- Output: A normalized 768-dimensional vector (embedding).
- Architecture: Dual-Encoder (CLAP) architecture.
- Model Size: ~914 MB (FP32).
Architecture
The model uses a dual-encoder setup to align audio and text:
- Audio Encoder:
openai/whisper-small(Transformer-based).- Selected for its high efficiency and strong semantic understanding of speech and non-speech sounds.
- Weights: Initialized from OpenAI, fine-tuned on the contrastive task.
- Text Encoder:
Alibaba-NLP/gte-base-en-v1.5(Frozen/Fine-tuned adaptation). - Projection Head: A non-linear MLP mapping the specific encoder outputs to the shared 768-dim space.
Training Data & Strategy
- Dataset Size: ~11 Million audio-text pairs (scaled up from 7M in v0.11).
- Batch Size: Global batch size of 4,096, allowing for stable contrastive convergence.
- Annotations: The training data utilizes a rich mixture of synthetic and organic captions, specifically annotated for:
- Emotion & Sentiment (e.g., "angry shouting," "melancholic whisper").
- Timbre & Texture (e.g., "grainy," "reverberant," "metallic").
- Speaking Style (e.g., "fast-paced," "stuttering," "broadcast quality").
- Vocal Bursts (e.g., laughter, sighs, breathing).
- Talking Pace (CPS/WPM alignment).
- Recording Quality (MOS).
Important: Weight Loading
The model.safetensors file in this repo uses the following key naming convention:
| Component | Key prefix in safetensors | Example key |
|---|---|---|
| Audio encoder | audio_encoder.* |
audio_encoder.layers.0.self_attn.q_proj.weight |
| Projection head | audio_proj.* |
audio_proj.0.weight |
| Text encoder | text_model.* |
text_model.encoder.layer.0.attention.qkv_proj.weight |
| Contrastive scale | logit_scale |
logit_scale |
Your model class must use self.audio_encoder and self.audio_proj as attribute names so that load_state_dict can match the keys correctly. If you use different names (e.g. self.projector), the trained weights will be silently skipped when using strict=False, and the projection head will remain randomly initialized โ producing meaningless embeddings.
Always verify that the projection weights were actually loaded:
result = model.load_state_dict(state_dict, strict=False)
# These keys are expected to be missing (decoder, text model โ not used for audio inference):
# whisper.decoder.*, text_model.*, logit_scale
# But audio_proj MUST NOT appear in unexpected_keys:
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
"Projection head weights were NOT loaded! Check your attribute names."
Usage
1. Installation
Install the required libraries. torchaudio and transformers are essential.
pip install torch torchaudio transformers safetensors huggingface_hub
2. Single Inference Example
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
# --- Configuration ---
REPO_ID = "laion/Majestrino-1.00"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# --- Model Definition ---
class MajestrinoCLAP(nn.Module):
def __init__(self):
super().__init__()
self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
self.audio_encoder = self.whisper.encoder
input_dim = self.whisper.config.d_model # 768
# IMPORTANT: The attribute MUST be named 'audio_proj' to match
# the key names in model.safetensors (audio_proj.0.weight, etc.)
self.audio_proj = nn.Sequential(
nn.Linear(input_dim, 2048),
nn.GELU(),
nn.Linear(2048, 768),
)
def encode_audio(self, features):
"""
Args:
features: Mel spectrogram from WhisperFeatureExtractor [batch, 80, 3000]
Returns:
L2-normalized embedding [batch, 768]
"""
out = self.audio_encoder(features).last_hidden_state # [B, 1500, 768]
out = out.mean(dim=1) # [B, 768]
return F.normalize(self.audio_proj(out), p=2, dim=1) # [B, 768]
# --- Load Model ---
print("Loading Majestrino 1.00...")
model = MajestrinoCLAP()
weights_path = hf_hub_download(REPO_ID, "model.safetensors")
state_dict = load_file(weights_path)
result = model.load_state_dict(state_dict, strict=False)
# Verify: projection head must be loaded
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
"audio_proj weights were not loaded!"
print("Model loaded successfully โ encoder and projection head OK.")
model.to(DEVICE).eval()
# Audio processor (must match the Whisper backbone)
processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
# --- Inference ---
def get_embedding(audio_path):
wav, sr = torchaudio.load(audio_path)
if sr != 16000:
wav = torchaudio.transforms.Resample(sr, 16000)(wav)
if wav.shape[0] > 1:
wav = wav.mean(dim=0, keepdim=True)
inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(DEVICE)
with torch.no_grad():
embedding = model.encode_audio(input_features)
return embedding # shape: [1, 768], unit norm
# Example:
# emb = get_embedding("my_audio.wav")
# print(emb.shape) # torch.Size([1, 768])
3. High-Efficiency Batch Annotation (Multi-GPU)
This script is designed for massive datasets. It automatically detects all available GPUs, spawns a dedicated worker for each, and processes audio files in parallel. It handles atomic JSON writes to prevent data corruption.
Features:
- Auto-Scaling: Uses all GPUs (Rank 0 to N).
- Resume Capability: Skips files that already have a
.jsonwith the Majestrino key. - Atomic Writes: Prevents crashes from corrupting JSON files.
- Memory Management: Explicit garbage collection and threaded I/O.
# -*- coding: utf-8 -*-
#!/usr/bin/env python3
"""
Majestrino 1.00 - Mass Embedding Generator
==========================================
Efficiently generates audio embeddings using all available GPUs.
Scans directories, processes audio via Whisper-Small based CLAP,
and saves result to .json files.
"""
import os
import sys
import json
import uuid
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
import numpy as np
import warnings
import gc
from concurrent.futures import ThreadPoolExecutor
from transformers import WhisperFeatureExtractor, WhisperModel
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from tqdm import tqdm
import torch.multiprocessing as mp
# Filter warnings for cleaner output
warnings.filterwarnings("ignore")
# =======================
# --- CONFIGURATION ---
# =======================
# Model Identity
REPO_ID = "laion/Majestrino-1.00"
WHISPER_BACKBONE = "openai/whisper-small"
# Directories to Scan (Add your paths here)
DATA_ROOTS = [
"./my_dataset_folder",
"/mnt/data/audio_collection",
]
# Output Settings
TARGET_JSON_KEY = "majestrino_1_0_clap"
BATCH_SIZE = 64
MAX_AUDIO_SEC = 30.0
TARGET_SR = 16000
NUM_IO_WORKERS = 8
# =======================
# --- MODEL CLASS ---
# =======================
class MajestrinoCLAP(nn.Module):
"""
Majestrino 1.00 Architecture.
Backbone: OpenAI Whisper Small encoder
Head: 2-Layer MLP Projection (audio_proj)
"""
def __init__(self):
super().__init__()
self.whisper = WhisperModel.from_pretrained(WHISPER_BACKBONE)
self.audio_encoder = self.whisper.encoder
input_dim = self.whisper.config.d_model # 768
# IMPORTANT: must be named audio_proj to match safetensors keys
self.audio_proj = nn.Sequential(
nn.Linear(input_dim, 2048),
nn.GELU(),
nn.Linear(2048, 768),
)
def encode_audio(self, features):
out = self.audio_encoder(features).last_hidden_state
out = out.mean(dim=1)
return F.normalize(self.audio_proj(out), p=2, dim=1)
# =======================
# --- FILE HANDLING ---
# =======================
def atomic_json_update(audio_path, embedding):
json_path = os.path.splitext(audio_path)[0] + ".json"
dir_name = os.path.dirname(json_path)
temp_name = f".{os.path.basename(json_path)}.{uuid.uuid4().hex}.tmp"
temp_path = os.path.join(dir_name, temp_name)
data = {}
if os.path.exists(json_path):
try:
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
except:
data = {}
data[TARGET_JSON_KEY] = embedding.tolist()
try:
with open(temp_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2)
os.replace(temp_path, json_path)
return True
except Exception as e:
print(f"Error writing {json_path}: {e}")
if os.path.exists(temp_path):
os.remove(temp_path)
return False
def check_processed(audio_path):
json_path = os.path.splitext(audio_path)[0] + ".json"
if not os.path.exists(json_path): return False
try:
with open(json_path, 'r') as f:
d = json.load(f)
return TARGET_JSON_KEY in d
except:
return False
def load_audio_tensor(file_path):
target_len = int(MAX_AUDIO_SEC * TARGET_SR)
try:
wav, sr = torchaudio.load(file_path)
if sr != TARGET_SR:
wav = torchaudio.transforms.Resample(sr, TARGET_SR)(wav)
if wav.shape[0] > 1:
wav = wav.mean(dim=0, keepdim=True)
wav = wav.squeeze()
if wav.numel() < target_len:
wav = F.pad(wav, (0, target_len - wav.numel()))
elif wav.numel() > target_len:
wav = wav[:target_len]
return wav.numpy()
except Exception:
return None
# =======================
# --- WORKER LOGIC ---
# =======================
def gpu_worker(rank, file_chunk):
device_id = f"cuda:{rank}"
torch.cuda.set_device(device_id)
device = torch.device(device_id)
print(f"[GPU {rank}] Initializing Majestrino 1.00 on {device_id}...")
try:
weights_path = hf_hub_download(REPO_ID, "model.safetensors")
model = MajestrinoCLAP()
state = load_file(weights_path)
result = model.load_state_dict(state, strict=False)
# Verify projection head loaded correctly
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
f"[GPU {rank}] audio_proj weights were not loaded!"
model.to(device).eval()
processor = WhisperFeatureExtractor.from_pretrained(WHISPER_BACKBONE)
except Exception as e:
print(f"[GPU {rank}] Critical Error loading model: {e}")
return
total = len(file_chunk)
batches = [file_chunk[i:i + BATCH_SIZE] for i in range(0, total, BATCH_SIZE)]
pbar = tqdm(total=total, desc=f"GPU {rank}", position=rank, leave=True)
with ThreadPoolExecutor(max_workers=NUM_IO_WORKERS) as pool:
for batch_files in batches:
needs_processing = list(pool.map(check_processed, batch_files))
todo_files = [f for f, done in zip(batch_files, needs_processing) if not done]
skipped_count = len(batch_files) - len(todo_files)
if skipped_count > 0:
pbar.update(skipped_count)
if not todo_files:
continue
audio_data = list(pool.map(load_audio_tensor, todo_files))
valid_tensors = []
valid_paths = []
for path, wav in zip(todo_files, audio_data):
if wav is not None:
valid_tensors.append(wav)
valid_paths.append(path)
else:
pbar.update(1)
if not valid_tensors:
continue
try:
inputs = processor(valid_tensors, sampling_rate=TARGET_SR, return_tensors="pt")
input_features = inputs.input_features.to(device)
with torch.no_grad():
embeddings = model.encode_audio(input_features)
embeddings_np = embeddings.cpu().numpy()
write_futures = [
pool.submit(atomic_json_update, path, emb)
for path, emb in zip(valid_paths, embeddings_np)
]
for f in write_futures:
f.result()
pbar.update(len(valid_paths))
except RuntimeError as e:
if "out of memory" in str(e):
print(f"[GPU {rank}] OOM Error. Clearing cache.")
torch.cuda.empty_cache()
else:
print(f"[GPU {rank}] Error: {e}")
if len(valid_tensors) > 0:
del input_features, embeddings
gc.collect()
# =======================
# --- MAIN ENTRY ---
# =======================
def scan_worker(root):
files = []
valid_exts = ('.wav', '.mp3', '.flac', '.ogg', '.m4a', '.opus')
try:
for dirpath, _, filenames in os.walk(root):
for f in filenames:
if f.lower().endswith(valid_exts):
files.append(os.path.join(dirpath, f))
except: pass
return files
def main_wrapper(rank, chunks):
gpu_worker(rank, chunks[rank])
if __name__ == "__main__":
mp.set_start_method('spawn', force=True)
if not torch.cuda.is_available():
print("No GPU detected.")
sys.exit(1)
num_gpus = torch.cuda.device_count()
print(f"--- Majestrino 1.00 Annotation Tool ---")
print(f"GPUs Available: {num_gpus}")
print("Scanning directories...")
all_files = []
with ThreadPoolExecutor(max_workers=8) as pool:
results = pool.map(scan_worker, DATA_ROOTS)
for r in results:
all_files.extend(r)
all_files = list(set(all_files))
total_files = len(all_files)
print(f"Found {total_files} audio files.")
if total_files == 0:
sys.exit(0)
np.random.shuffle(all_files)
chunks = np.array_split(all_files, num_gpus)
chunks = [c.tolist() for c in chunks]
print("Launching workers...")
mp.spawn(main_wrapper, args=(chunks,), nprocs=num_gpus, join=True)
print("All processing finished.")
Troubleshooting
"My embeddings don't match / cosine similarity is ~0"
This almost always means the projection head weights were not loaded. The most common cause is naming the projection attribute self.projector instead of self.audio_proj. Because the code uses strict=False, PyTorch will silently skip mismatched keys and the projector will keep its random initialization.
How to check:
state_dict = load_file("model.safetensors")
result = model.load_state_dict(state_dict, strict=False)
# Print unexpected keys (keys in safetensors that didn't match any model parameter)
print("Unexpected:", [k for k in result.unexpected_keys if "text_model" not in k and k != "logit_scale"])
# If you see audio_proj.* here, the projection weights were NOT loaded.
# Print missing keys (model parameters not found in safetensors)
print("Missing:", [k for k in result.missing_keys if "decoder" not in k])
# If you see projector.* here, you used the wrong attribute name.
The fix: rename self.projector to self.audio_proj in your model class.
Expected load_state_dict behavior
When loading correctly, you should see:
| Category | What appears | Why |
|---|---|---|
| Loaded successfully | audio_encoder.*, audio_proj.* |
Encoder and projection head matched |
| Expected missing | whisper.encoder.*, whisper.decoder.* |
whisper.encoder is an alias for audio_encoder (same tensors, already loaded); decoder is unused for audio-only inference |
| Expected unexpected | text_model.*, logit_scale |
Text encoder and contrastive scale are not needed for audio embedding extraction |
State dict key reference
The model.safetensors contains these parameter groups:
audio_encoder.conv1.weight [768, 80, 3]
audio_encoder.conv2.weight [768, 768, 3]
audio_encoder.embed_positions.weight [1500, 768]
audio_encoder.layers.{0-11}.* (12 transformer layers)
audio_encoder.layer_norm.*
audio_proj.0.weight [2048, 768] # Linear: 768 โ 2048
audio_proj.0.bias [2048]
audio_proj.2.weight [768, 2048] # Linear: 2048 โ 768
audio_proj.2.bias [768]
text_model.* (text encoder, not needed for audio)
logit_scale (contrastive temperature, not needed for inference)