Majestrino 1.00

Majestrino 1.00 is a state-of-the-art contrastive audio-text model developed by LAION. It maps audio and text into a shared 768-dimensional latent space, enabling tasks such as audio retrieval, zero-shot classification, and similarity search.

Compared to version 0.11 and other earlier prototypes, Majestrino 1.00 represents a massive scale-up in training data volume, variety, and annotation density.

Model Details

High-Level Overview

Input: Raw Audio (WAV/MP3/FLAC) or Text Captions.
Output: A normalized 768-dimensional vector (embedding).
Architecture: Dual-Encoder (CLAP) architecture.
Model Size: ~914 MB (FP32).

Architecture

The model uses a dual-encoder setup to align audio and text:

Audio Encoder: openai/whisper-small (Transformer-based).
- Selected for its high efficiency and strong semantic understanding of speech and non-speech sounds.
- Weights: Initialized from OpenAI, fine-tuned on the contrastive task.
Text Encoder: Alibaba-NLP/gte-base-en-v1.5 (Frozen/Fine-tuned adaptation).
Projection Head: A non-linear MLP mapping the specific encoder outputs to the shared 768-dim space.

Training Data & Strategy

Dataset Size: ~11 Million audio-text pairs (scaled up from 7M in v0.11).
Batch Size: Global batch size of 4,096, allowing for stable contrastive convergence.
Annotations: The training data utilizes a rich mixture of synthetic and organic captions, specifically annotated for:
- Emotion & Sentiment (e.g., "angry shouting," "melancholic whisper").
- Timbre & Texture (e.g., "grainy," "reverberant," "metallic").
- Speaking Style (e.g., "fast-paced," "stuttering," "broadcast quality").
- Vocal Bursts (e.g., laughter, sighs, breathing).
- Talking Pace (CPS/WPM alignment).
- Recording Quality (MOS).

Important: Weight Loading

The model.safetensors file in this repo uses the following key naming convention:

Component	Key prefix in safetensors	Example key
Audio encoder	`audio_encoder.*`	`audio_encoder.layers.0.self_attn.q_proj.weight`
Projection head	`audio_proj.*`	`audio_proj.0.weight`
Text encoder	`text_model.*`	`text_model.encoder.layer.0.attention.qkv_proj.weight`
Contrastive scale	`logit_scale`	`logit_scale`

Your model class must use self.audio_encoder and self.audio_proj as attribute names so that load_state_dict can match the keys correctly. If you use different names (e.g. self.projector), the trained weights will be silently skipped when using strict=False, and the projection head will remain randomly initialized — producing meaningless embeddings.

Always verify that the projection weights were actually loaded:

result = model.load_state_dict(state_dict, strict=False)

# These keys are expected to be missing (decoder, text model — not used for audio inference):
# whisper.decoder.*, text_model.*, logit_scale

# But audio_proj MUST NOT appear in unexpected_keys:
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
    "Projection head weights were NOT loaded! Check your attribute names."

Usage

1. Installation

Install the required libraries. torchaudio and transformers are essential.

pip install torch torchaudio transformers safetensors huggingface_hub

2. Single Inference Example

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

# --- Configuration ---
REPO_ID = "laion/Majestrino-1.00"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# --- Model Definition ---
class MajestrinoCLAP(nn.Module):
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
        self.audio_encoder = self.whisper.encoder

        input_dim = self.whisper.config.d_model  # 768

        # IMPORTANT: The attribute MUST be named 'audio_proj' to match
        # the key names in model.safetensors (audio_proj.0.weight, etc.)
        self.audio_proj = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        """
        Args:
            features: Mel spectrogram from WhisperFeatureExtractor [batch, 80, 3000]
        Returns:
            L2-normalized embedding [batch, 768]
        """
        out = self.audio_encoder(features).last_hidden_state  # [B, 1500, 768]
        out = out.mean(dim=1)                                  # [B, 768]
        return F.normalize(self.audio_proj(out), p=2, dim=1)   # [B, 768]

# --- Load Model ---
print("Loading Majestrino 1.00...")
model = MajestrinoCLAP()
weights_path = hf_hub_download(REPO_ID, "model.safetensors")
state_dict = load_file(weights_path)
result = model.load_state_dict(state_dict, strict=False)

# Verify: projection head must be loaded
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
    "audio_proj weights were not loaded!"
print("Model loaded successfully — encoder and projection head OK.")

model.to(DEVICE).eval()

# Audio processor (must match the Whisper backbone)
processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

# --- Inference ---
def get_embedding(audio_path):
    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.transforms.Resample(sr, 16000)(wav)
    if wav.shape[0] > 1:
        wav = wav.mean(dim=0, keepdim=True)

    inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(DEVICE)

    with torch.no_grad():
        embedding = model.encode_audio(input_features)

    return embedding  # shape: [1, 768], unit norm

# Example:
# emb = get_embedding("my_audio.wav")
# print(emb.shape)  # torch.Size([1, 768])

3. High-Efficiency Batch Annotation (Multi-GPU)

This script is designed for massive datasets. It automatically detects all available GPUs, spawns a dedicated worker for each, and processes audio files in parallel. It handles atomic JSON writes to prevent data corruption.

Features:

Auto-Scaling: Uses all GPUs (Rank 0 to N).
Resume Capability: Skips files that already have a .json with the Majestrino key.
Atomic Writes: Prevents crashes from corrupting JSON files.
Memory Management: Explicit garbage collection and threaded I/O.

# -*- coding: utf-8 -*-
#!/usr/bin/env python3

"""
Majestrino 1.00 - Mass Embedding Generator
==========================================
Efficiently generates audio embeddings using all available GPUs.
Scans directories, processes audio via Whisper-Small based CLAP,
and saves result to .json files.
"""

import os
import sys
import json
import uuid
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
import numpy as np
import warnings
import gc
from concurrent.futures import ThreadPoolExecutor
from transformers import WhisperFeatureExtractor, WhisperModel
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from tqdm import tqdm
import torch.multiprocessing as mp

# Filter warnings for cleaner output
warnings.filterwarnings("ignore")

# =======================
# --- CONFIGURATION ---
# =======================

# Model Identity
REPO_ID = "laion/Majestrino-1.00"
WHISPER_BACKBONE = "openai/whisper-small"

# Directories to Scan (Add your paths here)
DATA_ROOTS = [
    "./my_dataset_folder",
    "/mnt/data/audio_collection",
]

# Output Settings
TARGET_JSON_KEY = "majestrino_1_0_clap"
BATCH_SIZE = 64
MAX_AUDIO_SEC = 30.0
TARGET_SR = 16000
NUM_IO_WORKERS = 8

# =======================
# --- MODEL CLASS ---
# =======================

class MajestrinoCLAP(nn.Module):
    """
    Majestrino 1.00 Architecture.
    Backbone: OpenAI Whisper Small encoder
    Head: 2-Layer MLP Projection (audio_proj)
    """
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained(WHISPER_BACKBONE)
        self.audio_encoder = self.whisper.encoder

        input_dim = self.whisper.config.d_model  # 768
        # IMPORTANT: must be named audio_proj to match safetensors keys
        self.audio_proj = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        out = self.audio_encoder(features).last_hidden_state
        out = out.mean(dim=1)
        return F.normalize(self.audio_proj(out), p=2, dim=1)

# =======================
# --- FILE HANDLING ---
# =======================

def atomic_json_update(audio_path, embedding):
    json_path = os.path.splitext(audio_path)[0] + ".json"
    dir_name = os.path.dirname(json_path)
    temp_name = f".{os.path.basename(json_path)}.{uuid.uuid4().hex}.tmp"
    temp_path = os.path.join(dir_name, temp_name)

    data = {}
    if os.path.exists(json_path):
        try:
            with open(json_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
        except:
            data = {}

    data[TARGET_JSON_KEY] = embedding.tolist()

    try:
        with open(temp_path, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2)
        os.replace(temp_path, json_path)
        return True
    except Exception as e:
        print(f"Error writing {json_path}: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        return False

def check_processed(audio_path):
    json_path = os.path.splitext(audio_path)[0] + ".json"
    if not os.path.exists(json_path): return False
    try:
        with open(json_path, 'r') as f:
            d = json.load(f)
        return TARGET_JSON_KEY in d
    except:
        return False

def load_audio_tensor(file_path):
    target_len = int(MAX_AUDIO_SEC * TARGET_SR)
    try:
        wav, sr = torchaudio.load(file_path)
        if sr != TARGET_SR:
            wav = torchaudio.transforms.Resample(sr, TARGET_SR)(wav)
        if wav.shape[0] > 1:
            wav = wav.mean(dim=0, keepdim=True)
        wav = wav.squeeze()
        if wav.numel() < target_len:
            wav = F.pad(wav, (0, target_len - wav.numel()))
        elif wav.numel() > target_len:
            wav = wav[:target_len]
        return wav.numpy()
    except Exception:
        return None

# =======================
# --- WORKER LOGIC ---
# =======================

def gpu_worker(rank, file_chunk):
    device_id = f"cuda:{rank}"
    torch.cuda.set_device(device_id)
    device = torch.device(device_id)

    print(f"[GPU {rank}] Initializing Majestrino 1.00 on {device_id}...")

    try:
        weights_path = hf_hub_download(REPO_ID, "model.safetensors")
        model = MajestrinoCLAP()
        state = load_file(weights_path)
        result = model.load_state_dict(state, strict=False)

        # Verify projection head loaded correctly
        assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
            f"[GPU {rank}] audio_proj weights were not loaded!"

        model.to(device).eval()
        processor = WhisperFeatureExtractor.from_pretrained(WHISPER_BACKBONE)
    except Exception as e:
        print(f"[GPU {rank}] Critical Error loading model: {e}")
        return

    total = len(file_chunk)
    batches = [file_chunk[i:i + BATCH_SIZE] for i in range(0, total, BATCH_SIZE)]
    pbar = tqdm(total=total, desc=f"GPU {rank}", position=rank, leave=True)

    with ThreadPoolExecutor(max_workers=NUM_IO_WORKERS) as pool:
        for batch_files in batches:
            needs_processing = list(pool.map(check_processed, batch_files))
            todo_files = [f for f, done in zip(batch_files, needs_processing) if not done]
            skipped_count = len(batch_files) - len(todo_files)
            if skipped_count > 0:
                pbar.update(skipped_count)
            if not todo_files:
                continue

            audio_data = list(pool.map(load_audio_tensor, todo_files))
            valid_tensors = []
            valid_paths = []
            for path, wav in zip(todo_files, audio_data):
                if wav is not None:
                    valid_tensors.append(wav)
                    valid_paths.append(path)
                else:
                    pbar.update(1)
            if not valid_tensors:
                continue

            try:
                inputs = processor(valid_tensors, sampling_rate=TARGET_SR, return_tensors="pt")
                input_features = inputs.input_features.to(device)
                with torch.no_grad():
                    embeddings = model.encode_audio(input_features)
                    embeddings_np = embeddings.cpu().numpy()

                write_futures = [
                    pool.submit(atomic_json_update, path, emb)
                    for path, emb in zip(valid_paths, embeddings_np)
                ]
                for f in write_futures:
                    f.result()
                pbar.update(len(valid_paths))
            except RuntimeError as e:
                if "out of memory" in str(e):
                    print(f"[GPU {rank}] OOM Error. Clearing cache.")
                    torch.cuda.empty_cache()
                else:
                    print(f"[GPU {rank}] Error: {e}")

            if len(valid_tensors) > 0:
                del input_features, embeddings
                gc.collect()

# =======================
# --- MAIN ENTRY ---
# =======================

def scan_worker(root):
    files = []
    valid_exts = ('.wav', '.mp3', '.flac', '.ogg', '.m4a', '.opus')
    try:
        for dirpath, _, filenames in os.walk(root):
            for f in filenames:
                if f.lower().endswith(valid_exts):
                    files.append(os.path.join(dirpath, f))
    except: pass
    return files

def main_wrapper(rank, chunks):
    gpu_worker(rank, chunks[rank])

if __name__ == "__main__":
    mp.set_start_method('spawn', force=True)

    if not torch.cuda.is_available():
        print("No GPU detected.")
        sys.exit(1)

    num_gpus = torch.cuda.device_count()
    print(f"--- Majestrino 1.00 Annotation Tool ---")
    print(f"GPUs Available: {num_gpus}")

    print("Scanning directories...")
    all_files = []
    with ThreadPoolExecutor(max_workers=8) as pool:
        results = pool.map(scan_worker, DATA_ROOTS)
        for r in results:
            all_files.extend(r)

    all_files = list(set(all_files))
    total_files = len(all_files)
    print(f"Found {total_files} audio files.")

    if total_files == 0:
        sys.exit(0)

    np.random.shuffle(all_files)
    chunks = np.array_split(all_files, num_gpus)
    chunks = [c.tolist() for c in chunks]

    print("Launching workers...")
    mp.spawn(main_wrapper, args=(chunks,), nprocs=num_gpus, join=True)
    print("All processing finished.")

Troubleshooting

"My embeddings don't match / cosine similarity is ~0"

This almost always means the projection head weights were not loaded. The most common cause is naming the projection attribute self.projector instead of self.audio_proj. Because the code uses strict=False, PyTorch will silently skip mismatched keys and the projector will keep its random initialization.

How to check:

state_dict = load_file("model.safetensors")
result = model.load_state_dict(state_dict, strict=False)

# Print unexpected keys (keys in safetensors that didn't match any model parameter)
print("Unexpected:", [k for k in result.unexpected_keys if "text_model" not in k and k != "logit_scale"])
# If you see audio_proj.* here, the projection weights were NOT loaded.

# Print missing keys (model parameters not found in safetensors)
print("Missing:", [k for k in result.missing_keys if "decoder" not in k])
# If you see projector.* here, you used the wrong attribute name.

The fix: rename self.projector to self.audio_proj in your model class.

Expected `load_state_dict` behavior

When loading correctly, you should see:

Category	What appears	Why
Loaded successfully	`audio_encoder.`, `audio_proj.`	Encoder and projection head matched
Expected missing	`whisper.encoder.`, `whisper.decoder.`	`whisper.encoder` is an alias for `audio_encoder` (same tensors, already loaded); decoder is unused for audio-only inference
Expected unexpected	`text_model.*`, `logit_scale`	Text encoder and contrastive scale are not needed for audio embedding extraction

State dict key reference

The model.safetensors contains these parameter groups:

audio_encoder.conv1.weight          [768, 80, 3]
audio_encoder.conv2.weight          [768, 768, 3]
audio_encoder.embed_positions.weight [1500, 768]
audio_encoder.layers.{0-11}.*       (12 transformer layers)
audio_encoder.layer_norm.*

audio_proj.0.weight                 [2048, 768]    # Linear: 768 → 2048
audio_proj.0.bias                   [2048]
audio_proj.2.weight                 [768, 2048]    # Linear: 2048 → 768
audio_proj.2.bias                   [768]

text_model.*                        (text encoder, not needed for audio)
logit_scale                         (contrastive temperature, not needed for inference)

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for laion/Majestrino-1.00

Finetunes

1 model