Majestrino 1.00

Majestrino 1.00 is a state-of-the-art contrastive audio-text model developed by LAION. It maps audio and text into a shared 768-dimensional latent space, enabling tasks such as audio retrieval, zero-shot classification, and similarity search.

Compared to version 0.11 and other earlier prototypes, Majestrino 1.00 represents a massive scale-up in training data volume, variety, and annotation density.

Model Details

High-Level Overview

  • Input: Raw Audio (WAV/MP3/FLAC) or Text Captions.
  • Output: A normalized 768-dimensional vector (embedding).
  • Architecture: Dual-Encoder (CLAP) architecture.
  • Model Size: ~914 MB (FP32).

Architecture

The model uses a dual-encoder setup to align audio and text:

  1. Audio Encoder: openai/whisper-small (Transformer-based).
    • Selected for its high efficiency and strong semantic understanding of speech and non-speech sounds.
    • Weights: Initialized from OpenAI, fine-tuned on the contrastive task.
  2. Text Encoder: Alibaba-NLP/gte-base-en-v1.5 (Frozen/Fine-tuned adaptation).
  3. Projection Head: A non-linear MLP mapping the specific encoder outputs to the shared 768-dim space.

Training Data & Strategy

  • Dataset Size: ~11 Million audio-text pairs (scaled up from 7M in v0.11).
  • Batch Size: Global batch size of 4,096, allowing for stable contrastive convergence.
  • Annotations: The training data utilizes a rich mixture of synthetic and organic captions, specifically annotated for:
    • Emotion & Sentiment (e.g., "angry shouting," "melancholic whisper").
    • Timbre & Texture (e.g., "grainy," "reverberant," "metallic").
    • Speaking Style (e.g., "fast-paced," "stuttering," "broadcast quality").
    • Vocal Bursts (e.g., laughter, sighs, breathing).
    • Talking Pace (CPS/WPM alignment).
    • Recording Quality (MOS).

Important: Weight Loading

The model.safetensors file in this repo uses the following key naming convention:

Component Key prefix in safetensors Example key
Audio encoder audio_encoder.* audio_encoder.layers.0.self_attn.q_proj.weight
Projection head audio_proj.* audio_proj.0.weight
Text encoder text_model.* text_model.encoder.layer.0.attention.qkv_proj.weight
Contrastive scale logit_scale logit_scale

Your model class must use self.audio_encoder and self.audio_proj as attribute names so that load_state_dict can match the keys correctly. If you use different names (e.g. self.projector), the trained weights will be silently skipped when using strict=False, and the projection head will remain randomly initialized โ€” producing meaningless embeddings.

Always verify that the projection weights were actually loaded:

result = model.load_state_dict(state_dict, strict=False)

# These keys are expected to be missing (decoder, text model โ€” not used for audio inference):
# whisper.decoder.*, text_model.*, logit_scale

# But audio_proj MUST NOT appear in unexpected_keys:
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
    "Projection head weights were NOT loaded! Check your attribute names."

Usage

1. Installation

Install the required libraries. torchaudio and transformers are essential.

pip install torch torchaudio transformers safetensors huggingface_hub

2. Single Inference Example

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

# --- Configuration ---
REPO_ID = "laion/Majestrino-1.00"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# --- Model Definition ---
class MajestrinoCLAP(nn.Module):
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
        self.audio_encoder = self.whisper.encoder

        input_dim = self.whisper.config.d_model  # 768

        # IMPORTANT: The attribute MUST be named 'audio_proj' to match
        # the key names in model.safetensors (audio_proj.0.weight, etc.)
        self.audio_proj = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        """
        Args:
            features: Mel spectrogram from WhisperFeatureExtractor [batch, 80, 3000]
        Returns:
            L2-normalized embedding [batch, 768]
        """
        out = self.audio_encoder(features).last_hidden_state  # [B, 1500, 768]
        out = out.mean(dim=1)                                  # [B, 768]
        return F.normalize(self.audio_proj(out), p=2, dim=1)   # [B, 768]

# --- Load Model ---
print("Loading Majestrino 1.00...")
model = MajestrinoCLAP()
weights_path = hf_hub_download(REPO_ID, "model.safetensors")
state_dict = load_file(weights_path)
result = model.load_state_dict(state_dict, strict=False)

# Verify: projection head must be loaded
assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
    "audio_proj weights were not loaded!"
print("Model loaded successfully โ€” encoder and projection head OK.")

model.to(DEVICE).eval()

# Audio processor (must match the Whisper backbone)
processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

# --- Inference ---
def get_embedding(audio_path):
    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.transforms.Resample(sr, 16000)(wav)
    if wav.shape[0] > 1:
        wav = wav.mean(dim=0, keepdim=True)

    inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(DEVICE)

    with torch.no_grad():
        embedding = model.encode_audio(input_features)

    return embedding  # shape: [1, 768], unit norm

# Example:
# emb = get_embedding("my_audio.wav")
# print(emb.shape)  # torch.Size([1, 768])

3. High-Efficiency Batch Annotation (Multi-GPU)

This script is designed for massive datasets. It automatically detects all available GPUs, spawns a dedicated worker for each, and processes audio files in parallel. It handles atomic JSON writes to prevent data corruption.

Features:

  • Auto-Scaling: Uses all GPUs (Rank 0 to N).
  • Resume Capability: Skips files that already have a .json with the Majestrino key.
  • Atomic Writes: Prevents crashes from corrupting JSON files.
  • Memory Management: Explicit garbage collection and threaded I/O.
# -*- coding: utf-8 -*-
#!/usr/bin/env python3

"""
Majestrino 1.00 - Mass Embedding Generator
==========================================
Efficiently generates audio embeddings using all available GPUs.
Scans directories, processes audio via Whisper-Small based CLAP,
and saves result to .json files.
"""

import os
import sys
import json
import uuid
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
import numpy as np
import warnings
import gc
from concurrent.futures import ThreadPoolExecutor
from transformers import WhisperFeatureExtractor, WhisperModel
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from tqdm import tqdm
import torch.multiprocessing as mp

# Filter warnings for cleaner output
warnings.filterwarnings("ignore")

# =======================
# --- CONFIGURATION ---
# =======================

# Model Identity
REPO_ID = "laion/Majestrino-1.00"
WHISPER_BACKBONE = "openai/whisper-small"

# Directories to Scan (Add your paths here)
DATA_ROOTS = [
    "./my_dataset_folder",
    "/mnt/data/audio_collection",
]

# Output Settings
TARGET_JSON_KEY = "majestrino_1_0_clap"
BATCH_SIZE = 64
MAX_AUDIO_SEC = 30.0
TARGET_SR = 16000
NUM_IO_WORKERS = 8

# =======================
# --- MODEL CLASS ---
# =======================

class MajestrinoCLAP(nn.Module):
    """
    Majestrino 1.00 Architecture.
    Backbone: OpenAI Whisper Small encoder
    Head: 2-Layer MLP Projection (audio_proj)
    """
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained(WHISPER_BACKBONE)
        self.audio_encoder = self.whisper.encoder

        input_dim = self.whisper.config.d_model  # 768
        # IMPORTANT: must be named audio_proj to match safetensors keys
        self.audio_proj = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        out = self.audio_encoder(features).last_hidden_state
        out = out.mean(dim=1)
        return F.normalize(self.audio_proj(out), p=2, dim=1)

# =======================
# --- FILE HANDLING ---
# =======================

def atomic_json_update(audio_path, embedding):
    json_path = os.path.splitext(audio_path)[0] + ".json"
    dir_name = os.path.dirname(json_path)
    temp_name = f".{os.path.basename(json_path)}.{uuid.uuid4().hex}.tmp"
    temp_path = os.path.join(dir_name, temp_name)

    data = {}
    if os.path.exists(json_path):
        try:
            with open(json_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
        except:
            data = {}

    data[TARGET_JSON_KEY] = embedding.tolist()

    try:
        with open(temp_path, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2)
        os.replace(temp_path, json_path)
        return True
    except Exception as e:
        print(f"Error writing {json_path}: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        return False

def check_processed(audio_path):
    json_path = os.path.splitext(audio_path)[0] + ".json"
    if not os.path.exists(json_path): return False
    try:
        with open(json_path, 'r') as f:
            d = json.load(f)
        return TARGET_JSON_KEY in d
    except:
        return False

def load_audio_tensor(file_path):
    target_len = int(MAX_AUDIO_SEC * TARGET_SR)
    try:
        wav, sr = torchaudio.load(file_path)
        if sr != TARGET_SR:
            wav = torchaudio.transforms.Resample(sr, TARGET_SR)(wav)
        if wav.shape[0] > 1:
            wav = wav.mean(dim=0, keepdim=True)
        wav = wav.squeeze()
        if wav.numel() < target_len:
            wav = F.pad(wav, (0, target_len - wav.numel()))
        elif wav.numel() > target_len:
            wav = wav[:target_len]
        return wav.numpy()
    except Exception:
        return None

# =======================
# --- WORKER LOGIC ---
# =======================

def gpu_worker(rank, file_chunk):
    device_id = f"cuda:{rank}"
    torch.cuda.set_device(device_id)
    device = torch.device(device_id)

    print(f"[GPU {rank}] Initializing Majestrino 1.00 on {device_id}...")

    try:
        weights_path = hf_hub_download(REPO_ID, "model.safetensors")
        model = MajestrinoCLAP()
        state = load_file(weights_path)
        result = model.load_state_dict(state, strict=False)

        # Verify projection head loaded correctly
        assert not any(k.startswith("audio_proj") for k in result.unexpected_keys), \
            f"[GPU {rank}] audio_proj weights were not loaded!"

        model.to(device).eval()
        processor = WhisperFeatureExtractor.from_pretrained(WHISPER_BACKBONE)
    except Exception as e:
        print(f"[GPU {rank}] Critical Error loading model: {e}")
        return

    total = len(file_chunk)
    batches = [file_chunk[i:i + BATCH_SIZE] for i in range(0, total, BATCH_SIZE)]
    pbar = tqdm(total=total, desc=f"GPU {rank}", position=rank, leave=True)

    with ThreadPoolExecutor(max_workers=NUM_IO_WORKERS) as pool:
        for batch_files in batches:
            needs_processing = list(pool.map(check_processed, batch_files))
            todo_files = [f for f, done in zip(batch_files, needs_processing) if not done]
            skipped_count = len(batch_files) - len(todo_files)
            if skipped_count > 0:
                pbar.update(skipped_count)
            if not todo_files:
                continue

            audio_data = list(pool.map(load_audio_tensor, todo_files))
            valid_tensors = []
            valid_paths = []
            for path, wav in zip(todo_files, audio_data):
                if wav is not None:
                    valid_tensors.append(wav)
                    valid_paths.append(path)
                else:
                    pbar.update(1)
            if not valid_tensors:
                continue

            try:
                inputs = processor(valid_tensors, sampling_rate=TARGET_SR, return_tensors="pt")
                input_features = inputs.input_features.to(device)
                with torch.no_grad():
                    embeddings = model.encode_audio(input_features)
                    embeddings_np = embeddings.cpu().numpy()

                write_futures = [
                    pool.submit(atomic_json_update, path, emb)
                    for path, emb in zip(valid_paths, embeddings_np)
                ]
                for f in write_futures:
                    f.result()
                pbar.update(len(valid_paths))
            except RuntimeError as e:
                if "out of memory" in str(e):
                    print(f"[GPU {rank}] OOM Error. Clearing cache.")
                    torch.cuda.empty_cache()
                else:
                    print(f"[GPU {rank}] Error: {e}")

            if len(valid_tensors) > 0:
                del input_features, embeddings
                gc.collect()

# =======================
# --- MAIN ENTRY ---
# =======================

def scan_worker(root):
    files = []
    valid_exts = ('.wav', '.mp3', '.flac', '.ogg', '.m4a', '.opus')
    try:
        for dirpath, _, filenames in os.walk(root):
            for f in filenames:
                if f.lower().endswith(valid_exts):
                    files.append(os.path.join(dirpath, f))
    except: pass
    return files

def main_wrapper(rank, chunks):
    gpu_worker(rank, chunks[rank])

if __name__ == "__main__":
    mp.set_start_method('spawn', force=True)

    if not torch.cuda.is_available():
        print("No GPU detected.")
        sys.exit(1)

    num_gpus = torch.cuda.device_count()
    print(f"--- Majestrino 1.00 Annotation Tool ---")
    print(f"GPUs Available: {num_gpus}")

    print("Scanning directories...")
    all_files = []
    with ThreadPoolExecutor(max_workers=8) as pool:
        results = pool.map(scan_worker, DATA_ROOTS)
        for r in results:
            all_files.extend(r)

    all_files = list(set(all_files))
    total_files = len(all_files)
    print(f"Found {total_files} audio files.")

    if total_files == 0:
        sys.exit(0)

    np.random.shuffle(all_files)
    chunks = np.array_split(all_files, num_gpus)
    chunks = [c.tolist() for c in chunks]

    print("Launching workers...")
    mp.spawn(main_wrapper, args=(chunks,), nprocs=num_gpus, join=True)
    print("All processing finished.")

Troubleshooting

"My embeddings don't match / cosine similarity is ~0"

This almost always means the projection head weights were not loaded. The most common cause is naming the projection attribute self.projector instead of self.audio_proj. Because the code uses strict=False, PyTorch will silently skip mismatched keys and the projector will keep its random initialization.

How to check:

state_dict = load_file("model.safetensors")
result = model.load_state_dict(state_dict, strict=False)

# Print unexpected keys (keys in safetensors that didn't match any model parameter)
print("Unexpected:", [k for k in result.unexpected_keys if "text_model" not in k and k != "logit_scale"])
# If you see audio_proj.* here, the projection weights were NOT loaded.

# Print missing keys (model parameters not found in safetensors)
print("Missing:", [k for k in result.missing_keys if "decoder" not in k])
# If you see projector.* here, you used the wrong attribute name.

The fix: rename self.projector to self.audio_proj in your model class.

Expected load_state_dict behavior

When loading correctly, you should see:

Category What appears Why
Loaded successfully audio_encoder.*, audio_proj.* Encoder and projection head matched
Expected missing whisper.encoder.*, whisper.decoder.* whisper.encoder is an alias for audio_encoder (same tensors, already loaded); decoder is unused for audio-only inference
Expected unexpected text_model.*, logit_scale Text encoder and contrastive scale are not needed for audio embedding extraction

State dict key reference

The model.safetensors contains these parameter groups:

audio_encoder.conv1.weight          [768, 80, 3]
audio_encoder.conv2.weight          [768, 768, 3]
audio_encoder.embed_positions.weight [1500, 768]
audio_encoder.layers.{0-11}.*       (12 transformer layers)
audio_encoder.layer_norm.*

audio_proj.0.weight                 [2048, 768]    # Linear: 768 โ†’ 2048
audio_proj.0.bias                   [2048]
audio_proj.2.weight                 [768, 2048]    # Linear: 2048 โ†’ 768
audio_proj.2.bias                   [768]

text_model.*                        (text encoder, not needed for audio)
logit_scale                         (contrastive temperature, not needed for inference)
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for laion/Majestrino-1.00

Finetunes
1 model