You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Multi-Label Audio Expressiveness Classifier

This model is a fine-tuned version of openai/whisper-large-v3 for multi-label audio classification. It is designed to detect six distinct expressiveness traits in speech audio.

🏷️ Labels

"enthusiastic", "happy", "angry", "saddened", "awed", "calm", "anxious", "disgusted", "scared", "confused", "bored", "sleepy", "pained", "guilt","sarcastic", "sympathetic", "admiring", "desirous"

🛠 Usage: Input & Output

1. Input Specifications

The model requires audio formatted specifically for the Whisper architecture:

Sampling Rate: Must be 16,000 Hz. Resample your audio if necessary.
Duration: Best performance is achieved on clips between 0.5s and 30s.
Pre-processing: Use the WhisperProcessor to convert raw waveforms into the Mel-spectrogram format expected by the model.

2. Output Format (Multi-Label)

Unlike standard classifiers, this model uses a Multi-Label approach. This means:

Independence: Each label is calculated independently. The probabilities do not sum to 1.0.
Vector Output: The model outputs a vector of 5 probabilities (via Sigmoid activation).
Thresholding: A label is considered "Active" if its probability is above a certain threshold (default is 0.5).

🚀 Quick Start (Inference)

import torch
import librosa
from transformers import WhisperProcessor, WhisperForAudioClassification

model_id = "Kang-Chieh/whisper-large-v3-mlb-emotion"
device = "cuda" if torch.cuda.is_available() else "cpu"

feature_extractor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForAudioClassification.from_pretrained(model_id).to(device)

def predict(audio_path, threshold=0.5):
    # Load and resample audio
    audio, _ = librosa.load(audio_path, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").to(device)

    with torch.no_grad():
        logits = model(**inputs).logits
    
    # Multi-label uses Sigmoid activation
    probs = torch.sigmoid(logits).squeeze().cpu().numpy()
    
    id2label = model.config.id2label
    results = {id2label[i]: float(probs[i]) for i in range(len(probs))}
    
    # Filter for active tags
    active_tags = [tag for tag, score in results.items() if score > threshold]
    
    return active_tags, results

tags, scores = predict("path_to_your_audio.wav")
print(f"Detected Tags: {tags}")

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support