Audio Pitch-Quality Classifier (Multi-Label with None)

This model is a fine-tuned version of openai/whisper-small for multi-label classification. It predicts whether an audio clip contains one or more pitch-quality attributes, and falls back to None when no label passes its decision threshold.

🏷️ Labels

  • shrill: Sharp, piercing, or thin high-frequency vocal quality.
  • nasal: Resonance is concentrated in the nasal tract.
  • deep: Strong low-frequency or deep vocal quality.
  • None: No target pitch-quality label is predicted.

πŸ›  Usage: Input & Output

1. Input Specifications

  • Processor: Uses WhisperProcessor for resampling and log-Mel feature extraction.
  • Sampling Rate: 16,000 Hz.
  • Audio Format: Mono raw waveform.
  • Recommended Processor: openai/whisper-small

2. Output (Multi-Label Logic)

Because this is a Multi-Label task, labels are not mutually exclusive.

  • Activation: Sigmoid. Each label gets an independent probability from 0 to 1.
  • Decision: A label is selected if its probability is greater than or equal to its own threshold.
  • Fallback: If no label passes threshold, return None.

Dynamic thresholds saved with this checkpoint:

{
    "shrill": 0.10356783866882324,
    "nasal": 0.29178741574287415,
    "deep": 0.07894472032785416,
}

πŸ“Š Label Mapping

{
    0: "shrill",
    1: "nasal",
    2: "deep",
}

πŸš€ Inference Code

import torch
import librosa
from transformers import WhisperProcessor, WhisperForAudioClassification

model_id = "Kang-Chieh/whisper-small-mlb-with-none-pitch-quality-dynamic-threshold"
processor_id = "openai/whisper-small"
device = "cuda" if torch.cuda.is_available() else "cpu"

thresholds = {
    "shrill": 0.10356783866882324,
    "nasal": 0.29178741574287415,
    "deep": 0.07894472032785416,
}

# Load the processor and model
processor = WhisperProcessor.from_pretrained(processor_id)
model = WhisperForAudioClassification.from_pretrained(model_id).to(device)

def predict_pitch_quality(audio_path):
    # 1. Load audio and ensure 16kHz mono audio
    audio, _ = librosa.load(audio_path, sr=16000, mono=True)

    # 2. Preprocess
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(device)

    # 3. Inference
    with torch.no_grad():
        logits = model(input_features=input_features).logits

    # 4. Multi-label logic (Sigmoid)
    probs = torch.sigmoid(logits).squeeze().cpu().numpy()
    id2label = {int(k): v for k, v in model.config.id2label.items()}
    all_scores = {id2label[i]: float(probs[i]) for i in range(len(probs))}

    predicted_labels = [
        label for label, score in all_scores.items()
        if score >= thresholds[label]
    ]

    if not predicted_labels:
        predicted_labels = ["None"]

    return {
        "labels": predicted_labels,
        "all_scores": all_scores,
        "thresholds": thresholds,
    }

# Run example
result = predict_pitch_quality("audio_clip.wav")
print("Detected Labels:", ", ".join(result["labels"]))
print("Scores:", result["all_scores"])

πŸ“ˆ Reported Performance

From the saved evaluation results:

  • Micro F1: 0.62
  • Macro F1: 0.56
  • Weighted F1: 0.61
  • Samples F1: 0.64

Class-wise summary:

  • shrill: precision 0.78, recall 0.45, f1-score 0.57
  • nasal: precision 0.51, recall 0.29, f1-score 0.37
  • deep: precision 0.68, recall 0.56, f1-score 0.62
  • None: precision 0.61, recall 0.80, f1-score 0.69
Downloads last month
11
Safetensors
Model size
88.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support