Audio Pitch-Level Classifier (Single-Label)

This model is a fine-tuned version of openai/whisper-small for single-label classification. It categorizes the perceived pitch level of an audio clip into one of three mutually exclusive classes.

🏷️ Pitch Labels

  • high-pitched: Relatively high perceived vocal pitch.
  • medium-pitched: Moderate or neutral perceived vocal pitch.
  • low-pitched: Relatively low perceived vocal pitch.

πŸ›  Usage: Input & Output

1. Input Specifications

  • Processor: Uses WhisperProcessor for resampling and log-Mel feature extraction.
  • Sampling Rate: 16,000 Hz.
  • Audio Format: Mono raw waveform.
  • Recommended Processor: openai/whisper-small

2. Output (Single-Label Logic)

Because this is a Single-Label task, the categories are mutually exclusive.

  • Activation: Softmax. The output probabilities sum to 1.0.
  • Decision: The model selects the label with the highest probability score.

πŸ“Š Label Mapping

{
    0: "high-pitched",
    1: "medium-pitched",
    2: "low-pitched",
}

πŸš€ Inference Code

import torch
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForAudioClassification

model_id = "Kang-Chieh/whisper-small-pitch-level"
processor_id = "openai/whisper-small"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the processor and model
processor = WhisperProcessor.from_pretrained(processor_id)
model = WhisperForAudioClassification.from_pretrained(model_id).to(device)

def predict_pitch(audio_path):
    # 1. Load audio and ensure 16kHz mono audio
    audio, _ = librosa.load(audio_path, sr=16000, mono=True)

    # 2. Preprocess
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(device)

    # 3. Inference
    with torch.no_grad():
        logits = model(input_features=input_features).logits

    # 4. Single-label logic (Softmax)
    probs = torch.softmax(logits, dim=-1).squeeze().cpu().numpy()

    # 5. Get the highest scoring label
    id2label = {int(k): v for k, v in model.config.id2label.items()}
    predicted_id = int(np.argmax(probs))

    return {
        "label": id2label[predicted_id],
        "confidence": float(probs[predicted_id]),
        "all_scores": {id2label[i]: float(probs[i]) for i in range(len(probs))},
    }

# Run example
result = predict_pitch("audio_clip.wav")
print(f"Detected Pitch: {result['label']} ({result['confidence']:.2%})")

πŸ“ˆ Reported Performance

From the saved evaluation results:

  • Accuracy: 0.68
  • Macro F1: 0.69

Class-wise summary:

  • high-pitched: precision 0.85, recall 0.66, f1-score 0.74
  • medium-pitched: precision 0.59, recall 0.69, f1-score 0.64
  • low-pitched: precision 0.66, recall 0.69, f1-score 0.67
Downloads last month
16
Safetensors
Model size
88.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support