Audio Pitch-Level Classifier (Single-Label)

This model is a fine-tuned version of openai/whisper-small for single-label classification. It categorizes the perceived pitch level of an audio clip into one of three mutually exclusive classes.

🏷️ Pitch Labels

high-pitched: Relatively high perceived vocal pitch.
medium-pitched: Moderate or neutral perceived vocal pitch.
low-pitched: Relatively low perceived vocal pitch.

🛠 Usage: Input & Output

1. Input Specifications

Processor: Uses WhisperProcessor for resampling and log-Mel feature extraction.
Sampling Rate: 16,000 Hz.
Audio Format: Mono raw waveform.
Recommended Processor: openai/whisper-small

2. Output (Single-Label Logic)

Because this is a Single-Label task, the categories are mutually exclusive.

Activation: Softmax. The output probabilities sum to 1.0.
Decision: The model selects the label with the highest probability score.

📊 Label Mapping

{
    0: "high-pitched",
    1: "medium-pitched",
    2: "low-pitched",
}

🚀 Inference Code

import torch
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForAudioClassification

model_id = "Kang-Chieh/whisper-small-pitch-level"
processor_id = "openai/whisper-small"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the processor and model
processor = WhisperProcessor.from_pretrained(processor_id)
model = WhisperForAudioClassification.from_pretrained(model_id).to(device)

def predict_pitch(audio_path):
    # 1. Load audio and ensure 16kHz mono audio
    audio, _ = librosa.load(audio_path, sr=16000, mono=True)

    # 2. Preprocess
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(device)

    # 3. Inference
    with torch.no_grad():
        logits = model(input_features=input_features).logits

    # 4. Single-label logic (Softmax)
    probs = torch.softmax(logits, dim=-1).squeeze().cpu().numpy()

    # 5. Get the highest scoring label
    id2label = {int(k): v for k, v in model.config.id2label.items()}
    predicted_id = int(np.argmax(probs))

    return {
        "label": id2label[predicted_id],
        "confidence": float(probs[predicted_id]),
        "all_scores": {id2label[i]: float(probs[i]) for i in range(len(probs))},
    }

# Run example
result = predict_pitch("audio_clip.wav")
print(f"Detected Pitch: {result['label']} ({result['confidence']:.2%})")

📈 Reported Performance

From the saved evaluation results:

Accuracy: 0.68
Macro F1: 0.69

Class-wise summary:

high-pitched: precision 0.85, recall 0.66, f1-score 0.74
medium-pitched: precision 0.59, recall 0.69, f1-score 0.64
low-pitched: precision 0.66, recall 0.69, f1-score 0.67

Downloads last month: 16

Safetensors

Model size

88.4M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support