Audio Pitch-Level Classifier (Single-Label)
This model is a fine-tuned version of openai/whisper-small for single-label classification. It categorizes the perceived pitch level of an audio clip into one of three mutually exclusive classes.
π·οΈ Pitch Labels
high-pitched: Relatively high perceived vocal pitch.medium-pitched: Moderate or neutral perceived vocal pitch.low-pitched: Relatively low perceived vocal pitch.
π Usage: Input & Output
1. Input Specifications
- Processor: Uses
WhisperProcessorfor resampling and log-Mel feature extraction. - Sampling Rate: 16,000 Hz.
- Audio Format: Mono raw waveform.
- Recommended Processor:
openai/whisper-small
2. Output (Single-Label Logic)
Because this is a Single-Label task, the categories are mutually exclusive.
- Activation: Softmax. The output probabilities sum to 1.0.
- Decision: The model selects the label with the highest probability score.
π Label Mapping
{
0: "high-pitched",
1: "medium-pitched",
2: "low-pitched",
}
π Inference Code
import torch
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForAudioClassification
model_id = "Kang-Chieh/whisper-small-pitch-level"
processor_id = "openai/whisper-small"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the processor and model
processor = WhisperProcessor.from_pretrained(processor_id)
model = WhisperForAudioClassification.from_pretrained(model_id).to(device)
def predict_pitch(audio_path):
# 1. Load audio and ensure 16kHz mono audio
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
# 2. Preprocess
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)
# 3. Inference
with torch.no_grad():
logits = model(input_features=input_features).logits
# 4. Single-label logic (Softmax)
probs = torch.softmax(logits, dim=-1).squeeze().cpu().numpy()
# 5. Get the highest scoring label
id2label = {int(k): v for k, v in model.config.id2label.items()}
predicted_id = int(np.argmax(probs))
return {
"label": id2label[predicted_id],
"confidence": float(probs[predicted_id]),
"all_scores": {id2label[i]: float(probs[i]) for i in range(len(probs))},
}
# Run example
result = predict_pitch("audio_clip.wav")
print(f"Detected Pitch: {result['label']} ({result['confidence']:.2%})")
π Reported Performance
From the saved evaluation results:
- Accuracy: 0.68
- Macro F1: 0.69
Class-wise summary:
high-pitched: precision 0.85, recall 0.66, f1-score 0.74medium-pitched: precision 0.59, recall 0.69, f1-score 0.64low-pitched: precision 0.66, recall 0.69, f1-score 0.67
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support