Audio Pitch-Quality Classifier (Multi-Label with None)
This model is a fine-tuned version of openai/whisper-small for multi-label classification. It predicts whether an audio clip contains one or more pitch-quality attributes, and falls back to None when no label passes its decision threshold.
π·οΈ Labels
shrill: Sharp, piercing, or thin high-frequency vocal quality.nasal: Resonance is concentrated in the nasal tract.deep: Strong low-frequency or deep vocal quality.None: No target pitch-quality label is predicted.
π Usage: Input & Output
1. Input Specifications
- Processor: Uses
WhisperProcessorfor resampling and log-Mel feature extraction. - Sampling Rate: 16,000 Hz.
- Audio Format: Mono raw waveform.
- Recommended Processor:
openai/whisper-small
2. Output (Multi-Label Logic)
Because this is a Multi-Label task, labels are not mutually exclusive.
- Activation: Sigmoid. Each label gets an independent probability from 0 to 1.
- Decision: A label is selected if its probability is greater than or equal to its own threshold.
- Fallback: If no label passes threshold, return
None.
Dynamic thresholds saved with this checkpoint:
{
"shrill": 0.10356783866882324,
"nasal": 0.29178741574287415,
"deep": 0.07894472032785416,
}
π Label Mapping
{
0: "shrill",
1: "nasal",
2: "deep",
}
π Inference Code
import torch
import librosa
from transformers import WhisperProcessor, WhisperForAudioClassification
model_id = "Kang-Chieh/whisper-small-mlb-with-none-pitch-quality-dynamic-threshold"
processor_id = "openai/whisper-small"
device = "cuda" if torch.cuda.is_available() else "cpu"
thresholds = {
"shrill": 0.10356783866882324,
"nasal": 0.29178741574287415,
"deep": 0.07894472032785416,
}
# Load the processor and model
processor = WhisperProcessor.from_pretrained(processor_id)
model = WhisperForAudioClassification.from_pretrained(model_id).to(device)
def predict_pitch_quality(audio_path):
# 1. Load audio and ensure 16kHz mono audio
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
# 2. Preprocess
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)
# 3. Inference
with torch.no_grad():
logits = model(input_features=input_features).logits
# 4. Multi-label logic (Sigmoid)
probs = torch.sigmoid(logits).squeeze().cpu().numpy()
id2label = {int(k): v for k, v in model.config.id2label.items()}
all_scores = {id2label[i]: float(probs[i]) for i in range(len(probs))}
predicted_labels = [
label for label, score in all_scores.items()
if score >= thresholds[label]
]
if not predicted_labels:
predicted_labels = ["None"]
return {
"labels": predicted_labels,
"all_scores": all_scores,
"thresholds": thresholds,
}
# Run example
result = predict_pitch_quality("audio_clip.wav")
print("Detected Labels:", ", ".join(result["labels"]))
print("Scores:", result["all_scores"])
π Reported Performance
From the saved evaluation results:
- Micro F1: 0.62
- Macro F1: 0.56
- Weighted F1: 0.61
- Samples F1: 0.64
Class-wise summary:
shrill: precision 0.78, recall 0.45, f1-score 0.57nasal: precision 0.51, recall 0.29, f1-score 0.37deep: precision 0.68, recall 0.56, f1-score 0.62None: precision 0.61, recall 0.80, f1-score 0.69
- Downloads last month
- 11
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support