π€ Wav2Vec2 Speech Emotion Recognition for English
π§ Model Overview
πΉ Model name: dihuzz/wav2vec2-ser-english-finetuned
β¨ This model is fine-tuned for recognizing emotions in English speech using the Wav2Vec2 architecture. It can detect the following emotions:
- π’ Sadness
- π Anger
- π€’ Disgust
- π¨ Fear
- π Happiness
- π Neutral
π§ The model was created by fine-tuning r-f/wav2vec-english-speech-emotion-recognition on several prominent Speech Emotion Recognition datasets containing English emotional speech samples.
π Performance Metrics:
- π― Accuracy: 92.42%
- π Loss: 0.219
ποΈ Training Procedure
βοΈ Training Details
- Base Model:
r-f/wav2vec-english-speech-emotion-recognition - π» Hardware: P100 GPU on Kaggle
- β± Training Duration: 10 epochs
- π Learning Rate: 5e-4
- π§© Batch Size: 4
- π Gradient Accumulation Steps: 8
- βοΈ Optimizer: AdamW (Ξ²β=0.9, Ξ²β=0.999)
- π Loss Function: Cross Entropy Loss
- β³ Learning Rate Scheduler: None
π Training Results
| Epoch | Loss | Accuracy |
|---|---|---|
| 1 | 1.0257 | 61.20% |
| 2 | 0.7025 | 73.88% |
| 3 | 0.5901 | 78.25% |
| 4 | 0.4960 | 81.56% |
| 5 | 0.4105 | 85.04% |
| 6 | 0.3516 | 87.70% |
| 7 | 0.3140 | 88.87% |
| 8 | 0.2649 | 90.45% |
| 9 | 0.2178 | 92.42% |
| 10 | 0.2187 | 92.29% |
π How to Use
π Installation
pip install transformers torch torchaudio
π» Example Usage
Here is an example of how to use the model to classify emotions in a .wav format English audio file:
import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torchaudio
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
# Load the fine-tuned model and feature extractor
model_name = "dihuzz/wav2vec2-ser-english-finetuned"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name).to(device)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
# Set the model to evaluation mode
model.eval()
# Load and preprocess the audio file
def predict_emotion(audio_path):
# Load audio
waveform, sample_rate = torchaudio.load(audio_path)
# Alternatively, librosa can also be used to load the audio file
# Resample if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
# Extract features and move them to device
inputs = feature_extractor(
waveform.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
padding=True
)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Perform inference (here we are using a batch size of 1 but you can increase batch size for faster inference)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
# Map predicted class ID to emotion label
label = model.config.id2label[predicted_class_id]
return label
# Example usage
audio_file = "/path/to/your/audio.wav"
predicted_emotion = predict_emotion(audio_file)
print(f"Predicted Emotion: {predicted_emotion}")
π Example Output
The model returns a string representing the predicted emotion:
Predicted Emotion: <emotion_label>
Limitations
π Note: This model has several important limitations:
- π Language Specificity: English-only support
- π£οΈ Dialect Sensitivity: Variable performance across accents
- π§ Audio Quality Needs: Requires clean, clear recordings
- βοΈ Potential Biases: May reflect cultural biases in training data
- 6οΈβ£ Limited Categories: Only detects 6 basic emotions
- π§ Context Unaware: Doesn't consider speech content meaning
- Downloads last month
- 12