Leelu1002's picture
Update README.md
260aeb6 verified
# Spotify Track Popularity Prediction
[For Video Presentation Please Click Here](https://youtu.be/P0bjIb1qknQ)
---
This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models.
It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle.
## **Part 1: Dataset Overview**
The goal of this project is to determine whether musical and audio characteristics can predict a track’s popularity (0–100).
The task is first approached as a regression problem and later reframed as a binary classification problem.
*Target variable:* `popularity`
*Type:* continuous (0–100)
---
## 2. Dataset Description
Source: Spotify Tracks Dataset, Kaggle
Size: ~114,000 rows, 20 features, 125+ genres
Includes:
- Audio features (danceability, energy, loudness, tempo, valence, etc.)
- Metadata (track name, artist name)
- Genre
- Popularity score
---
## 3. Exploratory Data Analysis (EDA)
### Cleaning
- Removed a small number of rows with missing metadata
- Audio features were fully available – no imputation needed
### Outliers
- Outliers in duration, tempo, and loudness represent real musical variation
- No outlier removal was performed
### Target Distribution
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
Popularity is heavily right-skewed: most songs are minimally popular.
### Correlations
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
There are no strong linear correlations between audio features and popularity.
Several weak but consistent patterns appear:
- Higher danceability corresponds to higher maximum popularity
- Louder tracks tend to be more popular
- Explicit tracks have slightly higher median popularity
- Genres differ significantly in average popularity
### Visualizations
1. Danceaiblity VS Popularity
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/mnvWrzuLfFuSB7akeqoT2.png)
There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability.
2.Energy VS Popularity
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/4dU_Jpd36IFqE6zZvMn4c.png)
Popular songs cluster around medium-to-high energy levels. Very low-energy tracks rarely achieve high popularity, showing a clear preference for energetic music.
3. Loudness VS Popularity
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/M3PyAqpLXlmdSUroN44gs.png)
There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends.
### **Reseach Questions**
1. What percentage of songs are explicit vs. clean?
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/aGxa55X2K-5yydyEw7a11.png)
- Most songs in the dataset are clean (91%), while only a small portion (9%) are explicit, showing that explicit content is relatively uncommon on Spotify.
2. Which musical keys are most common in the dataset?
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/cJHx7bC_y3i7ADzQLMqSR.png)
- Certain musical keys (especially 0 = C and 7 = G) appear far more frequently, but the distribution does not suggest any direct relationship with popularity.
3. How does average popularity vary across different music genres?
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
- Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
### EDA Summary
The data suggests that no single feature determines popularity.
Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions.
---
## 4. Baseline Regression Model
### Features
- Numeric audio features
- One-hot encoded genre
### Model
- Linear Regression (scikit-learn)
- 80/20 train-test split
### Performance
- MAE: ~14.08
- RMSE: ~19.14
- RΒ²: ~0.26
Genre features dominate model coefficients.
Audio trends include:
- Positive effect: danceability, explicit
- Negative effect: valence, speechiness
- Minimal effect: key, mode, liveness
### **Feature Importance**
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/8X2jU4oJ8bIXYiuY-jDWi.png)
- Genre features dominate the coefficients.
- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
Among non-genre features:
- **Positive:** danceability, explicit
- **Negative:** valence, speechiness, energy
- **Weak influence:** liveness, tempo, key, mode
---
## 5. Feature Engineering
### Scaling
StandardScaler applied to all numeric features.
### Polynomial Features
PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships.
### PCA
Performed only for visualization (2 components).
No distinct clusters observed in PCA space.
### Clustering
K-Means (k=5) applied to scaled numeric features.
Added engineered features:
- `cluster_id`
- `cluster_distance`
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)
Clusters differ by energy, danceability, valence, acousticness, and avg popularity.
---
## 6. Improved Regression Models
Three models were trained on the engineered dataset:
| Model | MAE | RMSE | RΒ² |
|------------------------------|--------|--------|-------|
| Enhanced Linear Regression | ~14.05 | ~19.03 | ~0.27 |
| Random Forest | ~15.97 | ~19.95 | ~0.20 |
| Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)
Winner: **Enhanced Linear Regression**
Tree-based models underperformed due to:
- High-dimensional sparsity (after one-hot + polynomials)
- Weak signal-to-noise ratio
- Genre dominating feature space
Saved model: `spotify_popularity_enhanced_linear_regression.pkl`
---
## 7. Regression to Classification
A binary label was created using the training-set median popularity (35):
- Class 0: popularity < 35
- Class 1: popularity β‰₯ 35
Classes were balanced (~50/50), so no resampling was needed.
Precision was prioritized over recall, since predicting a non-popular track as popular is more costly.
---
## 8. Classification Models
Three models were trained:
| Model | Accuracy |
|-------------------------|----------|
| Logistic Regression | ~0.76 |
| Random Forest | ~0.75 |
| Gradient Boosting | ~0.72 |
Winner: Logistic Regression
It achieved the best precision-recall balance and the lowest misclassification bias.
Saved model: `spotify_popularity_logistic_regression_classifier.pkl`
### **Full Evaluation**
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/tVcYnJul6_zp1GIY8S8Bl.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/HDgdidPPTISme1hMCnZkH.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)
**Logistic Regression** shows the best balance between precision and recall across both classes.
---
## 9. How to Reproduce
Install dependencies:
```bash
pip install -r requirements.txt
```
Run the notebook:
```
Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb
```
The preprocessing pipeline includes:
- Standard scaling
- Polynomial feature generation
- One-hot encoding of genres
- K-Means clustering (k=5)
These steps must be applied before loading any saved model.
---
## 10. Repository Structure
```
project/
│── README.md
│── Leelu_Spotify_Popularity_Assignment_2.ipynb
│── spotify_popularity_enhanced_linear_regression.pkl
│── spotify_popularity_logistic_regression_classifier.pkl
```
---
## 11. Final Summary
This project builds a complete machine learning pipeline for predicting Spotify track popularity.
Through EDA, feature engineering, regression, and classification, the project demonstrates:
- Popularity is difficult to predict linearly
- Feature engineering improves model performance
- Enhanced Linear Regression is best for regression
- Logistic Regression is best for binary popularity classification
All final models require the full preprocessing pipeline to reproduce predictions.