Update README.md
Browse files
README.md
CHANGED
|
@@ -1,54 +1,56 @@
|
|
| 1 |
# Spotify Track Popularity Prediction
|
| 2 |
|
| 3 |
---
|
| 4 |
-
|
| 5 |
|
|
|
|
| 6 |
## **Part 1: Dataset Overview**
|
| 7 |
|
| 8 |
-
|
| 9 |
-
The
|
| 10 |
-
The task begins as a **regression** problem, and later is transformed into **classification**.
|
| 11 |
|
| 12 |
-
*
|
| 13 |
-
Predict song popularity on Spotify using machine learning methods.
|
| 14 |
-
|
| 15 |
-
**Target Variable:**
|
| 16 |
-
`popularity` (0β100), a continuous score based on Spotify's internal engagement metrics.
|
| 17 |
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
-
##
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
- No imputation was needed for audio features.
|
| 26 |
|
| 27 |
-
|
| 28 |
-
- Outliers in duration, loudness, tempo, etc. are musically valid.
|
| 29 |
-
- No outlier removal was performed.
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
-
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
##
|
| 36 |
-

|
| 37 |
-
- No strong linear correlations with popularity.
|
| 38 |
-
- Audio features show only weak trends with the target.
|
| 39 |
|
| 40 |
-
###
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
- There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability.
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-

|
| 64 |
- Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
|
| 65 |
|
| 66 |
-
###
|
| 67 |
-
The
|
| 68 |
-
|
| 69 |
-
Some patterns emerge: tracks that are louder, more danceable, or explicit tend to be slightly more popular, and genre plays a meaningful role in average popularity levels.
|
| 70 |
-
Overall, the EDA indicates that predicting popularity requires multi-feature modeling and more advanced feature engineering to capture subtle, non-linear relationships.
|
| 71 |
|
| 72 |
---
|
|
|
|
| 73 |
|
| 74 |
-
##
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
- Numerical audio features
|
| 78 |
-
- One-hot encoded genre (`track_genre`)
|
| 79 |
|
| 80 |
-
###
|
| 81 |
-
-
|
| 82 |
-
-
|
| 83 |
|
| 84 |
-
###
|
| 85 |
-
-
|
| 86 |
-
-
|
| 87 |
-
-
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
### **Feature Importance**
|
| 92 |

|
|
@@ -96,99 +99,86 @@ The baseline model explains about **26%** of the variance in popularity.
|
|
| 96 |
- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
|
| 97 |
|
| 98 |
Among non-genre features:
|
|
|
|
| 99 |
- **Positive:** danceability, explicit
|
| 100 |
- **Negative:** valence, speechiness, energy
|
| 101 |
-
- **Weak influence:** liveness, tempo, key, mode
|
| 102 |
-
|
| 103 |
---
|
| 104 |
|
| 105 |
-
##
|
| 106 |
-
|
| 107 |
-
### **4.1 Scaling**
|
| 108 |
-
- Applied **StandardScaler** to all numeric features to normalize their scales.
|
| 109 |
|
| 110 |
-
###
|
| 111 |
-
|
| 112 |
-
- Captures nonlinear relationships and interactions between audio features.
|
| 113 |
|
| 114 |
-
###
|
| 115 |
-
|
| 116 |
-
- No clear clusters; structure appears continuous in 2D.
|
| 117 |
|
| 118 |
-
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
###
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
Cluster profiles differ in **danceability**, **energy**, **valence**, **acousticness**, and **popularity**.
|
| 127 |
|
| 128 |

|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
## **Part 5: Improved Regression Models**
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
-
|
| 137 |
-
- **Random Forest Regressor**
|
| 138 |
-
- **Gradient Boosting Regressor**
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
| Model | MAE | RMSE | RΒ² |
|
| 143 |
|------------------------------|--------|--------|-------|
|
| 144 |
-
| Linear Regression
|
| 145 |
| Random Forest | ~15.97 | ~19.95 | ~0.20 |
|
| 146 |
| Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
|
| 147 |
|
| 148 |

|
| 149 |
|
| 150 |
-
### **Result**
|
| 151 |
|
| 152 |
-
|
| 153 |
-
-
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
-
Saved
|
| 156 |
-
- `spotify_popularity_enhanced_linear_regression.pkl`
|
| 157 |
|
| 158 |
---
|
| 159 |
|
| 160 |
-
##
|
| 161 |
-
|
| 162 |
-
### **Creating Classes**
|
| 163 |
|
| 164 |
-
|
| 165 |
|
| 166 |
-
-
|
| 167 |
-
-
|
| 168 |
|
| 169 |
-
|
| 170 |
-
No rebalancing or resampling was required.
|
| 171 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
- **Precision** is more important than recall.
|
| 175 |
-
- **False Positives** are more costly than **False Negatives**, because predicting a non-popular song as popular leads to wasted marketing resources.
|
| 176 |
|
| 177 |
---
|
|
|
|
| 178 |
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
Three classifiers were trained on the engineered feature space:
|
| 182 |
|
| 183 |
-
|
| 184 |
-
-
|
| 185 |
-
|
|
|
|
|
|
|
| 186 |
|
| 187 |
-
|
|
|
|
| 188 |
|
| 189 |
-
|
| 190 |
-
- Random Forest: β **0.75**
|
| 191 |
-
- Gradient Boosting: β **0.72**
|
| 192 |
|
| 193 |
### **Full Evaluation**
|
| 194 |
|
|
@@ -198,30 +188,51 @@ Three classifiers were trained on the engineered feature space:
|
|
| 198 |
|
| 199 |

|
| 200 |
|
| 201 |
-
Logistic Regression shows the best balance between precision and recall across both classes.
|
| 202 |
|
| 203 |
---
|
| 204 |
|
| 205 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
|
| 207 |
-
|
|
|
|
|
|
|
|
|
|
| 208 |
|
| 209 |
-
|
| 210 |
-
-
|
| 211 |
-
-
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
|
| 214 |
-
- `spotify_popularity_logistic_regression_classifier.pkl`
|
| 215 |
|
| 216 |
---
|
|
|
|
| 217 |
|
| 218 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
|
| 220 |
-
This project
|
|
|
|
| 221 |
|
| 222 |
-
-
|
| 223 |
-
-
|
| 224 |
-
-
|
| 225 |
-
-
|
| 226 |
|
| 227 |
-
All
|
|
|
|
| 1 |
# Spotify Track Popularity Prediction
|
| 2 |
|
| 3 |
---
|
| 4 |
+
This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models.
|
| 5 |
|
| 6 |
+
It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle.
|
| 7 |
## **Part 1: Dataset Overview**
|
| 8 |
|
| 9 |
+
The goal of this project is to determine whether musical and audio characteristics can predict a trackβs popularity (0β100).
|
| 10 |
+
The task is first approached as a regression problem and later reframed as a binary classification problem.
|
|
|
|
| 11 |
|
| 12 |
+
*Target variable:* `popularity`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
*Type:* continuous (0β100)
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
+
## 2. Dataset Description
|
| 19 |
|
| 20 |
+
Source: Spotify Tracks Dataset, Kaggle
|
| 21 |
+
Size: ~114,000 rows, 20 features, 125+ genres
|
|
|
|
| 22 |
|
| 23 |
+
Includes:
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
- Audio features (danceability, energy, loudness, tempo, valence, etc.)
|
| 26 |
+
- Metadata (track name, artist name)
|
| 27 |
+
- Genre
|
| 28 |
+
- Popularity score
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
|
| 32 |
+
## 3. Exploratory Data Analysis (EDA)
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
### Cleaning
|
| 35 |
+
- Removed a small number of rows with missing metadata
|
| 36 |
+
- Audio features were fully available β no imputation needed
|
|
|
|
| 37 |
|
| 38 |
+
### Outliers
|
| 39 |
+
- Outliers in duration, tempo, and loudness represent real musical variation
|
| 40 |
+
- No outlier removal was performed
|
| 41 |
|
| 42 |
+
### Target Distribution
|
| 43 |
+

|
| 44 |
+
Popularity is heavily right-skewed: most songs are minimally popular.
|
| 45 |
+
|
| 46 |
+
### Correlations
|
| 47 |
+

|
| 48 |
+
There are no strong linear correlations between audio features and popularity.
|
| 49 |
+
Several weak but consistent patterns appear:
|
| 50 |
+
- Higher danceability corresponds to higher maximum popularity
|
| 51 |
+
- Louder tracks tend to be more popular
|
| 52 |
+
- Explicit tracks have slightly higher median popularity
|
| 53 |
+
- Genres differ significantly in average popularity
|
| 54 |
|
| 55 |
### **Reseach Questions**
|
| 56 |
1. What percentage of songs are explicit vs. clean?
|
|
|
|
| 65 |

|
| 66 |
- Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
|
| 67 |
|
| 68 |
+
### EDA Summary
|
| 69 |
+
The data suggests that no single feature determines popularity.
|
| 70 |
+
Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions.
|
|
|
|
|
|
|
| 71 |
|
| 72 |
---
|
| 73 |
+
## 4. Baseline Regression Model
|
| 74 |
|
| 75 |
+
### Features
|
| 76 |
+
- Numeric audio features
|
| 77 |
+
- One-hot encoded genre
|
|
|
|
|
|
|
| 78 |
|
| 79 |
+
### Model
|
| 80 |
+
- Linear Regression (scikit-learn)
|
| 81 |
+
- 80/20 train-test split
|
| 82 |
|
| 83 |
+
### Performance
|
| 84 |
+
- MAE: ~14.08
|
| 85 |
+
- RMSE: ~19.14
|
| 86 |
+
- RΒ²: ~0.26
|
| 87 |
|
| 88 |
+
Genre features dominate model coefficients.
|
| 89 |
+
Audio trends include:
|
| 90 |
+
- Positive effect: danceability, explicit
|
| 91 |
+
- Negative effect: valence, speechiness
|
| 92 |
+
- Minimal effect: key, mode, liveness
|
| 93 |
|
| 94 |
### **Feature Importance**
|
| 95 |

|
|
|
|
| 99 |
- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
|
| 100 |
|
| 101 |
Among non-genre features:
|
| 102 |
+
|
| 103 |
- **Positive:** danceability, explicit
|
| 104 |
- **Negative:** valence, speechiness, energy
|
| 105 |
+
- **Weak influence:** liveness, tempo, key, mode
|
| 106 |
+
|
| 107 |
---
|
| 108 |
|
| 109 |
+
## 5. Feature Engineering
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
+
### Scaling
|
| 112 |
+
StandardScaler applied to all numeric features.
|
|
|
|
| 113 |
|
| 114 |
+
### Polynomial Features
|
| 115 |
+
PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships.
|
|
|
|
| 116 |
|
| 117 |
+
### PCA
|
| 118 |
+
Performed only for visualization (2 components).
|
| 119 |
+
No distinct clusters observed in PCA space.
|
| 120 |
|
| 121 |
+
### Clustering
|
| 122 |
+
K-Means (k=5) applied to scaled numeric features.
|
| 123 |
+
Added engineered features:
|
| 124 |
+
- `cluster_id`
|
| 125 |
+
- `cluster_distance`
|
|
|
|
|
|
|
| 126 |
|
| 127 |

|
| 128 |
|
| 129 |
+
Clusters differ by energy, danceability, valence, acousticness, and avg popularity.
|
|
|
|
|
|
|
| 130 |
|
| 131 |
+
---
|
| 132 |
|
| 133 |
+
## 6. Improved Regression Models
|
|
|
|
|
|
|
| 134 |
|
| 135 |
+
Three models were trained on the engineered dataset:
|
| 136 |
|
| 137 |
| Model | MAE | RMSE | RΒ² |
|
| 138 |
|------------------------------|--------|--------|-------|
|
| 139 |
+
| Enhanced Linear Regression | ~14.05 | ~19.03 | ~0.27 |
|
| 140 |
| Random Forest | ~15.97 | ~19.95 | ~0.20 |
|
| 141 |
| Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
|
| 142 |
|
| 143 |

|
| 144 |
|
|
|
|
| 145 |
|
| 146 |
+
Winner: **Enhanced Linear Regression**
|
| 147 |
+
Tree-based models underperformed due to:
|
| 148 |
+
- High-dimensional sparsity (after one-hot + polynomials)
|
| 149 |
+
- Weak signal-to-noise ratio
|
| 150 |
+
- Genre dominating feature space
|
| 151 |
|
| 152 |
+
Saved model: `spotify_popularity_enhanced_linear_regression.pkl`
|
|
|
|
| 153 |
|
| 154 |
---
|
| 155 |
|
| 156 |
+
## 7. Regression to Classification
|
|
|
|
|
|
|
| 157 |
|
| 158 |
+
A binary label was created using the training-set median popularity (35):
|
| 159 |
|
| 160 |
+
- Class 0: popularity < 35
|
| 161 |
+
- Class 1: popularity β₯ 35
|
| 162 |
|
| 163 |
+
Classes were balanced (~50/50), so no resampling was needed.
|
|
|
|
| 164 |
|
| 165 |
+
Precision was prioritized over recall, since predicting a non-popular track as popular is more costly.
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
---
|
| 168 |
+
## 8. Classification Models
|
| 169 |
|
| 170 |
+
Three models were trained:
|
|
|
|
|
|
|
| 171 |
|
| 172 |
+
| Model | Accuracy |
|
| 173 |
+
|-------------------------|----------|
|
| 174 |
+
| Logistic Regression | ~0.76 |
|
| 175 |
+
| Random Forest | ~0.75 |
|
| 176 |
+
| Gradient Boosting | ~0.72 |
|
| 177 |
|
| 178 |
+
Winner: Logistic Regression
|
| 179 |
+
It achieved the best precision-recall balance and the lowest misclassification bias.
|
| 180 |
|
| 181 |
+
Saved model: `spotify_popularity_logistic_regression_classifier.pkl`
|
|
|
|
|
|
|
| 182 |
|
| 183 |
### **Full Evaluation**
|
| 184 |
|
|
|
|
| 188 |
|
| 189 |

|
| 190 |
|
| 191 |
+
**Logistic Regression** shows the best balance between precision and recall across both classes.
|
| 192 |
|
| 193 |
---
|
| 194 |
|
| 195 |
+
## 9. How to Reproduce
|
| 196 |
+
|
| 197 |
+
Install dependencies:
|
| 198 |
+
```bash
|
| 199 |
+
pip install -r requirements.txt
|
| 200 |
+
```
|
| 201 |
|
| 202 |
+
Run the notebook:
|
| 203 |
+
```
|
| 204 |
+
Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb
|
| 205 |
+
```
|
| 206 |
|
| 207 |
+
The preprocessing pipeline includes:
|
| 208 |
+
- Standard scaling
|
| 209 |
+
- Polynomial feature generation
|
| 210 |
+
- One-hot encoding of genres
|
| 211 |
+
- K-Means clustering (k=5)
|
| 212 |
|
| 213 |
+
These steps must be applied before loading any saved model.
|
|
|
|
| 214 |
|
| 215 |
---
|
| 216 |
+
## 10. Repository Structure
|
| 217 |
|
| 218 |
+
```
|
| 219 |
+
project/
|
| 220 |
+
βββ README.md
|
| 221 |
+
βββ notebook.ipynb
|
| 222 |
+
βββ spotify_popularity_enhanced_linear_regression.pkl
|
| 223 |
+
βββ spotify_popularity_logistic_regression_classifier.pkl
|
| 224 |
+
βββ data/ (optional)
|
| 225 |
+
```
|
| 226 |
+
|
| 227 |
+
---
|
| 228 |
+
## 11. Final Summary
|
| 229 |
|
| 230 |
+
This project builds a complete machine learning pipeline for predicting Spotify track popularity.
|
| 231 |
+
Through EDA, feature engineering, regression, and classification, the project demonstrates:
|
| 232 |
|
| 233 |
+
- Popularity is difficult to predict linearly
|
| 234 |
+
- Feature engineering improves model performance
|
| 235 |
+
- Enhanced Linear Regression is best for regression
|
| 236 |
+
- Logistic Regression is best for binary popularity classification
|
| 237 |
|
| 238 |
+
All final models require the full preprocessing pipeline to reproduce predictions.
|