| # Spotify Track Popularity Prediction |
|
|
| [For Video Presentation Please Click Here](https://youtu.be/P0bjIb1qknQ) |
|
|
| --- |
| This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models. |
|
|
| It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle. |
| ## **Part 1: Dataset Overview** |
|
|
| The goal of this project is to determine whether musical and audio characteristics can predict a trackβs popularity (0β100). |
| The task is first approached as a regression problem and later reframed as a binary classification problem. |
|
|
| *Target variable:* `popularity` |
|
|
| *Type:* continuous (0β100) |
|
|
| --- |
|
|
| ## 2. Dataset Description |
|
|
| Source: Spotify Tracks Dataset, Kaggle |
| Size: ~114,000 rows, 20 features, 125+ genres |
|
|
| Includes: |
|
|
| - Audio features (danceability, energy, loudness, tempo, valence, etc.) |
| - Metadata (track name, artist name) |
| - Genre |
| - Popularity score |
| |
| --- |
|
|
| ## 3. Exploratory Data Analysis (EDA) |
|
|
| ### Cleaning |
| - Removed a small number of rows with missing metadata |
| - Audio features were fully available β no imputation needed |
|
|
| ### Outliers |
| - Outliers in duration, tempo, and loudness represent real musical variation |
| - No outlier removal was performed |
|
|
| ### Target Distribution |
|  |
| Popularity is heavily right-skewed: most songs are minimally popular. |
|
|
| ### Correlations |
|  |
| There are no strong linear correlations between audio features and popularity. |
| Several weak but consistent patterns appear: |
| - Higher danceability corresponds to higher maximum popularity |
| - Louder tracks tend to be more popular |
| - Explicit tracks have slightly higher median popularity |
| - Genres differ significantly in average popularity |
|
|
| ### Visualizations |
| 1. Danceaiblity VS Popularity |
|  |
| There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability. |
|
|
| 2.Energy VS Popularity |
|  |
| Popular songs cluster around medium-to-high energy levels. Very low-energy tracks rarely achieve high popularity, showing a clear preference for energetic music. |
| 3. Loudness VS Popularity |
|  |
| There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends. |
| ### **Reseach Questions** |
| 1. What percentage of songs are explicit vs. clean? |
|  |
| - Most songs in the dataset are clean (91%), while only a small portion (9%) are explicit, showing that explicit content is relatively uncommon on Spotify. |
|
|
| 2. Which musical keys are most common in the dataset? |
|  |
| - Certain musical keys (especially 0 = C and 7 = G) appear far more frequently, but the distribution does not suggest any direct relationship with popularity. |
|
|
| 3. How does average popularity vary across different music genres? |
|  |
| - Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform. |
|
|
| ### EDA Summary |
| The data suggests that no single feature determines popularity. |
| Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions. |
|
|
| --- |
| ## 4. Baseline Regression Model |
|
|
| ### Features |
| - Numeric audio features |
| - One-hot encoded genre |
|
|
| ### Model |
| - Linear Regression (scikit-learn) |
| - 80/20 train-test split |
|
|
| ### Performance |
| - MAE: ~14.08 |
| - RMSE: ~19.14 |
| - RΒ²: ~0.26 |
|
|
| Genre features dominate model coefficients. |
| Audio trends include: |
| - Positive effect: danceability, explicit |
| - Negative effect: valence, speechiness |
| - Minimal effect: key, mode, liveness |
|
|
| ### **Feature Importance** |
|  |
|  |
|
|
| - Genre features dominate the coefficients. |
| - Most genre coefficients are negative, showing niche genres perform worse than the reference genre. |
|
|
| Among non-genre features: |
|
|
| - **Positive:** danceability, explicit |
| - **Negative:** valence, speechiness, energy |
| - **Weak influence:** liveness, tempo, key, mode |
| |
| --- |
|
|
| ## 5. Feature Engineering |
|
|
| ### Scaling |
| StandardScaler applied to all numeric features. |
|
|
| ### Polynomial Features |
| PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships. |
|
|
| ### PCA |
| Performed only for visualization (2 components). |
| No distinct clusters observed in PCA space. |
|
|
| ### Clustering |
| K-Means (k=5) applied to scaled numeric features. |
| Added engineered features: |
| - `cluster_id` |
| - `cluster_distance` |
|
|
|  |
|
|
| Clusters differ by energy, danceability, valence, acousticness, and avg popularity. |
|
|
| --- |
|
|
| ## 6. Improved Regression Models |
|
|
| Three models were trained on the engineered dataset: |
|
|
| | Model | MAE | RMSE | RΒ² | |
| |------------------------------|--------|--------|-------| |
| | Enhanced Linear Regression | ~14.05 | ~19.03 | ~0.27 | |
| | Random Forest | ~15.97 | ~19.95 | ~0.20 | |
| | Gradient Boosting | ~17.02 | ~20.55 | ~0.15 | |
|
|
|  |
|
|
|
|
| Winner: **Enhanced Linear Regression** |
| Tree-based models underperformed due to: |
| - High-dimensional sparsity (after one-hot + polynomials) |
| - Weak signal-to-noise ratio |
| - Genre dominating feature space |
|
|
| Saved model: `spotify_popularity_enhanced_linear_regression.pkl` |
|
|
| --- |
|
|
| ## 7. Regression to Classification |
|
|
| A binary label was created using the training-set median popularity (35): |
|
|
| - Class 0: popularity < 35 |
| - Class 1: popularity β₯ 35 |
|
|
| Classes were balanced (~50/50), so no resampling was needed. |
|
|
| Precision was prioritized over recall, since predicting a non-popular track as popular is more costly. |
|
|
| --- |
| ## 8. Classification Models |
|
|
| Three models were trained: |
|
|
| | Model | Accuracy | |
| |-------------------------|----------| |
| | Logistic Regression | ~0.76 | |
| | Random Forest | ~0.75 | |
| | Gradient Boosting | ~0.72 | |
|
|
| Winner: Logistic Regression |
| It achieved the best precision-recall balance and the lowest misclassification bias. |
|
|
| Saved model: `spotify_popularity_logistic_regression_classifier.pkl` |
|
|
| ### **Full Evaluation** |
|
|
|  |
|
|
|  |
|
|
|  |
|
|
| **Logistic Regression** shows the best balance between precision and recall across both classes. |
|
|
| --- |
|
|
| ## 9. How to Reproduce |
|
|
| Install dependencies: |
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| Run the notebook: |
| ``` |
| Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb |
| ``` |
|
|
| The preprocessing pipeline includes: |
| - Standard scaling |
| - Polynomial feature generation |
| - One-hot encoding of genres |
| - K-Means clustering (k=5) |
|
|
| These steps must be applied before loading any saved model. |
|
|
| --- |
| ## 10. Repository Structure |
|
|
| ``` |
| project/ |
| βββ README.md |
| βββ Leelu_Spotify_Popularity_Assignment_2.ipynb |
| βββ spotify_popularity_enhanced_linear_regression.pkl |
| βββ spotify_popularity_logistic_regression_classifier.pkl |
| ``` |
|
|
| --- |
| ## 11. Final Summary |
|
|
| This project builds a complete machine learning pipeline for predicting Spotify track popularity. |
| Through EDA, feature engineering, regression, and classification, the project demonstrates: |
|
|
| - Popularity is difficult to predict linearly |
| - Feature engineering improves model performance |
| - Enhanced Linear Regression is best for regression |
| - Logistic Regression is best for binary popularity classification |
|
|
| All final models require the full preprocessing pipeline to reproduce predictions. |
|
|