Leelu1002
/

spotify-popularity-predictor

Model card Files Files and versions

xet

Community

Leelu1002 commited on Dec 9, 2025

Commit

5e39aee

verified ·

1 Parent(s): b12a05d

Update README.md

Browse files

Files changed (1) hide show

README.md +133 -122

README.md CHANGED Viewed

@@ -1,54 +1,56 @@
 # Spotify Track Popularity Prediction
 ---
-Predicting Spotify track popularity using audio features, metadata, feature engineering, clustering, regression, and classification models.
 ## **Part 1: Dataset Overview**
-This project uses the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle (≈114,000 songs, 125 genres, 20 features).
-The central question is whether musical/audio features can predict a track's popularity (0–100).
-The task begins as a **regression** problem, and later is transformed into **classification**.
-**Objective:**
-Predict song popularity on Spotify using machine learning methods.
-**Target Variable:**
-`popularity` (0–100), a continuous score based on Spotify's internal engagement metrics.
 ---
-## **Part 2: Exploratory Data Analysis (EDA)**
-### **Cleaning**
-- Dropped a small number of rows with missing text metadata.
-- No imputation was needed for audio features.
-### **Outliers**
-- Outliers in duration, loudness, tempo, etc. are musically valid.
-- No outlier removal was performed.
-### **Target Distribution**
-![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
-- Popularity is right-skewed; most songs are minimally streamed.
-### **Correlations**
-![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
-- No strong linear correlations with popularity.
-- Audio features show only weak trends with the target.
-### **Visualizations**
-1. Danceability vs Popularity
-![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/BS5xSp6tadQNAQNHQqiKD.png)
-- There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability.
-2. Loudness vs Popularity
-![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/V4ATotCdz_UMFs9hG1QfN.png)
-- There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends.
-3. Explicit vs Popularity
-![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/2BvKsZbTzOesLgDQiRXA7.png)
-- Explicit tracks show a higher median popularity and a slightly higher upper range. The difference is not dramatic, but explicit songs tend to perform better on average.
 ### **Reseach Questions**
 1. What percentage of songs are explicit vs. clean?
@@ -63,30 +65,31 @@ Predict song popularity on Spotify using machine learning methods.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
 - Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
-### **EDA SUMMARY**
-The exploratory analysis shows that Spotify popularity is heavily right-skewed, with most tracks receiving very low scores. Audio features exhibit only weak linear correlations with popularity, suggesting that no single musical attribute strongly drives success.
-Outliers in duration, loudness, and tempo are musically valid and were kept.
-Some patterns emerge: tracks that are louder, more danceable, or explicit tend to be slightly more popular, and genre plays a meaningful role in average popularity levels.
-Overall, the EDA indicates that predicting popularity requires multi-feature modeling and more advanced feature engineering to capture subtle, non-linear relationships.
 ---
-## **Part 3: Baseline Regression Model**
-### **Features**
-- Numerical audio features
-- One-hot encoded genre (`track_genre`)
-### **Model**
-- `LinearRegression` (scikit-learn)
-- Train/test split: 80/20
-### **Performance**
-- **MAE** ≈ 14.08
-- **RMSE** ≈ 19.14
-- **R²** ≈ 0.26
-The baseline model explains about **26%** of the variance in popularity.
 ### **Feature Importance**
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
@@ -96,99 +99,86 @@ The baseline model explains about **26%** of the variance in popularity.
 - Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
 Among non-genre features:
 - **Positive:** danceability, explicit
 - **Negative:** valence, speechiness, energy
-- **Weak influence:** liveness, tempo, key, mode
 ---
-## **Part 4: Feature Engineering**
-### **4.1 Scaling**
-- Applied **StandardScaler** to all numeric features to normalize their scales.
-### **4.2 Polynomial Features**
-- Used `PolynomialFeatures(degree=2)` on numeric features.
-- Captures nonlinear relationships and interactions between audio features.
-### **4.3 PCA**
-- Applied **PCA (2 components)** on scaled numeric features for visualization only.
-- No clear clusters; structure appears continuous in 2D.
-![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/ZsG5Ltqji6f4ulp3in-6x.png)
-### **4.4 Clustering**
-- Applied **K-Means (k = 5)** to scaled numeric features.
-- Added two new engineered features:
-  - `cluster_id`
-  - `cluster_distance`
-Cluster profiles differ in **danceability**, **energy**, **valence**, **acousticness**, and **popularity**.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)
----
-## **Part 5: Improved Regression Models**
-Three models were trained using the engineered feature set:
-- **Enhanced Linear Regression**
-- **Random Forest Regressor**
-- **Gradient Boosting Regressor**
-### **Performance Summary**
 | Model                         | MAE    | RMSE   | R²    |
 |------------------------------|--------|--------|-------|
-| Linear Regression (Enhanced) | ~14.05 | ~19.03 | ~0.27 |
 | Random Forest                | ~15.97 | ~19.95 | ~0.20 |
 | Gradient Boosting            | ~17.02 | ~20.55 | ~0.15 |
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)
-### **Result**
-- The **Enhanced Linear Regression** model is the **regression winner**.
-- Feature engineering improved performance compared to the baseline.
-Saved regression model file:
-- `spotify_popularity_enhanced_linear_regression.pkl`
 ---
-## **Part 6: Regression-to-Classification**
-### **Creating Classes**
-Popularity was converted into a binary label using the **training-set median (35)**:
-- **Class 0:** popularity < 35
-- **Class 1:** popularity ≥ 35
-Class balance is almost exactly **50/50** in train and test.
-No rebalancing or resampling was required.
-### **Error Considerations**
-- **Precision** is more important than recall.
-- **False Positives** are more costly than **False Negatives**, because predicting a non-popular song as popular leads to wasted marketing resources.
 ---
-## **Part 7: Classification Models**
-Three classifiers were trained on the engineered feature space:
-- **Logistic Regression**
-- **Random Forest Classifier**
-- **Gradient Boosting Classifier**
-### **Accuracy (Test Set)**
-- Logistic Regression: ≈ **0.76**
-- Random Forest: ≈ **0.75**
-- Gradient Boosting: ≈ **0.72**
 ### **Full Evaluation**
@@ -198,30 +188,51 @@ Three classifiers were trained on the engineered feature space:
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)
-Logistic Regression shows the best balance between precision and recall across both classes.
 ---
-## **Part 8: Classification Winner**
-**Logistic Regression** is the best-performing classifier:
-- Highest accuracy (~0.76)
-- Balanced precision and recall
-- Fewer systematic classification errors than tree-based models
-Saved classification model file:
-- `spotify_popularity_logistic_regression_classifier.pkl`
 ---
-## **Final Notes**
-This project demonstrates:
-- How audio features, genre, PCA structure, and clustering relate to popularity
-- How feature engineering improves regression performance
-- How a regression problem can be converted into a balanced classification task
-- How classical machine-learning models compare when predicting popularity labels
-All saved models require the same preprocessing pipeline (scaling, polynomial features, encoding, clustering features) to reproduce predictions.

 # Spotify Track Popularity Prediction
 ---
+This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models.
+It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle.
 ## **Part 1: Dataset Overview**
+The goal of this project is to determine whether musical and audio characteristics can predict a track’s popularity (0–100).
+The task is first approached as a regression problem and later reframed as a binary classification problem.
+*Target variable:* `popularity`
+*Type:* continuous (0–100)
 ---
+## 2. Dataset Description
+Source: Spotify Tracks Dataset, Kaggle
+Size: ~114,000 rows, 20 features, 125+ genres
+Includes:
+- Audio features (danceability, energy, loudness, tempo, valence, etc.)
+- Metadata (track name, artist name)
+- Genre
+- Popularity score
+---
+## 3. Exploratory Data Analysis (EDA)
+### Cleaning
+- Removed a small number of rows with missing metadata
+- Audio features were fully available – no imputation needed
+### Outliers
+- Outliers in duration, tempo, and loudness represent real musical variation
+- No outlier removal was performed
+### Target Distribution
+![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
+Popularity is heavily right-skewed: most songs are minimally popular.
+### Correlations
+![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
+There are no strong linear correlations between audio features and popularity.
+Several weak but consistent patterns appear:
+- Higher danceability corresponds to higher maximum popularity
+- Louder tracks tend to be more popular
+- Explicit tracks have slightly higher median popularity
+- Genres differ significantly in average popularity
 ### **Reseach Questions**
 1. What percentage of songs are explicit vs. clean?
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
 - Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
+### EDA Summary
+The data suggests that no single feature determines popularity.
+Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions.
 ---
+## 4. Baseline Regression Model
+### Features
+- Numeric audio features
+- One-hot encoded genre
+### Model
+- Linear Regression (scikit-learn)
+- 80/20 train-test split
+### Performance
+- MAE: ~14.08
+- RMSE: ~19.14
+- R²: ~0.26
+Genre features dominate model coefficients.
+Audio trends include:
+- Positive effect: danceability, explicit
+- Negative effect: valence, speechiness
+- Minimal effect: key, mode, liveness
 ### **Feature Importance**
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
 - Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
 Among non-genre features:
 - **Positive:** danceability, explicit
 - **Negative:** valence, speechiness, energy
+- **Weak influence:** liveness, tempo, key, mode
 ---
+## 5. Feature Engineering
+### Scaling
+StandardScaler applied to all numeric features.
+### Polynomial Features
+PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships.
+### PCA
+Performed only for visualization (2 components).
+No distinct clusters observed in PCA space.
+### Clustering
+K-Means (k=5) applied to scaled numeric features.
+Added engineered features:
+- `cluster_id`
+- `cluster_distance`
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)
+Clusters differ by energy, danceability, valence, acousticness, and avg popularity.
+---
+## 6. Improved Regression Models
+Three models were trained on the engineered dataset:
 | Model                         | MAE    | RMSE   | R²    |
 |------------------------------|--------|--------|-------|
+| Enhanced Linear Regression   | ~14.05 | ~19.03 | ~0.27 |
 | Random Forest                | ~15.97 | ~19.95 | ~0.20 |
 | Gradient Boosting            | ~17.02 | ~20.55 | ~0.15 |
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)
+Winner: **Enhanced Linear Regression**
+Tree-based models underperformed due to:
+- High-dimensional sparsity (after one-hot + polynomials)
+- Weak signal-to-noise ratio
+- Genre dominating feature space
+Saved model: `spotify_popularity_enhanced_linear_regression.pkl`
 ---
+## 7. Regression to Classification
+A binary label was created using the training-set median popularity (35):
+- Class 0: popularity < 35
+- Class 1: popularity ≥ 35
+Classes were balanced (~50/50), so no resampling was needed.
+Precision was prioritized over recall, since predicting a non-popular track as popular is more costly.
 ---
+## 8. Classification Models
+Three models were trained:
+| Model                   | Accuracy |
+|-------------------------|----------|
+| Logistic Regression     | ~0.76    |
+| Random Forest           | ~0.75    |
+| Gradient Boosting       | ~0.72    |
+Winner: Logistic Regression
+It achieved the best precision-recall balance and the lowest misclassification bias.
+Saved model: `spotify_popularity_logistic_regression_classifier.pkl`
 ### **Full Evaluation**
 ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)
+**Logistic Regression** shows the best balance between precision and recall across both classes.
 ---
+## 9. How to Reproduce
+Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+Run the notebook:
+```
+Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb
+```
+The preprocessing pipeline includes:
+- Standard scaling
+- Polynomial feature generation
+- One-hot encoding of genres
+- K-Means clustering (k=5)
+These steps must be applied before loading any saved model.
 ---
+## 10. Repository Structure
+```
+project/
+│── README.md
+│── notebook.ipynb
+│── spotify_popularity_enhanced_linear_regression.pkl
+│── spotify_popularity_logistic_regression_classifier.pkl
+│── data/  (optional)
+```
+---
+## 11. Final Summary
+This project builds a complete machine learning pipeline for predicting Spotify track popularity.
+Through EDA, feature engineering, regression, and classification, the project demonstrates:
+- Popularity is difficult to predict linearly
+- Feature engineering improves model performance
+- Enhanced Linear Regression is best for regression
+- Logistic Regression is best for binary popularity classification
+All final models require the full preprocessing pipeline to reproduce predictions.