# Spotify Track Popularity Prediction [For Video Presentation Please Click Here](https://youtu.be/P0bjIb1qknQ) --- This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models. It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle. ## **Part 1: Dataset Overview** The goal of this project is to determine whether musical and audio characteristics can predict a track’s popularity (0–100). The task is first approached as a regression problem and later reframed as a binary classification problem. *Target variable:* `popularity` *Type:* continuous (0–100) --- ## 2. Dataset Description Source: Spotify Tracks Dataset, Kaggle Size: ~114,000 rows, 20 features, 125+ genres Includes: - Audio features (danceability, energy, loudness, tempo, valence, etc.) - Metadata (track name, artist name) - Genre - Popularity score --- ## 3. Exploratory Data Analysis (EDA) ### Cleaning - Removed a small number of rows with missing metadata - Audio features were fully available – no imputation needed ### Outliers - Outliers in duration, tempo, and loudness represent real musical variation - No outlier removal was performed ### Target Distribution ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png) Popularity is heavily right-skewed: most songs are minimally popular. ### Correlations ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png) There are no strong linear correlations between audio features and popularity. Several weak but consistent patterns appear: - Higher danceability corresponds to higher maximum popularity - Louder tracks tend to be more popular - Explicit tracks have slightly higher median popularity - Genres differ significantly in average popularity ### Visualizations 1. Danceaiblity VS Popularity ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/mnvWrzuLfFuSB7akeqoT2.png) There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability. 2.Energy VS Popularity ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/4dU_Jpd36IFqE6zZvMn4c.png) Popular songs cluster around medium-to-high energy levels. Very low-energy tracks rarely achieve high popularity, showing a clear preference for energetic music. 3. Loudness VS Popularity ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/M3PyAqpLXlmdSUroN44gs.png) There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends. ### **Reseach Questions** 1. What percentage of songs are explicit vs. clean? ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/aGxa55X2K-5yydyEw7a11.png) - Most songs in the dataset are clean (91%), while only a small portion (9%) are explicit, showing that explicit content is relatively uncommon on Spotify. 2. Which musical keys are most common in the dataset? ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/cJHx7bC_y3i7ADzQLMqSR.png) - Certain musical keys (especially 0 = C and 7 = G) appear far more frequently, but the distribution does not suggest any direct relationship with popularity. 3. How does average popularity vary across different music genres? ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png) - Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform. ### EDA Summary The data suggests that no single feature determines popularity. Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions. --- ## 4. Baseline Regression Model ### Features - Numeric audio features - One-hot encoded genre ### Model - Linear Regression (scikit-learn) - 80/20 train-test split ### Performance - MAE: ~14.08 - RMSE: ~19.14 - R²: ~0.26 Genre features dominate model coefficients. Audio trends include: - Positive effect: danceability, explicit - Negative effect: valence, speechiness - Minimal effect: key, mode, liveness ### **Feature Importance** ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/8X2jU4oJ8bIXYiuY-jDWi.png) - Genre features dominate the coefficients. - Most genre coefficients are negative, showing niche genres perform worse than the reference genre. Among non-genre features: - **Positive:** danceability, explicit - **Negative:** valence, speechiness, energy - **Weak influence:** liveness, tempo, key, mode --- ## 5. Feature Engineering ### Scaling StandardScaler applied to all numeric features. ### Polynomial Features PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships. ### PCA Performed only for visualization (2 components). No distinct clusters observed in PCA space. ### Clustering K-Means (k=5) applied to scaled numeric features. Added engineered features: - `cluster_id` - `cluster_distance` ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png) Clusters differ by energy, danceability, valence, acousticness, and avg popularity. --- ## 6. Improved Regression Models Three models were trained on the engineered dataset: | Model | MAE | RMSE | R² | |------------------------------|--------|--------|-------| | Enhanced Linear Regression | ~14.05 | ~19.03 | ~0.27 | | Random Forest | ~15.97 | ~19.95 | ~0.20 | | Gradient Boosting | ~17.02 | ~20.55 | ~0.15 | ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png) Winner: **Enhanced Linear Regression** Tree-based models underperformed due to: - High-dimensional sparsity (after one-hot + polynomials) - Weak signal-to-noise ratio - Genre dominating feature space Saved model: `spotify_popularity_enhanced_linear_regression.pkl` --- ## 7. Regression to Classification A binary label was created using the training-set median popularity (35): - Class 0: popularity < 35 - Class 1: popularity ≥ 35 Classes were balanced (~50/50), so no resampling was needed. Precision was prioritized over recall, since predicting a non-popular track as popular is more costly. --- ## 8. Classification Models Three models were trained: | Model | Accuracy | |-------------------------|----------| | Logistic Regression | ~0.76 | | Random Forest | ~0.75 | | Gradient Boosting | ~0.72 | Winner: Logistic Regression It achieved the best precision-recall balance and the lowest misclassification bias. Saved model: `spotify_popularity_logistic_regression_classifier.pkl` ### **Full Evaluation** ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/tVcYnJul6_zp1GIY8S8Bl.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/HDgdidPPTISme1hMCnZkH.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png) **Logistic Regression** shows the best balance between precision and recall across both classes. --- ## 9. How to Reproduce Install dependencies: ```bash pip install -r requirements.txt ``` Run the notebook: ``` Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb ``` The preprocessing pipeline includes: - Standard scaling - Polynomial feature generation - One-hot encoding of genres - K-Means clustering (k=5) These steps must be applied before loading any saved model. --- ## 10. Repository Structure ``` project/ │── README.md │── Leelu_Spotify_Popularity_Assignment_2.ipynb │── spotify_popularity_enhanced_linear_regression.pkl │── spotify_popularity_logistic_regression_classifier.pkl ``` --- ## 11. Final Summary This project builds a complete machine learning pipeline for predicting Spotify track popularity. Through EDA, feature engineering, regression, and classification, the project demonstrates: - Popularity is difficult to predict linearly - Feature engineering improves model performance - Enhanced Linear Regression is best for regression - Logistic Regression is best for binary popularity classification All final models require the full preprocessing pipeline to reproduce predictions.