Update README.md

260aeb6 verified 5 months ago

9.15 kB

	# Spotify Track Popularity Prediction

	[For Video Presentation Please Click Here](https://youtu.be/P0bjIb1qknQ)

	---
	This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models.

	It is based on the [Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle.
	## Part 1: Dataset Overview

	The goal of this project is to determine whether musical and audio characteristics can predict a track’s popularity (0–100).
	The task is first approached as a regression problem and later reframed as a binary classification problem.

	Target variable: `popularity`

	Type: continuous (0–100)

	---

	## 2. Dataset Description

	Source: Spotify Tracks Dataset, Kaggle
	Size: ~114,000 rows, 20 features, 125+ genres

	Includes:

	- Audio features (danceability, energy, loudness, tempo, valence, etc.)
	- Metadata (track name, artist name)
	- Genre
	- Popularity score

	---

	## 3. Exploratory Data Analysis (EDA)

	### Cleaning
	- Removed a small number of rows with missing metadata
	- Audio features were fully available – no imputation needed

	### Outliers
	- Outliers in duration, tempo, and loudness represent real musical variation
	- No outlier removal was performed

	### Target Distribution
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
	Popularity is heavily right-skewed: most songs are minimally popular.

	### Correlations
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
	There are no strong linear correlations between audio features and popularity.
	Several weak but consistent patterns appear:
	- Higher danceability corresponds to higher maximum popularity
	- Louder tracks tend to be more popular
	- Explicit tracks have slightly higher median popularity
	- Genres differ significantly in average popularity

	### Visualizations
	1. Danceaiblity VS Popularity
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/mnvWrzuLfFuSB7akeqoT2.png)
	There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability.

	2.Energy VS Popularity
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/4dU_Jpd36IFqE6zZvMn4c.png)
	Popular songs cluster around medium-to-high energy levels. Very low-energy tracks rarely achieve high popularity, showing a clear preference for energetic music.
	3. Loudness VS Popularity
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/M3PyAqpLXlmdSUroN44gs.png)
	There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends.
	### Reseach Questions
	1. What percentage of songs are explicit vs. clean?
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/aGxa55X2K-5yydyEw7a11.png)
	- Most songs in the dataset are clean (91%), while only a small portion (9%) are explicit, showing that explicit content is relatively uncommon on Spotify.

	2. Which musical keys are most common in the dataset?
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/cJHx7bC_y3i7ADzQLMqSR.png)
	- Certain musical keys (especially 0 = C and 7 = G) appear far more frequently, but the distribution does not suggest any direct relationship with popularity.

	3. How does average popularity vary across different music genres?
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
	- Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.

	### EDA Summary
	The data suggests that no single feature determines popularity.
	Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions.

	---
	## 4. Baseline Regression Model

	### Features
	- Numeric audio features
	- One-hot encoded genre

	### Model
	- Linear Regression (scikit-learn)
	- 80/20 train-test split

	### Performance
	- MAE: ~14.08
	- RMSE: ~19.14
	- R²: ~0.26

	Genre features dominate model coefficients.
	Audio trends include:
	- Positive effect: danceability, explicit
	- Negative effect: valence, speechiness
	- Minimal effect: key, mode, liveness

	### Feature Importance
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/8X2jU4oJ8bIXYiuY-jDWi.png)

	- Genre features dominate the coefficients.
	- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.

	Among non-genre features:

	- Positive: danceability, explicit
	- Negative: valence, speechiness, energy
	- Weak influence: liveness, tempo, key, mode

	---

	## 5. Feature Engineering

	### Scaling
	StandardScaler applied to all numeric features.

	### Polynomial Features
	PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships.

	### PCA
	Performed only for visualization (2 components).
	No distinct clusters observed in PCA space.

	### Clustering
	K-Means (k=5) applied to scaled numeric features.
	Added engineered features:
	- `cluster_id`
	- `cluster_distance`

	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)

	Clusters differ by energy, danceability, valence, acousticness, and avg popularity.

	---

	## 6. Improved Regression Models

	Three models were trained on the engineered dataset:

	\| Model \| MAE \| RMSE \| R² \|
	\|------------------------------\|--------\|--------\|-------\|
	\| Enhanced Linear Regression \| ~14.05 \| ~19.03 \| ~0.27 \|
	\| Random Forest \| ~15.97 \| ~19.95 \| ~0.20 \|
	\| Gradient Boosting \| ~17.02 \| ~20.55 \| ~0.15 \|

	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)


	Winner: Enhanced Linear Regression
	Tree-based models underperformed due to:
	- High-dimensional sparsity (after one-hot + polynomials)
	- Weak signal-to-noise ratio
	- Genre dominating feature space

	Saved model: `spotify_popularity_enhanced_linear_regression.pkl`

	---

	## 7. Regression to Classification

	A binary label was created using the training-set median popularity (35):

	- Class 0: popularity < 35
	- Class 1: popularity ≥ 35

	Classes were balanced (~50/50), so no resampling was needed.

	Precision was prioritized over recall, since predicting a non-popular track as popular is more costly.

	---
	## 8. Classification Models

	Three models were trained:

	\| Model \| Accuracy \|
	\|-------------------------\|----------\|
	\| Logistic Regression \| ~0.76 \|
	\| Random Forest \| ~0.75 \|
	\| Gradient Boosting \| ~0.72 \|

	Winner: Logistic Regression
	It achieved the best precision-recall balance and the lowest misclassification bias.

	Saved model: `spotify_popularity_logistic_regression_classifier.pkl`

	### Full Evaluation

	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/tVcYnJul6_zp1GIY8S8Bl.png)

	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/HDgdidPPTISme1hMCnZkH.png)

	![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)

	Logistic Regression shows the best balance between precision and recall across both classes.

	---

	## 9. How to Reproduce

	Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	Run the notebook:
	```
	Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb
	```

	The preprocessing pipeline includes:
	- Standard scaling
	- Polynomial feature generation
	- One-hot encoding of genres
	- K-Means clustering (k=5)

	These steps must be applied before loading any saved model.

	---
	## 10. Repository Structure

	```
	project/
	│── README.md
	│── Leelu_Spotify_Popularity_Assignment_2.ipynb
	│── spotify_popularity_enhanced_linear_regression.pkl
	│── spotify_popularity_logistic_regression_classifier.pkl
	```

	---
	## 11. Final Summary

	This project builds a complete machine learning pipeline for predicting Spotify track popularity.
	Through EDA, feature engineering, regression, and classification, the project demonstrates:

	- Popularity is difficult to predict linearly
	- Feature engineering improves model performance
	- Enhanced Linear Regression is best for regression
	- Logistic Regression is best for binary popularity classification

	All final models require the full preprocessing pipeline to reproduce predictions.