# Spotify Track Popularity Prediction

[For Video Presentation Please Click Here](https://youtu.be/P0bjIb1qknQ)

---
This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models.

It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle. 
## **Part 1: Dataset Overview**

The goal of this project is to determine whether musical and audio characteristics can predict a track’s popularity (0–100).
The task is first approached as a regression problem and later reframed as a binary classification problem.

*Target variable:* `popularity` 

*Type:* continuous (0–100)

---

## 2. Dataset Description

Source: Spotify Tracks Dataset, Kaggle  
Size: ~114,000 rows, 20 features, 125+ genres

Includes:  

- Audio features (danceability, energy, loudness, tempo, valence, etc.)  
- Metadata (track name, artist name)  
- Genre  
- Popularity score
  
---

## 3. Exploratory Data Analysis (EDA)

### Cleaning  
- Removed a small number of rows with missing metadata  
- Audio features were fully available – no imputation needed  

### Outliers  
- Outliers in duration, tempo, and loudness represent real musical variation  
- No outlier removal was performed  

### Target Distribution  
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
Popularity is heavily right-skewed: most songs are minimally popular.

### Correlations  
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
There are no strong linear correlations between audio features and popularity.  
Several weak but consistent patterns appear:
- Higher danceability corresponds to higher maximum popularity  
- Louder tracks tend to be more popular  
- Explicit tracks have slightly higher median popularity  
- Genres differ significantly in average popularity

### Visualizations 
1. Danceaiblity VS Popularity 
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/mnvWrzuLfFuSB7akeqoT2.png)
There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability.

2.Energy VS Popularity 
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/4dU_Jpd36IFqE6zZvMn4c.png)
Popular songs cluster around medium-to-high energy levels. Very low-energy tracks rarely achieve high popularity, showing a clear preference for energetic music.
3. Loudness VS Popularity 
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/M3PyAqpLXlmdSUroN44gs.png)
There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends.
### **Reseach Questions**
1. What percentage of songs are explicit vs. clean? 
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/aGxa55X2K-5yydyEw7a11.png)
- Most songs in the dataset are clean (91%), while only a small portion (9%) are explicit, showing that explicit content is relatively uncommon on Spotify.

2. Which musical keys are most common in the dataset?
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/cJHx7bC_y3i7ADzQLMqSR.png)
- Certain musical keys (especially 0 = C and 7 = G) appear far more frequently, but the distribution does not suggest any direct relationship with popularity.

3. How does average popularity vary across different music genres?
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
- Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform. 

### EDA Summary  
The data suggests that no single feature determines popularity.  
Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions.

---
## 4. Baseline Regression Model

### Features  
- Numeric audio features  
- One-hot encoded genre  

### Model  
- Linear Regression (scikit-learn)  
- 80/20 train-test split  

### Performance  
- MAE: ~14.08  
- RMSE: ~19.14  
- R²: ~0.26  

Genre features dominate model coefficients.  
Audio trends include:
- Positive effect: danceability, explicit  
- Negative effect: valence, speechiness  
- Minimal effect: key, mode, liveness  

### **Feature Importance**
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/8X2jU4oJ8bIXYiuY-jDWi.png)

- Genre features dominate the coefficients.  
- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.

Among non-genre features:

- **Positive:** danceability, explicit  
- **Negative:** valence, speechiness, energy  
- **Weak influence:** liveness, tempo, key, mode
  
---

## 5. Feature Engineering

### Scaling  
StandardScaler applied to all numeric features.

### Polynomial Features  
PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships.

### PCA  
Performed only for visualization (2 components).  
No distinct clusters observed in PCA space.

### Clustering  
K-Means (k=5) applied to scaled numeric features.  
Added engineered features:
- `cluster_id`  
- `cluster_distance`  

![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)

Clusters differ by energy, danceability, valence, acousticness, and avg popularity.

---

## 6. Improved Regression Models

Three models were trained on the engineered dataset:

| Model                         | MAE    | RMSE   | R²    |
|------------------------------|--------|--------|-------|
| Enhanced Linear Regression   | ~14.05 | ~19.03 | ~0.27 |
| Random Forest                | ~15.97 | ~19.95 | ~0.20 |
| Gradient Boosting            | ~17.02 | ~20.55 | ~0.15 |

![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)


Winner: **Enhanced Linear Regression**  
Tree-based models underperformed due to:
- High-dimensional sparsity (after one-hot + polynomials)  
- Weak signal-to-noise ratio  
- Genre dominating feature space  

Saved model: `spotify_popularity_enhanced_linear_regression.pkl`

---

## 7. Regression to Classification

A binary label was created using the training-set median popularity (35):

- Class 0: popularity < 35  
- Class 1: popularity ≥ 35  

Classes were balanced (~50/50), so no resampling was needed.

Precision was prioritized over recall, since predicting a non-popular track as popular is more costly.

---
## 8. Classification Models

Three models were trained:

| Model                   | Accuracy |
|-------------------------|----------|
| Logistic Regression     | ~0.76    |
| Random Forest           | ~0.75    |
| Gradient Boosting       | ~0.72    |

Winner: Logistic Regression  
It achieved the best precision-recall balance and the lowest misclassification bias.

Saved model: `spotify_popularity_logistic_regression_classifier.pkl`

### **Full Evaluation**

![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/tVcYnJul6_zp1GIY8S8Bl.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/HDgdidPPTISme1hMCnZkH.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)

**Logistic Regression** shows the best balance between precision and recall across both classes.

---

## 9. How to Reproduce

Install dependencies:
```bash
pip install -r requirements.txt
```

Run the notebook:
```
Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb
```

The preprocessing pipeline includes:
- Standard scaling  
- Polynomial feature generation  
- One-hot encoding of genres  
- K-Means clustering (k=5)

These steps must be applied before loading any saved model.

---
## 10. Repository Structure

```
project/
│── README.md
│── Leelu_Spotify_Popularity_Assignment_2.ipynb
│── spotify_popularity_enhanced_linear_regression.pkl
│── spotify_popularity_logistic_regression_classifier.pkl
```

---
## 11. Final Summary

This project builds a complete machine learning pipeline for predicting Spotify track popularity.  
Through EDA, feature engineering, regression, and classification, the project demonstrates:

- Popularity is difficult to predict linearly  
- Feature engineering improves model performance  
- Enhanced Linear Regression is best for regression  
- Logistic Regression is best for binary popularity classification  

All final models require the full preprocessing pipeline to reproduce predictions.