Update README.md
Browse files
README.md
CHANGED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spotify Track Popularity Prediction
|
| 2 |
+
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
## **Part 1: Dataset Overview**
|
| 6 |
+
|
| 7 |
+
This project uses the **Spotify Tracks Dataset** from Kaggle (≈114,000 songs, 125 genres, 20 features).
|
| 8 |
+
The central question is whether musical/audio features can predict a track's popularity (0–100).
|
| 9 |
+
The task begins as a **regression** problem, and later is transformed into **classification**.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## **Part 2: Exploratory Data Analysis (EDA)**
|
| 14 |
+
|
| 15 |
+
### **Cleaning**
|
| 16 |
+
- Dropped a small number of rows with missing text metadata.
|
| 17 |
+
- No imputation was needed for audio features.
|
| 18 |
+
|
| 19 |
+
### **Outliers**
|
| 20 |
+
- Outliers in duration, loudness, tempo, etc. are musically valid.
|
| 21 |
+
- No outlier removal was performed.
|
| 22 |
+
|
| 23 |
+
### **Target Distribution**
|
| 24 |
+
- Popularity is right-skewed; most songs are minimally streamed.
|
| 25 |
+
|
| 26 |
+
### **Correlations**
|
| 27 |
+
- No strong linear correlations with popularity.
|
| 28 |
+
- Audio features show only weak trends with the target.
|
| 29 |
+
|
| 30 |
+
### **Visualizations**
|
| 31 |
+
_Insert plots here:_
|
| 32 |
+
- Boxplots of numeric features
|
| 33 |
+
- Popularity histogram
|
| 34 |
+
- Correlation heatmap
|
| 35 |
+
- Danceability vs popularity
|
| 36 |
+
- Energy vs popularity
|
| 37 |
+
- Explicit vs popularity
|
| 38 |
+
- Loudness vs popularity
|
| 39 |
+
- Genre popularity
|
| 40 |
+
- Any additional scatter/violin plots
|
| 41 |
+
|
| 42 |
+
_(Insert plot images in this section.)_
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## **Part 3: Baseline Regression Model**
|
| 47 |
+
|
| 48 |
+
### **Features**
|
| 49 |
+
- Numerical audio features
|
| 50 |
+
- One-hot encoded genre (`track_genre`)
|
| 51 |
+
|
| 52 |
+
### **Model**
|
| 53 |
+
- `LinearRegression` (scikit-learn)
|
| 54 |
+
- Train/test split: 80/20
|
| 55 |
+
|
| 56 |
+
### **Performance**
|
| 57 |
+
- **MAE** ≈ 14.08
|
| 58 |
+
- **RMSE** ≈ 19.14
|
| 59 |
+
- **R²** ≈ 0.26
|
| 60 |
+
|
| 61 |
+
The baseline model explains about **26%** of the variance in popularity.
|
| 62 |
+
|
| 63 |
+
### **Feature Importance**
|
| 64 |
+
|
| 65 |
+
- Genre features dominate the coefficients.
|
| 66 |
+
- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
|
| 67 |
+
|
| 68 |
+
Among non-genre features:
|
| 69 |
+
- **Positive:** danceability, explicit
|
| 70 |
+
- **Negative:** valence, speechiness, energy
|
| 71 |
+
- **Weak influence:** liveness, tempo, key, mode
|
| 72 |
+
|
| 73 |
+
_Insert plots here:_
|
| 74 |
+
- Top 20 coefficients
|
| 75 |
+
- Non-genre coefficient plot
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## **Part 4: Feature Engineering**
|
| 80 |
+
|
| 81 |
+
### **4.1 Scaling**
|
| 82 |
+
- Applied **StandardScaler** to all numeric features to normalize their scales.
|
| 83 |
+
|
| 84 |
+
### **4.2 Polynomial Features**
|
| 85 |
+
- Used `PolynomialFeatures(degree=2)` on numeric features.
|
| 86 |
+
- Captures nonlinear relationships and interactions between audio features.
|
| 87 |
+
|
| 88 |
+
### **4.3 PCA**
|
| 89 |
+
- Applied **PCA (2 components)** on scaled numeric features for visualization only.
|
| 90 |
+
- No clear clusters; structure appears continuous in 2D.
|
| 91 |
+
|
| 92 |
+
_Insert PCA plot here._
|
| 93 |
+
|
| 94 |
+
### **4.4 Clustering**
|
| 95 |
+
- Applied **K-Means (k = 5)** to scaled numeric features.
|
| 96 |
+
- Added two new engineered features:
|
| 97 |
+
- `cluster_id`
|
| 98 |
+
- `cluster_distance`
|
| 99 |
+
|
| 100 |
+
Cluster profiles differ in **danceability**, **energy**, **valence**, **acousticness**, and **popularity**.
|
| 101 |
+
|
| 102 |
+
_Insert K-Means + PCA plot and cluster summary table here._
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## **Part 5: Improved Regression Models**
|
| 107 |
+
|
| 108 |
+
Three models were trained using the engineered feature set:
|
| 109 |
+
|
| 110 |
+
- **Enhanced Linear Regression**
|
| 111 |
+
- **Random Forest Regressor**
|
| 112 |
+
- **Gradient Boosting Regressor**
|
| 113 |
+
|
| 114 |
+
### **Performance Summary**
|
| 115 |
+
|
| 116 |
+
| Model | MAE | RMSE | R² |
|
| 117 |
+
|------------------------------|--------|--------|-------|
|
| 118 |
+
| Linear Regression (Enhanced) | ~14.05 | ~19.03 | ~0.27 |
|
| 119 |
+
| Random Forest | ~15.97 | ~19.95 | ~0.20 |
|
| 120 |
+
| Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
|
| 121 |
+
|
| 122 |
+
### **Result**
|
| 123 |
+
|
| 124 |
+
- The **Enhanced Linear Regression** model is the **regression winner**.
|
| 125 |
+
- Feature engineering improved performance compared to the baseline.
|
| 126 |
+
|
| 127 |
+
Saved regression model file:
|
| 128 |
+
- `spotify_popularity_enhanced_linear_regression.pkl`
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## **Part 6: Regression-to-Classification**
|
| 133 |
+
|
| 134 |
+
### **Creating Classes**
|
| 135 |
+
|
| 136 |
+
Popularity was converted into a binary label using the **training-set median (35)**:
|
| 137 |
+
|
| 138 |
+
- **Class 0:** popularity < 35
|
| 139 |
+
- **Class 1:** popularity ≥ 35
|
| 140 |
+
|
| 141 |
+
Class balance is almost exactly **50/50** in train and test.
|
| 142 |
+
No rebalancing or resampling was required.
|
| 143 |
+
|
| 144 |
+
### **Error Considerations**
|
| 145 |
+
|
| 146 |
+
- **Precision** is more important than recall.
|
| 147 |
+
- **False Positives** are more costly than **False Negatives**, because predicting a non-popular song as popular leads to wasted marketing resources.
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## **Part 7: Classification Models**
|
| 152 |
+
|
| 153 |
+
Three classifiers were trained on the engineered feature space:
|
| 154 |
+
|
| 155 |
+
- **Logistic Regression**
|
| 156 |
+
- **Random Forest Classifier**
|
| 157 |
+
- **Gradient Boosting Classifier**
|
| 158 |
+
|
| 159 |
+
### **Accuracy (Test Set)**
|
| 160 |
+
|
| 161 |
+
- Logistic Regression: ≈ **0.76**
|
| 162 |
+
- Random Forest: ≈ **0.75**
|
| 163 |
+
- Gradient Boosting: ≈ **0.72**
|
| 164 |
+
|
| 165 |
+
### **Full Evaluation**
|
| 166 |
+
|
| 167 |
+
_Insert plots here:_
|
| 168 |
+
- Logistic Regression classification report (screenshot or table)
|
| 169 |
+
- Logistic Regression confusion matrix
|
| 170 |
+
- Random Forest confusion matrix
|
| 171 |
+
- Gradient Boosting confusion matrix
|
| 172 |
+
|
| 173 |
+
Logistic Regression shows the best balance between precision and recall across both classes.
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## **Part 8: Classification Winner**
|
| 178 |
+
|
| 179 |
+
**Logistic Regression** is the best-performing classifier:
|
| 180 |
+
|
| 181 |
+
- Highest accuracy (~0.76)
|
| 182 |
+
- Balanced precision and recall
|
| 183 |
+
- Fewer systematic classification errors than tree-based models
|
| 184 |
+
|
| 185 |
+
Saved classification model file:
|
| 186 |
+
- `spotify_popularity_logistic_regression_classifier.pkl`
|
| 187 |
+
|
| 188 |
+
---
|
| 189 |
+
|
| 190 |
+
## **Final Notes**
|
| 191 |
+
|
| 192 |
+
This project demonstrates:
|
| 193 |
+
|
| 194 |
+
- How audio features, genre, PCA structure, and clustering relate to popularity
|
| 195 |
+
- How feature engineering improves regression performance
|
| 196 |
+
- How a regression problem can be converted into a balanced classification task
|
| 197 |
+
- How classical machine-learning models compare when predicting popularity labels
|
| 198 |
+
|
| 199 |
+
All saved models require the same preprocessing pipeline (scaling, polynomial features, encoding, clustering features) to reproduce predictions.
|