Leelu1002
/

spotify-popularity-predictor

Model card Files Files and versions

xet

Community

Leelu1002 commited on Dec 9, 2025

Commit

fd0d605

verified ·

1 Parent(s): 1fe233d

Update README.md

Browse files

Files changed (1) hide show

README.md +199 -0

README.md CHANGED Viewed

	@@ -0,0 +1,199 @@

+# Spotify Track Popularity Prediction
+---
+## **Part 1: Dataset Overview**
+This project uses the **Spotify Tracks Dataset** from Kaggle (≈114,000 songs, 125 genres, 20 features).
+The central question is whether musical/audio features can predict a track's popularity (0–100).
+The task begins as a **regression** problem, and later is transformed into **classification**.
+---
+## **Part 2: Exploratory Data Analysis (EDA)**
+### **Cleaning**
+- Dropped a small number of rows with missing text metadata.
+- No imputation was needed for audio features.
+### **Outliers**
+- Outliers in duration, loudness, tempo, etc. are musically valid.
+- No outlier removal was performed.
+### **Target Distribution**
+- Popularity is right-skewed; most songs are minimally streamed.
+### **Correlations**
+- No strong linear correlations with popularity.
+- Audio features show only weak trends with the target.
+### **Visualizations**
+_Insert plots here:_
+- Boxplots of numeric features
+- Popularity histogram
+- Correlation heatmap
+- Danceability vs popularity
+- Energy vs popularity
+- Explicit vs popularity
+- Loudness vs popularity
+- Genre popularity
+- Any additional scatter/violin plots
+_(Insert plot images in this section.)_
+---
+## **Part 3: Baseline Regression Model**
+### **Features**
+- Numerical audio features
+- One-hot encoded genre (`track_genre`)
+### **Model**
+- `LinearRegression` (scikit-learn)
+- Train/test split: 80/20
+### **Performance**
+- **MAE** ≈ 14.08
+- **RMSE** ≈ 19.14
+- **R²** ≈ 0.26
+The baseline model explains about **26%** of the variance in popularity.
+### **Feature Importance**
+- Genre features dominate the coefficients.
+- Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
+Among non-genre features:
+- **Positive:** danceability, explicit
+- **Negative:** valence, speechiness, energy
+- **Weak influence:** liveness, tempo, key, mode
+_Insert plots here:_
+- Top 20 coefficients
+- Non-genre coefficient plot
+---
+## **Part 4: Feature Engineering**
+### **4.1 Scaling**
+- Applied **StandardScaler** to all numeric features to normalize their scales.
+### **4.2 Polynomial Features**
+- Used `PolynomialFeatures(degree=2)` on numeric features.
+- Captures nonlinear relationships and interactions between audio features.
+### **4.3 PCA**
+- Applied **PCA (2 components)** on scaled numeric features for visualization only.
+- No clear clusters; structure appears continuous in 2D.
+_Insert PCA plot here._
+### **4.4 Clustering**
+- Applied **K-Means (k = 5)** to scaled numeric features.
+- Added two new engineered features:
+  - `cluster_id`
+  - `cluster_distance`
+Cluster profiles differ in **danceability**, **energy**, **valence**, **acousticness**, and **popularity**.
+_Insert K-Means + PCA plot and cluster summary table here._
+---
+## **Part 5: Improved Regression Models**
+Three models were trained using the engineered feature set:
+- **Enhanced Linear Regression**
+- **Random Forest Regressor**
+- **Gradient Boosting Regressor**
+### **Performance Summary**
+| Model                         | MAE    | RMSE   | R²    |
+|------------------------------|--------|--------|-------|
+| Linear Regression (Enhanced) | ~14.05 | ~19.03 | ~0.27 |
+| Random Forest                | ~15.97 | ~19.95 | ~0.20 |
+| Gradient Boosting            | ~17.02 | ~20.55 | ~0.15 |
+### **Result**
+- The **Enhanced Linear Regression** model is the **regression winner**.
+- Feature engineering improved performance compared to the baseline.
+Saved regression model file:
+- `spotify_popularity_enhanced_linear_regression.pkl`
+---
+## **Part 6: Regression-to-Classification**
+### **Creating Classes**
+Popularity was converted into a binary label using the **training-set median (35)**:
+- **Class 0:** popularity < 35
+- **Class 1:** popularity ≥ 35
+Class balance is almost exactly **50/50** in train and test.
+No rebalancing or resampling was required.
+### **Error Considerations**
+- **Precision** is more important than recall.
+- **False Positives** are more costly than **False Negatives**, because predicting a non-popular song as popular leads to wasted marketing resources.
+---
+## **Part 7: Classification Models**
+Three classifiers were trained on the engineered feature space:
+- **Logistic Regression**
+- **Random Forest Classifier**
+- **Gradient Boosting Classifier**
+### **Accuracy (Test Set)**
+- Logistic Regression: ≈ **0.76**
+- Random Forest: ≈ **0.75**
+- Gradient Boosting: ≈ **0.72**
+### **Full Evaluation**
+_Insert plots here:_
+- Logistic Regression classification report (screenshot or table)
+- Logistic Regression confusion matrix
+- Random Forest confusion matrix
+- Gradient Boosting confusion matrix
+Logistic Regression shows the best balance between precision and recall across both classes.
+---
+## **Part 8: Classification Winner**
+**Logistic Regression** is the best-performing classifier:
+- Highest accuracy (~0.76)
+- Balanced precision and recall
+- Fewer systematic classification errors than tree-based models
+Saved classification model file:
+- `spotify_popularity_logistic_regression_classifier.pkl`
+---
+## **Final Notes**
+This project demonstrates:
+- How audio features, genre, PCA structure, and clustering relate to popularity
+- How feature engineering improves regression performance
+- How a regression problem can be converted into a balanced classification task
+- How classical machine-learning models compare when predicting popularity labels
+All saved models require the same preprocessing pipeline (scaling, polynomial features, encoding, clustering features) to reproduce predictions.