Leelu1002 commited on
Commit
5e39aee
Β·
verified Β·
1 Parent(s): b12a05d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -122
README.md CHANGED
@@ -1,54 +1,56 @@
1
  # Spotify Track Popularity Prediction
2
 
3
  ---
4
- Predicting Spotify track popularity using audio features, metadata, feature engineering, clustering, regression, and classification models.
5
 
 
6
  ## **Part 1: Dataset Overview**
7
 
8
- This project uses the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle (β‰ˆ114,000 songs, 125 genres, 20 features).
9
- The central question is whether musical/audio features can predict a track's popularity (0–100).
10
- The task begins as a **regression** problem, and later is transformed into **classification**.
11
 
12
- **Objective:**
13
- Predict song popularity on Spotify using machine learning methods.
14
-
15
- **Target Variable:**
16
- `popularity` (0–100), a continuous score based on Spotify's internal engagement metrics.
17
 
 
18
 
19
  ---
20
 
21
- ## **Part 2: Exploratory Data Analysis (EDA)**
22
 
23
- ### **Cleaning**
24
- - Dropped a small number of rows with missing text metadata.
25
- - No imputation was needed for audio features.
26
 
27
- ### **Outliers**
28
- - Outliers in duration, loudness, tempo, etc. are musically valid.
29
- - No outlier removal was performed.
30
 
31
- ### **Target Distribution**
32
- ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
33
- - Popularity is right-skewed; most songs are minimally streamed.
 
 
 
34
 
35
- ### **Correlations**
36
- ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
37
- - No strong linear correlations with popularity.
38
- - Audio features show only weak trends with the target.
39
 
40
- ### **Visualizations**
41
- 1. Danceability vs Popularity
42
- ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/BS5xSp6tadQNAQNHQqiKD.png)
43
- - There is a clear upward trend: songs with higher danceability tend to be more popular. While the relationship is not perfectly linear, the upper envelope rises consistently with danceability.
44
 
45
- 2. Loudness vs Popularity
46
- ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/V4ATotCdz_UMFs9hG1QfN.png)
47
- - There is a visible positive trend: louder songs (closer to 0 dB) tend to achieve higher popularity. Quiet tracks rarely reach high popularity, reflecting modern production and streaming trends.
48
 
49
- 3. Explicit vs Popularity
50
- ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/2BvKsZbTzOesLgDQiRXA7.png)
51
- - Explicit tracks show a higher median popularity and a slightly higher upper range. The difference is not dramatic, but explicit songs tend to perform better on average.
 
 
 
 
 
 
 
 
 
52
 
53
  ### **Reseach Questions**
54
  1. What percentage of songs are explicit vs. clean?
@@ -63,30 +65,31 @@ Predict song popularity on Spotify using machine learning methods.
63
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
64
  - Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
65
 
66
- ### **EDA SUMMARY**
67
- The exploratory analysis shows that Spotify popularity is heavily right-skewed, with most tracks receiving very low scores. Audio features exhibit only weak linear correlations with popularity, suggesting that no single musical attribute strongly drives success.
68
- Outliers in duration, loudness, and tempo are musically valid and were kept.
69
- Some patterns emerge: tracks that are louder, more danceable, or explicit tend to be slightly more popular, and genre plays a meaningful role in average popularity levels.
70
- Overall, the EDA indicates that predicting popularity requires multi-feature modeling and more advanced feature engineering to capture subtle, non-linear relationships.
71
 
72
  ---
 
73
 
74
- ## **Part 3: Baseline Regression Model**
75
-
76
- ### **Features**
77
- - Numerical audio features
78
- - One-hot encoded genre (`track_genre`)
79
 
80
- ### **Model**
81
- - `LinearRegression` (scikit-learn)
82
- - Train/test split: 80/20
83
 
84
- ### **Performance**
85
- - **MAE** β‰ˆ 14.08
86
- - **RMSE** β‰ˆ 19.14
87
- - **RΒ²** β‰ˆ 0.26
88
 
89
- The baseline model explains about **26%** of the variance in popularity.
 
 
 
 
90
 
91
  ### **Feature Importance**
92
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
@@ -96,99 +99,86 @@ The baseline model explains about **26%** of the variance in popularity.
96
  - Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
97
 
98
  Among non-genre features:
 
99
  - **Positive:** danceability, explicit
100
  - **Negative:** valence, speechiness, energy
101
- - **Weak influence:** liveness, tempo, key, mode
102
-
103
  ---
104
 
105
- ## **Part 4: Feature Engineering**
106
-
107
- ### **4.1 Scaling**
108
- - Applied **StandardScaler** to all numeric features to normalize their scales.
109
 
110
- ### **4.2 Polynomial Features**
111
- - Used `PolynomialFeatures(degree=2)` on numeric features.
112
- - Captures nonlinear relationships and interactions between audio features.
113
 
114
- ### **4.3 PCA**
115
- - Applied **PCA (2 components)** on scaled numeric features for visualization only.
116
- - No clear clusters; structure appears continuous in 2D.
117
 
118
- ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/ZsG5Ltqji6f4ulp3in-6x.png)
 
 
119
 
120
- ### **4.4 Clustering**
121
- - Applied **K-Means (k = 5)** to scaled numeric features.
122
- - Added two new engineered features:
123
- - `cluster_id`
124
- - `cluster_distance`
125
-
126
- Cluster profiles differ in **danceability**, **energy**, **valence**, **acousticness**, and **popularity**.
127
 
128
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)
129
 
130
- ---
131
-
132
- ## **Part 5: Improved Regression Models**
133
 
134
- Three models were trained using the engineered feature set:
135
 
136
- - **Enhanced Linear Regression**
137
- - **Random Forest Regressor**
138
- - **Gradient Boosting Regressor**
139
 
140
- ### **Performance Summary**
141
 
142
  | Model | MAE | RMSE | RΒ² |
143
  |------------------------------|--------|--------|-------|
144
- | Linear Regression (Enhanced) | ~14.05 | ~19.03 | ~0.27 |
145
  | Random Forest | ~15.97 | ~19.95 | ~0.20 |
146
  | Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
147
 
148
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)
149
 
150
- ### **Result**
151
 
152
- - The **Enhanced Linear Regression** model is the **regression winner**.
153
- - Feature engineering improved performance compared to the baseline.
 
 
 
154
 
155
- Saved regression model file:
156
- - `spotify_popularity_enhanced_linear_regression.pkl`
157
 
158
  ---
159
 
160
- ## **Part 6: Regression-to-Classification**
161
-
162
- ### **Creating Classes**
163
 
164
- Popularity was converted into a binary label using the **training-set median (35)**:
165
 
166
- - **Class 0:** popularity < 35
167
- - **Class 1:** popularity β‰₯ 35
168
 
169
- Class balance is almost exactly **50/50** in train and test.
170
- No rebalancing or resampling was required.
171
 
172
- ### **Error Considerations**
173
-
174
- - **Precision** is more important than recall.
175
- - **False Positives** are more costly than **False Negatives**, because predicting a non-popular song as popular leads to wasted marketing resources.
176
 
177
  ---
 
178
 
179
- ## **Part 7: Classification Models**
180
-
181
- Three classifiers were trained on the engineered feature space:
182
 
183
- - **Logistic Regression**
184
- - **Random Forest Classifier**
185
- - **Gradient Boosting Classifier**
 
 
186
 
187
- ### **Accuracy (Test Set)**
 
188
 
189
- - Logistic Regression: β‰ˆ **0.76**
190
- - Random Forest: β‰ˆ **0.75**
191
- - Gradient Boosting: β‰ˆ **0.72**
192
 
193
  ### **Full Evaluation**
194
 
@@ -198,30 +188,51 @@ Three classifiers were trained on the engineered feature space:
198
 
199
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)
200
 
201
- Logistic Regression shows the best balance between precision and recall across both classes.
202
 
203
  ---
204
 
205
- ## **Part 8: Classification Winner**
 
 
 
 
 
206
 
207
- **Logistic Regression** is the best-performing classifier:
 
 
 
208
 
209
- - Highest accuracy (~0.76)
210
- - Balanced precision and recall
211
- - Fewer systematic classification errors than tree-based models
 
 
212
 
213
- Saved classification model file:
214
- - `spotify_popularity_logistic_regression_classifier.pkl`
215
 
216
  ---
 
217
 
218
- ## **Final Notes**
 
 
 
 
 
 
 
 
 
 
219
 
220
- This project demonstrates:
 
221
 
222
- - How audio features, genre, PCA structure, and clustering relate to popularity
223
- - How feature engineering improves regression performance
224
- - How a regression problem can be converted into a balanced classification task
225
- - How classical machine-learning models compare when predicting popularity labels
226
 
227
- All saved models require the same preprocessing pipeline (scaling, polynomial features, encoding, clustering features) to reproduce predictions.
 
1
  # Spotify Track Popularity Prediction
2
 
3
  ---
4
+ This project predicts Spotify track popularity using audio features, metadata, feature engineering, clustering, regression models, and classification models.
5
 
6
+ It is based on the [**Spotify Tracks Dataset**](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data) from Kaggle.
7
  ## **Part 1: Dataset Overview**
8
 
9
+ The goal of this project is to determine whether musical and audio characteristics can predict a track’s popularity (0–100).
10
+ The task is first approached as a regression problem and later reframed as a binary classification problem.
 
11
 
12
+ *Target variable:* `popularity`
 
 
 
 
13
 
14
+ *Type:* continuous (0–100)
15
 
16
  ---
17
 
18
+ ## 2. Dataset Description
19
 
20
+ Source: Spotify Tracks Dataset, Kaggle
21
+ Size: ~114,000 rows, 20 features, 125+ genres
 
22
 
23
+ Includes:
 
 
24
 
25
+ - Audio features (danceability, energy, loudness, tempo, valence, etc.)
26
+ - Metadata (track name, artist name)
27
+ - Genre
28
+ - Popularity score
29
+
30
+ ---
31
 
32
+ ## 3. Exploratory Data Analysis (EDA)
 
 
 
33
 
34
+ ### Cleaning
35
+ - Removed a small number of rows with missing metadata
36
+ - Audio features were fully available – no imputation needed
 
37
 
38
+ ### Outliers
39
+ - Outliers in duration, tempo, and loudness represent real musical variation
40
+ - No outlier removal was performed
41
 
42
+ ### Target Distribution
43
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/rlAtjyjSM6EEFux9Iro-B.png)
44
+ Popularity is heavily right-skewed: most songs are minimally popular.
45
+
46
+ ### Correlations
47
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/t86A_BkvsZuqR4csjhFI4.png)
48
+ There are no strong linear correlations between audio features and popularity.
49
+ Several weak but consistent patterns appear:
50
+ - Higher danceability corresponds to higher maximum popularity
51
+ - Louder tracks tend to be more popular
52
+ - Explicit tracks have slightly higher median popularity
53
+ - Genres differ significantly in average popularity
54
 
55
  ### **Reseach Questions**
56
  1. What percentage of songs are explicit vs. clean?
 
65
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/bc3IO-I2yghDXq-6cIXYn.png)
66
  - Genres like pop-film, k-pop, and chill show the highest average popularity, indicating that genre has a meaningful effect on how well songs perform.
67
 
68
+ ### EDA Summary
69
+ The data suggests that no single feature determines popularity.
70
+ Weak linear relationships indicate the need for non-linear models, feature engineering, and multi-feature interactions.
 
 
71
 
72
  ---
73
+ ## 4. Baseline Regression Model
74
 
75
+ ### Features
76
+ - Numeric audio features
77
+ - One-hot encoded genre
 
 
78
 
79
+ ### Model
80
+ - Linear Regression (scikit-learn)
81
+ - 80/20 train-test split
82
 
83
+ ### Performance
84
+ - MAE: ~14.08
85
+ - RMSE: ~19.14
86
+ - RΒ²: ~0.26
87
 
88
+ Genre features dominate model coefficients.
89
+ Audio trends include:
90
+ - Positive effect: danceability, explicit
91
+ - Negative effect: valence, speechiness
92
+ - Minimal effect: key, mode, liveness
93
 
94
  ### **Feature Importance**
95
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/EqMwjYs1Zy-c27k-BwJTS.png)
 
99
  - Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
100
 
101
  Among non-genre features:
102
+
103
  - **Positive:** danceability, explicit
104
  - **Negative:** valence, speechiness, energy
105
+ - **Weak influence:** liveness, tempo, key, mode
106
+
107
  ---
108
 
109
+ ## 5. Feature Engineering
 
 
 
110
 
111
+ ### Scaling
112
+ StandardScaler applied to all numeric features.
 
113
 
114
+ ### Polynomial Features
115
+ PolynomialFeatures(degree=2) used to capture interactions and non-linear relationships.
 
116
 
117
+ ### PCA
118
+ Performed only for visualization (2 components).
119
+ No distinct clusters observed in PCA space.
120
 
121
+ ### Clustering
122
+ K-Means (k=5) applied to scaled numeric features.
123
+ Added engineered features:
124
+ - `cluster_id`
125
+ - `cluster_distance`
 
 
126
 
127
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/MpGG5fa2nTedXgqXhtzhM.png)
128
 
129
+ Clusters differ by energy, danceability, valence, acousticness, and avg popularity.
 
 
130
 
131
+ ---
132
 
133
+ ## 6. Improved Regression Models
 
 
134
 
135
+ Three models were trained on the engineered dataset:
136
 
137
  | Model | MAE | RMSE | RΒ² |
138
  |------------------------------|--------|--------|-------|
139
+ | Enhanced Linear Regression | ~14.05 | ~19.03 | ~0.27 |
140
  | Random Forest | ~15.97 | ~19.95 | ~0.20 |
141
  | Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
142
 
143
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/OTgw25hHbvJ3PAXEF9T2_.png)
144
 
 
145
 
146
+ Winner: **Enhanced Linear Regression**
147
+ Tree-based models underperformed due to:
148
+ - High-dimensional sparsity (after one-hot + polynomials)
149
+ - Weak signal-to-noise ratio
150
+ - Genre dominating feature space
151
 
152
+ Saved model: `spotify_popularity_enhanced_linear_regression.pkl`
 
153
 
154
  ---
155
 
156
+ ## 7. Regression to Classification
 
 
157
 
158
+ A binary label was created using the training-set median popularity (35):
159
 
160
+ - Class 0: popularity < 35
161
+ - Class 1: popularity β‰₯ 35
162
 
163
+ Classes were balanced (~50/50), so no resampling was needed.
 
164
 
165
+ Precision was prioritized over recall, since predicting a non-popular track as popular is more costly.
 
 
 
166
 
167
  ---
168
+ ## 8. Classification Models
169
 
170
+ Three models were trained:
 
 
171
 
172
+ | Model | Accuracy |
173
+ |-------------------------|----------|
174
+ | Logistic Regression | ~0.76 |
175
+ | Random Forest | ~0.75 |
176
+ | Gradient Boosting | ~0.72 |
177
 
178
+ Winner: Logistic Regression
179
+ It achieved the best precision-recall balance and the lowest misclassification bias.
180
 
181
+ Saved model: `spotify_popularity_logistic_regression_classifier.pkl`
 
 
182
 
183
  ### **Full Evaluation**
184
 
 
188
 
189
  ![image](https://cdn-uploads.huggingface.co/production/uploads/691201ec1b511b23f0c29e8b/PdAIUlRsBfw-z8N7g8UyK.png)
190
 
191
+ **Logistic Regression** shows the best balance between precision and recall across both classes.
192
 
193
  ---
194
 
195
+ ## 9. How to Reproduce
196
+
197
+ Install dependencies:
198
+ ```bash
199
+ pip install -r requirements.txt
200
+ ```
201
 
202
+ Run the notebook:
203
+ ```
204
+ Spotify_Popularity_Classification,_Regression,_Clustering_Assignment_2.ipynb
205
+ ```
206
 
207
+ The preprocessing pipeline includes:
208
+ - Standard scaling
209
+ - Polynomial feature generation
210
+ - One-hot encoding of genres
211
+ - K-Means clustering (k=5)
212
 
213
+ These steps must be applied before loading any saved model.
 
214
 
215
  ---
216
+ ## 10. Repository Structure
217
 
218
+ ```
219
+ project/
220
+ │── README.md
221
+ │── notebook.ipynb
222
+ │── spotify_popularity_enhanced_linear_regression.pkl
223
+ │── spotify_popularity_logistic_regression_classifier.pkl
224
+ │── data/ (optional)
225
+ ```
226
+
227
+ ---
228
+ ## 11. Final Summary
229
 
230
+ This project builds a complete machine learning pipeline for predicting Spotify track popularity.
231
+ Through EDA, feature engineering, regression, and classification, the project demonstrates:
232
 
233
+ - Popularity is difficult to predict linearly
234
+ - Feature engineering improves model performance
235
+ - Enhanced Linear Regression is best for regression
236
+ - Logistic Regression is best for binary popularity classification
237
 
238
+ All final models require the full preprocessing pipeline to reproduce predictions.