Leelu1002 commited on
Commit
fd0d605
·
verified ·
1 Parent(s): 1fe233d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +199 -0
README.md CHANGED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Spotify Track Popularity Prediction
2
+
3
+ ---
4
+
5
+ ## **Part 1: Dataset Overview**
6
+
7
+ This project uses the **Spotify Tracks Dataset** from Kaggle (≈114,000 songs, 125 genres, 20 features).
8
+ The central question is whether musical/audio features can predict a track's popularity (0–100).
9
+ The task begins as a **regression** problem, and later is transformed into **classification**.
10
+
11
+ ---
12
+
13
+ ## **Part 2: Exploratory Data Analysis (EDA)**
14
+
15
+ ### **Cleaning**
16
+ - Dropped a small number of rows with missing text metadata.
17
+ - No imputation was needed for audio features.
18
+
19
+ ### **Outliers**
20
+ - Outliers in duration, loudness, tempo, etc. are musically valid.
21
+ - No outlier removal was performed.
22
+
23
+ ### **Target Distribution**
24
+ - Popularity is right-skewed; most songs are minimally streamed.
25
+
26
+ ### **Correlations**
27
+ - No strong linear correlations with popularity.
28
+ - Audio features show only weak trends with the target.
29
+
30
+ ### **Visualizations**
31
+ _Insert plots here:_
32
+ - Boxplots of numeric features
33
+ - Popularity histogram
34
+ - Correlation heatmap
35
+ - Danceability vs popularity
36
+ - Energy vs popularity
37
+ - Explicit vs popularity
38
+ - Loudness vs popularity
39
+ - Genre popularity
40
+ - Any additional scatter/violin plots
41
+
42
+ _(Insert plot images in this section.)_
43
+
44
+ ---
45
+
46
+ ## **Part 3: Baseline Regression Model**
47
+
48
+ ### **Features**
49
+ - Numerical audio features
50
+ - One-hot encoded genre (`track_genre`)
51
+
52
+ ### **Model**
53
+ - `LinearRegression` (scikit-learn)
54
+ - Train/test split: 80/20
55
+
56
+ ### **Performance**
57
+ - **MAE** ≈ 14.08
58
+ - **RMSE** ≈ 19.14
59
+ - **R²** ≈ 0.26
60
+
61
+ The baseline model explains about **26%** of the variance in popularity.
62
+
63
+ ### **Feature Importance**
64
+
65
+ - Genre features dominate the coefficients.
66
+ - Most genre coefficients are negative, showing niche genres perform worse than the reference genre.
67
+
68
+ Among non-genre features:
69
+ - **Positive:** danceability, explicit
70
+ - **Negative:** valence, speechiness, energy
71
+ - **Weak influence:** liveness, tempo, key, mode
72
+
73
+ _Insert plots here:_
74
+ - Top 20 coefficients
75
+ - Non-genre coefficient plot
76
+
77
+ ---
78
+
79
+ ## **Part 4: Feature Engineering**
80
+
81
+ ### **4.1 Scaling**
82
+ - Applied **StandardScaler** to all numeric features to normalize their scales.
83
+
84
+ ### **4.2 Polynomial Features**
85
+ - Used `PolynomialFeatures(degree=2)` on numeric features.
86
+ - Captures nonlinear relationships and interactions between audio features.
87
+
88
+ ### **4.3 PCA**
89
+ - Applied **PCA (2 components)** on scaled numeric features for visualization only.
90
+ - No clear clusters; structure appears continuous in 2D.
91
+
92
+ _Insert PCA plot here._
93
+
94
+ ### **4.4 Clustering**
95
+ - Applied **K-Means (k = 5)** to scaled numeric features.
96
+ - Added two new engineered features:
97
+ - `cluster_id`
98
+ - `cluster_distance`
99
+
100
+ Cluster profiles differ in **danceability**, **energy**, **valence**, **acousticness**, and **popularity**.
101
+
102
+ _Insert K-Means + PCA plot and cluster summary table here._
103
+
104
+ ---
105
+
106
+ ## **Part 5: Improved Regression Models**
107
+
108
+ Three models were trained using the engineered feature set:
109
+
110
+ - **Enhanced Linear Regression**
111
+ - **Random Forest Regressor**
112
+ - **Gradient Boosting Regressor**
113
+
114
+ ### **Performance Summary**
115
+
116
+ | Model | MAE | RMSE | R² |
117
+ |------------------------------|--------|--------|-------|
118
+ | Linear Regression (Enhanced) | ~14.05 | ~19.03 | ~0.27 |
119
+ | Random Forest | ~15.97 | ~19.95 | ~0.20 |
120
+ | Gradient Boosting | ~17.02 | ~20.55 | ~0.15 |
121
+
122
+ ### **Result**
123
+
124
+ - The **Enhanced Linear Regression** model is the **regression winner**.
125
+ - Feature engineering improved performance compared to the baseline.
126
+
127
+ Saved regression model file:
128
+ - `spotify_popularity_enhanced_linear_regression.pkl`
129
+
130
+ ---
131
+
132
+ ## **Part 6: Regression-to-Classification**
133
+
134
+ ### **Creating Classes**
135
+
136
+ Popularity was converted into a binary label using the **training-set median (35)**:
137
+
138
+ - **Class 0:** popularity < 35
139
+ - **Class 1:** popularity ≥ 35
140
+
141
+ Class balance is almost exactly **50/50** in train and test.
142
+ No rebalancing or resampling was required.
143
+
144
+ ### **Error Considerations**
145
+
146
+ - **Precision** is more important than recall.
147
+ - **False Positives** are more costly than **False Negatives**, because predicting a non-popular song as popular leads to wasted marketing resources.
148
+
149
+ ---
150
+
151
+ ## **Part 7: Classification Models**
152
+
153
+ Three classifiers were trained on the engineered feature space:
154
+
155
+ - **Logistic Regression**
156
+ - **Random Forest Classifier**
157
+ - **Gradient Boosting Classifier**
158
+
159
+ ### **Accuracy (Test Set)**
160
+
161
+ - Logistic Regression: ≈ **0.76**
162
+ - Random Forest: ≈ **0.75**
163
+ - Gradient Boosting: ≈ **0.72**
164
+
165
+ ### **Full Evaluation**
166
+
167
+ _Insert plots here:_
168
+ - Logistic Regression classification report (screenshot or table)
169
+ - Logistic Regression confusion matrix
170
+ - Random Forest confusion matrix
171
+ - Gradient Boosting confusion matrix
172
+
173
+ Logistic Regression shows the best balance between precision and recall across both classes.
174
+
175
+ ---
176
+
177
+ ## **Part 8: Classification Winner**
178
+
179
+ **Logistic Regression** is the best-performing classifier:
180
+
181
+ - Highest accuracy (~0.76)
182
+ - Balanced precision and recall
183
+ - Fewer systematic classification errors than tree-based models
184
+
185
+ Saved classification model file:
186
+ - `spotify_popularity_logistic_regression_classifier.pkl`
187
+
188
+ ---
189
+
190
+ ## **Final Notes**
191
+
192
+ This project demonstrates:
193
+
194
+ - How audio features, genre, PCA structure, and clustering relate to popularity
195
+ - How feature engineering improves regression performance
196
+ - How a regression problem can be converted into a balanced classification task
197
+ - How classical machine-learning models compare when predicting popularity labels
198
+
199
+ All saved models require the same preprocessing pipeline (scaling, polynomial features, encoding, clustering features) to reproduce predictions.