🎶 Spotify Music Analysis
Project Overview
This project explores what makes a song popular using a Spotify dataset.
The dataset was cleaned and analyzed using both Python (Pandas + SQLAlchemy) and PostgreSQL (SQL queries).
Tools & Technologies
- Python: data cleaning and visualization (Pandas, SQLAlchemy, Seaborn, Matplotlib)
- PostgreSQL: data storage & querying
- SQL: data cleaning and analysis
Data Cleaning & Preparation
Dataset Overview
- Total tracks (raw): 81,201
- Key columns:
track_id, track_name, artists, album_name, duration_ms, popularity, danceability, energy, valence, track_genre
- Source: Spotify sample dataset (CSV file provided)
Cleaning in SQL
- Removed duplicate rows (deduplicated on
track_name & artists)
- Standardized text columns (applied
TRIM())
- Handled invalid durations
- Removed tracks with
duration_ms = 0
- Removed tracks with
duration_ms > 900,000 ms (~15 minutes)
Cleaning in Python
- Loaded dataset into PostgreSQL using SQLAlchemy
- Standardized missing values:
- Replaced blanks (
'') with NaN
- Filled missing
artists, album_name, track_name with "Unknown Artist", "Unknown Album", "Untitled Track"
- Trimmed whitespace in string columns
- Replaced empty strings with
"Unknown" across all text fields
- Filled numeric columns with median values
- Saved cleaned data back into PostgreSQL as
tracks_cleaned
Summary Statistics (after cleaning)
| Metric |
Value |
| Total tracks |
81,201 |
| Avg popularity |
34.7 |
| Popularity stdev |
19.3 |
| Avg danceability |
0.56 |
| Avg energy |
0.64 |
| Avg valence |
0.46 |
Popularity Analysis
Question: What makes a song popular?
We analyzed audio features such as danceability, energy, valence, acousticness, instrumentalness, and tempo to see how they relate to popularity. Correlation analysis was performed in Python (pandas).
Correlation with Popularity
| Feature |
Correlation with Popularity |
Interpretation |
| Danceability |
0.035 |
Slight positive effect |
| Tempo |
0.013 |
Almost no effect |
| Energy |
0.001 |
No meaningful relationship |
| Acousticness |
-0.025 |
Slight negative effect |
| Valence |
-0.041 |
Very weak negative effect |
| Instrumentalness |
-0.095 |
Instrumental tracks tend to be less popular |
Key Takeaways
- Instrumental content is the strongest (though still small) predictor of lower popularity.
- Danceable songs have a tiny positive influence on popularity.
- Audio features alone explain very little of popularity; other factors like artist, marketing, or cultural trends play a larger role.
Visualizations
All plots are saved in the visualizations/ folder and displayed below.
Correlation Heatmap
Danceability vs Popularity
Findings & Summary
- Scatter Plots: The scatter plots showed a random distribution of points. For every audio feature, songs with low, medium, and high popularity scores were present across the entire range of the feature's values.
- Correlation Heatmap: The heatmap confirmed the visual findings. Correlation coefficients between all audio features and popularity were very close to zero, ranging from -0.10 to 0.04.
- For example, the correlation between danceability and popularity was only 0.04, suggesting that a song's danceability score is not a reliable predictor of its popularity.
Repository Structure
- README.md: Project overview and explanation
- index.html: This live HTML version
- visualizations/: Scatter plots, heatmaps, and other images
- sql/: SQL scripts for cleaning and preparing the dataset
- scripts/: Python scripts for linking the dataset to SQL, cleaning, and generating visuals from the dataset
- data/: raw dataset used for analysis