By Amy Hein
Reviews are important to us. Before we choose where to eat, or what to buy online, the first thing many people do is check the reviews. Indubitably, these reviews affect how we act, be it on Amazon, Google Maps, or otherwise. Not only do reviews affect how people consume food or other goods, it can also affect how some people consume music. There are two major music reviewing entities in online music discussion circles. They are Pitchfork, a tenured online publication with over 18 thousand reviews, and a bald-headed YouTube creator by the name of Anthony Fantano. Fantano's influence may seem surprising given that he is just one person talking to a camera from the suburbs of Connecticut, but he has garnered a large online cult following of people who take what he says about music to heart. More so than telling you what to listen to or not listen to, the reviews from these two sources mainly serve as talking points in online music discussion circles. Indeed, reviews about art should be treated differently than reviews about material goods. People can agree agree and disagree with these reviews, and talk about their differing opinions. Additionally, people in the online music community often look to entities like Pitchfork and Fantano when they want recommendations on music to listen to.
There are many questions we can ask about this phenomenon from a data science perspective. Does either source have a bias towards certain time periods? Is one entity "nicer" than the other? How often do they agree? Disagree? And at the very core of this- are there objective quantities that we can look at to determine how much people will like a piece of music, or will art always be subjective? On this premise, I will take you through the data science life cycle. This will include data collection and processing, exploratory data analysis, hypothesis testing, and machine learning.
One thing I will note is that there are several blocks of code that are commented out. The output that these blocks produce is always exported into a csv file, which I include in the repository. Some of these blocks take a very long time to run, and it's much more convenient if we don't have to do that every time.
Our goal is to get Spotify data for all the albums that have been reviewed by both Anthony Fantano and Pitchfork. First we have to find the albums in common, then get the Spotify data for each track, then average the data per album. Ultimately there are three datasets that we have to work with to explore this topic: Pitchfork's review data, Anthony Fantano's review data, and Spotify's music data.
The dataset that we will be using for Pitchfork reviews available on kaggle here. The attributes of this dataset are as follows:
Let us download database.sqlite
from that page and place it handily in our project directory. Instead of in SQL format, we want our data in a Pandas dataframe for easier processing. We can do that by opening a connection to the database and using the pandas' read_sql_query()
function.
# First, important libraries
import pandas as pd
import numpy as np
import sqlite3
# Establish a connection to the database
conn = sqlite3.connect("./database.sqlite")
# Read into dataframe pitch, short for pitchfork
pitch = pd.read_sql_query("SELECT * FROM reviews;", conn)
pitch.head(5)
reviewid | title | artist | url | score | best_new_music | author | author_type | pub_date | pub_weekday | pub_day | pub_month | pub_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22703 | mezzanine | massive attack | http://pitchfork.com/reviews/albums/22703-mezz... | 9.3 | 0 | nate patrin | contributor | 2017-01-08 | 6 | 8 | 1 | 2017 |
1 | 22721 | prelapsarian | krallice | http://pitchfork.com/reviews/albums/22721-prel... | 7.9 | 0 | zoe camp | contributor | 2017-01-07 | 5 | 7 | 1 | 2017 |
2 | 22659 | all of them naturals | uranium club | http://pitchfork.com/reviews/albums/22659-all-... | 7.3 | 0 | david glickman | contributor | 2017-01-07 | 5 | 7 | 1 | 2017 |
3 | 22661 | first songs | kleenex, liliput | http://pitchfork.com/reviews/albums/22661-firs... | 9.0 | 1 | jenn pelly | associate reviews editor | 2017-01-06 | 4 | 6 | 1 | 2017 |
4 | 22725 | new start | taso | http://pitchfork.com/reviews/albums/22725-new-... | 8.1 | 0 | kevin lozano | tracks coordinator | 2017-01-06 | 4 | 6 | 1 | 2017 |
# drop unnecessary columns
pitch = pitch.drop(columns=["reviewid", "url", "author_type", "pub_weekday", "pub_day", "pub_month", "pub_year"])
pitch.head(5)
title | artist | score | best_new_music | author | pub_date | |
---|---|---|---|---|---|---|
0 | mezzanine | massive attack | 9.3 | 0 | nate patrin | 2017-01-08 |
1 | prelapsarian | krallice | 7.9 | 0 | zoe camp | 2017-01-07 |
2 | all of them naturals | uranium club | 7.3 | 0 | david glickman | 2017-01-07 |
3 | first songs | kleenex, liliput | 9.0 | 1 | jenn pelly | 2017-01-06 |
4 | new start | taso | 8.1 | 0 | kevin lozano | 2017-01-06 |
Some albums have multiple artists, such as the 3rd review on the above list, "First Songs" by Kleenex, Liliput. I happen to know that those are the same band under two different names, but we're not sure which name the album will be listed under on Spotify. So, If artist
takes the form artist 1, artist 2
, we want to list the same album two times under both artists. If we don't find a match for either of them later, we can drop it.
# empty dataframe to store the new, single artist rows that we will append later
fixed_names = pd.DataFrame()
# array to store indices of dual artist albums, we will drop based on index later
indices = []
for index, row in pitch.iterrows():
# check for match of format
if ("," in row['artist']):
# indicate that we want to remove this row
indices.append(index)
# get array of artists
contributors = row['artist'].split(", ")
# duplicate row, edit artist, append to dataframe
for a in contributors:
r = row
r['artist'] = a
fixed_names = fixed_names.append(r)
pitch = pitch.drop(indices)
pitch = pitch.append(fixed_names)
# rename cols and print
pitch = pitch.rename(columns={"score": "pf_score", "artist": "pf_artist"})
pitch.head(10)
title | pf_artist | pf_score | best_new_music | author | pub_date | |
---|---|---|---|---|---|---|
0 | mezzanine | massive attack | 9.3 | 0.0 | nate patrin | 2017-01-08 |
1 | prelapsarian | krallice | 7.9 | 0.0 | zoe camp | 2017-01-07 |
2 | all of them naturals | uranium club | 7.3 | 0.0 | david glickman | 2017-01-07 |
4 | new start | taso | 8.1 | 0.0 | kevin lozano | 2017-01-06 |
5 | insecure (music from the hbo original series) | various artists | 7.4 | 0.0 | vanessa okoth-obbo | 2017-01-05 |
6 | stillness in wonderland | little simz | 7.1 | 0.0 | katherine st. asaph | 2017-01-05 |
7 | tehillim | yotam avni | 7.0 | 0.0 | andy beta | 2017-01-05 |
8 | reflection | brian eno | 7.7 | 0.0 | andy beta | 2017-01-04 |
9 | filthy america its beautiful | the lox | 5.3 | 0.0 | ian cohen | 2017-01-04 |
10 | clear sounds/perfetta | harry bertoia | 8.0 | 0.0 | marc masters | 2017-01-04 |
Now for the second dataset, we can start to look at Anthony Fantano's review data. The attributes of the dataset are as follows:
That can be retrieved from kaggle here and, once again, handily placed in our project directory.
fantano = pd.read_csv('fantano_reviews.csv', encoding = "ISO-8859-1")
fantano.tail(5)
Unnamed: 0 | title | artist | review_date | review_type | score | word_score | best_tracks | worst_track | link | |
---|---|---|---|---|---|---|---|---|---|---|
1729 | 1729 | Tell Me How You Really Feel | Courtney Barnett | 2018-05-22 | Album | 6.0 | NaN | ['NEED A LITTLE TIME ; CITY LOOKS PRETTY ; NAM... | WALKIN ' ON EGGSHELLS | https://www.youtube.com/watch?v=GkeHYp7MASY |
1730 | 1730 | Wide Awake! | Parquet Courts | 2018-05-23 | Album | 9.0 | NaN | ['VIOLENCE ', 'BEFORE THE WATER GETS TOO HIGH ... | BACK TO EARTH | https://www.youtube.com/watch?v=4ZZREmYnygU |
1731 | 1731 | Mark Kozelek | Mark Kozelek | 2018-05-25 | Album | 7.0 | NaN | ['THIS IS MY TOWN ', 'LIVE IN CHICAGO ', 'THE ... | YOUNG RIDDICK BOWE | https://www.youtube.com/watch?v=HMIUSLOR350 |
1732 | 1732 | DAYTONA | Pusha T | 2018-05-28 | Album | 8.0 | NaN | ['IF YOU KNOW YOU KNOW ', 'THE GAMES WE PLAY '... | HARD PIANO | https://www.youtube.com/watch?v=z605Rm7lFTM |
1733 | 1733 | Communion | Park Jiha | 2018-05-29 | Album | 6.0 | NaN | ['THROUGHOUT THE NIGHT ', 'COMMUNION ', "ALL S... | ACCUMULATION OF TIME | https://www.youtube.com/watch?v=icib8b4GYlI |
There are a few more columns here than we need, so we will go ahead and drop those, then merge with the pitchfork data.
fantano = fantano.drop(columns=["Unnamed: 0", "word_score", "best_tracks", "worst_track", "link", "review_type"])
fantano = fantano.dropna()
fantano["title"] = fantano["title"].str.lower()
fantano["artist"] = fantano["artist"].str.lower()
fantano = fantano.rename(columns={"score": "af_score", "artist": "af_artist"})
fantano.tail(5)
title | af_artist | review_date | af_score | |
---|---|---|---|---|
1729 | tell me how you really feel | courtney barnett | 2018-05-22 | 6.0 |
1730 | wide awake! | parquet courts | 2018-05-23 | 9.0 |
1731 | mark kozelek | mark kozelek | 2018-05-25 | 7.0 |
1732 | daytona | pusha t | 2018-05-28 | 8.0 |
1733 | communion | park jiha | 2018-05-29 | 6.0 |
# merge
df = pd.merge(fantano, pitch, on=["title"])
df
title | af_artist | review_date | af_score | pf_artist | pf_score | best_new_music | author | pub_date | |
---|---|---|---|---|---|---|---|---|---|
0 | cosmogramma | flying lotus | 2010-05-05 | 8.0 | flying lotus | 8.8 | 1.0 | joe colly | 2010-05-06 |
1 | throat | little women | 2010-05-09 | 9.0 | adr | 7.1 | 0.0 | thea ballard | 2016-12-17 |
2 | latin | holy fuck | 2010-05-10 | 7.0 | holy fuck | 7.8 | 0.0 | joe tangari | 2010-05-14 |
3 | high violet | the national | 2010-05-11 | 6.0 | the national | 8.7 | 1.0 | andrew gaerig | 2010-05-10 |
4 | at echo lake | woods | 2010-05-12 | 8.0 | woods | 8.0 | 0.0 | rob mitchum | 2010-05-10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1071 | 7 | beach house | 2018-05-16 | 7.0 | supersilent | 8.3 | 0.0 | chris dahlen | 2006-03-13 |
1072 | 7 | beach house | 2018-05-16 | 7.0 | philip jeck | 8.2 | 1.0 | mark richardson | 2004-01-13 |
1073 | communion | park jiha | 2018-05-29 | 6.0 | rabit | 7.9 | 0.0 | philip sherburne | 2015-10-26 |
1074 | communion | park jiha | 2018-05-29 | 6.0 | years & years | 7.4 | 0.0 | tim finney | 2015-07-16 |
1075 | communion | park jiha | 2018-05-29 | 6.0 | soundtrack of our lives | 6.2 | 0.0 | joshua klein | 2009-03-11 |
1076 rows × 9 columns
Okay, here is our merged data, but it's not perfect. We can see that there are multiple albums of the same name. Yoy may have noticed that we did not do something to handle the dual-artist albums for the Anthony Fantano data. This is because they were separated with the word "and," and it is plausible that an artist name will contain "and" in it. So, separating based on 'and' would have ruled out a lot of good data. This is also why we didn't merge on artist. Instead we will check if the Pitchfork artist pf_artist
is contained in the string for Fantano artist af_artist
, and delete non-matching rows based on that.
# empty list
match = []
# loop through, true false depending on if pitchfork artist name is in fantano name
for i, row in df.iterrows():
match.append(row['pf_artist'] in row['af_artist'])
# add column
df['match'] = match
# drop rows with no match
df = df[df.match == True]
# rename and rearrange columns
df = df.drop(columns = ["af_artist", "match"])
df = df.rename(columns={"pf_artist": "artist"})
df = df[['title', 'artist', 'af_score', 'review_date', 'pf_score', 'best_new_music', 'author', 'pub_date']]
df
title | artist | af_score | review_date | pf_score | best_new_music | author | pub_date | |
---|---|---|---|---|---|---|---|---|
0 | cosmogramma | flying lotus | 8.0 | 2010-05-05 | 8.8 | 1.0 | joe colly | 2010-05-06 |
2 | latin | holy fuck | 7.0 | 2010-05-10 | 7.8 | 0.0 | joe tangari | 2010-05-14 |
3 | high violet | the national | 6.0 | 2010-05-11 | 8.7 | 1.0 | andrew gaerig | 2010-05-10 |
4 | at echo lake | woods | 8.0 | 2010-05-12 | 8.0 | 0.0 | rob mitchum | 2010-05-10 |
5 | together | the new pornographers | 7.0 | 2010-05-13 | 7.3 | 0.0 | matthew perpetua | 2010-05-05 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1043 | black america again | common | 5.0 | 2016-12-07 | 7.9 | 0.0 | edwin stats houghton | 2016-11-04 |
1044 | 4 your eyez only | j. cole | 6.0 | 2016-12-12 | 6.7 | 0.0 | paul a. thompson | 2016-12-14 |
1045 | do what thou wilt. | ab-soul | 7.0 | 2016-12-13 | 4.4 | 0.0 | sheldon pearce | 2016-12-22 |
1046 | run the jewels 3 | run the jewels | 8.0 | 2016-12-30 | 8.6 | 1.0 | sheldon pearce | 2017-01-03 |
1047 | stillness in wonderland | little simz | 5.0 | 2017-01-16 | 7.1 | 0.0 | katherine st. asaph | 2017-01-05 |
868 rows × 8 columns
Alright! Now we must fetch the appropriate data from Spotify using their API. This is going to be a little bit more involved. We first have to make a Spotify Developer account. That can be done here. Once you have done that, you'll come to a dashboard that will prompt you to "Create an App." Click on that, and enter some details.
Once you create an app from there, Spotify will take you to an overview of your app. This is where you can find your client id, client secret, as well as a few visual aids that have to do with the data your app is handling.
Now that we have that set up, we can start to work on connecting to Spotify's API and retrieving our relevant data. An important library that is going to help us achieve this is spotipy, a python library for the Spotify API. Spotify's API takes IDs instead of names as keys for data, but our data does not contain Spotify IDs. We can use spotipy to get these IDs to give to Spotify.
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
# So as to keep my credentials private, I will read the fields in from a two local text files.
client = pd.read_csv('id.txt', header=None)[0][0]
secret = pd.read_csv('secret.txt', header=None)[0][0]
# connect to Spotify database via spotipy
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client,
client_secret=secret))
def getArtistID(name):
results = sp.search(q=name, type='artist')
items = results['artists']['items']
# empty items array means artist not found on spotify
return "" if items == [] else items[0]['id']
# example use
getArtistID('steely dan')
'6P7H3ai06vU1sGvdpBwDmE'
Why might I do this for artists and not for albums? Two albums sharing a name is more probable than two artists sharing a name. Therefore, we might get the best matching data by first fetching data per artist reviewed then filtering by albums reviewed. Let's go ahead and get the IDs of all the artists in the Pitchfork database. Because this code takes a second to run, I have saved the output into a CSV file, and commented out the lines that generate the csv. It can easily be uncommented and ran if needed.
# artistIDs = pd.DataFrame()
# artists = df['artist'].unique()
# for x in artists:
# if x != "":
# aID = getArtistID(x)
# if aID != "":
# # write x, aID to csv file
# artistIDs = artistIDs.append(pd.Series([x, aID]) ,ignore_index=True)
# # write csv file
# artistIDs.to_csv('artistIDs.csv', index=False)
artistIDs = pd.read_csv('artistIDs.csv', sep = ',')
artistIDs = artistIDs.rename(columns={"0": "artist", "1": "id"})
artistIDs.tail(10)
artist | id | |
---|---|---|
602 | green day | 7oPftvlwr6VrsViSDV7fJY |
603 | nxworries | 6PEMFpe3PTOksdV4ZXUpbE |
604 | crying | 4RoIQHT4yJcA0ENhs8WLd4 |
605 | the dillinger escape plan | 7IGcjaMGAtsvKBLQX26W4i |
606 | lady gaga | 1HY2Jd0NmPuamShAr6KMms |
607 | conor oberst | 2Z7gV3uEh1ckIaBzTUCE6R |
608 | d.r.a.m. | 5M0lbkGluOPXLeFjApw8r8 |
609 | a tribe called quest | 09hVIj6vWgoCDtT03h8ZCa |
610 | bruno mars | 0du5cEVh5yTK9QJze8zA0C |
611 | little simz | 6eXZu6O7nAUA5z6vLV8NKI |
Now that we have all the artist IDs, we must get all the album IDs. We will write this to a csv as well, for more time saving.
def getAlbums(aID):
albIDs = []
r = sp.artist_albums(aID)
for e in r['items']:
albIDs.append(e['id'])
return albIDs
# ids = artistIDs['id']
# albumIDs = []
# for a in ids:
# l = getAlbums(a)
# albumIDs.extend(l)
# albumSeries = pd.Series(albumIDs)
# albumSeries.to_csv('albums.csv', index=False)
Now that we have all the album IDs, lets get some album metadata and match it with our reviewed album titles and artists. Then, we will know which album IDs are relevant to us.
# albums = pd.read_csv('albums.csv', sep = '\n')['0']
# albums_IDs = pd.DataFrame()
# i = 0
# for a in albums:
# r = sp.album(a)
# albums_IDs = albums_IDs.append(pd.Series([r['artists'][0]['name'], r['name'], a]),ignore_index=True)
# albums_IDs.to_csv('albums_IDs.csv', index=False)
albums_IDs = pd.read_csv('albums_IDs.csv')
albums_IDs = albums_IDs.rename(columns={'0': "artist", '1': "title", '2': "id"})
albums_IDs["title"] = albums_IDs["title"].str.lower()
albums_IDs["artist"] = albums_IDs["artist"].str.lower()
albums_IDs
artist | title | id | |
---|---|---|---|
0 | flying lotus | yasuke | 4duUlv53npBm7EmqxTT1kj |
1 | flying lotus | yasuke | 47qyYFxPC1jv6lUp1FBlSl |
2 | flying lotus | flamagra (deluxe edition) | 2S10mDxQswPB4tBI2fKPfX |
3 | flying lotus | flamagra (deluxe edition) | 2XFI4MPI4b1yPPTomGkGqt |
4 | flying lotus | flamagra | 2oDoWhkGhElQJm6jD8uMOB |
... | ... | ... | ... |
11113 | little simz | i love you, i hate you | 75ncNW4YUYAZ1WsHLpR3sf |
11114 | little simz | rollin stone | 0QYrdzHhm3xWfPLdH4tehT |
11115 | little simz | rollin stone | 1je7BydnQhYBit4W2FGYph |
11116 | little simz | woman | 7M0Tu8Fr3L2K105Ew8qzJ0 |
11117 | little simz | woman | 5xLroFHEvbfDUBqsiYgk5Z |
11118 rows × 3 columns
df2 = pd.merge(df, albums_IDs, on=["title", "artist"])
df2
title | artist | af_score | review_date | pf_score | best_new_music | author | pub_date | id | |
---|---|---|---|---|---|---|---|---|---|
0 | cosmogramma | flying lotus | 8.0 | 2010-05-05 | 8.8 | 1.0 | joe colly | 2010-05-06 | 5c7XChrHxYaqykCZLaGM5f |
1 | cosmogramma | flying lotus | 8.0 | 2010-05-05 | 8.8 | 1.0 | joe colly | 2010-05-06 | 5EnERG2QBlF6Z0BrUjEcF4 |
2 | latin | holy fuck | 7.0 | 2010-05-10 | 7.8 | 0.0 | joe tangari | 2010-05-14 | 45nMlmlIPPDVWSCDwdvCB9 |
3 | latin | holy fuck | 7.0 | 2010-05-10 | 7.8 | 0.0 | joe tangari | 2010-05-14 | 72GuNOfz6dRX7WUVRO4SUS |
4 | latin | holy fuck | 7.0 | 2010-05-10 | 7.8 | 0.0 | joe tangari | 2010-05-14 | 6af8JVMaLdtG2utfvLxJZu |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1320 | run the jewels 3 | run the jewels | 8.0 | 2016-12-30 | 8.6 | 1.0 | sheldon pearce | 2017-01-03 | 2vY03PfKPFUUM1FA2lgmC2 |
1321 | run the jewels 3 | run the jewels | 8.0 | 2016-12-30 | 8.6 | 1.0 | sheldon pearce | 2017-01-03 | 1K6D3iovFpTWunGDdVwacq |
1322 | stillness in wonderland | little simz | 5.0 | 2017-01-16 | 7.1 | 0.0 | katherine st. asaph | 2017-01-05 | 7r96hilTCSMBPqqhnYouRw |
1323 | stillness in wonderland | little simz | 5.0 | 2017-01-16 | 7.1 | 0.0 | katherine st. asaph | 2017-01-05 | 2Ip7sT1J1hv9OqlFDPxLgQ |
1324 | stillness in wonderland | little simz | 5.0 | 2017-01-16 | 7.1 | 0.0 | katherine st. asaph | 2017-01-05 | 4G50FUTTI4fCDyrxP1UEer |
1325 rows × 9 columns
These are all the albums we could successfully find a match for on Spotify. Okay, not perfect though. It seems like Spotify sometimes has multiple copies of the same album. We'll assume that any album with a matching title and artist is sufficient, and just use the ID of the first match
df = df2.groupby(['title','artist'], as_index=False).first()
df
title | artist | af_score | review_date | pf_score | best_new_music | author | pub_date | id | |
---|---|---|---|---|---|---|---|---|---|
0 | (iii) | crystal castles | 4.0 | 2012-11-13 | 8.0 | 0.0 | ian cohen | 2012-11-12 | 1NIfkZIYVAO6vnfmFOilHc |
1 | ...and then you shoot your cousin | the roots | 4.0 | 2014-05-23 | 7.2 | 0.0 | jayson greene | 2014-05-23 | 6kYqws8vRcaUKTjFnJRb4X |
2 | 1999 | joey bada$$ | 7.0 | 2012-06-18 | 8.0 | 0.0 | felipe delerme | 2012-06-26 | 5ra51AaWF3iVebyhlZ1aqq |
3 | 2 | mac demarco | 6.0 | 2012-10-30 | 8.2 | 1.0 | sam hockley-smith | 2012-10-31 | 0Skv3s5A99n7dstiJOs0aA |
4 | 2014 forest hills drive | j. cole | 6.0 | 2014-12-08 | 6.9 | 0.0 | craig jenkins | 2014-12-11 | 0UMMIkurRUmkruZ3KGBLtG |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
705 | you're better than this | pile | 6.0 | 2015-03-05 | 7.4 | 0.0 | paul thompson | 2015-03-02 | 1rhRbObNIBv3go6twKKfb5 |
706 | you're nothing | iceage | 8.0 | 2013-02-22 | 8.6 | 1.0 | brandon stosuy | 2013-02-18 | 0aUW20pZCgBPGgxUucUIbE |
707 | zeros | the soft moon | 4.0 | 2012-11-01 | 7.0 | 0.0 | brandon stosuy | 2012-11-02 | 0ftaDgVf1d3Hkod3iLkPrI |
708 | zonoscope | cut copy | 5.0 | 2011-02-08 | 8.6 | 1.0 | tom breihan | 2011-02-07 | 5SFPr07PUPCT4YSaZjYeRR |
709 | {awayland} | villagers | 7.0 | 2013-01-17 | 5.5 | 0.0 | ian cohen | 2013-01-14 | 7Hw5WowWzfIqaurdd4ct7q |
710 rows × 9 columns
# Let's write that to a csv for good measure.
df.to_csv('df_no_track_info.csv', index=False)
Now that we have all those album IDs attached to commonly reviewed albums, we can get the tracks from those albums and average some of their sound features.
def getTrackIDs(album):
trackIDs = []
r = sp.album_tracks(album)
for e in r['items']:
trackIDs.append(e['id'])
return trackIDs
# albums = df['id']
# tracks = []
# for a in albums:
# l = getTrackIDs(a)
# tracks.extend(l)
# trackSeries = pd.Series(tracks)
# trackSeries.to_csv('trackIDs.csv', index=False)
Almost there! What we have right now is every Spotify track ID from every album that has received a review from both Pitchfork and Anthony Fantano (that we could find a match for). Let's get the track metadata and audio features. To do this, I'm going to use a code snippet from Angelica Dietzel's article on BetterProgramming.pub about extracting artist data using Spotify's API.
trackIDs = pd.read_csv('trackIDs.csv', sep = '\n')['0']
# Angelica Dietzel, 2020 (modified)
def getTrackFeatures(id):
meta = sp.track(id)
features = sp.audio_features(id)
# meta
if meta != None:
name = meta['name']
album = meta['album']['name']
artist = meta['album']['artists'][0]['name']
release_date = meta['album']['release_date']
length = meta['duration_ms']
popularity = meta['popularity']
# features
if features[0] != None:
acousticness = features[0]['acousticness']
danceability = features[0]['danceability']
energy = features[0]['energy']
instrumentalness = features[0]['instrumentalness']
liveness = features[0]['liveness']
loudness = features[0]['loudness']
speechiness = features[0]['speechiness']
tempo = features[0]['tempo']
time_signature = features[0]['time_signature']
else:
acousticness = 0
danceability = 0
energy = 0
instrumentalness = 0
liveness = 0
loudness = 0
speechiness = 0
tempo = 0
time_signature = 0
track = [name, album, artist, release_date, length, popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature]
return track
# Angelica Dietzel, 2020 (modified)
# tracks = []
# for id in trackIDs:
# print (id)
# track = getTrackFeatures(id)
# tracks.append(track)
# # create dataset
# df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature'])
# df.to_csv("spotify.csv", sep = ',')
That dataset took a long time to make, so I do not recommend trying that code out for yourself. Anyway, at last, we can consolidate by album, and merge with our other dataset.
# original code from here down
spotify = pd.read_csv('spotify.csv')
spotify
Unnamed: 0 | name | album | artist | release_date | length | popularity | danceability | acousticness | danceability.1 | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Plague | (III) | Crystal Castles | 2012-01-01 | 295880 | 49 | 0.437 | 0.019500 | 0.437 | 0.6490 | 0.82200 | 0.1040 | -9.473 | 0.0675 | 130.169 | 4 |
1 | 1 | Kerosene | (III) | Crystal Castles | 2012-01-01 | 192000 | 60 | 0.519 | 0.082700 | 0.519 | 0.5460 | 0.00000 | 0.1330 | -9.368 | 0.0763 | 115.799 | 4 |
2 | 2 | Wrath Of God | (III) | Crystal Castles | 2012-01-01 | 186626 | 43 | 0.396 | 0.002630 | 0.396 | 0.6260 | 0.74600 | 0.1060 | -9.410 | 0.0515 | 129.979 | 4 |
3 | 3 | Affection | (III) | Crystal Castles | 2012-01-01 | 156706 | 47 | 0.656 | 0.057500 | 0.656 | 0.8290 | 0.00783 | 0.1370 | -7.107 | 0.0459 | 124.640 | 4 |
4 | 4 | Pale Flesh | (III) | Crystal Castles | 2012-01-01 | 178800 | 48 | 0.596 | 0.000173 | 0.596 | 0.6250 | 0.00726 | 0.1700 | -7.620 | 0.0506 | 139.955 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8475 | 8475 | {Awayland} | {Awayland} | Villagers | 2013-04-09 | 155373 | 12 | 0.245 | 0.917000 | 0.245 | 0.1110 | 0.69900 | 0.1280 | -17.156 | 0.0365 | 96.110 | 4 |
8476 | 8476 | Passing A Message | {Awayland} | Villagers | 2013-04-09 | 191913 | 5 | 0.618 | 0.245000 | 0.618 | 0.6950 | 0.74500 | 0.0819 | -9.500 | 0.0346 | 108.037 | 4 |
8477 | 8477 | Grateful Song | {Awayland} | Villagers | 2013-04-09 | 264573 | 10 | 0.214 | 0.144000 | 0.214 | 0.6720 | 0.28500 | 0.0786 | -7.291 | 0.0497 | 177.688 | 3 |
8478 | 8478 | In A Newfound Land You Are Free | {Awayland} | Villagers | 2013-04-09 | 210893 | 10 | 0.418 | 0.927000 | 0.418 | 0.0543 | 0.16400 | 0.1030 | -21.577 | 0.0428 | 73.763 | 4 |
8479 | 8479 | Rhythm Composer | {Awayland} | Villagers | 2013-04-09 | 306693 | 6 | 0.526 | 0.623000 | 0.526 | 0.5510 | 0.09910 | 0.3470 | -9.275 | 0.0338 | 144.935 | 4 |
8480 rows × 17 columns
# Convert release date to float
import datetime
date_num = []
for e in spotify['release_date']:
if len(e) == 4:
e = str(e) + "-01-01"
datetime_object = datetime.datetime.strptime(e, '%Y-%m-%d')
date_num.append(datetime_object.timestamp())
spotify['date'] = date_num
spotify = spotify.drop(columns=['Unnamed: 0', 'name', 'danceability.1', 'time_signature', 'tempo', 'release_date'])
spotify = spotify.groupby(['album','artist'], as_index=False).mean()
spotify["album"] = spotify["album"].str.lower()
spotify["artist"] = spotify["artist"].str.lower()
spotify = spotify.rename(columns={"album": "title"})
spotify
title | artist | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (iii) | crystal castles | 198021.916667 | 48.666667 | 0.552583 | 0.102302 | 0.650167 | 0.402041 | 0.196933 | -8.141500 | 0.120350 | 1.325376e+09 |
1 | ...and then you shoot your cousin | the roots | 182099.000000 | 19.818182 | 0.526273 | 0.543391 | 0.482600 | 0.028602 | 0.177864 | -12.292273 | 0.148355 | 1.400458e+09 |
2 | 1999 | joey bada$$ | 245861.200000 | 56.533333 | 0.628000 | 0.384260 | 0.762667 | 0.030335 | 0.221333 | -4.572200 | 0.277667 | 1.339459e+09 |
3 | 2 | mac demarco | 162884.923077 | 35.769231 | 0.536769 | 0.197898 | 0.652462 | 0.252438 | 0.209454 | -7.800308 | 0.060246 | 1.376006e+09 |
4 | 2014 forest hills drive | j. cole | 298749.846154 | 15.615385 | 0.639231 | 0.402331 | 0.575308 | 0.005664 | 0.184315 | -8.981385 | 0.267423 | 1.418083e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
705 | another eternity | purity ring | 212780.700000 | 23.800000 | 0.526200 | 0.156091 | 0.591300 | 0.020943 | 0.227230 | -7.636100 | 0.069890 | 1.425341e+09 |
706 | awe naturale | theesatisfaction | 140896.153846 | 9.769231 | 0.675231 | 0.323658 | 0.525123 | 0.288759 | 0.173915 | -12.103231 | 0.084554 | 1.332806e+09 |
707 | channel orange | frank ocean | 196908.882353 | 62.235294 | 0.546882 | 0.399382 | 0.483141 | 0.161221 | 0.272676 | -10.392588 | 0.143476 | 1.341878e+09 |
708 | pom pom | ariel pink | 236672.235294 | 34.941176 | 0.398118 | 0.085540 | 0.800412 | 0.129650 | 0.278994 | -3.682941 | 0.090982 | 1.416182e+09 |
709 | {awayland} | villagers | 235906.727273 | 12.909091 | 0.493818 | 0.477091 | 0.521118 | 0.309210 | 0.152873 | -10.775455 | 0.047055 | 1.365466e+09 |
710 rows × 12 columns
df = pd.merge(df, spotify, on=["title", "artist"])
df.to_csv("main_data.csv", sep = ',')
# Last but not least, convert our date floats back to comprehensible datetime objects
# Pitchfork pub dates and Fantano review dates should be datetime Objects, too
year = []
for e in df['date']:
y = datetime.datetime.fromtimestamp(e)
year.append(y)
def str_dt(x):
return datetime.datetime.strptime(x, '%Y-%m-%d')
df['pub_date'] = df['pub_date'].apply(str_dt)
df['review_date'] = df['review_date'].apply(str_dt)
df = df.drop(columns=['date'])
df['release_date'] = year
df
title | artist | af_score | review_date | pf_score | best_new_music | author | pub_date | id | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | release_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (iii) | crystal castles | 4.0 | 2012-11-13 | 8.0 | 0.0 | ian cohen | 2012-11-12 | 1NIfkZIYVAO6vnfmFOilHc | 198021.916667 | 48.666667 | 0.552583 | 0.102302 | 0.650167 | 0.402041 | 0.196933 | -8.141500 | 0.120350 | 2012-01-01 |
1 | ...and then you shoot your cousin | the roots | 4.0 | 2014-05-23 | 7.2 | 0.0 | jayson greene | 2014-05-23 | 6kYqws8vRcaUKTjFnJRb4X | 182099.000000 | 19.818182 | 0.526273 | 0.543391 | 0.482600 | 0.028602 | 0.177864 | -12.292273 | 0.148355 | 2014-05-19 |
2 | 1999 | joey bada$$ | 7.0 | 2012-06-18 | 8.0 | 0.0 | felipe delerme | 2012-06-26 | 5ra51AaWF3iVebyhlZ1aqq | 245861.200000 | 56.533333 | 0.628000 | 0.384260 | 0.762667 | 0.030335 | 0.221333 | -4.572200 | 0.277667 | 2012-06-12 |
3 | 2 | mac demarco | 6.0 | 2012-10-30 | 8.2 | 1.0 | sam hockley-smith | 2012-10-31 | 0Skv3s5A99n7dstiJOs0aA | 162884.923077 | 35.769231 | 0.536769 | 0.197898 | 0.652462 | 0.252438 | 0.209454 | -7.800308 | 0.060246 | 2013-08-09 |
4 | 2014 forest hills drive | j. cole | 6.0 | 2014-12-08 | 6.9 | 0.0 | craig jenkins | 2014-12-11 | 0UMMIkurRUmkruZ3KGBLtG | 298749.846154 | 15.615385 | 0.639231 | 0.402331 | 0.575308 | 0.005664 | 0.184315 | -8.981385 | 0.267423 | 2014-12-09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
705 | you're better than this | pile | 6.0 | 2015-03-05 | 7.4 | 0.0 | paul thompson | 2015-03-02 | 1rhRbObNIBv3go6twKKfb5 | 220838.900000 | 14.400000 | 0.302600 | 0.075326 | 0.594100 | 0.415051 | 0.167410 | -11.394100 | 0.048500 | 2015-03-03 |
706 | you're nothing | iceage | 8.0 | 2013-02-22 | 8.6 | 1.0 | brandon stosuy | 2013-02-18 | 0aUW20pZCgBPGgxUucUIbE | 141968.083333 | 3.250000 | 0.197917 | 0.054959 | 0.754842 | 0.478018 | 0.266483 | -4.798083 | 0.074367 | 2013-02-18 |
707 | zeros | the soft moon | 4.0 | 2012-11-01 | 7.0 | 0.0 | brandon stosuy | 2012-11-02 | 0ftaDgVf1d3Hkod3iLkPrI | 205968.400000 | 18.000000 | 0.488100 | 0.039557 | 0.920000 | 0.822500 | 0.415620 | -5.415600 | 0.091150 | 2012-10-30 |
708 | zonoscope | cut copy | 5.0 | 2011-02-08 | 8.6 | 1.0 | tom breihan | 2011-02-07 | 5SFPr07PUPCT4YSaZjYeRR | 335133.000000 | 30.181818 | 0.513091 | 0.067440 | 0.797000 | 0.370252 | 0.311336 | -5.773455 | 0.042382 | 2011-01-01 |
709 | {awayland} | villagers | 7.0 | 2013-01-17 | 5.5 | 0.0 | ian cohen | 2013-01-14 | 7Hw5WowWzfIqaurdd4ct7q | 235906.727273 | 12.909091 | 0.493818 | 0.477091 | 0.521118 | 0.309210 | 0.152873 | -10.775455 | 0.047055 | 2013-04-09 |
710 rows × 19 columns
Folks... we have our dataset! At last, let's do some exploratory data analysis.
Matplotlib will come in handy for graphs and visualizations.
Let's first take a look at how Fantano's reviews compare to Pitchfork reviews. Is he usually more forgiving or less forgiving? We can find out by plotting Fantano's album scores against Pitchfork's album scores. Each dot represents an album that they both rated.
import matplotlib
import matplotlib.pyplot as plt
We're going to separate the data into two groups, albums who earned a Best New Music tag from Pitchfork and those that didn't.
groups = df.groupby("best_new_music")
for bnm, group in groups:
plt.plot(group["pf_score"], group["af_score"], marker="o", linestyle="", label=bnm)
The Best New Music tag didn't give us much new information other than that the score is above an 8.3~ or so. One thing that's important to note here given the shape of this graph, is that Fantano usually only gives integer scores to albums, while Pitchfork gives scores to one decimal point. Let's add a regression line to the plot.
m, b = np.polyfit(df['pf_score'], df['af_score'], 1)
plt.scatter(df['pf_score'], df['af_score'], c = 'firebrick')
plt.plot(df['pf_score'], m*df['pf_score']+b, c = 'black')
plt.xlabel("Pitchfork Score")
plt.ylabel("Fantano Score")
Text(0, 0.5, 'Fantano Score')
print ('slope: ' + str(m))
print ('intercept: ' + str(b))
slope: 0.33132081873591773 intercept: 3.8582464547543283
According to this line, on average, a score that Fantano gives will be about a third of the Pitchfork score, plus 4. This means that a 1 from Pitchfork would be about a 4 from Fantano, a Pitchfork 5 is a Fantano 5 to 6, and a Pitchfork 10 is about a Fantano 7. We can gather that Fantano is a little bit more hesitant to give scores on the extreme ends of the 0 to 10 scale. Lets see if we can find more patterns that support this. Let's start by making histograms of each of their scores.
bins = [1,2,3,4,5,6,7,8,9,10]
Pitchfork = df['pf_score']
Fantano = df['af_score']
plt.hist(Fantano, bins, facecolor='black', alpha = 0.75)
plt.hist(Pitchfork, bins, facecolor='red', alpha = 0.75)
plt.show()
The red in this graph represents Pitchfork scores, and the black represents Fantano scores. This tells us a slightly different story, one that tells us that pitchfork is more apt to give scores around the 7 and 8 range, while Fantano's ratings are a little bit more normally distributed.
We can also extend this sort of graph to all the albums that both entities have reviewed, not just the ones that have been reviewed by both parties.
Pitchfork = pitch['pf_score']
plt.hist(Pitchfork, bins, facecolor='red', alpha = 0.75)
plt.show()
Fantano = fantano['af_score']
plt.hist(Fantano, bins, facecolor='black', alpha = 0.75)
plt.show()
The shapes of the graphs are similar, but not exactly the same. This means that missing data is somewhat of a factor in the averages and whatnot of the scores. I believe that the data is Missing at Random. Because Anthony Fantano is only one person and Pitchfork is a long-running publication with a whole team, Fantano may only have reviewed albums that were especially relevant during the time he has been an active content creator. This is much shorter than the amount of time that Pitchfork has been around. There are also albums that Fantano has reviewed that Pitchfork hasn't, and it's hard to say what those reasons might be. It's possible that he really likes certain albums that a large publication doesn't have on its radar, or any number of other things.
Back to data analysis, the way that Pitchfork and Fantano rate their albums lends itself well to a violin graph, given Fantano's integer-only album scores. We'll use seaborn to help us with the violin plot.
import seaborn as sns
sns.violinplot(x="af_score", y="pf_score", data=df)
<AxesSubplot:xlabel='af_score', ylabel='pf_score'>
Although there appears to be correlation between how Fantano and Pitchfork score things (we'll explore this more later), clearly, the two sources don't always agree.
As mentioned previously, Spotify gives us a number of track features that have been averaged per album. We can look at those and determine whether Fantano or Pitchfork prefers music with these qualities.
# comparePlot method makes a scatter plot with both sets of review scores with two regression lines.
# Black is Fantano, red is Pitchfork
def comparePlot(field):
plt.scatter(df[field], df['af_score'], c='black', alpha = 0.5)
ma, ba = np.polyfit(df[field], df['af_score'], 1)
plt.plot(df[field], ma*df[field]+ba, c = 'black')
plt.scatter(df[field], df['pf_score'], c='red', alpha = 0.5)
mp, bp = np.polyfit(df[field], df['pf_score'], 1)
plt.plot(df[field], mp*df[field]+bp, c='red')
plt.xlabel(field)
plt.ylabel("Score")
comparePlot('popularity')
comparePlot('acousticness')
comparePlot('energy')
comparePlot('instrumentalness')
comparePlot('liveness')
comparePlot('loudness')
comparePlot('speechiness')
comparePlot('danceability')
comparePlot('length')
I will be the first to admit that none of these plots are all that interesting. One thing we can notice, though is that Anthony's black line is lower than Pitchfork's red line for every quality that spotify gives us. However, how far apart the lines are varies from quality to quality, We already knew that Anthony was generally less nice to these albums than Pitchfork, so to get the most out of this data, it might be best to normalize the review scores. I believe that we should normalize around only the shared albums and not the respective review datasets. We should also use mean normalization, because the ranges of the data are the same.
df_normalized = df.copy()
df_normalized['pf_score'] = (df_normalized['pf_score']-df_normalized['pf_score'].mean())/(df_normalized['pf_score'].std())
df_normalized['af_score'] = (df_normalized['af_score']-df_normalized['af_score'].mean())/(df_normalized['af_score'].std())
Pitchfork = df_normalized['pf_score']
Fantano = df_normalized['af_score']
plt.hist(Fantano, bins=9, facecolor='black', alpha = 0.75)
plt.hist(Pitchfork, bins=9, facecolor='red', alpha = 0.75)
plt.show()
This looks like data we can compare more accurately. Lets try making the same sorts of plots with the normalized data instead.
def normalizedPlot(field):
plt.scatter(df_normalized[field], df_normalized['af_score'], c='black', alpha = 0.5)
ma, ba = np.polyfit(df_normalized[field], df_normalized['af_score'], 1)
plt.plot(df_normalized[field], ma*df_normalized[field]+ba, c = 'black')
plt.scatter(df_normalized[field], df_normalized['pf_score'], c='red', alpha = 0.5)
mp, bp = np.polyfit(df_normalized[field], df_normalized['pf_score'], 1)
plt.plot(df_normalized[field], mp*df_normalized[field]+bp, c='red')
plt.xlabel(field)
plt.ylabel("Score")
normalizedPlot('popularity')
normalizedPlot('danceability')
normalizedPlot('acousticness')
normalizedPlot('energy')
normalizedPlot('instrumentalness')
normalizedPlot('liveness')
normalizedPlot('loudness')
normalizedPlot('speechiness')
Just from eyeballing, these graphs point towards there not being a strong correlation between any one spotify audio feature and a high or low rating. Before we move on to hypothesis testing, there are a couple of other things I would like to plot out using our time data. Did either source get nicer over time? And do they prefer certain eras of music?
import matplotlib.dates as mdates
from datetime import date
# plotting method that takes in an a (formatted in datetime objects), x axis and a numerical b (y)
# code borrowed from
# https://stackoverflow.com/questions/29308729/how-to-plot-a-linear-regression-with-datetimes-on-the-x-axis
def dateXplot(a, b, colr, xlab):
df_normalized['date_ordinal'] = pd.to_datetime(df_normalized[a]).apply(lambda date: date.toordinal())
ax = sns.regplot(
data=df_normalized,
x='date_ordinal',
y=b,
scatter_kws={'alpha':0.15},
color = colr,
)
# Tighten up the axes for prettiness
# ax.set_xlim(df_normalized['date_ordinal'].min() - 1, df_normalized['date_ordinal'].max() + 1)
# ax.set_ylim(0, df_normalized[b].max() + 1)
ax.set_xlabel(xlab)
ax.set_ylabel('Normalized Score')
new_labels = [date.fromordinal(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels)
dateXplot('review_date', 'af_score', 'black', 'Review Date')
dateXplot('pub_date', 'pf_score', 'red', 'Review Date')
/tmp/ipykernel_1215/2298965798.py:41: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(new_labels) /tmp/ipykernel_1215/2298965798.py:41: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(new_labels)
dateXplot('release_date', 'af_score', 'black', 'Album Release Date')
dateXplot('release_date', 'pf_score', 'red', 'Album Release Date')
/tmp/ipykernel_1215/2298965798.py:41: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(new_labels) /tmp/ipykernel_1215/2298965798.py:41: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(new_labels)
None of these graphs seem to point towards any trend with relation to temporal things. We can conclude that neither source has changed their rating criteria over time, nor do they have biases for records from older or more recent years.
For this part, we are first going to check if our datasets are normally distributed. The result will help us know what other tests are appropriate to run. We are first going to perform a Shapiro-Wilk test on both review datasets. The null hypothesis in this case is that the data is normally distributed. Source: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
import scipy.stats as sp
# Shapiro-Wilk test Fantano's scores
stat, p = sp.shapiro(df['af_score'])
print('Fantano: Statistics=%.3f, p=%.3f' % (stat, p))
# Shapiro-Wilk Pitchfork scores
stat2, p2 = sp.shapiro(df['pf_score'])
print('Pitchfork: Statistics=%.3f, p=%.3f' % (stat2, p2))
Fantano: Statistics=0.949, p=0.000 Pitchfork: Statistics=0.903, p=0.000
We can tell that the data is not normally distributed from the low p-values, and reject the null hypothesis. Let's try it with the normalized data from earlier.
# Shapiro-Wilk test Fantano's scores
stat, p = sp.shapiro(df_normalized['af_score'])
print('Fantano: Statistics=%.3f, p=%.3f' % (stat, p))
# Shapiro-Wilk Pitchfork scores
stat2, p2 = sp.shapiro(df_normalized['pf_score'])
print('Pitchfork: Statistics=%.3f, p=%.3f' % (stat2, p2))
Fantano: Statistics=0.949, p=0.000 Pitchfork: Statistics=0.903, p=0.000
Even the normalized data is not actually normally distributed. It is more so just the same shape, just shifted a little bit so the means are in the same vicinity. We even get the exact same p and stat values. Because the data is not distributed normally, we have to run nonparametric tests. We can run a Spearman correlation analysis to see if the Pitchfork and Fantano album scores are correlated. The null hypothesis in this case is that the Fantano and Pitchfork album scores are not correlated.
sp.spearmanr(df['pf_score'], df['af_score'])
SpearmanrResult(correlation=0.23544186624405386, pvalue=2.125418949083463e-10)
The Spearman correlation test gave us a very low p-value, indicating that we can reject the null-hypothesis that the two sets are not correlated. This indicates that there are qualities about an album or piece of music that can help to earn good scores from both entities. It's not clear whether or not these scores can be predicted or guessed given the audio features that Spotify provides to us. That is where the machine learning comes in.
# Split into training and test data
# Predict Pitchfork Score without Fantano score
from sklearn.model_selection import train_test_split
cols = ['popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness']
X = df[cols]
y = df['pf_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
We are going to use this to train a multivariate linear regression model.
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train,y_train)
print('coefficients: ')
for i in range(len(model.coef_)):
print(cols[i] + ": " + str(model.coef_[i]))
print('intercept: ' + str(model.intercept_))
coefficients: popularity: -0.0011754043408539327 danceability: -0.549871551121753 acousticness: 1.0978970355390232 energy: -0.46894152072665857 instrumentalness: 0.2066451447059251 liveness: 2.225642313554848 loudness: 0.02427779488194215 speechiness: 0.46837771669437006 intercept: 7.311774090368092
Some coefficients are much larger than others, which tells us how important they are. Popularity seems to have the least effect on score (coefficient might as well be 0) and speechiness seems to have the most effect on how Pitchfork scores albums. Lets look at the R-squared value to see how accurate this model is.
model.score(X_test, y_test)
-0.018718747500091615
The R-squared is very low. The value means that only a very small amount of Pitchfork scores can be completely explained by the indpendent audio variables. Let's see if and how much the model improves if we consider Anthony Fantano's score as well.
cols = ['af_score', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness']
X = df[cols]
y = df['pf_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
model = LinearRegression().fit(X_train,y_train)
print("r-squared: " + str(model.score(X_test, y_test)))
print("fantano score coeff: " + str(model.coef_[0]))
r-squared: 0.04548824732773071 fantano score coeff: 0.20618821981507965
Let's see what the average difference between the actual score and the predicted score is.
diff = model.predict(X_test) - y_test
diff.mean()
-0.05993859140421221
Not so bad, but still not great. It almost feels like cheating, but let's see what happens when we account for the best_new_music
column too.
cols = ['best_new_music', 'af_score', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness']
X = df[cols]
y = df['pf_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
model = LinearRegression().fit(X_train,y_train)
print("r-squared:" + str(model.score(X_test, y_test)))
print("best new music coeff: " + str(model.coef_[0]))
r-squared:0.34216914711616886 best new music coeff: 1.5119784592515624
Definitely a lot better, but how could it not be? Lets see if Fantano's scores are any more predictable.
cols = ['popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness']
X = df[cols]
y = df['af_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
model = LinearRegression().fit(X_train,y_train)
print('coefficients: ')
for i in range(len(model.coef_)):
print(cols[i] + ": " + str(model.coef_[i]))
print('intercept: ' + str(model.intercept_))
coefficients: popularity: -0.0006052529280098929 danceability: -0.7748848724620336 acousticness: 0.2348852560553427 energy: -1.0205828392238079 instrumentalness: 1.0345946650652393 liveness: 1.0247457047011492 loudness: 0.04946362672989524 speechiness: 0.6323560331785891 intercept: 7.04668771102398
model.score(X_test, y_test)
-0.006544698763402579
And with pitchfork score?
cols = ['pf_score', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness']
X = df[cols]
y = df['af_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
model = LinearRegression().fit(X_train,y_train)
print('coefficients: ')
for i in range(len(model.coef_)):
print(cols[i] + ": " + str(model.coef_[i]))
print('intercept: ' + str(model.intercept_))
print('r-sq: ' + str(model.score(X_test, y_test)))
coefficients: pf_score: 0.2936286263136484 popularity: -0.009211620703548896 danceability: -0.5171581705695801 acousticness: -0.3533676353068687 energy: -1.1110144877951071 instrumentalness: 0.6877029224309528 liveness: 0.11610868723556808 loudness: 0.023336364373497495 speechiness: 2.0542341339588654 intercept: 5.333777877029917 r-sq: 0.06662352924588022
Only slightly better, but still pretty far off. let's see what the average difference in rating between the model prediction and actual score is.
diff = model.predict(X_test) - y_test
diff.mean()
0.27363564355777925
That number is not a great measure of how good the model is, we do over and undershoot a lot. Last but not least, let's try using the k nearest neighbors algorithm to predict whether something is tagged Best New Music or not by Pitchfork, based on the sound attributes and Fantano score.
cols = ['af_score', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness']
X = df[cols]
y = df['best_new_music']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn_reg = KNeighborsClassifier()
knn_reg.fit(X_train,y_train)
k_pred = knn_reg.predict(X_test)
k_pred
print ("accuracy: ", metrics.accuracy_score(y_test, k_pred))
accuracy: 0.7323943661971831
It is accurate about 73% of the time. That's better than just guessing!
This has been a journey through the Data Science pipeline, through the lens of music data and criticism. We learned many valuable lessons along the way. We learned how to wrangle lots of data despite the odds. We learned that API calls can take a long time. We learned that critics don't always agree, but they do sometimes. Most importantly, we've learned that art is truly subjective, and we mustn't try to quantify personal taste. To each their own, indeed.