Decoding the Language of Music with MusicGen

7 min readMar 23, 2023

As part of the Advanced Corpus Linguistics project, we have built a corpus of lyrics from different musical genres and analyzed the themes, the patterns of language usage and linguistics features from the most popular songs (top charting songs) within those genres over the past decades.

Data Collection
Corpus Description
Data Preprocessing
Annotation
Interannotator Agreement Study
Web Interface Demo
Findings and Data Visualization

Data Collection

We used Spotify API and Apple Music to scrap charts and songs from Spotify and Apple Music charts, and Genius.com to search and collect lyrics. We have three generators namely song_scrapper, charts_scrapper and lyrics_scrapper dedicated to scrap the songs, charts and lyrics. We then stored the corpus as a JSON file to work with easily. This file comes with one document per song containing all features. There are approximately 200 words per document, and nearly limitless amount of lyric data available, which is about or easily over the brown-sized corpus.

Corpus Description

Genre: Pop, Rock, Hip Hop, R&B, Country, Latin Pop, etc.
Years: decades from 1970s to 2020s
Language: English
Music features: artist, duration, danceability, acousticness, etc.
Lyrics: rhyme patterns
Metadata: Chart name and IDs

Data Preprocessing

After the data collection, we intiatially inspected the data and to ensure the annotation quality, we decided to remove empty lists or missing data in spotify_genres column in the dataframe, removing those rows from the dataframe. There are about 200 song records are removed from the total 8551 song records.We then extracted the audio features from the feature dictionary, into

Annotation

Annotation task: genres + rhyme scheme in lyrics
Annotation labels:

Genres: Pop, Rock, Hip Hop, R&B, Country…

Rhyme scheme: AABB, ABAB notation (identify rhyme pairs)

Methods: we conducted a pilot study by performing manual annotation + autonomous annotation on ‘genres’ (using codes), no external tools were used.
Annotators: we did not consider Mechanical Turk at this point, annotators are our teammates.

Our annotation consists of two tasks, genre annotations from Spotify versus Apple Music and rhyme scheme recognizaition in the lyrics. In this annotation task, we are trying to identify rhyme schemes in song lyrics from the corpus that we are working on. A rhyme scheme is a way of determining which lines of lyrics rhyme with one another. The pattern we choose allows us to create a flow that feels either stable or unstable. We will use a very basic notation system to mark these up:

- If two (or more) lines rhyme together, we give them the same letter (starting with A)

- If a line doesn’t rhyme with any other rhyme we mark it with an X

We can easily notate the rhyme scheme and start recognizing it in the most popular songs from a few decades, there can be many rhyme schemes that we could explore. In this project, we notated the following main four types (and variations) as part of our piloty study:

AABB: if we have a four-line chorus in which lines 1 and 2 rhyme together and lines 3 and 4 rhyme together, we annote them as an AABB rhyme scheme

ABAB: similary, if the first line will rhyme with the third, and the second will rhyme with the fourth.

AAAX: rhyming three lines together and leaving the fourth line unrhymed.

XAXA: if a verse that only has rhymes between lines 2 and 4, and no rhyme between lines 1 and 3.

Here are some examples of rhyeme shceme that we noted in the corpus:

AABB

A “Yeah, life sure can try to put love through it, but”,
A “We built this right, so nothing’s ever gonna move it”
B “When the bones are good, the rest don’t matter”,
B “Yeah, the paint could peel, the glass could shatter”,

ABAB

A “Buckles on the jacket, it’s Alyx shit “,
B “Nike crossbody, got a piece in it”,
A “Got a dance, but it’s really on some street shit”,
B “I’ma show you how to get it”

AAAX

A “Tastes like strawberries on a summer evenin’”,
A “And it sounds just like a song”,
A “I want your belly and that summer feelin’”,
X “I don’t know if I could ever go without”

XAXA

X “This that drip, it’s more like oceans”,
A “They can’t fit me in a Trojan”,
X “Out of pocket, but I’m always in my bag”,
A “Yeah, that’s the slogan”,

Interannotator Agreement study

For interannotator agreement, we decided to use Cohen’s Kappa as our main evaluation metric considering we have two annotations for each instance with a category. We firstly got the data from both charts and combined them into one dataframe, to be ready for the next step.

Upon performing EDA, we noticed there are about 200 songs with missing values in genres, so we removed them to make sure we can ensure the quality of the agreement

We then worked on a comparison between two annotations.

Our intuition is that when we see if any of the words from Apple Music’s genre annotation can be found in the genres from Spotify, we consider them to agree. We have explored several options, namely:

(A1 refers to annotation from Apple and A2 refers to annotation from Spotify)

Exact match: alternative (A1) vs. alternative (A2)
Partial match: Hard Rock (A1) and Classic Rock (A2)
No match: Rock (A1) vs. Pop (A2)

We would consider two annotations “agreed” when there are exact matches and partial matches; and “disagreed” if there’s no match.

We then extracted the “agreed” genre from the given multiple Spotify genres; for cases with “disagreed” genre, we kept the first element of Spotify genre to make sure we can ensure the disagreed element is included and compute the agreement score accurately.

We achieved the score of 0.25 on average, which falls into the “fair agreement” range. It corresponds to the ways how Apple and Spotify annotate their music genres.

Sneak Peek at Web Interface

Functionality: Our web interface has the following functionality:

Search functionality: users will be able to enter a keyword or phrase and receive a list of songs that match their query. The results will include the song title, artist, and year, as well as the relevant lines of lyrics from the corpus.
Interactive visualization: we als incorporated an interactive visualization that displays the distribution of songs across different regions and years. This will give users a better understanding of the corpus and how it is distributed across time and space.

Front-end: Our front-end will be developed using HTML, CSS, and JavaScript. We will use a responsive design that works well on both desktop and mobile devices. We will also incorporate Bootstrap to make the interface look professional and modern.

Back-end: We will use FastAPI as the back-end framework for our web interface. FastAPI is a modern, fast, and intuitive framework for building APIs with Python 3.7+.

Please refer to this repo for the interface demo in details.

Appendix: Findings and Data Visualization

Let’s look at some of our findings from the corpus!

Genre distribution

import squarify as sq

plt.figure(figsize=(14,8))
sq.plot(sizes=df.am_genre.value_counts(), label=df["am_genre"].unique(), alpha=.9 )
plt.axis('off')
plt.show()

Top 20 Artists with Highest Count of Records

df_artist_sorted = df_artist.sort_values(by='artist', ascending=True)
df_artist_sorted.plot.barh()
plt.title('Top 20 Most Popular Artists')
plt.xlabel('Count of Song Records')
plt.ylabel('Artist')
plt.show()

Artist WordClouds

Top 20 Most Popular Genres

df_genre_sorted = df_genre.sort_values(by='am_genre', ascending=True)
df_genre_sorted.plot.barh()
plt.title('Most Popular Genres')
plt.xlabel('Count of Song Records')
plt.ylabel('Genres')
plt.show()

Audio Features

feats = b_df[['danceability', 'energy', 'speechiness', 'acousticness', 'liveness', 'valence']]
plt.figure(figsize=(10,4))
small.mean().plot.barh()
plt.title('Mean Values of Audio Features')
plt.show()