That which unites and divides
National anthems reflect collective identity and history. In this blog lets examine similarities and differences between them
वसुधैव कुटुम्बकम् - The world is one family
What defines a country? At its core, it represents a geographic area where the inhabitants share a collective memory of their history, a cohesive identity in their current lives, and a common aspiration for their future. This collective identity is not only evident in the lifestyle of its people but also in the symbols that represent the nation. One significant symbol is the national anthem. In this blog, we aim to explore the national anthems of various countries to gain insight into which nations might share similar perspectives. The next parts of the blog will focus on the technical aspects of doing this.
Getting the data
Now there wasn't a single source where I could get this data. So what I did was scrape the data from a website. That is a grey zone I know but I just ran the scraper a few times and compiled the dataset. I'll make the data available for everyone (first the lyrics of the national anthems; the audio files to follow soon)
Word of warning: The data needs a fair bit of cleaning but there's just 190 rows - even a manual cleaning would cut it!
I've also downloaded the audio files and I'll be using it to create audio features. How do I make it accessible publicly is something I need to work through. In the meantime email me and we'll work out a way for you to get the audio files
Feature Engineering
Text Feature Generation
This is the simpler bit. We take the lyrics of the national anthems and do the following operations:
- Remove punctuations, convert to lower-case
- Regular expression substitutions
- Tokenize the text data
- Remove stopwords and country specific references
- Apply stemming or lemmatization based on the option selected
- Restrict length of the lyrics between the options selected
Use TF-IDF vectorizer to convert the resulting cleaned lyrics and get the feature matrix
Audio Feature Generation
This is pretty involved. Fortunately there's a library that does a lot of the algorithmic heavy-lifting. We have to however know the domain-specific information which isn't the most intuitive but a couple of walkthroughs will help (along with any signal processing courses you've taken😉). Anyway the general steps are:
- Read audio files one at a time, take mean if multiple channels and extract features via short term windowing
- Create chromagram by stacking features
- Get note frequency influences by binning and normalizing the resulting histogram
K Means Clustering
In this part we'll group countries into distinct clusters using k means clustering. Now I won't really do a deep dive into k means clustering. You can find more details here. What we'll instead do is discuss how to decide the number of clusters.
- Start with 5 cluster centers randomly initiated and work your way up to 30 cluster centers
- For each number of cluster centers, record the sum of distances (yes sum of distances not the average over clusters because we observed a more explainable result with this heuristic)
- Find the knee point of the resulting graph. The knee represents point after which the sum of distances decrements incredibly slowly and thus adding clusters is just computationally expensive without adding value
- We'll use the number of clusters at the knee point.
We see the knee occurs when k
is equal to 7. Once we set this value we get the following groups
We see that a few countries are missing. This is because even though Russia and some Central African countries are present in the dataset, they're named slightly differently. We'll leave it to you to manually correct these entries and recreate the clusters.
Conclusion
One look at the cluster map and we see that countries in the South American continent seem to have the same cluster assignment (baring a few). Countries in south and central Asia seem to share a cluster and west Asia is pretty close in their clusters as well. Australia, New Zealand, Canada are assigned to the same cluster which isn't a big surprise given how they have a shared history under Britain. India, United Kingdom, Ireland. India and Bangladesh despite having the same composer of their national anthems get divided into different clusters.