Posted on May 25th, 2014
Filed under: Code — Karl Olson @ 6:50 pm
So a while back a rather interesting chart was posted online noting the unique words used in the first 35,000 words of many major MCs. Of course, being an MC, musician and computer scientist, I was immediately intrigued yet let down. Obviously, you can’t really put up the exact same source lyrics since that would be infringement (sadly), but it would’ve been awesome if some source code was available and at least a listing of the sources used.
So, I just built my own solution that would be a lot more transparent, and that could eventually act as a framework for something more collaborative.
On one hand, it’s a really basic word count. However, it has a lot of nice, little tweaks that let the user (probably an insecure rapper like myself,) carefully manipulate the behavior of any automatic reformatting and correction of their text. To further demonstrate the transparency of how it works, it show not only the unique counts, but it also shows a raw JSON count of the object (that will probably be wired into some graphing functionality in future,) and the processed version of the input so that you can see exactly what any replacement has done. Beyond that, there is a also field for inputting excluded words if you want to see what happens to the unique word count if you exclude certain common words as well.
On that note, here are unique word counts for the following artists first 35000 words (or as many words as they’ve released to date if they don’t have 35000+ words on their own official, non-best of albums):
Aesop Rock: 8411/35000 (sourced from lyricswiki**)
mc chris: 6467/35000 (sourced from A-Z Lyrics**)
MC Frontalot*: 5163/21708 (sourced from his own website**)
Whoremoans*: 3682/19779 (forwarded directly to me from a transcription, no edits made by me.)
Ultraklystron: 6492/35000 (from my own archives**)
*=under the 35000 count.
**=manually corrected for variations in spelling, transcription errors and reduced repetition of choruses when feasible.
I want to note these numbers are obtained after removing non-essential punctuation, making it all lowercase and removing apostrophes via the Lyricist page. Additionally, as noted, I also corrected for transcription errors and inconsistent spelling and also removed obvious repetition like choruses/hooks since there is no consistency in the notation of that kind of thing. Also omitted when possible/reasonable were any lyrics from featured MCs on those artist’s releases. In the case of annotations that didn’t clearly breakdown which MCs said what, the entire song was cut from the count.
Lessons Learned so far:
-There is so much variation in the count due to the very issues I’m trying to correct for above that at best, you can probably say that if two rappers are within a 500 words of each other, they’re probably comparable when it comes to vocabulary, even after running corrections/clean up over all of their lyrics. This is reinforced by the fact that MCs will trade spaces depending on the removal of apostrophes or not.
-Most MCs hit a logarithmic ceiling as they go on with time. Aesop doesn’t appear to though. In fact, even if a fan put serious time into working through and getting a very accurate transcription of his first 35000 words fully corrected for any of the possible duplicates sneaking by, he’d still probably be smashing it.
-I have a lot of artists I want to gradually add this to list (MC Lars, Megaran and YTCracker to name 3 off the top of my head,) but finding a good, preferably single source for their lyrics is going to critical to the accuracy of the analysis.
-Making this an un-moderated, open source list will probably be a fiasco. To make this work, it almost needs to be integrated into Rap Genius or something similar. I kind of hope this spurs the major lyrics site into integrating this kind of analysis as a way of engaging people with the words of their favorite musicians and MCs at a lexicographical level.
No Comments »
No comments yet.