Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
George Mathis Jr. - Winston-Salem State University
Co-Author(s): Sebastian Cousins, Debzani Deb, Winston-Salem State University, Winston-Salem, NC
The proliferation of online music stores and streaming services revolutionized the way we listen to music by making available seemingly infinite number of songs to us and allowing us to discover the music that we may like in various possible ways. Recent studies on Music Information Retrieval (MIR) found that music mood is increasingly becoming an important access point to the music repositories and collections. In recent years, a number of automatic mood-based classification methods have been explored which rely on different audio and instrumental features and some of them use song lyrics in their classifications. However, the studies provided contradictory results and most importantly they are mostly based on small-scale datasets. In this study, we focus on the analysis of song lyrics for music mood classification using Apache Spark platform. More specifically, we used Spark’s MLlib library for the classification purpose in order for us to apply the classification algorithms on some sizable datasets in an efficient way. We utilized Russsell’s psychological model [1] to derive mood categories, where 28 emotion-denoting adjectives are placed on a bipolar space corresponding to two dimensions such as valence (negative-positive) and arousal (inactive-active). Further investigation of this model led us to derive four mood categories such as “happy”, “sad”, “calm” and “angry”. Currently, our training dataset contains 1000 songs from the popular Million Songs Dataset (MSD) [2] that are available to us with lyrics and with social-tag based mood categories (“happy” and “sad”). However we are also working on building a much bigger training dataset by finding a subset of songs in MSD for which lyrics are available and all 4 proposed mood categories are derivable. During the classification process, term frequency feature vectors are created from training dataset lyrics and then three classification algorithms such as Naïve Bayes, K Nearest Neighbor (KNN) and Support Vector Machine (SVM) are utilized to train our models. A separate test dataset containing 200 songs are utilized to test our models. Our preliminary results verified that the mood of the song correlates directly with the semantics of the lyrics and we also observed variation in accuracy with respect to different classification algorithms. With the ERN conference on February, 2018, we hope to run our classifiers on the bigger dataset that is being built and hope to present the results of our comparative study based on that. With the bigger dataset, we are also hoping to observe significantly faster execution of our classifiers on a cluster of workstations by utilizing the Spark framework. In future we would like to explore mood-based recommender systems. 1. J. A. Russell: “A Circumplex Model of Affect,” Journal of Personality and Social Psychology, 39: 1161-1178, 1980. 2. T. Bertin-Mahieux, et. al. “The million song dataset,” ISMIR, 2011.
Not SubmittedFunder Acknowledgement(s): This study is supported by NSF HBCU-UP grant #1600864 - awarded to Debzani Deb, Associate Professor, Winston-Salem State University.
Faculty Advisor: Debzani Deb, debd@wssu.edu
Role: I selected the subject of the, configured the sample dataset and spark codes for analysis of the data. I also helped decide which classification algorithms to use for comparison.