Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Olessya Medvedeva - Borough of Manhattan Community College, CUNY
Co-Author(s): Farnaz Abtahi, CUNY Graduate Center, New York, NY Wei Li, CUNY City College, New York, NY
The importance, benefits and shortcomings of the automated speaker recognition and emotion detection systems have been topics of interest of several studies. The areas of applications of such systems are numerous and range from entertainment, human-computer interaction, health diagnostics, and enhancement of the quality of communication for the handicapped population. Many proposed systems that are based on uni-modal visual or audio data, or a multi-modal combination of the two fail to perform in disadvantaged scenarios such as in presence of corrupted data or in poorly-lit environments. We propose to use machine learning techniques for speaker recognition and emotion detection based on Multimodal Deep Belief Networks (MDBNs). The focus is on the development of generative and joint data representation of different modalities (including visual, audio and electromyography – or EMG) to improve the performance under different scenarios. Our study emphasizes the role of the unique sensor modality, namely EMG, in improving the accuracy of the audio-visual based systems. The two main goals of our study are the following: (1) Developing a multi-modal model that is able to combine different modalities and generate a shared representation of the entire data, and additionally, has the ability of handling missing modalities in scenarios where capturing one or more modalities is impossible, e.g. during runtime when attaching EMG sensors to the user’s face is not practically possible. (2) Applying the above model to the tasks of speaker recognition and facial expression detection. A prerequisite for this part of our study is to collect data, for which we need to hire human subjects and capture visual, audio and EMG signals from their faces while speaking words or acting emotions. One of the interesting applications of such a system will be for the blind and visually impaired community during their social interactions with other people, as being aware of what other people are talking about, or what emotions they are expressing with their faces are of extreme interest for visually impaired people during face-to-face conversations. So far, we have developed a DBN-based multi-modal model for handling multi-modal data and applied it to speaker recognition. Our findings reveal that adding the EMG data improves the overall performance of the multi-modal DBN. Our next step will be to extend the model so that it is able to deal with missing modalities, most importantly the EMG part of the data. The new model will need to be tested in speaker and also facial expression recognition tasks.
Funder Acknowledgement(s): This work is supported by the National Science Foundation under Award # EFRI – 1137172, and the 2015 NSF EFRI-REM pilot program at the City College of New York. I thank Dr. Zhigang Zhu and his team at City College Visual Computing Lab and Dr. Hao Tang, CSTEP Program, BMCC.
Faculty Advisor: Zhigang Zhu,