Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Session: 2
Room: Exhibit Hall A
Tenecious Underwood - Livingstone College
Co-Author(s): Evan Drake Suggs, University of Tennessee at Chattanooga; Eliza G. Foran, University of Indiana at Bloomington; Dr. Winona Snapp-Childs, University of Indiana at Bloomington; Dr. Sherri Sanders, University of Indiana at Bloomington.
Hypothesis: Recording animal calls and vocalizations is a time-honored data collection method in various fields of biological and environmental science. In the past, the only method available for analyzing such recordings involved extensive training of human experts. Now, however, machine learning techniques have made automatic recognition of such vocalizations possible. Automatic recognition of animal calls and vocalizations is desirable on two fronts: it reduces the burden of (at least initial) data analysis, and supports non-intrusive environmental monitoring. Here, we outline a proof-of-concept workflow that will make the quest from collecting data to understanding data more attainable for researchers. We simulate this data collection process by collecting animal (frog) calls using recording devices and Raspberry Pi’s, then feed this data to a database and virtual machine hosted on XSEDE resources (i.e. Jetstream and Wrangler). We then show how database pulling, machine learning, and visualization works on Jetstream.
Methods: All audio samples were collected from Cornell’s Macaulay Library archives of wild life sounds . Unfortunately, many of the sound files contained false positives. To solve this problem, we used the R package WarbleR to create 9-second spectrograms of each of the loss less sound samples. In both the CNNs and RNN, we created visual representations of sound. For the audio-based CNN, we created a simple frequency over time spectrogram in greyscale. This simple spectrogram gives the frequency over time of each call. Audio processing often applies much more complex visual representations, such as Fourier transformations. Done correctly,these transformations eliminate empty information that would otherwise hinder a neural network. For inputs with time stamps, popular spectrograms are short-time Fourier transform and mel-frequency spectrogram. For the audio-based RNN and CNN, we performed a series of transformations on the raw audio, shown. Instead of using a simple frequency-over-time spectrogram, we wanted to include both the power spectra, with frequency and time domains. To do this, we first For the audio-based RNN and CNN, we performed a series of transformations on the raw audio, shown in Figure 1. Instead of using a simple frequency-over-time spectrogram, we wanted to include both the power spectra, with frequency and time domains. To do this, we first used a fast Fourier transformation to create a change-in-time-over-frequency spectrum, which eliminated unimportant features of the audio. Additionally, we used short-time Fourier transformation (STFT) with Hamming Windows, Filter bank coefficient energies (logarithmic value of 26 filters),and processed through the Discrete Cosine Transform (DCT) to eliminate over-correlated coefficients of higher frequencies used a fast Fourier transformation to create a change-in-time-over-frequencyspectrum,whicheliminatedunimportant features of the audio. Additionally, we used short-time Fourier transformation (STFT) with Hamming Windows,
Results: In order to calculate the accuracy of each model, we ran predictions ten times with model rebuilding between each run (Figure 6). We found that the image-based CNN performed with the greatest accuracy. In a Krustal-Wallis ranks untest, we found that the audio-based RNN had the lowest performance accuracy. Conclusion: The accuracy of all three neural networks theoretically exceeds the accuracy for traditional citizen science frog surveys. All three of our models produced over 88% accuracy. When fully integrated, our image-based CNN model can translate a frog calling in a remote location to automatic identification on a webpage. Previous Jetstream undergraduate students created a custom Raspberry Pi recording device that could push audio directly to a web page for basic bioacoustics visualization (as a spectrogram and principle component analysis plot. Our research would complete automated identification process.
Funder Acknowledgement(s): National Science Foundation (NSF); Jetstream IU; National Center for Genomic Analysis Support (NCGAS)
Faculty Advisor: Dr. Balogun, obalogun@livingstone.edu
Role: All parts included the code for the RNN and CNN imaged based neural networks.