Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Benjamin Kassman - Lewis and Clark College
Co-Author(s): Catherine Seita, Cornell University, NY; Jason Antal, Gallaudet University, DC; Jaron Rekhop, Gallaudet University, DC
Automatic Speech Recognition (ASR) software provides a service, transcribing spoken content into text, that may benefit people who identify as d/Deaf or Hard of Hearing (DHH). Text-based messaging such as ASR is a mode of communication that is readily accessible to DHH individuals in the workplace. Despite the advances that ASR technology has made in recent years, the transcriptions produced by ASR are not always accurate. Our aim with this study was to work within the limitations imposed by current ASR techniques, rather than to develop new techniques.
We studied the types of errors that occur with ASR transcriptions and tested our hypothesis that transcription errors can be accurately predicted through examining linguistic factors and speaker tendencies. We also hoped to identify through our analyses which specific factors contributed most to transcription error. We focused on errors that occurred during sessions in an experiment where hearing participants used an ASR app to interact with DHH individuals. ASR software assigned a confidence score to each word that it produced and an app that we developed displayed the words produced with ASR and underlined each of the words that had a score below 75% in order to indicate the likelihood that the word was incorrectly transcribed.
In 9 sessions in an experiment with 12 hearing participants altogether, we were able to collect 3100 ASR produced words, which we were able to compare with reference words collected by audio transcription. This allowed us to produce a rich dataset with metadata on each word. We found that word pairs with phonetic and lexical stress similarity were more likely to be incorrectly produced by the ASR system. In addition, there was a correlation between the part of speech of an ASR produced word and word error rate for the part of speech, and a correlation between a speaker’s native language and accuracy of transcription.
Using these data, we confirmed that our app’s confidence rating was an accurate indicator of whether a word would be mistranscribed. High confidence was defined as above 75% confident. Out of 3108 spoken words, 74.5% were transcribed correctly and had high confidence, 15.3% were correctly yet low confidence, 3.1% were incorrect yet high confidence, and 6.7% were incorrect and low confidence. For future work, it would be useful to also collect prosodic information such as rate of speech and inflection for each word in the dataset to better understand how these factors indicate transcription errors.
Funder Acknowledgement(s): This work has been generously supported by an NSF REU Site Grant (#1460894) awarded to Dr. Raja Kushalnagar, PI.
Faculty Advisor: Matt Huenerfauth, firstname.lastname@example.org
Role: I helped to design the test sessions for the app and created the scenarios that the participants worked through. I collected the data, setting up video recording sessions to test out the ASR app and managing the API of the app as the experiment ran. I processed the data collected from the video sessions, transcribing the audio into text through a Python program, analyzing the video and captioning it for our Deaf and Hard of Hearing researchers. Finally, I analyzed the data and tested out our hypotheses regarding the relationship between certain metadata and error rate in transcription.