Real Talk: Applying Computer Learning Models to Human Speech Recognition

UConn professor of psychological science, James Magnuson, has received more than $600K from the National Science Foundation to address a longstanding problem with research on speech perception and spoken word recognition.



“Good morning. Can I get a medium coffee, please?”

Your vocal cords vibrate to create sound waves that the barista perceives as speech sounds (consonants and vowels) which combine to make words and transmit your request.

Sound familiar? We hear and understand thousands of words and phrases every day. While we don’t have to give speech perception a second thought, our brains are busy completing the complex process that allows us to make sense of the bits of sound that communicate something meaningful.

A University of Connecticut professor of psychological science, James Magnuson, has received more than $600,000 from the National Science Foundation to address a longstanding problem with research on speech perception and spoken word recognition.

Since the 1950s, researchers have grappled with the “lack of invariance” problem in speech perception. Even though we easily identify consonants and vowels in speech, there is nothing like an acoustic “alphabet” of cues that uniquely identify speech sounds. The sounds, or acoustic patterns, that correspond to a particular consonant or vowel can change with “phonetic context” (preceding or following speech sounds), speaking rate, and physical characteristics of the person talking. After decades of concerted efforts, scientists do not have a full understanding of how we perceive speech despite this variability.

An important tool in this quest is the use of computational models. Once theories propose even a few simple interacting mechanisms to explain speech perception, the precise predictions that the theory makes can be difficult or impossible to derive without conducting simulations.

Magnuson is tackling a limitation of current computational models of speech perception and spoken word recognition: none of them work the actual speech signal. Because the lack of invariance problem has not been solved, researchers have been using simplified inputs that sidestep the problem.

“It is crucial that we overcome this limitation, because without working on real speech, our theories and explanations are not just incomplete, they may be misguided,” Magnuson says. “And they also cannot address some of the most interesting and puzzling aspects of human speech recognition.”

Magnuson’s project draws inspiration from the resurgence of “deep learning” in artificial intelligence and engineering in the last two decades. Deep learning is a toolkit of techniques for training neural networks with many layers. New deep learning techniques have made systems like Siri or Amazon’s Alexa powerful enough to recognize human speech in real-world contexts. However, these systems are so complex that they are of little help to scientists trying to understand human speech recognition.

Magnuson’s project will borrow techniques from deep learning and apply them to established psychological models of human speech processing, while keeping the models simple enough that the functions they learn to perform can be understood.

In preliminary work, Magnuson and a postdoctoral researcher in his lab, Heejo You, have created a neural network model that is not more complex than other current models, but works on real speech. They hope to publish a paper later this year on their first set of results.

Another goal of the project is to conduct the first comprehensive comparison of current models of spoken word recognition (the process of mapping perceived consonants and vowels onto words in memory). Magnuson and his psychological science colleague Jay Rueckl have developed a comprehensive test with 15 critical benchmarks to compare five models.

This work could help move the research community closer to understanding this essential cognitive function.  Deeper understanding has the potential to help clinicians develop more effective interventions for language disorders. It could also help advance speech-to-text technology, as humans handle with ease situations that can “break” computer speech recognition (such as an unusual acoustic context, like a stairwell or noisy airport).

Magnuson received his Ph.D. in brain and cognitive science in 2001 from the University of Rochester. His research interests include the neurobiology of language, especially development and disorders of spoken language understanding.

This project is grant No. 1754284.