Anyone who has tried to dictate a text using their smartphone has likely encountered the frustration of their phone completely misunderstanding them. But if you spoke the same message to another person, they would likely understand it without trouble.
This is because, unlike automatic speech recognition (ASR) systems like Siri or Alexa, humans have an ability known as phonetic constancy. Phonetic constancy allows us to understand a speaker’s message even if the speaking rate, talker characteristics, and novelty of the message’s content are different. This phenomenon is a major scientific mystery.
ASR is used by billions of smartphone users every day. Scientists studying human speech recognition use simplified versions of the “deep learning” neural networks that power ASR. Keeping the networks simple allows scientists to test ideas about the cognitive mechanisms that support human speech recognition.
These models do not address phonetic constancy because they use abstract inputs based on linguistic theories, rather than real speech. Despite our frustrations with Siri or Alexa, those systems are dramatically more advanced than the cognitive models speech scientists use.
Even though ASR technology is impressive, these systems cannot adapt to acoustic variation the way humans can. Furthermore, scientists cannot use them to study human phonetic constancy because of how complex they are. ASR models are carefully engineered systems that do not have to contend with human biological considerations.
Professor in the Department of Psychological Sciences Jim Magnuson is working on a collaborative project to develop a model that will bridge the gap between cognitive theories and models of human speech recognition and ASR’s impressive deep learning networks.
Magnuson is collaborating with professors Monty Escabí in the Department of Biomedical Engineering and Jay Rueckl and Christian Brodbeck in the Department of Psychological Sciences, as well as professor Kevin Brown, a physicist at Oregon State University. This work is supported by a collaborative research grant from the National Science Foundation to the teams at UConn ($437,000) and Oregon State ($179,000).
Through a previous NSF-funded grant, this group developed the first model of human speech recognition that can process real speech produced by multiple speakers, simulating the conditions needed to understand phonetic constancy. Their EARSHOT system (Emulation of Auditory Recognition of Speech by Humans Over Time) is a neural network modeled after the human brain and powered by special nodes developed for ASR models.
In this new grant, the team will expand EARSHOT to further bridge human speech recognition research and ASR technologies.
The researchers will run simulations that learn from real human speech. This means the models can be applied to questions of speech development and processing that are out of reach for other current cognitive models.
Over the next three years, the team will incorporate additional techniques from machine learning and artificial intelligence research, new model components based on human brain pathways, and a 20,000-word lexicon.
They will compare the validity of the model against behavioral and neural data obtained during previous human subjects research.
This work will advance scientific understanding of the cognitive and neurobiological bases for phonetic constancy. By providing insight into language development and processing, this knowledge can help advance understanding of developmental and acquired speech disorders.
This project may also lead to improved ASR technologies by uncovering ways to give ASR more human-like flexibility to adapt to acoustic variation.
This project was launched with an OVPR Research Excellence Program award in 2017. This seed funding allowed the group to build their team and secure this NSF funding, paving the way for new discoveries.
“This work simply could not be done without the complementary expertise of our team members, who are psychologists, neuroscientists, engineers, and physicists,” Magnuson says. “UConn provides an amazing environment that enables this kind of innovative and interdisciplinary collaboration.”
Magnuson holds a Ph.D. from the University of Rochester in brain and cognitive sciences. His research interests include neurobiology and psychology of language, computational models as theory-building tools, developing comprehensive understanding of language and learning over the lifespan, and science communication.