
Cornell University researchers have developed a silent speech recognition interface that uses acoustic sensing and artificial intelligence to continuously recognize up to 31 inaudible commands, based on lip and mouth movements.
The low-power, wearable interface — called EchoSpeech — requires just a few minutes of user training data before it recognizes commands and can run on a smartphone.
Ruidong Zhang, a doctoral student in information science, is the lead author of “EchoSpeech: Continuous Silent Speech Recognition on Acoustically Intrusive Minimalist Goggles,” which will be presented at the Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI). ) this month in Hamburg, Germany.
“For people who cannot articulate sounds, this silent speech technology could be an excellent input for a voice synthesizer,” Zhang said of the potential use of the technology with further development. “It could restore patients’ voices to patients.”
In its current form, EchoSpeech can be used to communicate with others via smartphone in places where speech would be uncomfortable or inappropriate, such as a noisy restaurant or a quiet library. The silent speech interface can also be paired with a pen and used with design software such as CAD, all without the need for a keyboard and mouse.
Equipped with a pair of microphones and speakers smaller than pencil erasers, EchoSpeech glasses now have an AI-powered wearable sonar system that sends and receives sound waves across the face and senses mouth movements. A deep learning algorithm then analyzes these echo profiles in real time, with up to 95% accuracy.
“We’re sonar-to-body,” said Cheng Zhang, assistant professor of information science and director of the Cornell Laboratory for Intelligent Computer Interfaces for Future Interactions (SciFi).
“We’re very excited about this system,” he said, “because it really pushes the field forward in terms of performance and privacy. It’s small, low-power, and privacy-sensitive, all of which are important features for deploying new wearable technologies in the real world.”
Cheng Zhang said that most of the techniques in silent speech recognition are limited to a selection of pre-set commands and require the user to face or wear the camera, which is impractical and useless. He said there are also significant privacy concerns surrounding wearable cameras — both for the user and those with whom the user interacts.
Voice sensing technology like EchoSpeech eliminates the need for wearable video cameras. Because audio data is much smaller than image or video data, it requires less processing bandwidth and can be transmitted to a smartphone via Bluetooth in real time, said François Gempretier, professor of information sciences.
“And because the data is processed locally on your smartphone rather than being uploaded to the cloud, privacy-sensitive information never leaves your control,” he said.