
Only a fraction of the 7,000 to 8,000 languages spoken around the world benefit from modern language technologies such as transcription, automatic captioning, simultaneous translation, and voice recognition. Carnegie Mellon University researchers want to increase the number of languages with automatic speech recognition tools available to them from about 200 to nearly 2,000.
“Many people in this world speak various languages, but language technology tools are not developed for all of them,” said Xinjian Li, Ph.D. Student at the Language Technologies Institute (LTI) of the Faculty of Computer Science. “Developing technology and a good language model for all people is one of the goals of this research.”
Li is part of a research team that aims to simplify the data requirements languages need to create a speech recognition model. The team—which also includes LTI faculty Shinji Watanabe, Florian Metze, David Mortensen, and Alan Black—presented their latest work, “ASR2K: Speech Recognition of Nearly 2,000 Languages Without Voice” at Interspeech 2022 in South Korea.
Most speech recognition models require two sets of data: text and audio. There are text data for thousands of languages. Voice data no. The team hopes to eliminate the need for phonetic data by focusing on linguistic elements that are common across many languages.
Historically, speech recognition technologies have focused on the sound of language. These distinct sounds that distinguish one word from another—such as the “d” that distinguishes “dog” from “log” and “cog”—are unique to each language. But languages also have phones that describe how a word looks physically. Multiple phones may correspond to one voice. So even though separate languages may have different sounds, their base phones can all be the same.
The LTI team is developing a speech recognition model that moves away from phonemes and instead relies on information about how phones are shared between languages, reducing the effort of building separate models for each language. Specifically, it pairs the model with a phylogenetic tree—a diagram that charts relationships between languages—to help with pronunciation rules. With their model and tree structure, the team can approximate a speech model for thousands of languages without phoneme data.
“We’re trying to remove the requirement for audio data, which helps us go from 100 or 200 languages to 2,000,” he told me. “This is the first research to target this large number of languages, and we are the first team to aim to extend language tools to this range.”
The research is still in the early stages of improving existing language approximations by a modest 5%, but the team hopes it will serve as an inspiration not only for their future work but also for that of other researchers.
For me, work means more than making language technologies available to everyone. It is about cultural preservation.
He told me: “Every language is a very important factor in its culture. Every language has its own story, and if you don’t try to preserve the languages, these stories may be lost.” “Developing this kind of speech recognition system and this tool is a step to try to preserve those languages.”