Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) claim to have developed a system that can decipher a lost language without needing knowledge of its relation to other languages. They say it’s a step toward a system that’s able to decipher lost languages using just a few thousand words.
Lost languages are more than an academic curiosity. Without them, humanity risks missing a body of knowledge about the people who historically spoke them. Unfortunately, most lost languages have such minimal records that scientists can’t decipher them by using conventional machine-translation algorithms. Some don’t have a well-researched “relative” language to be compared to, and they often lack traditional dividers like white space and punctuation.
This CSAIL work, which was supported in part by the Intelligence Advanced Research Projects Activity and spearheaded by MIT professor Regina Barzilay, a specialist in natural language processing, leverages several principles grounded in insights from historical linguistics. For instance, while a given language rarely adds or deletes a sound, certain sound substitutions are likely to occur. A word with a “p” in the parent language may change into a “b” in the descendant language, but changing to a “k” is less likely due to the significant pronunciation gap.
By incorporating these and other linguistic constraints, Barzilay and Luo developed a decipherment algorithm that can handle the vast space of transformations and the scarcity of a signal in the input. The algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors. This design enables the system to capture patterns of language change and express them as computational constraints. The resulting model can segment words in an ancient language and map them to counterparts in a related language.
With the new system, the relationship between languages is inferred by the algorithm; the algorithm can assess the proximity between two languages. Moreover, when tested on known languages, it can accurately identify language families.
The team applied their algorithm to Iberian considering Basque as well as less likely candidates from Romance, Germanic, Turkic, and Uralic families. While Basque and Latin were closer to Iberian than other languages, they were still too different to be considered related, the system revealed.
In future work, the team hopes to expand their efforts beyond the act of connecting texts to related words in a known language, an approach referred to as cognate-based decipherment. The team’s approach would involve identifying the semantic meaning of the words even if they don’t know how to read them. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language,” Barzilay said.
Barzilay and coauthors aren’t the only ones to apply AI to recovering long-lost languages. Alphabet’s DeepMind developed a system, Pythia, that learned to recognize patterns in 35,000 relics containing more than 3 million words. It managed to guess missing words or characters from Greek inscriptions on surfaces including stone, ceramic, and metal that were between 1,500 and 2,600 years old.
The audio problem:
Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here