English, Arabic, and French dialects can be found on parts of the African continent and are used across tribes, ethnic groups, and national borders, but none is native to Africa. Some estimates put the number of living languages on the continent at 2,000 or more. This can stand in the way of communication as well as commerce, and earlier this year such concerns led to the creation of the Masakhane open source project, an effort being undertaken by African technologists to translate African languages using neural machine translation.
Kathleen Siminyu is a member of the Luhya tribe in Kenya. Although English is spoken in schools and various parts of the country, tribes speak different languages, which creates a language gap between Siminyu and her neighbors. To bring her community together, she joined Masakhane earlier this year, bringing along her experience as co-organizer of the Women in Machine Learning and Data Science chapter in Nairobi and as a coordinator for AI for Development.
Siminyu believes translating languages using machine learning can be a key to the growth of AI use cases in Africa and enable Africans to apply AI to benefit African lives. Projects like Masakhane are critical to connecting Africa’s community of developers and researchers and constructing a framework for creating sustained, long-term collaboration, Siminyu said.
“Right now, I’m thinking a lot about how research networks can work on this continent,” she said. “I see language as a barrier which, if eliminated, allows a lot of Africans to just be able to engage in the digital economy and eventually in the AI economy. As people who are sitting here building for local languages, I feel like it’s our responsibility to … bring the people who are not in a digital age into the age of AI.”
The Masakhane project works with AI researchers and data scientists across Africa, and the organization aims to create neural machine translation that connects Africa’s many populations. The project was created by Jade Abbott and Laura Martinus from South Africa and came together following lectures and conversations at Deep Learning Indaba and the Sauti Yetu NLP Unconference. The name “Masakhane” means “We build together” in isiZulu.
Masakhane works with groups like Translators Without Borders and academics to find language data sets. In addition to translating native African languages to English, the project will seek to translate dialects like Pidgin English in Nigeria or strands of Arabic in northern and central Africa.
After it’s created machine translation for African languages, the group envisions potential for a range of open source projects to benefit Africans.
The group now counts about 60 contributors from across the continent but is most active in South Africa, Kenya, and Nigeria. Each participant is asked to help collect data or train models in their respective mother tongue.
Masakhane isn’t alone in its ambition to spin up more machine translation for Africa by Africans.
This week, Mozilla and a German government ministry launched an open source project to collect voice data from local African languages.
Earlier this month, as part of her work with Artificial Intelligence for Development, Siminyu launched the African Language Dataset Challenge, together with data science challenge website Zindi. In addition to Siminyu and Abbott, advisors assessing data sets come from Google AI and Facebook AI Research. Data sets made by challenge participants may be used to train Masakhane’s neural models in the future.
The rash of projects comes at a time when countries like Kenya and Nigeria rank as the fastest-growing group of contributors to open source projects worldwide, according to GitHub’s 2019 Octoverse report. In recent weeks, the growth of the African tech and developer ecosystem has attracted Silicon Valley executives like Twitter CEO Jack Dorsey and GitHub CEO Nat Friedman to visit parts of Africa like Lagos, Nigeria.
In a group interview, Masakhane volunteers told VentureBeat that the benefits of machine translation for Africa could be substantial.
Translation’s potential for transformation
The interview participants hail from all corners of the continent — Tunisia, Nigeria, South Africa, and the Democratic Republic of Congo — and said they want to put Africa on the global AI map and see African solutions to African problems.
“We can solve our problems. We have the expertise, we have the intelligence, we have the knowledge — we just need to take some responsibility about it,” said Olabiyi Samuel, a researcher focused on Yoruba in Nigeria.
Widely available and accurate African language machine translation could allow more African voices to enter the global conversation online or quickly translate educational material from English into an African language. Multiple studies have found that people learn better when they receive instruction in their native tongue.
Siminyu and other project participants want Masakhane to be a starting point for a range of research projects that can apply AI to African challenges and improve lives in other sectors important to the continent.
“We should be thinking about agriculture and how we’re fixing the food problems. We should be thinking about climate change, we should be thinking about health care … I see language as the entry point,” she said. But Siminyu also acknowledged the challenges ahead, saying: “Yeah, I think the road is long.”
Espoir Murhabazi lives in the Democratic Republic of the Congo and focuses on Lingala, a Bantu language. He wants to better understand Bantu languages and how machine learning can infer meaning from words that contain a common root. Bantu is an agglutinative language, meaning that words can contain a stem meaning and multiple elements to form each word. It’s an example of the sort of technical challenges of resolving structural differences between languages that Masakhane faces.
On a more playful level, Murhabazi wants to see projects like Masakhane offer support for translating songs into English so everyone enjoying the music can comprehend the lyrics.
“Last time I was in Kenya, I found people dancing to music in nightclubs and bars without understanding all they are dancing to,” he said.
The Masakhane project plan
Masakhane’s work will roll out in phases, starting with English translation to African languages using publicly available data, like government documents or newspapers. Once that’s complete, the group plans to create individual baseline models for translation. They’ll then submit the work for publication at top NLP conferences around the world.
The project is now in the data-gathering and translation phase, Abbott said, because unlike European languages that make up the backbone of the modern internet, African languages lack benchmarks and large data sets.
Africa, AI, and the world
Beyond creating digital economies and allowing people to learn in their own language, Masakhane participants also hope that the successful creation of an AI project by Africans will loosen restrictions often placed on African AI researchers.
Many AI research conferences are held in Europe, Asia, or North America, and despite global demand from industry and nations for AI talent, governments sometimes deny entry to Africans in the field, even if they’re studying in a Western nation.
For example, as Vancouver, Canada prepares to welcome NeurIPS, the largest AI research conference in the world, next month, African and Asian researchers — including Masakhane volunteers — have reported being denied visas by the Canadian government.
For Abbott and Martinus, the ability to travel to events outside Africa (like NeurIPS) has paid dividends that can be applied directly to the burgeoning Masakhane project. At such events, Abbott said, other NLP developers share 100 or so tips, perspectives, and lessons learned when attempting to optimize model performance.
“Meeting the community working worldwide on low-resource languages really spurred us in our research,” Abbott said.
For example, shortly after launch, Masakhane looked to the JW300 data set of 380 languages from Jehovah’s Witness texts, an insight the group gained following attendance at ACL.
“We were looking at data sets that range from … 20,000 parallel sentences, which in [the] machine translation world is very small. The same language in this JW300 data set ended up with 1 million parallel sentences, which is a massive jump in magnitude,” she said.
Abbott and Martinus detailed some early findings in applying Transformers, a kind of neural network, to low-resource languages in “Towards Neural Machine Translation for African Languages,” a preprint published on arXiv and shared at the Machine Learning for Developing World workshop at NeurIPS in 2018. Application of a range of techniques for low-resource languages achieved state-of-the-art performance for English-to-Setswana (Tswana) translations.
Still in its early stages, the ambitious Masakhane project is looking for volunteers and is currently amassing data for thousands of languages.
Open source projects like MySQL, Python, and TensorFlow built the foundation of the modern internet and growing disciplines like machine learning. Today, developers from places like Europe, Asia, and North America still lead the world in open source project contributions, but if Masakhane and projects like it succeed, that could spark major change for the continent with the youngest population on Earth — and for the rest of the world.