Khaya African Language Translation AI Adds Gurene, Kikuyu, Kimeru & Luo

6 min readJul 5, 2023

Khaya AI App Expands in Northern Ghana with the addition of Gurene (alternatively Frafra or Farefare), contributes to Kenya with the addition of Kikuyu, Kimeru & Luo. Also includes improvements in Dagbani Speech Recognition

The following describes work done by the NLP Ghana and Algorine teams to democratize modern machine learning tool access for Ghanaian and other African Languages. Specifically, it covers the work leading up to the release of version 1.0.5 of the Khaya AI app, and represents advances in automatic speech recognition (ASR) and machine translation “state-of-the-art” for over 15 million people in Africa — specifically in Ghana, Burkina Faso & Kenya

Reminder — What Is The Inspiration Behind The Name Khaya AI?

**Fig 1.** Khaya AI is named after the Khaya African Mahogany tree. Just like the tree, it is rooted in Africa. We hope it will similarly become a nourishing, sustaining resource for Africa and Africans in the digital future. It is also a word for “home” in several Southern African languages.

Khaya AI is named after the Khaya African Mahogany tree. Just like the tree, it is rooted in Africa. We hope it will similarly become a nourishing, sustaining resource for Africa and Africans in the digital future. It is also a word for “home” in several Southern African languages.

What Could the Previous Version of Khaya AI do?

**Fig 2.** Last year, version 1.0.4 of the Khaya App was released — providing the world with Dagbani, Ga, Twi and Yoruba Automatic Speech Recognition (ASR) capabilities, as well as Ga, Ewe, Twi and Yoruba neural text translators.

The previous version 1.0.4 of the Khaya AI App could perform Automatic Speech Recognition (ASR) in Twi, Ga, Yoruba & Dagbani, as well as Ga, Ewe, Twi, Yoruba neural machine learning text translators. It included the crucial ability to gather feedback from the public to improve its quality over time. Click here to read our v1.0.4 announcement article.

Over the past year, the team have been working to improve the capabilities of these machine learning systems. We are happy to release today version 1.0.5 of the Khaya AI App, showcasing improvements in quality and expanded language coverage. You can already use the app on Web, Android, iOS or via API in your own apps by following the links in https://linktr.ee/nlpghana

We outline the various improvements achieved in this article, as summarized by the following list and subsequent sections.

What Can The New Version of Khaya AI Do?

Highlights
1. Gurene (alternatively known as Farefare or Frafra) text translation has been added, expanding Khaya’s language coverage in Northern Ghana
2. Collaboration with the Harvard African Language School on Kenyan Languages allowed us to add Kikuyu, Kimeru and Luo text translators to the system. This underscores our commitment to providing world-class solutions across Africa — wherever they are needed
3. We improved the Dagbani ASR system through the curation of an open Dagbani speech corpus
4. Text translators have continued to improve

Now let’s dive deeper into the enhancements being released and ongoing work.

Gurene Introduced as Expansion in Northern Ghana Continues

In version 1.0.4 of the app released last year, we introduced our first Northern Ghanaian language — Dagbani. This time, we are adding Gurene aka Frafra to the app. According to Ethnologue, this language is spoken by over 700000 people. To the best of our knowledge, this is the first and only app providing machine translation capabilities for this language.

To achieve this, we worked closely with native language experts from the Gurene Wikimedia Group to curate a dataset of over 17000 Gurene translations. Our usual vanilla transformer models were then trained on the data. BLEU scores achieved were 29 for Gurene to English and 21 for English to Gurene. We are in the process of publishing the data and details of its collection process. Stay tuned.

Kikuyu, Kimeru & Luo Introduced as Khaya Grows in Kenya

While our goal when starting Ghana NLP was to ensure someone was prioritizing Ghanaian languages in AI research, we have also been committed to providing solutions wherever they are needed in Africa. In this update, we are proud to include Kikuyu, Kimeru & Luo text translators in our update. Note that none of these languages is available in Google Translate.

We worked closely with Professor John Mugane of Harvard African Language School, who is Kenyan and was able to engage native speakers to curate high quality datasets for these languages. We made sure to include both News, colloquial speech, and data across as many domains as possible in the data to make it representative. These were then used to train our vanilla transformer text translation models.

Kikuyu is spoken by about 6.5 million people, while Luo is spoken by over 4 million (according to Ethnologue). The only machine translation model available for these languages is Meta’s NLLB (No Language Left Behind) model. As such, we benchmarked our model against NLLB on Meta’s own benchmark for this, i.e., FLORES-200. We used the 600m parameter distilled version of NLLB in the comparison to make it fairer, as our models are under 80m parameters in size.

We found that while NLLB scored 9 in the English to Kikuyu direction, it scored a poor 3 in the Kikuyu to English direction. Khaya on the other hand scored over 11 on English to Kikuyu and over 16 on Kikuyu to English. This difference is performance difference gets even more dramatic when we used our own higher quality and more representative data as a benchmark.

For instance, for English to Luo NLLB scores about 15 while Khaya scores about 20 on our higher quality benchmark. For Luo to English, Khaya scores 31 while NLLB scores a paltry 2. Particularly when translating into English, NLLB is clearly very poor and it is not even close.

These comparisons were confirmed using human evaluators, who found NLLB to be barely useable overall while being impressed by Khaya’s output most of the time.

For Kimeru, spoken by over 2 million people according to Ethnologue, no other translation models are available to the best of our knowledge (NLLB does not presently cover it). BLEU scores of 23 and 10 were achieved for Kimeru to English and English to Kimeru respectively on our high quality benchmarks.

We are in the process of publishing the data and details of its collection process. Stay tuned.

Dagbani Speech Recognition Improved

In version 1.0.4 of the app, we added the ability to recognize Dagbani, but it was limited to single words due to the constraints on the training data available. Over the past year, we worked closely with the Dagbani Wikimedia group to curate an audio dataset and trained a speech recognition model that can handle longer utterances. This model is now available in the Khaya App and API. We published the data and data collection procedure details at the AfricaNLP 2023 Kigali Workshop, in case you are interested in building on our work.

Other Text Translator Improvements

Our translators have continued to improve across the board through human feedback submitted through the app, as well as through other innovations. Please stay tuned for more peer-reviewed publications reporting on these in the coming future.

What Comes Next?

1. Many more languages in Ghana and across Africa to be added. We have many models in the pipeline for release, but we want to ensure quality — our models have to be the best out there and meet a minimum usability threshold to qualify for release. This takes quite a bit of time and effort to ensure, and so we thank you for your patience as we work diligently to add more languages!
2. Many folks have asked for text to speech capabilities — these should begin to become available for some languages this year!
3. We have already released an API that you can use to integrate text translators in your own apps — https://translation.ghananlp.org/. We will be releasing the speech recognition in this API within weeks and text to speech will follow.

Acknowledgements

Many thanks to Google for the GCP credits used to train and evaluate the models. Thanks to Abdoulaye Diack for facilitating this process for us at Google. Much gratitude to the Motsepe Presidential Research Accelerator Fund for Africa for the funding that was used to collect the described data. Many thanks for Professor John Mugane and his data collection team, as well as to the Gurene and Dagbani Wikimedia groups for lending their lingusitic expertise! Many thanks to Mr. Sadik Shahadu and Mr. Abugre Anyorigya for many insightful discussions.