Google launches WAXAL: A new era for African AI sovereignty


If you speak to an artificial-intelligence bot in an African language, it will most likely not understand you. If it does manage to muster a response, it will be rife with errors. This is an existential problem with AI that everybody in Africa is trying to solve. Now, Google has joined the cause.

On February 3, Google launched WAXAL, a data set for 21 African languages, including Acholi, Hausa, Luganda, and Yoruba.

“Taking its name from the Wolof word for ‘speak,’ this dataset was developed over three years to empower researchers and drive the development of inclusive technology across Africa,” Google said in a blogpost.

While WAXAL will make building AI products that understand African languages easier, it represents a rare move toward digital sovereignty: The data set is owned by African partners who worked on the project, and not Google.

“WAXAL is a collaborative achievement, powered by the expertise of leading African organizations who were essential partners in the creation of this dataset,” Google said. “This framework ensures our partners retain ownership of the data they collected, while working with us toward the shared goal of making these resources available to the global research community.”

Google’s African partners for this project include Makerere University in Uganda, the University of Ghana, AI and open data company Digital Umuganda in Rwanda, and the African Institute for Mathematical Sciences, among others.

Success lies in the local ownership.”

“The scarcity of high-quality, permissively licensed speech corpora has historically been the primary bottleneck for everyone,” Abdoulaye Diack, research project manager at Google AI, told Rest of World. “Success lies in the local ownership of [this] innovation cycle.”

Data ownership has become one of the most critical points of contention related to the global rise of AI. For years, tech firms from the U.S. and China have controlled vast data sets from across the world. They use this data – at times collected without clear consent or compensation – to train their AI models.

Now, many countries, especially emerging economies, are building frameworks to claim ownership and protect their data by storing it within their borders. With data-driven businesses estimated to generate over $2 trillion annually, it has become critical to identify who owns the data and who ultimately benefits from it.

WAXAL contains over 11,000 hours of speech data from nearly 2 million individual recordings, including about 1,250 hours of transcribed speech for automatic speech recognition and over 20 hours of studio recordings for text-to-speech voice synthesis.

The creators of WAXAL made an intentional choice to release the data under a permissive license to allow commercial deployment, Diack said. Keeping it open-source will help African entrepreneurs bypass Silicon Valley intermediaries for innovation. 

Several local organizations are already using WAXAL for different use cases, Diack said.

“We are already seeing incredible use cases,” he said. “The University of Ghana is utilizing the data for maternal health-care research. … These institutions aren’t just collectors — they are now hubs of AI infrastructure.”

The data, controlled by African institutions and made open-source for everybody, is a great foundation to build on, Nigerian linguist and language expert Kola Tubosun told Rest of World.

Google is not alone in this race. Microsoft recently introduced Paza, a new pipeline and benchmarking tool for 39 African languages, signaling a shift toward community-led AI infrastructure.

Building WAXAL was not without hurdles. African languages are linguistically rich, with several layers of context, and that posed major technical challenges for Google and its partners, Diack said.

“Transcription was our steepest mountain. We leaned heavily on university linguistics departments to navigate dialectal nuances and orthographic standards,” he said. “On the hardware side, capturing ‘studio-quality’ audio in varied environments required real African ingenuity — partners designed portable, self-made recording boxes and used noise-canceling technology to ensure the audio was clean enough for high-fidelity TTS models.”

Tubosun fears these issues might persist if the voice data set isn’t captured perfectly. “People have pointed out that the Yoruba data in Google’s release lacks diacritics, which isn’t optimal. Diacritics are a crucial element in Yoruba speech, so the absence will significantly degrade its performance for Text-to-speech,” he said.

While the data set captured a lot, Diack said, the vast dialectal variation across the continent remains a challenge that must be addressed to ensure no community is left behind.

“We currently have an additional six languages in the pipeline, bringing our total to 27. However, our long-term strategy focuses on sustainability through partnership,” he said.



Source link

Leave a Reply

Translate »
Share via
Copy link