WAXAL: A Massive Open Speech Dataset for 27 African Languages

WAXAL: A Massive Open Speech Dataset for 27 African Languages

7 0 0

If you’ve ever tried to use a voice assistant in a language that isn’t English, Mandarin, or Spanish, you know the pain. The tech just isn’t there. For the hundreds of millions of people across Sub-Saharan Africa, where over 2,000 languages are spoken, this isn’t just an inconvenience — it’s a barrier to essential tools like voice search, transcription, and accessibility features.

Google Research has been quietly working on this problem since 2021. Today, they’re releasing WAXAL, a massive open speech dataset covering 27 Sub-Saharan African languages spoken by more than 100 million people across 26+ countries. The numbers are impressive: roughly 1,846 hours of transcribed natural speech for automatic speech recognition (ASR), plus over 565 hours of high-fidelity recordings for text-to-speech (TTS). All of it is under a Creative Commons CC-BY-4.0 license, which means anyone can use it — researchers, startups, even big companies — without jumping through legal hoops.

What’s particularly smart about WAXAL is how they collected the data. For the ASR portion, they didn’t have people read boring scripts. Instead, participants described visual stimuli — images from Google’s Open Images dataset covering 50+ topics. This prompted spontaneous, natural speech with all the messy real-world stuff: tonal variations, code-switching, hesitations. That’s exactly the kind of data you need if you want models that work outside a recording studio.

The TTS side was even more collaborative. Local community members worked in pairs, drafting scripts of 10,000–20,000 words, alternating as reader and recorder. Some participants got creative with the project funding, building custom studio boxes in their homes to get professional-grade acoustics. The result is 565 hours of phonetically balanced audio that’s been segmented, aligned with text, and manually reviewed for quality.

This isn’t Google’s first rodeo with low-resource languages — they’ve done similar work with the Masakhane project and others — but WAXAL feels different in scale and intent. The fact that it’s permissively licensed means we might actually see commercial products built on it, not just academic papers. And the plan is to keep expanding, adding more languages over time.

That said, 27 languages out of 2,000 is a drop in the ocean. WAXAL covers major languages like Swahili, Yoruba, Hausa, and Zulu, but leaves out hundreds of smaller ones. The dataset also skews toward certain regions and recording conditions. If you’re building a voice system for a rural dialect in Chad, you’re still on your own.

Still, this is a solid foundation. The combination of natural speech for ASR and studio-quality audio for TTS, all under an open license, is exactly what the African AI ecosystem needs. Whether it’s startups building voice-enabled banking tools or researchers working on dialect recognition, WAXAL gives them a starting point that didn’t exist before.

The paper and dataset are available now. If you’re working on speech tech for African languages, go grab it. This is the kind of resource that actually moves the needle.

Comments (0)

Be the first to comment!