Google dropped a new text-to-speech model today, and honestly, it’s the first one in a while that made me sit up and pay attention. Gemini 3.1 Flash TTS is rolling out to developers via the Gemini API and Google AI Studio, to enterprises on Vertex AI, and to Workspace users through Google Vids. The headline feature is something called audio tags—natural language commands you embed directly into text to control vocal style, pace, and delivery.
Let me be blunt: most TTS models feel like you’re talking to a very polite robot reading a script. You get one or two voices, maybe some speed control, and that’s it. Google is trying to change that by giving you granular control over how the AI speaks. You can tell it to speak slowly, add emphasis, change tone mid-sentence. It’s not revolutionary in concept—we’ve seen this with some voice assistants—but the execution here seems more deliberate.
On the quality front, Google claims Gemini 3.1 Flash TTS is their most natural and expressive model yet. They’re citing an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, which is based on thousands of blind human preference tests. That’s a solid number, and Artificial Analysis has placed it in their “most attractive quadrant” for balancing high-quality speech with low cost. I’d take benchmark scores with a grain of salt—they don’t always translate to real-world use—but it’s a good sign.
The model also supports native multi-speaker dialogue and over 70 languages. Multi-speaker is the kind of feature that sounds niche until you need it for a podcast, audiobook, or interactive app. Then it’s a lifesaver.
Now, the elephant in the room: all audio generated with Gemini 3.1 Flash TTS is watermarked with SynthID. That’s Google’s invisible watermarking tech for AI-generated content. It’s good for combating misinformation, but it also means you can’t use this for anything where you’d want to hide the AI origin. That’s a trade-off I’m fine with, but developers building certain types of applications might find it limiting.
The rollout is in preview, which means it’s not production-ready for everyone. Developers can start experimenting in Google AI Studio today, and enterprise users can access it via Vertex AI. If you’re a Workspace user, you’ll see it in Google Vids. No word on general availability yet.
My take? This is a step in the right direction. The audio tags approach gives creators real control without needing a PhD in machine learning. But I’m curious to see how well it handles edge cases—heavy accents, emotional range, non-standard punctuation. Those are the tests that separate a good TTS model from a great one.
If you’re building voice apps, give it a spin. Just don’t expect it to replace a human voice actor for high-end productions. Not yet.
Comments (0)
Login Log in to comment.
Be the first to comment!