Introducing Eleven v3 (alpha) - the most expressive Text to Speech model ever. Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers]. Now in public alpha and 80% off in June.
This is a research preview. It requires more prompt engineering than previous models - but the generations are breathtaking. We’ll continue fine-tuning to improve reliability and control.
The new architecture of Eleven v3 deeply understands text - delivering much greater expressiveness. And now you can guide generations more directly using audio tags: - Emotions [sad] [angry] [happily] - Delivery direction [whispers] [shouts] - Non-verbal reactions [laughs] [clears throat] [sighs]
Generate multi-speaker dialogue that sounds like a real conversation. Eleven v3 handles interruptions, shifts in tone, and emotional cues based on conversational context.
Public API for Eleven v3 (alpha) is coming soon. For early access, please contact sales. We are working on the real-time version of v3. For real-time and conversational use cases, we recommend staying with v2.5 Turbo or Flash for now.
Built for creators and developers building media tools. If you’re working on videos, audiobooks, or media tools - v3 unlocks a new level of expressiveness. Learn how to get the most out of it with our prompting guide:
Eleven v3 (alpha) is available now: It’s 80% off during June. Try it out - and share your best generations with us.
1,41M