Fast and accurate. Natural, expressive voices. Simple pricing. Multilingual support.
Today, we are excited to announce two powerful standalone audio APIs: Grok Speech to Text (STT) and Grok Text to Speech (TTS). Built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support.
These standalone endpoints make it straightforward for developers to integrate high-quality speech features into any application, whether you're creating voice agents, real-time transcription tools, accessibility solutions, podcasts, or interactive audio experiences.
High accuracy, low latency.
We’ve added powerful features like word-level timestamps, speaker diarization, and multichannel support. It further includes intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more.
We keep pricing straightforward and predictable: Speech to Text is $0.10 per hour for batch and $0.20 per hour for streaming. Full details and current rate limits are available in the xAI API console.
Grok STT is evaluated against the top commercial models on phone calls, meetings, video/podcasts, and telephony. It excels at entity recognition and business use cases like medical, legal, and financial.
| Domain (Word Error Rate) | Grok STT | ElevenLabs | Deepgram | AssemblyAI |
|---|---|---|---|---|
Phone Call Entities | 5.0% | 12.0% | 13.5% | 21.3% |
Video/Podcasts | 2.4% | 2.4% | 3.0% | 3.2% |
Meetings | 10.9% | 12.2% | 16.3% | 15.7% |
Telephone | 9.3% | 9.4% | 11.0% | 11.2% |
Overall | 6.9% | 9.0% | 11.0% | 12.9% |
Most transcription models give you raw spoken words. Grok Speech to Text goes further.
When you enable formatting, the API performs advanced Inverse Text Normalization that intelligently converts spoken language into proper structured output:
The Grok Speech to Text API offers strong multilingual support across 25+ languages, switch languages seamlessly without missing a beat.
Transcribe multichannel audio files for perfect speaker separation with the same API.
Detect speakers in both pre-recorded and real-time streaming with word-level speaker IDs using Diarization.
Fast, natural, and expressive voices with Speech Tags.
Add natural prosody and emotion using simple inline and wrapping speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>, and many more. These controls let you create engaging, lifelike delivery without complex markup.
Text to Speech is priced at $15.00 per 1 million characters, with straightforward usage-based billing and no hidden fees.