Grok Voice

Available in API

Voice agents that
feel human.

Deploy intelligent speech-to-speech voice agents for customer support, sales, and more. Enterprise-grade text-to-speech and speech-to-text APIs.

#1 Tau Voice LeaderboardSub-second latency25+ languages$0.05 / min

Voice Agent

Build real-time voice agents with tool use, search, and multi-turn conversation.

  • Full-duplex real-time conversations with sub-second latency
  • Built-in reasoning for complex, multi-step requests
  • Orchestrate dozens of tools in ambiguous real-world workflows
Start
Ara
Eve
Leo
Rex
Sal

Text to Speech

Natural speech from text with multiple voices and audio formats. Built for telephony and web. Enter text, choose a voice, and press Play.

  • 80+ natural voices across 25+ languages
  • Speech tags for tone, pauses, whisper, and laughter
  • PCM, MP3, Opus, FLAC, and WAV outputs

Speech to Text

Enterprise-grade transcription for phone calls, meetings, videos, and podcasts.

  • Entity recognition across medicine, law, and finance
  • Inverse text normalization for numbers, currencies, and more
  • Streaming and batch endpoints from one API
  • Speaker diarization for multi-speaker audio

MP3, WAV, OGG, Opus, FLAC, AAC, MP4, M4A, MKV, MOV, WebM

Custom voices

Clone a voice from a short recording and use it instantly across Grok Text to Speech and Voice Agent APIs.

  • Clone from under a minute of natural speech
  • Two-stage verification: passphrase + speaker embedding match
  • Inherits every TTS capability — speech tags, multilingual, streaming

Original

Cloned

The full voice stack

Everything you need to build production voice experiences — from realtime agents to batch transcription.

Realtime voice agents

Full-duplex conversations with sub-second latency

Text-to-speech

Natural speech from text across 80+ voices

Speech-to-text

Accurate transcription with speaker diarization

Tool calling

Call APIs and take actions mid-conversation

Custom voices

Clone or create voices for your brand

25+ languages

Multilingual with natural intonation per locale

Sub-second latency

Fast enough for real conversations at scale

Speech tags

Control whisper, laughter, pauses, and tone

Speaker diarization

Identify who said what in multi-speaker audio

Streaming & batch

Realtime WebSocket or async batch processing

Multiple audio formats

PCM, MP3, Opus, FLAC, WAV, and more

Session control

Dynamic instructions, context, and tool updates

Enterprise ready

SOC 2, HIPAA eligible, and GDPR compliant

Text normalization

Proper formatting of numbers, dates, addresses

Interruption handling

Natural turn-taking with barge-in support

80+ voices across 25+ languages

Multilingual voices with natural intonation. Preview any voice instantly.

Pricing

Simple, transparent pricing

Straightforward usage-based pricing with no hidden fees, minimums, or force upgrades.

Pricing Docs
Realtime
Real-time voice conversations over WebSocket
$0.05 / min·$3.00 / hr
Text to Speech
Convert text to natural speech
$15.00 / 1M characters
Speech to Text
Transcribe audio files and live streams
$0.10 / hr·$0.20 / hr(streaming)

Need higher limits or rollout help?

Talk with xAI about onboarding, custom limits, and enterprise deployment.

Contact Sales

Enterprise

Trust, controls, and deployment support

Enterprise-ready controls, compliance, security, and scale.

SOC 2 Type II

Audited controls for security, availability, and confidentiality.

HIPAA eligible

BAA available for healthcare applications handling protected health information.

GDPR and DPA support

Data processing agreements and EU data residency options.

High availability

Multi-region infrastructure for enterprise workloads.

Custom rate limits

Concurrent session and request limits scaled to your traffic.

SSO and audit controls

SAML SSO, role-based access, and audit logging for your team.

Zero Data Retention

Enable zero data retention for your deployments.

Ready to build with voice?

Get an API key and start building in minutes, or talk to our team about enterprise deployment.