Available Models

Speech-to-text and text-to-speech models with transparent pricing. Bring your own API keys.

Sync transcription via POST /api/v1/transcribe. Models with the Batch badge also support async jobs via POST /api/v1/jobs.

ASR BYOK Batch AssemblyAI
$0.21/hr
Universal-3 Pro

High-accuracy batch speech-to-text with word-level timestamps and prompting capabilities. Supports 6 languages with automatic language detection.

Word-level timestamps Auto language detection 6 languages
ASR BYOK Batch ElevenLabs
$0.40/hr
Scribe v2

High-accuracy batch speech-to-text with word-level timestamps. Supports 99 languages with automatic language detection.

Word-level timestamps Auto language detection 99 languages
ASR BYOK Batch Fireworks
$0.09/hr
Whisper V3

OpenAI Whisper V3 Large hosted on Fireworks AI with word-level timestamps. Supports 99+ languages.

Word-level timestamps Auto language detection 99+ languages
ASR BYOK Batch Fireworks
$0.05/hr
Whisper V3 Turbo

OpenAI Whisper V3 Turbo hosted on Fireworks AI. Faster and cheaper than V3 with word-level timestamps. Supports 99+ languages.

Word-level timestamps Auto language detection 99+ languages
ASR BYOK Batch Speechmatics
$0.75/hr
Ursa 2 Enhanced

Enterprise-grade batch transcription powered by Ursa 2 in enhanced accuracy mode. Best-in-class accuracy across 70 languages.

Word-level timestamps 70 languages
ASR BYOK Batch Deepgram
$0.55/hr
Nova-3

Deepgram's latest and most accurate speech-to-text model with word-level timestamps. Supports 50+ languages with automatic language detection.

Word-level timestamps Auto language detection 50+ languages
ASR BYOK Batch Google
$0.15/hr
Gemini 2.5 Pro

Google's most capable multimodal model used for audio transcription. Highest accuracy with wide language support. No word-level timestamps.

Auto language detection 50+ languages
ASR BYOK Batch Google
$0.12/hr
Gemini 2.5 Flash

Google's fast, cost-efficient multimodal model used for audio transcription. Lower cost than Pro with good accuracy. No word-level timestamps.

Auto language detection 50+ languages
ASR BYOK Batch Together
$0.09/hr
Whisper Large V3

OpenAI Whisper Large V3 hosted on Together AI with word-level timestamps. Supports 99+ languages with automatic language detection.

Word-level timestamps Auto language detection 99+ languages

Sync synthesis via POST /api/v1/synthesize. Models with the Batch badge also support async jobs via POST /api/v1/tts-jobs.

TTS BYOK Batch Cartesia
$30/M chars
Sonic 3

Cartesia's flagship TTS model with ultra-low latency and high naturalness. 42 languages. Voices specified by UUID — use the Cartesia playground to find voice IDs.

42 languages Ultra-low latency Voice cloning Emotion control
TTS BYOK Batch Google
$0.125/M chars
Gemini 2.5 Flash TTS

Google's fast, cost-efficient TTS model via Gemini API. 30 HD voices across 100+ languages. Auto-detects language. Always returns PCM 24kHz mono audio.

100+ languages Auto language detection 30 HD voices Steerable via instructions
TTS BYOK Batch Google
$0.25/M chars
Gemini 2.5 Pro TTS

Google's highest-quality TTS model via Gemini API. Studio-quality audio with natural prosody. 30 HD voices across 100+ languages. Always returns PCM 24kHz mono audio.

100+ languages Auto language detection 30 HD voices Studio-quality prosody Steerable via instructions
TTS BYOK Batch OpenAI
$0.6/M chars
GPT-4o Mini TTS

OpenAI's fast, affordable TTS model. 13 voices, 57+ languages. Supports voice style instructions. Max ~8,000 characters per request.

57+ languages 13 voices Style instructions Multiple output formats
TTS BYOK Batch ElevenLabs
$300/M chars
Eleven Multilingual v2

Most lifelike TTS model with rich emotional expression. 10,000 character limit per request (~10 min audio). Supports 29 languages with consistent voice quality.

29 languages Multiple voices Voice settings control High emotional range
TTS BYOK Batch ElevenLabs
$300/M chars
Eleven v3

Latest ElevenLabs model with 70+ languages, audio tags ([laughs], [whispers], etc.), and rich emotional expressiveness. 5,000 character limit per request (~5 min audio).

70+ languages Multiple voices Audio tags High emotional range Voice settings control
TTS BYOK Batch Mistral
$16/M chars
Voxtral Mini TTS

Mistral's TTS model with zero-shot voice cloning and ~100ms latency. 9 languages. Voice determines language. 10 preset voices available; custom voices via Mistral Voices API.

9 languages Zero-shot voice cloning Low latency (~100ms) 10 preset voices