Available Models
Speech-to-text and text-to-speech models with transparent pricing. Bring your own API keys.
Sync transcription via POST /api/v1/transcribe. Models with the Batch badge also support async jobs via POST /api/v1/jobs.
Universal-3 Pro
High-accuracy batch speech-to-text with word-level timestamps and prompting capabilities. Supports 6 languages with automatic language detection.
Scribe v2
High-accuracy batch speech-to-text with word-level timestamps. Supports 99 languages with automatic language detection.
Whisper V3
OpenAI Whisper V3 Large hosted on Fireworks AI with word-level timestamps. Supports 99+ languages.
Whisper V3 Turbo
OpenAI Whisper V3 Turbo hosted on Fireworks AI. Faster and cheaper than V3 with word-level timestamps. Supports 99+ languages.
Ursa 2 Enhanced
Enterprise-grade batch transcription powered by Ursa 2 in enhanced accuracy mode. Best-in-class accuracy across 70 languages.
Nova-3
Deepgram's latest and most accurate speech-to-text model with word-level timestamps. Supports 50+ languages with automatic language detection.
Gemini 2.5 Pro
Google's most capable multimodal model used for audio transcription. Highest accuracy with wide language support. No word-level timestamps.
Gemini 2.5 Flash
Google's fast, cost-efficient multimodal model used for audio transcription. Lower cost than Pro with good accuracy. No word-level timestamps.
Whisper Large V3
OpenAI Whisper Large V3 hosted on Together AI with word-level timestamps. Supports 99+ languages with automatic language detection.
Sync synthesis via POST /api/v1/synthesize. Models with the Batch badge also support async jobs via POST /api/v1/tts-jobs.
Sonic 3
Cartesia's flagship TTS model with ultra-low latency and high naturalness. 42 languages. Voices specified by UUID — use the Cartesia playground to find voice IDs.
Gemini 2.5 Flash TTS
Google's fast, cost-efficient TTS model via Gemini API. 30 HD voices across 100+ languages. Auto-detects language. Always returns PCM 24kHz mono audio.
Gemini 2.5 Pro TTS
Google's highest-quality TTS model via Gemini API. Studio-quality audio with natural prosody. 30 HD voices across 100+ languages. Always returns PCM 24kHz mono audio.
GPT-4o Mini TTS
OpenAI's fast, affordable TTS model. 13 voices, 57+ languages. Supports voice style instructions. Max ~8,000 characters per request.
Eleven Multilingual v2
Most lifelike TTS model with rich emotional expression. 10,000 character limit per request (~10 min audio). Supports 29 languages with consistent voice quality.
Eleven v3
Latest ElevenLabs model with 70+ languages, audio tags ([laughs], [whispers], etc.), and rich emotional expressiveness. 5,000 character limit per request (~5 min audio).
Voxtral Mini TTS
Mistral's TTS model with zero-shot voice cloning and ~100ms latency. 9 languages. Voice determines language. 10 preset voices available; custom voices via Mistral Voices API.