Available Models
Speech-to-text and text-to-speech models with transparent pricing. Bring your own API keys.
Sync transcription via POST /api/v1/transcribe. Models with the Batch badge also support async jobs via POST /api/v1/jobs.
Chorus Pro
Premium multilingual multi-speaker separation model built for 4+ concurrent speakers from a single audio input. Emits one transcript per speaker stream with timestamps. Supports Whisper-style automatic language detection, language hints, and long audio with consistent speaker streams across chunks. Trelis-hosted (no BYOK).
OmniASR 1B
Trelis-hosted OmniASR LLM Unlimited 1B for long-tail multilingual speech recognition across 1,600+ languages. You can use automatic language detection or provide a language hint for better accuracy. No word-level timestamps.
Universal-3 Pro
High-accuracy batch speech-to-text with word-level timestamps and prompting capabilities. Supports 6 languages with automatic language detection.
Scribe v2
High-accuracy batch speech-to-text with word-level timestamps. Supports 99 languages with automatic language detection.
Ursa 2 Enhanced
Enterprise-grade batch transcription powered by Ursa 2 in enhanced accuracy mode. Best-in-class accuracy across 70 languages.
Nova-3
Deepgram's latest and most accurate speech-to-text model with word-level timestamps. Supports 50+ languages with automatic language detection.
Gemini 2.5 Pro
Google's most capable multimodal model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files.
Gemini 2.5 Flash
Google's fast, cost-efficient multimodal model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files.
Gemini 3.1 Pro (Preview)
Google's most capable Gemini 3 series model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB inline file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files. Preview model: replaces the discontinued gemini-3-pro-preview.
Gemini 3 Flash (Preview)
Google's fast, cost-efficient Gemini 3 series model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB inline file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files.
Saaras v3
Sarvam AI's flagship multilingual ASR optimized for 22 Indian languages plus Indian English. Uses Saaras v3 REST mode by default (`transcribe`) and supports provider option `mode` for `transcribe`, `translate`, `verbatim`, `translit`, and `codemix`. Sync REST API only — single-request limit is 30 seconds of audio. Use AssemblyAI, ElevenLabs, or Speechmatics for longer files.
Whisper Large V3
OpenAI Whisper Large V3 hosted on Together AI with word-level timestamps. Supports 99+ languages with automatic language detection.
Sync synthesis via POST /api/v1/synthesize. Models with the Batch badge also support async jobs via POST /api/v1/tts-jobs.
Sonic 3
Cartesia's flagship TTS model with ultra-low latency and high naturalness. 42 languages. Voices specified by UUID — use the Cartesia playground to find voice IDs.
Gemini 2.5 Flash TTS
Google's fast, cost-efficient TTS model via Gemini API. 30 HD voices across 100+ languages. Auto-detects language. Always returns PCM 24kHz mono audio.
Gemini 2.5 Pro TTS
Google's highest-quality TTS model via Gemini API. Studio-quality audio with natural prosody. 30 HD voices across 100+ languages. Always returns PCM 24kHz mono audio.
GPT-4o Mini TTS
OpenAI's fast, affordable TTS model. 13 voices, 57+ languages. Supports voice style instructions. Max ~8,000 characters per request.
Eleven Multilingual v2
Most lifelike TTS model with rich emotional expression. 10,000 character limit per request (~10 min audio). Supports 29 languages with consistent voice quality.
Eleven v3
Latest ElevenLabs model with 70+ languages, audio tags ([laughs], [whispers], etc.), and rich emotional expressiveness. 5,000 character limit per request (~5 min audio).
Voxtral Mini TTS
Mistral's TTS model with zero-shot voice cloning and ~100ms latency. 9 languages. Voice determines language. 10 preset voices available; custom voices via Mistral Voices API.