Available Models

Speech-to-text and text-to-speech models with transparent pricing. Bring your own API keys.

Sync transcription via POST /api/v1/transcribe. Models with the Batch badge also support async jobs via POST /api/v1/jobs.

ASR Trelis Batch Trelis
$1.20/hr
Chorus Pro

Premium multilingual multi-speaker separation model built for 4+ concurrent speakers from a single audio input. Emits one transcript per speaker stream with timestamps. Supports Whisper-style automatic language detection, language hints, and long audio with consistent speaker streams across chunks. Trelis-hosted (no BYOK).

Native 4+ speaker separation Automatic language detection 99+ languages Word-/segment-level timestamps VAD-based chunking for long audio Speaker consistency across chunks
ASR Trelis Trelis
$0.45/hr
OmniASR 1B

Trelis-hosted OmniASR LLM Unlimited 1B for long-tail multilingual speech recognition across 1,600+ languages. You can use automatic language detection or provide a language hint for better accuracy. No word-level timestamps.

1,600+ languages Automatic language detection Optional language hints Optimized managed serving
ASR BYOK Batch AssemblyAI
$0.21/hr
Universal-3 Pro

High-accuracy batch speech-to-text with word-level timestamps and prompting capabilities. Supports 6 languages with automatic language detection.

Word-level timestamps Auto language detection 6 languages
ASR BYOK Batch ElevenLabs
$0.40/hr
Scribe v2

High-accuracy batch speech-to-text with word-level timestamps. Supports 99 languages with automatic language detection.

Word-level timestamps Auto language detection 99 languages
ASR BYOK Batch Speechmatics
$0.75/hr
Ursa 2 Enhanced

Enterprise-grade batch transcription powered by Ursa 2 in enhanced accuracy mode. Best-in-class accuracy across 70 languages.

Word-level timestamps 70 languages
ASR BYOK Batch Deepgram
$0.55/hr
Nova-3

Deepgram's latest and most accurate speech-to-text model with word-level timestamps. Supports 50+ languages with automatic language detection.

Word-level timestamps Auto language detection 50+ languages
ASR BYOK Batch Google
$0.24/hr
Gemini 2.5 Pro

Google's most capable multimodal model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files.

Auto language detection 50+ languages
ASR BYOK Batch Google
$0.15/hr
Gemini 2.5 Flash

Google's fast, cost-efficient multimodal model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files.

Auto language detection 50+ languages
ASR BYOK Batch Google
$0.37/hr
Gemini 3.1 Pro (Preview)

Google's most capable Gemini 3 series model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB inline file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files. Preview model: replaces the discontinued gemini-3-pro-preview.

Auto language detection 50+ languages
ASR BYOK Batch Google
$0.15/hr
Gemini 3 Flash (Preview)

Google's fast, cost-efficient Gemini 3 series model used for audio transcription via prompted generation. Supports 50+ languages with automatic language detection. No word-level timestamps. Audio duration and cost are estimated from token count (32 tokens/second), not reported by the API. Hard 20 MB inline file size limit — use AssemblyAI, ElevenLabs, or Deepgram for larger files.

Auto language detection 50+ languages
ASR BYOK Sarvam
$0.36/hr
Saaras v3

Sarvam AI's flagship multilingual ASR optimized for 22 Indian languages plus Indian English. Uses Saaras v3 REST mode by default (`transcribe`) and supports provider option `mode` for `transcribe`, `translate`, `verbatim`, `translit`, and `codemix`. Sync REST API only — single-request limit is 30 seconds of audio. Use AssemblyAI, ElevenLabs, or Speechmatics for longer files.

Auto language detection 22 Indian languages + English Saaras modes: transcribe/translate/verbatim/translit/codemix
ASR BYOK Batch Together
$0.09/hr
Whisper Large V3

OpenAI Whisper Large V3 hosted on Together AI with word-level timestamps. Supports 99+ languages with automatic language detection.

Word-level timestamps Auto language detection 99+ languages

Sync synthesis via POST /api/v1/synthesize. Models with the Batch badge also support async jobs via POST /api/v1/tts-jobs.

TTS BYOK Batch Cartesia
$30/M chars
Sonic 3

Cartesia's flagship TTS model with ultra-low latency and high naturalness. 42 languages. Voices specified by UUID — use the Cartesia playground to find voice IDs.

42 languages Ultra-low latency Voice cloning Emotion control
TTS BYOK Batch Google
$0.125/M chars
Gemini 2.5 Flash TTS

Google's fast, cost-efficient TTS model via Gemini API. 30 HD voices across 100+ languages. Auto-detects language. Always returns PCM 24kHz mono audio.

100+ languages Auto language detection 30 HD voices Steerable via instructions
TTS BYOK Batch Google
$0.25/M chars
Gemini 2.5 Pro TTS

Google's highest-quality TTS model via Gemini API. Studio-quality audio with natural prosody. 30 HD voices across 100+ languages. Always returns PCM 24kHz mono audio.

100+ languages Auto language detection 30 HD voices Studio-quality prosody Steerable via instructions
TTS BYOK Batch OpenAI
$0.6/M chars
GPT-4o Mini TTS

OpenAI's fast, affordable TTS model. 13 voices, 57+ languages. Supports voice style instructions. Max ~8,000 characters per request.

57+ languages 13 voices Style instructions Multiple output formats
TTS BYOK Batch ElevenLabs
$300/M chars
Eleven Multilingual v2

Most lifelike TTS model with rich emotional expression. 10,000 character limit per request (~10 min audio). Supports 29 languages with consistent voice quality.

29 languages Multiple voices Voice settings control High emotional range
TTS BYOK Batch ElevenLabs
$300/M chars
Eleven v3

Latest ElevenLabs model with 70+ languages, audio tags ([laughs], [whispers], etc.), and rich emotional expressiveness. 5,000 character limit per request (~5 min audio).

70+ languages Multiple voices Audio tags High emotional range Voice settings control
TTS BYOK Batch Mistral
$16/M chars
Voxtral Mini TTS

Mistral's TTS model with zero-shot voice cloning and ~100ms latency. 9 languages. Voice determines language. 10 preset voices available; custom voices via Mistral Voices API.

9 languages Zero-shot voice cloning Low latency (~100ms) 10 preset voices