Voice AI
PolarGrid provides low-latency voice capabilities at the edge.Text-to-Speech (TTS)
Convert text to natural-sounding speech.Basic Usage
Voices
Thekokoro-82m model exposes eight voices across American and British English:
| Voice ID | Accent / gender |
|---|---|
af_bella | American English, female |
af_sarah | American English, female |
am_adam | American English, male |
am_michael | American English, male |
bf_emma | British English, female |
bf_isabella | British English, female |
bm_george | British English, male |
bm_lewis | British English, male |
Speed Control
Adjust playback speed from 0.25x to 4.0x:Audio Format
Audio is generated at 24 kHz, 16-bit, mono. The container is selected viaresponse_format:
response_format | Content-Type | Streaming? | Use when |
|---|---|---|---|
pcm (default) | audio/pcm | Yes | Real-time voice-agent pipelines — lowest first-byte latency. |
wav | audio/wav | No | You need a playable file with no client-side post-processing. |
mp3 | audio/mpeg | No | Bandwidth matters; bytes are encoded server-side via libmp3lame at 128 kbps. |
The default differs from OpenAI’s
/v1/audio/speech (which defaults to mp3). PolarGrid defaults to pcm to keep streaming TTS first-byte latency minimal — the PolarGrid SDKs default to mp3 for OpenAI-style behavior end-to-end, so pass responseFormat / response_format explicitly when calling via the SDK.opus, aac, and flac from the OpenAI spec are not yet supported in batch mode — requesting them returns 400. For streaming, opus is supported (see below); for batch, transcode PCM client-side if you need one of those:Streaming
For real-time playback or voice-agent pipelines, setstream: true and use response_format: 'pcm' (lowest latency) or 'opus' (compressed). See the TTS API reference for the full contract and the formats / models matrix.
wav or mp3 returns 400; transcode client-side from pcm if you need a different container. For tada-3b-ml, streaming does not honor the speed parameter — use speed=1.0 (or omit it).
Raw HTTP Contract
If you’re not using the SDK, here’s the full request/response shape:Batch Request
Streaming Request
Content-Type matches the format (audio/wav, audio/pcm).
Streaming response: Chunked transfer-encoding with raw audio bytes. Headers include X-Polargrid-Stream: 1 and X-Polargrid-Sample-Rate: 24000. PCM is 16-bit signed little-endian mono at 24 kHz.
Speech-to-Text (STT)
Transcribe audio to text. Thefile parameter accepts File | Blob in JavaScript, and Path or any file-like object in Python. Buffers, path strings, and streams are not accepted directly — wrap them in a Blob or File first.
Basic Transcription
Verbose Output with Timestamps
Get word-level timestamps:Subtitle Formats
Generate subtitles directly:Voice Chat (Request/Response Loop)
Transcription in this loop requires a completed audio file — the user must finish speaking before the request is sent. For streaming realtime audio, see PersonaPlex (multi-modal, single model) or the Modular Pipeline Agent (STT → LLM → TTS with streaming events).
Supported Audio Formats
For transcription and translation:- MP3
- WAV
- M4A
- OGG
- FLAC
- WebM
