Voice AI
PolarGrid provides low-latency voice capabilities at the edge.Text-to-Speech (TTS)
Convert text to natural-sounding speech.Basic Usage
Voices
Thekokoro-82m model exposes eight voices across American and British English:
| Voice ID | Accent / gender |
|---|---|
af_bella | American English, female |
af_sarah | American English, female |
am_adam | American English, male |
am_michael | American English, male |
bf_emma | British English, female |
bf_isabella | British English, female |
bm_george | British English, male |
bm_lewis | British English, male |
Speed Control
Adjust playback speed from 0.25x to 4.0x:Audio Format
The TTS endpoint returns raw 24 kHz, 16-bit, mono PCM audio. Theresponse_format / responseFormat parameter is accepted for forward compatibility but is currently ignored — the response is always PCM regardless of what you pass.
To produce MP3, WAV, OGG, or other container formats, transcode the PCM bytes client-side. With ffmpeg:Encoder support for
mp3, opus, aac, flac, and wav in the backend is planned but not yet implemented.Speech-to-Text (STT)
Transcribe audio to text. Thefile parameter accepts File | Blob in JavaScript, and Path or any file-like object in Python. Buffers, path strings, and streams are not accepted directly — wrap them in a Blob or File first.
Basic Transcription
Verbose Output with Timestamps
Get word-level timestamps:Subtitle Formats
Generate subtitles directly:Voice Chat (Request/Response Loop)
Transcription in this loop requires a completed audio file — the user must finish speaking before the request is sent. For streaming realtime audio, see PersonaPlex (multi-modal, single model) or the Modular Pipeline Agent (STT → LLM → TTS with streaming events).
Supported Audio Formats
For transcription and translation:- MP3
- WAV
- M4A
- OGG
- FLAC
- WebM
