Text-to-Speech
Generate audio from text. PolarGrid serves kokoro-82m (preset voice catalog) and tada-3b-ml (voice-cloning). The endpoint returns audio in the container format requested via response_format — supported values are pcm, wav, and mp3.
Edge endpoints accept your pg_* API key as a bearer token. See Authentication for details. The cURL examples below pin Toronto (yto-01) for concreteness — substitute another region or discover the fastest one via GET https://autorouter.polargrid.ai/v1/route. See API Overview for both patterns.
Create Speech
Generate audio from input text.
Request Body
| Parameter | Type | Required | Default | Description |
|---|
model | string | Yes | — | TTS model: kokoro-82m or tada-3b-ml |
input | string | Yes | — | Text to convert |
voice | string | Yes | — | Voice to use (see below) |
voice_transcript | string | No | — | tada-3b-ml only. The exact text spoken in the voice reference clip. Required when voice is a URL or base64 WAV; not needed for voice: "default". Ignored by kokoro-82m. |
language | string | No | en | tada-3b-ml only. Target synthesis language: en, fr, de, es, it, pt, pl, ja, ar, zh. Ignored by kokoro-82m. |
response_format | string | No | pcm | Container format: pcm, wav, or mp3. See Audio Format below. |
stream | boolean | No | false | When true, returns chunked audio. See Streaming. Only pcm and opus are valid response_format values when streaming. |
speed | number | No | 1.0 | Speed multiplier (0.25-4.0). tada-3b-ml honors this only in batch mode — streaming TADA requires speed: 1.0. |
The gateway returns 400 Bad Request if input is empty/whitespace-only, voice is empty, or response_format is anything other than pcm, wav, or mp3. Synthesis failures (invalid voice ID, upstream error) return 502 Bad Gateway — never an empty 200.
Voices
tada-3b-ml is a voice-cloning model — see the model page for how to provide a reference voice; it does not use the preset voice IDs below.
The kokoro-82m model exposes eight preset voices through the PolarGrid SDKs:
| Voice ID | Accent / gender |
|---|
af_bella | American English, female |
af_sarah | American English, female |
am_adam | American English, male |
am_michael | American English, male |
bf_emma | British English, female |
bf_isabella | British English, female |
bm_george | British English, male |
bm_lewis | British English, male |
See the Voice AI guide for how these map to Kokoro-82M and a link to the full upstream voice list.
TADA output is deterministic. tada-3b-ml uses diffusion-based synthesis with a fixed random seed, so the same input text + same voice reference always produces byte-identical audio. This is by design — the fixed seed guarantees consistent voice identity and timing across requests, which matters for voice-agent pipelines where prosody shifts between calls would be jarring. There is no server-side cache: every request runs full inference and is billed accordingly, even when the output matches a previous call. A user-controllable seed parameter is planned for a future API version to let callers introduce deliberate prosody variation.
Audio is generated at 24 kHz, 16-bit, mono. The container is chosen by response_format:
response_format | Content-Type | Body | Streaming? |
|---|
pcm (default) | audio/pcm | Raw signed 16-bit little-endian samples, no header. | Yes — chunks stream as they’re synthesized. |
wav | audio/wav | A standard RIFF/WAVE container wrapping the PCM samples. | No — buffered until synthesis completes (a valid WAV header needs the total sample count). |
mp3 | audio/mpeg | MP3 at 128 kbps CBR (encoded server-side via libmp3lame). | No — buffered, then encoded in one shot. |
PCM is the lowest-latency choice and the recommended format for real-time voice-agent pipelines. Pick wav if you need a playable file with no client-side post-processing, or mp3 if bandwidth matters more than first-byte latency.
The default differs from OpenAI’s /v1/audio/speech (which defaults to mp3). PolarGrid defaults to pcm to keep streaming TTS first-byte latency minimal. PolarGrid SDK consumers should pass response_format explicitly — the JavaScript and Python SDKs default it to mp3 for OpenAI-style behavior end-to-end.
opus, aac, and flac from the OpenAI spec are not yet supported in batch mode — requesting them returns 400. For streaming, opus is supported (see below); for batch, transcode PCM client-side if you need one of those:
ffmpeg -f s16le -ar 24000 -ac 1 -i speech.pcm speech.opus
Streaming
Pass stream: true to receive chunked audio over a single HTTP response — first bytes typically arrive in under 300 ms, well before synthesis finishes. The PolarGrid SDKs default streaming requests to response_format: 'opus'; raw HTTP callers get pcm when response_format is omitted (the gateway’s lowest-latency default).
Streamable formats:
response_format | Content-Type | Streamable? | Use when |
|---|
pcm | audio/pcm | ✓ chunked | Real-time voice-agent pipelines — lowest first-byte latency. |
opus | audio/ogg; codecs=opus | ✓ chunked | Bandwidth-constrained clients — server-side encoded at 48 kHz. |
wav | — | ✗ | RIFF header needs the full sample count up front. |
mp3 | — | ✗ | Frame alignment incompatible with sub-300 ms TTFB. |
Streaming wav or mp3 returns 400 Bad Request.
Streamable models:
| Model | Streaming? | Notes |
|---|
kokoro-82m | ✓ pcm, opus | — |
tada-3b-ml | ✓ pcm, opus | Per-token via decoupled Triton handler. TADA streaming does not honor the speed parameter — pass speed=1.0 (or omit it) for streaming requests; use batch mode if you need speed control on TADA. |
Response headers:
X-Polargrid-Stream: 1 — set on every streaming response.
X-Polargrid-Sample-Rate: 24000 — PCM sample rate; Opus is resampled to 48 kHz inside the Ogg container.
Detecting truncated streams
A mid-stream Triton or upstream failure closes the connection cleanly — there is no in-band error frame. Clients observe one of:
- A
ReadError / IncompleteRead / ChunkedEncodingError raised by the HTTP client.
- For
opus, an Ogg stream that never sees the end-of-stream flag on its final page.
Treat any of these as a synthesis failure and retry. The PolarGrid SDKs surface these as exceptions from the async iterator; they do not silently terminate.
Streaming example
curl -X POST https://api.yto-01.edge.polargrid.ai/v1/audio/speech \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
"model": "kokoro-82m",
"input": "Streaming hello from PolarGrid!",
"voice": "af_bella",
"response_format": "opus",
"stream": true
}' \
--output stream.ogg
Not supported in v1:
- Cartesia-compatible WebSocket TTS (
wss://api.cartesia.ai/tts/websocket shape). PolarGrid streaming TTS uses chunked HTTP only. Customers porting from Cartesia must swap the transport layer.
- WebSocket TTS endpoint of any kind. There is no
/v1/audio/speech/ws. PolarGrid exposes WebSockets only for completions (/v1/completions/ws) and PersonaPlex’s full-duplex voice pipeline.
- Streaming WAV and MP3. WAV requires the full sample count for its RIFF header; MP3 frame alignment can’t hit sub-300 ms TTFB. Request
pcm or opus for streaming.
If your existing pipeline depends on any of the above, file a request in the PolarGrid roadmap.
Example Request
# Edge endpoints accept your pg_* API key as a bearer token — see Authentication
curl -X POST https://api.yto-01.edge.polargrid.ai/v1/audio/speech \
-H "Authorization: Bearer pg_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-82m",
"input": "Hello from PolarGrid!",
"voice": "af_bella",
"response_format": "wav"
}' \
--output speech.wav
Response
Returns the requested container as a binary body, with Content-Type set to audio/pcm, audio/wav, or audio/mpeg to match response_format.
See Streaming above for the streaming TTS contract, supported formats and models, and code samples.