Skip to main content

Text-to-Speech

Generate audio from text. PolarGrid serves kokoro-82m (preset voice catalog) and tada-3b-ml (voice-cloning). The endpoint returns audio in the container format requested via response_format — supported values are pcm, wav, and mp3.
Edge endpoints accept your pg_* API key as a bearer token. See Authentication for details. The cURL examples below pin Toronto (yto-01) for concreteness — substitute another region or discover the fastest one via GET https://autorouter.polargrid.ai/v1/route. See API Overview for both patterns.

Create Speech

POST /v1/audio/speech
Generate audio from input text.

Request Body

ParameterTypeRequiredDefaultDescription
modelstringYesTTS model: kokoro-82m or tada-3b-ml
inputstringYesText to convert
voicestringYesVoice to use (see below)
voice_transcriptstringNotada-3b-ml only. The exact text spoken in the voice reference clip. Required when voice is a URL or base64 WAV; not needed for voice: "default". Ignored by kokoro-82m.
languagestringNoentada-3b-ml only. Target synthesis language: en, fr, de, es, it, pt, pl, ja, ar, zh. Ignored by kokoro-82m.
response_formatstringNopcmContainer format: pcm, wav, or mp3. See Audio Format below.
streambooleanNofalseWhen true, returns chunked audio. See Streaming. Only pcm and opus are valid response_format values when streaming.
speednumberNo1.0Speed multiplier (0.25-4.0). tada-3b-ml honors this only in batch mode — streaming TADA requires speed: 1.0.
The gateway returns 400 Bad Request if input is empty/whitespace-only, voice is empty, or response_format is anything other than pcm, wav, or mp3. Synthesis failures (invalid voice ID, upstream error) return 502 Bad Gateway — never an empty 200.

Voices

tada-3b-ml is a voice-cloning model — see the model page for how to provide a reference voice; it does not use the preset voice IDs below. The kokoro-82m model exposes eight preset voices through the PolarGrid SDKs:
Voice IDAccent / gender
af_bellaAmerican English, female
af_sarahAmerican English, female
am_adamAmerican English, male
am_michaelAmerican English, male
bf_emmaBritish English, female
bf_isabellaBritish English, female
bm_georgeBritish English, male
bm_lewisBritish English, male
See the Voice AI guide for how these map to Kokoro-82M and a link to the full upstream voice list.
TADA output is deterministic. tada-3b-ml uses diffusion-based synthesis with a fixed random seed, so the same input text + same voice reference always produces byte-identical audio. This is by design — the fixed seed guarantees consistent voice identity and timing across requests, which matters for voice-agent pipelines where prosody shifts between calls would be jarring. There is no server-side cache: every request runs full inference and is billed accordingly, even when the output matches a previous call. A user-controllable seed parameter is planned for a future API version to let callers introduce deliberate prosody variation.

Audio Format

Audio is generated at 24 kHz, 16-bit, mono. The container is chosen by response_format:
response_formatContent-TypeBodyStreaming?
pcm (default)audio/pcmRaw signed 16-bit little-endian samples, no header.Yes — chunks stream as they’re synthesized.
wavaudio/wavA standard RIFF/WAVE container wrapping the PCM samples.No — buffered until synthesis completes (a valid WAV header needs the total sample count).
mp3audio/mpegMP3 at 128 kbps CBR (encoded server-side via libmp3lame).No — buffered, then encoded in one shot.
PCM is the lowest-latency choice and the recommended format for real-time voice-agent pipelines. Pick wav if you need a playable file with no client-side post-processing, or mp3 if bandwidth matters more than first-byte latency.
The default differs from OpenAI’s /v1/audio/speech (which defaults to mp3). PolarGrid defaults to pcm to keep streaming TTS first-byte latency minimal. PolarGrid SDK consumers should pass response_format explicitly — the JavaScript and Python SDKs default it to mp3 for OpenAI-style behavior end-to-end.
opus, aac, and flac from the OpenAI spec are not yet supported in batch mode — requesting them returns 400. For streaming, opus is supported (see below); for batch, transcode PCM client-side if you need one of those:
ffmpeg -f s16le -ar 24000 -ac 1 -i speech.pcm speech.opus

Streaming

Pass stream: true to receive chunked audio over a single HTTP response — first bytes typically arrive in under 300 ms, well before synthesis finishes. The PolarGrid SDKs default streaming requests to response_format: 'opus'; raw HTTP callers get pcm when response_format is omitted (the gateway’s lowest-latency default). Streamable formats:
response_formatContent-TypeStreamable?Use when
pcmaudio/pcm✓ chunkedReal-time voice-agent pipelines — lowest first-byte latency.
opusaudio/ogg; codecs=opus✓ chunkedBandwidth-constrained clients — server-side encoded at 48 kHz.
wavRIFF header needs the full sample count up front.
mp3Frame alignment incompatible with sub-300 ms TTFB.
Streaming wav or mp3 returns 400 Bad Request. Streamable models:
ModelStreaming?Notes
kokoro-82mpcm, opus
tada-3b-mlpcm, opusPer-token via decoupled Triton handler. TADA streaming does not honor the speed parameter — pass speed=1.0 (or omit it) for streaming requests; use batch mode if you need speed control on TADA.
Response headers:
  • X-Polargrid-Stream: 1 — set on every streaming response.
  • X-Polargrid-Sample-Rate: 24000 — PCM sample rate; Opus is resampled to 48 kHz inside the Ogg container.

Detecting truncated streams

A mid-stream Triton or upstream failure closes the connection cleanly — there is no in-band error frame. Clients observe one of:
  • A ReadError / IncompleteRead / ChunkedEncodingError raised by the HTTP client.
  • For opus, an Ogg stream that never sees the end-of-stream flag on its final page.
Treat any of these as a synthesis failure and retry. The PolarGrid SDKs surface these as exceptions from the async iterator; they do not silently terminate.

Streaming example

curl -X POST https://api.yto-01.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "kokoro-82m",
    "input": "Streaming hello from PolarGrid!",
    "voice": "af_bella",
    "response_format": "opus",
    "stream": true
  }' \
  --output stream.ogg
Not supported in v1:
  • Cartesia-compatible WebSocket TTS (wss://api.cartesia.ai/tts/websocket shape). PolarGrid streaming TTS uses chunked HTTP only. Customers porting from Cartesia must swap the transport layer.
  • WebSocket TTS endpoint of any kind. There is no /v1/audio/speech/ws. PolarGrid exposes WebSockets only for completions (/v1/completions/ws) and PersonaPlex’s full-duplex voice pipeline.
  • Streaming WAV and MP3. WAV requires the full sample count for its RIFF header; MP3 frame alignment can’t hit sub-300 ms TTFB. Request pcm or opus for streaming.
If your existing pipeline depends on any of the above, file a request in the PolarGrid roadmap.

Example Request

# Edge endpoints accept your pg_* API key as a bearer token — see Authentication
curl -X POST https://api.yto-01.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer pg_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro-82m",
    "input": "Hello from PolarGrid!",
    "voice": "af_bella",
    "response_format": "wav"
  }' \
  --output speech.wav

Response

Returns the requested container as a binary body, with Content-Type set to audio/pcm, audio/wav, or audio/mpeg to match response_format. See Streaming above for the streaming TTS contract, supported formats and models, and code samples.