Text-to-Speech
Generate audio from text with a catalog of Kokoro-82M voices. The endpoint returns raw PCM audio.Edge endpoints require a JWT. See Authentication for how to obtain one.
Create Speech
Request Body
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | — | TTS model: kokoro-82m |
input | string | Yes | — | Text to convert (max 4096 chars) |
voice | string | Yes | — | Voice to use (see below) |
response_format | string | No | pcm | Accepted for forward compatibility; currently ignored (see Audio Format below) |
speed | number | No | 1.0 | Speed multiplier (0.25-4.0) |
Voices
Thekokoro-82m model exposes eight voices through the PolarGrid SDKs:
| Voice ID | Accent / gender |
|---|---|
af_bella | American English, female |
af_sarah | American English, female |
am_adam | American English, male |
am_michael | American English, male |
bf_emma | British English, female |
bf_isabella | British English, female |
bm_george | British English, male |
bm_lewis | British English, male |
Audio Format
The endpoint returns raw 24 kHz, 16-bit, mono PCM audio. Theresponse_format parameter is accepted by the API for forward compatibility but is currently ignored — the response is always PCM regardless of what you pass.
Backend encoder support for mp3, opus, aac, flac, and wav is planned but not yet implemented. To produce one of those container formats today, transcode the PCM bytes client-side — for example:
Example Request
Response
Returns raw 24 kHz, 16-bit, mono PCM audio bytes.Streaming TTS
For real-time audio playback, use streaming:Streaming TTS is available in mock mode. Production streaming coming soon.
