Kokoro 82M

Kokoro 82M (kokoro-82m) is an 82M-parameter text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. It is a preset-voice model — pick one of a fixed catalog of named voices — which makes it the low-latency counterpart to tada-3b-ml’s voice-cloning. It is co-resident on the voice pod with whisper-large-v3-turbo and cohere-transcribe-03-2026 — see backend/edge-production-setup/CLAUDE.md for the pod layout.

HF repo: hexgrad/Kokoro-82M
Modality: Text-to-Speech (streaming)
Backend: Triton python (voice pod)
Parameters: 82M

Available regions: all regions except dfw-02 — see Model availability

Headline benchmark

Kokoro exposes a chunked-HTTP streaming transport for /v1/audio/speech.

Measurement	p50	p95
End-to-end TTFA (with network)	158 ms	196 ms
Server-only TTFA (gateway → first triton byte)	46 ms	63 ms
Network leg of TTFA (e2e − server)	109 ms	134 ms
Full-utterance latency (TTLB, client wall-clock)	696 ms	1025 ms
Real-time factor (RTF)	0.100	0.149
streaming_verdict	streaming (100/100)	—

Streaming /v1/audio/speech (stream: true, pcm, voice bm_george), 100 runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-06-02 from a Vancouver-area laptop over the public internet. TTFA is the time to the first audio byte arriving at the client — identical to TTFB since the response body is raw PCM. Server-only TTFA comes from the X-Pg-First-Byte-Ms response header (PR #507), the gateway’s measured time to the first PCM chunk returned by triton. Total synthesis time (TTLB) is not exposed via header and stays a client-side wall-clock number. RTF = synthesis wall-clock ÷ audio duration; below 1.0 is faster than real time. All 100/100 runs returned a streaming verdict (54–234 chunks per utterance). Raw runs: benchmarks/yvr-02-2026-06-02/kokoro-82m/. Harness: bench/kokoro-82m/.

How this compares

Same node, same harness, same /v1/audio/speech streaming surface as the tada-3b-ml bench in benchmarks/yvr-02-2026-05-27/:

Model	server TTFA p50	e2e TTFA p50	e2e TTLB p50	RTF p50
`tada-3b-ml` (3B)	238 ms	352 ms	919 ms	0.164
`kokoro-82m` (82M)	46 ms	158 ms	696 ms	0.100

Kokoro’s 82M model synthesizes the first audio chunk ~5× faster server-side (46 vs 238 ms) and finishes the full utterance sooner at a lower real-time factor. The ~109 ms network leg is identical across both — same laptop, same POP. Pick kokoro for lowest-latency synthesis from a fixed voice catalog; pick tada when you need a specific cloned voice or a non-English language. For external-provider TTS comparisons (Cartesia, ElevenLabs, Hume), see the tada-3b-ml card.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.

curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "kokoro-82m",
    "input": "Hello from PolarGrid.",
    "voice": "bm_george",
    "response_format": "pcm",
    "stream": true
  }' \
  --output speech.pcm

import { PolarGrid } from "@polargrid/polargrid-sdk";

const client = await PolarGrid.create({ apiKey: process.env.POLARGRID_API_KEY });

for await (const chunk of client.textToSpeechStream({
  model: "kokoro-82m",
  input: "Hello from PolarGrid.",
  voice: "bm_george",
  responseFormat: "opus",
})) {
  audioPlayer.appendChunk(chunk);
}

from polargrid import PolarGrid

client = await PolarGrid.create(api_key="pg_...")

async for chunk in client.text_to_speech_stream({
    "model": "kokoro-82m",
    "input": "Hello from PolarGrid.",
    "voice": "bm_george",
    "response_format": "opus",
}):
    audio_player.append_chunk(chunk)

Capabilities

Field	Value
Endpoint	`POST /v1/audio/speech`
Audio output	24 kHz, 16-bit, mono
Streaming	Yes — chunked HTTP, `pcm` and `opus` (`stream: true`); audio delivered incrementally as synthesis proceeds
Batch formats	`pcm`, `wav`, `mp3`
Voice model	Fixed catalog of named preset voices (no cloning)
`speed` control	`0.25`–`4.0` multiplier
Max batch size	1

voice_transcript and language are tada-3b-ml-only fields and are ignored by kokoro-82m.

Voices — preset catalog

Kokoro uses named preset voices, not reference-clip cloning. The PolarGrid SDKs expose eight presets:

Voice ID	Accent / gender
`af_bella`	American English, female
`af_sarah`	American English, female
`am_adam`	American English, male
`am_michael`	American English, male
`bf_emma`	British English, female
`bf_isabella`	British English, female
`bm_george`	British English, male
`bm_lewis`	British English, male

voice is required; the PolarGrid SDKs and CLI default to af_bella. An invalid voice name returns 502 Bad Gateway (synthesis failure), never an empty 200. See the Text-to-Speech API reference and the Voice AI guide for the full upstream voice list.

Streaming

Pass stream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request. Audio is delivered incrementally as kokoro’s pipeline produces it — the bench above observed 54–234 PCM chunks per utterance with a streaming verdict on every run, so the first audio arrives well before synthesis finishes. See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.

Model identifier

Call this model with the canonical id kokoro-82m at /v1/audio/speech. It has no short alias. The HuggingFace repo id hexgrad/Kokoro-82M is accepted at /v1/models/load for hot-loading but does not resolve at inference time.

Notes

License: Apache 2.0.
For a specific cloned voice or a non-English language, use tada-3b-ml — kokoro is the choice for lowest-latency synthesis from a fixed English voice catalog.

Getting Started

Guides

Available Models

Headline benchmark

How this compares

Quickstart

Capabilities

Voices — preset catalog

Streaming

Model identifier

Notes

See also

​Headline benchmark

​How this compares

​Quickstart

​Capabilities

​Voices — preset catalog

​Streaming

​Model identifier

​Notes

​See also

Headline benchmark

How this compares

Quickstart

Capabilities

Voices — preset catalog

Streaming

Model identifier

Notes

See also