kokoro-82m) is an 82M-parameter text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. It is a preset-voice model — pick one of a fixed catalog of named voices — which makes it the low-latency counterpart to tada-3b-ml’s voice-cloning. It is co-resident on the voice pod with whisper-large-v3-turbo and cohere-transcribe-03-2026 — see backend/edge-production-setup/CLAUDE.md for the pod layout.
- HF repo:
hexgrad/Kokoro-82M - Modality: Text-to-Speech (streaming)
- Backend: Triton
python(voice pod) - Parameters: 82M
- Available regions:
yvr-02(Blackwell production)
Headline benchmark
Kokoro exposes a chunked-HTTP streaming transport for/v1/audio/speech.
| Measurement | p50 | p95 |
|---|---|---|
| End-to-end TTFA (with network) | 158 ms | 196 ms |
| Server-only TTFA (gateway → first triton byte) | 46 ms | 63 ms |
| Network leg of TTFA (e2e − server) | 109 ms | 134 ms |
| Full-utterance latency (TTLB, client wall-clock) | 696 ms | 1025 ms |
| Real-time factor (RTF) | 0.100 | 0.149 |
| streaming_verdict | streaming (100/100) | — |
/v1/audio/speech (stream: true, pcm, voice bm_george), 100 runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-06-02 from a Vancouver-area laptop over the public internet. TTFA is the time to the first audio byte arriving at the client — identical to TTFB since the response body is raw PCM. Server-only TTFA comes from the X-Pg-First-Byte-Ms response header (PR #507), the gateway’s measured time to the first PCM chunk returned by triton. Total synthesis time (TTLB) is not exposed via header and stays a client-side wall-clock number. RTF = synthesis wall-clock ÷ audio duration; below 1.0 is faster than real time. All 100/100 runs returned a streaming verdict (54–234 chunks per utterance). Raw runs: benchmarks/yvr-02-2026-06-02/kokoro-82m/. Harness: bench/kokoro-82m/.
How this compares
Same node, same harness, same/v1/audio/speech streaming surface as the tada-3b-ml bench in benchmarks/yvr-02-2026-05-27/:
| Model | server TTFA p50 | e2e TTFA p50 | e2e TTLB p50 | RTF p50 |
|---|---|---|---|---|
tada-3b-ml (3B) | 238 ms | 352 ms | 919 ms | 0.164 |
kokoro-82m (82M) | 46 ms | 158 ms | 696 ms | 0.100 |
tada-3b-ml card.
Quickstart
Edge endpoints accept your rawpg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.
Capabilities
| Field | Value |
|---|---|
| Endpoint | POST /v1/audio/speech |
| Audio output | 24 kHz, 16-bit, mono |
| Streaming | Yes — chunked HTTP, pcm and opus (stream: true); audio delivered incrementally as synthesis proceeds |
| Batch formats | pcm, wav, mp3 |
| Voice model | Fixed catalog of named preset voices (no cloning) |
speed control | 0.25–4.0 multiplier |
| Max batch size | 1 |
voice_transcript and language are tada-3b-ml-only fields and are ignored by kokoro-82m.
Voices — preset catalog
Kokoro uses named preset voices, not reference-clip cloning. The PolarGrid SDKs expose eight presets:| Voice ID | Accent / gender |
|---|---|
af_bella | American English, female |
af_sarah | American English, female |
am_adam | American English, male |
am_michael | American English, male |
bf_emma | British English, female |
bf_isabella | British English, female |
bm_george | British English, male |
bm_lewis | British English, male |
voice is required; the PolarGrid SDKs and CLI default to af_bella. An invalid voice name returns 502 Bad Gateway (synthesis failure), never an empty 200. See the Text-to-Speech API reference and the Voice AI guide for the full upstream voice list.
Streaming
Passstream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request.
Audio is delivered incrementally as kokoro’s pipeline produces it — the bench above observed 54–234 PCM chunks per utterance with a streaming verdict on every run, so the first audio arrives well before synthesis finishes. See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.
Model identifier
Call this model with the canonical idkokoro-82m at /v1/audio/speech. It has no short alias. The HuggingFace repo id hexgrad/Kokoro-82M is accepted at /v1/models/load for hot-loading but does not resolve at inference time.
Notes
- License: Apache 2.0.
- For a specific cloned voice or a non-English language, use
tada-3b-ml— kokoro is the choice for lowest-latency synthesis from a fixed English voice catalog.
See also
- Text-to-Speech API — endpoint reference, formats, streaming contract
- Voice AI guide — building voice agents on PolarGrid
- Authentication — using your
pg_*API key /v1/models— list all available models
