Skip to main content
Kokoro 82M (kokoro-82m) is an 82M-parameter text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. It is a preset-voice model — pick one of a fixed catalog of named voices — which makes it the low-latency counterpart to tada-3b-ml’s voice-cloning. It is co-resident on the voice pod with whisper-large-v3-turbo and cohere-transcribe-03-2026 — see backend/edge-production-setup/CLAUDE.md for the pod layout.
  • HF repo: hexgrad/Kokoro-82M
  • Modality: Text-to-Speech (streaming)
  • Backend: Triton python (voice pod)
  • Parameters: 82M
  • Available regions: yvr-02 (Blackwell production)

Headline benchmark

Kokoro exposes a chunked-HTTP streaming transport for /v1/audio/speech.
Measurementp50p95
End-to-end TTFA (with network)158 ms196 ms
Server-only TTFA (gateway → first triton byte)46 ms63 ms
Network leg of TTFA (e2e − server)109 ms134 ms
Full-utterance latency (TTLB, client wall-clock)696 ms1025 ms
Real-time factor (RTF)0.1000.149
streaming_verdictstreaming (100/100)
Streaming /v1/audio/speech (stream: true, pcm, voice bm_george), 100 runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-06-02 from a Vancouver-area laptop over the public internet. TTFA is the time to the first audio byte arriving at the client — identical to TTFB since the response body is raw PCM. Server-only TTFA comes from the X-Pg-First-Byte-Ms response header (PR #507), the gateway’s measured time to the first PCM chunk returned by triton. Total synthesis time (TTLB) is not exposed via header and stays a client-side wall-clock number. RTF = synthesis wall-clock ÷ audio duration; below 1.0 is faster than real time. All 100/100 runs returned a streaming verdict (54–234 chunks per utterance). Raw runs: benchmarks/yvr-02-2026-06-02/kokoro-82m/. Harness: bench/kokoro-82m/.

How this compares

Same node, same harness, same /v1/audio/speech streaming surface as the tada-3b-ml bench in benchmarks/yvr-02-2026-05-27/:
Modelserver TTFA p50e2e TTFA p50e2e TTLB p50RTF p50
tada-3b-ml (3B)238 ms352 ms919 ms0.164
kokoro-82m (82M)46 ms158 ms696 ms0.100
Kokoro’s 82M model synthesizes the first audio chunk ~5× faster server-side (46 vs 238 ms) and finishes the full utterance sooner at a lower real-time factor. The ~109 ms network leg is identical across both — same laptop, same POP. Pick kokoro for lowest-latency synthesis from a fixed voice catalog; pick tada when you need a specific cloned voice or a non-English language. For external-provider TTS comparisons (Cartesia, ElevenLabs, Hume), see the tada-3b-ml card.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.
curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "kokoro-82m",
    "input": "Hello from PolarGrid.",
    "voice": "bm_george",
    "response_format": "pcm",
    "stream": true
  }' \
  --output speech.pcm

Capabilities

FieldValue
EndpointPOST /v1/audio/speech
Audio output24 kHz, 16-bit, mono
StreamingYes — chunked HTTP, pcm and opus (stream: true); audio delivered incrementally as synthesis proceeds
Batch formatspcm, wav, mp3
Voice modelFixed catalog of named preset voices (no cloning)
speed control0.254.0 multiplier
Max batch size1
voice_transcript and language are tada-3b-ml-only fields and are ignored by kokoro-82m.

Voices — preset catalog

Kokoro uses named preset voices, not reference-clip cloning. The PolarGrid SDKs expose eight presets:
Voice IDAccent / gender
af_bellaAmerican English, female
af_sarahAmerican English, female
am_adamAmerican English, male
am_michaelAmerican English, male
bf_emmaBritish English, female
bf_isabellaBritish English, female
bm_georgeBritish English, male
bm_lewisBritish English, male
voice is required; the PolarGrid SDKs and CLI default to af_bella. An invalid voice name returns 502 Bad Gateway (synthesis failure), never an empty 200. See the Text-to-Speech API reference and the Voice AI guide for the full upstream voice list.

Streaming

Pass stream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request. Audio is delivered incrementally as kokoro’s pipeline produces it — the bench above observed 54–234 PCM chunks per utterance with a streaming verdict on every run, so the first audio arrives well before synthesis finishes. See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.

Model identifier

Call this model with the canonical id kokoro-82m at /v1/audio/speech. It has no short alias. The HuggingFace repo id hexgrad/Kokoro-82M is accepted at /v1/models/load for hot-loading but does not resolve at inference time.

Notes

  • License: Apache 2.0.
  • For a specific cloned voice or a non-English language, use tada-3b-ml — kokoro is the choice for lowest-latency synthesis from a fixed English voice catalog.

See also