whisper-large-v3-turbo) is OpenAI’s 809M-parameter Whisper model, served on PolarGrid edge nodes via Triton’s python backend using the faster-whisper runtime. It is co-resident on the voice pod alongside cohere-transcribe-03-2026 and the kokoro-82m TTS model — see backend/edge-production-setup/CLAUDE.md for the pod split rationale.
- HF repo:
openai/whisper-large-v3-turbo - Modality: Speech-to-Text (streaming + sync)
- Backend: Triton
python(voice pod) - Parameters: 809M
- Available regions:
yvr-02(Blackwell production)
Headline benchmark
POST /v1/audio/transcriptions?stream=true is the live streaming surface. The server emits a text/event-stream of transcript.text.delta events as each decode window completes, then closes with a single transcript.text.done event. Consumers can render the rolling transcript immediately instead of waiting for the final result.
| Measurement | p50 | p95 |
|---|---|---|
| TTFT (response headers → first non-empty delta) | 77 ms | 83 ms |
Time to done event (response headers → done) | 639 ms | 1074 ms |
| Interim work (first delta → done) | 561 ms | 997 ms |
| Delta events per request | 4 | 7 (max) |
| Partial deltas (text != final) | 4 | — |
e2e total (POST → [DONE]) | 1473 ms | 2053 ms |
| RTF (e2e ÷ audio duration) | 0.25 | 0.31 |
| well_formed | 100 / 100 | — |
https://api.yvr-02.edge.polargrid.ai, captured 2026-06-02 from a Vancouver-area laptop. Inputs were the same 5 short utterances (4.2 – 7.7 s, 24 kHz mono WAV, ~200–350 KB each) used by the cohere bench, pre-synthesized via tada-3b-ml on the same node. Raw runs: benchmarks/yvr-02-2026-06-02/whisper-large-v3-turbo/.
Delta cadence is chunk-driven. 4.2 – 5.9 s clips emit 4 deltas, 7.2 – 7.7 s clips emit 7. Each delta carries a non-overlapping span of the transcript; concatenating them reconstructs the final string. The first delta arriving at 77 ms after response headers is the meaningful TTFT for live-captioning pipelines. The gateway does not emit X-Pg-Inference-Ms on the stream surface yet, so server-only inference cannot be quoted client-side for streaming today.
How this compares
Same node, same inputs, same client as thecohere-transcribe-03-2026 benches:
| Model | Streaming first-partial p50 | ttft_done p50 | e2e total p50 | RTF (e2e) |
|---|---|---|---|---|
cohere-transcribe-03-2026 | 44 ms | 560 ms | 2510 ms | 0.42 |
whisper-large-v3-turbo | 77 ms | 639 ms | 1473 ms | 0.25 |
Quickstart
Edge endpoints accept your rawpg_* API key as a bearer token — no token exchange. See Authentication.
Endpoint modes
POST /v1/audio/transcriptions has three modes selected by query params (not multipart fields):
| Mode | Query | Response | Use when |
|---|---|---|---|
| Streaming | ?stream=true | text/event-stream of delta + done events | Live captioning; sub-utterance partial transcripts as the model decodes. This is what the headline bench above measures. |
| Sync | ?sync=true | 200 with the formatted transcript directly | Voice-agent path that needs the answer in one round trip. |
| Async (default) | none | 202 with {job_id, poll_url}; poll via GET ?job_id=... | Background batch transcription; no caller blocking. |
stream and sync are mutually exclusive — passing both returns 400. Streaming requires response_format in {json, text}.
Sync benchmark (?sync=true)
The blocking surface holds the connection open until inference completes, then ships the whole JSON response. Client-side TTFB ≈ total wall-clock, and the meaningful split is server inference time vs network leg (which for STT is dominated by the audio upload). Raw runs: benchmarks/yvr-02-2026-06-02/whisper-large-v3-turbo/.
| Measurement | p50 | p95 |
|---|---|---|
| End-to-end TTFB (with network + upload) | 1015 ms | 1519 ms |
| Server-only inference | 287 ms | 324 ms |
| Network leg (e2e − server) | 728 ms | — |
| Body transfer (JSON response) | 0.7 ms | 2.2 ms |
| RTF (server inference ÷ audio duration) | 0.050 | 0.064 |
X-Pg-Inference-Ms response header (PR #507), available on the sync surface.
The network leg is upload-dominated. Each request ships a multi-second WAV file before inference can begin, so the 728 ms p50 network figure is largely the upload time of a ~300 KB body — not POP-to-client RTT. For shorter clips (sub-2 s) the network leg shrinks proportionally. Quote the server-only RTF when comparing inference throughput against centralized providers.
Capabilities
| Field | Value |
|---|---|
| Endpoint | POST /v1/audio/transcriptions |
| Multipart field | file (required) |
| Query params | model, language, prompt, temperature, response_format, punctuation, stream, sync |
response_format | json (default), text, srt, vtt, verbose_json |
| Languages | Multilingual — auto-detected, or pin with language (e.g. en, fr, es) |
| Max batch size | 1 |
| Backend pod | inference-backend-triton-voice |
Response timing headers
PR #507 added two response headers that bench harnesses and observability tooling can read to get a server-only inference time without inferring it from the body:| Header | Value |
|---|---|
X-Pg-Inference-Ms | Integer milliseconds — wall-clock around the transcribe() call inside the gateway’s _handle_sync (i.e., excludes the audio upload + response transfer). |
Server-Timing | Standard-shape inference;dur=<ms> entry carrying the same number. |
(client wall-clock) − (X-Pg-Inference-Ms), the same e2e-vs-server split available for LLM via pg_metadata.
Model identifier
Call this model with the canonical idwhisper-large-v3-turbo at /v1/audio/transcriptions. It has no short alias. The HuggingFace repo id openai/whisper-large-v3-turbo is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time.
Notes
- License: Apache 2.0 (no auth required to pull weights).
- Runtime: served via
faster-whisper, not rawtransformers, for optimized CTranslate2 inference. - For multilingual coverage and accuracy on accented speech,
cohere-transcribe-03-2026is the alternative — whisper-turbo’s edge is lowest full-utterance latency.
See also
- Speech-to-Text API — endpoint reference, formats, streaming contract
- Voice AI guide — building voice agents on PolarGrid
- Authentication — using your
pg_*API key /v1/models— list all available models
