Skip to main content
Whisper Large V3 Turbo (whisper-large-v3-turbo) is OpenAI’s 809M-parameter Whisper model, served on PolarGrid edge nodes via Triton’s python backend using the faster-whisper runtime. It is co-resident on the voice pod alongside cohere-transcribe-03-2026 and the kokoro-82m TTS model — see backend/edge-production-setup/CLAUDE.md for the pod split rationale.
  • Available regions: yvr-02 (Blackwell production)

Headline benchmark

POST /v1/audio/transcriptions?stream=true is the live streaming surface. The server emits a text/event-stream of transcript.text.delta events as each decode window completes, then closes with a single transcript.text.done event. Consumers can render the rolling transcript immediately instead of waiting for the final result.
Measurementp50p95
TTFT (response headers → first non-empty delta)77 ms83 ms
Time to done event (response headers → done)639 ms1074 ms
Interim work (first delta → done)561 ms997 ms
Delta events per request47 (max)
Partial deltas (text != final)4
e2e total (POST → [DONE])1473 ms2053 ms
RTF (e2e ÷ audio duration)0.250.31
well_formed100 / 100
Bench: 100 streaming transcription runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-06-02 from a Vancouver-area laptop. Inputs were the same 5 short utterances (4.2 – 7.7 s, 24 kHz mono WAV, ~200–350 KB each) used by the cohere bench, pre-synthesized via tada-3b-ml on the same node. Raw runs: benchmarks/yvr-02-2026-06-02/whisper-large-v3-turbo/.
Delta cadence is chunk-driven. 4.2 – 5.9 s clips emit 4 deltas, 7.2 – 7.7 s clips emit 7. Each delta carries a non-overlapping span of the transcript; concatenating them reconstructs the final string. The first delta arriving at 77 ms after response headers is the meaningful TTFT for live-captioning pipelines. The gateway does not emit X-Pg-Inference-Ms on the stream surface yet, so server-only inference cannot be quoted client-side for streaming today.

How this compares

Same node, same inputs, same client as the cohere-transcribe-03-2026 benches:
ModelStreaming first-partial p50ttft_done p50e2e total p50RTF (e2e)
cohere-transcribe-03-202644 ms560 ms2510 ms0.42
whisper-large-v3-turbo77 ms639 ms1473 ms0.25
Cohere reaches its first partial sooner; whisper finishes the full streamed transcript materially faster end-to-end (1473 vs 2510 ms p50) at a lower real-time factor. Pick cohere for earliest-possible interim display and broad accented-language accuracy; pick whisper for fastest full-utterance completion.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication.
curl -X POST "https://api.yvr-02.edge.polargrid.ai/v1/audio/transcriptions?sync=true&model=whisper-large-v3-turbo" \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -F "file=@input.wav"

Endpoint modes

POST /v1/audio/transcriptions has three modes selected by query params (not multipart fields):
ModeQueryResponseUse when
Streaming?stream=truetext/event-stream of delta + done eventsLive captioning; sub-utterance partial transcripts as the model decodes. This is what the headline bench above measures.
Sync?sync=true200 with the formatted transcript directlyVoice-agent path that needs the answer in one round trip.
Async (default)none202 with {job_id, poll_url}; poll via GET ?job_id=...Background batch transcription; no caller blocking.
stream and sync are mutually exclusive — passing both returns 400. Streaming requires response_format in {json, text}.

Sync benchmark (?sync=true)

The blocking surface holds the connection open until inference completes, then ships the whole JSON response. Client-side TTFB ≈ total wall-clock, and the meaningful split is server inference time vs network leg (which for STT is dominated by the audio upload). Raw runs: benchmarks/yvr-02-2026-06-02/whisper-large-v3-turbo/.
Measurementp50p95
End-to-end TTFB (with network + upload)1015 ms1519 ms
Server-only inference287 ms324 ms
Network leg (e2e − server)728 ms
Body transfer (JSON response)0.7 ms2.2 ms
RTF (server inference ÷ audio duration)0.0500.064
Server-only timing comes from the X-Pg-Inference-Ms response header (PR #507), available on the sync surface.
The network leg is upload-dominated. Each request ships a multi-second WAV file before inference can begin, so the 728 ms p50 network figure is largely the upload time of a ~300 KB body — not POP-to-client RTT. For shorter clips (sub-2 s) the network leg shrinks proportionally. Quote the server-only RTF when comparing inference throughput against centralized providers.

Capabilities

FieldValue
EndpointPOST /v1/audio/transcriptions
Multipart fieldfile (required)
Query paramsmodel, language, prompt, temperature, response_format, punctuation, stream, sync
response_formatjson (default), text, srt, vtt, verbose_json
LanguagesMultilingual — auto-detected, or pin with language (e.g. en, fr, es)
Max batch size1
Backend podinference-backend-triton-voice

Response timing headers

PR #507 added two response headers that bench harnesses and observability tooling can read to get a server-only inference time without inferring it from the body:
HeaderValue
X-Pg-Inference-MsInteger milliseconds — wall-clock around the transcribe() call inside the gateway’s _handle_sync (i.e., excludes the audio upload + response transfer).
Server-TimingStandard-shape inference;dur=<ms> entry carrying the same number.
These let callers compute the network leg as (client wall-clock) − (X-Pg-Inference-Ms), the same e2e-vs-server split available for LLM via pg_metadata.

Model identifier

Call this model with the canonical id whisper-large-v3-turbo at /v1/audio/transcriptions. It has no short alias. The HuggingFace repo id openai/whisper-large-v3-turbo is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time.

Notes

  • License: Apache 2.0 (no auth required to pull weights).
  • Runtime: served via faster-whisper, not raw transformers, for optimized CTranslate2 inference.
  • For multilingual coverage and accuracy on accented speech, cohere-transcribe-03-2026 is the alternative — whisper-turbo’s edge is lowest full-utterance latency.

See also