Whisper Large V3 Turbo

Whisper Large V3 Turbo (whisper-large-v3-turbo) is OpenAI’s 809M-parameter Whisper model, served on PolarGrid edge nodes via Triton’s python backend using the faster-whisper runtime. It is co-resident on the voice pod alongside cohere-transcribe-03-2026 and the kokoro-82m TTS model — see backend/edge-production-setup/CLAUDE.md for the pod split rationale.

HF repo: openai/whisper-large-v3-turbo
Modality: Speech-to-Text (streaming + sync)
Backend: Triton python (voice pod)
Parameters: 809M

Available regions: all regions except dfw-02 — see Model availability

June 11 update: server-side STT inference halved

Re-benched 2026-06-11 on the same node, harness, and input corpus as the previous run. Server-only inference dropped 287 → 145 ms p50 (RTF 0.050 → 0.025, ~40× real-time) after the in-process audio resampling fix on the STT hot path (PR #612) — the gateway previously paid an out-of-process resample on every non-16 kHz upload. Streaming time-to-done improved 639 → 582 ms p50; TTFT unchanged at ~78 ms. Raw runs: benchmarks/yvr-02-2026-06-11/.

Headline benchmark

POST /v1/audio/transcriptions?stream=true is the live streaming surface. The server emits a text/event-stream of transcript.text.delta events as each decode window completes, then closes with a single transcript.text.done event. Consumers can render the rolling transcript immediately instead of waiting for the final result.

Measurement	p50	p95
TTFT (response headers → first non-empty delta)	78 ms	83 ms
Time to `done` event (response headers → done)	582 ms	921 ms
Interim work (first delta → done)	504 ms	841 ms
Delta events per request	4	7 (max)
Partial deltas (text != final)	4	—
e2e total (POST → `[DONE]`)	1304 ms	1994 ms
RTF (e2e ÷ audio duration)	0.23	0.30
well_formed	100 / 100	—

Bench: 100 streaming transcription runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-06-11 from a Vancouver-area laptop. Inputs were the same 5 short utterances (4.2 – 7.7 s, 24 kHz mono WAV, ~200–350 KB each) used by the cohere bench, pre-synthesized via tada-3b-ml on the same node. Raw runs: benchmarks/yvr-02-2026-06-11/whisper-large-v3-turbo/.

Delta cadence is chunk-driven. 4.2 – 5.9 s clips emit 4 deltas, 7.2 – 7.7 s clips emit 7. Deltas are provisional display hints — a delta carries either the new suffix or a full revised hypothesis, and the two are not distinguishable from the event alone, so do not concatenate them; take the final transcript from done.text (see Delta semantics). The first delta arriving at 78 ms after response headers is the meaningful TTFT for live-captioning pipelines. The gateway does not emit X-Pg-Inference-Ms on the stream surface yet, so server-only inference cannot be quoted client-side for streaming today.

Live WebSocket benchmark

wss://…/v1/audio/transcriptions/ws transcribes while audio is being captured: the client streams 16 kHz mono PCM frames at microphone pace and partial transcripts arrive during the utterance — no more waiting for the upload to finish. Benched 2026-06-11 with the same five clips streamed in 100 ms frames at real-time pace, 100 runs after a 5-run warmup. E2e is bounded by the audio duration by design; the numbers that matter are the partial cadence and the end-of-speech lag:

Metric	p50	p95
First partial (first frame sent → first non-empty delta)	1023 ms	1042 ms
Partials while audio still flowing	4	7 (max, 7.2–7.7 s clips)
`stop` → authoritative `done`	159 ms	205 ms
well_formed	100 / 100	—

Raw runs: whisper-large-v3-turbo/whisper-large-v3-turbo_ws_bench.json. Harness: bench/whisper-large-v3-turbo/bench_ws.py. The first partial lands at the ~1 s window boundary (partials are emitted per second of new audio) and every run produced partials while audio was still arriving. The 159 ms p50 stop→done figure is the effective end-of-speech-to-transcript latency — the number to compare against streaming STT providers. The upload surfaces instead pay full-clip transcription after the audio ends (~580 ms to done on the SSE surface). Sessions cap at 120 s of audio; wire protocol and delta semantics in the Speech-to-Text API reference.

How this compares

Same node, same inputs, same client as the cohere-transcribe-03-2026 benches:

Model	Streaming first-partial p50	ttft_done p50	e2e total p50	RTF (e2e)
`cohere-transcribe-03-2026`	44 ms	560 ms	2510 ms	0.42
`whisper-large-v3-turbo`	77 ms	639 ms	1473 ms	0.25

Cohere reaches its first partial sooner; whisper finishes the full streamed transcript materially faster end-to-end (1473 vs 2510 ms p50) at a lower real-time factor. Pick cohere for earliest-possible interim display and broad accented-language accuracy; pick whisper for fastest full-utterance completion. Both rows are from the pre-PR-#612 runs (2026-05-28 / 2026-06-02) so they remain apples-to-apples. Whisper’s post-fix numbers are lower (see the headline table above); cohere has not yet been re-benched on the fixed hot path.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication.

curl -X POST "https://api.yvr-02.edge.polargrid.ai/v1/audio/transcriptions?sync=true&model=whisper-large-v3-turbo" \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -F "file=@input.wav"

import { PolarGrid } from "@polargrid/polargrid-sdk";
import fs from "node:fs";

const client = await PolarGrid.create({ apiKey: process.env.POLARGRID_API_KEY });

const result = await client.audioTranscriptions({
  model: "whisper-large-v3-turbo",
  file: fs.createReadStream("input.wav"),
  // Default mode is async (returns job_id). Pass sync to block on the result.
  sync: true,
});
console.log(result.text);

from polargrid import PolarGrid

client = await PolarGrid.create(api_key="pg_...")

with open("input.wav", "rb") as f:
    result = await client.audio_transcriptions({
        "model": "whisper-large-v3-turbo",
        "file": f,
        "sync": True,
    })
print(result["text"])

Endpoint modes

POST /v1/audio/transcriptions has three upload modes selected by query params (not multipart fields), plus a live WebSocket surface:

Mode	Surface	Response	Use when
Live (WebSocket)	`wss://…/v1/audio/transcriptions/ws`	`transcript.text.delta` events while audio is still arriving, `done` after `stop`	Voice agents and live captioning from a microphone — partials during speech, final transcript ~160 ms after end-of-speech.
Streaming	`?stream=true`	`text/event-stream` of delta + done events	Rolling display over a complete file upload.
Sync	`?sync=true`	`200` with the formatted transcript directly	Voice-agent path that needs the answer in one round trip.
Async (default)	none	`202` with `{job_id, poll_url}`; poll via `GET ?job_id=...`	Background batch transcription; no caller blocking.

stream and sync are mutually exclusive — passing both returns 400. Streaming requires response_format in {json, text}. The WebSocket surface takes 16 kHz mono int16 PCM frames and caps sessions at 120 s — see the Speech-to-Text API reference for the wire protocol and delta semantics.

Sync benchmark (`?sync=true`)

The blocking surface holds the connection open until inference completes, then ships the whole JSON response. Client-side TTFB ≈ total wall-clock, and the meaningful split is server inference time vs network leg (which for STT is dominated by the audio upload). Raw runs: benchmarks/yvr-02-2026-06-11/whisper-large-v3-turbo/.

Measurement	p50	p95
End-to-end TTFB (with network + upload)	909 ms	1329 ms
Server-only inference	145 ms	176 ms
Network leg (e2e − server)	765 ms	—
Body transfer (JSON response)	1.1 ms	2.4 ms
RTF (server inference ÷ audio duration)	0.025	0.031

Server-only timing comes from the X-Pg-Inference-Ms response header (PR #507), available on the sync surface.

The network leg is upload-dominated. Each request ships a multi-second WAV file before inference can begin, so the 765 ms p50 network figure is largely the upload time of a ~300 KB body — not POP-to-client RTT. For shorter clips (sub-2 s) the network leg shrinks proportionally. Quote the server-only RTF when comparing inference throughput against centralized providers.

Capabilities

Field	Value
Endpoint	`POST /v1/audio/transcriptions`
Multipart field	`file` (required)
Query params	`model`, `language`, `prompt`, `temperature`, `response_format`, `punctuation`, `stream`, `sync`
`response_format`	`json` (default), `text`, `srt`, `vtt`, `verbose_json`
Languages	Multilingual — auto-detected, or pin with `language` (e.g. `en`, `fr`, `es`)
Max batch size	1
Backend pod	`inference-backend-triton-voice`

Response timing headers

PR #507 added two response headers that bench harnesses and observability tooling can read to get a server-only inference time without inferring it from the body:

Header	Value
`X-Pg-Inference-Ms`	Integer milliseconds — wall-clock around the `transcribe()` call inside the gateway’s `_handle_sync` (i.e., excludes the audio upload + response transfer).
`Server-Timing`	Standard-shape `inference;dur=<ms>` entry carrying the same number.

These let callers compute the network leg as (client wall-clock) − (X-Pg-Inference-Ms), the same e2e-vs-server split available for LLM via pg_metadata.

Model identifier

Call this model with the canonical id whisper-large-v3-turbo at /v1/audio/transcriptions. It has no short alias. The HuggingFace repo id openai/whisper-large-v3-turbo is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time.

Notes

License: Apache 2.0 (no auth required to pull weights).
Runtime: served via faster-whisper, not raw transformers, for optimized CTranslate2 inference.
For multilingual coverage and accuracy on accented speech, cohere-transcribe-03-2026 is the alternative — whisper-turbo’s edge is lowest full-utterance latency.

Getting Started

Guides

Available Models

Whisper Large V3 Turbo

June 11 update: server-side STT inference halved

Headline benchmark

Live WebSocket benchmark

How this compares

Quickstart

Endpoint modes

Sync benchmark (`?sync=true`)

Capabilities

Response timing headers

Model identifier

Notes

See also

June 11 update: server-side STT inference halved

​Headline benchmark

​Live WebSocket benchmark

​How this compares

​Quickstart

​Endpoint modes

​Sync benchmark (?sync=true)

​Capabilities

​Response timing headers

​Model identifier

​Notes

​See also

Headline benchmark

Live WebSocket benchmark

How this compares

Quickstart

Endpoint modes

Sync benchmark (`?sync=true`)

Capabilities

Response timing headers

Model identifier

Notes

See also