Skip to main content
Cohere Transcribe (cohere-transcribe-03-2026) is a 2-billion-parameter multilingual speech-to-text model served on PolarGrid edge nodes via Triton’s python backend. The voice pod runs transformers >= 5.4 to support this model — see backend/edge-production-setup/CLAUDE.md for the pod split rationale.
  • HF repo: CohereLabs/cohere-transcribe-03-2026
  • Modality: Speech-to-Text (streaming + sync)
  • Backend: Triton python (voice pod)
  • Model ID: cohere-transcribe-03-2026 (use the full ID; the short alias cohere-transcribe is not routable on the gateway)
  • Available regions: yvr-02 (Blackwell production)

Headline benchmark

POST /v1/audio/transcriptions?stream=true is the live surface. The server emits a text/event-stream of transcript.text.delta events as the cohere handler’s internal decode loop completes each window, then closes with a single transcript.text.done event. Consumers can render the rolling transcript immediately instead of waiting for the final result.
Measurementp50p95
TTFT (response headers → first non-empty delta)44 ms62 ms
Time to done event (response headers → done)560 ms856 ms
Interim work (first delta → done)517 ms816 ms
Delta events per request47 (max)
Partial deltas (text != final)4
e2e total (POST → [DONE])2510 ms3523 ms
RTF (e2e ÷ audio duration)0.420.53
well_formed100 / 100
Bench: 100 streaming transcription runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-05-28 from a Vancouver-area laptop. Inputs were 5 short utterances (4.2 – 7.7 s, 24 kHz mono WAV, ~200–350 KB each) pre-synthesized via tada-3b-ml on the same node — see bench/cohere-transcribe-03-2026/synthesize_inputs.py. Raw runs: benchmarks/yvr-02-2026-05-28/cohere-transcribe-03-2026/.
Delta cadence is chunk-driven. 4.2 – 5.9 s clips emit 4 deltas, 7.2 – 7.7 s clips emit 7. Each delta carries a non-overlapping span of the transcript; concatenating them reconstructs the final string. The first delta arriving at 44 ms after response headers is the meaningful TTFT for live-captioning pipelines. The gateway does not emit X-Pg-Inference-Ms on the stream surface yet, so server-only inference cannot be quoted client-side for streaming today.

How this compares

ProviderStreaming first-partial p50Sync RTFNotesSource
PolarGrid cohere-transcribe-03-2026 on Blackwell44 ms after response headers0.041 server-only14 languages, sync + streaming surfacesthis card
Deepgram Nova-3sub-300 ms over WebSocket from end-of-speechn/aWebSocket protocol streams audio in continuouslyartificialanalysis.ai
AssemblyAI Universal-3n/a~0.008 to 0.05 batchBatch transcription tierassemblyai.com
PolarGrid’s 44 ms streaming TTFT is gated by the multipart audio upload arriving first; total client wall-clock to first partial is closer to 1970 ms p50 (TTFB 1926 ms + first-delta 44 ms) on a 300 KB WAV. Deepgram Nova-3 measures streaming TTFT from end-of-speech because the WebSocket protocol streams audio in continuously, which removes the upload component entirely. For mic-to-screen pipelines that need sub-300 ms first-partial from end-of-speech, an audio-streaming-in protocol (WebSocket or chunked-upload) is the missing piece on PolarGrid today. For batch-upload workloads PolarGrid’s server-only RTF of 0.041 is in the same tier as AssemblyAI Universal-3 batch.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication.
curl -X POST "https://api.yvr-02.edge.polargrid.ai/v1/audio/transcriptions?sync=true&model=cohere-transcribe-03-2026" \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -F "file=@input.wav"

Endpoint modes

POST /v1/audio/transcriptions has three modes selected by query params (not multipart fields):
ModeQueryResponseUse when
Streaming?stream=truetext/event-stream of delta + done eventsLive captioning; sub-utterance partial transcripts as the model decodes. This is what the headline bench above measures.
Sync?sync=true200 with the formatted transcript directlyVoice-agent path that needs the answer in one round trip.
Async (default)none202 with {job_id, poll_url}; poll via GET ?job_id=...Background batch transcription; no caller blocking.
stream and sync are mutually exclusive — passing both returns 400. Streaming requires response_format in {json, text}.

Sync benchmark (?sync=true)

The blocking surface holds the connection open until inference completes, then ships the whole JSON response. Client-side TTFB ≈ total wall-clock, and the meaningful split is server inference time vs network leg (which for STT is dominated by the audio upload). Raw runs: benchmarks/yvr-02-2026-05-27/cohere-transcribe-03-2026/.
Measurementp50p95
End-to-end TTFB (with network + upload)1068 ms1535 ms
Server-only inference238 ms305 ms
Network leg (e2e − server)831 ms
Body transfer (JSON response)0.7 ms1.6 ms
RTF (server inference ÷ audio duration)0.0410.056
Server-only timing comes from the X-Pg-Inference-Ms response header (PR #507), available on the sync surface.
The network leg is upload-dominated. Each request ships a multi-second WAV file before inference can begin, so the 831 ms p50 network figure is largely the upload time of a ~300 KB body — not POP-to-client RTT. For shorter clips (sub-2 s) the network leg shrinks proportionally. Quote the server-only RTF when comparing inference throughput against centralized providers.

Capabilities

FieldValue
EndpointPOST /v1/audio/transcriptions
Multipart fieldfile (required)
Query paramsmodel, language, prompt, temperature, response_format, punctuation, stream, sync
response_formatjson (default), text, srt, vtt, verbose_json
LanguagesEnglish, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, Arabic
Punctuation toggleYes — pass `punctuation=truefalse`
Max batch size1
Backend podinference-backend-triton-voice

Response timing headers

PR #507 added two response headers that bench harnesses and observability tooling can read to get a server-only inference time without inferring it from the body:
HeaderValue
X-Pg-Inference-MsInteger milliseconds — wall-clock around the transcribe() call inside the gateway’s _handle_sync (i.e., excludes the audio upload + response transfer).
Server-TimingStandard-shape inference;dur=<ms> entry carrying the same number.
These let callers compute the network leg as (client wall-clock) − (X-Pg-Inference-Ms), the same e2e-vs-server split available for LLM via pg_metadata.

Model identifier

Call this model with the full id cohere-transcribe-03-2026 at /v1/audio/transcriptions. The short alias cohere-transcribe is not routable on the gateway (returns 404). The HuggingFace repo id CohereLabs/cohere-transcribe-03-2026 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time.

Notes

  • License: Apache 2.0 (no auth required to pull weights).
  • Voice pod isolation: this model needs transformers >= 5.4, which conflicts with hume-tada’s < 5 pin and the LLM pod’s vLLM 0.17.x. That’s why PolarGrid splits voice/LLM/TADA into three Triton pods — see backend/edge-production-setup/CLAUDE.md.
  • For lower full-utterance latency, whisper-large-v3-turbo is the alternative — Cohere’s edge is multilingual coverage + accuracy on accented speech.

See also