Cohere Transcribe (03-2026)

Cohere Transcribe (cohere-transcribe-03-2026) is a 2-billion-parameter multilingual speech-to-text model served on PolarGrid edge nodes via Triton’s python backend. The voice pod runs transformers >= 5.4 to support this model — see backend/edge-production-setup/CLAUDE.md for the pod split rationale.

HF repo: CohereLabs/cohere-transcribe-03-2026
Modality: Speech-to-Text (streaming + sync)
Backend: Triton python (voice pod)
Model ID: cohere-transcribe-03-2026 (use the full ID; the short alias cohere-transcribe is not routable on the gateway)

Available regions: all regions except dfw-02 — see Model availability

Headline benchmark

POST /v1/audio/transcriptions?stream=true is the live surface. The server emits a text/event-stream of transcript.text.delta events as the cohere handler’s internal decode loop completes each window, then closes with a single transcript.text.done event. Consumers can render the rolling transcript immediately instead of waiting for the final result.

Measurement	p50	p95
TTFT (response headers → first non-empty delta)	44 ms	62 ms
Time to `done` event (response headers → done)	560 ms	856 ms
Interim work (first delta → done)	517 ms	816 ms
Delta events per request	4	7 (max)
Partial deltas (text != final)	4	—
e2e total (POST → `[DONE]`)	2510 ms	3523 ms
RTF (e2e ÷ audio duration)	0.42	0.53
well_formed	100 / 100	—

Bench: 100 streaming transcription runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-05-28 from a Vancouver-area laptop. Inputs were 5 short utterances (4.2 – 7.7 s, 24 kHz mono WAV, ~200–350 KB each) pre-synthesized via tada-3b-ml on the same node — see bench/cohere-transcribe-03-2026/synthesize_inputs.py. Raw runs: benchmarks/yvr-02-2026-05-28/cohere-transcribe-03-2026/.

Delta cadence is chunk-driven. 4.2 – 5.9 s clips emit 4 deltas, 7.2 – 7.7 s clips emit 7. Each delta carries a non-overlapping span of the transcript; concatenating them reconstructs the final string. The first delta arriving at 44 ms after response headers is the meaningful TTFT for live-captioning pipelines. The gateway does not emit X-Pg-Inference-Ms on the stream surface yet, so server-only inference cannot be quoted client-side for streaming today.

How this compares

Provider	Streaming first-partial p50	Sync RTF	Notes	Source
PolarGrid `cohere-transcribe-03-2026` on Blackwell	44 ms after response headers	0.041 server-only	14 languages, sync + streaming surfaces	this card
Deepgram Nova-3	sub-300 ms over WebSocket from end-of-speech	n/a	WebSocket protocol streams audio in continuously	artificialanalysis.ai
AssemblyAI Universal-3	n/a	~0.008 to 0.05 batch	Batch transcription tier	assemblyai.com

PolarGrid’s 44 ms streaming TTFT is gated by the multipart audio upload arriving first; total client wall-clock to first partial is closer to 1970 ms p50 (TTFB 1926 ms + first-delta 44 ms) on a 300 KB WAV. Deepgram Nova-3 measures streaming TTFT from end-of-speech because the WebSocket protocol streams audio in continuously, which removes the upload component entirely. For mic-to-screen pipelines that need sub-300 ms first-partial from end-of-speech, an audio-streaming-in protocol (WebSocket or chunked-upload) is the missing piece on PolarGrid today. For batch-upload workloads PolarGrid’s server-only RTF of 0.041 is in the same tier as AssemblyAI Universal-3 batch.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication.

curl -X POST "https://api.yvr-02.edge.polargrid.ai/v1/audio/transcriptions?sync=true&model=cohere-transcribe-03-2026" \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -F "file=@input.wav"

import { PolarGrid } from "@polargrid/polargrid-sdk";
import fs from "node:fs";

const client = await PolarGrid.create({ apiKey: process.env.POLARGRID_API_KEY });

const result = await client.audioTranscriptions({
  model: "cohere-transcribe-03-2026",
  file: fs.createReadStream("input.wav"),
  // Default mode is async (returns job_id). Pass sync to block on the result.
  sync: true,
});
console.log(result.text);

from polargrid import PolarGrid

client = await PolarGrid.create(api_key="pg_...")

with open("input.wav", "rb") as f:
    result = await client.audio_transcriptions({
        "model": "cohere-transcribe-03-2026",
        "file": f,
        "sync": True,
    })
print(result["text"])

Endpoint modes

POST /v1/audio/transcriptions has three modes selected by query params (not multipart fields):

Mode	Query	Response	Use when
Streaming	`?stream=true`	`text/event-stream` of delta + done events	Live captioning; sub-utterance partial transcripts as the model decodes. This is what the headline bench above measures.
Sync	`?sync=true`	`200` with the formatted transcript directly	Voice-agent path that needs the answer in one round trip.
Async (default)	none	`202` with `{job_id, poll_url}`; poll via `GET ?job_id=...`	Background batch transcription; no caller blocking.

stream and sync are mutually exclusive — passing both returns 400. Streaming requires response_format in {json, text}.

Sync benchmark (`?sync=true`)

The blocking surface holds the connection open until inference completes, then ships the whole JSON response. Client-side TTFB ≈ total wall-clock, and the meaningful split is server inference time vs network leg (which for STT is dominated by the audio upload). Raw runs: benchmarks/yvr-02-2026-05-27/cohere-transcribe-03-2026/.

Measurement	p50	p95
End-to-end TTFB (with network + upload)	1068 ms	1535 ms
Server-only inference	238 ms	305 ms
Network leg (e2e − server)	831 ms	—
Body transfer (JSON response)	0.7 ms	1.6 ms
RTF (server inference ÷ audio duration)	0.041	0.056

Server-only timing comes from the X-Pg-Inference-Ms response header (PR #507), available on the sync surface.

The network leg is upload-dominated. Each request ships a multi-second WAV file before inference can begin, so the 831 ms p50 network figure is largely the upload time of a ~300 KB body — not POP-to-client RTT. For shorter clips (sub-2 s) the network leg shrinks proportionally. Quote the server-only RTF when comparing inference throughput against centralized providers.

Capabilities

Field	Value
Endpoint	`POST /v1/audio/transcriptions`
Multipart field	`file` (required)
Query params	`model`, `language`, `prompt`, `temperature`, `response_format`, `punctuation`, `stream`, `sync`
`response_format`	`json` (default), `text`, `srt`, `vtt`, `verbose_json`
Languages	English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, Arabic
Punctuation toggle	Yes — pass `punctuation=true	false`
Max batch size	1
Backend pod	`inference-backend-triton-voice`

Response timing headers

PR #507 added two response headers that bench harnesses and observability tooling can read to get a server-only inference time without inferring it from the body:

Header	Value
`X-Pg-Inference-Ms`	Integer milliseconds — wall-clock around the `transcribe()` call inside the gateway’s `_handle_sync` (i.e., excludes the audio upload + response transfer).
`Server-Timing`	Standard-shape `inference;dur=<ms>` entry carrying the same number.

These let callers compute the network leg as (client wall-clock) − (X-Pg-Inference-Ms), the same e2e-vs-server split available for LLM via pg_metadata.

Model identifier

Call this model with the full id cohere-transcribe-03-2026 at /v1/audio/transcriptions. The short alias cohere-transcribe is not routable on the gateway (returns 404). The HuggingFace repo id CohereLabs/cohere-transcribe-03-2026 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time.

Notes

License: Apache 2.0 (no auth required to pull weights).
Voice pod isolation: this model needs transformers >= 5.4, which conflicts with hume-tada’s < 5 pin and the LLM pod’s vLLM 0.17.x. That’s why PolarGrid splits voice/LLM/TADA into three Triton pods — see backend/edge-production-setup/CLAUDE.md.
For lower full-utterance latency, whisper-large-v3-turbo is the alternative — Cohere’s edge is multilingual coverage + accuracy on accented speech.

Getting Started

Guides

Available Models

Cohere Transcribe (03-2026)

Headline benchmark

How this compares

Quickstart

Endpoint modes

Sync benchmark (`?sync=true`)

Capabilities

Response timing headers

Model identifier

Notes

See also

​Headline benchmark

​How this compares

​Quickstart

​Endpoint modes

​Sync benchmark (?sync=true)

​Capabilities

​Response timing headers

​Model identifier

​Notes

​See also

Headline benchmark

How this compares

Quickstart

Endpoint modes

Sync benchmark (`?sync=true`)

Capabilities

Response timing headers

Model identifier

Notes

See also