cohere-transcribe-03-2026) is a 2-billion-parameter multilingual speech-to-text model served on PolarGrid edge nodes via Triton’s python backend. The voice pod runs transformers >= 5.4 to support this model — see backend/edge-production-setup/CLAUDE.md for the pod split rationale.
- HF repo:
CohereLabs/cohere-transcribe-03-2026 - Modality: Speech-to-Text (streaming + sync)
- Backend: Triton
python(voice pod) - Model ID:
cohere-transcribe-03-2026(use the full ID; the short aliascohere-transcribeis not routable on the gateway)
- Available regions:
yvr-02(Blackwell production)
Headline benchmark
POST /v1/audio/transcriptions?stream=true is the live surface. The server emits a text/event-stream of transcript.text.delta events as the cohere handler’s internal decode loop completes each window, then closes with a single transcript.text.done event. Consumers can render the rolling transcript immediately instead of waiting for the final result.
| Measurement | p50 | p95 |
|---|---|---|
| TTFT (response headers → first non-empty delta) | 44 ms | 62 ms |
Time to done event (response headers → done) | 560 ms | 856 ms |
| Interim work (first delta → done) | 517 ms | 816 ms |
| Delta events per request | 4 | 7 (max) |
| Partial deltas (text != final) | 4 | — |
e2e total (POST → [DONE]) | 2510 ms | 3523 ms |
| RTF (e2e ÷ audio duration) | 0.42 | 0.53 |
| well_formed | 100 / 100 | — |
https://api.yvr-02.edge.polargrid.ai, captured 2026-05-28 from a Vancouver-area laptop. Inputs were 5 short utterances (4.2 – 7.7 s, 24 kHz mono WAV, ~200–350 KB each) pre-synthesized via tada-3b-ml on the same node — see bench/cohere-transcribe-03-2026/synthesize_inputs.py. Raw runs: benchmarks/yvr-02-2026-05-28/cohere-transcribe-03-2026/.
Delta cadence is chunk-driven. 4.2 – 5.9 s clips emit 4 deltas, 7.2 – 7.7 s clips emit 7. Each delta carries a non-overlapping span of the transcript; concatenating them reconstructs the final string. The first delta arriving at 44 ms after response headers is the meaningful TTFT for live-captioning pipelines. The gateway does not emit X-Pg-Inference-Ms on the stream surface yet, so server-only inference cannot be quoted client-side for streaming today.
How this compares
| Provider | Streaming first-partial p50 | Sync RTF | Notes | Source |
|---|---|---|---|---|
PolarGrid cohere-transcribe-03-2026 on Blackwell | 44 ms after response headers | 0.041 server-only | 14 languages, sync + streaming surfaces | this card |
| Deepgram Nova-3 | sub-300 ms over WebSocket from end-of-speech | n/a | WebSocket protocol streams audio in continuously | artificialanalysis.ai |
| AssemblyAI Universal-3 | n/a | ~0.008 to 0.05 batch | Batch transcription tier | assemblyai.com |
Quickstart
Edge endpoints accept your rawpg_* API key as a bearer token — no token exchange. See Authentication.
Endpoint modes
POST /v1/audio/transcriptions has three modes selected by query params (not multipart fields):
| Mode | Query | Response | Use when |
|---|---|---|---|
| Streaming | ?stream=true | text/event-stream of delta + done events | Live captioning; sub-utterance partial transcripts as the model decodes. This is what the headline bench above measures. |
| Sync | ?sync=true | 200 with the formatted transcript directly | Voice-agent path that needs the answer in one round trip. |
| Async (default) | none | 202 with {job_id, poll_url}; poll via GET ?job_id=... | Background batch transcription; no caller blocking. |
stream and sync are mutually exclusive — passing both returns 400. Streaming requires response_format in {json, text}.
Sync benchmark (?sync=true)
The blocking surface holds the connection open until inference completes, then ships the whole JSON response. Client-side TTFB ≈ total wall-clock, and the meaningful split is server inference time vs network leg (which for STT is dominated by the audio upload). Raw runs: benchmarks/yvr-02-2026-05-27/cohere-transcribe-03-2026/.
| Measurement | p50 | p95 |
|---|---|---|
| End-to-end TTFB (with network + upload) | 1068 ms | 1535 ms |
| Server-only inference | 238 ms | 305 ms |
| Network leg (e2e − server) | 831 ms | — |
| Body transfer (JSON response) | 0.7 ms | 1.6 ms |
| RTF (server inference ÷ audio duration) | 0.041 | 0.056 |
X-Pg-Inference-Ms response header (PR #507), available on the sync surface.
The network leg is upload-dominated. Each request ships a multi-second WAV file before inference can begin, so the 831 ms p50 network figure is largely the upload time of a ~300 KB body — not POP-to-client RTT. For shorter clips (sub-2 s) the network leg shrinks proportionally. Quote the server-only RTF when comparing inference throughput against centralized providers.
Capabilities
| Field | Value | |
|---|---|---|
| Endpoint | POST /v1/audio/transcriptions | |
| Multipart field | file (required) | |
| Query params | model, language, prompt, temperature, response_format, punctuation, stream, sync | |
response_format | json (default), text, srt, vtt, verbose_json | |
| Languages | English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, Arabic | |
| Punctuation toggle | Yes — pass `punctuation=true | false` |
| Max batch size | 1 | |
| Backend pod | inference-backend-triton-voice |
Response timing headers
PR #507 added two response headers that bench harnesses and observability tooling can read to get a server-only inference time without inferring it from the body:| Header | Value |
|---|---|
X-Pg-Inference-Ms | Integer milliseconds — wall-clock around the transcribe() call inside the gateway’s _handle_sync (i.e., excludes the audio upload + response transfer). |
Server-Timing | Standard-shape inference;dur=<ms> entry carrying the same number. |
(client wall-clock) − (X-Pg-Inference-Ms), the same e2e-vs-server split available for LLM via pg_metadata.
Model identifier
Call this model with the full idcohere-transcribe-03-2026 at /v1/audio/transcriptions. The short alias cohere-transcribe is not routable on the gateway (returns 404). The HuggingFace repo id CohereLabs/cohere-transcribe-03-2026 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time.
Notes
- License: Apache 2.0 (no auth required to pull weights).
- Voice pod isolation: this model needs
transformers >= 5.4, which conflicts with hume-tada’s< 5pin and the LLM pod’s vLLM 0.17.x. That’s why PolarGrid splits voice/LLM/TADA into three Triton pods — seebackend/edge-production-setup/CLAUDE.md. - For lower full-utterance latency,
whisper-large-v3-turbois the alternative — Cohere’s edge is multilingual coverage + accuracy on accented speech.
See also
- Speech-to-Text API — endpoint reference, formats, streaming contract
- Voice AI guide — building voice agents on PolarGrid
- Authentication — using your
pg_*API key /v1/models— list all available models
