Speech-to-Text
Transcribe audio files to text. A single endpoint serves three modes — default async, opt-in streaming, opt-in sync — controlled by query parameters.Edge endpoints accept your
pg_* API key as a bearer token. See Authentication for details. The cURL examples below pin Toronto (yto-01) for concreteness — substitute another region or discover the fastest one via GET https://autorouter.polargrid.ai/v1/route.Transcribe Audio
file. Everything else is a query parameter.
Query Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | — | STT model (e.g. whisper-large-v3-turbo, cohere-transcribe-03-2026) |
language | string | No | — | ISO-639-1 language code |
prompt | string | No | — | Context hint to guide transcription |
response_format | string | No | json | json, text, srt, vtt, or verbose_json |
temperature | number | No | 0 | Sampling temperature (0.0-1.0) |
punctuation | boolean | No | — | Force/forbid punctuation in the output |
stream | boolean | No | false | If true, return Server-Sent Events |
sync | boolean | No | false | If true, block until completion and return the result inline |
Three Modes
| Query flags | Response | Use when |
|---|---|---|
| (none) | 202 Accepted with { job_id, status, poll_url } | Default. Best for long files; poll with GET /v1/audio/transcriptions?job_id=... |
?stream=true | 200 SSE of transcript.text.delta + transcript.text.done | Best UX for real-time display |
?sync=true | 200 with formatted body (JSON / text / SRT / VTT / verbose JSON) | Small files when you can wait inline |
stream=true and sync=true are mutually exclusive.
Available Models
| Model | Description |
|---|---|
whisper-large-v3-turbo | OpenAI Whisper, fast multilingual transcription |
cohere-transcribe-03-2026 | Cohere transcription, 14 languages |
Examples
Default — async job
Streaming — SSE
Sync — blocking request
Cohere model — same surface, different model id
cohere-transcribe-03-2026 supports the same
sync, stream, and async modes. Swap the model query parameter — everything
else is identical:
language is omitted.
SSE Event Types
type | Fields | When |
|---|---|---|
transcript.text.delta | delta (string) | Zero or more, each an incremental piece of the transcript |
transcript.text.done | text, duration (s), language | Exactly once, before the stream closes |
error | error (string) | On failure — stream ends after |
data: [DONE]\n\n. Each transcript.text.delta event carries the full transcript hypothesis so far, and later events may revise earlier text. Render by replacing your display with the latest delta, not by appending. The final authoritative transcript is done.text.
Polling Responses
GET /v1/audio/transcriptions?job_id=... returns:
202while the job isaccepted/processing—{ job_id, status, poll_interval_ms }200whencompleted— formatted body, withjob_idandstatus: "completed"400onfailed/cancelled
