This guide shows how to chain PolarGrid’s three voice endpoints into a complete pipeline: transcribe speech, generate a response, and synthesize it back to audio — all on the same edge network.
All three models (Whisper, Qwen 3.5, Kokoro) run on the same edge node. No cross-provider latency, one auth token, one bill.
From a nearby region (e.g., Eastern North America → Toronto):
Step
Typical Latency
STT (6s of audio)
~500ms
LLM (TTFT)
~120-250ms
TTS (2 sentences)
~300-800ms
Total perceived
~900-1200ms
For real-time bidirectional voice (phone calls, live agents), see PersonaPlex — it handles the full pipeline over a single WebSocket with streaming in both directions.
The examples above use response_format: "pcm" because real-time voice pipelines benefit from the lowest first-byte latency — PCM streams chunk-by-chunk, while wav and mp3 are buffered until the full clip is synthesized.If you’d rather get a playable file back directly, ask the server for a container:
# Request WAV — RIFF/WAVE container, plays in any audio librarycurl ... -d '{"...","response_format":"wav"}' -o response.wav# Request MP3 — encoded server-side at 128 kbps CBRcurl ... -d '{"...","response_format":"mp3"}' -o response.mp3
pcm, wav, and mp3 are the supported values. See TTS API → Audio Format for the full content-type / streaming table.For HTTP-only stacks that can’t pipe raw PCM chunks, set stream: true and response_format: 'opus' — same TTFB, smaller bytes on the wire. See the TTS API streaming section for the full contract.