Voice Pipeline Quickstart

This guide shows how to chain PolarGrid’s three voice endpoints into a complete pipeline: transcribe speech, generate a response, and synthesize it back to audio — all on the same edge network.

All three models (Whisper, Qwen 3.5, Kokoro) run on the same edge node. No cross-provider latency, one auth token, one bill.

Prerequisites

A PolarGrid API key (get one here)
An audio file to transcribe (WAV, MP3, FLAC, M4A, OGG, or WebM)
Node.js 18+ or Python 3.10+

The Pipeline

Audio In → STT (Whisper) → LLM (Qwen 3.5) → TTS (Kokoro) → Audio Out

JavaScript

import { PolarGrid } from '@polargrid/polargrid-sdk';
import { readFile, writeFile } from 'fs/promises';

const client = await PolarGrid.create({
  apiKey: process.env.POLARGRID_API_KEY,
});

console.log(`Connected to: ${client.getRegionName()}`);

// Step 1: Transcribe audio → text
const audioInput = await readFile('input.wav');
const transcription = await client.transcribe({
  file: new Blob([audioInput]),
  model: 'whisper-large-v3-turbo',
});
console.log('User said:', transcription.text);

// Step 2: Generate a response with the LLM
const completion = await client.chatCompletion({
  model: 'qwen-3.5-27b',
  messages: [
    {
      role: 'system',
      content: 'You are a helpful voice assistant. Keep responses under 2 sentences.',
    },
    { role: 'user', content: transcription.text },
  ],
});
const reply = completion.choices[0].message.content;
console.log('Assistant:', reply);

// Step 3: Synthesize the response to audio
const audioOutput = await client.textToSpeech({
  model: 'kokoro-82m',
  input: reply,
  voice: 'af_bella',
  responseFormat: 'pcm',
});

// Save as raw PCM (24 kHz, 16-bit, mono)
await writeFile('response.pcm', Buffer.from(audioOutput));
console.log('Audio saved to response.pcm');

// Convert to playable format:
// ffmpeg -f s16le -ar 24000 -ac 1 -i response.pcm response.wav

Python

import asyncio
from polargrid import PolarGrid

async def voice_pipeline():
    client = await PolarGrid.create(api_key="pg_your_api_key")
    print(f"Connected to: {client.get_region_name()}")

    # Step 1: Transcribe audio → text
    with open("input.wav", "rb") as f:
        transcription = await client.transcribe(
            file=f,
            request={"model": "whisper-large-v3-turbo"},
        )
    print(f"User said: {transcription.text}")

    # Step 2: Generate a response with the LLM
    completion = await client.chat_completion({
        "model": "qwen-3.5-27b",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful voice assistant. Keep responses under 2 sentences.",
            },
            {"role": "user", "content": transcription.text},
        ],
    })
    reply = completion.choices[0].message.content
    print(f"Assistant: {reply}")

    # Step 3: Synthesize the response to audio
    audio_output = await client.text_to_speech({
        "model": "kokoro-82m",
        "input": reply,
        "voice": "af_bella",
        "response_format": "pcm",
    })

    with open("response.pcm", "wb") as f:
        f.write(audio_output)
    print("Audio saved to response.pcm")

    # Convert to playable format:
    # ffmpeg -f s16le -ar 24000 -ac 1 -i response.pcm response.wav

asyncio.run(voice_pipeline())

cURL

export API_KEY="pg_your_api_key"
EDGE="https://api.yto-01.edge.polargrid.ai"

# Step 1: Transcribe — single endpoint, query params select mode.
# Use ?sync=true for the lowest-latency blocking response in this voice loop.
TRANSCRIPT=$(curl -s -X POST \
  "$EDGE/v1/audio/transcriptions?sync=true&model=whisper-large-v3-turbo" \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@input.wav | jq -r .text)
echo "User said: $TRANSCRIPT"

# Step 2: LLM response
REPLY=$(curl -s -X POST "$EDGE/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"qwen-3.5-27b\",
    \"messages\": [
      {\"role\": \"system\", \"content\": \"You are a helpful voice assistant. Keep responses under 2 sentences.\"},
      {\"role\": \"user\", \"content\": \"$TRANSCRIPT\"}
    ]
  }" | jq -r '.choices[0].message.content')
echo "Assistant: $REPLY"

# Step 3: Synthesize to audio
curl -s -X POST "$EDGE/v1/audio/speech" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"kokoro-82m\",
    \"input\": \"$REPLY\",
    \"voice\": \"af_bella\",
    \"response_format\": \"pcm\"
  }" -o response.pcm

echo "Audio saved to response.pcm"
# ffmpeg -f s16le -ar 24000 -ac 1 -i response.pcm response.wav

Expected Latency

From a nearby region (e.g., Eastern North America → Toronto):

Step	Typical Latency
STT (6s of audio)	~500ms
LLM (TTFT)	~120-250ms
TTS (2 sentences)	~300-800ms
Total perceived	~900-1200ms

For real-time bidirectional voice (phone calls, live agents), see PersonaPlex — it handles the full pipeline over a single WebSocket with streaming in both directions.

Audio Format Notes

The examples above use response_format: "pcm" because real-time voice pipelines benefit from the lowest first-byte latency — PCM streams chunk-by-chunk, while wav and mp3 are buffered until the full clip is synthesized. If you’d rather get a playable file back directly, ask the server for a container:

# Request WAV — RIFF/WAVE container, plays in any audio library
curl ... -d '{"...","response_format":"wav"}' -o response.wav

# Request MP3 — encoded server-side at 128 kbps CBR
curl ... -d '{"...","response_format":"mp3"}' -o response.mp3

pcm, wav, and mp3 are the supported values. See TTS API → Audio Format for the full content-type / streaming table. For HTTP-only stacks that can’t pipe raw PCM chunks, set stream: true and response_format: 'opus' — same TTFB, smaller bytes on the wire. See the TTS API streaming section for the full contract.

Next Steps

Voice AI Guide

Detailed TTS and STT endpoint reference

PersonaPlex

Real-time bidirectional voice agent over WebSocket

Streaming

Stream LLM tokens as they generate

Models

Available models and specifications

​Voice Pipeline Quickstart

​Prerequisites

​The Pipeline

​JavaScript

​Python

​cURL

​Expected Latency

​Audio Format Notes

​Next Steps

Voice AI Guide

PersonaPlex

Streaming

Models

Voice Pipeline Quickstart

Prerequisites

The Pipeline

JavaScript

Python

cURL

Expected Latency

Audio Format Notes

Next Steps