Skip to main content

Voice AI

PolarGrid provides low-latency voice capabilities at the edge.

Text-to-Speech (TTS)

Convert text to natural-sounding speech.

Basic Usage

const audioBuffer = await client.textToSpeech({
  model: 'kokoro-82m',
  input: 'Hello from PolarGrid!',
  voice: 'af_bella',
  responseFormat: 'wav',
});

// Fully-formed RIFF/WAVE container — playable directly.
import { writeFile } from 'fs/promises';
await writeFile('speech.wav', Buffer.from(audioBuffer));

Voices

The kokoro-82m model exposes eight voices across American and British English:
Voice IDAccent / gender
af_bellaAmerican English, female
af_sarahAmerican English, female
am_adamAmerican English, male
am_michaelAmerican English, male
bf_emmaBritish English, female
bf_isabellaBritish English, female
bm_georgeBritish English, male
bm_lewisBritish English, male
Kokoro-82M itself ships many more voices (additional English tiers plus Japanese, Mandarin, Spanish, French, Hindi, Italian, and Brazilian Portuguese) — see the upstream Kokoro-82M VOICES.md. Only the eight above are exposed through the PolarGrid SDKs today.

Speed Control

Adjust playback speed from 0.25x to 4.0x:
const audioBuffer = await client.textToSpeech({
  model: 'kokoro-82m',
  input: 'This will be spoken slowly.',
  voice: 'af_bella',
  speed: 0.75,  // Slower
});

Audio Format

Audio is generated at 24 kHz, 16-bit, mono. The container is selected via response_format:
response_formatContent-TypeStreaming?Use when
pcm (default)audio/pcmYesReal-time voice-agent pipelines — lowest first-byte latency.
wavaudio/wavNoYou need a playable file with no client-side post-processing.
mp3audio/mpegNoBandwidth matters; bytes are encoded server-side via libmp3lame at 128 kbps.
The default differs from OpenAI’s /v1/audio/speech (which defaults to mp3). PolarGrid defaults to pcm to keep streaming TTS first-byte latency minimal — the PolarGrid SDKs default to mp3 for OpenAI-style behavior end-to-end, so pass responseFormat / response_format explicitly when calling via the SDK.opus, aac, and flac from the OpenAI spec are not yet supported in batch mode — requesting them returns 400. For streaming, opus is supported (see below); for batch, transcode PCM client-side if you need one of those:
ffmpeg -f s16le -ar 24000 -ac 1 -i speech.pcm speech.opus

Streaming

For real-time playback or voice-agent pipelines, set stream: true and use response_format: 'pcm' (lowest latency) or 'opus' (compressed). See the TTS API reference for the full contract and the formats / models matrix.
for await (const chunk of client.textToSpeechStream({
  model: 'kokoro-82m',
  input: 'Streaming hello',
  voice: 'af_bella',
  responseFormat: 'opus',
})) {
  audioPlayer.appendChunk(chunk);
}
Streaming wav or mp3 returns 400; transcode client-side from pcm if you need a different container. For tada-3b-ml, streaming does not honor the speed parameter — use speed=1.0 (or omit it).

Raw HTTP Contract

If you’re not using the SDK, here’s the full request/response shape:
Batch Request
curl -X POST https://api.yto-01.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer pg_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro-82m",
    "input": "Hello from PolarGrid!",
    "voice": "af_bella",
    "response_format": "wav",
    "speed": 1.0
  }' \
  --output speech.wav
Streaming Request
curl -X POST https://api.yto-01.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer pg_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro-82m",
    "input": "Streaming hello from PolarGrid!",
    "voice": "af_bella",
    "response_format": "pcm",
    "stream": true
  }' \
  --output speech.pcm
Batch response: Binary audio in the requested container format. Content-Type matches the format (audio/wav, audio/pcm). Streaming response: Chunked transfer-encoding with raw audio bytes. Headers include X-Polargrid-Stream: 1 and X-Polargrid-Sample-Rate: 24000. PCM is 16-bit signed little-endian mono at 24 kHz.

Speech-to-Text (STT)

Transcribe audio to text. The file parameter accepts File | Blob in JavaScript, and Path or any file-like object in Python. Buffers, path strings, and streams are not accepted directly — wrap them in a Blob or File first.

Basic Transcription

const file = new File([audioData], 'recording.mp3', { type: 'audio/mpeg' });

const result = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  language: 'en',  // Optional: hint the language
});

console.log(result.text);

Verbose Output with Timestamps

Get word-level timestamps:
const result = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  responseFormat: 'verbose_json',
});

console.log(`Duration: ${result.duration}s`);
console.log(`Language: ${result.language}`);

result.segments.forEach(segment => {
  console.log(`[${segment.start.toFixed(2)} - ${segment.end.toFixed(2)}] ${segment.text}`);
});

Subtitle Formats

Generate subtitles directly:
// SRT format
const srt = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  responseFormat: 'srt',
});

// WebVTT format
const vtt = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  responseFormat: 'vtt',
});

Voice Chat (Request/Response Loop)

Transcription in this loop requires a completed audio file — the user must finish speaking before the request is sent. For streaming realtime audio, see PersonaPlex (multi-modal, single model) or the Modular Pipeline Agent (STT → LLM → TTS with streaming events).
Combine TTS and STT for voice conversations. This example records a short utterance from the microphone, transcribes it, passes the text through the chat model, and speaks the response:
// Browser: capture a Blob from the microphone via MediaRecorder,
// then run it through transcribe → chat → TTS.
async function voiceChat(audioBlob) {
  // audioBlob: a Blob produced by MediaRecorder, e.g.
  //   const recorder = new MediaRecorder(stream);
  //   recorder.ondataavailable = (e) => chunks.push(e.data);
  //   const audioBlob = new Blob(chunks, { type: 'audio/webm' });

  // 1. Transcribe user speech
  const transcription = await client.transcribe({
    file: audioBlob,
    model: 'whisper-large-v3-turbo',
  });
  
  // 2. Generate AI response
  const response = await client.chatCompletion({
    model: 'qwen-3.5-27b',
    messages: [
      { role: 'user', content: transcription.text }
    ],
  });
  
  // 3. Convert response to speech
  const audio = await client.textToSpeech({
    model: 'kokoro-82m',
    input: response.choices[0].message.content,
    voice: 'af_bella',
  });
  
  return audio;
}

Supported Audio Formats

For transcription and translation:
  • MP3
  • WAV
  • M4A
  • OGG
  • FLAC
  • WebM