Skip to main content

Voice AI

PolarGrid provides low-latency voice capabilities at the edge.

Text-to-Speech (TTS)

Convert text to natural-sounding speech.

Basic Usage

const audioBuffer = await client.textToSpeech({
  model: 'kokoro-82m',
  input: 'Hello from PolarGrid!',
  voice: 'af_bella',
  responseFormat: 'pcm',
});

// The response is raw 24 kHz, 16-bit, mono PCM. Save and transcode
// to MP3/WAV with ffmpeg if you need a container format.
import { writeFile } from 'fs/promises';
await writeFile('speech.pcm', Buffer.from(audioBuffer));
// ffmpeg -f s16le -ar 24000 -ac 1 -i speech.pcm speech.mp3

Voices

The kokoro-82m model exposes eight voices across American and British English:
Voice IDAccent / gender
af_bellaAmerican English, female
af_sarahAmerican English, female
am_adamAmerican English, male
am_michaelAmerican English, male
bf_emmaBritish English, female
bf_isabellaBritish English, female
bm_georgeBritish English, male
bm_lewisBritish English, male
Kokoro-82M itself ships many more voices (additional English tiers plus Japanese, Mandarin, Spanish, French, Hindi, Italian, and Brazilian Portuguese) — see the upstream Kokoro-82M VOICES.md. Only the eight above are exposed through the PolarGrid SDKs today.

Speed Control

Adjust playback speed from 0.25x to 4.0x:
const audioBuffer = await client.textToSpeech({
  model: 'kokoro-82m',
  input: 'This will be spoken slowly.',
  voice: 'af_bella',
  speed: 0.75,  // Slower
});

Audio Format

The TTS endpoint returns raw 24 kHz, 16-bit, mono PCM audio. The response_format / responseFormat parameter is accepted for forward compatibility but is currently ignored — the response is always PCM regardless of what you pass.
To produce MP3, WAV, OGG, or other container formats, transcode the PCM bytes client-side. With ffmpeg:
ffmpeg -f s16le -ar 24000 -ac 1 -i speech.pcm speech.mp3
Encoder support for mp3, opus, aac, flac, and wav in the backend is planned but not yet implemented.

Speech-to-Text (STT)

Transcribe audio to text. The file parameter accepts File | Blob in JavaScript, and Path or any file-like object in Python. Buffers, path strings, and streams are not accepted directly — wrap them in a Blob or File first.

Basic Transcription

const file = new File([audioData], 'recording.mp3', { type: 'audio/mpeg' });

const result = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  language: 'en',  // Optional: hint the language
});

console.log(result.text);

Verbose Output with Timestamps

Get word-level timestamps:
const result = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  responseFormat: 'verbose_json',
});

console.log(`Duration: ${result.duration}s`);
console.log(`Language: ${result.language}`);

result.segments.forEach(segment => {
  console.log(`[${segment.start.toFixed(2)} - ${segment.end.toFixed(2)}] ${segment.text}`);
});

Subtitle Formats

Generate subtitles directly:
// SRT format
const srt = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  responseFormat: 'srt',
});

// WebVTT format
const vtt = await client.transcribe({
  file,
  model: 'whisper-large-v3-turbo',
  responseFormat: 'vtt',
});

Voice Chat (Request/Response Loop)

Transcription in this loop requires a completed audio file — the user must finish speaking before the request is sent. For streaming realtime audio, see PersonaPlex (multi-modal, single model) or the Modular Pipeline Agent (STT → LLM → TTS with streaming events).
Combine TTS and STT for voice conversations. This example records a short utterance from the microphone, transcribes it, passes the text through the chat model, and speaks the response:
// Browser: capture a Blob from the microphone via MediaRecorder,
// then run it through transcribe → chat → TTS.
async function voiceChat(audioBlob) {
  // audioBlob: a Blob produced by MediaRecorder, e.g.
  //   const recorder = new MediaRecorder(stream);
  //   recorder.ondataavailable = (e) => chunks.push(e.data);
  //   const audioBlob = new Blob(chunks, { type: 'audio/webm' });

  // 1. Transcribe user speech
  const transcription = await client.transcribe({
    file: audioBlob,
    model: 'whisper-large-v3-turbo',
  });
  
  // 2. Generate AI response
  const response = await client.chatCompletion({
    model: 'Meta-Llama-3.1-8B-Instruct',
    messages: [
      { role: 'user', content: transcription.text }
    ],
  });
  
  // 3. Convert response to speech
  const audio = await client.textToSpeech({
    model: 'kokoro-82m',
    input: response.choices[0].message.content,
    voice: 'af_bella',
  });
  
  return audio;
}

Supported Audio Formats

For transcription and translation:
  • MP3
  • WAV
  • M4A
  • OGG
  • FLAC
  • WebM