Skip to main content

PersonaPlex

PersonaPlex is PolarGrid’s real-time voice conversation endpoint. It streams Opus audio in both directions over a single WebSocket, with a persona prompt and voice selected at connect time.

PersonaPlex vs. the Modular Pipeline Agent

DimensionPersonaPlex (multi-modal)Modular Pipeline Agent
Model architectureSingle multi-modal model (audio in / audio out)Three models: STT → LLM → TTS
Latency profileOne model inference per turnThree model calls per turn; LLM output is streamed into TTS at sentence boundaries rather than after generation completes
Persona / promptText persona query parameter at connect timeSystem prompt passed to the LLM
Voice selectionOne of the PersonaPlex voice IDs (NATF0…)Any voice supported by the selected TTS model
Model substitutionNot supported (models are bundled)Any STT / LLM / TTS from GET /v1/models
Barge-inHandled inside the multi-modal modelUser speech cancels in-flight LLM and TTS
Structured eventsAudio frames + transcript text framesJSON event stream (see Modular Pipeline Agent)
Function callingNot applicable (audio-native model)Via the LLM step, once native tool-use is available
PersonaPlex is not listed in GET /v1/models — that endpoint only returns the Triton-served request/response models (qwen-3.5-27b, whisper-large-v3-turbo, cohere-transcribe-03-2026, kokoro-82m, tada-3b-ml). PersonaPlex runs as a separate moshi-backend LiveKit agent pod with its own WebSocket endpoint and wire protocol.

Connecting

Open a WebSocket to the voice endpoint with your pg_* API key as the token query parameter.

Open the WebSocket

wss://api.<region>.edge.polargrid.ai/v1/voice/personaplex
  ?voice=NATF0
  &persona=<url-encoded persona prompt>
  &token=pg_your_api_key
Default region is yto-01 (Toronto) if you omit it. As an alternative to the token query parameter, you can pass the credential via the WebSocket subprotocol header:
Sec-WebSocket-Protocol: bearer.pg_your_api_key

Voices

Pass one of the following as the voice query parameter. Do not include the .pt suffix.
GroupVoices
NATFNATF0NATF3
NATMNATM0NATM3
VARFVARF0VARF4
VARMVARM0VARM4

Wire protocol

All frames are binary. The first byte is a type tag; the remaining bytes are the payload.

Client → Server

TagPayloadNotes
0x01Opus audioOgg container, mono, 24 kHz
0x02UTF-8 textInject text into the conversation
0x03Control frame (bos or eos)Stream boundary markers

Server → Client

TagPayloadNotes
0x00HandshakeSent once, immediately after connect
0x01Opus audioGenerated speech
0x02UTF-8 textTranscript tokens
0x03 eos marks end-of-turn, not end-of-reply. Sending eos after your audio tells the agent you are done speaking. The agent completes its full reply after receiving eos. In builds prior to June 2026, eos interrupted the agent’s in-flight reply, causing truncated responses. If you integrated during the alpha and added workarounds for that behavior, retest without them.

Gotchas

Wait for the 0x00 handshake before sending any audio. The first audio bytes you send must be the Opus Ogg BOS (beginning-of-stream) page. If audio arrives before the handshake — or without a valid BOS page — the server closes the connection with code 1000.
Disable your WebSocket library’s heartbeat / ping. The upstream moshi runtime does not respond to RFC 6455 pongs, so a client-side ping timer will tear down an otherwise healthy session. Most libraries expose this as a ping_interval or heartbeat option — set it to 0 or None.
Sessions are billed by wall-clock duration. Close the socket as soon as the conversation is idle; an open connection keeps accruing cost even with no audio flowing.

Quickstart

Both examples below do the same thing: connect to PersonaPlex, send a pre-recorded audio file, and save the response audio to disk. Pick the language you prefer.

Prerequisites

  • A PolarGrid API key (get one here)
  • An input audio file (WAV, mono, 24 kHz recommended — other sample rates will be resampled)

Python

pip install websockets soundfile numpy opuslib
#!/usr/bin/env python3
"""Send an audio file to PersonaPlex and save the voice response."""

import argparse
import asyncio
import struct

import numpy as np
import opuslib
import soundfile as sf
import websockets

# ---------------------------------------------------------------------------
# Wire protocol tags
# ---------------------------------------------------------------------------
TAG_HANDSHAKE = 0x00   # Server → Client: session ready
TAG_AUDIO     = 0x01   # Both directions: Opus audio in Ogg container
TAG_TEXT      = 0x02   # Both directions: UTF-8 text
TAG_CONTROL   = 0x03   # Client → Server: stream boundary (eos)


# ---------------------------------------------------------------------------
# Ogg helpers — build a minimal Ogg/Opus bitstream from raw Opus frames
# ---------------------------------------------------------------------------
def _ogg_page(serial: int, granule: int, seq: int, bos: bool, eos: bool,
              *segments: bytes) -> bytes:
    """Build one Ogg page containing the given segments."""
    body = b"".join(segments)
    seg_count = len(segments)
    seg_table = bytes(len(s) for s in segments)
    header_type = (0x02 if bos else 0x00) | (0x04 if eos else 0x00)
    # Ogg page header (27 bytes + segment table + body), checksum patched below
    header = struct.pack(
        "<4sBBqIIIB",
        b"OggS",       # capture pattern
        0,             # version
        header_type,
        granule,
        serial,
        seq,
        0,             # checksum placeholder
        seg_count,
    )
    page_no_crc = header + seg_table + body
    crc = _ogg_crc(page_no_crc)
    return page_no_crc[:22] + struct.pack("<I", crc) + page_no_crc[26:]


# CRC-32 lookup table used by the Ogg framing spec
_OGG_CRC_TABLE = None

def _ogg_crc(data: bytes) -> int:
    global _OGG_CRC_TABLE
    if _OGG_CRC_TABLE is None:
        _OGG_CRC_TABLE = []
        for i in range(256):
            r = i << 24
            for _ in range(8):
                r = ((r << 1) ^ 0x04C11DB7) & 0xFFFFFFFF if r & 0x80000000 else (r << 1) & 0xFFFFFFFF
            _OGG_CRC_TABLE.append(r)
    crc = 0
    for b in data:
        crc = ((crc << 8) ^ _OGG_CRC_TABLE[((crc >> 24) & 0xFF) ^ b]) & 0xFFFFFFFF
    return crc


def encode_audio_to_ogg_opus(pcm: np.ndarray, sample_rate: int = 24000,
                              frame_ms: int = 20) -> bytes:
    """Encode PCM float32 mono audio into an Ogg/Opus bytestream."""
    # Resample to 24 kHz if needed
    if sample_rate != 24000:
        from fractions import Fraction
        ratio = Fraction(24000, sample_rate)
        n_out = int(len(pcm) * ratio)
        indices = np.linspace(0, len(pcm) - 1, n_out)
        pcm = np.interp(indices, np.arange(len(pcm)), pcm).astype(np.float32)
        sample_rate = 24000

    encoder = opuslib.Encoder(sample_rate, 1, opuslib.APPLICATION_VOIP)
    frame_size = sample_rate * frame_ms // 1000  # samples per frame

    serial = 1
    seq = 0

    # --- BOS page: OpusHead ---
    opus_head = struct.pack("<8sBBHIhB", b"OpusHead", 1, 1, 312, 24000, 0, 0)
    pages = _ogg_page(serial, 0, seq, bos=True, eos=False, opus_head)
    seq += 1

    # --- Comment page: OpusTags ---
    vendor = b"PolarGrid"
    opus_tags = struct.pack("<8sI", b"OpusTags", len(vendor)) + vendor + struct.pack("<I", 0)
    pages += _ogg_page(serial, 0, seq, bos=False, eos=False, opus_tags)
    seq += 1

    # --- Audio pages ---
    granule = 0
    pcm_i16 = (np.clip(pcm, -1.0, 1.0) * 32767).astype(np.int16)
    pos = 0
    while pos < len(pcm_i16):
        chunk = pcm_i16[pos : pos + frame_size]
        if len(chunk) < frame_size:
            chunk = np.pad(chunk, (0, frame_size - len(chunk)))
        opus_frame = encoder.encode(chunk.tobytes(), frame_size)
        granule += frame_size
        is_last = pos + frame_size >= len(pcm_i16)
        pages += _ogg_page(serial, granule, seq, bos=False, eos=is_last, opus_frame)
        seq += 1
        pos += frame_size

    return pages


# ---------------------------------------------------------------------------
# Main client
# ---------------------------------------------------------------------------
async def run(api_key: str, audio_path: str, region: str, voice: str,
              persona: str, output_path: str):
    # Load and prepare audio
    pcm, sr = sf.read(audio_path, dtype="float32", always_2d=True)
    pcm = pcm[:, 0]  # mono
    print(f"Loaded {audio_path}: {len(pcm)/sr:.1f}s at {sr} Hz")

    ogg_data = encode_audio_to_ogg_opus(pcm, sample_rate=sr)
    print(f"Encoded to Ogg/Opus: {len(ogg_data)} bytes")

    # Build WebSocket URL
    from urllib.parse import quote
    url = (
        f"wss://api.{region}.edge.polargrid.ai/v1/voice/personaplex"
        f"?voice={voice}&persona={quote(persona)}&token={api_key}"
    )

    response_audio = bytearray()
    transcript_parts = []

    # Connect — disable ping to avoid teardown (moshi does not pong)
    async with websockets.connect(url, ping_interval=None,
                                  max_size=None) as ws:
        # Step 1: Wait for the handshake (0x00)
        print("Waiting for handshake...")
        msg = await ws.recv()
        if isinstance(msg, bytes) and msg[0] == TAG_HANDSHAKE:
            print("Handshake received — session is ready")
        else:
            raise RuntimeError(f"Expected handshake, got: {msg[:20]!r}")

        # Step 2: Send audio frames with 0x01 tag prefix
        CHUNK = 4096
        for i in range(0, len(ogg_data), CHUNK):
            frame = bytes([TAG_AUDIO]) + ogg_data[i : i + CHUNK]
            await ws.send(frame)
        print(f"Sent {len(ogg_data)} bytes of audio")

        # Step 3: Send end-of-stream control frame
        await ws.send(bytes([TAG_CONTROL]) + b"eos")
        print("Sent eos — waiting for response...")

        # Step 4: Receive response audio and transcript
        try:
            async for msg in ws:
                if not isinstance(msg, bytes) or len(msg) < 1:
                    continue
                tag = msg[0]
                payload = msg[1:]
                if tag == TAG_AUDIO:
                    response_audio.extend(payload)
                elif tag == TAG_TEXT:
                    text = payload.decode("utf-8")
                    transcript_parts.append(text)
                    print(f"  transcript: {text}")
        except websockets.exceptions.ConnectionClosed:
            pass

    # Save response audio (raw Opus/Ogg — playable with ffplay, mpv, VLC)
    if response_audio:
        with open(output_path, "wb") as f:
            f.write(response_audio)
        print(f"\nSaved response audio to {output_path} ({len(response_audio)} bytes)")
    if transcript_parts:
        print(f"Full transcript: {''.join(transcript_parts)}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="PersonaPlex client example")
    parser.add_argument("--api-key", required=True, help="Your pg_* API key")
    parser.add_argument("--audio", required=True, help="Path to input audio file (WAV)")
    parser.add_argument("--region", default="yto-01", help="Edge region (default: yto-01)")
    parser.add_argument("--voice", default="NATF0", help="Voice ID (default: NATF0)")
    parser.add_argument("--persona", default="A helpful voice assistant.",
                        help="Persona prompt")
    parser.add_argument("--output", default="response.ogg", help="Output file path")
    args = parser.parse_args()
    asyncio.run(run(args.api_key, args.audio, args.region, args.voice,
                    args.persona, args.output))
Usage:
python personaplex_client.py \
  --api-key pg_your_key_here \
  --audio question.wav \
  --persona "A friendly tour guide for Vancouver."
# => response.ogg (playable with ffplay, mpv, or VLC)

Node.js

npm install ws
#!/usr/bin/env node
/**
 * Send a pre-encoded Ogg/Opus file to PersonaPlex and save the response.
 *
 * This example reads a file that is already in Ogg/Opus format (mono, 24 kHz).
 * To convert a WAV to the right format:
 *   ffmpeg -i input.wav -ac 1 -ar 24000 -c:a libopus input.ogg
 */

const { readFileSync, writeFileSync } = require("fs");
const WebSocket = require("ws");

// ---------------------------------------------------------------------------
// Wire protocol tags
// ---------------------------------------------------------------------------
const TAG_HANDSHAKE = 0x00; // Server → Client: session ready
const TAG_AUDIO     = 0x01; // Both directions: Opus audio (Ogg container)
const TAG_TEXT      = 0x02; // Both directions: UTF-8 text
const TAG_CONTROL   = 0x03; // Client → Server: stream boundary (eos)

// ---------------------------------------------------------------------------
// Config — edit these or pass via environment variables
// ---------------------------------------------------------------------------
const API_KEY = process.env.POLARGRID_API_KEY || "pg_your_key_here";
const INPUT   = process.argv[2] || "input.ogg"; // pre-encoded Ogg/Opus file
const REGION  = process.env.POLARGRID_REGION || "yto-01";
const VOICE   = process.env.POLARGRID_VOICE || "NATF0";
const PERSONA = process.env.POLARGRID_PERSONA || "A helpful voice assistant.";
const OUTPUT  = process.argv[3] || "response.ogg";

// ---------------------------------------------------------------------------
// Main
// ---------------------------------------------------------------------------
const oggData = readFileSync(INPUT);
console.log(`Loaded ${INPUT}: ${oggData.length} bytes`);

const url = new URL(`wss://api.${REGION}.edge.polargrid.ai/v1/voice/personaplex`);
url.searchParams.set("voice", VOICE);
url.searchParams.set("persona", PERSONA);
url.searchParams.set("token", API_KEY);

// Disable automatic ping — moshi does not respond to pongs
const ws = new WebSocket(url.toString(), { pingInterval: 0 });
ws.binaryType = "arraybuffer";

const responseChunks = [];
const transcriptParts = [];

ws.on("open", () => {
  console.log("WebSocket connected, waiting for handshake...");
});

ws.on("message", (data) => {
  const buf = Buffer.from(data);
  if (buf.length < 1) return;

  const tag = buf[0];
  const payload = buf.subarray(1);

  if (tag === TAG_HANDSHAKE) {
    // Step 1: Handshake received — safe to send audio
    console.log("Handshake received — sending audio...");

    // Step 2: Send audio in chunks with 0x01 tag prefix
    const CHUNK = 4096;
    for (let i = 0; i < oggData.length; i += CHUNK) {
      const slice = oggData.subarray(i, i + CHUNK);
      const frame = Buffer.concat([Buffer.from([TAG_AUDIO]), slice]);
      ws.send(frame);
    }
    console.log(`Sent ${oggData.length} bytes of audio`);

    // Step 3: Send end-of-stream control frame
    ws.send(Buffer.concat([Buffer.from([TAG_CONTROL]), Buffer.from("eos")]));
    console.log("Sent eos — waiting for response...");

  } else if (tag === TAG_AUDIO) {
    responseChunks.push(payload);

  } else if (tag === TAG_TEXT) {
    const text = payload.toString("utf-8");
    transcriptParts.push(text);
    console.log(`  transcript: ${text}`);
  }
});

ws.on("close", () => {
  // Step 4: Save response audio
  if (responseChunks.length > 0) {
    const responseAudio = Buffer.concat(responseChunks);
    writeFileSync(OUTPUT, responseAudio);
    console.log(`\nSaved response audio to ${OUTPUT} (${responseAudio.length} bytes)`);
  }
  if (transcriptParts.length > 0) {
    console.log(`Full transcript: ${transcriptParts.join("")}`);
  }
});

ws.on("error", (err) => {
  console.error("WebSocket error:", err.message);
  process.exit(1);
});
Usage:
# First, convert your audio to Ogg/Opus mono 24 kHz:
ffmpeg -i question.wav -ac 1 -ar 24000 -c:a libopus question.ogg

# Run the client:
POLARGRID_API_KEY=pg_your_key_here node personaplex_client.js question.ogg
# => response.ogg
The Node.js example reads a pre-encoded Ogg/Opus file to keep the code short. If you need to encode from WAV at runtime, use ffmpeg as a subprocess or the @discordjs/opus package to encode PCM frames, then wrap them in Ogg pages (see the Python example for the page structure).