Skip to main content

Modular Pipeline Agent

Beta. The connect endpoint and auth flow are not yet public. Contact PolarGrid for access.

Overview

The Modular Pipeline Agent orchestrates three models from the PolarGrid catalog on each conversation turn: speech-to-text, an LLM, and text-to-speech. Audio streams in both directions. A parallel event channel carries JSON messages (transcripts, per-token LLM output, latency markers, errors). Any model returned by GET /v1/models with the matching modality can be used.

Default models

  • STT: whisper-large-v3-turbo
  • LLM: meta-llama/Meta-Llama-3.1-8B-Instruct
  • TTS: kokoro-82m with voice bm_george
Overridable per session via request parameters (documented when the endpoint is public).

Pipeline behavior

The agent runs on-device voice activity detection (VAD) to detect end-of-speech, transcribes the utterance, then streams LLM generation into TTS synthesis: TTS begins at the first sentence boundary of LLM output rather than after the full response. If the user starts speaking again while the agent is mid-response, the in-flight LLM and TTS are cancelled.

Event reference

The server emits the following JSON events on the event channel. Some event types are emitted more than once per turn with different payloads — for example, tts_complete fires once when the first audio plays and again when synthesis finishes.
EventWhenPayload fields
configOn connectmodels: { stt, llm, tts }, vad: { threshold, silence_ms }
speech_end_detectedVAD detects end of user speechserver_timestamp_ms
transcriptSTT returnstext, latency_ms, turn_id
llm_start, tts_startPipeline markersturn_id
llm_tokenEach streamed LLM tokentoken, index, ttft_ms (first token only)
llm_completeLLM generation finishesfull_response, tokens, latency_ms, turn_id
tts_complete (first)First audio plays (TTFA)latency_ms (TTFA), total_pipeline_ms, turn_id, server_timestamp_ms
tts_complete (second)TTS generation endslatency_ms (TTFA value preserved), duration_ms, turn_summary, turn_id
metrics_statsPeriodicAggregated stt_latency, llm_ttft, tts_ttfa, pipeline_latency
errorOn errormessage

Choosing between this and PersonaPlex

PersonaPlex uses a single multi-modal model and exposes a persona prompt and fixed voice IDs. The Modular Pipeline Agent runs three models you pick independently and exposes a per-turn event stream with latency markers. See the comparison table on the PersonaPlex page.

Access

Contact PolarGrid for Beta access.