Modular Pipeline Agent

Beta. The connect endpoint and auth flow are not yet public. Contact PolarGrid for access.

Overview

The Modular Pipeline Agent orchestrates three models from the PolarGrid catalog on each conversation turn: speech-to-text, an LLM, and text-to-speech. Audio streams in both directions. A parallel event channel carries JSON messages (transcripts, per-token LLM output, latency markers, errors). Any model returned by GET /v1/models with the matching modality can be used.

Default models

STT: whisper-large-v3-turbo
LLM: qwen-3.5-27b
TTS: kokoro-82m with voice bm_george

Overridable per session via request parameters (documented when the endpoint is public).

Pipeline behavior

The agent runs on-device voice activity detection (VAD) to detect end-of-speech, transcribes the utterance, then streams LLM generation into TTS synthesis: TTS begins at the first sentence boundary of LLM output rather than after the full response. If the user starts speaking again while the agent is mid-response, the in-flight LLM and TTS are cancelled.

Event reference

The server emits the following JSON events on the event channel. Some event types are emitted more than once per turn with different payloads — for example, tts_complete fires once when the first audio plays and again when synthesis finishes.

Event	When	Payload fields
`config`	On connect	`models: { stt, llm, tts }`, `vad: { threshold, silence_ms }`
`speech_end_detected`	VAD detects end of user speech	`server_timestamp_ms`
`transcript`	STT returns	`text`, `latency_ms`, `turn_id`
`llm_start`, `tts_start`	Pipeline markers	`turn_id`
`llm_token`	Each streamed LLM token	`token`, `index`, `ttft_ms` (first token only)
`llm_complete`	LLM generation finishes	`full_response`, `tokens`, `latency_ms`, `turn_id`
`tts_complete` (first)	First audio plays (TTFA)	`latency_ms` (TTFA), `total_pipeline_ms`, `turn_id`, `server_timestamp_ms`
`tts_complete` (second)	TTS generation ends	`latency_ms` (TTFA value preserved), `duration_ms`, `turn_summary`, `turn_id`
`metrics_stats`	Periodic	Aggregated `stt_latency`, `llm_ttft`, `tts_ttfa`, `pipeline_latency`
`error`	On error	`message`

Function calling

Function calling (tool use) is supported at the chat completions API level — the gateway’s /v1/chat/completions endpoint supports tools and tool_choice parameters with models that have native tool-use support (qwen-3.5-27b). The voice agent pipeline does not yet integrate function calling. Tool use in the voice agent is planned but not implemented — the pipeline currently streams text tokens only, with no tool-call detection, TTS pausing, or tool-result injection. PersonaPlex does not support function calling (single audio-native model, no LLM step).

Choosing between this and PersonaPlex

PersonaPlex uses a single multi-modal model and exposes a persona prompt and fixed voice IDs. The Modular Pipeline Agent runs three models you pick independently and exposes a per-turn event stream with latency markers. See the comparison table on the PersonaPlex page.

Access

Contact PolarGrid for Beta access.

​Modular Pipeline Agent

​Overview

​Default models

​Pipeline behavior

​Event reference

​Function calling

​Choosing between this and PersonaPlex

​Access

Modular Pipeline Agent

Overview

Default models

Pipeline behavior

Event reference

Function calling

Choosing between this and PersonaPlex

Access