Modular Pipeline Agent
Beta. The connect endpoint and auth flow are not yet public. Contact PolarGrid for access.
Overview
The Modular Pipeline Agent orchestrates three models from the PolarGrid catalog on each conversation turn: speech-to-text, an LLM, and text-to-speech. Audio streams in both directions. A parallel event channel carries JSON messages (transcripts, per-token LLM output, latency markers, errors). Any model returned by GET /v1/models with the matching modality can be used.
Default models
- STT:
whisper-large-v3-turbo
- LLM:
meta-llama/Meta-Llama-3.1-8B-Instruct
- TTS:
kokoro-82m with voice bm_george
Overridable per session via request parameters (documented when the endpoint is public).
Pipeline behavior
The agent runs on-device voice activity detection (VAD) to detect end-of-speech, transcribes the utterance, then streams LLM generation into TTS synthesis: TTS begins at the first sentence boundary of LLM output rather than after the full response. If the user starts speaking again while the agent is mid-response, the in-flight LLM and TTS are cancelled.
Event reference
The server emits the following JSON events on the event channel. Some event types are emitted more than once per turn with different payloads — for example, tts_complete fires once when the first audio plays and again when synthesis finishes.
| Event | When | Payload fields |
|---|
config | On connect | models: { stt, llm, tts }, vad: { threshold, silence_ms } |
speech_end_detected | VAD detects end of user speech | server_timestamp_ms |
transcript | STT returns | text, latency_ms, turn_id |
llm_start, tts_start | Pipeline markers | turn_id |
llm_token | Each streamed LLM token | token, index, ttft_ms (first token only) |
llm_complete | LLM generation finishes | full_response, tokens, latency_ms, turn_id |
tts_complete (first) | First audio plays (TTFA) | latency_ms (TTFA), total_pipeline_ms, turn_id, server_timestamp_ms |
tts_complete (second) | TTS generation ends | latency_ms (TTFA value preserved), duration_ms, turn_summary, turn_id |
metrics_stats | Periodic | Aggregated stt_latency, llm_ttft, tts_ttfa, pipeline_latency |
error | On error | message |
Choosing between this and PersonaPlex
PersonaPlex uses a single multi-modal model and exposes a persona prompt and fixed voice IDs. The Modular Pipeline Agent runs three models you pick independently and exposes a per-turn event stream with latency markers. See the comparison table on the PersonaPlex page.
Access
Contact PolarGrid for Beta access.