Skip to main content

Models

PolarGrid serves open-weight models on GPU-accelerated edge infrastructure. All models are available via our OpenAI-compatible API.
PolarGrid runs open-source models optimized for low-latency edge inference. These models are selected for speed and efficiency in real-time applications like voice AI. For workloads that require large cloud-hosted reasoning models (e.g., GPT-4, Gemini, Claude), use those providers directly — PolarGrid does not proxy requests to third-party APIs.

LLM Models

Qwen 3.5 9B

Parameters9B
QuantizationFP8 (runtime)
Max context8,192 tokens
LicenseApache 2.0
Default LLM on every edge node. Lower TTFT and higher throughput than 27B, multilingual + multimodal (image input supported). Use this for latency-sensitive voice paths. Endpoint: POST /v1/chat/completions with "model": "qwen-3.5-9b"

Qwen 3.5 27B

Parameters27B
QuantizationFP8 (native)
Max context8,192 tokens
LicenseApache 2.0
Pricing0.20/1Minputtokens,0.20 / 1M input tokens, 0.75 / 1M output tokens
General-purpose large language model with strong performance across reasoning, coding, and multilingual tasks. Our largest deployed LLM, suitable for complex workloads where reply quality is the priority. Available fleet-wide on every edge node. Endpoint: POST /v1/chat/completions with "model": "qwen-3.5-27b"

Speech-to-Text Models

Whisper Large V3 Turbo

Parameters809M
LicenseApache 2.0
Pricing$0.004 / min
OpenAI’s Whisper model optimized for speed. Supports multilingual transcription with high accuracy. Endpoint: POST /v1/audio/transcriptions with "model": "whisper-large-v3-turbo" Full model card →

Cohere Transcribe

Parameters2B
LicenseApache 2.0
Supported languagesEnglish, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, Arabic
Pricing$0.004 / min
High-accuracy multilingual transcription with support for 14 languages. Supports punctuation toggling. Available on yvr-02 (Blackwell production). Endpoint: POST /v1/audio/transcriptions?sync=true with "model": "cohere-transcribe-03-2026" (also accepts the alias cohere-transcribe) Latency (yvr-02 Blackwell, external bench from a Vancouver-area laptop, 100 sync runs, 2026-05-27): server inference p50 238 ms / p95 305 ms, server-only RTF p50 0.041 (24× faster than real time). End-to-end TTFB p50 1068 ms is upload-dominated — each request ships a ~300 KB WAV body before inference can start; the network leg is the upload time, not POP-to-client RTT. See the Cohere Transcribe model page for the full breakdown.

Text-to-Speech Models

Hume AI TADA

Parameters~4B (Llama 3.2 3B text base + audio components)
LicenseLlama 3.2 Community License
Output24 kHz mono — caller picks the container via response_format (pcm, wav, or mp3)
Supported languagesEnglish, French, German, Spanish, Italian, Portuguese, Polish, Japanese, Arabic, Chinese
CapabilitiesCross-lingual voice cloning, speed control (batch only)
Streaming✓ chunked HTTP, pcm + opus (per-token via decoupled Triton handler; speed not honored in streaming mode)
Pricing$0.009 / min
Expressive text-to-speech with cross-lingual voice cloning. Generates natural-sounding speech across 10 languages from a short reference clip. Endpoint: POST /v1/audio/speech with "model": "tada-3b-ml"

Kokoro 82M

Parameters82M
LicenseApache 2.0
Streaming✓ chunked HTTP, pcm + opus
Pricing$0.006 / min
Lightweight, fast text-to-speech model. Ideal for low-latency voice applications where speed is critical. Endpoint: POST /v1/audio/speech with "model": "kokoro-82m" Full model card →

Voice Pipeline

PersonaPlex

Parameters7B
PipelineSTT + LLM + TTS (end-to-end)
Pricing$0.070 / min
Integrated voice-to-voice pipeline that combines speech recognition, language model reasoning, and speech synthesis into a single low-latency stream. Billed by wall-clock duration. See the PersonaPlex guide for setup details.

Performance

Latency benchmarks per model and region are actively being measured. Performance depends on:
  • Client proximity to the nearest edge region
  • Model size — smaller models have lower TTFT and higher throughput
  • Request complexity — token count, audio length, streaming vs. batch
The autorouter optimizes for the lowest-latency region automatically. For detailed performance data, contact us.

Custom Models

Enterprise customers can deploy custom fine-tuned models on PolarGrid infrastructure. PolarGrid handles provisioning and loading — contact us to discuss your model requirements. For custom model deployments, contact hello@polargrid.ai.

Pricing

See full pricing details and volume discounts

API Reference

Full endpoint documentation