Skip to main content

Documentation Index

Fetch the complete documentation index at: https://polargrid.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3.5 9B (qwen-3.5-9b) is a 9-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights are runtime-quantized to FP8 (~9 GB VRAM) on Hopper and Blackwell GPUs (Ada-class GPUs run via vLLM’s marlin fallback).
  • HF repo: Qwen/Qwen3.5-9B
  • Modality: Text LLM
  • Backend: Triton vllm (LLM pod)
  • Available regions: yvr-01

Headline benchmark

  • TTFT (p50, yvr-01): 124 ms
  • TTFT (p95): 143 ms
  • Throughput (p50): 61.6 tok/s
  • Total time, ~30-token reply (p50): 660 ms
Bench: 100 streaming chat-completion runs against https://api.yvr-01.edge.polargrid.ai, captured 2026-04-30. Reasoning mode off (default).

Quickstart

curl https://api.yvr-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-9b",
    "messages": [{"role": "user", "content": "Say hi in one short sentence."}],
    "stream": true,
    "max_tokens": 32
  }'

Capabilities

FieldValue
Context window8192 tokens
StreamingYes (SSE via stream: true)
Tools / function callingNo
Reasoning (“thinking”) modeOff by default; opt in via "enable_thinking": true in the request body

Reasoning mode

Qwen3.5 ships with a “thinking” mode that emits a <think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency under 150 ms — the right trade-off for voice agents and conversational UIs. To enable thinking on a per-request basis (longer TTFT, higher reasoning quality):
{
  "model": "qwen-3.5-9b",
  "messages": [{"role": "user", "content": "..."}],
  "enable_thinking": true
}
With reasoning on, expect TTFT to land in the 600 ms – 2 s range depending on prompt difficulty, and the reply to begin with <think> tokens that callers may want to filter before display.

Model identifier

Call this model with the canonical id qwen-3.5-9b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-9B is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.

Notes

  • License: Apache 2.0 (no auth required to pull weights).
  • Voice agents in the PolarGrid voice pipeline use Qwen3.5 9B as the LLM stage between STT and TTS.
  • Single-GPU edge nodes can host this model alongside the voice pod by sharing GPU 0; dual-GPU edges keep LLM pinned to GPU 1.

See also