Qwen3.5 9B - PolarGrid

Qwen3.5 9B (qwen-3.5-9b) is a 9-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights are runtime-quantized to FP8 (~9 GB VRAM) on Hopper and Blackwell GPUs (Ada-class GPUs run via vLLM’s marlin fallback).

HF repo: Qwen/Qwen3.5-9B
Modality: Text LLM
Backend: Triton vllm (LLM pod)
Available regions: yvr-01

Headline benchmark

TTFT (p50, yvr-01): 124 ms
TTFT (p95): 143 ms
Throughput (p50): 61.6 tok/s
Total time, ~30-token reply (p50): 660 ms

Bench: 100 streaming chat-completion runs against https://api.yvr-01.edge.polargrid.ai, captured 2026-04-30. Reasoning mode off (default).

Quickstart

curl https://api.yvr-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-9b",
    "messages": [{"role": "user", "content": "Say hi in one short sentence."}],
    "stream": true,
    "max_tokens": 32
  }'

Capabilities

Field	Value
Context window	8192 tokens
Streaming	Yes (SSE via `stream: true`)
Tools / function calling	No
Reasoning (“thinking”) mode	Off by default; opt in via `"enable_thinking": true` in the request body

Reasoning mode

Qwen3.5 ships with a “thinking” mode that emits a <think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency under 150 ms — the right trade-off for voice agents and conversational UIs. To enable thinking on a per-request basis (longer TTFT, higher reasoning quality):

{
  "model": "qwen-3.5-9b",
  "messages": [{"role": "user", "content": "..."}],
  "enable_thinking": true
}

With reasoning on, expect TTFT to land in the 600 ms – 2 s range depending on prompt difficulty, and the reply to begin with <think> tokens that callers may want to filter before display.

Model identifier

Call this model with the canonical id qwen-3.5-9b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-9B is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.

Notes

License: Apache 2.0 (no auth required to pull weights).
Voice agents in the PolarGrid voice pipeline use Qwen3.5 9B as the LLM stage between STT and TTS.
Single-GPU edge nodes can host this model alongside the voice pod by sharing GPU 0; dual-GPU edges keep LLM pinned to GPU 1.

Documentation Index

​Headline benchmark

​Quickstart

​Capabilities

​Reasoning mode

​Model identifier

​Notes

​See also