Qwen3.5 9B (Documentation Index
Fetch the complete documentation index at: https://polargrid.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
qwen-3.5-9b) is a 9-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights are runtime-quantized to FP8 (~9 GB VRAM) on Hopper and Blackwell GPUs (Ada-class GPUs run via vLLM’s marlin fallback).
- HF repo:
Qwen/Qwen3.5-9B - Modality: Text LLM
- Backend: Triton
vllm(LLM pod) - Available regions:
yvr-01
Headline benchmark
- TTFT (p50,
yvr-01): 124 ms - TTFT (p95): 143 ms
- Throughput (p50): 61.6 tok/s
- Total time, ~30-token reply (p50): 660 ms
https://api.yvr-01.edge.polargrid.ai, captured 2026-04-30. Reasoning mode off (default).
Quickstart
Capabilities
| Field | Value |
|---|---|
| Context window | 8192 tokens |
| Streaming | Yes (SSE via stream: true) |
| Tools / function calling | No |
| Reasoning (“thinking”) mode | Off by default; opt in via "enable_thinking": true in the request body |
Reasoning mode
Qwen3.5 ships with a “thinking” mode that emits a<think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency under 150 ms — the right trade-off for voice agents and conversational UIs.
To enable thinking on a per-request basis (longer TTFT, higher reasoning quality):
<think> tokens that callers may want to filter before display.
Model identifier
Call this model with the canonical idqwen-3.5-9b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-9B is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.
Notes
- License: Apache 2.0 (no auth required to pull weights).
- Voice agents in the PolarGrid voice pipeline use Qwen3.5 9B as the LLM stage between STT and TTS.
- Single-GPU edge nodes can host this model alongside the voice pod by sharing GPU 0; dual-GPU edges keep LLM pinned to GPU 1.
See also
- Authentication — how to mint a JWT from your API key
/v1/models— list all available models/v1/chat/completions— endpoint reference
