Qwen3.5 27B (Documentation Index
Fetch the complete documentation index at: https://polargrid.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
qwen-3.5-27b) is a 27-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights ship pre-quantized to FP8 (~28 GB VRAM) and load directly on PolarGrid’s Blackwell edge GPUs without runtime requantization.
- HF repo:
Qwen/Qwen3.5-27B-FP8 - Modality: Text LLM
- Backend: Triton
vllm(LLM pod) - Available regions:
yvr-02,yto-01,yul-01
Headline benchmark
We publish two numbers side by side. End-to-end is what your application actually experiences (request → response, network included). Server-only is what the GPU spends on inference (apples-to-apples vs centralized providers’ published “inference-only” figures). The gap is the latency PolarGrid’syto-01 PoP eliminates by being at the edge.
| Measurement | TTFT p50 | TTFT p95 | Throughput p50 |
|---|---|---|---|
| End-to-end (with network) | 300 ms | 330 ms | 12.4 tok/s |
| Server-only (no network) | 211 ms | 240 ms | — |
| Network overhead (e2e − server) | 90 ms | — | — |
https://api.yto-01.edge.polargrid.ai, captured 2026-05-01. End-to-end is client-side wall-clock from a colocated test machine; server-only is read from the gateway’s pg_metadata SSE event (inference_ttft_ms / inference_total_ms). Reasoning mode off (default). Throughput is lower than the 9B sibling because vLLM 0.17.x’s CUDA-graph capture asserts on Qwen3-Next hybrid attention, so the model runs in enforce_eager mode (no graph capture); a future Triton image with the upstream vLLM patch (vllm-project/vllm#27406) will let us re-enable graph capture and re-bench.
Apples-to-apples disclaimer. Other providers usually publish only their server-side number; comparing it to our server-only row is the fair baseline. Our end-to-end row is what you’ll see from a customer-side request because PolarGrid runs at the edge — the network row above shows exactly how much that’s worth in milliseconds.
Quickstart
Capabilities
| Field | Value |
|---|---|
| Context window | 8192 tokens |
| Streaming | Yes (SSE via stream: true) |
| Function calling / tools | Yes (Hermes-style; see “Function calling” below) |
Structured output (response_format) | Yes — json_object and json_schema (vLLM structured_outputs constrained decoding) |
| Logprobs | No (vllm_backend exposes only text_output over Triton; not surfaced) |
| Sampling controls | temperature, top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty, seed, stop |
| Reasoning (“thinking”) mode | Off by default; opt in via "enable_thinking": true in the request body |
Function calling
Pass OpenAI-shapetools and the model returns a tool_calls array on the assistant message (or as a delta.tool_calls chunk when streaming). The gateway speaks Qwen’s Hermes tool-call template under the hood.
tool_choice accepts "auto" (model decides), "none" (force plain text), "required" (force a tool call), or { "type": "function", "function": { "name": "<tool>" } } to force a specific tool.
After invoking the tool yourself, append a role: "tool" message containing the result and re-call the model:
Structured output (JSON mode)
Useresponse_format to force the model to emit valid JSON. Backed server-side by vLLM’s structured_outputs constrained decoding, so the output is guaranteed to parse.
{"type": "json_object"} accepts any valid JSON; json_schema constrains it to your schema.
Reasoning mode
Qwen3.5 ships with a “thinking” mode that emits a<think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency low. The 27B variant runs the same toggle, with deeper reasoning quality at the cost of longer generation time.
To enable thinking on a per-request basis:
Model identifier
Call this model with the canonical idqwen-3.5-27b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-27B-FP8 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.
Notes
- License: Apache 2.0 (no auth required to pull weights).
- Native FP8 — no runtime quantization step at load, lower load time than 9B’s runtime-quantized path.
- VRAM is tight: a single 46 GB L40S can host this model or the voice stack, not both. Multi-GPU edges pin 27B to its own GPU; see
backend/edge-production-setup/CLAUDE.mdfor the layout matrix. - Compared to
qwen-3.5-9b: higher reasoning quality, longer TTFT, higher VRAM. Use 9B for latency-sensitive voice paths and 27B when reply quality is the priority.
See also
- Qwen3.5 9B — sibling model, lighter VRAM footprint
- Authentication — how to mint a JWT from your API key
/v1/models— list all available models/v1/chat/completions— endpoint reference
