qwen-3.5-27b) is a 27-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights ship pre-quantized to FP8 (~28 GB VRAM) and load directly on PolarGrid’s Blackwell edge GPUs without runtime requantization.
- HF repo:
Qwen/Qwen3.5-27B-FP8 - Modality: Text LLM
- Backend: Triton
vllm(LLM pod) - Available regions:
yvr-02(Blackwell),yto-01,yul-01
Headline benchmark
We publish two numbers side by side. End-to-end is what your application actually experiences (request → response, network included). Server-only is what the GPU spends on inference (apples-to-apples vs centralized providers’ published “inference-only” figures). The gap is the latency PolarGrid’syvr-02 PoP eliminates by being at the edge.
| Measurement | TTFT p50 | TTFT p95 | Throughput p50 |
|---|---|---|---|
| End-to-end (with network) | 205 ms | 254 ms | 29.4 tok/s |
| Server-only (no network) | 97 ms | 156 ms | — |
| Network overhead (e2e − server) | 108 ms | — | — |
https://api.yvr-02.edge.polargrid.ai, captured 2026-05-27 from a Vancouver-area laptop over the public internet. End-to-end is client wall-clock; server-only is read from the gateway’s pg_metadata SSE event (inference_ttft_ms / inference_total_ms). Reasoning mode off (default). The yvr-02 node ships on RTX 6000 Pro Blackwell 96 GB — roughly 2.4× the tok/s and less than half the e2e total latency of the earlier yvr-01 (L40S-class) baseline at benchmarks/qwen-2026-05-01/27b/llm_bench.json. Raw runs: benchmarks/yvr-02-2026-05-27/27b/llm_bench.json.
Apples-to-apples disclaimer. Other providers usually publish only their server-side number; comparing it to our server-only row is the fair baseline. Our end-to-end row is what you’ll see from a customer-side request because PolarGrid runs at the edge. The network row above shows exactly how much that’s worth in milliseconds.
How this compares
| Provider | TTFT p50 | Throughput p50 | Source |
|---|---|---|---|
PolarGrid qwen-3.5-27b on Blackwell | 205 ms e2e / 97 ms server | 29.4 tok/s | this card |
| Claude Sonnet 4.5 | ~1600 ms | 47.6 tok/s | artificialanalysis.ai |
| gpt-4o | ~850 ms | 135 tok/s | artificialanalysis.ai |
| Cerebras (specialty silicon) | n/a | ~2100 tok/s | cerebras.ai |
| Groq with speculative decoding | n/a | ~1665 tok/s | artificialanalysis.ai |
enforce_eager=true (a CUDA-graph workaround for the vLLM 0.17.x path on this model) costs an estimated 10 to 20 percent. The pending vLLM patch closes part of the gap.
Quickstart
Capabilities
| Field | Value |
|---|---|
| Context window | 8192 tokens |
| Streaming | Yes (SSE via stream: true) |
| Function calling / tools | Yes (Hermes-style; see “Function calling” below) |
Structured output (response_format) | Yes — json_object and json_schema (vLLM structured_outputs constrained decoding) |
| Logprobs | No (vllm_backend exposes only text_output over Triton; not surfaced) |
| Sampling controls | temperature, top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty, seed, stop |
| Reasoning (“thinking”) mode | Off by default; opt in via "enable_thinking": true in the request body |
Function calling
Pass OpenAI-shapetools and the model returns a tool_calls array on the assistant message (or as a delta.tool_calls chunk when streaming). The gateway speaks Qwen’s Hermes tool-call template under the hood.
tool_choice accepts "auto" (model decides), "none" (force plain text), "required" (force a tool call), or { "type": "function", "function": { "name": "<tool>" } } to force a specific tool.
After invoking the tool yourself, append a role: "tool" message containing the result and re-call the model:
Structured output (JSON mode)
Useresponse_format to force the model to emit valid JSON. Backed server-side by vLLM’s structured_outputs constrained decoding, so the output is guaranteed to parse.
{"type": "json_object"} accepts any valid JSON; json_schema constrains it to your schema.
Reasoning mode
Qwen3.5 ships with a “thinking” mode that emits a<think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency low. The 27B variant runs the same toggle, with deeper reasoning quality at the cost of longer generation time.
To enable thinking on a per-request basis:
Model identifier
Call this model with the canonical idqwen-3.5-27b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-27B-FP8 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.
Notes
- License: Apache 2.0 (no auth required to pull weights).
- Native FP8 — no runtime quantization step at load, lower load time than 9B’s runtime-quantized path.
- VRAM is tight: a single 46 GB L40S can host this model or the voice stack, not both. Multi-GPU edges pin 27B to its own GPU; see
backend/edge-production-setup/CLAUDE.mdfor the layout matrix. - Compared to
qwen-3.5-9b: higher reasoning quality, longer TTFT, higher VRAM. Use 9B for latency-sensitive voice paths and 27B when reply quality is the priority.
See also
- Qwen3.5 9B — sibling model, lighter VRAM footprint
- Authentication — how to mint a JWT from your API key
/v1/models— list all available models/v1/chat/completions— endpoint reference
