Qwen3.5 27B - PolarGrid

Qwen3.5 27B (qwen-3.5-27b) is a 27-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights ship pre-quantized to FP8 (~28 GB VRAM) and load directly on PolarGrid’s Blackwell edge GPUs without runtime requantization.

HF repo: Qwen/Qwen3.5-27B-FP8
Modality: Text LLM
Backend: Triton vllm (LLM pod)
Available regions: yvr-02, yto-01, yul-01

Headline benchmark

We publish two numbers side by side. End-to-end is what your application actually experiences (request → response, network included). Server-only is what the GPU spends on inference (apples-to-apples vs centralized providers’ published “inference-only” figures). The gap is the latency PolarGrid’s yto-01 PoP eliminates by being at the edge.

Measurement	TTFT p50	TTFT p95	Throughput p50
End-to-end (with network)	300 ms	330 ms	12.4 tok/s
Server-only (no network)	211 ms	240 ms	—
Network overhead (e2e − server)	90 ms	—	—

Bench: 100 streaming chat-completion runs against https://api.yto-01.edge.polargrid.ai, captured 2026-05-01. End-to-end is client-side wall-clock from a colocated test machine; server-only is read from the gateway’s pg_metadata SSE event (inference_ttft_ms / inference_total_ms). Reasoning mode off (default). Throughput is lower than the 9B sibling because vLLM 0.17.x’s CUDA-graph capture asserts on Qwen3-Next hybrid attention, so the model runs in enforce_eager mode (no graph capture); a future Triton image with the upstream vLLM patch (vllm-project/vllm#27406) will let us re-enable graph capture and re-bench.

Apples-to-apples disclaimer. Other providers usually publish only their server-side number; comparing it to our server-only row is the fair baseline. Our end-to-end row is what you’ll see from a customer-side request because PolarGrid runs at the edge — the network row above shows exactly how much that’s worth in milliseconds.

Quickstart

curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Say hi in one short sentence."}],
    "stream": true,
    "max_tokens": 32
  }'

Capabilities

Field	Value
Context window	8192 tokens
Streaming	Yes (SSE via `stream: true`)
Function calling / tools	Yes (Hermes-style; see “Function calling” below)
Structured output (`response_format`)	Yes — `json_object` and `json_schema` (vLLM `structured_outputs` constrained decoding)
Logprobs	No (vllm_backend exposes only `text_output` over Triton; not surfaced)
Sampling controls	`temperature`, `top_p`, `top_k`, `min_p`, `frequency_penalty`, `presence_penalty`, `repetition_penalty`, `seed`, `stop`
Reasoning (“thinking”) mode	Off by default; opt in via `"enable_thinking": true` in the request body

Function calling

Pass OpenAI-shape tools and the model returns a tool_calls array on the assistant message (or as a delta.tool_calls chunk when streaming). The gateway speaks Qwen’s Hermes tool-call template under the hood.

curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Whats the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

tool_choice accepts "auto" (model decides), "none" (force plain text), "required" (force a tool call), or { "type": "function", "function": { "name": "<tool>" } } to force a specific tool. After invoking the tool yourself, append a role: "tool" message containing the result and re-call the model:

{
  "model": "qwen-3.5-27b",
  "messages": [
    {"role": "user", "content": "Weather in Tokyo?"},
    {"role": "assistant", "tool_calls": [
      {"id": "call_abc", "type": "function",
       "function": {"name": "get_weather", "arguments": "{\"city\":\"Tokyo\"}"}}
    ]},
    {"role": "tool", "tool_call_id": "call_abc",
     "content": "{\"temp_c\": 22, \"sky\": \"sunny\"}"}
  ]
}

Structured output (JSON mode)

Use response_format to force the model to emit valid JSON. Backed server-side by vLLM’s structured_outputs constrained decoding, so the output is guaranteed to parse.

curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Give me a JSON object describing a cat."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age_years": {"type": "integer"},
            "color": {"type": "string"}
          },
          "required": ["name", "age_years", "color"]
        }
      }
    }
  }'

{"type": "json_object"} accepts any valid JSON; json_schema constrains it to your schema.

Reasoning mode

Qwen3.5 ships with a “thinking” mode that emits a <think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency low. The 27B variant runs the same toggle, with deeper reasoning quality at the cost of longer generation time. To enable thinking on a per-request basis:

{
  "model": "qwen-3.5-27b",
  "messages": [{"role": "user", "content": "..."}],
  "enable_thinking": true
}

Model identifier

Call this model with the canonical id qwen-3.5-27b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-27B-FP8 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.

Notes

License: Apache 2.0 (no auth required to pull weights).
Native FP8 — no runtime quantization step at load, lower load time than 9B’s runtime-quantized path.
VRAM is tight: a single 46 GB L40S can host this model or the voice stack, not both. Multi-GPU edges pin 27B to its own GPU; see backend/edge-production-setup/CLAUDE.md for the layout matrix.
Compared to qwen-3.5-9b: higher reasoning quality, longer TTFT, higher VRAM. Use 9B for latency-sensitive voice paths and 27B when reply quality is the priority.

Documentation Index

​Headline benchmark

​Quickstart

​Capabilities

​Function calling

​Structured output (JSON mode)

​Reasoning mode

​Model identifier

​Notes

​See also