Skip to main content
Qwen3.5 27B (qwen-3.5-27b) is a 27-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights ship pre-quantized to FP8 (~28 GB VRAM) and load directly on PolarGrid’s Blackwell edge GPUs without runtime requantization.
  • HF repo: Qwen/Qwen3.5-27B-FP8
  • Modality: Text LLM
  • Backend: Triton vllm (LLM pod)
  • Available regions: yvr-02 (Blackwell), yto-01, yul-01

Headline benchmark

We publish two numbers side by side. End-to-end is what your application actually experiences (request → response, network included). Server-only is what the GPU spends on inference (apples-to-apples vs centralized providers’ published “inference-only” figures). The gap is the latency PolarGrid’s yvr-02 PoP eliminates by being at the edge.
MeasurementTTFT p50TTFT p95Throughput p50
End-to-end (with network)205 ms254 ms29.4 tok/s
Server-only (no network)97 ms156 ms
Network overhead (e2e − server)108 ms
Bench: 100 streaming chat-completion runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-05-27 from a Vancouver-area laptop over the public internet. End-to-end is client wall-clock; server-only is read from the gateway’s pg_metadata SSE event (inference_ttft_ms / inference_total_ms). Reasoning mode off (default). The yvr-02 node ships on RTX 6000 Pro Blackwell 96 GB — roughly 2.4× the tok/s and less than half the e2e total latency of the earlier yvr-01 (L40S-class) baseline at benchmarks/qwen-2026-05-01/27b/llm_bench.json. Raw runs: benchmarks/yvr-02-2026-05-27/27b/llm_bench.json.
Apples-to-apples disclaimer. Other providers usually publish only their server-side number; comparing it to our server-only row is the fair baseline. Our end-to-end row is what you’ll see from a customer-side request because PolarGrid runs at the edge. The network row above shows exactly how much that’s worth in milliseconds.

How this compares

ProviderTTFT p50Throughput p50Source
PolarGrid qwen-3.5-27b on Blackwell205 ms e2e / 97 ms server29.4 tok/sthis card
Claude Sonnet 4.5~1600 ms47.6 tok/sartificialanalysis.ai
gpt-4o~850 ms135 tok/sartificialanalysis.ai
Cerebras (specialty silicon)n/a~2100 tok/scerebras.ai
Groq with speculative decodingn/a~1665 tok/sartificialanalysis.ai
PolarGrid wins on TTFT end-to-end against frontier-reasoning providers because of edge proximity (108 ms p50 network leg vs centralized regions). PolarGrid is 4 to 70 times behind specialty silicon on raw throughput; RTX 6000 Pro Blackwell workstation FLOPS are below H100 and H200 datacenter FLOPS, and enforce_eager=true (a CUDA-graph workaround for the vLLM 0.17.x path on this model) costs an estimated 10 to 20 percent. The pending vLLM patch closes part of the gap.

Quickstart

curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Say hi in one short sentence."}],
    "stream": true,
    "max_tokens": 32
  }'

Capabilities

FieldValue
Context window8192 tokens
StreamingYes (SSE via stream: true)
Function calling / toolsYes (Hermes-style; see “Function calling” below)
Structured output (response_format)Yes — json_object and json_schema (vLLM structured_outputs constrained decoding)
LogprobsNo (vllm_backend exposes only text_output over Triton; not surfaced)
Sampling controlstemperature, top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty, seed, stop
Reasoning (“thinking”) modeOff by default; opt in via "enable_thinking": true in the request body

Function calling

Pass OpenAI-shape tools and the model returns a tool_calls array on the assistant message (or as a delta.tool_calls chunk when streaming). The gateway speaks Qwen’s Hermes tool-call template under the hood.
curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Whats the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto"
  }'
tool_choice accepts "auto" (model decides), "none" (force plain text), "required" (force a tool call), or { "type": "function", "function": { "name": "<tool>" } } to force a specific tool. After invoking the tool yourself, append a role: "tool" message containing the result and re-call the model:
{
  "model": "qwen-3.5-27b",
  "messages": [
    {"role": "user", "content": "Weather in Tokyo?"},
    {"role": "assistant", "tool_calls": [
      {"id": "call_abc", "type": "function",
       "function": {"name": "get_weather", "arguments": "{\"city\":\"Tokyo\"}"}}
    ]},
    {"role": "tool", "tool_call_id": "call_abc",
     "content": "{\"temp_c\": 22, \"sky\": \"sunny\"}"}
  ]
}

Structured output (JSON mode)

Use response_format to force the model to emit valid JSON. Backed server-side by vLLM’s structured_outputs constrained decoding, so the output is guaranteed to parse.
curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Give me a JSON object describing a cat."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age_years": {"type": "integer"},
            "color": {"type": "string"}
          },
          "required": ["name", "age_years", "color"]
        }
      }
    }
  }'
{"type": "json_object"} accepts any valid JSON; json_schema constrains it to your schema.

Reasoning mode

Qwen3.5 ships with a “thinking” mode that emits a <think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency low. The 27B variant runs the same toggle, with deeper reasoning quality at the cost of longer generation time. To enable thinking on a per-request basis:
{
  "model": "qwen-3.5-27b",
  "messages": [{"role": "user", "content": "..."}],
  "enable_thinking": true
}

Model identifier

Call this model with the canonical id qwen-3.5-27b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-27B-FP8 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.

Notes

  • License: Apache 2.0 (no auth required to pull weights).
  • Native FP8 — no runtime quantization step at load, lower load time than 9B’s runtime-quantized path.
  • VRAM is tight: a single 46 GB L40S can host this model or the voice stack, not both. Multi-GPU edges pin 27B to its own GPU; see backend/edge-production-setup/CLAUDE.md for the layout matrix.
  • Compared to qwen-3.5-9b: higher reasoning quality, longer TTFT, higher VRAM. Use 9B for latency-sensitive voice paths and 27B when reply quality is the priority.

See also