Skip to main content

Documentation Index

Fetch the complete documentation index at: https://polargrid.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3.5 27B (qwen-3.5-27b) is a 27-billion-parameter text LLM served on PolarGrid edge nodes via Triton’s vllm_backend. Weights ship pre-quantized to FP8 (~28 GB VRAM) and load directly on PolarGrid’s Blackwell edge GPUs without runtime requantization.
  • HF repo: Qwen/Qwen3.5-27B-FP8
  • Modality: Text LLM
  • Backend: Triton vllm (LLM pod)
  • Available regions: yvr-02, yto-01, yul-01

Headline benchmark

We publish two numbers side by side. End-to-end is what your application actually experiences (request → response, network included). Server-only is what the GPU spends on inference (apples-to-apples vs centralized providers’ published “inference-only” figures). The gap is the latency PolarGrid’s yto-01 PoP eliminates by being at the edge.
MeasurementTTFT p50TTFT p95Throughput p50
End-to-end (with network)300 ms330 ms12.4 tok/s
Server-only (no network)211 ms240 ms
Network overhead (e2e − server)90 ms
Bench: 100 streaming chat-completion runs against https://api.yto-01.edge.polargrid.ai, captured 2026-05-01. End-to-end is client-side wall-clock from a colocated test machine; server-only is read from the gateway’s pg_metadata SSE event (inference_ttft_ms / inference_total_ms). Reasoning mode off (default). Throughput is lower than the 9B sibling because vLLM 0.17.x’s CUDA-graph capture asserts on Qwen3-Next hybrid attention, so the model runs in enforce_eager mode (no graph capture); a future Triton image with the upstream vLLM patch (vllm-project/vllm#27406) will let us re-enable graph capture and re-bench.
Apples-to-apples disclaimer. Other providers usually publish only their server-side number; comparing it to our server-only row is the fair baseline. Our end-to-end row is what you’ll see from a customer-side request because PolarGrid runs at the edge — the network row above shows exactly how much that’s worth in milliseconds.

Quickstart

curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Say hi in one short sentence."}],
    "stream": true,
    "max_tokens": 32
  }'

Capabilities

FieldValue
Context window8192 tokens
StreamingYes (SSE via stream: true)
Function calling / toolsYes (Hermes-style; see “Function calling” below)
Structured output (response_format)Yes — json_object and json_schema (vLLM structured_outputs constrained decoding)
LogprobsNo (vllm_backend exposes only text_output over Triton; not surfaced)
Sampling controlstemperature, top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty, seed, stop
Reasoning (“thinking”) modeOff by default; opt in via "enable_thinking": true in the request body

Function calling

Pass OpenAI-shape tools and the model returns a tool_calls array on the assistant message (or as a delta.tool_calls chunk when streaming). The gateway speaks Qwen’s Hermes tool-call template under the hood.
curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Whats the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto"
  }'
tool_choice accepts "auto" (model decides), "none" (force plain text), "required" (force a tool call), or { "type": "function", "function": { "name": "<tool>" } } to force a specific tool. After invoking the tool yourself, append a role: "tool" message containing the result and re-call the model:
{
  "model": "qwen-3.5-27b",
  "messages": [
    {"role": "user", "content": "Weather in Tokyo?"},
    {"role": "assistant", "tool_calls": [
      {"id": "call_abc", "type": "function",
       "function": {"name": "get_weather", "arguments": "{\"city\":\"Tokyo\"}"}}
    ]},
    {"role": "tool", "tool_call_id": "call_abc",
     "content": "{\"temp_c\": 22, \"sky\": \"sunny\"}"}
  ]
}

Structured output (JSON mode)

Use response_format to force the model to emit valid JSON. Backed server-side by vLLM’s structured_outputs constrained decoding, so the output is guaranteed to parse.
curl https://api.yto-01.edge.polargrid.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Give me a JSON object describing a cat."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age_years": {"type": "integer"},
            "color": {"type": "string"}
          },
          "required": ["name", "age_years", "color"]
        }
      }
    }
  }'
{"type": "json_object"} accepts any valid JSON; json_schema constrains it to your schema.

Reasoning mode

Qwen3.5 ships with a “thinking” mode that emits a <think>...</think> reasoning trace before the user-visible answer. PolarGrid’s /v1/chat/completions endpoint disables this by default to keep first-token latency low. The 27B variant runs the same toggle, with deeper reasoning quality at the cost of longer generation time. To enable thinking on a per-request basis:
{
  "model": "qwen-3.5-27b",
  "messages": [{"role": "user", "content": "..."}],
  "enable_thinking": true
}

Model identifier

Call this model with the canonical id qwen-3.5-27b at all inference endpoints (/v1/chat/completions, /v1/completions). The HuggingFace repo id Qwen/Qwen3.5-27B-FP8 is accepted at /v1/models/load for hot-loading purposes but does not resolve at inference time — use the canonical id for chat and completions calls.

Notes

  • License: Apache 2.0 (no auth required to pull weights).
  • Native FP8 — no runtime quantization step at load, lower load time than 9B’s runtime-quantized path.
  • VRAM is tight: a single 46 GB L40S can host this model or the voice stack, not both. Multi-GPU edges pin 27B to its own GPU; see backend/edge-production-setup/CLAUDE.md for the layout matrix.
  • Compared to qwen-3.5-9b: higher reasoning quality, longer TTFT, higher VRAM. Use 9B for latency-sensitive voice paths and 27B when reply quality is the priority.

See also