Rate Limits

PolarGrid enforces rate limits to ensure fair usage and platform stability. These limits apply uniformly across all inference endpoints (LLM, STT, TTS) and all models.

Current Limits

Limit	Value	Scope
Requests per minute	100	Per user, counted independently on each edge node (per client IP if unauthenticated)

Request rate limit: A per-minute window allows up to 100 requests per user. The limit is keyed to the user who owns the API key, so every key created by the same user draws from one shared 100/min budget. Each edge node keeps its own counter and resets it at the minute boundary, so traffic spread across several nodes gets a separate budget on each (see Distribute requests across edges). If no API key is provided, the limit applies per client IP address.

This limit applies identically to all endpoints: /v1/chat/completions, /v1/completions, /v1/audio/speech, and /v1/audio/transcriptions.

What Happens When You Hit a Limit

When the limit is exceeded, the gateway returns an HTTP 429 Too Many Requests response:

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

The SDK surfaces this as a RateLimitError. When the response carries a Retry-After header, the SDK exposes its value as retryAfter (in seconds); when the header is absent, retryAfter is undefined, so apply your own backoff (see below). See Error Handling for the full error type reference.

Retry Strategy

Use exponential backoff with jitter to avoid thundering-herd problems when retrying after a 429:

async function withBackoff(fn, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status !== 429 || attempt === maxRetries - 1) {
        throw error;
      }

      const baseDelay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
      const jitter = Math.random() * 1000;
      await new Promise(r => setTimeout(r, baseDelay + jitter));
    }
  }
}

const response = await withBackoff(() =>
  client.chatCompletion({
    model: 'qwen-3.5-27b',
    messages: [{ role: 'user', content: 'Hello' }],
  })
);

import asyncio
import random

async def with_backoff(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return await fn()
        except Exception as e:
            if getattr(e, "status", None) != 429 or attempt == max_retries - 1:
                raise

            base_delay = (2 ** attempt)  # 1s, 2s, 4s, 8s, 16s
            jitter = random.uniform(0, 1)
            await asyncio.sleep(base_delay + jitter)

response = await with_backoff(lambda: client.chat_completion({
    "model": "qwen-3.5-27b",
    "messages": [{"role": "user", "content": "Hello"}],
}))

The PolarGrid SDKs include built-in retry with exponential backoff for transient errors (including 429s). Configure via maxRetries when initializing the client. The examples above are for custom retry logic beyond the defaults.

Best Practices for Production

Distribute requests across edges

Use the autorouter to spread traffic across multiple edge nodes. Each edge node maintains its own rate limit counters, so distributing requests reduces the chance of hitting limits on any single node.

const client = new PolarGrid({
  apiKey: 'pg_...',
  // Autorouter selects the nearest available edge
  baseUrl: 'https://autorouter.polargrid.ai',
});

Implement client-side throttling

Rather than relying on server-side 429 responses, proactively throttle requests in your application:

// Simple token bucket rate limiter
class RateLimiter {
  constructor(maxRequests = 90, windowMs = 60_000) {
    this.tokens = maxRequests;
    this.maxTokens = maxRequests;
    this.windowMs = windowMs;
    this.lastRefill = Date.now();
  }

  async acquire() {
    this.refill();
    if (this.tokens <= 0) {
      const waitMs = this.windowMs - (Date.now() - this.lastRefill);
      await new Promise(r => setTimeout(r, waitMs));
      this.refill();
    }
    this.tokens--;
  }

  refill() {
    const now = Date.now();
    if (now - this.lastRefill >= this.windowMs) {
      this.tokens = this.maxTokens;
      this.lastRefill = now;
    }
  }
}

const limiter = new RateLimiter(90); // Stay under the 100/min limit

async function safeRequest(client, params) {
  await limiter.acquire();
  return client.chatCompletion(params);
}

Use streaming to reduce request count

A single streaming request holds one connection open while tokens are generated, rather than making multiple polling requests. This is especially effective for long responses.

// One request, streamed over a single connection
for await (const chunk of client.chatCompletionStream({
  model: 'qwen-3.5-27b',
  messages: [{ role: 'user', content: 'Write a detailed analysis...' }],
})) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

See the Streaming guide for full details.

Batch where possible

If you have multiple independent prompts, send them as separate requests but pace them to stay within your rate limit. Avoid firing all requests simultaneously.

Custom Limits

Enterprise customers can request custom rate limits tailored to their workload. Contact support@polargrid.ai or reach out to your account representative to discuss higher limits.

Getting Started

Guides

Available Models

Rate Limits

Rate Limits

Current Limits

What Happens When You Hit a Limit

Retry Strategy

Best Practices for Production

Distribute requests across edges

Implement client-side throttling

Use streaming to reduce request count

Batch where possible

Custom Limits

​Rate Limits

​Current Limits

​What Happens When You Hit a Limit

​Retry Strategy

​Best Practices for Production

​Distribute requests across edges

​Implement client-side throttling

​Use streaming to reduce request count

​Batch where possible

​Custom Limits

Rate Limits

Current Limits

What Happens When You Hit a Limit

Retry Strategy

Best Practices for Production

Distribute requests across edges

Implement client-side throttling

Use streaming to reduce request count

Batch where possible

Custom Limits