Skip to main content

Rate Limits

PolarGrid enforces rate limits to ensure fair usage and platform stability. These limits apply uniformly across all inference endpoints (LLM, STT, TTS) and all models.

Current Limits

LimitValueScope
Requests per minute100Per API key (or client IP if unauthenticated)
  • Request rate limit: A per-minute window allows up to 100 requests per authenticated identity. The window resets at each minute boundary. If no API key is provided, the limit is applied per client IP address.
This limit applies identically to all endpoints: /v1/chat/completions, /v1/completions, /v1/audio/speech, and /v1/audio/transcriptions.

What Happens When You Hit a Limit

When the limit is exceeded, the gateway returns an HTTP 429 Too Many Requests response:
{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}
The SDK surfaces this as a RateLimitError with a retryAfter property indicating how many seconds to wait. See Error Handling for full error type reference.

Retry Strategy

Use exponential backoff with jitter to avoid thundering-herd problems when retrying after a 429:
async function withBackoff(fn, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status !== 429 || attempt === maxRetries - 1) {
        throw error;
      }

      const baseDelay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
      const jitter = Math.random() * 1000;
      await new Promise(r => setTimeout(r, baseDelay + jitter));
    }
  }
}

const response = await withBackoff(() =>
  client.chatCompletion({
    model: 'qwen-3.5-9b',
    messages: [{ role: 'user', content: 'Hello' }],
  })
);
The PolarGrid SDKs include built-in retry with exponential backoff for transient errors (including 429s). Configure via maxRetries when initializing the client. The examples above are for custom retry logic beyond the defaults.

Best Practices for Production

Distribute requests across edges

Use the autorouter to spread traffic across multiple edge nodes. Each edge node maintains its own rate limit counters, so distributing requests reduces the chance of hitting limits on any single node.
const client = new PolarGrid({
  apiKey: 'pg_...',
  // Autorouter selects the nearest available edge
  baseUrl: 'https://autorouter.polargrid.ai',
});

Implement client-side throttling

Rather than relying on server-side 429 responses, proactively throttle requests in your application:
// Simple token bucket rate limiter
class RateLimiter {
  constructor(maxRequests = 90, windowMs = 60_000) {
    this.tokens = maxRequests;
    this.maxTokens = maxRequests;
    this.windowMs = windowMs;
    this.lastRefill = Date.now();
  }

  async acquire() {
    this.refill();
    if (this.tokens <= 0) {
      const waitMs = this.windowMs - (Date.now() - this.lastRefill);
      await new Promise(r => setTimeout(r, waitMs));
      this.refill();
    }
    this.tokens--;
  }

  refill() {
    const now = Date.now();
    if (now - this.lastRefill >= this.windowMs) {
      this.tokens = this.maxTokens;
      this.lastRefill = now;
    }
  }
}

const limiter = new RateLimiter(90); // Stay under the 100/min limit

async function safeRequest(client, params) {
  await limiter.acquire();
  return client.chatCompletion(params);
}

Use streaming to reduce request count

A single streaming request holds one connection open while tokens are generated, rather than making multiple polling requests. This is especially effective for long responses.
// One request, streamed over a single connection
for await (const chunk of client.chatCompletionStream({
  model: 'qwen-3.5-9b',
  messages: [{ role: 'user', content: 'Write a detailed analysis...' }],
})) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
See the Streaming guide for full details.

Batch where possible

If you have multiple independent prompts, send them as separate requests but pace them to stay within your rate limit. Avoid firing all requests simultaneously.

Custom Limits

Enterprise customers can request custom rate limits tailored to their workload. Contact support@polargrid.ai or reach out to your account representative to discuss higher limits.