Skip to main content

Streaming

Stream response tokens in real-time for better user experience.

Why Stream?

Without streaming, users wait for the entire response before seeing anything. With streaming, tokens appear as they’re generated — critical for interactive applications.

Chat Completion Streaming

for await (const chunk of client.chatCompletionStream({
  model: 'llama-3.1-8b',
  messages: [{ role: 'user', content: 'Tell me a story about a robot' }],
  maxTokens: 500,
})) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
  
  // Check for completion
  if (chunk.choices[0]?.finishReason) {
    console.log('\n\nFinished:', chunk.choices[0].finishReason);
  }
}

Text Completion Streaming

for await (const chunk of client.completionStream({
  prompt: 'Once upon a time',
  model: 'llama-3.1-8b',
  maxTokens: 200,
})) {
  process.stdout.write(chunk.choices[0].text);
}

Chunk Format

Each streaming chunk contains a delta (incremental change):
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1234567890,
  "model": "llama-3.1-8b",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",  // Only in first chunk
        "content": "Once"     // Token content
      },
      "finish_reason": null   // null until final chunk
    }
  ]
}

SSE Format (Raw API)

When using the API directly with stream: true, responses use Server-Sent Events:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"Once"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" upon"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Browser Example

async function streamChat(prompt) {
  const outputElement = document.getElementById('output');
  outputElement.textContent = '';

  for await (const chunk of client.chatCompletionStream({
    model: 'llama-3.1-8b',
    messages: [{ role: 'user', content: prompt }],
  })) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      outputElement.textContent += content;
    }
  }
}

React Hook Example

import { useState, useCallback } from 'react';

function useChatStream(client: PolarGrid) {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const stream = useCallback(async (messages: Message[]) => {
    setResponse('');
    setIsStreaming(true);

    try {
      for await (const chunk of client.chatCompletionStream({
        model: 'llama-3.1-8b',
        messages,
      })) {
        const content = chunk.choices[0]?.delta?.content;
        if (content) {
          setResponse(prev => prev + content);
        }
      }
    } finally {
      setIsStreaming(false);
    }
  }, [client]);

  return { response, isStreaming, stream };
}

Finish Reasons

ReasonDescription
stopNatural completion or stop sequence hit
lengthMax tokens reached
content_filterContent was filtered

Tips

  1. Always check for content: Some chunks may have empty content
  2. Handle the finish reason: Know why generation stopped
  3. Buffer if needed: For sentence-by-sentence display, buffer until punctuation
  4. Error handling: Wrap in try/catch for network errors mid-stream