Streaming API

Receive chat completion responses in real-time

Overview

The streaming API allows you to receive chat completion responses in real-time as they are generated, rather than waiting for the entire response. This is particularly useful for chat interfaces where showing progressive responses creates a better user experience.

How to Use Streaming

To use streaming, simply set the stream parameter to true in your request to the chat completions endpoint:

POST /v1/chat/completions HTTP/1.1
Host: localhost:11435
Content-Type: application/json

{
  "model": "your-model-id",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Tell me a story about a robot."
    }
  ],
  "temperature": 0.7,
  "stream": true
}

Stream Format

When streaming is enabled, the API will return a stream of data in the format of Server-Sent Events (SSE). Each event contains a JSON object that represents a token or a small piece of the full response.

Note: The response will be streamed as a series of data chunks, each prefixed with data: and terminated with two newlines \n\n. The final chunk will be data: [DONE].

Streaming Response Structure

Each chunk in the stream will contain the following structure:

Field Type Description
id string A unique identifier for the completion. Each chunk has the same ID.
object string Always "chat.completion.chunk".
created integer The Unix timestamp (in seconds) of when the completion was created. Each chunk has the same timestamp.
model string The model used for the completion.
choices array An array containing incremental message content, with one choice for each requested completion.
system_fingerprint string This fingerprint represents the backend configuration that the model runs with.
service_tier string The service tier used to process the request (if applicable).
usage object Only included in the final chunk if stream_options.include_usage is true. Contains token usage information.

Streaming Choice Object

Each choice in the choices array contains:

Field Type Description
index integer The index of the choice in the choices array.
delta object Contains the incremental content for the message.
finish_reason string The reason for finishing the completion. Only present in the final chunk of a choice if it has finished. Values include "stop", "length", etc.

Delta Object

The delta object can contain:

Field Type Description
role string The role of the message author. Only included in the first chunk.
content string The content fragment of the message. Can be empty if there's no new content to show in this chunk.

Example Streaming Response

Here's how a streaming response might look:

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant"
      },
      "finish_reason": null
    }
  ]
}

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Once"
      },
      "finish_reason": null
    }
  ]
}

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": " upon"
      },
      "finish_reason": null
    }
  ]
}

...more chunks...

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": ""
      },
      "finish_reason": "stop"
    }
  ]
}

data: [DONE]

Best Practices

Related Resources