Streaming API - Model Backpack

Overview

The streaming API allows you to receive chat completion responses in real-time as they are generated, rather than waiting for the entire response. This is particularly useful for chat interfaces where showing progressive responses creates a better user experience.

How to Use Streaming

To use streaming, simply set the stream parameter to true in your request to the chat completions endpoint:

POST /v1/chat/completions HTTP/1.1
Host: localhost:11435
Content-Type: application/json

{
  "model": "your-model-id",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Tell me a story about a robot."
    }
  ],
  "temperature": 0.7,
  "stream": true
}

Stream Format

When streaming is enabled, the API will return a stream of data in the format of Server-Sent Events (SSE). Each event contains a JSON object that represents a token or a small piece of the full response.

Note: The response will be streamed as a series of data chunks, each prefixed with data: and terminated with two newlines \n\n. The final chunk will be data: [DONE].

Streaming Response Structure

Each chunk in the stream will contain the following structure:

Field	Type	Description
id	string	A unique identifier for the completion. Each chunk has the same ID.
object	string	Always "chat.completion.chunk".
created	integer	The Unix timestamp (in seconds) of when the completion was created. Each chunk has the same timestamp.
model	string	The model used for the completion.
choices	array	An array containing incremental message content, with one choice for each requested completion.
system_fingerprint	string	This fingerprint represents the backend configuration that the model runs with.
service_tier	string	The service tier used to process the request (if applicable).
usage	object	Only included in the final chunk if stream_options.include_usage is true. Contains token usage information.

Streaming Choice Object

Each choice in the choices array contains:

Field	Type	Description
index	integer	The index of the choice in the choices array.
delta	object	Contains the incremental content for the message.
finish_reason	string	The reason for finishing the completion. Only present in the final chunk of a choice if it has finished. Values include "stop", "length", etc.

Delta Object

The delta object can contain:

Field	Type	Description
role	string	The role of the message author. Only included in the first chunk.
content	string	The content fragment of the message. Can be empty if there's no new content to show in this chunk.

Example Streaming Response

Here's how a streaming response might look:

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant"
      },
      "finish_reason": null
    }
  ]
}

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Once"
      },
      "finish_reason": null
    }
  ]
}

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": " upon"
      },
      "finish_reason": null
    }
  ]
}

...more chunks...

data: {
  "id": "chatcmpl-123456789abcdef",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "your-model-id",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": ""
      },
      "finish_reason": "stop"
    }
  ]
}

data: [DONE]

Best Practices

Always handle potential errors in your streaming code.
Be prepared for network interruptions during streaming.
Buffer the incoming tokens and consider updating your UI at reasonable intervals rather than for every token.
When implementing a chat interface, make sure to display typing indicators while waiting for the stream to start.

Related Resources

Chat Completion API - Documentation for the standard chat completion endpoint
Examples - Additional code examples