Blog Post

Microsoft Developer Community Blog
5 MIN READ

The importance of streaming for LLM-powered chat applications

Pamela_Fox's avatar
Pamela_Fox
Icon for Microsoft rankMicrosoft
Oct 07, 2025

Thanks to the popularity of chat-based interfaces like ChatGPT and GitHub Copilot, users have grown accustomed to getting answers conversationally. As a result, thousands of developers are now deploying chat applications on Azure for their own specialized domains.

To help developers understand how to build LLM-powered chat apps, we have open-sourced many chat app templates, like a super simple chat app and the very popular and sophisticated RAG chat app. All our templates support an important feature: streaming.

At first glance, streaming might not seem essential. But users have come to expect it from modern chat experiences. Beyond meeting expectations, streaming can dramatically improve the time to first token — letting your frontend display words as soon as they’re generated, instead of making users wait seconds for a complete answer.

Animated GIF of GitHub CoPilot answering a question about bash

How to stream from the APIs

Most modern LLM APIs and wrapper libraries now support streaming responses — usually through a simple boolean flag or a dedicated streaming method.

Let’s look at an example using the official OpenAI Python SDK. The openai package makes it easy to stream responses by passing a stream=True argument:

completion_stream = openai_client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What does a product manager do?"},
    ],
    stream=True,
)

When stream is true, the return type is an iterable, so we can use a for loop to process each of the ChatCompletion chunk objects:

for chunk in await completion_stream:
    content = event.choices[0].delta.content

Sending stream from backend to frontend

When building a web app, we need a way to stream data from the backend to the browser. A normal HTTP response won’t work here — it sends all the data at once, then closes the connection. Instead, we need a protocol that allows data to arrive progressively.

The most common options are:

  • WebSockets: A bidirectional channel where both client and server can send data at any time.
  • Server-sent events: A one-way channel where the server continuously pushes events to the client over HTTP.
  • Readable streams: An HTTP response with a Transfer-encoding header of "chunked", allowing the client to process chunks as they arrive.

All of these could potentially be used for a chat app, and I myself have experimented with both server-sent events and readable streams. Behind the scenes, the ChatGPT API actually uses server-sent events, so you'll find code in the openai package for parsing that protocol. However, I now prefer using readable streams for my frontend to backend communication. It's the simplest code setup on both the frontend and backend, and it supports the POST requests that our apps are already sending.

The key is to send chunks from the backend in NDJSON (newline-delimited JSON) format and parse them incrementally on the frontend. See my blog post on fetching JSON over streaming HTTP for Python and JavaScript example code.

Achieving a word-by-word effect

With all of that in place, we now have a frontend that reveals the model’s answer gradually — almost like watching it type in real time.

Animated GIF of answer appearing gradually

But something still feels off! Despite our frontend receiving chunks of just a few tokens at a time, that UI tends to reveal entire sentences at once. Why does that happen?

It turns out the browser is batching repaints. Instead of immediately re-rendering after each DOM update, it waits until it’s more efficient to repaint — a smart optimization in most cases, but not ideal for a streaming text effect.

My colleague Steve Steiner explored several techniques to make the browser repaint more frequently. The most effective approach uses window.setTimeout() with a delay of 33 milliseconds for each chunk. While this adds a small overall delay, it stays well within a natural reading pace and produces a smooth, word-by-word reveal. See his PR for implementation details for a React codebase.

With that change, our frontend now displays responses at the same granularity as the chat completions API itself — chunk by chunk:

Animated GIF of answer appearing word by word

Streaming more of the process

Many of our sample apps use RAG (Retrieval-Augmented Generation) pipelines that chain together multiple operations — querying data stores (like Azure AI Search), generating embeddings, and finally calling the chat completions API. Naturally, that chain takes longer than a single LLM call, so users may wait several seconds before seeing a response.

One way to improve the experience is to stream more of the process itself. Instead of holding back everything until the final answer, the backend can emit progress updates as each step completes — keeping users informed and engaged.

For example, your app might display messages like this sequence:

  • Processing your question: "Can you suggest a pizza recipe that incorporates both mushroom and pineapples?"
  • Generated search query "pineapple mushroom pizza recipes"
  • Found three related results from our cookbooks: 1) Mushroom calzone 2) Pineapple ham pizza 3) Mushroom loaf
  • Generating answer to your question...
  • Sure! Here's a recipe for a mushroom pineapple pizza...

Adding streamed progress like this makes your app feel responsive and alive, even while the backend is doing complex work. Consider experimenting with progress events in your own chat apps — a few simple updates can greatly improve user trust and engagement.

Making it optional

After all this talk about streaming, here’s one final recommendation: make streaming optional.

Provide a setting in your frontend to disable streaming, and a corresponding non-streaming endpoint in your backend. This flexibility helps both your users and your developers:

  • For users: Some may prefer (or require) a non-streamed experience for accessibility reasons, or simply to receive the full response at once.
  • For developers: There are times when you’ll want to interact with the app programmatically — for example, using curl, requests, or automated tests — and a standard, non-streaming HTTP endpoint makes that much easier.

Designing your app to gracefully support both modes ensures it’s inclusive, debuggable, and production-ready.

Sample applications

We’ve already linked to several of our sample apps that support streaming, but here’s a complete list so you can explore the one that best fits your tech stack:

RepositoryApp purposeBackendFrontend
azure-search-openai-demoRAG with AI SearchPython + QuartReact
rag-postgres-openai-pythonRAG with PostgreSQLPython + FastAPIReact
openai-chat-app-quickstartSimple chat with Azure OpenAI modelsPython + Quartplain JS
openai-chat-backend-fastapiSimple chat with Azure OpenAI modelsPython + FastAPIplain JS
deepseek-pythonSimple chat with Azure AI Foundry modelsPython + Quartplain JS

Each of these repositories includes streaming support out of the box, so you can inspect real implementation details in both the frontend and backend. They’re a great starting point for learning how to structure your own LLM chat application — or for extending one of the samples to match your specific use case.

Published Oct 07, 2025
Version 1.0
No CommentsBe the first to comment