Blog Post

Microsoft Developer Community Blog
3 MIN READ

Why your LLM-powered app needs concurrency

Pamela_Fox's avatar
Pamela_Fox
Icon for Microsoft rankMicrosoft
Oct 07, 2025

As part of the Python advocacy team, I help maintain several open-source sample AI applications, like our popular RAG chat demo. Through that work, I’ve learned a lot about what makes LLM-powered apps feel fast, reliable, and responsive.

One of the most important lessons: use an asynchronous backend framework. Concurrency is critical for LLM apps, which often juggle multiple API calls, database queries, and user requests at the same time. Without async, your app may spend most of its time waiting — blocking one user’s request while another sits idle.

The need for concurrency

Why? Let’s imagine we’re using a synchronous framework like Flask. We deploy that to a server with gunicorn and several workers. One worker receives a POST request to the "/chat" endpoint, which in turn calls the Azure OpenAI Chat Completions API.

That API call can take several seconds to complete — and during that time, the worker is completely tied up, unable to handle any other requests. We could scale out by adding more CPU cores, workers, or threads, but that’s often wasteful and expensive.

Without concurrency, each request must be handled serially:

Diagram of worker handling requests one after the other

When your app relies on long, blocking I/O operations — like model calls, database queries, or external API lookups — a better approach is to use an asynchronous framework. With async I/O, the Python runtime can pause a coroutine that’s waiting for a slow response and switch to handling another incoming request in the meantime.

With concurrency, your workers stay busy and can handle new requests while others are waiting:

Diagram of worker handling second request while first request waits for API response

Asynchronous Python backends

In the Python ecosystem, there are several asynchronous backend frameworks to choose from:

  • Quart: the asynchronous version of Flask
  • FastAPI: an API-centric, async-only framework (built on Starlette)
  • Litestar: a batteries-included async framework (also built on Starlette)
  • Django: not async by default, but includes support for asynchronous views

All of these can be good options depending on your project’s needs. I’ve written more about the decision-making process in another blog post.

As an example, let's see what changes when we port a Flask app to a Quart app.

First, our handlers now have async in front, signifying that they return a Python coroutine instead of a normal function:

async def chat_handler():
    request_message = (await request.get_json())["message"]

When deploying these apps, I often still use the Gunicorn production web server—but with the Uvicorn worker, which is designed for Python ASGI applications. Alternatively, you can run Uvicorn or Hypercorn directly as standalone servers.

Asynchronous API calls

To fully benefit from moving to an asynchronous framework, your app’s API calls also need to be asynchronous. That way, whenever a worker is waiting for an external response, it can pause that coroutine and start handling another incoming request.

Let's see what that looks like when using the official OpenAI Python SDK. First, we initialize the async version of the OpenAI client:

openai_client = openai.AsyncOpenAI(
    base_url=os.environ["AZURE_OPENAI_ENDPOINT"] + "/openai/v1",
    api_key=token_provider
)

Then, whenever we make API calls with methods on that client, we await their results:

chat_coroutine = await openai_client.chat.completions.create(
    deployment_id=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"],
    messages=[{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": request_message}],
    stream=True,
)

For the RAG sample, we also have calls to Azure services like Azure AI Search. To make those asynchronous, we first import the async variant of the credential and client classes in the aio module:

from azure.identity.aio import DefaultAzureCredential
from azure.search.documents.aio import SearchClient

Then, like with the OpenAI async clients, we must await results from any methods that make network calls:

r = await self.search_client.search(query_text)

By ensuring that every outbound network call is asynchronous, your app can make the most of Python’s event loop — handling multiple user sessions and API requests concurrently, without wasting worker time waiting on slow responses.

Sample applications

We’ve already linked to several of our samples that use async frameworks, but here’s a longer list so you can find the one that best fits your tech stack:

RepositoryApp purposeBackendFrontend
azure-search-openai-demoRAG with AI SearchPython + QuartReact
rag-postgres-openai-pythonRAG with PostgreSQLPython + FastAPIReact
openai-chat-app-quickstartSimple chat with Azure OpenAI modelsPython + Quartplain JS
openai-chat-backend-fastapiSimple chat with Azure OpenAI modelsPython + FastAPIplain JS
deepseek-pythonSimple chat with Azure AI Foundry modelsPython + Quartplain JS
Updated Oct 07, 2025
Version 1.0
No CommentsBe the first to comment