genai
35 TopicsWhy your LLM-powered app needs concurrency
As part of the Python advocacy team, I help maintain several open-source sample AI applications, like our popular RAG chat demo. Through that work, I’ve learned a lot about what makes LLM-powered apps feel fast, reliable, and responsive. One of the most important lessons: use an asynchronous backend framework. Concurrency is critical for LLM apps, which often juggle multiple API calls, database queries, and user requests at the same time. Without async, your app may spend most of its time waiting — blocking one user’s request while another sits idle. The need for concurrency Why? Let’s imagine we’re using a synchronous framework like Flask. We deploy that to a server with gunicorn and several workers. One worker receives a POST request to the "/chat" endpoint, which in turn calls the Azure OpenAI Chat Completions API. That API call can take several seconds to complete — and during that time, the worker is completely tied up, unable to handle any other requests. We could scale out by adding more CPU cores, workers, or threads, but that’s often wasteful and expensive. Without concurrency, each request must be handled serially: When your app relies on long, blocking I/O operations — like model calls, database queries, or external API lookups — a better approach is to use an asynchronous framework. With async I/O, the Python runtime can pause a coroutine that’s waiting for a slow response and switch to handling another incoming request in the meantime. With concurrency, your workers stay busy and can handle new requests while others are waiting: Asynchronous Python backends In the Python ecosystem, there are several asynchronous backend frameworks to choose from: Quart: the asynchronous version of Flask FastAPI: an API-centric, async-only framework (built on Starlette) Litestar: a batteries-included async framework (also built on Starlette) Django: not async by default, but includes support for asynchronous views All of these can be good options depending on your project’s needs. I’ve written more about the decision-making process in another blog post. As an example, let's see what changes when we port a Flask app to a Quart app. First, our handlers now have async in front, signifying that they return a Python coroutine instead of a normal function: async def chat_handler(): request_message = (await request.get_json())["message"] When deploying these apps, I often still use the Gunicorn production web server—but with the Uvicorn worker, which is designed for Python ASGI applications. Alternatively, you can run Uvicorn or Hypercorn directly as standalone servers. Asynchronous API calls To fully benefit from moving to an asynchronous framework, your app’s API calls also need to be asynchronous. That way, whenever a worker is waiting for an external response, it can pause that coroutine and start handling another incoming request. Let's see what that looks like when using the official OpenAI Python SDK. First, we initialize the async version of the OpenAI client: openai_client = openai.AsyncOpenAI( base_url=os.environ["AZURE_OPENAI_ENDPOINT"] + "/openai/v1", api_key=token_provider ) Then, whenever we make API calls with methods on that client, we await their results: chat_coroutine = await openai_client.chat.completions.create( deployment_id=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"], messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": request_message}], stream=True, ) For the RAG sample, we also have calls to Azure services like Azure AI Search. To make those asynchronous, we first import the async variant of the credential and client classes in the aio module: from azure.identity.aio import DefaultAzureCredential from azure.search.documents.aio import SearchClient Then, like with the OpenAI async clients, we must await results from any methods that make network calls: r = await self.search_client.search(query_text) By ensuring that every outbound network call is asynchronous, your app can make the most of Python’s event loop — handling multiple user sessions and API requests concurrently, without wasting worker time waiting on slow responses. Sample applications We’ve already linked to several of our samples that use async frameworks, but here’s a longer list so you can find the one that best fits your tech stack: Repository App purpose Backend Frontend azure-search-openai-demo RAG with AI Search Python + Quart React rag-postgres-openai-python RAG with PostgreSQL Python + FastAPI React openai-chat-app-quickstart Simple chat with Azure OpenAI models Python + Quart plain JS openai-chat-backend-fastapi Simple chat with Azure OpenAI models Python + FastAPI plain JS deepseek-python Simple chat with Azure AI Foundry models Python + Quart plain JSJS AI Build-a-thon Setup in 5 Easy Steps
🔥 TL;DR — You’re 5 Steps Away from an AI Adventure Set up your project repo, follow the quests, build cool stuff, and level up. Everything’s automated, community-backed, and designed to help you actually learn AI — using the skills you already have. Let’s build the future. One quest at a time. 👉 Join the Build-a-thon | Chat on DiscordAI Repo of the Week: Generative AI for Beginners with JavaScript
Introduction Ready to explore the fascinating world of Generative AI using your JavaScript skills? This week’s featured repository, Generative AI for Beginners with JavaScript, is your launchpad into the future of application development. Whether you're just starting out or looking to expand your AI toolbox, this open-source GitHub resource offers a rich, hands-on journey. It includes interactive lessons, quizzes, and even time-travel storytelling featuring historical legends like Leonardo da Vinci and Ada Lovelace. Each chapter combines narrative-driven learning with practical exercises, helping you understand foundational AI concepts and apply them directly in code. It’s immersive, educational, and genuinely fun. What You'll Learn 1. 🧠 Foundations of Generative AI and LLMs Start with the basics: What is generative AI? How do large language models (LLMs) work? This chapter lays the groundwork for how these technologies are transforming JavaScript development. 2. 🚀 Build Your First AI-Powered App Walk through setting up your environment and creating your first AI app. Learn how to configure prompts and unlock the potential of AI in your own projects. 3. 🎯 Prompt Engineering Essentials Get hands-on with prompt engineering techniques that shape how AI models respond. Explore strategies for crafting prompts that are clear, targeted, and effective. 4. 📦 Structured Output with JSON Learn how to guide the model to return structured data formats like JSON—critical for integrating AI into real-world applications. 5. 🔍 Retrieval-Augmented Generation (RAG) Go beyond static prompts by combining LLMs with external data sources. Discover how RAG lets your app pull in live, contextual information for more intelligent results. 6. 🛠️ Function Calling and Tool Use Give your LLM new powers! Learn how to connect your own functions and tools to your app, enabling more dynamic and actionable AI interactions. 7. 📚 Model Context Protocol (MCP) Dive into MCP, a new standard for organizing prompts, tools, and resources. Learn how it simplifies AI app development and fosters consistency across projects. 8. ⚙️ Enhancing MCP Clients with LLMs Build on what you’ve learned by integrating LLMs directly into your MCP clients. See how to make them smarter, faster, and more helpful. ✨ More chapters coming soon—watch the repo for updates! Companion App: Interact with History Experience the power of generative AI in action through the companion web app—where you can chat with historical figures and witness how JavaScript brings AI to life in real time. Conclusion Generative AI for Beginners with JavaScript is more than a course—it’s an adventure into how storytelling, coding, and AI can come together to create something fun and educational. Whether you're here to upskill, experiment, or build the next big thing, this repository is your all-in-one resource to get started with confidence. 🔗 Jump into the future of development—check out the repo and start building with AI today!Orchestrating Multi-Agent Intelligence: MCP-Driven Patterns in Agent Framework
Building reliable AI systems requires modular, stateful coordination and deterministic workflows that enable agents to collaborate seamlessly. The Microsoft Agent Framework provides these foundations, with memory, tracing, and orchestration built in. This implementation demonstrates four multi-agentic patterns — Single Agent, Handoff, Reflection, and Magentic Orchestration — showcasing different interaction models and collaboration strategies. From lightweight domain routing to collaborative planning and self-reflection, these patterns highlight the framework’s flexibility. At the core is Model Context Protocol (MCP), connecting agents, tools, and memory through a shared context interface. Persistent session state, conversation thread history, and checkpoint support are handled via Cosmos DB when configured, with an in-memory dictionary as a default fallback. This setup enables dynamic pattern swapping, performance comparison, and traceable multi-agent interactions — all within a unified, modular runtime. Business Scenario: Contoso Customer Support Chatbot Contoso’s chatbot handles multi-domain customer inquiries like billing anomalies, promotion eligibility, account locks, and data usage questions. These require combining structured data (billing, CRM, security logs, promotions) with unstructured policy documents processed via vector embeddings. Using MCP, the system orchestrates tool calls to fetch real-time structured data and relevant policy content, ensuring policy-aligned, auditable responses without exposing raw databases. This enables the assistant to explain anomalies, recommend actions, confirm eligibility, guide account recovery, and surface risk indicators—reducing handle time and improving first-contact resolution while supporting richer multi-agent reasoning. Architecture & Core Concepts The Contoso chatbot leverages the Microsoft Agent Framework to deliver a modular, stateful, and workflow-driven architecture. At its core, the system consists of: Base Agent: All agent patterns—single agent, reflection, handoff and magentic orchestration—inherit from a common base class, ensuring consistent interfaces for message handling, tool invocation, and state management. Backend: A FastAPI backend manages session routing, agent execution, and workflow orchestration. Frontend: A React-based UI (or Streamlit alternative) streams responses in real-time and visualizes agent reasoning and tool calls. Modular Runtime and Pattern Swapping One of the most powerful aspects of this implementation is its modular runtime design. Each agentic pattern—Single, Reflection, Handoff, and Magnetic—plugs into a shared execution pipeline defined by the base agent and MCP integration. By simply updating the .env configuration (e.g., agent_module=handoff), developers can swap in and out entire coordination strategies without touching the backend, frontend, or memory layers. This makes it easy to compare agent styles side by side, benchmark reasoning behaviors, and experiment with orchestration logic—all while maintaining a consistent, deterministic runtime. The same MCP connectors, FastAPI backend, and Cosmos/in-memory state management work seamlessly across every pattern, enabling rapid iteration and reliable evaluation. # Dynamic agent pattern loading agent_module_path = os.getenv("AGENT_MODULE") agent_module = __import__(agent_module_path, fromlist=["Agent"]) Agent = getattr(agent_module, "Agent") # Common MCP setup across all patterns async def _create_tools(self, headers: Dict[str, str]) -> List[MCPStreamableHTTPTool] | None: if not self.mcp_server_uri: return None return [MCPStreamableHTTPTool( name="mcp-streamable", url=self.mcp_server_uri, headers=headers, timeout=30, request_timeout=30, )] Memory & State Management State management is critical for multi-turn conversations and cross-agent workflows. The system supports two out-of-the-box options: Persistent Storage (Cosmos DB) Acts as the durable, enterprise-ready backend. Stores serialized conversation threads and workflow checkpoints keyed by tenant and session ID. Ensures data durability and auditability across restarts. In-Memory Session Store Default fallback when Cosmos DB credentials are not configured. Maintains ephemeral state per session for fast prototyping or lightweight use cases. All patterns leverage the same thread-based state abstraction, enabling: Session isolation: Each user session maintains its own state and history. Checkpointing: Multi-agent workflows can snapshot shared and executor-local state at any point, supporting pause/resume and fault recovery. Model Context Protocol (MCP): Acts as the connector between agents and tools, standardizing how data is fetched and results are returned to agents, whether querying structured databases or unstructured knowledge sources. Core Principles Across all patterns, the framework emphasizes: Modularity: Components are interchangeable—agents, tools, and state stores can be swapped without disrupting the system. Stateful Coordination: Multi-agent workflows coordinate through shared and local state, enabling complex reasoning without losing context. Deterministic Workflows: While agents operate autonomously, the workflow layer ensures predictable, auditable execution of multi-agent tasks. Unified Execution: From single-agent Q&A to complex Magentic orchestrations, every agent follows the same execution lifecycle and integrates seamlessly with MCP and the state store. Multi-Agent Patterns: Workflow and Coordination With the architecture and core concepts established, we can now explore the agentic patterns implemented in the Contoso chatbot. Each pattern builds on the base agent and MCP integration but differs in how agents orchestrate tasks and communicate with one another to handle multi-domain customer queries. In the sections that follow, we take a deeper dive into each pattern’s workflow and examine the under-the-hood communication flows between agents: Single Agent – A simple, single-domain agent handling straightforward queries. Reflection Agent – Allows agents to introspect and refine their outputs. Handoff Pattern – Routes conversations intelligently to specialized agents across domains. Magentic Orchestration – Coordinates multiple specialist agents for complex, parallel tasks. For each pattern, the focus will be on how agents communicate and coordinate, showing the practical orchestration mechanisms in action. Single Intelligent Agent The Single Agent Pattern represents the simplest orchestration style within the framework. Here, a single autonomous agent handles all reasoning, decision-making, and tool interactions directly — without delegation or multi-agent coordination. When a user submits a request, the single agent processes the query using all tools, memory, and data sources available through the Model Context Protocol (MCP). It performs retrieval, reasoning, and response composition in a single, cohesive loop. Communication Flow: User Input → Agent: The user submits a question or command. Agent → MCP Tools: The agent invokes one or more tools (e.g., vector retrieval, structured queries, or API calls) to gather relevant context and data. Agent → User: The agent synthesizes the tool outputs, applies reasoning, and generates the final response to the user. Session Memory: Throughout the exchange, the agent stores conversation history and extracted entities in the configured memory store (in-memory or Cosmos DB). Key Communication Principles: Single Responsibility: One agent performs both reasoning and action, ensuring fast response times and simpler state management. Direct Tool Invocation: The agent has direct access to all registered tools through MCP, enabling flexible retrieval and action chaining. Stateful Execution: The session memory preserves dialogue context, allowing the agent to maintain continuity across user turns. Deterministic Behavior: The workflow is fully predictable — input, reasoning, tool call, and output occur in a linear sequence. Reflection pattern The Reflection Pattern introduces a lightweight, two-agent communication loop designed to improve the quality and reliability of responses through structured self-review. In this setup, a Primary Agent first generates an initial response to the user’s query. This draft is then passed to a Reviewer Agent, whose role is to critique and refine the response—identifying gaps, inaccuracies, or missed context. Finally, the Primary Agent incorporates this feedback and produces a polished final answer for the user. This process introduces one round of reflection and improvement without adding excessive latency, balancing quality with responsiveness. Communication Flow: User Input → Primary Agent: The user submits a query. Primary Agent → Reviewer Agent: The primary generates an initial draft and passes it to the reviewer. Reviewer Agent → Primary Agent: The reviewer provides feedback or suggested improvements. Primary Agent → User: The primary revises its response and sends the refined version back to the user. Key Communication Principles: Two-Stage Dialogue: Structured interaction between Primary and Reviewer ensures each output undergoes quality assurance. Focused Review: The Reviewer doesn’t recreate answers—it critiques and enhances, reducing redundancy. Stateful Context: Both agents operate over the same shared memory, ensuring consistency between draft and revision. Deterministic Flow: A single reflection round guarantees predictable latency while still improving answer quality. Transparent Traceability: Each step—initial draft, feedback, and final output—is logged, allowing developers to audit reasoning or assess quality improvements over time. In practice, this pattern enables the system to reason about its own output before responding, yielding clearer, more accurate, and policy-aligned answers without requiring multiple independent retries. Handoff Pattern When a user request arrives, the system first routes it through an Intent Classifier (or triage agent) to determine which domain specialist should handle the conversation. Once identified, control is handed off directly to that Specialist Agent, which uses its own tools, domain knowledge, and state context to respond. This specialist continues to handle the user interaction as long as the conversation stays within its domain. If the user’s intent shifts — for example, moving from billing to security — the conversation is routed back to the Intent Classifier, which re-assigns it to the correct specialist agent. This pattern reduces latency and maintains continuity by minimizing unnecessary routing. Each handoff is tracked through the shared state store, ensuring seamless context carry-over and full traceability of decisions. Key Communication Principles: Dynamic Routing: The Intent Classifier routes user input to the right specialist domain. Domain Persistence: The specialist remains active while the user stays within its domain. Context Continuity: Conversation history and entities persist across agents through the shared state store. Traceable Handoffs: Every routing decision is logged for observability and auditability. Low Latency: Responses are faster since domain-appropriate agents handle queries directly. In practice, this means a user could begin a conversation about billing, continue seamlessly, and only be re-routed when switching topics — without losing any conversational context or history. Magentic Pattern The Magentic Pattern is designed for open-ended, multi-faceted tasks that require multiple agents to collaborate. It introduces a Manager (Planner) Agent, which interprets the user’s goal, breaks it into subtasks, and orchestrates multiple Specialist Agents to execute those subtasks. The Manager creates and maintains a Task Ledger, which tracks the status, dependencies, and results of each specialist’s work. As specialists perform their tool calls or reasoning, the Manager monitors their progress, gathers intermediate outputs, and can dynamically re-plan, dispatch additional tasks, or adjust the overall workflow. When all subtasks are complete, the Manager synthesizes the combined results into a coherent final response for the user. Key Communication Principles: Centralized Orchestration: The Manager coordinates all agent interactions and workflow logic. Parallel and Sequential Execution: Specialists can work simultaneously or in sequence based on task dependencies. Task Ledger: Acts as a transparent record of all task assignments, updates, and completions. Dynamic Re-planning: The Manager can modify or extend workflows in real time based on intermediate findings. Shared Memory: All agents access the same state store for consistent context and result sharing. Unified Output: The Manager consolidates results into one response, ensuring coherence across multi-agent reasoning. In practice, Magentic orchestration enables complex reasoning where the system might combine insights from multiple agents — e.g., billing, product, and security — and present a unified recommendation or resolution to the user. Choosing the Right Agent for Your Use Case Selecting the appropriate agent pattern hinges on the complexity of the task and the level of coordination required. As use cases evolve from straightforward queries to intricate, multi-step processes, the need for specialized orchestration increases. Below is a decision matrix to guide your choice: Feature / Requirement Single Agent Reflection Agent Handoff Pattern Magentic Orchestration Handles simple, domain-bound tasks ✔ ✔ ✖ ✖ Supports review / quality assurance ✖ ✔ ✖ ✔ Multi-domain routing ✖ ✖ ✔ ✔ Open-ended / complex workflows ✖ ✖ ✖ ✔ Parallel agent collaboration ✖ ✖ ✖ ✔ Direct tool access ✔ ✔ ✔ ✔ Low latency / fast response ✔ ✔ ✔ ✖ Easy to implement / low orchestration ✔ ✔ ✖ ✖ Dive Deeper: Explore, Build, and Innovate We've explored various agent patterns, from Single Agent to Magentic Orchestration, each tailored to different use cases and complexities. To see these patterns in action, we invite you to explore our Github repo. Clone the repo, experiment with the examples, and adapt them to your own scenarios. Additionally, beyond the patterns discussed here, the repository also features a Human-in-the-Loop (HITL) workflow designed for fraud detection. This workflow integrates human oversight into AI decision-making, ensuring higher accuracy and reliability. For an in-depth look at this approach, we recommend reading our detailed blog post: Building Human-in-the-loop AI Workflows with Microsoft Agent Framework | Microsoft Community Hub Engage with these resources, and start building intelligent, reliable, and scalable AI systems today! This repository and content is developed and maintained by James Nguyen, Nicole Serafino, Kranthi Kumar Manchikanti, Heena Ugale, and Tim Sullivan.From Cloud to Chip: Building Smarter AI at the Edge with Windows AI PCs
As AI engineers, we’ve spent years optimizing models for the cloud, scaling inference, wrangling latency, and chasing compute across clusters. But the frontier is shifting. With the rise of Windows AI PCs and powerful local accelerators, the edge is no longer a constraint it’s now a canvas. Whether you're deploying vision models to industrial cameras, optimizing speech interfaces for offline assistants, or building privacy-preserving apps for healthcare, Edge AI is where real-world intelligence meets real-time performance. Why Edge AI, Why Now? Edge AI isn’t just about running models locally, it’s about rethinking the entire lifecycle: - Latency: Decisions in milliseconds, not round-trips to the cloud. - Privacy: Sensitive data stays on-device, enabling HIPAA/GDPR compliance. - Resilience: Offline-first apps that don’t break when the network does. - Cost: Reduced cloud compute and bandwidth overhead. With Windows AI PCs powered by Intel and Qualcomm NPUs and tools like ONNX Runtime, DirectML, and Olive, developers can now optimize and deploy models with unprecedented efficiency. What You’ll Learn in Edge AI for Beginners The Edge AI for Beginners curriculum is a hands-on, open-source guide designed for engineers ready to move from theory to deployment. Multi-Language Support This content is available in over 48 languages, so you can read and study in your native language. What You'll Master This course takes you from fundamental concepts to production-ready implementations, covering: Small Language Models (SLMs) optimized for edge deployment Hardware-aware optimization across diverse platforms Real-time inference with privacy-preserving capabilities Production deployment strategies for enterprise applications Why EdgeAI Matters Edge AI represents a paradigm shift that addresses critical modern challenges: Privacy & Security: Process sensitive data locally without cloud exposure Real-time Performance: Eliminate network latency for time-critical applications Cost Efficiency: Reduce bandwidth and cloud computing expenses Resilient Operations: Maintain functionality during network outages Regulatory Compliance: Meet data sovereignty requirements Edge AI Edge AI refers to running AI algorithms and language models locally on hardware, close to where data is generated without relying on cloud resources for inference. It reduces latency, enhances privacy, and enables real-time decision-making. Core Principles: On-device inference: AI models run on edge devices (phones, routers, microcontrollers, industrial PCs) Offline capability: Functions without persistent internet connectivity Low latency: Immediate responses suited for real-time systems Data sovereignty: Keeps sensitive data local, improving security and compliance Small Language Models (SLMs) SLMs like Phi-4, Mistral-7B, Qwen and Gemma are optimized versions of larger LLMs, trained or distilled for: Reduced memory footprint: Efficient use of limited edge device memory Lower compute demand: Optimized for CPU and edge GPU performance Faster startup times: Quick initialization for responsive applications They unlock powerful NLP capabilities while meeting the constraints of: Embedded systems: IoT devices and industrial controllers Mobile devices: Smartphones and tablets with offline capabilities IoT Devices: Sensors and smart devices with limited resources Edge servers: Local processing units with limited GPU resources Personal Computers: Desktop and laptop deployment scenarios Course Modules & Navigation Course duration. 10 hours of content Module Topic Focus Area Key Content Level Duration 📖 00 Introduction to EdgeAI Foundation & Context EdgeAI Overview • Industry Applications • SLM Introduction • Learning Objectives Beginner 1-2 hrs 📚 01 EdgeAI Fundamentals Cloud vs Edge AI comparison EdgeAI Fundamentals • Real World Case Studies • Implementation Guide • Edge Deployment Beginner 3-4 hrs 🧠 02 SLM Model Foundations Model families & architecture Phi Family • Qwen Family • Gemma Family • BitNET • μModel • Phi-Silica Beginner 4-5 hrs 🚀 03 SLM Deployment Practice Local & cloud deployment Advanced Learning • Local Environment • Cloud Deployment Intermediate 4-5 hrs ⚙️ 04 Model Optimization Toolkit Cross-platform optimization Introduction • Llama.cpp • Microsoft Olive • OpenVINO • Apple MLX • Workflow Synthesis Intermediate 5-6 hrs 🔧 05 SLMOps Production Production operations SLMOps Introduction • Model Distillation • Fine-tuning • Production Deployment Advanced 5-6 hrs 🤖 06 AI Agents & Function Calling Agent frameworks & MCP Agent Introduction • Function Calling • Model Context Protocol Advanced 4-5 hrs 💻 07 Platform Implementation Cross-platform samples AI Toolkit • Foundry Local • Windows Development Advanced 3-4 hrs 🏭 08 Foundry Local Toolkit Production-ready samples Sample applications (see details below) Expert 8-10 hrs Each module includes Jupyter notebooks, code samples, and deployment walkthroughs, perfect for engineers who learn by doing. Developer Highlights - 🔧 Olive: Microsoft's optimization toolchain for quantization, pruning, and acceleration. - 🧩 ONNX Runtime: Cross-platform inference engine with support for CPU, GPU, and NPU. - 🎮 DirectML: GPU-accelerated ML API for Windows, ideal for gaming and real-time apps. - 🖥️ Windows AI PCs: Devices with built-in NPUs for low-power, high-performance inference. Local AI: Beyond the Edge Local AI isn’t just about inference, it’s about autonomy. Imagine agents that: - Learn from local context - Adapt to user behavior - Respect privacy by design With tools like Agent Framework, Azure AI Foundry and Windows Copilot Studio, and Foundry Local developers can orchestrate local agents that blend LLMs, sensors, and user preferences, all without cloud dependency. Try It Yourself Ready to get started? Clone the Edge AI for Beginners GitHub repo, run the notebooks, and deploy your first model to a Windows AI PC or IoT devices Whether you're building smart kiosks, offline assistants, or industrial monitors, this curriculum gives you the scaffolding to go from prototype to production.The importance of streaming for LLM-powered chat applications
Thanks to the popularity of chat-based interfaces like ChatGPT and GitHub Copilot, users have grown accustomed to getting answers conversationally. As a result, thousands of developers are now deploying chat applications on Azure for their own specialized domains. To help developers understand how to build LLM-powered chat apps, we have open-sourced many chat app templates, like a super simple chat app and the very popular and sophisticated RAG chat app. All our templates support an important feature: streaming. At first glance, streaming might not seem essential. But users have come to expect it from modern chat experiences. Beyond meeting expectations, streaming can dramatically improve the time to first token — letting your frontend display words as soon as they’re generated, instead of making users wait seconds for a complete answer. How to stream from the APIs Most modern LLM APIs and wrapper libraries now support streaming responses — usually through a simple boolean flag or a dedicated streaming method. Let’s look at an example using the official OpenAI Python SDK. The openai package makes it easy to stream responses by passing a stream=True argument: completion_stream = openai_client.chat.completions.create( model="gpt-5-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What does a product manager do?"}, ], stream=True, ) When stream is true, the return type is an iterable, so we can use a for loop to process each of the ChatCompletion chunk objects: for chunk in await completion_stream: content = event.choices[0].delta.content Sending stream from backend to frontend When building a web app, we need a way to stream data from the backend to the browser. A normal HTTP response won’t work here — it sends all the data at once, then closes the connection. Instead, we need a protocol that allows data to arrive progressively. The most common options are: WebSockets: A bidirectional channel where both client and server can send data at any time. Server-sent events: A one-way channel where the server continuously pushes events to the client over HTTP. Readable streams: An HTTP response with a Transfer-encoding header of "chunked", allowing the client to process chunks as they arrive. All of these could potentially be used for a chat app, and I myself have experimented with both server-sent events and readable streams. Behind the scenes, the ChatGPT API actually uses server-sent events, so you'll find code in the openai package for parsing that protocol. However, I now prefer using readable streams for my frontend to backend communication. It's the simplest code setup on both the frontend and backend, and it supports the POST requests that our apps are already sending. The key is to send chunks from the backend in NDJSON (newline-delimited JSON) format and parse them incrementally on the frontend. See my blog post on fetching JSON over streaming HTTP for Python and JavaScript example code. Achieving a word-by-word effect With all of that in place, we now have a frontend that reveals the model’s answer gradually — almost like watching it type in real time. But something still feels off! Despite our frontend receiving chunks of just a few tokens at a time, that UI tends to reveal entire sentences at once. Why does that happen? It turns out the browser is batching repaints. Instead of immediately re-rendering after each DOM update, it waits until it’s more efficient to repaint — a smart optimization in most cases, but not ideal for a streaming text effect. My colleague Steve Steiner explored several techniques to make the browser repaint more frequently. The most effective approach uses window.setTimeout() with a delay of 33 milliseconds for each chunk. While this adds a small overall delay, it stays well within a natural reading pace and produces a smooth, word-by-word reveal. See his PR for implementation details for a React codebase. With that change, our frontend now displays responses at the same granularity as the chat completions API itself — chunk by chunk: Streaming more of the process Many of our sample apps use RAG (Retrieval-Augmented Generation) pipelines that chain together multiple operations — querying data stores (like Azure AI Search), generating embeddings, and finally calling the chat completions API. Naturally, that chain takes longer than a single LLM call, so users may wait several seconds before seeing a response. One way to improve the experience is to stream more of the process itself. Instead of holding back everything until the final answer, the backend can emit progress updates as each step completes — keeping users informed and engaged. For example, your app might display messages like this sequence: Processing your question: "Can you suggest a pizza recipe that incorporates both mushroom and pineapples?" Generated search query "pineapple mushroom pizza recipes" Found three related results from our cookbooks: 1) Mushroom calzone 2) Pineapple ham pizza 3) Mushroom loaf Generating answer to your question... Sure! Here's a recipe for a mushroom pineapple pizza... Adding streamed progress like this makes your app feel responsive and alive, even while the backend is doing complex work. Consider experimenting with progress events in your own chat apps — a few simple updates can greatly improve user trust and engagement. Making it optional After all this talk about streaming, here’s one final recommendation: make streaming optional. Provide a setting in your frontend to disable streaming, and a corresponding non-streaming endpoint in your backend. This flexibility helps both your users and your developers: For users: Some may prefer (or require) a non-streamed experience for accessibility reasons, or simply to receive the full response at once. For developers: There are times when you’ll want to interact with the app programmatically — for example, using curl, requests, or automated tests — and a standard, non-streaming HTTP endpoint makes that much easier. Designing your app to gracefully support both modes ensures it’s inclusive, debuggable, and production-ready. Sample applications We’ve already linked to several of our sample apps that support streaming, but here’s a complete list so you can explore the one that best fits your tech stack: Repository App purpose Backend Frontend azure-search-openai-demo RAG with AI Search Python + Quart React rag-postgres-openai-python RAG with PostgreSQL Python + FastAPI React openai-chat-app-quickstart Simple chat with Azure OpenAI models Python + Quart plain JS openai-chat-backend-fastapi Simple chat with Azure OpenAI models Python + FastAPI plain JS deepseek-python Simple chat with Azure AI Foundry models Python + Quart plain JS Each of these repositories includes streaming support out of the box, so you can inspect real implementation details in both the frontend and backend. They’re a great starting point for learning how to structure your own LLM chat application — or for extending one of the samples to match your specific use case.¡Curso oficial y gratuito de GenAI y Python! 🚀
¿Quieres aprender a usar modelos de IA generativa en tus aplicaciones de Python?Estamos organizando una serie de nueve transmisiones en vivo, en inglés y español, totalmente dedicadas a la IA generativa. Vamos a cubrir modelos de lenguaje (LLMs), modelos de embeddings, modelos de visión, y también técnicas como RAG, function calling y structured outputs. Además, te mostraremos cómo construir Agentes y servidores MCP, y hablaremos sobre seguridad en IA y evaluaciones, para asegurarnos de que tus modelos y aplicaciones generen resultados seguros. 🔗 Regístrate para toda la serie. Además de las transmisiones en vivo, puedes unirte a nuestras office hours semanales en el AI Foundry Discord de para hacer preguntas que no se respondan durante el chat. ¡Nos vemos en los streams! 👋🏻 Here’s your HTML converted into clean, readable text format (perfect for a newsletter, blog post, or social media caption): Modelos de Lenguaje 📅 7 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡Únete a la primera sesión de nuestra serie de Python + IA! En esta sesión, hablaremos sobre los Modelos de Lenguaje (LLMs), los modelos que impulsan ChatGPT y GitHub Copilot. Usaremos Python para interactuar con LLMs utilizando paquetes como el SDK de OpenAI y Langchain. Experimentaremos con prompt engineering y ejemplos few-shot para mejorar los resultados. También construiremos una aplicación full stack impulsada por LLMs y explicaremos la importancia de la concurrencia y el streaming en apps de IA orientadas al usuario. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Embeddings Vectoriales 📅 8 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la segunda sesión de Python + IA, exploraremos los embeddings vectoriales, una forma de codificar texto o imágenes como arrays de números decimales. Estos modelos permiten realizar búsquedas por similitud en distintos tipos de contenido. Usaremos modelos como la serie text-embedding-3 de OpenAI, visualizaremos resultados en Python y compararemos métricas de distancia. También veremos cómo aplicar cuantización y cómo usar modelos multimodales de embedding. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Recuperación-Aumentada Generación (RAG) 📅 9 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la tercera sesión, exploraremos RAG, una técnica que envía contexto al LLM para obtener respuestas más precisas dentro de un dominio específico. Usaremos distintas fuentes de datos —CSVs, páginas web, documentos, bases de datos— y construiremos una app RAG full-stack con Azure AI Search. Modelos de Visión 📅 14 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡La cuarta sesión trata sobre modelos de visión como GPT-4o y 4o-mini! Estos modelos pueden procesar texto e imágenes, generando descripciones, extrayendo datos, respondiendo preguntas o clasificando contenido. Usaremos Python para enviar imágenes a los modelos, crear una app de chat con imágenes e integrarlos en flujos RAG. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Salidas Estructuradas 📅 15 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la quinta sesión aprenderemos a hacer que los LLMs generen respuestas estructuradas según un esquema. Exploraremos el modo structured outputs de OpenAI y cómo aplicarlo para extracción de entidades, clasificación y flujos con agentes. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Calidad y Seguridad 📅 16 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la sexta sesión hablaremos sobre cómo usar IA de manera segura y evaluar la calidad de las salidas. Mostraremos cómo configurar Azure AI Content Safety, manejar errores en código Python y evaluar resultados con el SDK de Evaluación de Azure AI. Tool Calling 📅 21 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la última semana de la serie, nos enfocamos en tool calling (function calling), la base para construir agentes de IA. Aprenderemos a definir herramientas en Python o JSON, manejar respuestas de los modelos y habilitar llamadas paralelas y múltiples iteraciones. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Agentes de IA 📅 22 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡En la penúltima sesión construiremos agentes de IA! Usaremos frameworks como Langgraph, Semantic Kernel, Autogen, y Pydantic AI. Empezaremos con ejemplos simples y avanzaremos a arquitecturas más complejas como round-robin, supervisor, graphs y ReAct. Model Context Protocol (MCP) 📅 23 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor Cerramos la serie con Model Context Protocol (MCP), la tecnología abierta más candente de 2025. Aprenderás a usar FastMCP para crear un servidor MCP local y conectarlo a chatbots como GitHub Copilot. También veremos cómo integrar MCP con frameworks de agentes como Langgraph, Semantic Kernel y Pydantic AI. Y, por supuesto, hablaremos sobre los riesgos de seguridad y las mejores prácticas para desarrolladores. ¿Querés que lo reformatee para publicación en Markdown (para blogs o repos) o en texto plano con emojis y separadores estilo redes sociales?Use Prompty with Foundry Local
Prompty is a powerful tool for managing prompts in AI applications. Not only does it allow you to easily test your prompts during development, but it also provides observability, understandability and portability. Here's how to use Prompty with Foundry Local to support your AI applications with on-device inference. Foundry Local At the Build '25 conference, Microsoft announced Foundry Local, a new tool that allows developers to run AI models locally on their devices. Foundry Local offers developers several benefits, including performance, privacy, and cost savings. Why Prompty? When you build AI applications with Foundry Local, but also other language model hosts, consider using Prompty to manage your prompts. With Prompty, you store your prompts in separate files, making it easy to test and adjust them without changing your code. Prompty also supports templating, allowing you to create dynamic prompts that adapt to different contexts or user inputs. Using Prompty with Foundry Local The most convenient way to use Prompty with Foundry Local is to create a new configuration for Foundry Local. Using a separate configuration allows you to seamlessly test your prompts without having to repeat the configuration for every prompt. It also allows you to easily switch between different configurations, such as Foundry Local and other language model hosts. Install Prompty and Foundry Local To get started, install the Prompty Visual Studio Code extension and Foundry Local. Start Foundry Local from the command line by running foundry service start and note the URL on which it listens for requests, such as http://localhost:5272 or http://localhost:5273. Create a new Prompty configuration for Foundry Local If you don't have a Prompty file yet, create one to easily access Prompty settings. In Visual Studio Code, open Explorer, click right to open the context menu, and select New Prompty. This creates a basic.prompty file in your workspace. Create the Foundry Local configuration From the status bar, select default to open the Prompty configuration picker. When prompted to select the configuration, choose Add or Edit.... In the settings pane, choose Edit in settings.json. In the settings.json file, to the prompty.modelConfigurations collection, add a new configuration for Foundry Local, for example (ignore comments): { // Foundry Local model ID that you want to use "name": "Phi-4-mini-instruct-generic-gpu", // API type; Foundry Local exposes OpenAI-compatible APIs "type": "openai", // API key required for the OpenAI SDK, but not used by Foundry Local "api_key": "local", // The URL where Foundry Local exposes its API "base_url": "http://localhost:5272/v1" } Important: Be sure to check that you use the correct URL for Foundry Local. If you started Foundry Local with a different port, adjust the URL accordingly. Save your changes, and go back to the .prompty file. Once again, select the default configuration from the status bar, and choose Phi-4-mini-instruct-generic-gpu from the list. Since the model and API are configured, you can remove them from the .prompty file. Test your prompts With the newly created Foundry Local configuration selected, in the .prompty file, press F5 to test the prompt. The first time you run the prompt, it may take a few seconds because Foundry Local needs to load the model. Eventually, you should see the response from Foundry Local in the output pane. Summary Using Prompty with Foundry Local allows you to easily manage and test your prompts while running AI models locally. By creating a dedicated Prompty configuration for Foundry Local, you can conveniently test your prompts with Foundry Local models and switch between different model hosts and models if needed.🚨Introducing the JS AI Build-a-thon 🚨
We’re entering a future where AI-first and agentic developer experiences will shape how we build — and you don’t want to be left behind. This isn’t your average hackathon. It’s a hands-on, quest-driven learning experience designed for developers, packed with: Interactive quests that guide you step by step — from your first prototype to production-ready apps Community-powered support via our dedicated Discord and local, community-led study jams Showcase moments to share your journey, get inspired, and celebrate what you build Whether you're just starting your AI journey or looking to sharpen your skills with frameworks like LangChain.js, tools like the Azure AI Foundry and AI Toolkit Extensions, or diving deeper into agentic app design — this is your moment to start building.