genai
40 TopicsWhy your LLM-powered app needs concurrency
As part of the Python advocacy team, I help maintain several open-source sample AI applications, like our popular RAG chat demo. Through that work, I’ve learned a lot about what makes LLM-powered apps feel fast, reliable, and responsive. One of the most important lessons: use an asynchronous backend framework. Concurrency is critical for LLM apps, which often juggle multiple API calls, database queries, and user requests at the same time. Without async, your app may spend most of its time waiting — blocking one user’s request while another sits idle. The need for concurrency Why? Let’s imagine we’re using a synchronous framework like Flask. We deploy that to a server with gunicorn and several workers. One worker receives a POST request to the "/chat" endpoint, which in turn calls the Azure OpenAI Chat Completions API. That API call can take several seconds to complete — and during that time, the worker is completely tied up, unable to handle any other requests. We could scale out by adding more CPU cores, workers, or threads, but that’s often wasteful and expensive. Without concurrency, each request must be handled serially: When your app relies on long, blocking I/O operations — like model calls, database queries, or external API lookups — a better approach is to use an asynchronous framework. With async I/O, the Python runtime can pause a coroutine that’s waiting for a slow response and switch to handling another incoming request in the meantime. With concurrency, your workers stay busy and can handle new requests while others are waiting: Asynchronous Python backends In the Python ecosystem, there are several asynchronous backend frameworks to choose from: Quart: the asynchronous version of Flask FastAPI: an API-centric, async-only framework (built on Starlette) Litestar: a batteries-included async framework (also built on Starlette) Django: not async by default, but includes support for asynchronous views All of these can be good options depending on your project’s needs. I’ve written more about the decision-making process in another blog post. As an example, let's see what changes when we port a Flask app to a Quart app. First, our handlers now have async in front, signifying that they return a Python coroutine instead of a normal function: async def chat_handler(): request_message = (await request.get_json())["message"] When deploying these apps, I often still use the Gunicorn production web server—but with the Uvicorn worker, which is designed for Python ASGI applications. Alternatively, you can run Uvicorn or Hypercorn directly as standalone servers. Asynchronous API calls To fully benefit from moving to an asynchronous framework, your app’s API calls also need to be asynchronous. That way, whenever a worker is waiting for an external response, it can pause that coroutine and start handling another incoming request. Let's see what that looks like when using the official OpenAI Python SDK. First, we initialize the async version of the OpenAI client: openai_client = openai.AsyncOpenAI( base_url=os.environ["AZURE_OPENAI_ENDPOINT"] + "/openai/v1", api_key=token_provider ) Then, whenever we make API calls with methods on that client, we await their results: chat_coroutine = await openai_client.chat.completions.create( deployment_id=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"], messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": request_message}], stream=True, ) For the RAG sample, we also have calls to Azure services like Azure AI Search. To make those asynchronous, we first import the async variant of the credential and client classes in the aio module: from azure.identity.aio import DefaultAzureCredential from azure.search.documents.aio import SearchClient Then, like with the OpenAI async clients, we must await results from any methods that make network calls: r = await self.search_client.search(query_text) By ensuring that every outbound network call is asynchronous, your app can make the most of Python’s event loop — handling multiple user sessions and API requests concurrently, without wasting worker time waiting on slow responses. Sample applications We’ve already linked to several of our samples that use async frameworks, but here’s a longer list so you can find the one that best fits your tech stack: Repository App purpose Backend Frontend azure-search-openai-demo RAG with AI Search Python + Quart React rag-postgres-openai-python RAG with PostgreSQL Python + FastAPI React openai-chat-app-quickstart Simple chat with Azure OpenAI models Python + Quart plain JS openai-chat-backend-fastapi Simple chat with Azure OpenAI models Python + FastAPI plain JS deepseek-python Simple chat with Azure AI Foundry models Python + Quart plain JSFrom Cloud to Chip: Building Smarter AI at the Edge with Windows AI PCs
As AI engineers, we’ve spent years optimizing models for the cloud, scaling inference, wrangling latency, and chasing compute across clusters. But the frontier is shifting. With the rise of Windows AI PCs and powerful local accelerators, the edge is no longer a constraint it’s now a canvas. Whether you're deploying vision models to industrial cameras, optimizing speech interfaces for offline assistants, or building privacy-preserving apps for healthcare, Edge AI is where real-world intelligence meets real-time performance. Why Edge AI, Why Now? Edge AI isn’t just about running models locally, it’s about rethinking the entire lifecycle: - Latency: Decisions in milliseconds, not round-trips to the cloud. - Privacy: Sensitive data stays on-device, enabling HIPAA/GDPR compliance. - Resilience: Offline-first apps that don’t break when the network does. - Cost: Reduced cloud compute and bandwidth overhead. With Windows AI PCs powered by Intel and Qualcomm NPUs and tools like ONNX Runtime, DirectML, and Olive, developers can now optimize and deploy models with unprecedented efficiency. What You’ll Learn in Edge AI for Beginners The Edge AI for Beginners curriculum is a hands-on, open-source guide designed for engineers ready to move from theory to deployment. Multi-Language Support This content is available in over 48 languages, so you can read and study in your native language. What You'll Master This course takes you from fundamental concepts to production-ready implementations, covering: Small Language Models (SLMs) optimized for edge deployment Hardware-aware optimization across diverse platforms Real-time inference with privacy-preserving capabilities Production deployment strategies for enterprise applications Why EdgeAI Matters Edge AI represents a paradigm shift that addresses critical modern challenges: Privacy & Security: Process sensitive data locally without cloud exposure Real-time Performance: Eliminate network latency for time-critical applications Cost Efficiency: Reduce bandwidth and cloud computing expenses Resilient Operations: Maintain functionality during network outages Regulatory Compliance: Meet data sovereignty requirements Edge AI Edge AI refers to running AI algorithms and language models locally on hardware, close to where data is generated without relying on cloud resources for inference. It reduces latency, enhances privacy, and enables real-time decision-making. Core Principles: On-device inference: AI models run on edge devices (phones, routers, microcontrollers, industrial PCs) Offline capability: Functions without persistent internet connectivity Low latency: Immediate responses suited for real-time systems Data sovereignty: Keeps sensitive data local, improving security and compliance Small Language Models (SLMs) SLMs like Phi-4, Mistral-7B, Qwen and Gemma are optimized versions of larger LLMs, trained or distilled for: Reduced memory footprint: Efficient use of limited edge device memory Lower compute demand: Optimized for CPU and edge GPU performance Faster startup times: Quick initialization for responsive applications They unlock powerful NLP capabilities while meeting the constraints of: Embedded systems: IoT devices and industrial controllers Mobile devices: Smartphones and tablets with offline capabilities IoT Devices: Sensors and smart devices with limited resources Edge servers: Local processing units with limited GPU resources Personal Computers: Desktop and laptop deployment scenarios Course Modules & Navigation Course duration. 10 hours of content Module Topic Focus Area Key Content Level Duration 📖 00 Introduction to EdgeAI Foundation & Context EdgeAI Overview • Industry Applications • SLM Introduction • Learning Objectives Beginner 1-2 hrs 📚 01 EdgeAI Fundamentals Cloud vs Edge AI comparison EdgeAI Fundamentals • Real World Case Studies • Implementation Guide • Edge Deployment Beginner 3-4 hrs 🧠 02 SLM Model Foundations Model families & architecture Phi Family • Qwen Family • Gemma Family • BitNET • μModel • Phi-Silica Beginner 4-5 hrs 🚀 03 SLM Deployment Practice Local & cloud deployment Advanced Learning • Local Environment • Cloud Deployment Intermediate 4-5 hrs ⚙️ 04 Model Optimization Toolkit Cross-platform optimization Introduction • Llama.cpp • Microsoft Olive • OpenVINO • Apple MLX • Workflow Synthesis Intermediate 5-6 hrs 🔧 05 SLMOps Production Production operations SLMOps Introduction • Model Distillation • Fine-tuning • Production Deployment Advanced 5-6 hrs 🤖 06 AI Agents & Function Calling Agent frameworks & MCP Agent Introduction • Function Calling • Model Context Protocol Advanced 4-5 hrs 💻 07 Platform Implementation Cross-platform samples AI Toolkit • Foundry Local • Windows Development Advanced 3-4 hrs 🏭 08 Foundry Local Toolkit Production-ready samples Sample applications (see details below) Expert 8-10 hrs Each module includes Jupyter notebooks, code samples, and deployment walkthroughs, perfect for engineers who learn by doing. Developer Highlights - 🔧 Olive: Microsoft's optimization toolchain for quantization, pruning, and acceleration. - 🧩 ONNX Runtime: Cross-platform inference engine with support for CPU, GPU, and NPU. - 🎮 DirectML: GPU-accelerated ML API for Windows, ideal for gaming and real-time apps. - 🖥️ Windows AI PCs: Devices with built-in NPUs for low-power, high-performance inference. Local AI: Beyond the Edge Local AI isn’t just about inference, it’s about autonomy. Imagine agents that: - Learn from local context - Adapt to user behavior - Respect privacy by design With tools like Agent Framework, Azure AI Foundry and Windows Copilot Studio, and Foundry Local developers can orchestrate local agents that blend LLMs, sensors, and user preferences, all without cloud dependency. Try It Yourself Ready to get started? Clone the Edge AI for Beginners GitHub repo, run the notebooks, and deploy your first model to a Windows AI PC or IoT devices Whether you're building smart kiosks, offline assistants, or industrial monitors, this curriculum gives you the scaffolding to go from prototype to production.The importance of streaming for LLM-powered chat applications
Thanks to the popularity of chat-based interfaces like ChatGPT and GitHub Copilot, users have grown accustomed to getting answers conversationally. As a result, thousands of developers are now deploying chat applications on Azure for their own specialized domains. To help developers understand how to build LLM-powered chat apps, we have open-sourced many chat app templates, like a super simple chat app and the very popular and sophisticated RAG chat app. All our templates support an important feature: streaming. At first glance, streaming might not seem essential. But users have come to expect it from modern chat experiences. Beyond meeting expectations, streaming can dramatically improve the time to first token — letting your frontend display words as soon as they’re generated, instead of making users wait seconds for a complete answer. How to stream from the APIs Most modern LLM APIs and wrapper libraries now support streaming responses — usually through a simple boolean flag or a dedicated streaming method. Let’s look at an example using the official OpenAI Python SDK. The openai package makes it easy to stream responses by passing a stream=True argument: completion_stream = openai_client.chat.completions.create( model="gpt-5-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What does a product manager do?"}, ], stream=True, ) When stream is true, the return type is an iterable, so we can use a for loop to process each of the ChatCompletion chunk objects: for chunk in await completion_stream: content = event.choices[0].delta.content Sending stream from backend to frontend When building a web app, we need a way to stream data from the backend to the browser. A normal HTTP response won’t work here — it sends all the data at once, then closes the connection. Instead, we need a protocol that allows data to arrive progressively. The most common options are: WebSockets: A bidirectional channel where both client and server can send data at any time. Server-sent events: A one-way channel where the server continuously pushes events to the client over HTTP. Readable streams: An HTTP response with a Transfer-encoding header of "chunked", allowing the client to process chunks as they arrive. All of these could potentially be used for a chat app, and I myself have experimented with both server-sent events and readable streams. Behind the scenes, the ChatGPT API actually uses server-sent events, so you'll find code in the openai package for parsing that protocol. However, I now prefer using readable streams for my frontend to backend communication. It's the simplest code setup on both the frontend and backend, and it supports the POST requests that our apps are already sending. The key is to send chunks from the backend in NDJSON (newline-delimited JSON) format and parse them incrementally on the frontend. See my blog post on fetching JSON over streaming HTTP for Python and JavaScript example code. Achieving a word-by-word effect With all of that in place, we now have a frontend that reveals the model’s answer gradually — almost like watching it type in real time. But something still feels off! Despite our frontend receiving chunks of just a few tokens at a time, that UI tends to reveal entire sentences at once. Why does that happen? It turns out the browser is batching repaints. Instead of immediately re-rendering after each DOM update, it waits until it’s more efficient to repaint — a smart optimization in most cases, but not ideal for a streaming text effect. My colleague Steve Steiner explored several techniques to make the browser repaint more frequently. The most effective approach uses window.setTimeout() with a delay of 33 milliseconds for each chunk. While this adds a small overall delay, it stays well within a natural reading pace and produces a smooth, word-by-word reveal. See his PR for implementation details for a React codebase. With that change, our frontend now displays responses at the same granularity as the chat completions API itself — chunk by chunk: Streaming more of the process Many of our sample apps use RAG (Retrieval-Augmented Generation) pipelines that chain together multiple operations — querying data stores (like Azure AI Search), generating embeddings, and finally calling the chat completions API. Naturally, that chain takes longer than a single LLM call, so users may wait several seconds before seeing a response. One way to improve the experience is to stream more of the process itself. Instead of holding back everything until the final answer, the backend can emit progress updates as each step completes — keeping users informed and engaged. For example, your app might display messages like this sequence: Processing your question: "Can you suggest a pizza recipe that incorporates both mushroom and pineapples?" Generated search query "pineapple mushroom pizza recipes" Found three related results from our cookbooks: 1) Mushroom calzone 2) Pineapple ham pizza 3) Mushroom loaf Generating answer to your question... Sure! Here's a recipe for a mushroom pineapple pizza... Adding streamed progress like this makes your app feel responsive and alive, even while the backend is doing complex work. Consider experimenting with progress events in your own chat apps — a few simple updates can greatly improve user trust and engagement. Making it optional After all this talk about streaming, here’s one final recommendation: make streaming optional. Provide a setting in your frontend to disable streaming, and a corresponding non-streaming endpoint in your backend. This flexibility helps both your users and your developers: For users: Some may prefer (or require) a non-streamed experience for accessibility reasons, or simply to receive the full response at once. For developers: There are times when you’ll want to interact with the app programmatically — for example, using curl, requests, or automated tests — and a standard, non-streaming HTTP endpoint makes that much easier. Designing your app to gracefully support both modes ensures it’s inclusive, debuggable, and production-ready. Sample applications We’ve already linked to several of our sample apps that support streaming, but here’s a complete list so you can explore the one that best fits your tech stack: Repository App purpose Backend Frontend azure-search-openai-demo RAG with AI Search Python + Quart React rag-postgres-openai-python RAG with PostgreSQL Python + FastAPI React openai-chat-app-quickstart Simple chat with Azure OpenAI models Python + Quart plain JS openai-chat-backend-fastapi Simple chat with Azure OpenAI models Python + FastAPI plain JS deepseek-python Simple chat with Azure AI Foundry models Python + Quart plain JS Each of these repositories includes streaming support out of the box, so you can inspect real implementation details in both the frontend and backend. They’re a great starting point for learning how to structure your own LLM chat application — or for extending one of the samples to match your specific use case.¡Curso oficial y gratuito de GenAI y Python! 🚀
¿Quieres aprender a usar modelos de IA generativa en tus aplicaciones de Python?Estamos organizando una serie de nueve transmisiones en vivo, en inglés y español, totalmente dedicadas a la IA generativa. Vamos a cubrir modelos de lenguaje (LLMs), modelos de embeddings, modelos de visión, y también técnicas como RAG, function calling y structured outputs. Además, te mostraremos cómo construir Agentes y servidores MCP, y hablaremos sobre seguridad en IA y evaluaciones, para asegurarnos de que tus modelos y aplicaciones generen resultados seguros. 🔗 Regístrate para toda la serie. Además de las transmisiones en vivo, puedes unirte a nuestras office hours semanales en el AI Foundry Discord de para hacer preguntas que no se respondan durante el chat. ¡Nos vemos en los streams! 👋🏻 Here’s your HTML converted into clean, readable text format (perfect for a newsletter, blog post, or social media caption): Modelos de Lenguaje 📅 7 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡Únete a la primera sesión de nuestra serie de Python + IA! En esta sesión, hablaremos sobre los Modelos de Lenguaje (LLMs), los modelos que impulsan ChatGPT y GitHub Copilot. Usaremos Python para interactuar con LLMs utilizando paquetes como el SDK de OpenAI y Langchain. Experimentaremos con prompt engineering y ejemplos few-shot para mejorar los resultados. También construiremos una aplicación full stack impulsada por LLMs y explicaremos la importancia de la concurrencia y el streaming en apps de IA orientadas al usuario. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Embeddings Vectoriales 📅 8 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la segunda sesión de Python + IA, exploraremos los embeddings vectoriales, una forma de codificar texto o imágenes como arrays de números decimales. Estos modelos permiten realizar búsquedas por similitud en distintos tipos de contenido. Usaremos modelos como la serie text-embedding-3 de OpenAI, visualizaremos resultados en Python y compararemos métricas de distancia. También veremos cómo aplicar cuantización y cómo usar modelos multimodales de embedding. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Recuperación-Aumentada Generación (RAG) 📅 9 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la tercera sesión, exploraremos RAG, una técnica que envía contexto al LLM para obtener respuestas más precisas dentro de un dominio específico. Usaremos distintas fuentes de datos —CSVs, páginas web, documentos, bases de datos— y construiremos una app RAG full-stack con Azure AI Search. Modelos de Visión 📅 14 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡La cuarta sesión trata sobre modelos de visión como GPT-4o y 4o-mini! Estos modelos pueden procesar texto e imágenes, generando descripciones, extrayendo datos, respondiendo preguntas o clasificando contenido. Usaremos Python para enviar imágenes a los modelos, crear una app de chat con imágenes e integrarlos en flujos RAG. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Salidas Estructuradas 📅 15 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la quinta sesión aprenderemos a hacer que los LLMs generen respuestas estructuradas según un esquema. Exploraremos el modo structured outputs de OpenAI y cómo aplicarlo para extracción de entidades, clasificación y flujos con agentes. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Calidad y Seguridad 📅 16 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la sexta sesión hablaremos sobre cómo usar IA de manera segura y evaluar la calidad de las salidas. Mostraremos cómo configurar Azure AI Content Safety, manejar errores en código Python y evaluar resultados con el SDK de Evaluación de Azure AI. Tool Calling 📅 21 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la última semana de la serie, nos enfocamos en tool calling (function calling), la base para construir agentes de IA. Aprenderemos a definir herramientas en Python o JSON, manejar respuestas de los modelos y habilitar llamadas paralelas y múltiples iteraciones. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Agentes de IA 📅 22 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡En la penúltima sesión construiremos agentes de IA! Usaremos frameworks como Langgraph, Semantic Kernel, Autogen, y Pydantic AI. Empezaremos con ejemplos simples y avanzaremos a arquitecturas más complejas como round-robin, supervisor, graphs y ReAct. Model Context Protocol (MCP) 📅 23 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor Cerramos la serie con Model Context Protocol (MCP), la tecnología abierta más candente de 2025. Aprenderás a usar FastMCP para crear un servidor MCP local y conectarlo a chatbots como GitHub Copilot. También veremos cómo integrar MCP con frameworks de agentes como Langgraph, Semantic Kernel y Pydantic AI. Y, por supuesto, hablaremos sobre los riesgos de seguridad y las mejores prácticas para desarrolladores. ¿Querés que lo reformatee para publicación en Markdown (para blogs o repos) o en texto plano con emojis y separadores estilo redes sociales?Level up your Python Gen AI Skills from our free nine-part YouTube series!
Want to learn how to use generative AI models in your Python applications? We're putting on a series of nine live streams, in both English and Spanish, all about generative AI. We'll cover large language models, embedding models, vision models, introduce techniques like RAG, function calling, and structured outputs, and show you how to build Agents and MCP servers. Plus we'll talk about AI safety and evaluations, to make sure all your models and applications are producing safe outputs. 🔗 Register for the entire series. In addition to the live streams, you can also join a weekly office hours in our AI Discord to ask any questions that don't get answered in the chat. You can also scroll down to learn about each live stream and register for individual sessions. See you in the streams! 👋🏻 Large Language Models 7 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Join us for the first session in our Python + AI series! In this session, we'll talk about Large Language Models (LLMs), the models that power ChatGPT and GitHub Copilot. We'll use Python to interact with LLMs using popular packages like the OpenAI SDK and Langchain. We'll experiment with prompt engineering and few-shot examples to improve our outputs. We'll also show how to build a full stack app powered by LLMs, and explain the importance of concurrency and streaming for user-facing AI apps. Vector embeddings 8 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our second session of the Python + AI series, we'll dive into a different kind of model: the vector embedding model. A vector embedding is a way to encode a text or image as an array of floating point numbers. Vector embeddings make it possible to perform similarity search on many kinds of content. In this session, we'll explore different vector embedding models, like the OpenAI text-embedding-3 series, with both visualizations and Python code. We'll compare distance metrics, use quantization to reduce vector size, and try out multimodal embedding models. Retrieval Augmented Generation 9 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our fourth Python + AI session, we'll explore one of the most popular techniques used with LLMs: Retrieval Augmented Generation. RAG is an approach that sends context to the LLM so that it can provide well-grounded answers for a particular domain. The RAG approach can be used with many kinds of data sources like CSVs, webpages, documents, databases. In this session, we'll walk through RAG flows in Python, starting with a simple flow and culminating in a full-stack RAG application based on Azure AI Search. Vision models 14 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Our third stream in the Python + AI series is all about vision models! Vision models are LLMs that can accept both text and images, like GPT 4o and 4o-mini. You can use those models for image captioning, data extraction, question-answering, classification, and more! We'll use Python to send images to vision models, build a basic chat-on-images app, and build a multimodal search engine. Structured outputs 15 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our fifth stream of the Python + AI series, we'll discover how to get LLMs to output structured responses that adhere to a schema. In Python, all we need to do is define a @dataclass or a Pydantic BaseModel, and we get validated output that meets our needs perfectly. We'll focus on the structured outputs mode available in OpenAI models, but you can use similar techniques with other model providers. Our examples will demonstrate the many ways you can use structured responses, like entity extraction, classification, and agentic workflows. Quality and safety 16 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Now that we're more than halfway through our Python + AI series, we're covering a crucial topic: how to use AI safely, and how to evaluate the quality of AI outputs. There are multiple mitigation layers when working with LLMs: the model itself, a safety system on top, the prompting and context, and the application user experience. Our focus will be on Azure tools that make it easier to put safe AI systems into production. We'll show how to configure the Azure AI Content Safety system when working with Azure AI models, and how to handle those errors in Python code. Then we'll use the Azure AI Evaluation SDK to evaluate the safety and quality of the output from our LLM. Tool calling 21 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Now that we're more than halfway through our Python + AI series, we're covering a crucial topic: how to use AI safely, and how to evaluate the quality of AI outputs. There are multiple mitigation layers when working with LLMs: the model itself, a safety system on top, the prompting and context, and the application user experience. Our focus will be on Azure tools that make it easier to put safe AI systems into production. We'll show how to configure the Azure AI Content Safety system when working with Azure AI models, and how to handle those errors in Python code. Then we'll use the Azure AI Evaluation SDK to evaluate the safety and quality of the output from our LLM. AI agents 22 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor For the penultimate session of our Python + AI series, we're building AI agents! We'll use many of the most popular Python AI agent frameworks: Langgraph, Semantic Kernel, Autogen, Pydantic AI, and more. Our agents will start simple and then ramp up in complexity, demonstrating different architectures like hand-offs, round-robin, supervisor, graphs, and ReAct. Model Context Protocol 23 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In the final session of our Python + AI series, we're diving into the hottest technology of 2025: MCP, Model Context Protocol. This open protocol makes it easy to extend AI agents and chatbots with custom functionality, to make them more powerful and flexible. We'll show how to use the official Python FastMCP SDK to build an MCP server running locally and consume that server from chatbots like GitHub Copilot. Then we'll build our own MCP client to consume the server. Finally, we'll discover how easy it is to point popular AI agent frameworks like Langgraph, Pydantic AI, and Semantic Kernel at MCP servers. With great power comes great responsibility, so we will briefly discuss the many security risks that come with MCP, both as a user and developer.Use Copilot and MCP to query Microsoft Learn Docs
Are you ready to take your Azure development workflow to the next level? In this post, we’ll walk through how to use GitHub Copilot in Agent Mode—paired with MCP (Model Context Protocol) servers—to get trusted, grounded answers from Microsoft Learn Docs, right inside your coding workspace. Whether you’re tired of switching tabs to search documentation or want to ensure your AI assistant’s answers are always accurate, this guide will show you how to streamline your workflow and boost your productivity.Join Us for a Technical Deep Dive and Q&A on Foundry Local - LLMs on device
Join us for an Ask Me Anything with the Foundry Local team on October 14th, 2025! Discover how Foundry Local is redefining edge AI with powerful features like on-device inference, enabling you to run models directly on your hardware, cutting costs and keeping your data secure. Whether you're customizing models to fit unique use cases or integrating seamlessly via SDKs, APIs, or CLI, Foundry Local offers scalable pathways to Azure AI Foundry as your needs evolve. It's the perfect solution for environments with limited connectivity, sensitive data requirements, low-latency demands, or early-stage experimentation before cloud deployment. If you're building smarter, leaner, and more private AI workflows, this AMA is your chance to dive deep with the team behind it all. What is Foundry Local? Foundry Local is a set of development tools designed to help you build and evaluate LLM applications on your local machine. It provides a curated collection of production-quality tools, including evaluation and prompt engineering capabilities, that are fully compatible with Azure AI. This allows for a seamless transition of your work from your local environment to the cloud. Don't miss this opportunity to connect with our experts and enhance your understanding of local LLM development. Foundry Local is an on-device AI inference solution offering performance, privacy, customization, and cost advantages. It integrates seamlessly into your existing workflows and applications through an intuitive CLI, SDK, and REST API. Key features On-Device Inference: Run models locally on your own hardware, reducing your costs while keeping all your data on your device. Model Customization: Select from preset models or use your own to meet specific requirements and use cases. Cost Efficiency: Eliminate recurring cloud service costs by using your existing hardware, making AI more accessible. Seamless Integration: Connect with your applications through an SDK, API endpoints, or the CLI, with easy scaling to Azure AI Foundry as your needs grow. How to Join: Register to Join the Azure AI Foundry Discord Community Event 14th Oct 2025 9am Pacific Time UTC−08:00 Unlock Accelerated Local LLM Development Discover how Foundry Local can enhance your development process and explore the possibilities for building robust LLM applications. Whether you're a seasoned AI developer or just getting started, this session is your chance to get hands-on insights into the innovative world of Azure AI Foundry. Event Highlights: An in-depth overview of the Foundry Local CLI and SDK. Interactive demo with step-by-step examples. Best practices for local AI Inference and models Transitioning your local development to cloud solutions or vice-versa Why Attend? Gain expert insights into Foundry Local, and ask questions about using Foundry Local Network with fellow AI professionals and developers in the Azure AI Foundry community. Enhance your AI development skills with practical examples. Stay at the forefront of LLM application development. Speakers Product Manager Foundry Local Maanav Dalal Product Manager |Foundry Local Microsoft Maanav Dalal is a PM on the AI Frameworks team. He's super inquisitive about the ways you use AI in daily life, so be encouraged to strike up a conversation with him about that. LinkedIn ProfileChaining and Streaming with Responses API in Azure
Responses API is an enhancement of the existing Chat Completions API. It is stateful and supports agentic capabilities. As a superset of the Chat Completions class, it continues to support functionality of chat completions. In addition, reasoning models, like GPT-5 result in better model intelligence when compared to Chat Completions. It has input flexibility, supporting a range of input types. It is currently available in the following regions on Azure and can be used with all the models available in the region. The API supports response streaming, chaining and also function calling. In the examples below, we use the gpt-5-nano model for a simple response, a chained response and streaming responses. To get started update the installed openai library. pip install --upgrade openai Simple Message 1) Build the client with the following code from openai import OpenAI client = OpenAI( base_url=endpoint, api_key=api_key, ) 2) The response received is an id which can then be used to retrieve the message. # Non-streaming request resp_id = client.responses.create( model=deployment, input=messages, ) 3) Message is retrieved using the response id from previous step response = client.responses.retrieve(resp_id.id) Chaining For a chained message, an extra step is sharing the context. This is done by sending the response id in the subsequent requests. resp_id = client.responses.create( model=deployment, previous_response_id=resp_id.id, input=[{"role": "user", "content": "Explain this at a level that could be understood by a college freshman"}] ) Streaming A different function call is used for streaming queries. client.responses.stream( model=deployment, input=messages, # structured messages ) In addition, the streaming query response has to be handled appropriately till end of event stream for event in s: # Accumulate only text deltas for clean output if event.type == "response.output_text.delta": delta = event.delta or "" text_out.append(delta) # Echo streaming output to console as it arrives print(delta, end="", flush=True) The code is available in the following github link - https://github.com/arunacarunac/ResponsesAPI Additional details are available in the following links - Azure OpenAI Responses API - Azure OpenAI | Microsoft Learn109Views0likes0CommentsUBS unlocks advanced AI techniques with PostgreSQL on Azure
This blog was authored by Jay Yang, Executive Director, and Orhun Oezbek, GenAI Architect, UBS RiskLab UBS Group AG is a multinational investment bank and world-leading asset manager that manages $5.7 trillion in assets across 15 different markets. We continue to evolve our tools to suit the needs of data scientists and to integrate the use of AI. Our UBS RiskLab data science platform helps over 1,200 UBS data scientists expedite development and deployment of their analytics and AI solutions, which support functions such as risk, compliance, and finance, as well as front-office divisions such as investment banking and wealth management. RiskLab and UBS GOTO (Group Operations and Technology Office) have a long-term AI strategy to provide a scalable and easy-to-use AI platform. This strategy aims to remove friction and pain points for users, such as developers and data scientists, by introducing DevOps automation, centralized governance and AI service simplification. These efforts have significantly democratized AI development for our business users. This blog walks through how we created two RiskLab products using Azure services. We also explain how we’re using Azure Database for PostgreSQL to power advanced Retrieval Augmented-Generation (RAG) techniques—such as new vector search algorithms, parameter tuning, hybrid search, semantic ranking, and a graphRAG approach—to further the work of our financial generative AI use cases. The RiskLab AI Common Ecosystem (AICE) provides fully governed and simplified generative AI platform services, including: Governed production data access for AI development Managed large language model (LLM) endpoints access control Tenanted RAG environments Enhanced document insight AI processing Streamlined AI agent standardization, development, registration, and deployment solutions End-to-end machine learning (ML) model continuous integration, training, deployment, and monitoring processes The AICE Vector Embedding Governance Application (VEGA) is a fully governed and multi-tenant vector store built on top of Azure Database for PostgreSQL that provides self-service vector store lifecycle management and advanced indexing and retrieval techniques for financial RAG use cases. A focus on best practices like AIOps and MLOps As generative AI gained traction in 2023, we noticed the need for a platform that simplified the process for our data scientists to build, test, and deploy generative AI applications. In this age of AI, the focus should be on data science best practices—GenAIOps and MLOps. Most of our data scientists aren’t fully trained on MLOps, GenAIOps, and setting up complex pipelines, so AICE was designed to provide automated, self-serve DevOps provisioning of the Azure resources they need, as well as simplified MLOps and AIOps pipelines libraries. This removes operational complexities from their workflows. The second reason for AICE was to make sure our data scientists were working in fully governed environments that comply with data privacy regulations from the multiple countries in which UBS operates. To meet that need, AICE provides a set of generative AI libraries that fully manages data governance and reduces complexity. Overall, AICE greatly simplifies the work for our data scientists. For instance, the platform provides managed Azure LLM endpoints, MLflow for generative AI evaluation, and AI agent deployment pipelines along with their corresponding Python libraries. Without going into the nitty gritty of setting up a new Azure subscription, managing MLFlow instances, and navigating Azure Kubernetes Service (AKS) deployments, data scientists can just write three lines of code to obtain a fully governed and secure generative AI ecosystem to manage their entire application lifecycle. And, as a governed, secure lab environment, they can also develop and prototype ML models and generative AI applications in the production tier. We found that providing production read-only datasets to build these models significantly expedites our AI development. In fact, the process for developing an ML model, building a pipeline for model training, and putting it into production has dropped from six months to just one month. Azure Database for PostgreSQL and pgvector: The best of both worlds for relational and vector databases Once AICE adoption ramped up, our next step was to develop a comprehensive, flexible vector store that would simplify vector store resource provisioning while supporting hundreds of RAG use cases and tenants across both lab and production environments. Essentially, we needed to create RAG as a Service (RaaS) so our data scientists could build custom AI solutions in a self-service manner. When we started building VEGA and this vector store, we anticipated that effective RAG would require a diverse range of search capabilities covering not only vector searches but also more traditional document searches or even relational queries. Therefore, we needed a database that could pivot easily. We were looking for a really flexible relational database and decided on Azure Database for PostgreSQL. For a while, Azure Database for PostgreSQL has been our go-to database at RiskLab for our structured data use cases because it’s like the Swiss Army Knife of databases. It’s very compact and flexible, and we have all the tools we need in a single package. Azure Database for PostgreSQL offers excellent relational queries and JSONB document search. When used in conjunction with the pgvector extension for vector search, we created some very powerful hybrid search and hierarchical search RAG functionalities for our end users. The relational nature of Azure Database for PostgreSQL also allowed us to build a highly regulated authorization and authentication mechanism that makes it easy and secure for data scientists to share their embeddings. This involved meeting very stringent access control policies so that users’ access to vector stores is on a need-to-know basis. Integrations with the Azure Graph API help us manage those identities and ensure that the environment is fully secure. Using VEGA, data scientists can just click a button to add a user or group and provide access to all their embeddings/documents. It’s very easy, but it’s also governed and highly regulated. Speeding vector store initialization from days to seconds With VEGA, the time it takes to provision a vector store has dropped from days to less than 30 seconds. Instead of waiting days on a request for new instances of Azure Database for PostgreSQL, pgvector, and Azure AI Search, data scientists can now simply write five lines of code to stand up virtual, fully governed, and secure collections. And the same is true for agentic deployment frameworks. This speed is critical for lab work that involves fast iterations and experiments. And because we built on Azure Database for PostgreSQL, a single instance of VEGA can support thousands of vector stores. It’s cost-effective and seamlessly scales. Creating a hybrid search to analyze thousands of documents Since launching VEGA, one of the top hybrid search use cases has been Augmented Indexing Search (AIR Search), allowing data scientists to comb through financial documents and pinpoint the correct sections and text. This search uses LLMs as agents that first filter based on metadata stored in JSONB columns of the Azure Database for PostgreSQL, then apply vector similarity retrieval. Our thousands of well-structured financial documents are built with hierarchical headers that act as metadata, providing a filtering mechanism for agents and allowing them to retrieve sections in our documents to find precisely what they’re looking for. Because these agents are autonomous, they can decide on the best tools to use for the situation—either metadata filtering or vector similarity search. As a hybrid search, this approach also minimizes AI hallucinations because it gives the agents more context to work with. To enable this search, we used ChatGPT and Azure OpenAI. But because most of our financial documents are saved as PDFs, the challenge was retaining hierarchical information from headers that were lost when simply dumping in text from PDFs. We also had to determine how to make sure ChatGPT understood the meaning behind aspects like tables and figures. As a solution, we created PNG images of PDF pages and told ChatGPT to semantically chunk documents by titles and headers. And if it came across a table, we asked it to provide a YAML or JSON representation of it. We also asked ChatGPT to interpret figures to extract information, which is an important step because many of our documents contain financial graphs and charts. We’re now using Azure AI Document Intelligence for layout detection and section detection as the first step, which simplified our document ingestion pipelines significantly. Forecasting economic implications with PostgreSQL Graph Extension Since creating AICE and VEGA using Azure services, we’ve significantly enhanced our data science workflows. We’ve made it faster and easier to develop generative AI applications thanks to the speed and flexibility of Azure Database for PostgreSQL. Making advanced AI features accessible to our data scientists has accelerated innovation in RiskLab and ultimately allowed UBS to deliver exceptional value to our customers. Looking ahead, we plan to use the Apache AGE graph extension in Azure Database for PostgreSQL for macroeconomics knowledge retention capabilities. Specifically, we’re considering Azure tooling such as GraphRAG to equip UBS economist and portfolio managers with advanced RAG capabilities. This will allow them to retrieve more coherent RAG search results for use cases such as economics scenario generation and impact analysis, as well as investment forecasting and decision-making. For instance, a UBS business user will be able to ask an AI agent: if a country’s interest rate increases by a certain percentage, what are the implications to my client’s investment portfolio? The agent can perform a graph search to obtain all other connected economic entity nodes that might be affected by the interest rate entity node in the graph. We anticipate the AI-assisted graph knowledge will gain significant traction in the financial industry. Learn more For a deeper dive on how we created AICE and VEGA, check out this on-demand session from Ignite. We talk through our use of Azure Database for PostgreSQL and pgvector, plus we show a demo of our GraphRAG capabilities. About Azure Database for PostgreSQL Azure Database for PostgreSQL is a fully managed, scalable, and secure relational database service that supports open-source PostgreSQL. It enables organizations to build and manage mission-critical applications with high availability, built-in security, and automated maintenance.Leveraging the power of NPU to run Gen AI tasks on Copilot+ PCs
Thanks to their massive scale and impressive technical evolution, large language models (LLMs) have become the public face of Generative AI innovation. However, bigger isn’t always better. While LLMs like the ones behind Microsoft Copilot are incredibly capable at a wide range of tasks, less-discussed small language models (SLMs) expand the utility of Gen AI for real-time and edge applications. SLMs can run efficiently on a local device with low power consumption and fast performance, enabling new scenarios and cost models. SLMs can run on universally available chips like CPUs and GPUs, but their potential really comes alive running on Neural Processing Units (NPUs), such as the ones found in Microsoft Surface Copilot+ PCs. NPUs are specifically designed for processing machine learning workloads, leading to high performance per watt and thermal efficiency compared to CPUs or GPUs [1]. SLMs and NPUs together support running quite powerful Gen AI workloads efficiently on a laptop, even when running on battery power or multitasking. In this blog, we focus on running SLMs on Snapdragon® X Plus processors on the recently launched Surface Laptop 13-inch, using the Qualcomm® AI Hub, leading to efficient local inference, increased hardware utilization and minimal setup complexity. This is only one of many methods available - before diving into this specific use case, let’s first provide an overview of the possibilities for deploying small language models on Copilot+ PC NPUs. Qualcomm AI Engine Direct (QNN) SDK: This process requires converting SLMs into QNN binaries that can be executed through the NPU. The Qualcomm AI Hub provides a convenient way to compile any PyTorch, TensorFlow, or ONNX-converted models into QNN binaries executable by the Qualcomm AI Engine Direct SDK. Various precompiled models are directly available in the Qualcomm AI Hub, their collection of over 175 pre-optimized models, ready for download and integration into your application. ONNX Runtime: ONNX Runtime is an open-source inference engine from Microsoft designed to run models in the ONNX format. The QNN Execution Provider (EP) by Qualcomm Technologies optimizes inference on Snapdragon processors using AI acceleration hardware, mainly for mobile and embedded use. ONNX Runtime Gen AI is a specialized version optimized for generative AI tasks, including transformer-based models, aiming for high-performance inference in applications like large language models. Although ONNX Runtime with QNN EP can run models on Copilot+ PCs, some operator support is missing for Gen AI workloads. ONNX Runtime Gen AI is not yet publicly available for NPU; a private beta is currently out with an unclear ETA on public release at the time of releasing this blog. Here is the link to the Git repo for more info on upcoming releases microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime Windows AI Foundry: Windows AI Foundry provides AI-supported features and APIs for Copilot+ PCs. It includes pre-built models such as Phi-Silica that can be inferred using Windows AI APIs. Additionally, it offers the capability to download models from the cloud for local inference on the device using Foundry Local. This feature is still in preview. You can learn more about Windows AI Foundry here: Windows AI Foundry | Microsoft Developer AI Toolkit for VS Code: The AI Toolkit for Visual Studio Code (VS Code) is a VS Code extension that simplifies generative AI app development by bringing together cutting-edge AI development tools and models from the Azure AI Foundry catalog and other catalogs like Hugging Face. This platform allows users to download multiple models either from the cloud or locally. It currently houses several models optimized to run on CPU, with support for NPU-based models forthcoming, starting with Deepseek R1. Comparison between different approaches Feature Qualcomm AI Hub ONNX Runtime (ORT) Windows AI Foundry AI Toolkit for VS code Availability of Models Wide set of AI models (vision, Gen AI, object detection, and audio). Any models can be integrated. NPU support for Gen AI tasks and ONNX Gen AI Runtime are not yet generally available. Phi Silica model is available through Windows AI APIs, additional AI models from cloud can be downloaded for local inference using Foundry Local Access to models from sources such as Azure AI Foundry and Hugging Face. Currently only supports Deepseek R1 and Phi 4 Mini models for NPU inference. Ease of development The API is user-friendly once the initial setup and end-to-end replication are complete. Simple setup, developer-friendly; however, limited support for custom operators means not all models deploy through ORT. Easiest framework to adopt—developers familiar with Windows App SDK face no learning curve. Intuitive interface for testing models via prompt-response, enabling quick experimentation and performance validation. Is processor or SoC independent No. Supports Qualcomm Technologies processors only. Models must be compiled and optimized for the specific SOC on the device. A list of supported chipsets is provided, and the resulting .bin files are SOC-specific. Limitations exist with QNN EP’s HTP backend: only quantized models and those with static shapes are currently supported. Yes. The tool can operate independently of SoC. It is part of the broader Windows Copilot Runtime framework, now rebranded as the Windows AI Foundry. Model-dependent. Easily deployable on-device; model download and inference are straightforward. As of writing this article and based on our team's research, we found Qualcomm AI Hub to be the most user-friendly and well-supported solution available at this time. In contrast, most other frameworks are still under development and not yet generally available. Before we dive into how to use Qualcomm AI Hub to run Small Language Models (SLMs), let’s first understand what Qualcomm AI Hub is. What is Qualcomm AI Hub? Qualcomm AI Hub is a platform designed to simplify the deployment of AI models for vision, audio, speech, and text applications on edge devices. It allows users to upload, optimize, and validate their models for specific target hardware—such as CPU, GPU, or NPU—within minutes. Models developed in PyTorch or ONNX are automatically converted for efficient on-device execution using frameworks like TensorFlow Lite, ONNX Runtime, or Qualcomm AI Engine Direct. The Qualcomm AI Hub offers access to a collection of over 100 pre-optimized models, with open-source deployment recipes available on GitHub and Hugging Face. Users can also test and profile these models on real devices with Snapdragon and Qualcomm platforms hosted in the cloud. In this blog we will be showing how you can use Qualcomm AI Hub to get a QNN context binary for models and use Qualcomm AI Engine to run those context binaries. The context binary is a SoC-specific deployment mechanism. When compiled for a device, it is expected that the model will be deployed to the same device. The format is operating system agnostic so the same model can be deployed on Android, Linux, or Windows. The context binary is designed only for the NPU. For more details on how to compile models in other formats, please visit the documentation here Overview of Qualcomm AI Hub — qai-hub documentation. The following case study details the efficient execution of the Phi-3.5 model using optimized, hardware-specific binaries on a Surface Laptop 13-inch powered by the Qualcomm Snapdragon X Plus processor, Hexagon™ NPU, and Qualcomm Al Hub. Microsoft Surface Engineering Case Study: Running Phi-3.5 Model Locally on Snapdragon X Plus on Surface Laptop 13-inch This case study details how the Phi-3.5 model was deployed on a Surface Laptop 13-inch powered by the Snapdragon X Plus processor. The study was developed and documented by the Surface DASH team, which specializes in delivering AI/ML solutions to Surface devices and generating data-driven insights through advanced telemetry. Using Qualcomm AI Hub, we obtained precompiled QNN context binaries tailored to the target SoC, enabling efficient local inference. This method maximizes hardware utilization and minimizes setup complexity. We used a Surface Laptop 13-inch with the Snapdragon X Plus processor as our test device. The steps below apply to the Snapdragon X Plus processor; however, the process remains similar for other Snapdragon X Series processors and devices as well. For the other processors, you may need to download different model variants of the desired models from Qualcomm AI Hub. Before you begin to follow along, please check the make and models of your NPU by navigating to Device Manager --> Neural Processors. We also used Visual Studio Code and Python (3.10.3.11, 3.12). We used the 3.11 version to run these steps below and recommend using the same, although there should be no difference in using a higher Python version. Before starting, let's create a new virtual environment in Python as a best practice. Follow the steps to create a new virtual environment here: https://code.visualstudio.com/docs/python/environments?from=20423#_creating-environments Create a folder named ‘genie_bundle’ store config and bin files. Download the QNN context binaries specific to your NPU and place the config files into the genie_bundle folder. Copy the .dll files from QNN SDK into the genie_bundle folder. Finally, execute the test prompt through genie-sdk in the required format for Phi-3.5. Setup steps in details Step 1: Setup local development environment Download QNN SDK: Go to the Qualcomm Software Center Qualcomm Neural Processing SDK | Qualcomm Developer and download the QNN SDK by clicking on Get Software (by default latest version of SDK gets downloaded). For the purpose of this demo, we used latest version available (2.34) . You may need to make an account on the Qualcomm website to access it. Step 2: Download QNN Context Binaries from Qualcomm AI Hub Models Download Binaries: Download the context binaries (.bin files) for the Phi-3.5-mini-instruct model from (Link to Download Phi-3.5 context binaries). Clone AI Hub Apps repo: Use the Genie SDK (Generative Runtime built on top of Qualcomm AI Direct Engine), and leverage the sample provided in https://github.com/quic/ai-hub-apps Setup folder structure to follow along the code: Create a folder named "genie_bundle" outside of the folder where AI Hub Apps repo was cloned. Selectively copy configuration files from AI Hub sample repo to 'genie_bundle' Step 3: Copy config files and edit files Copy config files to genie_bundle folder from ai-hub-apps. You will need two config files. You can use the PowerShell script below to copy the config files from repo to local genie folder created in previous steps. You also need to copy HTP backend config file as well as the genie config file from the repo # Define the source paths $sourceFile1 = "ai-hub-apps/tutorials/llm_on_genie/configs/htp/htp_backend_ext_config.json.template" $sourceFile2 = "ai-hub-apps/tutorials/llm_on_genie/configs/genie/phi_3_5_mini_instruct.json" # Define the local folder path $localFolder = "genie_bundle" # Define the destination file paths using the local folder $destinationFile1 = Join-Path -Path $localFolder -ChildPath "htp_backend_ext_config.json" $destinationFile2 = Join-Path -Path $localFolder -ChildPath "genie_config.json" # Create the local folder if it doesn't exist if (-not (Test-Path -Path $localFolder)) { New-Item -ItemType Directory -Path $localFolder } # Copy the files to the local folder Copy-Item -Path $sourceFile1 -Destination $destinationFile1 -Force Copy-Item -Path $sourceFile2 -Destination $destinationFile2 -Force Write-Host "Files have been successfully copied to the genie_bundle folder with updated names." After copying the files, you will need to make sure to change the default values of the parameters provided with template files copied. Edit HTP backend file in the newly pasted location - Change dsp_arch and soc_model to match with your configuration pdate soc model and dsp arch in HTP backend config files Edit genie_config file to include the downloaded binaries for Phi 3 models in previous steps Step 4: Download the tokenizer file from Hugging Face Visit the Hugging Face Website: Open your web browser and go to https://huggingface.co/microsoft/Phi-3.5-mini-instruct/tree/main Locate the Tokenizer File: On the Hugging Face page, find the tokenizer file for the Phi-3.5-mini-instruct model Download the File: Click on the download button to save the tokenizer file to your computer Save the File: Navigate to your genie_bundle folder and save the downloaded tokenizer file there. Note: There is an issue with the tokenizer.json file for the Phi 3.5 mini instruct model, where the output does not break words using spaces. To resolve this, you need to delete lines #192-197 in the tokenizer.json file. Download tokenizer files from the hugging face repo (Image Source - Hugging Face) Step 5: Copy files from QNN SDK Locate the QNN SDK Folder: Open the folder where you have installed the QNN SDK in step 1 and identify the required files. You need to copy the files from the below mentioned folder. Exact folder naming may change based on SDK version <QNN-SDK ROOT FOLDER>/qairt/2.34.0.250424/lib/hexagon-v75/unsigned <QNN-SDK ROOT FOLDER> /qairt/2.34.0.250424/lib/aarch64-windows-msvc <QNN-SDK ROOT FOLDER> /qairt/2.34.0.250424/bin/aarch64-windows-msvc Navigate to your genie_bundle folder and paste the copied files there. Step 6: Execute the Test Prompt Open Your Terminal: Navigate to your genie_bundle folder using your terminal or command prompt. Run the Command: Copy and paste the following command into your terminal: ./genie-t2t-run.exe -c genie_config.json -p "<|system|>\nYou are an assistant. Provide helpful and brief responses.\n<|user|>What is an NPU? \n<|end|>\n<|assistant|>\n" Check the Output: After running the command, you should see the response from the assistant in your terminal. This case study demonstrates the process of deploying a small language model (SLM) like Phi-3.5 on a Copilot+ PC using the Hexagon NPU and Qualcomm AI Hub. It outlines the setup steps, tooling, and configuration required for local inference using hardware-specific binaries. As deployment methods mature, this approach highlights a viable path toward efficient, scalable Gen AI execution directly on edge devices. Snapdragon® and Qualcomm® branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm, Snapdragon and Hexagon™ are trademarks or registered trademarks of Qualcomm Incorporated.3.2KViews7likes2Comments