ai
1278 TopicsResource Guide: Making Physical AI Practical for Real‑World Industrial Operations
Microsoft’s adaptive cloud approach enables organizations to turn operational technology (OT) data into intelligent actions, autonomously, without requiring everything to live in the cloud by unifying cloud-to-edge management plane, data plane, and intelligence platform. At the center of this approach are key foundational technologies: Key Purpose Offering Direct-to-cloud device management + telemetry ingestion Azure IoT Hub Industrial connectivity + edge data plane Azure IoT Operations Unified analytics + real-time intelligence Microsoft Fabric On-device AI inferencing runtime Foundry Local Microsoft Azure IoT Gartner winner: Microsoft named a Leader in the 2025 Gartner® Magic Quadrant™ for Global Industrial IoT Platforms See it all come together Before diving into each component, watch this end-to-end demo showing how Azure IoT Operations, Azure IoT Hub, Microsoft Fabric, and Foundry Local work as one stack across the edge-to-cloud lifecycle - Making industrial AI practical for real-world operations with adaptive cloud. How these components work together Azure IoT Operations and Azure IoT Hub collect real-time data from operational assets and send semantically-ready, modeled data to Microsoft Fabric, where it's contextualized with enterprise data for downstream analytics. Microsoft Foundry extends to the edge through Foundry Local, so the same tooling used to deploy and manage AI models in the cloud applies to edge use cases. All of it integrates into Azure Resource Manager, bringing OT devices, assets, and edge AI models into the same management and security paradigm as every other Azure-managed resource. This blog walks through where to get started with each product capability: 1. Manage Cloud-Connected Devices and Telemetry with Azure IoT Hub Azure IoT Hub is a fully managed cloud service that enables secure bidirectional communication, device-to-cloud telemetry ingestion, cloud-to-device command execution, per-device authentication, remote management and more. Telemetry from IoT Hub can also be routed downstream into analytics platforms like Microsoft Fabric for visualization or AI modeling. Recommended Usage: Devices that utilize IoT Hub are distributed, stand-alone devices with fixed-functions. These devices typically do not require cloud-managed containerized workloads or cloud-managed proximal industrial protocol connectivity. Examples of appropriate device-to-cloud IoT Hub endpoint devices include water monitoring stations, vehicle telematics, distributed fluid level sensors, etc. Resources Current in-market services overview: IoT Hub: What is Azure IoT Hub? - Azure IoT Hub DPS: Overview of Azure IoT Hub Device Provisioning Service - Azure IoT Hub Device Provisioning Service ADU: Introduction to Device Update for Azure IoT Hub Building scalable solutions with Azure IoT platform: Best practices for large-scale IoT deployments - Azure IoT Hub Device Provisioning Service Scale Out an Azure IoT Hub-based Solution to Support Millions of Devices - Azure Architecture Center Azure IoT Hub scaling Try out our preview of new IoT Hub capabilities (integration with Azure Device Registry and Certificate Management) Learn more about these capabilities on our blog post: Azure IoT Hub + Azure Device Registry (Preview Refresh): Device Trust and Management at Fleet Scale… Integration with Azure Device Registry (preview): Integration with Azure Device Registry (preview) - Azure IoT Hub Microsoft-backed X.509 certificate management (preview): What is Microsoft-backed X.509 Certificate Management (Preview)? - Azure IoT Hub How to start with the preview: Deploy IoT Hub with ADR integration and certificate management (Preview) - Azure IoT Hub 2. Connect Industrial Assets with Azure IoT Operations Azure IoT Operations provides a unified data plane for the edge that runs on Azure Arc–enabled Kubernetes clusters and supports open industrial standards. It allows organizations to connect and capture equipment telemetry, normalize OT data locally, route hot-path signals to real-time analytics, securely manage layered industrial networks, and more. Edge‑processed data can then be sent upstream to Microsoft Fabric for AI‑driven analysis. Recommended Usage: Azure IoT Operations is intended to be the data plane for an adaptive cloud deployment extending the management, data, and AI capabilities of the Microsoft cloud to an on-prem device. This device binds to these cloud planes providing a platform for local data processing and intermittent connectivity. The target for these devices range from a small-gateway-style PC to a full data center. Azure IoT Operations endpoints enable cloud-managed containerized workloads and cloud-managed proximal industrial protocol connectivity. Examples of appropriate adaptive cloud and Azure IoT Operations endpoints include, on-robot computers, industrial machine controllers, retail store sensor/vision processing, and top-of-factory site infrastructure for line of business applications. Resources Azure IoT Operations Overview Azure IoT Operations Documentation Hub Quickstart: explore-iot-operations/quickstart at main · Azure-Samples/explore-iot-operations Open-source framework for scaling robotics from simulation to production on Azure + NVIDIA: microsoft/physical-ai-toolchain Demo video showcasing How we built the demo: explore-iot-operations/quickstart at main · Azure-Samples/explore-iot-operations Edge-AI: microsoft/edge-ai: Production-ready Infrastructure as Code, applications, pluggable components, and… Latest Announcements & Blogs Making Physical AI Practical for Real-World Industrial Operations: Part 1 | Microsoft Community Hub Making Physical AI Practical for Real-World Industrial Operations: Part 2 | Microsoft Community Hub Unlock Industrial Intelligence | Microsoft Hannover Messe 2026 From pilots to production: How Microsoft and partners are accelerating intelligent operations 3. Advanced Analytics with Microsoft Fabric Microsoft Fabric delivers a unified, end‑to‑end analytics platform that transforms streaming OT telemetry into real‑time insights and live dashboards. Fabric Operations Agents monitor industrial signals to recommend targeted actions, while Fabric IQ provides a shared semantic foundation that enables AI agents to reason over enterprise data with business context. Together, Fabric turns live industrial data into AI‑powered operational intelligence. Resources Get Started with Microsoft Fabric Learning Path Fabric Real-Time Intelligence documentation - Microsoft Fabric | Microsoft Learn Create and Configure Operations Agents - Microsoft Fabric | Microsoft Learn Fabric IQ documentation - Microsoft Fabric | Microsoft Learn 4.Run AI Models On‑Device with Foundry Local Foundry Local extends on‑device AI to Arc‑enabled Kubernetes edge clusters, providing a Microsoft‑validated inferencing layer for running AI models in industrial, disconnected or sovereign environments. Resources Foundry Local on Azure Local Documentation Participate in Foundry Local on Azure Local preview form Foundry Local on Azure Local: HELM deployment Demo Customer Stories Chevron: Chevron plans facilities of the future with Azure IoT Operations Husqvarna: Husqvarna Group Boosts Operational Efficiency with Azure Adaptive Cloud Ecopetrol: Azure IoT Operations and Azure IoT for energy help Ecopetrol optimize energy distribution while lowering operational costs P&G: Procter & Gamble cuts model deployment time up to 90% with Azure IoT Operations Toyota: Toyota Industries innovates its paint shop processes with Azure industrial AI and Azure IoT Hub569Views1like0CommentsBuild observability for scalable AI apps and agents selling through Microsoft Marketplace
Discover how to design observability for AI apps and agents selling through Microsoft Marketplace. This Marketplace Community article explains why visibility into execution behavior is essential for operating AI systems confidently at scale—not just keeping them running. As AI apps and agents reason, branch, retry, and exit dynamically at runtime, traditional infrastructure metrics fall short. Behavioral signals such as execution flow, token usage, latency, and failure patterns help explain what systems are doing, why outcomes occur, and how limits and safeguards shape behavior across tenants and environments. Learn how observability turns runtime telemetry into clarity that supports customer trust, usage‑based billing, and scalable operations. Read more: Design observability for AI apps and agents selling through Microsoft Marketplace18Views0likes0CommentsMicrosoft 365 & Power Platform product updates call
💡Microsoft 365 & Power Platform product updates call concentrates on the different use cases and features within the Microsoft 365 and in Power Platform. Call includes topics like Microsoft 365 Copilot, Copilot Studio, Microsoft Teams, Power Platform, Microsoft Graph, Microsoft Viva, Microsoft Search, Microsoft Lists, SharePoint, Power Automate, Power Apps and more. 👏 Weekly Tuesday call is for all community members to see Microsoft PMs, engineering and Cloud Advocates showcasing the art of possible with Microsoft 365 and Power Platform. 📅 On the 12th of May we'll have following agenda: News and updates from Microsoft Together mode group photo Joe Komban– Getting started with SharePoint Skills Shreyas Saravanan – Introducing new administration capabilities for SharePoint Embedded Paolo Pialorsi – Microsoft 365 Work IQ API Integration in Custom Engine Agents 📞 & 📺 Join the Microsoft Teams meeting live at https://aka.ms/community/ms-speakers-call-join 🗓️ Download recurrent invite for this weekly call from https://aka.ms/community/ms-speakers-call-invite 👋 See you in the call! 💡 Building something cool for Microsoft 365 or Power Platform (Copilot, SharePoint, Power Apps, etc)? We are always looking for presenters - Volunteer for a community call demo at https://aka.ms/community/request/demo 📖 Resources: Previous community call recordings and demos from the Microsoft Community Learning YouTube channel at https://aka.ms/community/youtube Microsoft 365 & Power Platform samples from Microsoft and community - https://aka.ms/community/samples Microsoft 365 & Power Platform community details - https://aka.ms/community/home 🧡 Sharing is caring!14Views0likes0CommentsA New Chapter for Realtime AI: Reasoning, Translation, and Real-Time Transcription
Voice can be one of the most direct and productive interfaces for AI — enabling customer support agents that may resolve issues without a single keystroke, live multilingual communication that can take on language barriers as conversations happen, and voice assistants capable of reasoning through complex requests in real time. Developers building these experiences need models that can keep pace with increasingly demanding latency, accuracy, and language coverage requirements. Today, OpenAI’s GPT-realtime-translate, GPT‑realtime‑2 and, GPT-realtime-whisper are rolling out into Microsoft Foundry starting today — together representing a significant step forward for the realtime model lineup available to developers on the platform. GPT-realtime-translate and GPT-realtime-whisper GPT-realtime-translate and GPT-realtime-whisper together extend the realtime stack for live multilingual audio workflows. GPT-realtime-translate is built for continuous, real-time translation, producing translated output as speech unfolds without relying on segmented pipeline processing, while GPT-realtime-whisper provides low-latency streaming transcription of the original audio in parallel. Used together, they help developers support scenarios such as live events, cross-language customer experiences, captions, monitoring, and archival workflows that require both translated output and visibility into the source speech. Continuous stream processing: This new model translates live audio without segmenting or buffering allowing for more natural interactions. New translation and transcription capabilities: Translate between languages in real time and observe faster text to speech. Available via the Realtime API GPT-realtime-2 GPT‑realtime‑2 is a generational upgrade to OpenAI's speech-to-speech model, bringing internal reasoning and an expanded context window to real-time voice applications. Where previous speech to speech models responded immediately, GPT‑realtime‑2 can work through a problem before speaking — making it well suited for voice applications that need to handle complex, multi-step queries entirely in the audio layer without routing to a separate text pipeline. Native reasoning capability: The newest realtime model introduces stronger reasoning capabilities. Now the model thinks internally before responding. Adjustable reasoning effort via {reasoning.effort}: Explicitly request the level of reasoning the model uses -- minimal, low, medium, high – to save on cost and latency. Audio in, audio out: No need for an intermediary text step, conversation stays fluid and natural. Available via the Realtime API Use cases These models work independently, but they're designed to complement each other in real-world pipelines: Live multilingual events. GPT-realtime-translate enables real-time translation of live audio, producing translated speech along with a transcript in the target language. GPT‑realtime‑whisper can be used in parallel to capture a transcription of the original speech for captions, monitoring, or archival purposes. Together, they enable multilingual live streaming with both translated experiences and visibility into the source language. Global customer support. Route inbound calls through GPT-realtime-translate to translate conversations in real time and provide a translated transcript for agents. Use GPT‑realtime‑whisper alongside it to capture the original conversation as text for compliance, quality review, or analytics. Then pass the interaction to an agent built with GPT‑realtime‑2 using {reasoning.effort}: high for complex issue resolution, all within a continuous audio pipeline. International voice assistants. Build once and deploy across languages. GPT-realtime-translate enables multilingual interaction and provides translated output with a target-language transcript, while GPT‑realtime‑whisper can optionally capture the original user input as text. GPT‑realtime‑2 manages reasoning and conversational context, supporting more complex voice interactions. Pricing Model Deployment Modality Pricing per 1M tokens Input Cached Input Output GPT-realtime-2 Global Standard Audio $32.00 $0.40 $64.00 Text $4.00 $0.40 $24.00 Image $5.00 $0.50 -- GPT-realtime-translate Global Standard Audio -- -- $0.034/minute GPT-realtime-whisper Global Standard Audio -- -- $0.017/minute Getting Started Looking for ways to dive in? GPT-realtime-translate, GPT-realtime-whisper, and GPT‑realtime‑2 are rolling out into Microsoft Foundry today. Explore the model catalog and start building: https://ai.azure.com1.5KViews0likes1CommentFrom Test Cases to Trust: Elevating Enterprise Quality with GitHub Copilot
The Traditional QA Bottleneck In complex enterprise systems, QA teams often face familiar challenges: Time‑consuming test case creation from evolving requirements Repetitive automation scripting and refactoring Heavy regression cycles under tight release timelines Limited bandwidth for deeper risk analysis and exploratory testing None of these issues are caused by lack of skill—they’re symptoms of scale and complexity. This is where GitHub Copilot entered our workflow—not as a “magic button,” but as a thinking partner. Where GitHub Copilot Actually Helped Used responsibly, Copilot added value in very specific QA scenarios: Faster Test Design from Requirements Transforming business or technical requirements into structured test scenarios is intellectually demanding but time-intensive. Copilot helped accelerate: Initial test scenario drafting Gherkin-style acceptance criteria Coverage identification for edge and negative cases The result wasn’t “auto-generated tests,” but faster starting points, reviewed and refined by humans. Accelerating Automation Without Losing Control Whether working with UI automation or API tests, a significant portion of effort goes into boilerplate code, assertions, and structuring. Copilot assisted with: Suggesting test skeletons Refactoring repetitive code Improving readability and consistency This freed engineers to focus on test intent, not syntax. Supporting Debugging and Maintenance Automation maintenance is often underestimated. Copilot helped: Identify potential fixes during test failures Suggest improvements during refactoring Reduce turnaround time during regression cycles Again, nothing was auto‑merged. Human review remained non‑negotiable. The Most Important Shift: QA Mindset The real impact of Copilot wasn’t just efficiency—it changed how QA engineers spent their time. Instead of: Writing repetitive scripts Manually expanding similar test cases The team could focus more on: Risk‑based testing Failure pattern analysis Cross‑team quality discussions Improving test strategy and coverage depth In short, AI didn’t remove QA effort—it redirected it to higher‑value work. Responsible AI Was Not Optional In an enterprise setup, responsible AI usage matters. Key principles we followed: No blind acceptance of AI suggestions Strict human validation of all test logic Awareness of data sensitivity and compliance boundaries Treating Copilot as an assistant, not an authority This balance ensured quality and trust were never compromised in the pursuit of speed. What This Means for QA Teams From this experience, one thing became clear: AI won’t replace QA engineers. But QA engineers who use AI effectively will redefine quality. GitHub Copilot helped shift QA from execution to enablement—from writing tests faster to thinking about quality better. For enterprise teams, this is a powerful evolution. Final Thoughts Quality engineering is no longer just about finding defects—it’s about enabling confidence in delivery. Tools like GitHub Copilot, when used responsibly, can become catalysts in that transformation. The future of QA isn’t manual vs automation. It’s human judgment amplified by AI assistance. And that’s where real quality lives. References Create and manage manual test cases - Azure Test Plans | Microsoft Learn What is Azure Test Plans? Manual, exploratory, and automated test tools. - Azure Test Plans | Microsoft Learn Create and manage test plans - Azure Test Plans | Microsoft LearnUpdates to Teamwork Deployment specialization: View new name and requirements
Starting April 29, 2026, Microsoft is renaming the Teamwork Deployment specialization as the Secure AI Productivity specialization to reflect the increased focus on the secure foundations required for successful AI adoption. Additionally, the performance, skilling, and Solutions Partner designation requirements will be updated to better align with foundational workloads from Microsoft 365 E3 and the realities of modern AI adoption. These updates expand the workloads included in the specialization and refresh requirements to reflect the skills partners need to deploy, secure, and scale AI‑powered collaboration. Additionally, partners will be required to hold both the Modern Work designation and the Security designation to earn the Secure AI Productivity specialization. Partners currently enrolled in the specialization will need to meet these updated requirements at their next renewal if it occurs after April 29, 2026. Learn more111Views0likes0CommentsFirm AI for professional services: governed, agentic workflows built on Microsoft Azure
In this installment of our Partner Spotlight series, we’re highlighting partners building industry-focused AI solutions and bringing them to customers through Microsoft Marketplace. I connected with Richard Baskerville from Intapp to learn how the company is delivering Firm AI—governed, agentic capabilities designed specifically for professional and financial services firms—while aligning with Microsoft Azure for security, scale, and enterprise procurement. Intapp’s approach shows what it looks like to pair deep vertical workflow expertise with a trusted cloud platform so customers can adopt AI in a way that is both practical and accountable. About Richard Baskerville Richard Baskerville is a Senior Director at Intapp, where he helps shape strategic AI alliances across the Microsoft ecosystem and beyond. _______________________________________________________________________________________________________________________________________________________________ [JR] Tell us about Intapp. What inspired the founding of the company, and what problems do your solutions help customers solve? [RB] Intapp was founded on a durable observation: professional firms—law firms, accounting practices, private capital firms, and investment banks—operate differently than general enterprises. Their workflows are shaped by client relationships, professional obligations, and regulatory requirements that generic software was never designed to handle. Our founders saw that these firms were either building bespoke systems at enormous cost or forcing themselves into enterprise tools that didn’t fit. Today, Intapp delivers Firm AI—governed AI purpose-built for professional services. Our solutions span the full firm lifecycle: business development through Intapp DealCloud, time capture and billing through Intapp Time, and risk and compliance management across Intapp Conflicts, Intake, Terms, Walls, and Employee Compliance. Underpinning it all is Intapp Celeste, our agentic AI platform, which puts AI to work on the specific workflows that drive firm performance—while keeping humans accountable and in control. For firms where every engagement carries professional liability, that governance layer isn’t a feature; it’s the foundation. [JR] What industries or types of organizations do you primarily serve today? [RB] Intapp serves professional and financial services firms exclusively. That focus is intentional—and it’s what makes us different. Our customers are law firms, accounting and consulting practices, investment banks and advisory firms, private capital managers, and real assets firms. Many are among the largest in their categories globally. These firms share characteristics that set them apart from general enterprises: partnership structures, billable-hour economics, client conflict management, regulatory oversight, and relationship networks spanning decades. Generic CRM or ERP tools aren’t built for these dynamics. Intapp is. That vertical depth—built over more than 20 years—is what Firm AI means in practice: AI that understands the context of a law firm partner’s client obligations or a dealmaker’s fund-level requirements, not just the general shape of business software. [JR] What were your initial expectations for Microsoft Marketplace when you first started your journey? [RB] Our initial expectation was straightforward: Marketplace would give us a cleaner path to transact with customers already operating inside the Microsoft ecosystem. Many of our customers—large law firms and financial services firms—had already committed significant Azure spend through enterprise agreements. Marketplace offered a way to meet them commercially where they already were. What we underestimated was how much Marketplace would shape the broader partnership. We went in expecting a distribution channel. What we found was a framework that connected co-sell, Azure consumption alignment, and joint go-to-market in ways that changed how our teams engaged. The commercial mechanics—particularly MACC drawdown eligibility—became a real conversation-opener with procurement and finance stakeholders, not just a contract path. [JR] What applications do you have available in Microsoft Marketplace, and how do they help customers? [RB] Intapp has 12 SaaS solutions available in Microsoft Marketplace today, transacted via private offers. They span the full professional services firm lifecycle—from relationship intelligence and deal management with Intapp DealCloud, to time capture with Intapp Time, to risk and compliance management across Intapp Conflicts, Intake, Terms, Walls, and Employee Compliance. Because we focus exclusively on professional and financial services, customers aren’t buying horizontal software adapted for their industry; they’re buying solutions designed for their workflows, compliance obligations, and client structures. Looking ahead, Intapp Celeste—our agentic AI platform—will be available in Marketplace via a consumption-based, metered model. That structure matches how agentic AI gets used: variably, tied to real firm activity, and governed end-to-end. [JR] What were the biggest lessons you learned early on when selling through Marketplace? [RB] Three lessons stood out early: Listing is the beginning, not the end. Initial traction required deliberate investment in co-sell enablement—ensuring Microsoft field teams could position Intapp clearly, not just point to a catalog entry. Specificity wins. Our customers are sophisticated buyers. What worked was leading with vertical relevance—speaking directly to the compliance requirements of a law firm or the relationship-data challenges of a private capital manager. Private offers require commercial fluency. Helping customers understand how Intapp maps to Azure commitments—and helping Microsoft sellers tell that story—made a material difference in deal velocity. [JR] How has your business changed with a transactable offer on the Marketplace? [RB] Transactable offers have changed how deals close. Customers with existing Azure commitments can apply Intapp spend against their MACC, removing a procurement obstacle that previously added months to cycles. Finance and procurement can work within familiar Microsoft frameworks rather than running separate vendor onboarding. Marketplace has also expanded our reach. Microsoft’s field organization has relationships we can’t replicate at scale, and co-sell has helped translate that reach into qualified pipeline—especially in segments where we previously had limited coverage. And the signal matters: being transactable in Marketplace reinforces that Intapp is an enterprise-grade partner, not a niche point solution. [JR] How has collaborating with Microsoft sellers impacted your Marketplace growth? [RB] Microsoft sellers are the activation mechanism for our Marketplace offers. Without field alignment, a listing is a catalog entry. With it, it becomes a joint pipeline motion. We’ve invested in enablement—giving sellers the vertical context to position Intapp credibly in front of legal and financial services CIOs, and making it easy to bring us into deals where Azure capabilities are already in the conversation. That alignment shows up in specific segments. In private capital and investment banking, Microsoft enterprise relationships often predate ours—co-sell provides warm introductions backed by a trusted infrastructure partner. In legal, where Microsoft 365 is near-universal, that adjacency and deep interoperability creates natural entry points. Co-sell turns those adjacencies into active pipeline rather than theoretical opportunity. [JR] What has made the co-sell relationship with Microsoft particularly valuable for Intapp? [RB] Co-sell works because it’s structurally aligned, not just commercially convenient. Microsoft’s investment in these verticals—through industry clouds, compliance frameworks, and dedicated field teams—maps directly onto Intapp’s customer base. We’re selling into the same firms, with the same platform expectations underpinning both offerings. What makes it particularly valuable is the mutual trust transfer. Firms hold Microsoft to a high standard for security, data governance, and regulatory compliance. When Microsoft sellers bring Intapp into the conversation, that credibility extends to us and compresses the trust-building phase of an enterprise cycle—especially in regulated industries. [JR] How does Microsoft Marketplace fit into Intapp’s long-term growth strategy? [RB] Marketplace is central to how we scale Firm AI globally. Our ambition is to be the governed AI platform of record for professional firms across legal, private capital, accounting, and advisory. Helping customers decrement MACC through Marketplace purchases is a clear win-win because it aligns platform investment with workflow outcomes. The upcoming Marketplace availability of Intapp Celeste which will offer different commercial models for customers (e.g. consumption based) marks the next phase. As Celeste deepens integration with Microsoft 365, Teams, and Azure AI services, the commercial and technical stories converge—customers can both buy and operate within an architecture where Firm AI and Microsoft’s platform reinforce each other. [JR] What advice would you give other SDCs who are just starting their Microsoft Marketplace journey? [RB] Three things matter most early on: Earn the co-sell relationship before you need it. Invest in enablement early so Microsoft sellers have enough vertical context to represent your value clearly. Get your commercial model right for the channel. Understand how private offers interact with Azure commitments, and plan for consumption-based pricing where it fits AI usage. Lead with a sharp point of view. The partners who gain traction fastest are the obvious choice for a specific industry workflow or problem—know what that is and communicate it consistently. _______________________________________________________________________________________________________________________________________________________________ Closing reflection Intapp’s Marketplace journey shows that industry-specific, governed AI wins when it’s paired with an enterprise platform customers already trust. By making solutions transactable—especially through private offers that align to customer’s existing Azure commitments—Intapp reduces procurement friction and accelerates adoption. And like many successful partners, their growth ultimately comes down to enablement: clear vertical messaging and tight co-sell alignment that turns Marketplace presence into a real, qualified pipeline.80Views0likes0CommentsDesign observability for AI apps and agents selling through Microsoft Marketplace
In the last post, API resilience and reliability patterns for AI apps and agents, we focused on what happens when AI systems encounter failure—and how resilient execution paths keep that failure contained. Timeouts fire with intent. Retries stay bounded. Circuit breakers provide overload protection. When resilience is designed well, your system continues to function even as conditions change. You can always get curated step-by-step guidance through building, publishing and selling apps for Marketplace through App Advisor. This post is part of a series on building and publishing well-architected AI apps and agents in Microsoft Marketplace. The series focuses on AI apps and agents that are architected, hosted, and operated on Azure, with guidance aligned to building and selling solutions through Microsoft Marketplace. Observability for AI systems AI apps and agents are shifting traditional observability, which was designed for systems based on simple assumptions, where requests followed linear paths and workloads behaved predictably. Execution in AI systems consumes tokens at a highly variable rate rather than fixed compute units. Requests unfold across multiple reasoning steps. Agents perform work that spans APIs, models, retrieval layers, and applications. A single interaction may pause, branch, retry, or exit early depending on inferred intent, context, and constraints. Instead of asking whether services are running, observability for AI systems asks: what is the system doing right now—and why? Is an agent spending its time reasoning, waiting on dependencies, retrying tool calls, or exiting early due to enforced limits? Is cost increasing because value is increasing, or because execution paths are expanding without progress? AI observability requirements shift the focus in the following subtle, but critical ways: From resource availability to workflow state From performance metrics to signals From incidents to patterns Core observability dimensions for AI apps and agents Once observability shifts toward understanding behavior, clarity comes from tracking state across the agents in the workflow. For AI apps and agents, observable indicators, such as those detailed below, show how work unfolds and changes during real usage—especially in trials and early adoption: Execution flow shows how a request moves through agents, tools, and workflows. This highlights where execution progresses smoothly, where it slows, and where it concludes early. This makes agent outcomes explainable and keeps behavior consistent across tenants. Cost and token behavior reveals how execution translates into consumption. Token usage per request, per agent step, and per retry shows where value is being delivered and where execution paths expand without proportional benefit. This insight connects runtime behavior directly to Marketplace billing expectations and evaluations. Latency and wait states distinguish active processing from time spent waiting on dependencies. Seeing where time is consumed helps explain slow experiences and guides decisions about optimization, caching, or resilience improvements. Failure classification provides structure when systems degrade. Separating tool failures from planning failures, and transient issues from terminal exits, keeps investigations focused and prevents protective behavior from being misread as instability. Tenant‑level patterns surface how behavior repeats at scale. Uneven load, and recurring degradation often appear first during trials and shape the customer's perception. Together, these dimensions turn telemetry into understanding—supporting clearer conversations, faster triage, and predictable execution as usage grows. Why observability matters By this point in the journey, your AI app or agent has implemented bounded execution paths, cost controls, and quality of service safeguards. As a result, failure degrades gracefully instead of spreading. These resilience techniques determine how your solution behaves under pressure. The data gathered from observability platforms like Application Insights and Azure Monitor explains why it behaves that way. For AI and agentic systems, infrastructure health alone rarely answers the questions that matter. Services can be up, CPUs can be idle, and queues can look healthy while agents loop inefficiently, retries quietly expand cost, or workflows exit early without delivering value. From the customer’s perspective, the experience feels inconsistent even though the platform appears stable. Observability closes this gap by revealing system behavior rather than system status. It shows how requests move, where work concentrates, and how constraints shape outcomes. At Marketplace scale, these patterns repeat across tenants and trials. What appears once during an evaluation often appears again as adoption grows. Observability connects runtime behavior back to the design choices introduced in earlier posts: Usage‑based billing introduced variability in consumption Performance optimization introduced tradeoffs among latency, quality, and cost Resilience patterns introduced controlled failure and bounded execution Observability allows you to explain outcomes during trials, validate assumptions as usage grows, and operate with confidence across customers and environments. Without this visibility, teams react to symptoms. With it, they recognize patterns. From execution paths to behavioral signals Observability begins at the same place resilience begins—API boundaries. These boundaries define where responsibility shifts and where behavior becomes visible. Observability focuses on signals that explain decisions made by the system as it executes instead of relying on raw logs that describe isolated events. Every resilience mechanism emits behavioral signals. Viewed together, these signals provide far more value than logs alone. Logs answer whether something happened. Behavioral signals explain why it happened and how the system responded. Circuit breakers change state as load builds and recedes. Retry loops show whether failures resolve quickly or exhaust their limits. Timeout enforcement reveals where dependencies slow execution. Fallback paths and early terminations show how the system protects itself while preserving outcomes for customers. This perspective matters most for agents. Agent execution unfolds as a series of choices—plan, call a tool, retry, exit early—rather than a single request‑response cycle. Observability that tracks these decisions makes agent behavior understandable, consistent, and defensible as usage grows across customer tenants. Observability at the agent layer As AI systems become more agent‑driven, observability needs to move closer to where decisions are made. Agents introduce variability by design. They plan, adapt, and choose workflow paths dynamically. Without first‑class visibility into that behavior, execution can appear unpredictable even when the underlying system is healthy. Observability at the agent layer acts as the feedback loop that keeps execution safely bounded. It shows how agents use the freedom you give them—and where that freedom begins to stretch into inefficiency. Observability follows how the agent did its job instead of treating the agent’s interaction as a single outcome. Several indicators help make agent behavior understandable. Step count per request reveals how much reasoning effort a prompt requires. Planning iterations show whether an agent converges quickly or cycles through alternatives. Tool invocation frequency highlights when agents rely heavily on external systems. Early exits compared to full completion explain whether limits and fallbacks activate as designed. Taken together, these indicators help distinguish healthy exploration from inefficient reasoning and degraded execution. An agent exploring briefly before converging adds value. An agent looping through tools without progress signals pressure, uncertainty, or dependency issues. This distinction reinforces a core principle of agentic systems: models reason probabilistically, adapting to context as it changes. Your system observes deterministically—measuring execution, enforcing boundaries, and clarifying outcomes. When those roles stay separate and well‑instrumented, agent behavior becomes transparent, predictable, and ready for Marketplace scale. Observability across environments The type of Marketplace offer you choose shapes what observability customers expect and how responsibility is shared. For SaaS offers, publishers typically own end‑to‑end execution. Observability centers on agent behavior, workflow completion, token usage, latency, and dependency impact across tenants. Publishers rely on consistent signals—often surfaced through tools like Azure Monitor, Application Insights, and Microsoft AI Foundry—to explain how requests behave as scale and load increase. For container‑based offers and Azure Managed Applications, observability expectations are more distributed. Publishers expose clear execution outcomes, limits, and failure signals at application boundaries. Customers, in turn, observe infrastructure health, scaling behavior, and downstream systems within their own environments. This separation ensures each party has visibility into what they control without creating ambiguity. Learn more about Choosing your marketplace offer type for AI Apps and agents. Execution behavior differs across environments for predictable reasons. Scale increases, tenant mix broadens, and external dependencies behave differently under real load. What must stay consistent is how behavior is interpreted. Signal definitions, thresholds, and failure classification should mean the same thing in Dev, Stage, and Prod. Learn more about designing a reliable environment strategy for Microsoft Marketplace AI apps and agents. Staging environments are where this consistency is validated. Observing retries, timeouts, and graceful degradation before production prepares you for Marketplace evaluations, which often resemble production conditions. Observability gaps tend to appear first during customer evaluation—when clarity matters most. Publisher and customer visibility boundaries Purpose: Parallel Post #13 responsibility clarity, now for observability As observability matures across environments, clarity around responsibility becomes essential. For Marketplace solutions, trust grows when publishers and customers each see what they own—and understand where that visibility ends. Publishers are responsible for instrumenting execution paths end to end. That means making workflows traceable, limits visible, and failure modes explainable. Observability should surface behavior—how requests progressed, where execution concluded, and why—rather than exposing raw internal errors that require insider knowledge to interpret. Customers focus their observability on what they control. This includes monitoring downstream systems, infrastructure behavior, and environment‑level alerts within their own estate. When visibility aligns with ownership, teams can act quickly and decisively. Exposing too much internal detail can overwhelm customers and blur accountability. Observing too little behavior creates friction, especially when issues cross boundaries and lack context. Clear visibility enables faster triage, sharper ownership boundaries, and fewer escalations rooted in ambiguity. Observability as an enabler for scale, billing, and trust From a customer’s perspective, observability answers two fundamental questions: Can I understand what happened? and Can I trust this at scale? When the answer to both is clear, observability becomes part of the value your Marketplace offering delivers. When system behavior is visible and explainable, customers gain confidence that adoption and growth will remain predictable. Observability directly supports usage‑based billing by tying execution behavior to measured consumption. Clear visibility into token usage, retries, and execution paths helps validate how usage is calculated and supports transparent billing conversations. It also enables ongoing performance tuning and caching strategies by showing where latency accumulates, where work repeats, and where optimization delivers measurable impact. Observability reinforces confidence in resilience mechanisms, confirming that limits, fallbacks, and degradation paths activate as designed under real‑world conditions. Beyond validation, observability creates a continuous feedback loop. Execution data informs pricing adjustments, guides changes to limits, and helps refine default configurations as customer behavior evolves. What’s next in the journey With execution behavior observable and explainable, the focus shifts to how AI systems are operated safely as change accelerates. The upcoming posts will discuss deployment strategies, CI/CD pipelines for agents, and progressive rollouts build on this foundation—ensuring AI apps evolve confidently as usage and expectations grow. Key Resources See curated, step-by-step guidance to help you build, publish, or sell your app or agent (no matter where you start) in App Advisor Quick-Start Development Toolkit can connect you with code templates for AI solution patterns Microsoft AI Envisioning Day Events How to build and publish AI apps and agents for Microsoft Marketplace Get over $126K USD in benefits and technical consultations to help you replicate and publish your app with ISV Success83Views1like0CommentsAPI resilience and reliability patterns for AI apps and agents selling through Microsoft Marketplace
Why API resilience is a Marketplace readiness requirement The previous post Design Predictable AI Performance for Apps Selling Through Microsoft Marketplace showed how to design systems that behave predictably when things go right. This post focuses on what happens when they do not. Imagine an enterprise customer launching a trial of your AI agent from Microsoft Marketplace. The first few interactions work beautifully. Then a more complex request triggers a multi‑step agent workflow: retrieval, enrichment, validation, approval. One downstream API stalls for just long enough to push the workflow beyond its timeout. The agent retries. The retry fans out into additional calls. Tokens burn. Costs rise. Eventually the entire interaction fails ambiguously. From the customer’s perspective, the trial just “didn’t work” with no explanation or architecture diagram. Just a stalled agent and decreased confidence. AI apps and agents treat APIs as their execution backbone. Every model invocation, tool call, retrieval query, and workflow step depends on APIs behaving within expected bounds. Solutions with a single unstable dependency can affect many tenants simultaneously. You can always get curated step-by-step guidance through building, publishing and selling apps for Marketplace through App Advisor. This post is part of a series on building and publishing well-architected AI apps and agents in Microsoft Marketplace. The series focuses on AI apps and agents that are architected, hosted, and operated on Azure, with guidance aligned to building and selling solutions through Microsoft Marketplace. How AI and agentic workloads stress APIs differently Traditional API platforms often assume linear, predictable request patterns. One request in, one response out. AI apps produce bursty, non‑linear traffic shaped by user behavior, token budgets, and inference variability. Agents amplify this further. A single user request may trigger planning, branching logic, parallel tool calls, and dynamic retries—all before returning a result. Single‑turn inference calls tend to be synchronous and bounded. Agent workflows may run for minutes, traverse multiple services, and consume tokens unpredictably depending on intermediate outcomes. Happy‑path assumptions break down quickly. Reliability also compounds mathematically. If you chain five APIs, each with 99.9% availability, the composite reliability drops to roughly 99.5%. Add retries without bounds, and the system can degrade traffic rather than absorb failure. For AI systems, reliability must be defined across multiple dimensions: Availability: Are dependencies reachable? Timeout behavior: How long will the system wait? Error propagation: What information crosses boundaries? Recovery safety: Can operations be retried without harm? Data access and integrity: Is contextual data available, relevant, and trustworthy? Defining reliability for AI systems Reliability becomes the mechanism that preserves trust when uncertainty appears. Reliability in AI systems is more than “the model didn’t fail.” That framing is incomplete. True reliability means providing predictable behavior under partial failure, bounding execution when dependencies degrade, and failing clearly, safely, and consistently instead of unpredictably. For publishers providing AI solutions on Marketplace, this includes protecting customers from ambiguous states—workflows that half‑complete, retries that silently multiply costs, or agents that continue planning after their assumptions are no longer valid. Designing resilient API boundaries The shift toward reliable AI systems starts with how you think about API boundaries. In this context, an API boundary is the line where responsibility changes—between your app and a dependency, between orchestration and execution, or between your system and a customer‑ or partner‑owned service. These boundaries are deliberate points of control. You must decide: how long is a call allowed to run? What happens if it fails? Is a retry safe, and if so, how many times? When agents assume that APIs will be reliable, fast, or always available, failure starts becoming systemic. Well‑designed API boundaries stop execution early when reliability assumptions break. Explicit timeouts keep your system from waiting indefinitely when a dependency slows or an API call hangs. Bounded retries allow brief recovery without inflating cost, load, or complexity. Together, these constraints help your system behave predictably, even under stress. This is where your enforcement layers come into focus. For many Marketplace solutions, Azure API Management is where you turn design intent into predictable behavior. At this boundary, you define how your system responds under pressure—how much traffic is allowed, how tokens are budgeted, and how long requests are permitted to run. These policies give you a steady way to shape execution across tenants, even when the systems behind the boundary behave unpredictably. As workflows grow more complex, orchestration layers such as Azure Durable Functions or Logic Apps carry that intent forward. They give you a way to manage long‑running or multi‑step operations explicitly, with clear execution limits, defined retry behavior, and compensating actions when steps fail so you can keep control over how work progresses and how it concludes. Core API resilience patterns for AI apps and agents Several foundational patterns appear repeatedly in resilient AI solutions published on Marketplace. Timeouts and deadline propagation ensure no call waits indefinitely. For AI workloads, these limits should be token‑aware—longer prompts or higher‑cost models require proportional constraints. Deadlines should propagate across calls so upstream services remain informed. Bounded retries protect against transient failures but with pre-defined limits and quotas. In agent workflows, retries should be explicit, counted, and observable. Retrying API calls that execute actions, attempt and fail authentications, or create updates that exceed quotas can lead to runaway failures. Circuit breakers prevent cascading failure by opening when error rates exceed thresholds. Unlike guardrails—which enforce policy by intent—circuit breakers react to system state by pausing execution paths that are no longer reliable. Azure API Management and resilience libraries such as Polly in .NET provide practical implementations. Bulkheads isolate high‑risk or high‑cost operations. Separate concurrency pools, queues, or compute tiers prevent one tenant or workflow from consuming disproportionate resources. This is especially critical for expensive reasoning paths or third‑party dependencies. Idempotency keeps retries safe by ensuring that repeating the same request produces the same result. Agents that take real‑world actions—creating records, approving workflows, triggering payments—must attach idempotency keys so repeat attempts do not multiply side effects. Together, these patterns do not eliminate failure. They contain it. Agent‑specific reliability risks and mitigations Agent autonomy shifts how reliability behaves in practice. Agents change the shape of failure. Because they plan, reason, and act across multiple steps, a single issue rarely stays isolated. When autonomy increases, failures affect more of the workflow and do so faster. Most agent failures fall into two categories and treating them the same way creates instability. Tool failures occur when an external dependency slows, times out, or becomes unavailable. An API may reject a request, enforce a quota, or fail temporarily. These failures require containment. Your system should pause execution, apply fallback behavior, or exit cleanly once limits are reached. Allowing the agent to keep calling tools under these conditions increases cost and load without improving results. Planning failures occur when the agent’s reasoning breaks down. The plan itself is flawed, incomplete, or loops without converging on an outcome. These failures require correction. Step limits, loop detection, and execution caps keep planning from expanding indefinitely and signal when the system should stop and reassess. Making this distinction explicit is what keeps agent behavior predictable. You define how far execution can go—how many steps are allowed, how long a request may run end‑to‑end, and when the system should pause or conclude. By enforcing these limits outside the model, you give agents room to reason while your system provides the structure that contains failure and keeps execution steady as conditions change. As explored in Designing AI Guardrails for Apps and Agents in Microsoft Marketplace, guardrails define what an agent is allowed to do. Resilience patterns determine how your system holds up when dependencies degrade. Together, they enable agents that feel capable and autonomous while remaining stable, bounded, and ready for Marketplace scale. Reliability across external and third‑party APIs Marketplace AI apps rarely operate in isolation. They depend on customer‑owned systems, partner services, SaaS platforms, and external LLM APIs—each with different SLAs and failure modes. Publishers must absorb this variability rather than pass it directly to customers. That means handling throttling gracefully, surfacing authentication failures clearly, and isolating quota exhaustion. Token‑based rate limiting via Azure API Management is especially important for downstream LLM calls, where cost and availability intersect. Remember the SLA math: your effective reliability is the product of every dependency. Designing for the weakest link protects customer perception—and your own margins. Environment‑aware reliability validation As outlined in Designing a reliable environment strategy for Microsoft Marketplace, environment strategy underpins reliable promotion and confident scaling. Reliability cannot be tested only in production. Before Marketplace submission, failure behavior should be validated in staging. Timeouts should trigger as expected. Retries should stop when designed to stop. Circuit breakers should open—and close—predictably. Equally important is environment consistency. Dev, Stage, and Prod environments should enforce the same resilience policies, even if scale differs. Otherwise, failures will appear only when customers are watching. Azure Chaos Studio provides controlled fault injection to test these scenarios intentionally. The goal is to confirm that systems behave consistently under stress. Reliability, ownership, and Marketplace readiness As a publisher, you are responsible for resilient defaults, protection against cascading failures, predictable failure modes, and documented service expectations. Customers, in turn, remain responsible for the reliability of their downstream systems, environment‑level scaling, and internal monitoring. When this boundary is explicit, teams know where responsibility sits and how to respond when conditions change. When ownership is unclear, support escalations increase, accountability blurs, and confidence drops on both sides. Marketplace customers expect clarity about what your solution controls, what it depends on, and how issues are handled when they arise. That clarity directly shapes Marketplace readiness. Reliable execution paths influence certification reviews, determine whether enterprise pilots progress, and establish long‑term operational confidence. During trials, predictable behavior feels professional. It reduces surprise costs, shortens evaluation cycles, and makes adoption decisions easier. In this way, reliability acts as a trust signal and a sales enabler. When customers see that ownership is well-defined and failure is handled intentionally, AI adoption through Marketplace feels safe, bounded, and ready to scale. What’s next in the journey Once execution paths are resilient, your solution’s behavior becomes visible. Circuit breaker transitions, retry frequency, timeout events, and error propagation turn into operational signals that show how your AI app or agent behaves under real load and across customer tenants. This foundation enables the next layer of operational maturity—observability, safe deployment practices, CI/CD for agents, and ongoing evaluation—so you can understand behavior end‑to‑end and operate confidently as usage grows. Reliability makes AI adoption safe; observability makes it sustainable. Key Resources See curated, step-by-step guidance to help you build, publish, or sell your app or agent (no matter where you start) in App Advisor Quick-Start Development Toolkit can connect you with code templates for AI solution patterns Microsoft AI Envisioning Day Events How to build and publish AI apps and agents for Microsoft Marketplace Get over $126K USD in benefits and technical consultations to help you replicate and publish your app with ISV Success139Views1like0Comments