azure
7772 TopicsPublished agent from Foundry doesn't work at all in Teams and M365
I've switched to the new version of Azure AI Foundry (New) and created a project there. Within this project, I created an Agent and connected two custom MCP servers to it. The agent works correctly inside Foundry Playground and responds to all test queries as expected. My goal was to make this agent available for my organization in Microsoft Teams / Microsoft 365 Copilot, so I followed all the steps described in the official Microsoft documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/publish-copilot?view=foundry Issue description The first problems started at Step 8 (Publishing the agent). Organization scope publishing I published the agent using Organization scope. The agent appeared in Microsoft Admin Center in the list of agents. However, when an administrator from my organization attempted to approve it, the approval always failed with a generic error: “Sorry, something went wrong” No diagnostic information, error codes, or logs were provided. We tried recreating and republishing the agent multiple times, but the result was always the same. Shared scope publishing As a workaround, I published the agent using Shared scope. In this case, the agent finally appeared in Microsoft Teams and Microsoft 365 Copilot. I can now see the agent here: Microsoft Teams → Copilot Microsoft Teams → Applications → Manage applications However, this revealed the main issue. Main problem The published agent cannot complete any query in Teams, despite the fact that: The agent works perfectly in Foundry Playground The agent responds correctly to the same prompts before publishing In Teams, every query results in messages such as: “Sorry, something went wrong. Try to complete a query later.” Simplification test To exclude MCP or instruction-related issues, I performed the following: Disabled all MCP tools Removed all complex instructions Left only a minimal system prompt: “When the user types 123, return 456” I then republished the agent. The agent appeared in Teams again, but the behavior did not change — it does not respond at all. Permissions warning in Teams When I go to: Teams → Applications → Manage Applications → My agent → View details I see a red warning label: “Permissions needed. Ask your IT admin to add InfoConnect Agent to this team/chat/meeting.” This message is confusing because: The administrator has already added all required permissions All relevant permissions were granted in Microsoft Entra ID Admin consent was provided Because of this warning, I also cannot properly share the agent with my colleagues. Additional observation I have a similar agent configured in Copilot Studio: It shows the same permissions warning However, that agent still responds correctly in Teams It can also successfully call some MCP tools This suggests that the issue is specific to Azure AI Foundry agents, not to Teams or tenant-wide permissions in general. Steps already taken to resolve the issue Configured all required RBAC roles in Azure Portal according to: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/rbac-foundry?view=foundry-classic During publishing, an agent-bot application was automatically created. I added my account to this bot with the Azure AI User role I also assigned Azure AI User to: The project’s Managed Identity The project resource itself Verified all permissions related to AI agents publishing in: Microsoft Admin Center Microsoft Teams Admin Center Simplified and republished the agent multiple times Deleted the automatically created agent-bot and allowed Foundry to recreate it Created a new Foundry project, configured several simple agents, and published them — the same issue occurs Tried publishing with different models: gpt-4.1, o4-mini Manually configured permissions in: Microsoft Entra ID → App registrations / Enterprise applications → API permissions Added both Delegated and Application permissions and granted Admin consent Added myself and my colleagues as Azure AI User in: Foundry → Project → Project users Followed all steps mentioned in this related discussion: https://techcommunity.microsoft.com/discussions/azure-ai-foundry-discussions/unable-to-publish-foundry-agent-to-m365-copilot-or-teams/4481420 Questions How can I make a Foundry agent work correctly in Microsoft Teams? Why does the agent fail to process requests in Teams while working correctly in Foundry? What does the “Permissions needed” warning actually mean for Foundry agents? How can I properly share the agent with other users in my organization? Any guidance, diagnostics, or clarification on the correct publishing and permission model for Foundry agents in Teams would be greatly appreciated.Published agent from Foundry doesn't work at all in Teams and M365 Copilot
I've switched to the new version of Azure AI Foundry (New) and created a project there. Within this project, I created an Agent and connected two custom MCP servers to it. The agent works correctly inside Foundry Playground and responds to all test queries as expected. My goal was to make this agent available for my organization in Microsoft Teams / Microsoft 365 Copilot, so I followed all the steps described in the official Microsoft documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/publish-copilot?view=foundry Issue description The first problems started at Step 8 (Publishing the agent). Organization scope publishing I published the agent using Organization scope. The agent appeared in Microsoft Admin Center in the list of agents. However, when an administrator from my organization attempted to approve it, the approval always failed with a generic error: “Sorry, something went wrong” No diagnostic information, error codes, or logs were provided. We tried recreating and republishing the agent multiple times, but the result was always the same. Shared scope publishing As a workaround, I published the agent using Shared scope. In this case, the agent finally appeared in Microsoft Teams and Microsoft 365 Copilot. I can now see the agent here: Microsoft Teams → Copilot Microsoft Teams → Applications → Manage applications However, this revealed the main issue. Main problem The published agent cannot complete any query in Teams, despite the fact that: The agent works perfectly in Foundry Playground The agent responds correctly to the same prompts before publishing In Teams, every query results in messages such as: “Sorry, something went wrong. Try to complete a query later.” Simplification test To exclude MCP or instruction-related issues, I performed the following: Disabled all MCP tools Removed all complex instructions Left only a minimal system prompt: “When the user types 123, return 456” I then republished the agent. The agent appeared in Teams again, but the behavior did not change — it does not respond at all. Permissions warning in Teams When I go to: Teams → Applications → Manage Applications → My agent → View details I see a red warning label: “Permissions needed. Ask your IT admin to add InfoConnect Agent to this team/chat/meeting.” This message is confusing because: The administrator has already added all required permissions All relevant permissions were granted in Microsoft Entra ID Admin consent was provided Because of this warning, I also cannot properly share the agent with my colleagues. Additional observation I have a similar agent configured in Copilot Studio: It shows the same permissions warning However, that agent still responds correctly in Teams It can also successfully call some MCP tools This suggests that the issue is specific to Azure AI Foundry agents, not to Teams or tenant-wide permissions in general. Steps already taken to resolve the issue Configured all required RBAC roles in Azure Portal according to: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/rbac-foundry?view=foundry-classic During publishing, an agent-bot application was automatically created. I added my account to this bot with the Azure AI User role I also assigned Azure AI User to: The project’s Managed Identity The project resource itself Verified all permissions related to AI agents publishing in: Microsoft Admin Center Microsoft Teams Admin Center Simplified and republished the agent multiple times Deleted the automatically created agent-bot and allowed Foundry to recreate it Created a new Foundry project, configured several simple agents, and published them — the same issue occurs Tried publishing with different models: gpt-4.1, o4-mini Manually configured permissions in: Microsoft Entra ID → App registrations / Enterprise applications → API permissions Added both Delegated and Application permissions and granted Admin consent Added myself and my colleagues as Azure AI User in: Foundry → Project → Project users Followed all steps mentioned in this related discussion: https://techcommunity.microsoft.com/discussions/azure-ai-foundry-discussions/unable-to-publish-foundry-agent-to-m365-copilot-or-teams/4481420 Questions How can I make a Foundry agent work correctly in Microsoft Teams? Why does the agent fail to process requests in Teams while working correctly in Foundry? What does the “Permissions needed” warning actually mean for Foundry agents? How can I properly share the agent with other users in my organization? Any guidance, diagnostics, or clarification on the correct publishing and permission model for Foundry agents in Teams would be greatly appreciated.Simplifying Code Signing for Windows Apps: Artifact Signing (GA)
Trusted Signing is now Artifact Signing—and it’s officially Generally Available! Artifact Signing is a fully managed, end-to-end code signing service that makes it easier than ever for Windows application developers to sign their apps securely and efficiently. As Artifact Signing rebrands, customers will see changes over the next weeks. Please refer to our Learn docs for the most updated information. What is Artifact Signing? Code signing has traditionally been a complex and manual process. Managing certificates, securing keys, and integrating signing into build pipelines can slow teams down and introduce risk. Artifact Signing changes that by offering a fully managed, end-to-end solution that automates certificate management, enforces strong security controls, and integrates seamlessly with your existing developer tools. With zero-touch certificate management, verified identity, role-based access control, and support for multiple trust models, Artifact Signing makes it easier than ever to build and distribute secure Windows applications. Whether you're shipping consumer apps or internal tools, Artifact Signing helps you deliver software that’s secure. Security Made Simple Zero-Touch Certificate Management No more manual certificate handling. The service provides “zero-touch” certificate management, meaning it handles the creation, protection, and even automatic rotation of code signing certificates on your behalf. These certificates are short-lived and auto renewed behind the scenes, giving you tighter control, faster revocation when needed, and eliminating the risks associated with long-lived certs. Your signing reputation isn’t tied to a single certificate. Instead, it’s anchored to your verified identity in Azure, and every signature reflects that verified identity. Verified Identity Identity validation with Artifact Signing ensures your app’s digital signature displays accurate and verified publisher information. Once validated, your identity details, such as your individual or organization name, are included in the certificate. This means your signed apps will show a verified publisher name, not the dreaded “Unknown Publisher” warning. The entire validation process happens in the Azure portal. You simply submit your individual or organization details, and in some cases, upload supporting documents like business registration papers. Most validations are completed within a few business days, and once approved, you’re ready to start signing your apps immediately. organization validation page Secure and Controlled Signing (RBAC) Artifact Signing enforces Azure’s Role-Based Access Control (RBAC) to secure signing activities. You can assign specific Azure roles to accounts or CI agents that use your Artifact Signing resource, ensuring only authorized developers or build pipelines can initiate signing operations. This tight access control helps prevent unauthorized or rogue signatures. Full Telemetry and Audit Logs Every signing request is tracked. You can see what was signed, when, and by whom in the Azure portal. This logging not only helps with compliance and auditing needs but also enables fast remediation if an issue arises. For example, if you discover a particular signing certificate was used in error or compromised, you can quickly revoke it directly from the portal. The short-lived nature of certificates in Artifact Signing further limits the window of any potential misuse. Artifact Signing gives you enterprise-grade security controls out of the box: strong protection of keys, fine-grained access control, and visibility. For developers and companies concerned about supply chain security, this dramatically reduces risk compared to handling signing keys manually. Built for Developers Artifact Signing was built to slot directly into developers’ existing workflows. You don’t need to overhaul how you build or release software, just plug Artifact Signing into your toolchain: GitHub Actions & Azure DevOps: The service includes first-class support for modern CI/CD. An official GitHub Action is available for easy integration into your workflow YAML, and Azure DevOps has tasks for pipelines. With these tools, every Windows app build can automatically sign binaries or installers—no manual steps required. Since signing credentials are managed in Azure, you avoid storing secrets in your repository. Visual Studio & MSBuild: Use the Artifact Signing client with SignTool to integrate signing into publish profiles or post-build steps. Once the Artifact Signing client is installed, Visual Studio or MSBuild can invoke SignTool as usual, with signatures routed through the Artifact Signing service. SignTool / CLI: Developers using scripts or custom build systems can continue using the familiar signtool.exe command. After a one-time setup, your existing SignTool commands will sign via the cloud service. The actual file signing on your build machine uses a digest signing approach: SignTool computes a hash of your file and sends that to the Artifact Signing service, which returns a signature. The file itself isn’t uploaded, preserving confidentiality and speed. This way, integrating Artifact Signing can be as simple as adding a couple of lines to your build script to point SignTool at Azure. PowerShell & SDK: For advanced automation or custom scenarios, Artifact Signing supports PowerShell modules and an SDK. These tools allow you to script signing operations, bulk-sign files, or integrate signing into specialized build systems. The Right Trust for the Right Audience Artifact Signing has support for multiple trust models to suit different distribution scenarios. You can choose between Public Trust and Private Trust for your code signing, depending on your app’s audience: Public Trust: This is the standard model for software intended to go to consumers. When you use Public Trust signing, the certificates come from a Microsoft CA that’s part of the Microsoft Trusted Root Program. Apps signed under Public Trust are recognized by Windows as coming from a known publisher, enabling a smooth installation experience when security features such as Smart App Control and SmartScreen are enabled. Private Trust: This model is for internal or enterprise apps. These certificates aren’t publicly trusted but are instead meant to work with Windows Defender Application Control (App Control for Business) policies. This is ideal for line-of-business applications, internal tools, or scenarios where you want to tightly control who trusts the app. Artifact Signing ’s Private Trust model is the modern, expanded evolution of Microsoft’s older Device Guard Signing Service (DGSS) -- delivering the same ability to sign internal apps but with ease of access and expanded capabilities. Test Signing: Useful for development and testing. These certificates mimic real signatures but aren’t publicly trusted, allowing you to validate your signing setup in non-production environments before releasing your app. Note on Expanded Scenario Support: Artifact Signing supports additional certificate profiles, including those for VBS enclaves and Private Trust CI Policies. In addition, there is a new preview feature for signing container images using the Notary v2 standard from the CNCF Notary project. This enables developers to sign Docker/OCI container images stored in Azure Container Registry using tools like the notation CLI, backed by Artifact Signing. Having all trust models in one service means you can manage all your signing needs in one place. Whether your code is destined for the world or just your organization, Artifact Signing makes it easy to ensure it is signed with an appropriate level of trust. Misuse and Abuse Management Artifact Signing is engineered with robust safeguards to counter certificate misuse and abuse. The signing platform employs active threat intelligence monitoring to continuously detect suspicious signing activity in real time. The service also emphasizes prevention: certificates are short-lived (renewed daily and valid for only 72 hours), which means any certificate used maliciously can be swiftly revoked without impacting software signed outside its brief lifetime. When misuse is confirmed, Artifact Signing quickly revokes the certificate and suspends the subscriber’s account, removing trust from the malicious code’s signature and stopping further abuse. These measures adhere to strict industry standards for responsible certificate governance. By combining real-time threat detection, built-in preventive controls, and rapid response policies, Artifact Signing gives Windows app developers confidence that any attempt to abuse the platform will be quickly identified and contained, helping protect the broader software ecosystem from emerging threats. Availability and What’s Next Check out the upcoming “What’s New” section in the Artifact Signing Learn Docs for updates on supported file types, new region availability, and more. Microsoft will continue evolving the service to meet developer needs. Conclusion: Enhancing Trust and Security for All Windows Apps Artifact Signing empowers Windows developers to sign their applications with ease and confidence. It integrates effortlessly into your development tools, automates the heavy lifting of certificate management, and ensures every app carries a verified digital signature backed by Microsoft’s Certificate Authorities. For users, it means peace of mind. For developers and organizations, it means fewer headaches, stronger protection against supply chain threats, and complete control over who signs what and when. Now that Artifact Signing is generally available, it’s a must-have for building trustworthy Windows software. It reflects Microsoft’s commitment to a secure, inclusive ecosystem and brings modern security features like Smart App Control and App Control for Business within reach, simply by signing your code. Whether you're shipping consumer apps or internal tools, Artifact Signing helps you deliver software that’s both easy to install and tough to compromise.Trusted Signing Public Preview Update
Nearly a year ago we announced the Public Preview of Trusted Signing with availability for organizations with 3 years or more of verifiable history to onboard to the service to get a fully managed code signing experience to simplify the efforts for Windows app developers. Over the past year, we’ve announced new features including the Preview support for Individual Developers, and we highlighted how the service contributes to the Windows Security story at Microsoft BUILD 2024 in the Unleash Windows App Security & Reputation with Trusted Signing session. During the Public Preview, we have obtained valuable insights on the service features from our customers, and insights into the developer experience as well as experience for Windows users. As we incorporate this feedback and learning into our General Availability (GA) release, we are limiting new customer subscriptions as part of the public preview. This approach will allow us to focus on refining the service based on the feedback and data collected during the preview phase. The limit in new customer subscriptions for Trusted Signing will take effect Wednesday, April 2, 2025, and make the service only available to US and Canada-based organizations with 3 years or more of verifiable history. Onboarding for individual developers and all other organizations will not be directly available for the remainder of the preview, and we look forward to expanding the service availability as we approach GA. Note that this announcement does not impact any existing subscribers of Trusted Signing, and the service will continue to be available for these subscribers as it has been throughout the Public Preview. For additional information about Trusted Signing please refer to Trusted Signing documentation | Microsoft Learn and Trusted Signing FAQ | Microsoft Learn.6.2KViews7likes40CommentsAI Didn’t Break Your Production — Your Architecture Did
Most AI systems don’t fail in the lab. They fail the moment production touches them. I’m Hazem Ali — Microsoft AI MVP, Principal AI & ML Engineer / Architect, and Founder & CEO of Skytells. With a strong foundation in AI and deep learning from low-level fundamentals to production-scale, backed by rigorous cybersecurity and software engineering expertise, I design and deliver enterprise AI systems end-to-end. I often speak about what happens after the pilot goes live: real users arrive, data drifts, security constraints tighten, and incidents force your architecture to prove it can survive. My focus is building production AI with a security-first mindset: identity boundaries, enforceable governance, incident-ready operations, and reliability at scale. My mission is simple: Architect and engineer secure AI systems that operate safely, predictably, and at scale in production. And here’s the hard truth: AI initiatives rarely fail because the model is weak. They fail because the surrounding architecture was never engineered for production reality. - Hazem Ali You see this clearly when teams bolt AI onto an existing platform. In Azure-based environments, the foundation can be solid—identity, networking, governance, logging, policy enforcement, and scale primitives. But that doesn’t make the AI layer production-grade by default. It becomes production-grade only when the AI runtime is engineered like a first-class subsystem with explicit boundaries, control points, and designed failure behavior. A quick moment from the field I still remember one rollout that looked perfect on paper. Latency was fine. Error rate was low. Dashboards were green. Everyone was relaxed. Then a single workflow started creating the wrong tickets, not failing or crashing. It was confidently doing the wrong thing at scale. It took hours before anyone noticed, because nothing was broken in the traditional sense. When we finally traced it, the model was not the root cause. The system had no real gates, no replayable trail, and tool execution was too permissive. The architecture made it easy for a small mistake to become a widespread mess. That is the gap I’m talking about in this article. Production Failure Taxonomy This is the part most teams skip because it is not exciting, and it is not easy to measure in a demo. When AI fails in production, the postmortem rarely says the model was bad. It almost always points to missing boundaries, over-privileged execution, or decisions nobody can trace. So if your AI can take actions, you are no longer shipping a chat feature. You are operating a runtime that can change state across real systems, that means reliability is not just uptime. It is the ability to limit blast radius, reproduce decisions, and stop or degrade safely when uncertainty or risk spikes. You can usually tell early whether an AI initiative will survive production. Not because the model is weak, but because the failure mode is already baked into the architecture. Here are the ones I see most often. 1. Healthy systems that are confidently wrong Uptime looks perfect. Latency is fine. And the output is wrong. This is dangerous because nothing alerts until real damage shows up. 2. The agent ends up with more authority than the user The user asks a question. The agent has tools and credentials. Now it can do things the user never should have been able to do in that moment. 3. Each action is allowed, but the chain is not Read data, create ticket, send message. All approved individually. Put together, it becomes a capability nobody reviewed. 4. Retrieval becomes the attack path Most teams worry about prompt injection. Fair. But a poisoned or stale retrieval layer can be worse, because it feeds the model the wrong truth. 5. Tool calls turn mistakes into incidents The moment AI can change state—config, permissions, emails, payments, or data—a mistake is no longer a bad answer. It is an incident. 6. Retries duplicate side effects Timeouts happen. Retries happen. If your tool calls are not safe to repeat, you will create duplicate tickets, refunds, emails, or deletes. Next, let’s talk about what changes when you inject probabilistic behavior into a deterministic platform. In the Field: Building and Sharing Real-World AI In December 2025, I had the chance to speak and engage with builders across multiple AI and technology events, sharing what I consider the most valuable part of the journey: the engineering details that show up when AI meets production reality. This photo captures one of those moments: real conversations with engineers, architects, and decision-makers about what it truly takes to ship production-grade AI. During my session, Designing Scalable and Secure Architecture at the Enterprise Scale I walked through the ideas in this article live on stage then went deeper into the engineering reality behind them: from zero-trust boundaries and runtime policy enforcement to observability, traceability, and safe failure design, The goal wasn’t to talk about “AI capability,” but to show how to build AI systems that operate safely and predictably at scale in production. Deterministic platforms, probabilistic behavior Most production platforms are built for deterministic behavior: defined contracts, predictable services, stable outputs. AI changes the physics. You introduce probabilistic behavior into deterministic pipelines and your failure modes multiply. An AI system can be confidently wrong while still looking “healthy” through basic uptime dashboards. That’s why reliability in production AI is rarely about “better prompts” or “higher model accuracy.” It’s about engineering the right control points: identity boundaries, governance enforcement, behavioral observability, and safe degradation. In other words: the model is only one component. The system is the product. Production AI Control Plane Here’s the thing. Once you inject probabilistic behavior into a deterministic platform, you need more than prompts and endpoints. You need a control plane. Not a fancy framework. Just a clear place in the runtime where decisions get bounded, actions get authorized, and behavior becomes explainable when something goes wrong. This is the simplest shape I have seen work in real enterprise systems. The control plane components Orchestrator Owns the workflow. Decides what happens next, and when the system should stop. Retrieval Brings in context, but only from sources you trust and can explain later. Prompt assembly Builds the final input to the model, including constraints, policy signals, and tool schemas. Model call Generates the plan or the response. It should never be trusted to execute directly. Policy Enforcement Point The gate before any high impact step. It answers: is this allowed, under these conditions, with these constraints. Tool Gateway The firewall for actions. Scopes every operation, validates inputs, rate-limits, and blocks unsafe calls. Audit log and trace store A replayable chain for every request. If you cannot replay it, you cannot debug it. Risk engine Detects prompt injection signals, anomalous sessions, uncertainty spikes, and switches the runtime into safer modes. Approval flow For the few actions that should never be automatic. It is the line between assistance and damage. If you take one idea from this section, let it be this. The model is not where you enforce safety. Safety lives in the control plane. Next, let’s talk about the most common mistake teams make right after they build the happy-path pipeline. Treating AI like a feature. The common architectural trap: treating AI like a feature Many teams ship AI like a feature: prompt → model → response. That structure demos well. In production, it collapses the moment AI output influences anything stateful tickets, approvals, customer messaging, remediation actions, or security decisions. At that point, you’re not “adding AI.” You’re operating a semi-autonomous runtime. The engineering questions become non-negotiable: Can we explain why the system responded this way? Can we bound what it’s allowed to do? Can we contain impact when it’s wrong? Can we recover without human panic? If those answers aren’t designed into the architecture, production becomes a roulette wheel. Governance is not a document It’s a runtime enforcement capability Most governance programs fail because they’re implemented as late-stage checklists. In production, governance must live inside the execution path as an enforceable mechanism, A Policy Enforcement Point (PEP) that evaluates every high-impact step before it happens. At the moment of execution, your runtime must answer a strict chain of authorization questions: 1. What tools is this agent attempting to call? Every tool invocation is a privilege boundary. Your runtime must identify the tool, the operation, and the intended side effect (read vs write, safe vs state-changing). 2. Does the tool have the right permissions to run for this agent? Even before user context, the tool itself must be runnable by the agent’s workload identity (service principal / managed identity / workload credentials). If the agent identity can’t execute the tool, the call is denied period. 3. If the tool can run, is the agent permitted to use it for this user? This is the missing piece in most systems: delegation. The agent might be able to run the tool in general, but not on behalf of this user, in this tenant, in this environment, for this task category. This is where you enforce: user role / entitlement tenant boundaries environment (prod vs staging) session risk level (normal vs suspicious) 4. If yes, which tasks/operations are permitted? Tools are too broad. Permissions must be operation-scoped. Not “Jira tool allowed.” But “Jira: create ticket only, no delete, no project-admin actions.” Not “Database tool allowed.” But “DB: read-only, specific schema, specific columns, row-level filters.” This is ABAC/RBAC + capability-based execution. 5. What data scope is allowed? Even a permitted tool operation must be constrained by data classification and scope: public vs internal vs confidential vs PII row/column filters time-bounded access purpose limitation (“only for incident triage”) If the system can’t express data scope at runtime, it can’t claim governance. 6. What operations require human approval? Some actions are inherently high risk: payments/refunds changing production configs emailing customers deleting data executing scripts The policy should return “REQUIRE_APPROVAL” with clear obligations (what must be reviewed, what evidence is required, who can approve). 7. What actions are forbidden under certain risk conditions? Risk-aware policy is the difference between governance and theater. Examples: If prompt injection signals are high → disable tool execution If session is anomalous → downgrade to read-only mode If data is PII + user not entitled → deny and redact If environment is prod + request is destructive → block regardless of model confidence The key engineering takeaway Governance works only when it’s enforceable, runtime-evaluated, and capability-scoped: Agent identity answers: “Can it run at all?” Delegation answers: “Can it run for this user?” Capabilities answer: “Which operations exactly?” Data scope answers: “How much and what kind of data?” Risk gates + approvals answer: “When must it stop or escalate?” If policy can’t be enforced at runtime, it isn’t governance. It’s optimism. Safe Execution Patterns Policy answers whether something is allowed. Safe execution answers what happens when things get messy. Because they will, Models time out, Retries happen, Inputs are adversarial. People ask for the wrong thing. Agents misunderstand. And when tools can change state, small mistakes turn into real incidents. These patterns are what keep the system stable when the world is not. 👈 Two-phase execution Do not execute directly from a model output. First phase: propose a plan and a dry-run summary of what will change. Second phase: execute only after policy gates pass, and approval is collected if required. Idempotency for every write If a tool call can create, refund, email, delete, or deploy, it must be safe to retry. Every write gets an idempotency key, and the gateway rejects duplicates. This one change prevents a huge class of production pain. Default to read-only when risk rises When injection signals spike, when the session looks anomalous, when retrieval looks suspicious, the system should not keep acting. It should downgrade. Retrieve, explain, and ask. No tool execution. Scope permissions to operations, not tools Tools are too broad. Do not allow Jira. Allow create ticket in these projects, with these fields. Do not allow database access. Allow read-only on this schema, with row and column filters. Rate limits and blast radius caps Agents should have a hard ceiling. Max tool calls per request. Max writes per session. Max affected entities. If the cap is hit, stop and escalate. A kill switch that actually works You need a way to disable tool execution across the fleet in one move. When an incident happens, you do not want to redeploy code. You want to stop the bleeding. If you build these in early, you stop relying on luck. You make failure boring, contained, and recoverable. Think for scale, in the Era of AI for AI I want to zoom out for a second, because this is the shift most teams still design around. We are not just adding AI to a product. We are entering a phase where parts of the system can maintain and improve themselves. Not in a magical way. In a practical, engineering way. A self-improving system is one that can watch what is happening in production, spot a class of problems, propose changes, test them, and ship them safely, while leaving a clear trail behind it. It can improve code paths, adjust prompts, refine retrieval rules, update tests, and tighten policies. Over time, the system becomes less dependent on hero debugging at 2 a.m. What makes this real is the loop, not the model. Signals come in from logs, traces, incidents, drift metrics, and quality checks. The system turns those signals into a scoped plan. Then it passes through gates: policy and permissions, safe scope, testing, and controlled rollout. If something looks wrong, it stops, downgrades to read-only, or asks for approval. This is why scale changes. In the old world, scale meant more users and more traffic. In the AI for AI world, scale also means more autonomy. One request can trigger many tool calls. One workflow can spawn sub-agents. One bad signal can cause retries and cascades. So the question is not only can your system handle load. The question is can your system handle multiplication without losing control. If you want self-improving behavior, you need three things to be true: The system is allowed to change only what it can prove is safe to change. Every change is testable and reversible. Every action is traceable, so you can replay why it happened. When those conditions exist, self-improvement becomes an advantage. When they do not, self-improvement becomes automated risk. And this leads straight into governance, because in this era governance is not a document. It is the gate that decides what the system is allowed to improve, and under which conditions. Observability: uptime isn’t enough — you need traceability and causality Traditional observability answers: Is the service up. Is it fast. Is it erroring. That is table stakes. Production AI needs a deeper truth: why did it do that. Because the system can look perfectly healthy while still making the wrong decision. Latency is fine. Error rate is fine. Dashboards are green. And the output is still harmful. To debug that kind of failure, you need causality you can replay and audit: Input → context retrieval → prompt assembly → model response → tool invocation → final outcome Without this chain, incident response becomes guesswork. People argue about prompts, blame the model, and ship small patches that do not address the real cause. Then the same issue comes back under a different prompt, a different document, or a slightly different user context. The practical goal is simple. Every high-impact action should have a story you can reconstruct later. What did the system see. What did it pull. What did it decide. What did it touch. And which policy allowed it. When you have that, you stop chasing symptoms. You can fix the actual failure point, and you can detect drift before users do. RAG Governance and Data Provenance Most teams treat retrieval as a quality feature. In production, retrieval is a security boundary. Because the moment a document enters the context window, it becomes part of the system’s brain for that request. If retrieval pulls the wrong thing, the model can behave perfectly and still lead you to a bad outcome. I learned this the hard way, I have seen systems where the model was not the problem at all. The problem was a single stale runbook that looked official, ranked high, and quietly took over the decision. Everything downstream was clean. The agent followed instructions, called the right tools, and still caused damage because the truth it was given was wrong. I keep repeating one line in reviews, and I mean it every time: Retrieval is where truth enters the system. If you do not control that, you are not governing anything. - Hazem Ali So what makes retrieval safe enough for enterprise use? Provenance on every chunk Every retrieved snippet needs a label you can defend later: source, owner, timestamp, and classification. If you cannot answer where it came from, you cannot trust it for actions. Staleness budgets Old truth is a real risk. A runbook from last quarter can be more dangerous than no runbook at all. If content is older than a threshold, the system should say it is old, and either confirm or downgrade to read-only. No silent reliance. Allowlisted sources per task Not all sources are valid for all jobs. Incident response might allow internal runbooks. Customer messaging might require approved templates only. Make this explicit. Retrieval should not behave like a free-for-all search engine. Scope and redaction before the model sees it Row and column limits, PII filtering, secret stripping, tenant boundaries. Do it before prompt assembly, not after the model has already seen the data. Citation requirement for high-impact steps If the system is about to take a high-impact action, it should be able to point to the sources that justified it. If it cannot, it should stop and ask. That one rule prevents a lot of confident nonsense. Monitor retrieval like a production dependency Track which sources are being used, which ones cause incidents, and where drift is coming from. Retrieval quality is not static. Content changes. Permissions change. Rankings shift. Behavior follows. When you treat retrieval as governance, the system stops absorbing random truth. It consumes controlled truth, with ownership, freshness, and scope. That is what production needs. Security: API keys aren’t a strategy when agents can act The highest-impact AI incidents are usually not model hacks. They are architectural failures: over-privileged identities, blurred trust boundaries, unbounded tool access, and unsafe retrieval paths. Once an agent can call tools that mutate state, treat it like a privileged service, not a chatbot. Least privilege by default Explicit authorization boundaries Auditable actions Containment-first design Clear separation between user intent and system authority This is how you prevent a prompt injection from turning into a system-level breach. If you want the deeper blueprint and the concrete patterns for securing agents in practice, I wrote a full breakdown here: Zero-Trust Agent Architecture: How to Actually Secure Your Agents What “production-ready AI” actually means Production-ready AI is not defined by a benchmark score. It’s defined by survivability under uncertainty. A production-grade AI system can: Explain itself with traceability. Enforce policy at runtime. Contain blast radius when wrong. Degrade safely under uncertainty. Recover with clear operational playbooks. If your system can’t answer “how does it fail?” you don’t have production AI yet.. You have a prototype with unmanaged risk. How Azure helps you engineer production-grade AI Azure doesn’t “solve” production-ready AI by itself, it gives you the primitives to engineer it correctly. The difference between a prototype and a survivable system is whether you translate those primitives into runtime control points: identity, policy enforcement, telemetry, and containment. 1. Identity-first execution (kill credential sprawl, shrink blast radius) A production AI runtime should not run on shared API keys or long-lived secrets. In Azure environments, the most important mindset shift is: every agent/workflow must have an identity and that identity must be scoped. Guidance Give each agent/orchestrator a dedicated identity (least privilege by default). Separate identities by environment (prod vs staging) and by capability (read vs write). Treat tool invocation as a privileged service call, never “just a function.” Why this matters If an agent is compromised (or tricked via prompt injection), identity boundaries decide whether it can read one table or take down a whole environment. 2. Policy as enforcement (move governance into the execution path) Your article’s core idea governance is runtime enforcement maps perfectly to Azure’s broader governance philosophy: policies must be enforceable, not advisory. Guidance Create an explicit Policy Enforcement Point (PEP) in your agent runtime. Make the PEP decision mandatory before executing any tool call or data access. Use “allow + obligations” patterns: allow only with constraints (redaction, read-only mode, rate limits, approval gates, extra logging). Why this matters Governance fails when it’s a document. It works when it’s compiled into runtime decisions. 3. Observability that explains behavior Azure’s telemetry stack is valuable because it’s designed for distributed systems: correlation, tracing, and unified logs. Production AI needs the same plus decision traceability. Guidance Emit a trace for every request across: retrieval → prompt assembly → model call → tool calls → outcome. Log policy decisions (allow/deny/require approval) with policy version + obligations applied. Capture “why” signals: risk score, classifier outputs, injection signals, uncertainty indicators. Why this matters When incidents happen, you don’t just debug latency — you debug behavior. Without causality, you can’t root-cause drift or containment failures. 4. Zero-trust boundaries for tools and data Azure environments tend to be strong at network segmentation and access control. That foundation is exactly what AI systems need because AI introduces adversarial inputs by default. Guidance Put a Tool Gateway in front of tools (Jira, email, payments, infra) and enforce scopes there. Restrict data access by classification (PII/secret zones) and enforce row/column constraints. Degrade safely: if risk is high, drop to read-only, disable tools, or require approval. Why this matters Prompt injection doesn’t become catastrophic when your system has hard boundaries and graceful failure modes. 5. Practical “production-ready” checklist (Azure-aligned, engineering-first) If you want a concrete way to apply this: Identity: every runtime has a scoped identity; no shared secrets PEP: every tool/data action is gated by policy, with obligations Traceability: full chain captured and correlated end-to-end Containment: safe degradation + approval gates for high-risk actions Auditability: policy versions and decision logs are immutable and replayable Environment separation: prod ≠ staging identities, tools, and permissions Outcome This is how you turn “we integrated AI” into “we operate AI safely at scale.” Operating Production AI A lot of teams build the architecture and still struggle, because production is not a diagram. It is a living system. So here is the operating model I look for when I want to trust an AI runtime in production. The few SLOs that actually matter Trace completeness For high-impact requests, can we reconstruct the full chain every time, without missing steps. Policy coverage What percentage of tool calls and sensitive reads pass through the policy gate, with a recorded decision. Action correctness Not model accuracy. Real-world correctness. Did the system take the right action, on the right target, with the right scope. Time to contain When something goes wrong, how fast can we stop tool execution, downgrade to read-only, or isolate a capability. Drift detection time How quickly do we notice behavioral drift before users do. The runbooks you must have If you operate agents, you need simple playbooks for predictable bad days: Injection spike → safe mode, block tool execution, force approvals Retrieval poisoning suspicion → restrict sources, raise freshness requirements, require citations Retry storm → enforce idempotency, rate limits, and circuit breakers Tool gateway instability → fail closed for writes, degrade safely for reads Model outage → fall back to deterministic paths, templates, or human escalation Clear ownership Someone has to own the runtime, not just the prompts. Platform owns the gates, tool gateway, audit, and tracing Product owns workflows and user-facing behavior Security owns policy rules, high-risk approvals, and incident procedures When these pieces are real, production becomes manageable. When they are not, you rely on luck and hero debugging. The 60-second production readiness checklist If you want a fast sanity check, here it is. Every agent has an identity, scoped per environment No shared API keys for privileged actions Every tool call goes through a policy gate with a logged decision Permissions are scoped to operations, not whole tools Writes are idempotent, retries cannot duplicate side effects Tool gateway validates inputs, scopes data, and rate-limits actions There is a safe mode that disables tools under risk There is a kill switch that stops tool execution across the fleet Retrieval is allowlisted, provenance-tagged, and freshness-aware High-impact actions require citations or they stop and ask Audit logs are immutable enough to trust later Traces are replayable end-to-end for any incident If most of these are missing, you do not have production AI yet. You have a prototype with unmanaged risk. A quick note In Azure-based enterprises, you already have strong primitives that mirror the mindset production AI requires: identity-first access control (Microsoft Entra ID), secure workload authentication patterns (managed identities), and deep telemetry foundations (Azure Monitor / Application Insights). The key is translating that discipline into the AI runtime so governance, identity, and observability aren’t external add-ons, but part of how AI executes and acts. Closing Models will keep evolving. Tooling will keep improving. But enterprise AI success still comes down to systems engineering. If you’re building production AI today, what has been the hardest part in your environment: governance, observability, security boundaries, or operational reliability? If you’re dealing with deep technical challenges around production AI, agent security, RAG governance, or operational reliability, feel free to connect with me on LinkedIn. I’m open to technical discussions and architecture reviews. Thanks for reading. — Hazem Ali387Views0likes0CommentsIntroducing native Service Bus message publishing from Azure API Management (Preview)
We’re excited to announce a preview capability in Azure API Management (APIM) — you can now send messages directly to Azure Service Bus from your APIs using a built-in policy. This enhancement, currently in public preview, simplifies how you connect your API layer with event-driven and asynchronous systems, helping you build more scalable, resilient, and loosely coupled architectures across your enterprise. Why this matters? Modern applications increasingly rely on asynchronous communication and event-driven designs. With this new integration: Any API hosted in API Management can publish to Service Bus — no SDKs, custom code, or middleware required. Partners, clients, and IoT devices can send data through standard HTTP calls, even if they don’t support AMQP natively. You stay in full control with authentication, throttling, and logging managed centrally in API Management. Your systems scale more smoothly by decoupling front-end requests from backend processing. How it works The new send-service-bus-message policy allows API Management to forward payloads from API calls directly into Service Bus queues or topics. High-level flow A client sends a standard HTTP request to your API endpoint in API Management. The policy executes and sends the payload as a message to Service Bus. Downstream consumers such as Logic Apps, Azure Functions, or microservices process those messages asynchronously. All configurations happen in API Management — no code changes or new infrastructure are required. Getting started You can try it out in minutes: Set up a Service Bus namespace and create a queue or topic. Enable a managed identity (system-assigned or user-assigned) on your API Management instance. Grant the identity the “Service Bus data sender” role in Azure RBAC, scoped to your queue/ topic. Add the policy to your API operation: <send-service-bus-message queue-name="orders"> <payload>@(context.Request.Body.As<string>())</payload> </send-service-bus-message> Once saved, each API call publishes its payload to the Service Bus queue or topic. 📖 Learn more. Common use cases This capability makes it easy to integrate your APIs into event-driven workflows: Order processing – Queue incoming orders for fulfillment or billing. Event notifications – Trigger internal workflows across multiple applications. Telemetry ingestion – Forward IoT or mobile app data to Service Bus for analytics. Partner integrations – Offer REST-based endpoints for external systems while maintaining policy-based control. Each of these scenarios benefits from simplified integration, centralized governance, and improved reliability. Secure and governed by design The integration uses managed identities for secure communication between API Management and Service Bus — no secrets required. You can further apply enterprise-grade controls: Enforce rate limits, quotas, and authorization through APIM policies. Gain API-level logging and tracing for each message sent. Use Service Bus metrics to monitor downstream processing. Together, these tools help you maintain a consistent security posture across your APIs and messaging layer. Build modern, event-driven architectures With this feature, API Management can serve as a bridge to your event-driven backbone. Start small by queuing a single API’s workload, or extend to enterprise-wide event distribution using topics and subscriptions. You’ll reduce architectural complexity while enabling more flexible, scalable, and decoupled application patterns. Learn more: Get the full walkthrough and examples in the documentation 👉 here3.6KViews2likes6CommentsError when creating Assistant in Microsoft Foundry using Fabric Data Agent
I am facing an issue when using a Microsoft Fabric Data Agent integrated with the new Microsoft Foundry, and I would like your assistance to investigate it. Scenario: 1. I created a Data Agent in Microsoft Fabric. 2. I connected this Data Agent as a Tool within a project in the new Microsoft Foundry. 3. I published the agent to Microsoft Teams and Copilot for Microsoft 365. 4. I configured the required Azure permissions, assigning the appropriate roles to the Foundry project Managed Identity (as shown in the attached evidence – Azure AI Developer and Azure AI User roles). Issue: When trying to use the published agent, I receive the following error: Response failed with code tool_user_error: Create assistant failed. If issue persists, please use following identifiers in any support request: ConversationId = PQbM0hGUvMF0X5EDA62v3-br activityId = PQbM0hGUvMF0X5EDA62v3-br|0000000 Additional notes: • Permissions appear to be correctly configured in Azure. • The error occurs during the assistant creation/execution phase via Foundry after publishing. • The same behavior occurs both in Teams and in Copilot for Microsoft 365. Could you please verify: • Whether there are any additional permissions required when using Fabric Data Agents as Tools in Foundry; • If there are any known limitations or specific requirements for publishing to Teams/Copilot M365; • And analyze the error identifiers provided above. I appreciate your support and look forward to your guidance on how to resolve this issue.Microsoft Ignite- Meaningful Announcements That Create Real Movement
Now that we are officially in 2026, I want to reflect back on Microsoft Ignite as it was more energized than ever. There is real momentum right now across Microsoft Marketplace, and the most meaningful shift I am seeing is around REO( Reseller Enabled Offer )capability. This is opening the door for partners to take Marketplace offers and bring them directly into new regions, new customer segments, and new routes to market without adding operational friction for either side. A true channel selling motion! For ISVs, REO means you can authorize trusted partners to resell your Marketplace offer in the way that works best for the customer. You no longer need to manage every deal yourself. For partners, it means instant access to in demand AI and security solutions that customers are already asking for. It removes barriers, it speeds up the process, and it connects the ecosystem in a much more natural way. If anyone in the community is exploring REO, private offers, multiparty models, or Marketplace strategy in general, it would be great to hear from you or reach out and would love to discuss. Looking forward to connecting with everyone in the new year. justinroyalMicrosoft BizTalk Server Product Lifecycle Update
For more than 25 years, Microsoft BizTalk Server has supported mission-critical integration workloads for organizations around the world. From business process automation and B2B messaging to connectivity across industries such as financial services, healthcare, manufacturing, and government, BizTalk Server has played a foundational role in enterprise integration strategies. To help customers plan confidently for the future, Microsoft is sharing an update to the BizTalk Server product lifecycle and long-term support timelines. BizTalk Server 2020 will be the final version of BizTalk Server. Guidance to support long-term planning for mission-critical workloads This announcement does not change existing support commitments. Customers can continue to rely on BizTalk Server for many years ahead, with a clear and predictable runway to plan modernization at a pace that aligns with their business and regulatory needs. Lifecycle Phase End Date What’s Included Mainstream Support April 11, 2028 Security + non-security updates and Customer Service & Support (CSS) support Extended Support April 9, 2030 CSS support, Security updates, and paid support for fixes (*) End of Support April 10, 2030 No further updates or support (*) Paid Extended Support will be available for BizTalk Server 2020 between April 2028 and April 2030 for customers requiring hotfixes for non-security updates. CSS will continue providing their typical support. BizTalk Server 2016 is already out of mainstream support, and we recommend those customers evaluate a direct modernization path to Azure Logic Apps. Continued Commitment to Enterprise Integration Microsoft remains fully committed to supporting mission-critical integration, including hybrid connectivity, future-ready orchestration, and B2B/EDI modernization. Azure Logic Apps, part of Azure Integration Services — which includes API Management, Service Bus, and Event Grid — delivers the comprehensive integration platform for the next decade of enterprise connectivity. Host Integration Server: Continued Support for Mainframe Workloads Host Integration Server (HIS) has long provided essential connectivity for organizations with mainframe and midrange systems. To ensure continued support for those workloads, Host Integration Server 2028 will ship as a standalone product with its own lifecycle, decoupled from BizTalk Server. This provides customers with more flexibility and a longer planning horizon. Recognizing Mainframe modernization customers might be looking to integrate with their mainframes from Azure, Microsoft provides Logic Apps connectors for mainframe and midrange systems, and we are keen on adding more connectors in this space. Let us know about your HIS plans, and if you require specific features for Mainframe and midranges integration from Logic Apps at: https://aka.ms/lamainframe Azure Logic Apps: The Successor to BizTalk Server Azure Logic Apps, part of Azure Integration Services, is the modern integration platform that carries forward what customers value in BizTalk while unlocking new innovation, scale, and intelligence. With 1,400+ out-of-box connectors supporting enterprise, SaaS, legacy, and mainframe systems, organizations can reuse existing BizTalk maps, schemas, rules, and custom code to accelerate modernization while preserving prior investments including B2B/EDI and healthcare transactions. Logic Apps delivers elastic scalability, enterprise-grade security and compliance, and built-in cost efficiency without the overhead of managing infrastructure. Modern DevOps tooling, Visual Studio Code support, and infrastructure-as-code (ARM/Bicep) ensure consistent, governed deployments with end-to-end observability using Azure Monitor and OpenTelemetry. Modernizing Logic Apps also unlocks agentic business processes, enabling AI-driven routing, predictive insights, and context-aware automation without redesigning existing integrations. Logic Apps adapts to business and regulatory needs, running fully managed in Azure, hybrid via Arc-enabled Kubernetes, or evaluated for air-gapped environments. Throughout this lifecycle transition, customers can continue to rely on the BizTalk investments they have made while moving toward a platform ready for the next decade of integration and AI-driven business. Charting Your Modernization Path Microsoft remains fully committed to supporting customers through this transition. We recognize that BizTalk systems support highly customized and mission-critical business operations. Modernization requires time, planning, and precision. We hope to provide: Proven guidance and recommended design patterns A growing ecosystem of tooling supporting artifact reuse Unified Support engagements for deep migration assistance A strong partner ecosystem specializing in BizTalk modernization Potential incentive programs to help facilitate migration for eligible customers (details forthcoming) Customers can take a phased approach — starting with new workloads while incrementally modernizing existing BizTalk deployments. We’re Here to Help Migration resources are available today: Overview: https://aka.ms/btmig Best practices: https://aka.ms/BizTalkServerMigrationResources Video series: https://aka.ms/btmigvideo Feature request survey: https://aka.ms/logicappsneeds Reactor session: Modernizing BizTalk: Accelerate Migration with Logic Apps - YouTube We encourage customers to engage their Microsoft accounts team early to assess readiness, identify modernization opportunities, and explore assistance programs. Your Modernization Journey Starts Now BizTalk Server has played a foundational role in enterprise integration success for more than two decades. As you plan ahead, Microsoft is here to partner with you every step of the way, ensuring operational continuity today while unlocking innovation tomorrow. To begin your transition, please contact your Microsoft account team or visit our migration hub. Thank you for your continued trust in Microsoft and BizTalk Server. We look forward to partnering closely with you as you plan the future of your integration platforms. Frequently Asked Questions Do I need to migrate now? No. BizTalk Server 2020 is fully supported through April 11, 2028, with paid Extended Support available through April 9, 2030, for non-security hotfixes. CSS will continue providing their typical support. You have a long and predictable runway to plan your transition. Will there be a new BizTalk Server version? No. BizTalk Server 2020 is the final version of the product. What happens after April 9, 2030? BizTalk Server will reach End of Support, and security updates or technical assistance will no longer be provided. Workloads will continue running but without Microsoft servicing. Is paid support available past 2028? Yes. Paid extended support will be available through April 2030 for BizTalk Server 2020 customers looking for non-security hotfixes. CSS will continue to provide the typical support. What about BizTalk Server 2016 or earlier versions? Those versions are already out of mainstream support. We strongly encourage moving directly to Logic Apps rather than upgrading to BizTalk Server 2020. Will Host Integration Server continue? Yes. Host Integration Server (HIS) 2028 will be released as a standalone product with its own lifecycle and support commitments. Can I reuse BizTalk Server artifacts in Logic Apps? Yes. Most of BizTalk maps, schemas, rules, assemblies, and custom code can be reused with minimal effort using Microsoft and partner migration tooling. We welcome feature requests here: https://aka.ms/logicappsneeds Does modernization require moving fully to the cloud? No. Logic Apps supports hybrid deployments for scenarios requiring local processing or regulatory compliance, and fully disconnected environments are under evaluation. More information of the Hybrid deployment model here: https://aka.ms/lahybrid. Does modernization unlock AI capabilities? Yes. Logic Apps enables AI-driven automations through Agent Loop, improving routing, decisioning, and operational intelligence. Where do I get planning support? Your Microsoft account team can assist with assessment and planning. Migration resources are also linked in this announcement to help you get started. Microsoft Corporation2.1KViews2likes1CommentAzure customer usage attribution (CUA) - report or validate it is working?
Per this article - https://learn.microsoft.com/en-us/partner-center/marketplace-offers/azure-partner-customer-usage-attribution#example-azure-powershell If we use Azure PowerShell along with the ISV/SDC's Tracking GUID (created earlier in Partner Centre), to provision Azure VM's and other resources in end-customer tenants directly related to our IP as an ISV/SDC - how do we report or validate that this CUA is working?