responsible ai
18 TopicsFoundry Agent Service at Ignite 2025: Simple to Build. Powerful to Deploy. Trusted to Operate.
The upgraded Foundry Agent Service delivers a unified, simplified platform with managed hosting, built-in memory, tool catalogs, and seamless integration with Microsoft Agent Framework. Developers can now deploy agents faster and more securely, leveraging one-click publishing to Microsoft 365 and advanced governance features for streamlined enterprise AI operations.10KViews3likes1CommentThe Future of AI: Structured Vibe Coding - An Improved Approach to AI Software Development
In this post from The Future of AI series, the author introduces structured vibe coding, a method for managing AI agents like a software team using specs, GitHub issues, and pull requests. By applying this approach with GitHub Copilot, they automated a repetitive task—answering Microsoft Excel-based questionnaires—while demonstrating how AI can enhance developer workflows without replacing human oversight. The result is a scalable, collaborative model for AI-assisted software development.4.1KViews0likes0CommentsGenerally Available: Evaluations, Monitoring, and Tracing in Microsoft Foundry
If you've shipped an AI agent to production, you've likely run into the same uncomfortable realization: the hard part isn't getting the agent to work - it's keeping it working. Models get updated, prompts get tweaked, retrieval pipelines drift, and user traffic surfaces edge cases that never appeared in your eval suite. Quality isn't something you establish once. It's something you have to continuously measure. Today, we're making that continuous measurement a first-class operational capability. Evaluations, Monitoring, and Tracing in Microsoft Foundry are now generally available through Foundry Control Plane. These aren't standalone tools bolted onto the side of the platform - they're deeply integrated with Azure Monitor, which means AI agent observability now lives in the same operational plane as the rest of your infrastructure. The Problem With Point-in-Time Evaluation Most evaluation workflows are designed around a pre-deployment gate. You build a test dataset, run your evals, review the scores, and ship. That approach has real value - but it has a hard ceiling. In production, agent behavior is a function of many things that change independently of your code: Foundation model updates ship continuously and can shift output style, reasoning patterns, and edge case handling in ways that don't always surface on your benchmark set. Prompt changes can have nonlinear effects downstream, especially in multi-step agentic flows. Retrieval pipeline drift changes what context your agent actually sees at inference time. A document index fresh last month may have stale or subtly different content today. Real-world traffic distribution is never exactly what you sampled for your test set. Production surfaces long-tail inputs that feel obvious in hindsight but were invisible during development. The implication is straightforward: evaluation has to be continuous, not episodic. You need quality signals at development time, at every CI/CD commit, and continuously against live production traffic - all using the same evaluator definitions so results are comparable across environments. That's the core design principle behind Foundry Observability. Continuous Evaluation Across the Full AI Lifecycle Built-In Evaluators Foundry's built-in evaluators cover the most critical quality and safety dimensions for production agent systems: Coherence and Relevance measure whether responses are internally consistent and on-topic relative to the input. These are table-stakes signals for any conversational or task-completion agent. Groundedness is particularly important for RAG-based architectures. It measures whether the model's output is actually supported by the retrieved context - as opposed to plausible-sounding content the model generated from its parametric memory. Groundedness failures are a leading indicator of hallucination risk in production, and they're often invisible to human reviewers at scale. Retrieval Quality evaluates the retrieval step independently from generation. Groundedness failures can originate in two places: the model may be ignoring good context, or the retrieval pipeline may not be surfacing relevant context in the first place. Splitting these signals makes it much easier to pinpoint root cause. Safety and Policy Alignment evaluates whether outputs meet your deployment's policy requirements - content safety, topic restrictions, response format compliance, and similar constraints. These evaluators are designed to run at every stage of the AI lifecycle: Local development - run evals inline as you iterate on prompts, retrieval config, or orchestration logic CI/CD pipelines - gate every commit against your quality baselines; catch regressions before they reach production Production traffic monitoring - continuously evaluate sampled live traffic and surface trends over time Because the evaluators are identical across all three contexts, a score in CI means the same thing as a score in production monitoring. See the Practical Guide to Evaluations and the Built-in Evaluators Reference for a deeper walkthrough. Custom Evaluators - Encoding Your Own Definition of Quality Built-in evaluators cover common signals well, but production agents often need to satisfy criteria specific to a domain, regulatory environment, or internal standard. Foundry supports two types of custom evaluators (currently in public preview): LLM-as-a-Judge evaluators let you configure a prompt and grading rubric, then use a language model to apply that rubric to your agent's outputs. This is the right approach for quality dimensions that require reasoning or contextual judgment - whether a response appropriately acknowledges uncertainty, whether a customer-facing message matches your brand tone, or whether a clinical summary meets documentation standards. You write a judge prompt with a scoring scale (e.g., 1–5 with criteria for each level) that evaluates a given {input} / {response} pair. Foundry runs this at scale and aggregates scores into your dashboards alongside built-in results. Code-based evaluators are Python functions that implement any evaluation logic you can express programmatically - regex matching, schema validation, business rule checks, compliance assertions, or calls to external systems. If your organization has documented policies about what a valid agent response looks like, you can encode those policies directly into your evaluation pipeline. Custom and built-in evaluators compose naturally - running against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. Monitoring and Alerting - AI Quality as an Operational Signal All observability data produced by Foundry - evaluation results, traces, latency, token usage, and quality metrics - is published directly to Azure Monitor. This is where the integration pays off for teams already on Azure. What this enables that siloed AI monitoring tools can't: Cross-stack correlation. When your groundedness score drops, is it a model update, a retrieval pipeline issue, or an infrastructure problem affecting latency? With AI quality signals and infrastructure telemetry in the same Azure Monitor Application Insights workspace, you can answer that in minutes rather than hours of manual correlation across disconnected systems. Unified alerting. Configure Azure Monitor alert rules on any evaluation metric - trigger a PagerDuty incident when groundedness drops below threshold, send a Teams notification when safety violations spike, or create automated runbook responses when retrieval quality degrades. These are the same alert mechanisms your SRE team already uses. Enterprise governance by default. Azure Monitor's RBAC, retention policies, diagnostic settings, and audit logging apply automatically to all AI observability data. You inherit the governance framework your organization has already built and approved. Grafana and existing dashboards. If your team uses Azure Managed Grafana, evaluation metrics can flow into existing dashboards alongside your other operational metrics - a single pane of glass for application health, infrastructure performance, and AI agent quality. The Agent Monitoring Dashboard in the Foundry portal provides an AI-native view out of the box - evaluation metric trends, safety threshold status, quality score distributions, and latency breakdowns. Everything in that dashboard is backed by Azure Monitor data, so SRE teams can always drill deeper. End-to-End Tracing: From Quality Signal to Root Cause A groundedness score tells you something is wrong. A trace tells you exactly where the failure occurred and what the agent actually did. Foundry provides OpenTelemetry-based distributed tracing that follows each request through your entire agent system: model calls, tool invocations, retrieval steps, orchestration logic, and cross-agent handoffs. Traces capture the full execution path - inputs, outputs, latency at each step, tool call parameters and responses, and token usage. The key design decision: evaluation results are linked directly to traces. When you see a low groundedness score in your monitoring dashboard, you navigate directly to the specific trace that produced it - no manual timestamp correlation, no separate trace ID lookup. The connection is made automatically. Foundry auto-collects traces across the frameworks your agents are likely already built on: Microsoft Agent Framework Semantic Kernel LangChain and LangGraph OpenAI Agents SDK For custom or less common orchestration frameworks, the Azure Monitor OpenTelemetry Distro provides an instrumentation path. Microsoft is also contributing upstream to the OpenTelemetry project - working with Cisco Outshift, we've contributed semantic conventions for multi-agent trace correlation, standardizing how agent identity, task context, and cross-agent handoffs are represented in OTel spans. Note: Tracing is currently in public preview, with GA shipping by end of March. Prompt Optimizer (Public Preview) One persistent friction point in agent development is the iteration loop between writing prompts and measuring their effect. You make a change, run your evals, look at the delta, try to infer what about the change mattered, and repeat. Prompt Optimizer tightens this loop. It analyzes your existing prompt and applies structured prompt engineering techniques - clarifying ambiguous instructions, improving formatting for model comprehension, restructuring few-shot examples, making implicit constraints explicit - with paragraph-level explanations for every change it makes. The transparency is deliberate. Rather than producing a black-box "optimized" prompt, it shows you exactly what it changed and why. You can add constraints, trigger another optimization pass, and iterate until satisfied. When you're done, apply it with one click. The value compounds alongside continuous evaluation: run your eval suite against the current prompt, optimize, run evals again, see the measured improvement. That feedback loop - optimize, measure, optimize - is the closest thing to a systematic approach to prompt engineering that currently exists. What Makes our Approach to Observability Different There are other evaluation and observability tools in the AI ecosystem. The differentiation in Foundry's approach comes down to specific architectural choices: Unified lifecycle coverage, not just pre-deployment testing. Most existing evaluation tools are designed for offline, pre-deployment use. Foundry's evaluators run in the same form at development time, in CI/CD, and against live production traffic. Your quality metrics are actually comparable across the lifecycle - you can tell whether production quality matches what you saw in testing, rather than operating two separate measurement systems that can't be compared. No separate observability silo. Publishing all observability data to Azure Monitor means you don't operate a separate system for AI quality alongside your existing infrastructure monitoring. AI incidents route through your existing on-call rotations. AI quality data is subject to the same retention and compliance controls as the rest of your telemetry. Framework-agnostic tracing. Auto-instrumentation across Semantic Kernel, LangChain, LangGraph, and the OpenAI Agents SDK means you're not locked into a specific orchestration framework. The OpenTelemetry foundation means trace data is portable to any compatible backend, protecting your investment as the tooling landscape evolves. Composable evaluators. Built-in and custom evaluators run in the same pipeline, against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. You don't choose between generic coverage and domain-specific precision - you get both. Evaluation linked to traces. Most systems treat evaluation and tracing as separate concerns. Foundry treats them as two views of the same event - closing the loop between detecting a quality problem and diagnosing it. Getting Started If you're building agents on Microsoft Foundry, or using Semantic Kernel, LangChain, LangGraph, or the OpenAI Agents SDK and want to add production observability, the entry point is Foundry Control Plane. Try it You'll need a Foundry project with an agent and an Azure OpenAI deployment. Enable observability by navigating to Foundry Control Plane and connecting your Azure Monitor workspace. Then walk through the Practical Guide to Evaluations, explore the Built-in Evaluators Reference, and set up end-to-end tracing for your agents.4KViews1like0CommentsThe Future of AI: How Lovable.dev and Azure OpenAI Accelerate Apps that Change Lives
Discover how Charles Elwood, a Microsoft AI MVP and TEDx Speaker, leverages Lovable.dev and Azure OpenAI to create impactful AI solutions. From automating expense reports to restoring voices, translating gestures to speech, and visualizing public health data, Charles's innovations are transforming lives and democratizing technology. Follow his journey to learn more about AI for good.2.6KViews2likes0CommentsBetter detecting cross prompt injection attacks: Introducing Spotlighting in Azure AI Foundry
Spotlighting now in public preview in Azure AI Foundry as part of Prompt Shields. It helps developers detect malicious instructions hidden inside inputs, documents, or websites before they reach an agent.1.9KViews0likes0CommentsAnnouncing a new Azure AI Translator API (Public Preview)
Microsoft has launched the Azure AI Translator API (Public Preview), offering flexible translation options using either neural machine translation (NMT) or generative AI models like GPT-4o. The API supports tone, gender, and adaptive custom translation, allowing enterprises to tailor output for real-time or human-reviewed workflows. Customers can mix models in a single request and authenticate via resource key or Entra ID. LLM features require deployment in Azure AI Foundry. Pricing is based on characters (NMT) or tokens (LLMs).1.6KViews0likes0CommentsIntroducing Phi-4-Reasoning-Vision to Microsoft Foundry
Vision reasoning models unlock a critical capability for developers: the ability to move beyond passive perception toward systems that can understand, reason over, and act on visual information. Instead of treating images, diagrams, documents, or UI screens as unstructured inputs, vision reasoning models enable developers to build applications that can interpret visual structure, connect it with textual context, and perform multi-step reasoning to reach actionable conclusions. Today, we are excited to announce Phi-4-Reasoning-Vision-15B is available in Microsoft Foundry and Hugging Face. This model brings high‑fidelity vision to the reasoning‑focused Phi‑4 family, extending small language models (SLMs) beyond perception into structured, multi‑step visual reasoning for agents, analytical tools, and scientific workflows. What’s new? The Phi model family has advanced toward combining efficient visual understanding with strong reasoning in small language models. Earlier Phi‑4 models demonstrated reliable perception and grounding across images and text, while later iterations introduced structured reasoning to improve performance on complex tasks. Phi‑4‑reasoning-vision-15B brings these threads together, pairing high‑resolution visual perception with selective, task‑aware reasoning. As a result, the model can reason deeply when needed while remaining fast and efficient for perception‑focused scenarios—making it well suited for interactive, real‑world applications. Key capabilities Reasoning behavior is explicitly enabled via prompting: Developers can explicitly enable or disable reasoning to balance latency and accuracy at runtime. Optimized for vision reasoning and can be used for: diagram-based math, document, chart, and table understanding, GUI interpretations and grounding for agent scenarios to interpret screens and actions, Computer-use agent scenarios, and General image chat and answering questions Benchmarks The following results summarize Phi-4-reasoning-vision-15B performance across a set of established multimodal reasoning, mathematics, and computer use benchmarks. The following benchmarks are the result of internal evaluations. Benchmark Phi-4-reasoning-vision-15B Phi-4-reasoning-vision-15B – force no think Phi-4-mm-instruct Kimi-VL-A3B-Instruct gemma-3-12b-it Qwen3-VL-8B-Instruct-4K Qwen3-VL-8B-Instruct-32K Qwen3-VL-32B-Instruct-4K Qwen3-VL-32B-Instruct-32K AI2D _TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA _TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench 64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse _MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision _MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista _MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU _VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot _v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 1: Accuracy comparisons relative to popular open-weight, non-thinking models Benchmark Phi-4-reasoning-vision-15B Phi-4-reasoning-vision-15B - force thinking Kimi-VL-A3B-Thinking gemma-3-12b-it Qwen3-VL-8B-Thinking-4K Qwen3-VL-8B-Thinking-40K Qwen3-VL-32B-Thiking-4K Qwen3-VL-32B-Thinking-40K AI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA _TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench 64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse _MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision _MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista _MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU _VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot _v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 2: Accuracy comparisons relative to popular open-weight, thinking models All results were obtained using a consistent evaluation setup and prompts across models; numbers are provided for comparison and analysis rather than as leaderboard claims. For more information regarding benchmarks and evaluations, please read the technical paper on the Microsoft Research hub. Suggested use cases and applications Phi‑4‑Reasoning-Vision-15B supports applications that require both high‑fidelity visual perception and structured inference. Two representative scenarios include scientific and mathematical reasoning over visual inputs, and computer‑using agents (CUAs) that operate directly on graphical user interfaces. In both cases, the model provides grounded visual understanding paired with controllable, low‑latency reasoning suitable for interactive systems. Computer use agents in retail scenarios For computer use agents, Phi‑4‑Reasoning-Vision-15B provides the perception and grounding layer required to understand and act within live ecommerce interfaces. For example, in an online shopping experience, the model interprets screen content—products, prices, filters, promotions, buttons, and cart state—and produces grounded observations that agentic models like Fara-7B can use to select actions. Its compact size and low latency inference make it well suited for CUA workflows and agentic applications. Visual reasoning for education Another practical use of visual reasoning models is education. A developer could build a K‑12 tutoring app with Phi‑4‑Reasoning‑Vision‑15B where students upload photos of worksheets, charts, or diagrams to get guided help—not answers. The model can understand the visual content, identify where the student went wrong, and explain the correct steps clearly. Over time, the app can adapt by serving new examples matched to the student’s learning level, turning visual problem‑solving into a personalized learning experience. Microsoft Responsible AI principles At Microsoft, our mission to empower people and organizations remains constant—especially in the age of AI, where the potential for human achievement is greater than ever. We recognize that trust is foundational to AI adoption, and earning that trust requires a commitment to transparency, safety, and accountability. As with other Phi models, Phi-4-Reasoning-Vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. These safety focused training signals help the model recognize and decline requests that fall outside intended or acceptable use. Additional details on the model’s safety considerations, evaluation approach, and known limitations are provided in the accompanying technical blog and model card. Getting started Start using Phi‑4‑Reasoning-Vision-15B in Microsoft Foundry today. Microsoft Foundry provides a unified environment for model discovery, evaluation, and deployment, making it straightforward to move from initial experimentation to production use while applying appropriate safety and governance practices. Deploy the new model on Microsoft Foundry. Learn more about the Phi family on Foundry Labs and in the Phi Cookbook Connect to the Microsoft Developer Community on Discord Read the technical paper on Microsoft Research Read more use cases on the Educators Developer blog1.5KViews0likes0CommentsEffective AI Governance with Azure
Why is AI Governance needed? As organizations increasingly adopt AI in their cloud environments, effective governance is essential to ensure sustainability, security, and operational excellence. Without proper oversight, AI workloads can escalate costs, expose vulnerabilities, and struggle with resiliency under dynamic conditions. AI Governance provides a structured approach to managing AI investments, securing sensitive data, optimizing performance, and ensuring compliance with evolving regulations. By implementing governance best practices, enterprises can balance innovation with control, enabling AI-driven solutions to scale efficiently and responsibly. This blog explores key areas of AI Governance, including cost management, security, resiliency, operational optimization, and model oversight. Five Pillars of AI Governance Manage AI Costs Choose the right billing model: For unpredictable usage, the Pay-as-you-go model works best, while predictable workloads benefit from Provisioned Throughput Units (PTUs). Mixing PTU endpoints with consumption-based endpoints helps save money, as PTUs take care of the main tasks while consumption-based endpoints deal with any extra demand. Choose the right model: Opting for an AI model should balance performance requirements with cost considerations. Select less expensive models unless the use case demands a higher-cost option. During fine-tuning, ensure maximum utilization of time within each billing period to prevent incurring additional charges. Reservations: By committing to a reservation for Provisioned Throughput Units (PTUs) over a period of one month or one year, you can realize savings. Most OpenAI models offer reservations, with discounts typically ranging from 30% to 60%. Track and control token usage: The Generative AI Gateway helps manage costs by tracking and throttling token usage, applying circuit breakers, and routing requests to multiple AI endpoints. Incorporating a semantic cache can further optimize both performance and expenses when using LLMs. Additionally, setting model-based provisioning quotas ensures better cost control by preventing unnecessary usage. Policies to shut down unused instances: Establish a policy requiring AI resources to enable the automatic shutdown feature on virtual machines and compute instances in Azure AI Foundry and Azure Machine Learning. This requirement applies to nonproduction environments and production workloads that can be taken offline periodically. Secure AI Workloads AI threat protection: Defender for Cloud provides real-time monitoring of Gen AI applications to detect security vulnerabilities. AI threat protection works with Azure AI content safety prompt shields and Microsoft’s threat intelligence to identify risks such as data leakage, data poisoning, jailbreak attempts, and credential threats. Integration with Defender XDR enables security teams to centralize alerts for AI workloads within the Defender XDR portal. Access and identity controls: Grant the minimum necessary user access to centralized AI resources. Leverage managed identities across supported Azure AI services and restrict access to essential AI model endpoints only. Implement just-in-time access to enable temporary elevation of permissions when required. Disable local authentication as needed. Key management: Azure AI services provide two API keys for each resource to facilitate secret rotation, enhancing security by enabling regular key updates. This feature protects service privacy in case of key leakage. It is recommended to store all keys securely in Azure Key Vault. Regulatory compliance: AI regulatory compliance involves utilizing industry-specific initiatives in Azure Policy and applying relevant policies for services like Azure AI Foundry and Azure Machine Learning. Compliance checklists designed for specific industries and locations, along with standards like ISO/IEC 23053:2022, assist in reviewing and confirming that AI workloads meet regulatory requirements. Network security: Azure AI services use a layered security model to restrict access to specific networks. Configuring network rules ensures that only applications from designated networks can access the account. Access can be further filtered by IP addresses, ranges, or Azure Virtual Network subnets. When network rules are in effect, applications must be authorized using Microsoft Entra ID credentials or a valid API key. Data security: Maintain strict data security boundaries by cataloging data to avoid feeding sensitive information to public-facing AI endpoints. Use legally licensed data for AI model grounding or training, and implement tools like Protected Material Detection to prevent copyright infringement. Establish version control for grounding data to track and revert changes, ensuring consistency and compliance across deployments. Regularly review outputs for intellectual property adherence. Tag sensitive information using Azure Information Protection. Risk scenario Risk impact Resiliency mitigation example Cyberattacks Ransomware, distributed denial of service (DDoS), or unauthorized access. To reduce impact, include robust security measures, including an appropriate backup and recovery process, in your adoption strategy and plan. System failures Hardware or software malfunctions. Design for quick recovery and data integrity restoration. Handle transient faults in your applications, and provide redundancy in your infrastructure, such as multiple replicas with automatic failover. Configuration issues Deployment errors or misconfigurations. Treat configuration changes as code changes by using infrastructure as code (IaC). Use continuous integration/continuous deployment (CI/CD) pipelines, canary deployments, and rollback mechanisms to minimize the impact of faulty updates or deployments. Demand spikes or overload Performance degradation during peak usage or spikes in traffic. Use elastic scalability to ensure that systems automatically scale to handle an increased demand without disruption to service. Compliance failures Breaches of regulatory standards. Adopt compliance tools like Microsoft Purview and use Azure Policy to enforce compliance requirements. Natural disasters Datacenter outages caused by earthquakes, floods, or storms. Plan for failover, high availability, and disaster recovery by using availability zones, multiple regions, or even multicloud approaches. Resilience for AI Platforms Deploy AI landing zones: AI landing zones provide pre-designed, scalable environments that provide a structured foundation for deploying AI workloads in Azure. They integrate various Azure services to ensure governance, compliance, security, and operational efficiency. ALZ’s help streamline AI deployments while maintaining best practices for scalability and performance. Reliable scaling strategy: AI applications require effective scaling strategies, such as auto scaling and automatic scaling mechanisms. While auto-scaling operates based on predefined threshold rules, automatic scaling leverages intelligent algorithms to adaptively scale resources by analyzing learned usage patterns. Disaster recovery planning: A critical component of business continuity that requires the development of techniques for High Availability (HA) and Disaster Recovery (DR) for your AI endpoints and AI Data. This involves deploying zonal services within a region to ensure HA and provisioning instances in a secondary region to enable effective DR. Building global resilience: Global deployment optimizes capacity utilization and throughput for generative AI by accessing distributed pools across regions. Intelligent routing prioritizes less busy instances, ensuring processing efficiency and reliability. Azure API Management (APIM) with premium SKU supports resilient global deployments, maintaining a single endpoint for seamless failover and enhanced scalability without burdening applications. Optimizing AI Operations Latency: With generative AI, inferencing time far outweighs network latency, making network time negligible in overall operations. A global deployment, leveraging intelligent routing to identify less busy capacity pools worldwide, ensures faster processing by utilizing idle resources effectively. This approach transforms traditional latency considerations, emphasizing the scalability and efficiency of globally distributed models over proximity. Additionally, seasonal differences across regions further enhance the potential for optimized performance. Capacity and throughput: Global deployments optimize capacity and throughput by accessing larger pools and leveraging intelligent routing to direct requests to less busy instances, ensuring faster processing and quota fulfillment. Data Zones balance broader capacity access with compliance for regions with sovereignty needs, while Provisioned Throughput Units (PTUs) can further improve utilization by dynamically managing token distribution across pools for maximum efficiency. Standard options remain limited and may restrict throughput under heavy demand. AI observability: GenAI observability encompasses monitoring model performance, capacity utilization, token throughput, and compliance across distributed systems. It tracks token utilization to ensure efficient distribution and optimize throughput, supported by tools like PTU for dynamic management. General observability features include latency tracking, resource allocation insights, error rate monitoring, and proactive alerting, enabling seamless operations and adherence to data sovereignty requirements while maximizing performance and efficiency. Azure OpenAI observability metrics Category Metric Unit Dimensions Aggregation Description HTTP Requests Total Request Count Count Endpoint, API Operation, Region Sum Tracks the total number of HTTP requests made to the Azure OpenAI endpoints. Failed Requests Count Status Code, Region, API Operation Sum Monitors the count of requests resulting in errors (e.g., 4xx, 5xx response codes). Request Rate Requests/second Endpoint, Region Average Measures the rate of incoming requests to analyze traffic patterns. Latency Request Latency Milliseconds (ms) Endpoint, Region, API Operation Average, Percentiles (50th, 90th, 99th) Captures the average response time of requests, broken down by endpoint or API call. Response Time Percentiles Milliseconds (ms) Endpoint, Region, API Operation Percentiles (50th, 90th, 99th) Identifies outliers or slow responses in terms of latency across different percentiles. Usage Token Utilization Tokens API Key, Region, Instance Type Sum, Average Tracks the number of tokens processed (prompt and completion) to monitor quota usage. Throttled Requests Count API Key, Region Sum Counts requests delayed or rejected due to throttling or quota limits. Actions Cache Hits/Misses Count Cache Type, Region, Endpoint Ratio (Hits vs Misses), Sum Monitors the efficiency of semantic or prompt caching to optimize token usage. Request Routing Efficiency Percentage (%) Region, Capacity Pool Average Tracks the accuracy of routing requests to the least busy capacity pool for better processing. Throughput Tokens/second Endpoint, Region Sum, Average Measures successfully processed tokens or requests per second to ensure capacity optimization. Govern AI Models Control the models: Azure Policy can be used to control which models teams are permitted to deploy from the Azure AI Foundry catalog. Organizations are advised to start with audit mode, which monitors model usage without restricting deployments. Transitioning to deny mode should only occur after thoroughly understanding workload teams’ development needs to avoid unnecessary disruption. It’s important to note that deny mode does not automatically remove noncompliant models already deployed, and these must be addressed manually. Evaluating models: Evaluation is a critical aspect of the generative AI lifecycle, ensuring models meet accuracy, performance, security, and ethical standards while mitigating biases and validating robustness before deployment. It plays a role at every stage, from selecting the base model to pre-production validation and post-production monitoring. Azure provides several tools to support systematic evaluation, including Azure AI Foundry, which offers built-in metrics for assessing AI model performance. The Evaluation API in Azure OpenAI Service enables automated quality checks by integrating evaluations into CI/CD pipelines. Additionally, organizations can leverage Azure DevOps and GitHub Actions to conduct bulk evaluations, ensuring AI models remain compliant, optimized, and trustworthy throughout their lifecycle. Content filters for models: Organizations are advised to define baseline content filters for generative AI models using Azure AI Content Safety. This system evaluates both prompts and completions through classification models that identify and mitigate harmful content across various categories. Key features include prompt shields, groundedness detection, and protected material text scanning for both images and text. Establishing a process for application teams to communicate governance needs ensures alignment and comprehensive oversight of safety measures. Ground AI models: To effectively manage generative AI output, utilize system messages and the retrieval augmented generation (RAG) pattern to ensure responses are grounded and reliable. Test grounding techniques using tools like prompt flow for structured workflows or the open-source red teaming framework PyRIT to identify potential vulnerabilities. These strategies help refine model behavior and maintain alignment with governance requirements.1.5KViews1like0CommentsKeeping Agents on Track: Introducing Task Adherence in Azure AI Foundry
Task Adherence is coming soon to public preview in both the Azure AI Content Safety API and Azure AI Foundry. It helps developers ensure AI agents stay aligned with their assigned tasks, preventing drift, misuse, or unsafe tool calls.1.4KViews0likes0CommentsBeyond the Model: Empower your AI with Data Grounding and Model Training
Discover how Microsoft Foundry goes beyond foundational models to deliver enterprise-grade AI solutions. Learn how data grounding, model tuning, and agentic orchestration unlock faster time-to-value, improved accuracy, and scalable workflows across industries.999Views6likes4Comments