evaluation

25 Topics

The Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.
manniarora
Feb 28, 2025 Place Microsoft Foundry Blog
1.4KViews
5likes
3Comments
The Future of AI: The Model is Key, but the App is the Doorway
This post explores the real-world impact of GPT-5 beyond benchmark scores, focusing on how application design shapes user experience. It highlights early developer feedback, common integration challenges, and practical strategies for adapting apps to leverage the advanced capabilities of GPT-5 in Foundry Models. From prompt refinement to fine-tuning to new API controls, learn how to make the most of this powerful model.
Marco_Casalaina
Nov 08, 2025 Place Microsoft Foundry Blog
495Views
3likes
0Comments
Automate Quota Discovery in Azure AI Foundry: A Tale of 3 APIs
Automate the discovery of Azure regions that meet your AI deployment needs using three essential APIs: Models API, Usages API, and Locations API. This process helps reduce decision fatigue and ensures compliance with enterprise-wide model deployment standards. Key learnings: Model Deployment Requirements: Understand the needs of a standard Retrieval-Augmented Generation (RAG) application, which involves deploying multiple models. Automation Benefits: Streamline your deployment process and ensure compliance with enterprise standards. Three Essential APIs: Models API: Query available models for a specific subscription within a chosen location. Usages API: Assess current usages and limits to infer available quotas. Locations API: Obtain a list of all available regions. A comprehensive Jupyter notebook with the implementation steps is available in the accompanying GitHub repository. This resource is invaluable for AI developers looking to streamline their deployment processes and ensure their applications meet all necessary requirements
cedricvidal
Feb 18, 2025 Place Microsoft Foundry Blog
596Views
3likes
0Comments
The Future of AI: Harnessing AI agents for Customer Engagements
Discover how AI-powered agents are revolutionizing customer engagement—enhancing real-time support, automating workflows, and empowering human professionals with intelligent orchestration. Explore the future of AI-driven service, including Customer Assist created with Azure AI Foundry.
manniarora
Jun 30, 2025 Place Microsoft Foundry Blog
779Views
2likes
0Comments
Start your Trustworthy AI Development with Safety Leaderboards in Azure AI Foundry
Selecting the right model for your AI application is more than a technical decision—it’s a foundational step in ensuring trust, compliance, and governance in AI. Today, we are excited to announce the public preview of safety leaderboards within Foundry model leaderboards, helping customers incorporate model safety as a first-class criterion alongside quality, cost, and throughput. This feature introduces three key components to support responsible AI development: A dedicated safety leaderboard highlighting the safest models; A quality–safety trade-off chart to balance performance and risk; Five new scenario-specific leaderboards supporting diverse responsible AI scenarios. Prioritize safety with the new leaderboard The safety leaderboard ranks the top models based on their robustness against generating harmful content. This is especially valuable in regulated or high-risk domains—such as healthcare, education, or financial services—where model outputs must meet high safety standards. To ensure benchmark rigor and relevance, we apply a structured filtering and validation process to select benchmarks. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety and responsible AI leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on the targeted areas of interest as they relate to safety. Our current safety leaderboard uses the HarmBench benchmark which includes prompts to illicit harmful behaviors from models. The benchmark covers 7 semantic categories of behaviors: Cybercrime & Unauthorized Intrusion Chemical & Biological Weapons/Drugs Copyright Violations Misinformation & Disinformation Harassment & Bullying Illegal Activities General Harm These 7 categories are organized into three broader functional groupings: Standard Harmful Behaviors Contextual Harmful Behaviors Copyright Violations Each grouping is featured in a separate responsible AI scenario leaderboard. We use the prompts evaluators from HarmBench to calculate Attack Success Rate (ASR) and aggregate them across the functional groupings to proxy model safety. Lower ASR values means that a model is more robust against attacks to illicit harmful content. We understand and acknowledge that model safety is a complex topic and has several dimensions. No single current open-source benchmark can test or represent the full spectrum of model safety in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This can lead to either overestimating or underestimating model performance in real-world safety scenarios. While HarmBench dataset covers a limited set of harmful topics, it can still provide a high-level understanding of safety trends. Navigate trade-offs with the quality-safety chart Model selection often involves compromise across multiple criteria. Our new quality–safety trade-off chart helps you make informed decisions by comparing models based on their performance in safety and quality. You can: Identify the safest model measured by Attack Success Rate (lower is better) at a given level of quality performance; Or choose the highest-performing model in quality (higher is better) that still meets a defined safety threshold. Together with the quality-cost trade-off chart, you would be able to find the best trade-off between quality, safety, and cost in selecting a model: Scenario-based responsible AI leaderboards To support customers' diverse responsible AI scenarios, we have added 5 new leaderboards to rank the top models in safety and broader responsibility AI scenarios. Each leaderboard is powered by industry-standard public benchmarks covering: Model robustness against harmful behaviors using HarmBench in 3 scenarios, targeting standard harmful behaviors, contextually harmful behaviors, and copyright violations: Consistent with the safety leaderboard, lower ASR scores for a model mean better robustness against generating harmful content. Model ability to detect toxic content using the Toxigen benchmark: This benchmark targets adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. Higher accuracy based on F1-score for a model means its better ability to detect toxic content. Model knowledge of sensitive domains including cybersecurity, biosecurity, and chemical security, using the Weapons of Mass Destruction Proxy benchmark (WMDP): A higher accuracy score for a model denotes more knowledge of dangerous capabilities. These scenario leaderboards allow developers, compliance teams, and AI governance stakeholders to align model selection with organizational risk tolerance and regulatory expectations. Building Trustworthy AI Starts with the Right Tools With safety leaderboards now available in public preview, Foundry model leaderboards offer a unified, transparent, and data-driven foundation for selecting models that align with your safety requirements. This addition empowers teams to move from ad hoc evaluation to principled model selection—anchored in industry-standard benchmarks and responsible AI practices. To learn more, explore the methodology documentation and start building AI solutions you—and your stakeholders—can trust.
Chang_Liu
Jun 20, 2025 Place Microsoft Foundry Blog
1.6KViews
2likes
0Comments
Ignite 2024: Streamlining AI Development with an Enhanced User Interface, Accessibility, and Learning Experiences in Azure AI Foundry portal
Announcing Azure AI Foundry, a unified platform that simplifies AI development and management. The platform portal (formerly Azure AI Studio) features a revamped user interface, enhanced model catalog, new management center, improved accessibility and learning, making it easier than ever for Developers and IT Admins to design, customize, and manage AI apps and agents efficiently.
udimilo
Nov 19, 2024 Place Microsoft Foundry Blog
6.1KViews
2likes
0Comments
The Future of AI: Evaluating and optimizing custom RAG agents using Azure AI Foundry
This blog post explores best practices for evaluating and optimizing Retrieval-Augmented Generation (RAG) agents using Azure AI Foundry. It introduces the RAG triad metrics—Retrieval, Groundedness, and Relevance—and demonstrates how to apply them using Azure AI Search and agentic retrieval for custom agents. Readers will learn how to fine-tune search parameters, use end-to-end evaluation metrics and golden retrieval metrics like XDCG and Max Relevance, and leverage Azure AI Foundry tools to build trustworthy, high-performing AI agents.
Chang_Liu
Sep 19, 2025 Place Microsoft Foundry Blog
1.7KViews
1like
0Comments
The Future of AI: Developing Lacuna - an agent for Revealing Quiet Assumptions in Product Design
A conversational agent named Lacuna is helping product teams uncover hidden assumptions embedded in design decisions. Built with Copilot Studio and powered by Azure AI Foundry, Lacuna analyzes product documents to identify speculative beliefs and assess their risk using design analysis lenses: impact, confidence, and reversibility. By surfacing cognitive biases and prompting reflection, Lacuna encourages teams to validate assumptions through lightweight evidence-gathering methods. This experiment in human-AI collaboration explores how agents can foster epistemic humility and transform static documents into dynamic conversations.
skeske
Aug 21, 2025 Place Microsoft Foundry Blog
547Views
1like
1Comment
AI reports: Improve AI governance and GenAIOps with consistent documentation
AI reports are designed to help organizations improve cross-functional observability, collaboration, and governance when developing, deploying, and operating generative AI applications and fine-tuned or custom models. These reports support AI governance best practices by helping developers document the purpose of their AI model or application, its features, potential risks or harms, and applied mitigations, so that cross-functional teams can track and assess production-readiness throughout the AI development lifecycle and then monitor it in production. Starting in December, AI reports will be available in private preview in a US and EU Azure region for Azure AI Foundry customers. To request access to the private preview of AI reports, please complete the Interest Form. Furthermore, we are excited to announce new collaborations with Credo AI and Saidot to support customers’ end-to-end AI governance. By integrating the best of Azure AI with innovative and industry-leading AI governance solutions, we hope to provide our customers with choice and help empower greater cross-functional collaboration to align AI solutions with their own principles and regulatory requirements. Building on learnings at Microsoft Microsoft’s approach for governing generative AI applications builds on our Responsible AI Standard and the National Institute of Standards and Technology’s AI Risk Management Framework. This approach requires teams to map, measure, and manage risks for generative applications throughout their development cycle. A core asset of the first—and iterative—map phase is the Responsible AI Impact Assessment. These assessments help identify potential risks and their associated harms, as well as mitigations to address them. As development of an AI system progresses, additional iterations can help development teams document their progress in risk mitigation and allow experts to review the evaluations and mitigations and make further recommendations or requirements before products are launched. Post-deployment, these assessments become a source of truth for ongoing governance and audits, and help guide how to monitor the application in production. You can learn more about Microsoft’s approach to AI governance in our Responsible AI Transparency Report and find a Responsible AI Impact Assessment Guide and example template on our website. How AI reports support AI impact assessments and GenAIOps AI reports can help organizations govern their GenAI models and applications by making it easier for developers to provide the information needed for cross-functional teams to assess production-readiness throughout the GenAIOps lifecycle. Developers will be able to assemble key project details, such as the intended business use case, potential risks and harms, model card, model endpoint configuration, content safety filter settings, and evaluation results into a unified AI report from within their development environment. Teams can then publish these reports to a central dashboard in the Azure AI Foundry portal, where business leaders can track, review, update, and assess reports from across their organization. Users can also export AI reports in PDF and industry-standard SPDX 3.0 AI BOM formats, for integration into existing GRC workflows. These reports can then be used by the development team, their business leaders, and AI, data, and other risk professionals to determine if an AI model or application is fit for purpose and ready for production as part of their AI impact assessment processes. Being versioned assets, AI reports can also help organizations build a consistent bridge across experimentation, evaluation, and GenAIOps by documenting what metrics were evaluated, what will be monitored in production, and the thresholds that will be used to flag an issue for incident response. For even greater control, organizations can choose to implement a release gate or policy as part of their GenAIOps that validates whether an AI report has been reviewed and approved for production. Key benefits of these capabilities include: Observability: Provide cross-functional teams with a shared view of AI models and applications in development, in review, and in production, including how these projects perform in key quality and safety evaluations. Collaboration: Enable consistent information-sharing between GRC, development, and operational teams using a consistent and extensible AI report template, accelerating feedback loops and minimizing non-coding time for developers. Governance: Facilitate responsible AI development across the GenAIOps lifecycle, reinforcing consistent standards, practices, and accountability as projects evolve or expand over time. Build production-ready GenAI apps with Azure AI Foundry If you are interested in testing AI reports and providing feedback to the product team, please request access to the private preview by completing the Interest Form. Want to learn more about building trustworthy GenAI applications with Azure AI? Here’s more guidance and exciting announcements to support your GenAIOps and governance workflows from Microsoft Ignite: Learn about new GenAI evaluation capabilities in Azure AI Foundry Learn about new GenAI monitoring capabilities in Azure AI Foundry Learn about new IT governance capabilities in Azure AI Foundry Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into capabilities that support trustworthy AI with these sessions: Keynote: Microsoft Ignite Keynote Breakout: Trustworthy AI: Future trends and best practices Breakout: Trustworthy AI: Advanced AI risk evaluation and mitigation Demo: Simulate, evaluate, and improve GenAI outputs with Azure AI Foundry Demo: Track and manage GenAI app risks with AI reports in Azure AI Foundry We’ll also be available for questions in the Connection Hub on Level 3, where you can find “ask the expert” stations for Azure AI and Trustworthy AI.
AlexSutton
Nov 19, 2024 Place Microsoft Foundry Blog
2.5KViews
1like
0Comments
Evaluate Fine-tuned Phi-3 / 3.5 Models in Azure AI Studio Focusing on Microsoft's Responsible AI
Fine-tuning a model can sometimes lead to unintended or undesired responses. To ensure that the model remains safe and effective, it's important to evaluate the model's potential to generate harmful content and its ability to produce accurate, relevant, and coherent responses. In this tutorial, you will learn how to evaluate the safety and performance of a fine-tuned Phi-3 / Phi-3.5 model integrated with Prompt flow in Azure AI Studio. Before beginning the technical steps, it's essential to understand Microsoft's Responsible AI Principles, an ethical framework designed to guide the responsible development, deployment, and operation of AI systems. These principles guide the responsible design, development, and deployment of AI systems, ensuring that AI technologies are built in a way that is fair, transparent, and inclusive. These principles are the foundation for evaluating the safety of AI models.
Minseok_Song
Sep 02, 2024 Place Educator Developer Blog
19KViews
1like
1Comment