evaluation
23 TopicsEvaluate Fine-tuned Phi-3 / 3.5 Models in Azure AI Studio Focusing on Microsoft's Responsible AI
Fine-tuning a model can sometimes lead to unintended or undesired responses. To ensure that the model remains safe and effective, it's important to evaluate the model's potential to generate harmful content and its ability to produce accurate, relevant, and coherent responses. In this tutorial, you will learn how to evaluate the safety and performance of a fine-tuned Phi-3 / Phi-3.5 model integrated with Prompt flow in Azure AI Studio. Before beginning the technical steps, it's essential to understand Microsoft's Responsible AI Principles, an ethical framework designed to guide the responsible development, deployment, and operation of AI systems. These principles guide the responsible design, development, and deployment of AI systems, ensuring that AI technologies are built in a way that is fair, transparent, and inclusive. These principles are the foundation for evaluating the safety of AI models.19KViews1like1CommentWhy is a fresh Enterprise Evaluation installation license expired?
Hello, normally whenever I would install the Windows Enterprise Evaluation it would automatically activate the 90-day license. This has changed however as this time I did a fresh install (already tried both Windows 10 and 11) I get the Windows License is expired message in the bottom-right corner of the screen. The System->Activation menu says We can't activate Windows on this device right now. You can try activating again later or ... Error code: 0xC004C032. When I run slmgr /ato to activate I sometimes (not sure what the reason for change is - maybe this changed after I tried to change the product key to what I found supposedly is a "generic Enterprise key for recovery") get the same error as above - 0xC004C032 (running slui on that returns The activation server reported that new time based activation is not available) or 0x80072F8F (A security error occured). I read that syncing the date/time might be an issue but as far as I can tell its synced correctly with time.windows.com Any ideas about what the issue could be? Or has something changed in how activation works for evaluation distributions? Any input is appreciated Thanks10KViews1like15CommentsCan't convert windows server standard core version to licensed copy.
I have a case where customer installed 2019 Server Standard Core EVAL version and has it all setup and configured for production and we can’t convert it to non-EVAL version. I did a new install myself in a VM and confirm you can’t convert 2019 Standard Core to full licensed copy. The only option given when converting it is to convert to Datacenter. In Microsoft’s documentation they mentioned this occurred with earlier versions of 2016 and it seems to have carried over to 2019 and 2022 as well from what I have found online. Can someone at Microsoft confirm this is the intended behavior and have them update their documentation to reflect it? Or have them tell us if this is a bug and have it corrected and if there is any other way to get this Server Standard Core EVAL to convert to a licensed copy. Thanks, Jason https://docs.microsoft.com/en-us/windows-server/get-started/upgrade-conversion-options C:\Users\Administrator>dism /online /get-currentedition Deployment Image Servicing and Management tool Version: 10.0.17763.1697 Image Version: 10.0.17763.2565 Current edition is: Current Edition : ServerStandardEvalCor The operation completed successfully. C:\Users\Administrator>dism /online /get-targeteditions Deployment Image Servicing and Management tool Version: 10.0.17763.1697 Image Version: 10.0.17763.2565 Editions that can be upgraded to: Target Edition : ServerDatacenterCor The operation completed successfully. C:\Users\Administrator>DISM /online /Set-Edition:serverstandardcor /ProductKey:N69G4-B89J2-4G8F4-WWYCC-J464C /AcceptEula Deployment Image Servicing and Management tool Version: 10.0.17763.1697 Image Version: 10.0.17763.2565 Starting to update components... Starting to install product key... Finished installing product key. Finished updating components. Starting to apply edition-specific settings... Error: 1168 An error occurred while applying target edition component settings. The upgrade cannot proceed. For more information, review the log file. The DISM log file can be found at C:\Windows\Logs\DISM\dism.log 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Migrating product key values from [Microsoft\Windows NT\CurrentVersion\DefaultProductKey2] to [Microsoft\Windows NT\CurrentVersion] and [ControlSet001\Control\ProductOptions]... - CTransmogManager::MigrateDigitalProductKey 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Successfully migrated product key information. - CTransmogManager::MigrateDigitalProductKey 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Successfully migrated the product key. - CTransmogManager::UpdateComponents 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Successfully updated components. - CTransmogManager::UpdateComponents 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Successfully updated components from [ServerStandardEvalCor] to [serverstandardcor] - CTransmogManager::TransmogrifyWorker 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Copying edition override file from [C:\Windows\Servicing\Editions\serverstandardcorEdition.xml] to [C:\Windows\serverstandardcor.xml] - CTransmogManager::ProjectEditionOverrideFile 2022-03-02 07:42:34, Info DISM DISM Transmog Provider: PID=3772 TID=3760 Installing edition license file for edition [serverstandardcor] - CTransmogManager::ProjectEditionLicenseFile 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=2648 Failed to copy edition license file. - ThreadSkuInstallEula(hr:0x80070490) 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=3760 Thread for Installing EULA Failed. - CTransmogManager::ProjectEditionLicenseFile(hr:0x80070490) 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=3760 Failed to apply edition settings - CTransmogManager::ApplyEditionSettings(hr:0x80070490) 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=3760 Failed to apply edition settings for [serverstandardcor] - CTransmogManager::TransmogrifyWorker 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=3760 [Upgrading system]: An error occurred while applying target edition component settings. The upgrade cannot proceed. For more information, review the log file. [hrError=0x80070490] - CTransmogManager::EventError 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=3760 Failed to Upgrade! - CTransmogManager::TransmogrifyWorker(hr:0x80070490) 2022-03-02 07:42:34, Error DISM DISM Transmog Provider: PID=3772 TID=3760 Failed to upgrade! - CTransmogManager7.8KViews0likes8CommentsIgnite 2024: Streamlining AI Development with an Enhanced User Interface, Accessibility, and Learning Experiences in Azure AI Foundry portal
Announcing Azure AI Foundry, a unified platform that simplifies AI development and management. The platform portal (formerly Azure AI Studio) features a revamped user interface, enhanced model catalog, new management center, improved accessibility and learning, making it easier than ever for Developers and IT Admins to design, customize, and manage AI apps and agents efficiently.5.9KViews2likes0CommentsNew evaluation tools for multimodal apps, benchmarking, CI/CD integration and more
If not designed carefully, GenAI applications can produce outputs that have errors, lack grounding in verifiable data, or are simply irrelevant or incoherent, resulting in poor customer experiences and attrition. Even worse, an application’s outputs could perpetuate bias, promote misinformation, or expose organizations to malicious attacks. By conducting proactive risk evaluations throughout the GenAIOps lifecycle, organizations can better-understand and mitigate risks to achieve more secure, safe, and trustworthy customer experiences. Whether you’re evaluating and comparing models at the start of an AI project or running a final evaluation of your application to demonstrate production-readiness, every evaluation has these key components: the evaluation target, whether a base model or an application in development or in production, it’s the thing you’re trying to assess, the evaluation data, comprised of inputs and generated outputs that form the basis of evaluation, and evaluators, or metrics, that help measure and compare performance in a consistent, interpretable way. Today, we’re excited to announce enhancements across these key components, making evaluations in Azure AI Foundry even more comprehensive and accessible for a broad set of generative AI use cases. Here’s a quick summary before we dive into details: Simplify model selection with enhanced benchmarks and model evaluations We’ve enhanced the model benchmarking experience in Azure AI Foundry, adding new performance metrics (e.g. latency, estimated cost, and throughput) and generation quality metrics. This allows users to compare base models across diverse criteria, to better understand potential trade-offs. Evaluate and compare base models using your own private data. This capability simplifies the model selection process by allowing organizations to compare how different models behave in real-world settings and assess which models align best with their unique requirements. Drive robust, measurable insights with new and advanced evaluators New risk and safety evaluations for image and multimodal content provide an out-of-the-box way to assess the frequency and severity of harmful content in generative AI interactions containing imagery. These evaluations can help inform targeted mitigations and demonstrate production-readiness. Evaluations for quality metrics are now generally available for text-based generative AI models and apps. Using either no-code and/or code-first experiences, users can assess generative AI models and applications for key quality attributes such as groundedness, coherence, recall, and fluency. Operationalize evaluations as part of your GenAIOps A new Python API allows developers to run built-in and custom text-based evaluations remotely in the cloud, streamlining the evaluation process at scale with the convenience of easy CI/CD integration. GitHub Actions for GenAI evaluations enable developers to use GitHub Actions to run automated evaluations of their models and applications, for faster experimentation and iteration within their coding environment. In related news, continuous online evaluations of generated outputs are now available, allowing teams to monitor and improve AI applications in production. Additionally, as applications transition from development to production, developers will soon have the capability to document and share evaluation results along with other key information about their fine-tuned models or applications through AI reports. With these expanded capabilities, cross-functional teams are empowered to iterate, launch, and govern their GenAI applications with greater observability and confidence. New benchmarking experience in Azure AI Foundry Picture this: You’re a developer exploring the Azure AI model catalog, trying to find the right fit for your use case. You use search filters, explore available models, and read the model cards to identify strong contenders, but you’re still not sure which model to choose. Why? Selecting the optimal model for an application isn't just about learning as much as you can about each individual model. Organizations need to understand and compare performance from multiple angles—accuracy, relevance, coherence, cost, and computational efficiency—to understand the trade-offs. Now, an enhanced benchmarking experience enables developers to view comprehensive, detailed performance data for models in the Azure AI model catalog while also allowing for direct comparison across multiple models. This provides developers with a clearer picture of each model’s relative performance across critical performance metrics to identify models that meet business requirements. Azure AI Foundry supports four categories of metrics to facilitate robust comparisons: Quality: Assess the accuracy, groundedness, coherence, and relevance of each model’s output. Cost: Assess estimated costs associated with deploying and running the models. Latency: Assess the response times for each model to understand speed and responsiveness. Throughput: Assess the number of tasks each model can process within a specific time frame, to gauge scalability and efficiency. Learn more in our documentation. within the Azure AI Foundry portal Evaluate and compare models using your own data Once you have compared various models using benchmarks on public data, you might still be wondering which model will perform best for your specific use case. At this point, it would be more helpful to compare each model using your own test dataset that reflects the inputs and outputs typical of your intended use case. We’re excited to provide developers with an easier way to do just that. Now, developers can easily evaluate and compare both base models and fine-tuned models from within the Azure AI Foundry portal. This is also helpful when comparing base models to fine-tuned models, to see the impact of your training data. With this update, developers can assess models using their own test data and pre-built quality and safety evaluators, for easier side-by-side model comparisons and data-driven decisions when building GenAI applications. Key components of this update, now available in public preview, include: A new entry point in the Azure AI model catalog to guide users through model evaluation. Expanded support for Azure OpenAI Service and Models as a Service (Maas) models, so developers can evaluate these models and user-defined prompts directly within Azure AI Foundry portal. Simplified evaluation setup wizard, so both experienced GenAI developers and those new to GenAI can navigate and evaluate models with ease. New tool for real-time test data generation, helping developers rapidly create sample data for evaluation purposes. Enhanced evaluation results page to help developers visualize and quickly grasp the tradeoffs between various evaluation metrics. Learn more in our documentation. Evaluate for risk and safety in image and multimodal content Risk and safety evaluations for images and multimodal content is now available in public preview in Azure AI Foundry. These evaluations can help organizations assess the frequency and severity of harmful content in human and AI-generated outputs to prioritize relevant risk mitigations. For example, these evaluations can help assess content risks in cases where 1) text inputs yield image outputs, 2) a combination of image and text inputs produce text outputs, and 3) images containing text (like memes) generate text and/or image outputs. Azure AI Foundry provides AI-assisted evaluators to streamline these evaluations at scale, where each evaluator functions like a grading assistant, using consistent and predefined grading instructions to assess large datasets of inputs and outputs across specific target metrics. Today, organizations can use these evaluations to assess generated outputs for hateful or unfair, violent, sexual, and self-harm-related content, as well as protected materials that may present infringement risks. These evaluators use a large multimodal language model hosted by Microsoft to not only grade the test datasets but also provide explanations for the evaluation results so they are interpretable and actionable. Making evaluations actionable is essential. Evaluation insights can help organizations compare base models and fine-tuned models to see which models are a better fit for their application. Or, they can help inform proactive steps to mitigate risk, such as activating image and multimodal content filters in Azure AI Content Safety to detect and block harmful content in real-time. After making changes, users can re-run an evaluation and compare the new scores to their baseline results side-by-side to understand the impact of their work and demonstrate production readiness for stakeholders. Learn more in our documentation. Evaluate GenAI models and applications for quality We’re excited to announce the general availability of quality evaluators for GenAI in Azure AI Foundry, accessible through the code-first Azure AI Foundry SDK experience and no-code Azure AI Foundry portal. These evaluators provide a scalable way to assess models and applications against key performance and quality metrics. This update also includes improvements to pre-existing AI-assisted metrics as well as explanations for evaluation results to help ensure they are interpretable and actionable. Generally available evaluators include: AI-assisted evaluators (these require an Azure OpenAI deployment to assist the evaluation), which are commonly used for retrieval augmented generation (RAG) and business and creative writing scenarios: • Groundedness • Retrieval • Relevance • Coherence • Fluency • Similarity Natural Language Processing (NLP) evaluators, which support assessments for the accuracy, precision, and recall of generative AI: • F1 score • ROUGE score • BLEU score • GLEU score • METEOR score Learn more in our documentation. Announcing a Python API for remote evaluation Previously, developers could only run local evaluations on their own machines when using the Azure AI Foundry SDK. Now, we're providing developers with a new, simplified Python API to run remote evaluations in the cloud. This API supports both built-in and custom prompt-based evaluators, allowing for scalable evaluation runs, seamless integration into CI/CD pipelines, and a more streamlined evaluation workflow. Plus, remote evaluation means developers don’t need to manage their own infrastructure for orchestrating evaluations. Instead, they can offload the task to Azure. Learn more in our documentation. GitHub Actions for GenAI evaluations are now available Given trade-offs between business impact, risk and cost, you need to be able to continuously evaluate your AI applications and run A/B experiments at scale. We are significantly simplifying this process with GitHub Actions that can be integrated seamlessly into existing CI/CD workflows in GitHub. With these actions, you can now run automated evaluations after each commit, using the Azure AI Foundry SDK to assess your applications for metrics such as groundedness, coherence, and fluency. First announced at GitHub Universe in October, these capabilities are now available in public preview. GitHub Actions for online A/B experimentation are available to try in private preview. These enable developers to seamlessly and automatically run A/B experiments comparing different models, prompts, and/or general UX changes to an AI application after deploying to production as part of a CD workflow. Analysis via out-of-the-box model monitoring metrics and custom metrics is seamless, with results posted back directly to GitHub. To participate in the private preview please sign up here. Build production-ready GenAI apps with Azure AI Foundry Want to learn about more ways to build trustworthy AI applications? Here are other exciting announcements from Microsoft Ignite to support your GenAIOps and governance workflows: Explore tracing and debugging capabilities to drive continuous improvement Monitor and improve GenAI apps in production Document and share evaluation results with business stakeholders Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into best practices for evaluations and trustworthy AI in these sessions: Microsoft Ignite Keynote Trustworthy AI: Future trends and best practices Trustworthy AI: Advanced risk evaluation and mitigation Azure AI and the dev toolchain you need to infuse AI in all your apps Simulate, evaluate, and improve GenAI outputs with Azure AI Foundry _________ Please note: This article was edited on Dec 30, 2024 to reflect the availability of risk and safety evaluations for images in public preview in Azure AI Foundry. This feature was previously announced as "coming soon" at Microsoft Ignite.4.3KViews0likes0CommentsContinuously monitor your GenAI application with Azure AI Foundry and Azure Monitor
Now, Azure AI Foundry and Azure Monitor seamlessly integrate to enable ongoing, comprehensive monitoring of your GenAI application's performance from various perspectives, including token usage, operational metrics (e.g. latency and request count), and the quality and safety of generated outputs. With online evaluation, now available in public preview, you can continuously assess your application's outputs, regardless of its deployment or orchestration framework, using built-in or custom evaluation metrics. This approach can help organizations identify and address security, quality, and safety issues in both pre-production and post-production phases of the enterprise GenAIOps lifecycle. Additionally, online evaluations integrate seamlessly with new tracing capabilities in Azure AI Foundry, now available in public preview, as well as Azure Monitor Application Insights. Tying it all together, Azure Monitor enables you to create custom monitoring dashboards, visualize evaluation results over time, and set up alerts for advanced monitoring and incident response. Let’s dive into how all these monitoring capabilities fit together to help you be successful when building enterprise-ready GenAI applications. Observability and the enterprise GenAIOps lifecycle The generative AI operations (GenAIOps) lifecycle is a dynamic development process that spans all the way from ideation to operationalization. It involves choosing the right base model(s) for your application, testing and making changes to the flow, and deploying your application to production. Throughout this process, you can evaluate your application’s performance iteratively and continuously. This practice can help you identify and mitigate issues early and optimize performance as you go, helping ensure your application performs as expected. You can use the built-in evaluation capabilities in Azure AI Foundry, which now include remote evaluation and continuous online evaluation, to support end-to-end observability into your app’s performance throughout the GenAIOps lifecycle. Online evaluation can be used in many different application development scenarios, including: Automated testing of application variants. Integration into DevOps CI/CD pipelines. Regularly assessing an application’s responses for key quality metrics (e.g. groundedness, coherence, recall). Quickly responding to risky or inappropriate outputs that may arise during real-world use (e.g. containing violent, hateful, or sexual content) Production application monitoring and observability with Azure Monitor Application Insights. Now, let explore how you can use tracing for your application to begin your observability journey. Gain deeper insight into your GenAI application's processes with tracing Tracing enables comprehensive monitoring and deeper analysis of your GenAI application's execution. This functionality allows you to trace the process from input to output, review intermediate results, and measure execution times. Additionally, detailed logs for each function call in your workflow are accessible. You can inspect parameters, metrics, and outputs of each AI model utilized, which facilitates debugging and optimization of your application while providing deeper insights into the functioning and outputs of the AI models. The Azure AI Foundry SDK supports tracing to various endpoints, including local viewers, Azure AI Foundry, and Azure Monitor Application Insights. Learn more about new tracing capabilities in Azure AI Foundry. Continuously measure the quality and safety of generated outputs with online evaluation With online evaluation, now available in public preview, you can continuously evaluate your collected trace data for troubleshooting, monitoring, and debugging purposes. Online evaluation with Azure AI Foundry offers the following capabilities: Integration between Azure AI services and Azure Monitor Application Insights Monitor any deployed application, agnostic of deployment method or orchestration framework Support for trace data logged via the Azure AI Foundry SDK or a logging API of your choice Support for built-in and custom evaluation metrics via the Azure AI Foundry SDK Can be used to monitor your application during all stages of the GenAIOps lifecycle To get started with online evaluation, please review the documentation and code samples. Monitor your app in production with Azure AI Foundry and Azure Monitor Azure Monitor Application Insights excels in application performance monitoring (APM) for live web applications, providing many experiences to help enhance the performance, reliability, and quality of your applications. Once you’ve started collecting data for your GenAI application, you can access an out-of-the-box dashboard view to help you get started with monitoring key metrics for your application directly from your Azure AI project. Insights are surfaced to you via an Azure Monitor workbook that is linked to your Azure AI project, helping you quickly observe trends for key metrics, such as token consumption, user feedback, and evaluations. You can customize this workbook and add tiles for additional metrics or insights based on your business needs. You can also share it with your team so they can get the latest insights as well. Build enterprise-ready GenAI apps with Azure AI Foundry Ready to learn more? Here are other exciting announcements from Microsoft Ignite to support your GenAIOps workflows: New tracing and debugging capabilities to drive continuous improvement New ways to evaluate models and applications in pre-production New ways to document and share evaluation results with business stakeholders Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into best practices for GenAIOps with these breakout sessions: Multi-agentic GenAIOps from prototype to production with dev tools Trustworthy AI: Advanced risk evaluation and mitigation Azure AI and the dev toolchain you need to infuse AI in all your apps2.8KViews0likes0CommentsCompare and select models with new benchmarking tools in Azure AI Foundry
Explore the latest updates to the model benchmarks experience in Azure AI Foundry portal. These updates include: direct integration with the Azure AI model catalog, new performance and cost metrics, and the ability to evaluate and compare models using your own private data.2.7KViews0likes0CommentsAI reports: Improve AI governance and GenAIOps with consistent documentation
AI reports are designed to help organizations improve cross-functional observability, collaboration, and governance when developing, deploying, and operating generative AI applications and fine-tuned or custom models. These reports support AI governance best practices by helping developers document the purpose of their AI model or application, its features, potential risks or harms, and applied mitigations, so that cross-functional teams can track and assess production-readiness throughout the AI development lifecycle and then monitor it in production. Starting in December, AI reports will be available in private preview in a US and EU Azure region for Azure AI Foundry customers. To request access to the private preview of AI reports, please complete the Interest Form. Furthermore, we are excited to announce new collaborations with Credo AI and Saidot to support customers’ end-to-end AI governance. By integrating the best of Azure AI with innovative and industry-leading AI governance solutions, we hope to provide our customers with choice and help empower greater cross-functional collaboration to align AI solutions with their own principles and regulatory requirements. Building on learnings at Microsoft Microsoft’s approach for governing generative AI applications builds on our Responsible AI Standard and the National Institute of Standards and Technology’s AI Risk Management Framework. This approach requires teams to map, measure, and manage risks for generative applications throughout their development cycle. A core asset of the first—and iterative—map phase is the Responsible AI Impact Assessment. These assessments help identify potential risks and their associated harms, as well as mitigations to address them. As development of an AI system progresses, additional iterations can help development teams document their progress in risk mitigation and allow experts to review the evaluations and mitigations and make further recommendations or requirements before products are launched. Post-deployment, these assessments become a source of truth for ongoing governance and audits, and help guide how to monitor the application in production. You can learn more about Microsoft’s approach to AI governance in our Responsible AI Transparency Report and find a Responsible AI Impact Assessment Guide and example template on our website. How AI reports support AI impact assessments and GenAIOps AI reports can help organizations govern their GenAI models and applications by making it easier for developers to provide the information needed for cross-functional teams to assess production-readiness throughout the GenAIOps lifecycle. Developers will be able to assemble key project details, such as the intended business use case, potential risks and harms, model card, model endpoint configuration, content safety filter settings, and evaluation results into a unified AI report from within their development environment. Teams can then publish these reports to a central dashboard in the Azure AI Foundry portal, where business leaders can track, review, update, and assess reports from across their organization. Users can also export AI reports in PDF and industry-standard SPDX 3.0 AI BOM formats, for integration into existing GRC workflows. These reports can then be used by the development team, their business leaders, and AI, data, and other risk professionals to determine if an AI model or application is fit for purpose and ready for production as part of their AI impact assessment processes. Being versioned assets, AI reports can also help organizations build a consistent bridge across experimentation, evaluation, and GenAIOps by documenting what metrics were evaluated, what will be monitored in production, and the thresholds that will be used to flag an issue for incident response. For even greater control, organizations can choose to implement a release gate or policy as part of their GenAIOps that validates whether an AI report has been reviewed and approved for production. Key benefits of these capabilities include: Observability: Provide cross-functional teams with a shared view of AI models and applications in development, in review, and in production, including how these projects perform in key quality and safety evaluations. Collaboration: Enable consistent information-sharing between GRC, development, and operational teams using a consistent and extensible AI report template, accelerating feedback loops and minimizing non-coding time for developers. Governance: Facilitate responsible AI development across the GenAIOps lifecycle, reinforcing consistent standards, practices, and accountability as projects evolve or expand over time. Build production-ready GenAI apps with Azure AI Foundry If you are interested in testing AI reports and providing feedback to the product team, please request access to the private preview by completing the Interest Form. Want to learn more about building trustworthy GenAI applications with Azure AI? Here’s more guidance and exciting announcements to support your GenAIOps and governance workflows from Microsoft Ignite: Learn about new GenAI evaluation capabilities in Azure AI Foundry Learn about new GenAI monitoring capabilities in Azure AI Foundry Learn about new IT governance capabilities in Azure AI Foundry Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into capabilities that support trustworthy AI with these sessions: Keynote: Microsoft Ignite Keynote Breakout: Trustworthy AI: Future trends and best practices Breakout: Trustworthy AI: Advanced AI risk evaluation and mitigation Demo: Simulate, evaluate, and improve GenAI outputs with Azure AI Foundry Demo: Track and manage GenAI app risks with AI reports in Azure AI Foundry We’ll also be available for questions in the Connection Hub on Level 3, where you can find “ask the expert” stations for Azure AI and Trustworthy AI.2.4KViews1like0CommentsIntroducing Evaluation API on Azure OpenAI Service
We are excited to announce new Evaluations (Evals) API in Azure OpenAI Service! Evaluation API lets users test and improve model outputs directly through API calls, making the experience simple and customizable for developers to programmatically assess model quality and performance in their development workflows.1.6KViews0likes0CommentsStart your Trustworthy AI Development with Safety Leaderboards in Azure AI Foundry
Selecting the right model for your AI application is more than a technical decision—it’s a foundational step in ensuring trust, compliance, and governance in AI. Today, we are excited to announce the public preview of safety leaderboards within Foundry model leaderboards, helping customers incorporate model safety as a first-class criterion alongside quality, cost, and throughput. This feature introduces three key components to support responsible AI development: A dedicated safety leaderboard highlighting the safest models; A quality–safety trade-off chart to balance performance and risk; Five new scenario-specific leaderboards supporting diverse responsible AI scenarios. Prioritize safety with the new leaderboard The safety leaderboard ranks the top models based on their robustness against generating harmful content. This is especially valuable in regulated or high-risk domains—such as healthcare, education, or financial services—where model outputs must meet high safety standards. To ensure benchmark rigor and relevance, we apply a structured filtering and validation process to select benchmarks. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety and responsible AI leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on the targeted areas of interest as they relate to safety. Our current safety leaderboard uses the HarmBench benchmark which includes prompts to illicit harmful behaviors from models. The benchmark covers 7 semantic categories of behaviors: Cybercrime & Unauthorized Intrusion Chemical & Biological Weapons/Drugs Copyright Violations Misinformation & Disinformation Harassment & Bullying Illegal Activities General Harm These 7 categories are organized into three broader functional groupings: Standard Harmful Behaviors Contextual Harmful Behaviors Copyright Violations Each grouping is featured in a separate responsible AI scenario leaderboard. We use the prompts evaluators from HarmBench to calculate Attack Success Rate (ASR) and aggregate them across the functional groupings to proxy model safety. Lower ASR values means that a model is more robust against attacks to illicit harmful content. We understand and acknowledge that model safety is a complex topic and has several dimensions. No single current open-source benchmark can test or represent the full spectrum of model safety in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This can lead to either overestimating or underestimating model performance in real-world safety scenarios. While HarmBench dataset covers a limited set of harmful topics, it can still provide a high-level understanding of safety trends. Navigate trade-offs with the quality-safety chart Model selection often involves compromise across multiple criteria. Our new quality–safety trade-off chart helps you make informed decisions by comparing models based on their performance in safety and quality. You can: Identify the safest model measured by Attack Success Rate (lower is better) at a given level of quality performance; Or choose the highest-performing model in quality (higher is better) that still meets a defined safety threshold. Together with the quality-cost trade-off chart, you would be able to find the best trade-off between quality, safety, and cost in selecting a model: Scenario-based responsible AI leaderboards To support customers' diverse responsible AI scenarios, we have added 5 new leaderboards to rank the top models in safety and broader responsibility AI scenarios. Each leaderboard is powered by industry-standard public benchmarks covering: Model robustness against harmful behaviors using HarmBench in 3 scenarios, targeting standard harmful behaviors, contextually harmful behaviors, and copyright violations: Consistent with the safety leaderboard, lower ASR scores for a model mean better robustness against generating harmful content. Model ability to detect toxic content using the Toxigen benchmark: This benchmark targets adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. Higher accuracy based on F1-score for a model means its better ability to detect toxic content. Model knowledge of sensitive domains including cybersecurity, biosecurity, and chemical security, using the Weapons of Mass Destruction Proxy benchmark (WMDP): A higher accuracy score for a model denotes more knowledge of dangerous capabilities. These scenario leaderboards allow developers, compliance teams, and AI governance stakeholders to align model selection with organizational risk tolerance and regulatory expectations. Building Trustworthy AI Starts with the Right Tools With safety leaderboards now available in public preview, Foundry model leaderboards offer a unified, transparent, and data-driven foundation for selecting models that align with your safety requirements. This addition empowers teams to move from ad hoc evaluation to principled model selection—anchored in industry-standard benchmarks and responsible AI practices. To learn more, explore the methodology documentation and start building AI solutions you—and your stakeholders—can trust.1.4KViews2likes0Comments