evaluation
21 TopicsThe Future of AI: Creating a Web Application with Vibe Coding
Discover how vibe coding with GPT-5 in Azure AI Foundry transforms web development. This post walks through building a Translator API-powered web app using natural language instructions in Visual Studio Code. Learn how adaptive translation, tone and gender customization, and Copilot agent collaboration redefine the developer experience.228Views0likes0CommentsAnnouncing a new Azure AI Translator API (Public Preview)
Microsoft has launched the Azure AI Translator API (Public Preview), offering flexible translation options using either neural machine translation (NMT) or generative AI models like GPT-4o. The API supports tone, gender, and adaptive custom translation, allowing enterprises to tailor output for real-time or human-reviewed workflows. Customers can mix models in a single request and authenticate via resource key or Entra ID. LLM features require deployment in Azure AI Foundry. Pricing is based on characters (NMT) or tokens (LLMs).230Views0likes0CommentsThe Future of AI: Vibe Code with Adaptive Custom Translation
This blog explores how vibe coding—a conversational, flow-based development approach—was used to build the AdaptCT playground in Azure AI Foundry. It walks through setting up a productive coding environment with GitHub Copilot in Visual Studio Code, configuring the Copilot agent, and building a translation playground using Adaptive Custom Translation (AdaptCT). The post includes real-world code examples, architectural insights, and advanced UI patterns. It also highlights how AdaptCT fine-tunes LLM outputs using domain-specific reference sentence pairs, enabling more accurate and context-aware translations. The blog concludes with best practices for vibe coding teams and a forward-looking view of AI-augmented development paradigms.261Views0likes0CommentsThe Future of AI: Developing Lacuna - an agent for Revealing Quiet Assumptions in Product Design
A conversational agent named Lacuna is helping product teams uncover hidden assumptions embedded in design decisions. Built with Copilot Studio and powered by Azure AI Foundry, Lacuna analyzes product documents to identify speculative beliefs and assess their risk using design analysis lenses: impact, confidence, and reversibility. By surfacing cognitive biases and prompting reflection, Lacuna encourages teams to validate assumptions through lightweight evidence-gathering methods. This experiment in human-AI collaboration explores how agents can foster epistemic humility and transform static documents into dynamic conversations.401Views1like1CommentThe Future of AI: Harnessing AI agents for Customer Engagements
Discover how AI-powered agents are revolutionizing customer engagement—enhancing real-time support, automating workflows, and empowering human professionals with intelligent orchestration. Explore the future of AI-driven service, including Customer Assist created with Azure AI Foundry.678Views2likes0CommentsStart your Trustworthy AI Development with Safety Leaderboards in Azure AI Foundry
Selecting the right model for your AI application is more than a technical decision—it’s a foundational step in ensuring trust, compliance, and governance in AI. Today, we are excited to announce the public preview of safety leaderboards within Foundry model leaderboards, helping customers incorporate model safety as a first-class criterion alongside quality, cost, and throughput. This feature introduces three key components to support responsible AI development: A dedicated safety leaderboard highlighting the safest models; A quality–safety trade-off chart to balance performance and risk; Five new scenario-specific leaderboards supporting diverse responsible AI scenarios. Prioritize safety with the new leaderboard The safety leaderboard ranks the top models based on their robustness against generating harmful content. This is especially valuable in regulated or high-risk domains—such as healthcare, education, or financial services—where model outputs must meet high safety standards. To ensure benchmark rigor and relevance, we apply a structured filtering and validation process to select benchmarks. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety and responsible AI leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on the targeted areas of interest as they relate to safety. Our current safety leaderboard uses the HarmBench benchmark which includes prompts to illicit harmful behaviors from models. The benchmark covers 7 semantic categories of behaviors: Cybercrime & Unauthorized Intrusion Chemical & Biological Weapons/Drugs Copyright Violations Misinformation & Disinformation Harassment & Bullying Illegal Activities General Harm These 7 categories are organized into three broader functional groupings: Standard Harmful Behaviors Contextual Harmful Behaviors Copyright Violations Each grouping is featured in a separate responsible AI scenario leaderboard. We use the prompts evaluators from HarmBench to calculate Attack Success Rate (ASR) and aggregate them across the functional groupings to proxy model safety. Lower ASR values means that a model is more robust against attacks to illicit harmful content. We understand and acknowledge that model safety is a complex topic and has several dimensions. No single current open-source benchmark can test or represent the full spectrum of model safety in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This can lead to either overestimating or underestimating model performance in real-world safety scenarios. While HarmBench dataset covers a limited set of harmful topics, it can still provide a high-level understanding of safety trends. Navigate trade-offs with the quality-safety chart Model selection often involves compromise across multiple criteria. Our new quality–safety trade-off chart helps you make informed decisions by comparing models based on their performance in safety and quality. You can: Identify the safest model measured by Attack Success Rate (lower is better) at a given level of quality performance; Or choose the highest-performing model in quality (higher is better) that still meets a defined safety threshold. Together with the quality-cost trade-off chart, you would be able to find the best trade-off between quality, safety, and cost in selecting a model: Scenario-based responsible AI leaderboards To support customers' diverse responsible AI scenarios, we have added 5 new leaderboards to rank the top models in safety and broader responsibility AI scenarios. Each leaderboard is powered by industry-standard public benchmarks covering: Model robustness against harmful behaviors using HarmBench in 3 scenarios, targeting standard harmful behaviors, contextually harmful behaviors, and copyright violations: Consistent with the safety leaderboard, lower ASR scores for a model mean better robustness against generating harmful content. Model ability to detect toxic content using the Toxigen benchmark: This benchmark targets adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. Higher accuracy based on F1-score for a model means its better ability to detect toxic content. Model knowledge of sensitive domains including cybersecurity, biosecurity, and chemical security, using the Weapons of Mass Destruction Proxy benchmark (WMDP): A higher accuracy score for a model denotes more knowledge of dangerous capabilities. These scenario leaderboards allow developers, compliance teams, and AI governance stakeholders to align model selection with organizational risk tolerance and regulatory expectations. Building Trustworthy AI Starts with the Right Tools With safety leaderboards now available in public preview, Foundry model leaderboards offer a unified, transparent, and data-driven foundation for selecting models that align with your safety requirements. This addition empowers teams to move from ad hoc evaluation to principled model selection—anchored in industry-standard benchmarks and responsible AI practices. To learn more, explore the methodology documentation and start building AI solutions you—and your stakeholders—can trust.1.4KViews2likes0CommentsIntroducing Evaluation API on Azure OpenAI Service
We are excited to announce new Evaluations (Evals) API in Azure OpenAI Service! Evaluation API lets users test and improve model outputs directly through API calls, making the experience simple and customizable for developers to programmatically assess model quality and performance in their development workflows.1.6KViews0likes0CommentsThe Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.1.2KViews5likes3CommentsThe Future of AI: Reduce AI Provisioning Effort - Jumpstart your solutions with AI App Templates
In the previous post, we introduced Contoso Chat – an open-source RAG-based retail chat sample for Azure AI Foundry, that serves as both an AI App template (for builders) and the basis for a hands-on workshop (for learners). And we briefly talked about five stages in the developer workflow (provision, setup, ideate, evaluate, deploy) that take them from the initial prompt to a deployed product. But how can that sample help you build your app? The answer lies in developer tools and AI App templates that jumpstart productivity by giving you a fast start and a solid foundation to build on. In this post, we answer that question with a closer look at Azure AI App templates - what they are, and how we can jumpstart our productivity with a reuse-and-extend approach that builds on open-source samples for core application architectures.459Views0likes0CommentsAutomate Quota Discovery in Azure AI Foundry: A Tale of 3 APIs
Automate the discovery of Azure regions that meet your AI deployment needs using three essential APIs: Models API, Usages API, and Locations API. This process helps reduce decision fatigue and ensures compliance with enterprise-wide model deployment standards. Key learnings: Model Deployment Requirements: Understand the needs of a standard Retrieval-Augmented Generation (RAG) application, which involves deploying multiple models. Automation Benefits: Streamline your deployment process and ensure compliance with enterprise standards. Three Essential APIs: Models API: Query available models for a specific subscription within a chosen location. Usages API: Assess current usages and limits to infer available quotas. Locations API: Obtain a list of all available regions. A comprehensive Jupyter notebook with the implementation steps is available in the accompanying GitHub repository. This resource is invaluable for AI developers looking to streamline their deployment processes and ensure their applications meet all necessary requirements555Views3likes0Comments