Deploy large language models responsibly with Azure AI
Published Jul 18 2023 08:45 AM 46.8K Views

Using pre-trained AI models offers significant benefits, including reducing development time and compute costs. Today we announced the availability of Meta’s Llama 2 (Large Language Model Meta AI) in Azure AI, enabling Azure customers to evaluate, customize, and deploy Llama 2 for commercial applications. Embedding Llama 2 and other pre-trained large language models (LLMs) into applications with Azure enables customers to innovate faster, by tapping into Azure’s end-to-end machine learning capabilities, unmatched scalability, and built-in security.


Starting with a great model is just the first step. We’ve learned through our experiences developing GitHub Copilot, Bing Chat, Microsoft 365 Copilot and many other generative AI applications that the model alone isn’t the full story. There is a lot more that developers must do to build a generative AI application responsibly, and it’s difficult for developers to do this work from scratch. In Azure AI, we have been investing in a full suite of responsible AI tools and technologies to enable developers to create safe and trustworthy applications more easily and effectively.


In this blog, we will explore emerging guidance to mitigate risks presented by LLMs, and how customers can use Azure AI to get started implementing best practices using Llama 2, their own models, or any pre-built and customizable models from Microsoft, OpenAI, and the open-source ecosystem.


Deploying LLMs responsibly requires new, purpose-built tools


Microsoft’s AI innovations and partnerships are rooted in our foundational AI principles of Fairness, Reliability and Safety, Privacy and Security, Inclusiveness, Transparency and Accountability. Today, we have nearly 350 people working on responsible AI across Microsoft, helping us implement best practices for building safe, secure, and transparent AI systems designed to benefit society. The technologies we build for internal use, we also make available to customers and partners. Recently, increased adoption of LLMs for commercial applications at Microsoft and by our customers demanded novel methods for identifying, measuring, and mitigating potential harms for generative AI systems. As a company, we also understand that technology alone cannot address the risks and challenges presented by AI, so we are committed to sharing our learnings and internal processes through resources such as our Responsible AI Standard, documentation, and training.


LLMs like Llama 2 and OpenAI GPT-4 are sophisticated transformer-based models for natural language processing. Essentially, they take a string of text as an input (prompt) and then produce a predicted continuation of that text as the output (completion). While these models have demonstrated improvements in advanced capabilities such as content and code generation, summarization, and search, these improvements also come with increased responsible AI challenges. Without proper application design, LLMs are susceptible to perpetuating the bias and toxicity found in training data, providing false or misleading information, enabling new vectors for manipulation, and other harms. In addition to recommending new policies to help society address these new challenges, Microsoft is helping customers put responsible AI into practice by building custom mitigations into our AI products, sharing our learnings and perspective, and providing purpose-built tooling to support customers that want to build their own LLM solutions responsibly.


Mitigate potential LLM harms with an iterative, layered approach


Mitigating potential harms presented by these new models requires an iterative, layered approach that includes experimentation and measurement. We find that most production applications require a mitigation plan with four layers of technical mitigations: (1) the model, (2) safety system, (3) metaprompt and grounding, and (4) user experience layers. The model and safety system layers are typically platform layers, where built-in mitigations would be common across many applications. The next two layers depend on the application’s purpose and design, meaning the implementation of mitigations can vary a lot from one application to the next. Below, we’ll discuss how Azure AI can aid the use and development of each of these mitigation layers.


 Fig 1. Technical mitigation layers to build an LLM application


Discover, fine-tune, and evaluate models with Azure AI's model catalog


At the model layer, it's important to understand the model you'll be using and what fine-tuning steps may have already been taken by the model developers to align the model towards its intended uses and to reduce potential harms. Today, customers can use the model catalog in Azure Machine Learning to explore models from Meta, Hugging Face, and Azure OpenAI Service, organized by collection and task. To better-understand the capabilities and limitations of each model, you can view model cards that provide model descriptions and may have the option to try a sample inference or test the model with your own data. This makes it easy to visualize how well a selected model performs, and you can even compare the performance of multiple models side-by-side to pick the one that works best for your use case. You can also fine-tune models using your own training data to improve model performance for your use case.




Fig 2. Discover, fine-tune, and evaluate models Azure AI's model catalog


Let’s take Meta’s Llama 2 model as an example. The model card provides information about the model’s training data, capabilities, limitations, and mitigations that Meta already built in. With this information, you can better-understand whether the model is a good fit for your application and start identifying the potential harms you need to mitigate and measure during development.



Fig 3. Model card for Meta’s Llama in the model catalog


Deploy your LLM with a built-in safety system using Azure AI Content Safety


For most applications, it’s not enough to rely on the safety fine-tuning built into the model itself.  LLMs can make mistakes and are susceptible to attacks like jailbreaks. In many applications at Microsoft, we use an additional AI-based safety system, Azure AI Content Safety, to provide an independent layer of protection.


When you deploy Llama 2 through the model catalog, you’ll see the default option is to add Azure AI Content Safety to filter harmful inputs and outputs from the model. This can help mitigate intentional misuse by your users and mistakes by the model. When you deploy your model to an endpoint with this option selected, an Azure AI Content Safety resource will be created within your Azure subscription.


This safety system works by running both the prompt and completion for your model through an ensemble of classification models aimed at detecting and preventing the output of harmful content across four categories (hate, sexual, violence, and self-harm) and four severity levels (safe, low, medium, and high). The default configuration is set to filter content at the medium severity threshold for all four content harm categories for both prompts and completions. This system can work for all languages, but quality may vary. It has been specifically trained and tested in the following languages: English, German, Japanese, Spanish, French, Italian, Portuguese, and Chinese. Variations in API configurations and application design may affect completions and thus filtering behavior. In all cases, customers should do their own testing to ensure it works for their application.


Use Azure AI's prompt flow to build effective metaprompts and ground your model


Metaprompt design and proper data grounding are at the heart of every generative AI application. They provide an application’s unique differentiation and are also a key component in reducing errors and mitigating risks. At Microsoft, we find retrieval augmented generation (RAG) to be a particularly effective and flexible architecture. With RAG, you enable your application to retrieve relevant knowledge from selected data and incorporate it into your metaprompt to the model. In this pattern, rather than using the model to store information, which can change over time and based on context, the model functions as a reasoning engine over the data provided to it during the query. This improves the freshness, accuracy, and relevancy of inputs and outputs. In other words, RAG can ground your model in relevant data for more relevant results.


Once you have the right data flowing into your application through grounding, the next step is building a metaprompt. A metaprompt, or system message, is a set of instructions used to guide an AI system’s behavior. Prompt engineering plays a crucial role here by enabling AI developers to develop the most effective metaprompt for their desired outcomes. Ideally, a metaprompt will enable a model to use the grounding data effectively and enforce rules that mitigate harmful content generation or user manipulations like jailbreaks or prompt injections. You can also use a metaprompt to define a model’s tone to provide a more responsible user experience. For more powerful generative AI applications, you should consider using advanced prompt engineering techniques to mitigate harms, such as requiring citations with outputs, limiting the lengths or structure of inputs and outputs, and preparing pre-determined responses for sensitive topics.


Prompt flow in Azure Machine Learning streamlines the development, evaluation, and continuous integration and deployment (CI/CD) of prompt engineering projects. It empowers LLM application developers with an interactive experience that combines natural language prompts, templates, built-in tools, and Python code. We define a flow as an executable workflow that streamlines the development of your LLM-based application. By leveraging connections in prompt flow, users can easily establish and manage connections to external APIs and data sources, facilitating secure and efficient data exchanges and interactions within their AI applications. For example, a pre-built connection for Azure AI Content Safety is available, should you want to customize where and how content filters are used in your application’s workflows.


Apply user-centric design to mitigate potential harms in your LLM application


Applications aren’t complete without their end users. That’s why we call our LLM applications “copilots.” The right user experience can help users better-understand and use AI technology and avoid common mistakes. There are many ways to educate end users who will use or be affected by your system about its capabilities and limitations, whether as part of your application’s metaprompt strategy, design, or positioning. Our open source HAX Toolkit provides helpful guidance for thinking through potential harms and mitigation strategies at this application layer and suggests feedback mechanisms so users can help flag harmful content.


Some user-centered design and user experience (UX) interventions, guidance and best practices include:

  • Reinforce user responsibility: Remind people that they are accountable for the final content when they're reviewing AI-generated content. For example, when offering code suggestions, remind the developer to review and test suggestions before accepting.
  • Highlight potential inaccuracies in the AI-generated outputs: Make clear how well the system can do what it can do, both when users first start using the system and at appropriate times during ongoing use.
  • Disclose AI's role in the interaction: Make people aware that they are interacting with an AI system (as opposed to another human). AI models may output content that could imply that they're human-like, misleading people to think that a system has certain capabilities when it doesn't.


Test your LLM application using red teaming and customizable evaluations in Azure AI


It’s not enough to just adopt the best practice mitigations. In order to know that they are working effectively for your application, you will need to test them before deployment. One approach to doing this is manual stress-testing or red-teaming to identify places where the product may not yet be performing as expected. We find this an important technique to help to us identify gaps that we may have missed in our automated testing. We’ve developed a few best practices that help make red-teaming more effective.


Red teaming is a useful technique for finding gaps, but it’s difficult to optimize the performance of your mitigations with red teaming alone. To do that, we find we need evaluations that measure the effectiveness of our system on a variety of typical and adversarial prompts. Prompt flow in Azure Machine Learning offers evaluation flows to facilitate the ongoing assessment and improvement of LLM applications. You can customize or create your own evaluation flow tailored to your tasks and objectives and then run a bulk test to evaluate your flow outcomes overall.


As a next step, you can compare the evaluation results of different metaprompt variants, which enables more effective prompt engineering. For example, we can experiment with different RAG prompts to ground our model in data sources like instruction manuals, online reviews, and a product website, and then we can use a pre-built groundedness evaluation to measure the percentage of outputs that actually match data in these sources. By comparing results for different variants, we can see which metaprompts enabled the model to use the data more effectively and resulted in the most grounded completions.


Similarly, you can use evaluation flows with your own data and metrics to test your mitigations' effectiveness against additional potential harms such as jailbreaks and harmful content or any application-specific concerns.



Fig 4. Use pre-built and customizable evaluations to assess your LLM application



Fig 5. Run a bulk test to assess each prompt flow variant against your selected evaluations



Fig 6. Compare evaluation metrics across multiple variants


Once we build and thoroughly test an LLM application, we will work to define and execute a deployment and operational readiness plan. This includes completing appropriate reviews of the application and mitigation layers with relevant stakeholders, establishing pipelines to collect telemetry and feedback, and developing an incident response and rollback plan. As with any application, we find it helpful to do a phased delivery, which gives a limited set of people the opportunity to try the application, provide feedback, report issues and concerns, and suggest improvements before we release it more widely.


Innovate confidently with a responsible AI platform


With the Microsoft Cloud, organizations can safeguard their organization, employees, and data with a cloud that runs on trust. Azure AI has built-in governance, security, and compliance controls, spanning identity, data, networking, monitoring, and compliance. Choosing a trustworthy AI platform is especially important when we consider generative AI, which relies on vast quantities of data to create value. When you fine-tune your models with Azure AI, your fine-tuning data is stored encrypted at rest within your Azure subscription and your fine-tuned models are only available to you. Your data is not used to train underlying foundation models in the model catalog without your permission.


Ultimately, we believe every organization that creates or uses advanced AI systems will need to develop and implement its own AI governance systems. To be effective, those systems must go beyond standards and principles and invest in real-world tools and practices to support teams throughout the AI development lifecycle. Providing end-to-end tooling and guidance for the responsible deployment of generative AI is a top priority for Microsoft. The tools and practices we shared today are just the start, and we look forward to working with customers, partners, and the open-source community to continue delivering innovation that supports the responsible development and use of LLMs for enterprise.


Get started with large language models in Azure AI

Version history
Last update:
‎Nov 09 2023 11:09 AM
Updated by: