Introducing Llama 2 on Azure

Microsoft

Jul 24, 2023

In recent times, advances in generative AI have shown us the potential of this technology to revolutionize the way we live and work. At Microsoft, we are constantly looking for new ways to empower our customers to harness the power of transformative technologies, and build on top of these technologies to benefit even more people.

At Microsoft Inspire, Microsoft and Meta expanded their AI partnership and announced support for Llama 2 family of models on Azure and Windows. Llama 2 is the next generation of large language model (LLM) developed and released by Meta. It is pretrained on 2 trillion tokens of public data and is designed to enable developers and organizations to build generative AI-powered tools and experiences. With this partnership, Microsoft is excited to be Meta’s preferred partner as they release their new version of Llama 2 to commercial customers for the first time.

Llama 2 is now available in the model catalog in Azure Machine Learning. The model catalog, currently in public preview in Azure Machine Learning, is your hub for foundation models, and empowers users to easily discover, customize and operationalize large foundation models at scale. The native support for Llama 2 within the Azure Machine Learning model catalog enables users to use these models, without having to manage any of the infrastructure or environment dependencies. It provides out-of-the-box support for model finetuning and evaluation, including a selection of optimizer libraries like DeepSpeed and ORT (ONNX RunTime), which speed up fine-tuning, along with LoRA (Low-Rank Adaptation of Large Language Models) to greatly reduce memory and compute requirements for fine-tuning. Deployments of Llama 2 models in Azure come standard with Azure AI Content Safety integration, offering a built-in layered approach to safety, and following responsible AI best practices.

Fig 1. Discover Llama 2 models in AzureML’s model catalog

Getting started with Llama 2 on Azure: Visit the model catalog to start using Llama 2. Models in the catalog are organized by collections. You can view models linked from the ‘Introducing Llama 2’ tile or filter on the ‘Meta’ collection, to get started with the Llama 2 models. The collection contains pretrained and fine-tuned variants of the 7B, 13B and 70B-parameter Llama 2 generative text models. The fine-tuned variants, called Llama-2-chat, are optimized for dialogue use cases.

You can view the model details as well as sample inputs and outputs for any of these models, by clicking through to the model card. The model card provides information about the model’s training data, capabilities, limitations, and mitigations that Meta already built in. With this information, you can better understand whether the model is a good fit for your application.

The sample inputs section of the model card specifies the inferencing parameters that can be used with the Llama 2 models. These parameters are optional and can control the response generated by the model.

Fig 2. Inference parameters in sample inputs listed on model cards

‘temperature’ controls randomness of the output – lowering the temperature will produce more repetitive and deterministic responses, while higher values will result in more unexpected or creative responses.
‘top_p’ also controls response randomness, but uses a different method. Lowering top_p will narrow the model’s token selection to likelier tokens. Increasing top_p will let the model choose from tokens with both high and low likelihood.
‘max_new_tokens’ sets a limit on the number of tokens per model response. You can think of tokens as pieces of words that are roughly 4 characters of typical English text - Since AzureML inference endpoints have a timeout of 90s, we recommend you set this parameter to no more than 512.

A NOTE about compute requirements when using Llama 2 models: Finetuning, evaluating and deploying Llama 2 models requires GPU compute of V100 / A100 SKUs. You can find the exact SKUs supported for each model in the information tooltip next to the compute selection field in the finetune/ evaluate / deploy wizards. You can view and request AzureML compute quota here.

Fine-tuning and evaluating Llama 2 models:

When using generative AI models, we hear a strong need from customers to be able to customize these LLMs by fine-tuning them using their own data. In AzureML, we make it very easy to finetune these models using either UI based or code-based methods. The fine-tuning job executes in your own workspace, and you can rest easy knowing that the training data you provide for fine-tuning these models is accessible only in your own workspace.

When finetuning models in Azure ML, you can customize parameters such as epochs, learning rate, batch size, etc from the ‘Advance settings’ screen. You can also enable optimizations such as LoRA, DeepSpeed and ORT to speed up finetuning and execute it with reduced compute / memory requirements.

Fig 3. Finetune Llama 2 models in AzureML’s model catalog

You can also evaluate the model with your own test data to see how it would perform in your own use case. The evaluation metrics make it easy to visualize how well the selected model performed on your own test data. Leveraging AzureML’s powerful experimentation and tracking capabilities, you can also compare the performance of multiple models side-by-side to pick the one that would perform best in your scenario.

Fig 4. Evaluate Llama 2 models in AzureML’s model catalog

Deploying Llama 2 with built-in Azure AI Content Safety:

Additionally, we hear about the need to deploy and operationalize these models as scale. AzureML ensures that you can accelerate the deployment and management of these models with industry-leading machine learning operations (MLOps) capabilities. These allow users to deploy and score models faster with fully managed endpoints for batch and real-time predictions.

When deploying these models, it is important to mitigate potential harms presented by the model. We find that productions applications require a multi-layered mitigation plan. Since LLMs can cause hallucinations and are susceptible to attacks like jailbreaks, it is not sufficient to rely solely on the safety fine-tuning built into the model. In many applications at Microsoft, we use an additional AI-based safety system, Azure AI Content Safety, to provide an independent layer of protection.

When you deploy any of the Llama 2 models through the model catalog, the default deployment option is to use Azure AI Content Safety to enable a safer content experience for both inputs and outputs. This integration can filter harmful inputs and outputs from the model to mitigate intentional misuse by your users and mistakes by the model. When you deploy your model to an endpoint with this option selected, an Azure AI Content Safety resource will be created within your Azure subscription.

This safety system works by running both the prompt and completion for your model through an ensemble of classification models aimed at detecting and preventing the output of harmful content across four categories (hate, sexual, violence, and self-harm) and four severity levels (safe, low, medium, and high). The default configuration is set to filter content at the medium severity threshold for all four content harm categories for both prompts and completions. This system can work for all languages, but quality may vary. Variations in API configurations and application design may affect completions and thus filtering behavior. In all cases, customers should do their own testing to ensure it works for their application. You can read more about Deploying LLMs responsibly with Azure AI in this blog post.

Fig 5. Deploy Llama 2 models in AzureML’s model catalog with Azure Content Safety

Using Llama 2 with prompt flow in Azure: In the new world of generative AI, prompt engineering (the process of choosing the right words, phrases, etc to guide the model) is critical to model performance. Prompt flow is a powerful feature within Azure Machine Learning, that streamlines the development, evaluation, and continuous integration and deployment (CI/CD) of prompt engineering projects. You can easily add a connection to a Llama 2 endpoint and use it in prompt flow to develop high quality flows with ease and efficiency. This enables you to harness the full potential of the model and deliver impactful AI solutions for your use case. Learn more about prompt flow in Azure Machine Learning in this blog post.

Fig 6. Use AzureML Prompt flow to streamline prompt engineering for Llama 2 models

In conclusion, Expanding the Azure Machine Learning model catalog with the addition of Llama 2 models is a big step forward towards our commitment of democratizing AI using state-of-the-art LLMs. With these new models, we hope to empower every organization, developer and data scientist, on their journey to harness the power of generative AI, regardless of their skill level or organizational size. Visit the Azure Machine Learning model catalog to get started with these new models right away. We are incredibly excited to see what you can build with Llama 2!

Get started with Llama 2 in Azure AI