Announcing Llama 2 Inference APIs and Hosted Fine-Tuning through Models-as-a-Service in Azure AI
Published Nov 15 2023 08:00 AM 52.8K Views
Microsoft

 

 

We are excited to announce the upcoming preview of Models as a Service (MaaS) that offers pay-as-you-go (PayGo) inference APIs and hosted fine-tuning for Llama 2 in Azure AI model catalog. We are expanding our partnership with Meta to offer Llama 2 as the first family of Large Language Models through MaaS in Azure AI Studio. MaaS makes it easy for Generative AI developers to build LLM (Large Language Models) apps by offering access to Llama 2 as an API. We are dramatically reducing the barrier for getting started with Llama 2 by offering PayGo inference APIs billed by the number of tokens used. It takes just a few seconds to create a Llama 2 PayGo inference API that you can use to explore the model in the playground or use it with your favorite LLM tools like prompt flow, Sematic Kernel or LangChain to build LLM apps. MaaS also offers the capability to fine-tune Llama 2 with your own data to help the model understand your domain or problem space better and generate more accurate predictions for your scenario, at a lower price point. The Llama 2 inference APIs in Azure have content moderation built-in to the service, offering a layered approach to safety and following responsible AI best practices.

 

Last summer, we announced the availability of Llama 2 on Azure earlier this summer in the model catalog in Azure Machine Learning, with turn-key support for operationalizing Llama 2 without the hassle of managing deployment code or infrastructure in your Azure environment. While our customers loved this experience, we heard that deploying model to models to dedicated virtual machines (VMs) took several minutes and some customers had to work through support for getting access to sufficiently powerful GPUs needed to run the Llama 2 models. As LLMs get more advanced, they tend to need machines with high-end GPUs to host or fine-tune them, increasing the cost for working these models in dedicated hosting environments. We believe that the cost and availability of GPUs shouldn’t come in the way of Generative AI developers working with LLMs.  

 

We aim to make Azure the best platform for developing Generative AI applications by providing seamless access to cutting-edge Large Language Models (LLMs). We are announcing Models as a Service soon to enable model providers to offer their LLMs on Azure, kicking off with Meta’s Llama 2 family of models. Traditionally, VMs with high-end GPUs meant for hosting frontier LLMs are capable of generating thousands of tokens per second but can be prohibitively expensive for dev-test cycles. MaaS eliminates the need to host models in dedicated VMs, especially for developers who don’t need high throughput during the dev-test phase of their projects. With PayGo inference APIs that are billed based on input and output tokens used, MaaS makes getting started easy and pricing attractive for Generative AI projects. Additionally, Llama 2 models can be fine-tuned with your specific data through hosted fine-tuning  to enhance prediction accuracy for tailored scenarios, allowing even smaller 7B and 13B Llama 2 models to deliver superior performance for your needs at a fraction of the cost of the larger Llama 2-70B model.

 

Getting started with MaaS

If you are new to Azure AI Studio, review this blog to make yourself familiar and create your first project. Open the model catalog in AI Studio. Filter by the Meta collection or click the “View models” button on the MaaS announcement card. Open the Llama-2-70b-chat model. Click on ‘Deploy’ and pick the Paygo deployment option. Subscribe to the offer to access the model and deploy. You will be navigated to the Playground within a few seconds where you can explore the model. Customize the context or inference parameters to tweak the predictions. Click on the “View code” button to find the API, keys, and code snippet to access the model programmatically. You will use this API in LLM tools such as prompt flow, Semantic Kernel, LangChain or any other tools that accept REST API with key based authentication for inference.

 

SwatiGharse_7-1699903837937.png

 

SwatiGharse_8-1699903837954.png

 

Developing LLM apps using MaaS and prompt flow

Once you deploy the Llama 2 model, you can streamline the development of AI apps using this deployed model, via prompt flow. You can now use Llama 2 models in prompt flow using the Open Source LLM Tool. To access this, go to ‘More tools’ and select ‘Open Source LLM Tool’  

 

SwatiGharse_9-1699903837956.png

 

Then configure the tool to use your deployed Llama 2 endpoint. The tool supports both completion and chat api types and you configure additional parameters like temperature and tokens to match your needs. For more details about the tool, refer to prompt flow tool documentation. 

 

SwatiGharse_10-1699903837959.png

 

Customize Llama 2 with hosted fine-tuning

When using generative AI models, developers and data scientists often look to fine-tune the model in order to optimize model performance for specific tasks. While approaches like Retrieval Augmented Generation (RAG) and prompt engineering work by injecting the right information and instructions into your prompt, fine-tuning operates by customizing the large language model itself. Fine-tuning allows training on more examples than can fit in a prompt, making fine-tuned models work with shorter prompts. This reduces the number of tokens used, driving down costs and reducing latency. Fine-tuning a smaller model with the right training data can potentially result in performance comparable to that of a larger model for your specific task, thus driving cost savings without compromising on quality.

 

 The process of fine-tuning models has traditionally been a technical challenge, requiring hands-on management of GPU resources which can deter adoption. With the hosted fine-tuning capabilities announced in the model catalog, you can get started with easily fine-tuning the Llama 2 models by simply providing your own training data. Hosted fine-tuning is currently supported on Llama 2-7b, Llama 2-13b and Llama 2-70b models. This service adopts a pay-as-you-go approach, ensuring you only pay for the actual training time your fine-tuning requires, as outlined in the Marketplace.

 

SwatiGharse_11-1699903837969.png

 

Once your fine-tuning job completes, you now have a fine-tuned model, that has been customized to work well in your scenario. You can view metrics of your fine-tuned model, to determine if it's good enough for use in your scenario.

 

SwatiGharse_12-1699903837974.png

 

Deploying your fine-tuned model is streamlined with our pay-as-you-go service, mirroring the flexibility of our MaaS feature through inference APIs. Fine-tuned model deployments are billed based on the number of input and output tokens used, in addition to a small hosting charge for hosting the fine-tuned model. Once deployed, you can integrate your fine-tuned model with leading LLM tools like prompt flow, LangChain, and Semantic Kernel, enhancing your AI capabilities effortlessly.

 

SwatiGharse_13-1699903837983.png

 

Building responsibly with content safety

Content safety is at the heart of Microsoft’s approach to enabling Generative AI app developers to build Generative AI responsibly. Inference APIs includes a content filtering system that works alongside scoring the Llama 2 models by running both the prompt and completion through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. When the content filtering system detects harmful content, you'll receive either an error on the inference API call if the prompt was deemed inappropriate or the response will be truncated partially or completely with an appropriate message when the output generated is inappropriate. When building your application or system, you'll want to account for these scenarios where the content returned by the inference API is filtered, which might result in content that is incomplete.

“With Model-as-a-Service (MaaS) on Azure AI, we're showing our commitment to giving developers the best in AI technology, “ said John Montgomery, Corporate Vice President, Azure AI Platform at Microsoft. “We've collaborated with Meta to bring Llama 2 to Azure AI Studio through MaaS, making the best frontier and open language models more accessible and adaptable. It’s another pivotal step in our journey to empower people to innovate with generative AI.”

The upcoming preview of Models as a Service (MaaS) is another step in our journey to democratize AI. The PayGo Inference APIs and hosted finetuning capabilities for state-of-the-art models like Llama 2 will dramatically reduce the barrier to adoption for these models. We hope these new features will empower every organization, developer and data scientist in their journey to harness the power of generative AI, regardless of their skill level or organizational size. We are incredibly excited to see what you can build with these new capabilities!

 

Learn more

To learn more, watch the Microsoft Ignite 2023 sessions to get familiar with other Azure AI announcements and start experimenting with Azure AI Studio and Azure Machine Learning.

 

 

 

7 Comments
Version history
Last update:
‎Dec 08 2023 09:22 AM
Updated by: