Welcoming Mistral, Phi, Jais, Code Llama, NVIDIA Nemotron, and more to the Azure AI Model Catalog
Published Nov 15 2023 08:00 AM 65.6K Views
Microsoft

 

We are excited to announce the addition of several new foundation and generative AI models to the Azure AI model catalog. From Hugging Face, we have onboarded a diverse set of stable diffusion models, falcon models, CLIP, Whisper V3, BLIP, and SAM models. In addition to Hugging Face models, we are adding Code Llama and Nemotron models from Meta and NVIDIA respectively. We are also introducing our cutting-edge Phi models from Microsoft research. These exciting additions to the model catalog have resulted in 40 new models and 4 new modalities including text-to-image and image embedding.   


Today, we are also
pleased to announce Models as a Service. Pro developers will soon be able to easily integrate the latest AI models such as Llama 2 from Meta, Command from Cohere, Jais from G42, and premium models from Mistral as an API endpoint to their applications. They can also fine-tune these models with their own data without needing to worry about setting up and managing the GPU infrastructure, helping eliminate the complexity of provisioning resources and managing hosting.  

 
Below is additional information about the incredible new models we are bringing to Models as a Service and to the Azure AI model catalog. 

 

Azure AI studio model catalogAzure AI studio model catalog

 

New Models in Models as a Service (MaaS) 

 

Command (Coming Soon) 

Command is Cohere's premier text generation model, designed to respond effectively to user commands and instantly cater to practical business applications. It offers a range of default functionalities but can be customized for specific company language or advanced use cases. Command's capabilities include writing product descriptions, drafting emails, suggesting press release examples, categorizing documents, extracting information, and answering general queries. We will soon support Command in MaaS. 

Jais (Coming Soon) 

Jais is a 13-billion parameter model developed by G42 and trained on a diverse 395-billion-token dataset, comprising 116 billion Arabic and 279 billion English tokens. Notably, Jais is trained on the Condor Galaxy 1 AI supercomputer, a multi-exaFLOP AI supercomputer co-developed by G42 and Cerebras Systems. This model represents a significant advancement for the Arabic world in AI, offering over 400 million Arabic speakers the opportunity to explore the potential of generative AI. Jais will also be offered in MaaS as inference APIs and hosted fine-tuning. 

Mistral 

Mistral is a Large Language Model with 7.3 billion parameters. It is trained on data that is able to generate coherent text and perform various natural language processing tasks. It is a significant leap from previous models and outperforms many existing AI models on a variety of benchmarks. One of the key features of the Mistral 7B model is its use of grouped query attention and sliding window attention, which allow for faster inference and longer response sequences. Azure AI model catalog will soon offer Mistral’s premium models in Model-as-a-Service (MaaS) through inference APIs and hosted-fine-tuning.  

  • Mistral-7B-V01 
  • Mistral-7B-Instruct-V01 

 

New Models in Azure AI Model Catalog 

 

Phi 

Phi-1-5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as Phi-1, augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-1.5 demonstrates a nearly state-of-the-art performance among models with fewer than 10 billion parameters. Phi-1.5 can write poems, draft emails, create stories, summarize texts, write Python code (such as downloading a Hugging Face transformer model), etc. 

Phi-2 is a Transformer with 2.7 billion parameters that shows dramatic improvement in reasoning capabilities and safety measures compared to Phi-1-5, however it remains relatively small compared to other transformers in the industry. With the right fine-tuning and customization, these SLMs are incredibly powerful tools for applications both on the cloud and on the edge.  

  • Phi 1.5 
  • Phi 2 

 

Whisper V3 

Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo labeled audio collected using Whisper large-v2. The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained in both speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio. 

  • OpenAI-Whisper-Large-V3 

BLIP 

BLIP (Bootstrapping Language-Image Pre-training) is a model that is able to perform various multi-modal tasks including: Visual Question Answering, Image-Text retrieval (Image-text matching), Image Captioning. Created by Salesforce, the BLIP model is based on the concept of vision-language pre-training (VLP), which combines pre-trained vision models and large language models (LLMs) for vision-language tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. It achieves state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval, image captioning, and VQA. The following variants are available in the model catalog: 

  • Salesforce-BLIP-VQA-Base 
  • Salesforce-BLIP-Image-Captioning-Base 
  • Salesforce-BLIP-2-OPT-2-7b-VQA. 
  • Salesforce-BLIP-2-OPT-2-7b-Image-To-Text. 

 

CLIP 

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of image-text pairs and created by OpenAI for efficiently learning visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. CLIP can also be used to extract visual and text embeddings for use in downstream tasks (such as information retrieval). Including this model increases our ever-growing list of other OpenAI models available in the model catalog including GPT and Dall-E. As mentioned earlier, these Azure Machine Learning curated models are thoroughly tested. The available CLIP variants include: 

  • OpenAI-CLIP-Image-Text-Embeddings-ViT-Base-Patch32 
  • OpenAI-CLIP-ViT-Base-Patch32 
  • OpenAI-CLIP-ViT-Large-Patch14 

Code Llama 

As a result of the partnership between Microsoft and Meta, we are delighted to offer the new Code Llama model and its variants in the Azure AI model catalog. Code Llama is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 34 billion parameters.  Code Llama is state-of-the-art for LLMs on code tasks and has the potential to make workflows faster and more efficient for current developers and lower the barrier to entry for people who are learning to code. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more robust, well-documented software. The available Code Llama variants include: 

  • CodeLlama-34b-Python 
  • CodeLlama-34b-Instruct 
  • CodeLlama-13b 
  • CodeLlama-13b-Python  
  • CodeLlama-13b-Instruct 
  • CodeLlama-7b 
  • CodeLlama-7b-Python  
  • CodeLlama-7b-Instruct 

 

Falcon models 

The next set of models were created by the Technical Innovation Institute (TII). Falcon-7B is a large language model with 7 billion parameters and Falcon-40B with 40 billion parameters. It is a causal decoder-only model developed by TII and trained on 1,500 billion tokens and 1 trillion tokens of RefinedWeb dataset respectively, which was enhanced with curated corpora. The model is available under the Apache 2.0 license. It outperforms comparable open-source models and features an architecture optimized for inferencing. The Falcon variants include: 

  • Falcon-40b  
  • Falcon-40b-Instruct  
  • Falcon-7b-Instruct  
  • Falcon-7b 

 

NVIDIA Nemotron 

Another additional update is the launch of the new NVIDIA AI collection of models and registry. This partnership is also where NVIDIA is launching a new 8B LLM, titled Nemotron-3, in 3 variants pretrained, chat and Q&A.  Nemotron-3, which is a family of enterprise ready GPT-based decoder-only generative text models compatible with NVIDIA NeMo Framework. 

  • Nemotron-3-8B-Base-4k 
  • Nemotron-3-8B-Chat-4k-SFT 
  • Nemotron-3-8B-Chat-4k-RLHF 
  • Nemotron-3-8B-Chat-4kSteerLM 
  • Nemotron-3-8B-QA-4k 

SAM 

The Segment Anything Model (SAM) is an innovative image segmentation tool capable of creating high-quality object masks from simple input prompts. Trained on a massive dataset comprising 11 million images and 1.1 billion masks, SAM demonstrates strong zero-shot capabilities, effectively adapting to new image segmentation tasks without prior specific training. Created by Meta, the model's impressive performance matches or exceeds prior models that operated under full supervision. 

  • Facebook-Sam-Vit-Large 
  • Facebook-Sam-Vit-Huge 
  • Facebook-Sam-Vit-Base 

 

Stable Diffusion Models 

The latest additions to the model catalog include Stable Diffusion models for text-to-image and inpainting tasks, developed by Stability AI and CompVis. These cutting-edge models offer a remarkable advancement in generative AI, providing greater robustness and consistency in generating images from text descriptions. By incorporating these Stable Diffusion models into our catalog, we are enhancing the diversity of available modalities and models, enabling users to access state-of-the-art capabilities that open new possibilities for creative content generation, design, and problem-solving. The addition of Stable Diffusion models in the Azure AI model catalog reflects our commitment to offering the most advanced and stable AI models to empower data scientists and developers in their machine learning projects, apps, and workflows. The available Stable Diffusion models include: 

  • Stable-Diffusion-V1-4 
  • Stable-Diffusion-2-1 
  • Stable-Diffusion-V1-5 
  • Stable-Diffusion-Inpainting 
  • Stable-Diffusion-2-Inpainting
     

Curated Model Catalog Inference Optimizations 

In addition to the above curated AI models, we also wanted to improve the overall user experience by optimizing the catalog and its features in meaningful ways. Models on the Azure AI model catalog are powered by a custom inferencing container to cater to the growing demand for high performance inference and serving of foundation models. 

The container comes equipped with multiple backend inferencing engines, including vLLM, DeepSpeed-FastGen and Hugging Face, to cover a wide variety of model architectures. Our default choice for serving models is vLLM, which provides high throughput and efficient memory management with continuous batching and Paged Attention. We are also excited to support DeepSpeed-FastGen - the latest offering by DeepSpeed  team, which introduces a Dynamic SplitFuse technique to offer even higher throughput. You can try out the alpha version of DeepSpeed-FastGen with our Llama-2 family of models. Learn more about DeepSpeed-FastGen from here: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen. For models that can’t be served by vLLM or DeepSpeed-MII, the container also comes with Hugging Face engine. 

To further maximize GPU utilization and achieve even higher throughput, the container strategically deploys multiple model replicas based on the available hardware and routes the incoming requests to available replicas. This allows efficient serving of even more concurrent user requests. Additionally, we integrated Azure AI Content Safety to streamline the process of detecting potentially harmful content in AI-generated applications and services. This integration aims to enforce responsible AI practices and safe usage of our advanced AI models. You can start seeing benefits of using our container with Llama-2 family of models. We plan to extend support to even more models, including other modalities like Stable Diffusion and Falcon. 

To get started with model inferencing, view the “Learn more” section below. 

https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/inference/te... 
 

Fine-tuning Optimizations 

Training larger LLMs such as those with 70B parameters and above needs a lot of GPU memory and can run out of memory during fine-tuning, and sometimes even loading them is not possible if the GPU memory is small. This is exacerbated further for most real-life use cases where we need context lengths that are as close to the model’s maximum allowed context length, which pushes memory requirement even further. To solve this problem, we are excited to provide users some of the latest optimizations for fine-tuning – Low Rank Adaptation (LoRA), DeepSpeed ZeRO, and Gradient Checkpointing. 

 

Gradient Checkpointing lowers GPU memory requirement by storing only select activations computed during the forward pass and recomputing them during the backward pass. This is known to reduce GPU memory by a factor of sqrt(n) (where n is the number of layers) while adding a modest additional computational cost from recomputing some activations.  

LoRA freezes most of the model parameters in the pretrained model during fine-tuning, and only modifies a small fraction of weights (LoRA adapters). This reduces GPU memory required, and also reduces fine-tuning time. LoRA reduces the number of trainable parameters by orders of magnitude, without much impact on the quality of the fine-tuned model. 

DeepSpeed’s Zero Redundancy Optimizer (ZeRO) achieves the merits of data and model parallelism, while alleviating the limitations of both. DeepSpeed ZeRO has three stages which partition the model states – parameters, gradients, and optimizer states – across GPUs and uses a dynamic communication schedule to share the necessary model states across GPUs. The GPU memory reduction allows users to fine-tune LLMs like LLAMA-2-70B with a single node of 8xV100s, for typical sequence lengths of the data encountered in many use cases. 

All these optimizations are orthogonal and can be used together in any combination, empowering our customers to train large models on multi-GPU clusters with mixed precision to get the best fine-tuning accuracy. 

 

AI safety and Responsible AI 

 

Responsible AI is at the heart of Microsoft’s approach to AI and how we partner. For years we’ve invested heavily in making Azure the place for responsible, cutting-edge AI innovation, whether customers are building their own models or using pre-built and customizable models from Microsoft, Meta, OpenAI and the open-source ecosystem.  

 

We are thrilled to announce that Stable Diffusion models now support Azure AI Content Safety. Azure AI Content Safety detects harmful user-generated and AI-generated content in applications and services. Content Safety includes text and image APIs that allow you to detect material that is harmful. We also have an interactive Content Safety Studio that allows you to view, explore and try out sample code for detecting harmful content across different modalities. You can learn more from the link below. We cannot wait to witness the incredible applications and solutions our users will create using these state-of-the-art models. 

 

 

Explore SDK and CLI examples for foundation model in azureml-examplesGitHub repo! 

Learn more! 

 

3 Comments
Co-Authors
Version history
Last update:
‎Nov 15 2023 01:49 PM
Updated by: