Introduction:
In recent months, the world of natural language processing (NLP) has witnessed a paradigm shift with the advent of large-scale language models like GPT-4. These models have achieved remarkable performance across a wide variety of NLP tasks, thanks to their ability to capture and understand the intricacies of human language. However, to fully unlock the potential of these pre-trained models, it is essential to streamline the deployment and management of these models for real world applications.
In this blog post, we will explore the process of operationalizing large language models, including prompt engineering and tuning, fine-tuning, and deployment, as well as the benefits and challenges associated with this new paradigm.
How do LLMs work?
Large language models, like GPT-4, use deep learning techniques to train on massive text datasets, learning grammar, semantics, and context. They employ the Transformer architecture, which excels at understanding relationships within text, to predict the next word in a sentence. Once trained, these models can generate human-like text and perform various tasks based on the input provided. This is very different from classical ML models where we train with specific statistical algorithms that deliver pre-defined outcomes.
Large language models outperform traditional machine learning models in terms of generating human-like responses due to their ability to learn from human feedback and the flexibility provided by prompt engineering.
Figure: Difference between ML Models and LLMs
What are the risks of LLMs in real-world applications?
LLMs are designed to generate text that appears coherent and contextually appropriate, rather than adhering to factual accuracy. This leads to various risks as highlighted below:
Bias amplification: LLMs could produce biased or discriminatory outputs.
Hallucination: LLMs may inadvertently generate incorrect, misleading, or false information.
Prompt Injection: Bad actors could exploit LLMs to produce harmful content using prompt injection.
Ethical concerns: The use of LLMs raises ethical questions about accountability and responsibility for the output generated by these models.
How to address the risks of LLMs?
In my opinion, there are two main ways to address the risks of LLMs and make them safe to use in real-world applications.
- Responsible AI Framework: Microsoft has created very detailed technical recommendations and resources to help customers design, develop, deploy, and use AI systems that implement the Azure OpenAI models responsibly. I will not delve more into this topic in this blog but please visit the links below to learn more:
Overview of Responsible AI practices for Azure OpenAI models
Responsible AI for LLMs (microsoft.com)
- Leverage MLOps for Large Language Models, i.e., LLMOps: Over the years, MLOps has demonstrated its ability to enhance the development, deployment, and maintenance of ML models, leading to more agile and efficient machine learning systems. MLOps approach enables the automation of repetitive tasks, such as model building, testing, deployment, and monitoring, thereby improving efficiency. It also promotes continuous integration and deployment, allowing for faster model iterations and smoother deployments in production. Though LLMs are pre-trained, we do not have to do the expensive training but MLOps can be leveraged to tune the LLMs, operationalize and monitor them effectively in production. MLOps for Large Language Models is called LLMOps.
MLOps vs LLMOps:
Let us quickly refresh how MLOps works for classical Machine Learning models. Taking ML models from development to deployment to operations involves multiple teams and roles and a wide range of tasks. Below is the flow of a standard ML lifecycle:
Figure: Classical ML Lifecycle workflow
Data Preparation: Gather necessary data, clean and transform into a format suitable for machine learning algorithms.
Model Build and Training: Select suitable algorithms and feed preprocessed data allowing it to learn patterns and make predictions. Improve the accuracy of the model through an iterative hyper parameter tuning and repeatable pipelines.
Model Deployment: Package the model and deploy it as a scalable container for making predictions. Expose the model as APIs to integrate with applications.
Model Management and Monitoring: Monitoring performance metrics, detecting data and model drifts, retraining the model, and communicating the model's performance to stakeholders.
Interestingly enough, the life cycle for LLMs is very similar to classical ML models as outlined above but we do not have to go through expensive model training because the LLMs are already pre-trained. However, we still have to consider discovering an LLM Model that fits use case, tune the prompts (i.e., prompt engineering or prompt tuning) and if necessary, fine-tune the models for domain specific grounding. Below is the flow of an LLM lifecycle:
Figure: LLM Lifecycle workflow
Using Azure Machine Learning for LLMOps:
Azure Machine Learning provides advanced capabilities throughout the entire LLM lifecycle. This includes everything from data preparation to the discovery and tuning of foundational models, and their deployment. It also assists in the development and deployment of Prompt flows. Finally, it enables monitoring of the deployed model and Prompt flow endpoints for attributes such as groundedness, relevance, and coherence.
Data Preparation for LLMs:
The first step in the process is to access the data for LLMs similar to ML models. Azure Machine Learning provides seamless access to Azure Data Lake Storage Gen2, Azure Blob Storage, Azure SQL Databases etc. which can be registered as Datastores. The data inside those Datastores, i.e., files, tables etc. can be easily accessed using the URIs. For example, azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet
Azure Machine Learning can be also used together with Microsoft Fabric to enhance collaboration between data professionals and ML professionals. ML-ready data assets prepared in Microsoft Fabric can easily be shared via OneLake and stored in managed feature store in Azure Machine Learning. For more documentation and examples please refer to the documentation here: Data concepts in Azure Machine Learning
Model Discover and Tune of LLMs:
One main advantage of LLMs is that we do not have to go through the expensive training process because they are already available models like GPT-4, Llama 2, Falcon etc. However, we still have to consider tuning the prompts (i.e., prompt engineering or prompt tuning) and if necessary, fine-tune the models for domain specific grounding.
Foundational Model Catalog:
The model catalog is a hub for discovering various foundation models from Azure OpenAI Service, Llama 2, Falcon, Hugging Face and a diverse suite of open-source vision models for image classification, object detection, and image segmentation. These models are curated, tested thoroughly to easily deploy and integrate with the applications.
Figure: LLM Foundational Model Catalog in Azure Machine Learning
Please refer to this link for more detailed documentation foundational models in Azure Machine Learning: How to use Open Source foundation models curated by Azure Machine Learning (preview)
GitHub Repo with example notebooks for deploying and inferencing the foundational models: azureml-examples/sdk/python/foundation-models at main (github.com)
Announcements on introducing Foundational and Vision models in Azure Machine Learning:
Announcing Foundation Models in Azure Machine Learning (microsoft.com)
Introducing Vision Models in Azure Machine Learning Model Catalog - Microsoft Community Hub
LLM Fine-tuning:
Fine-tuning for large language models is a process where a pre-trained model is adapted to generate answers specific to a particular domain. Fine-tuning allows the model to grasp the nuances and context relevant to that domain, thus improving its performance. The following are the steps involved in fine-tuning:
- Select a relevant dataset: Choose a dataset that represents the specific domain or task you want the model to excel in, ensuring it has adequate quality and size for effective fine-tuning.
- Adjust training parameters: Modify parameters like learning rate, batch size, and the number of training epochs to optimize the fine-tuning process and prevent overfitting.
- Evaluate and iterate: Regularly assess the fine-tuned model's performance using validation data and make necessary adjustments to improve its accuracy and effectiveness in the target domain.
Azure Machine Learning supports advanced optimization and distributed computing technologies such as ONNX Runtime Training’s ORTModule ,DeepSpeed and LoRA to significantly accelerate the training process.
Please refer to this link to learn more about fine-tuning, evaluating the foundational models:
How to use Open Source foundation models curated by Azure Machine Learning (preview)
Please refer to this GitHub Repo for a sample code for Fine tuning: advanced-gen-ai/Instructions/04-finetune-model.md at main
Prompt Flow:
As highlighted in the blog above, developing efficient prompts is highly crucial to keep the LLMs less risky and safer. Azure Machine Learning prompt flow provides a comprehensive solution that simplifies the process of prototyping, experimenting and tuning the prompt engineering process. Below are some important features:
- Create executable flows that link LLMs, prompts, and Python tools.
- Debug, share, and iterate your flows with ease through team collaboration.
- Create prompt variants and evaluate their performance through large-scale testing.
- Deploy the prompt flow as real-time endpoint to integrate into the workflow.
Figure: The prompt flow designer UI with integrated notebook feature
The Prompt Flow UI offers a visual representation of the steps and their interconnections. This visual guide and the navigation panel interact seamlessly, such that selecting a step in the visual guide automatically highlights the corresponding block in the navigation panel.
Figure: A visual flow with building blocks of prompt flow
Please refer to this link for more detailed documentation on prompt flow:
What is Azure Machine Learning prompt flow (preview)
Prompt flow code-first experience with SDK, CLI and VS Code Extension:
Prompt flow provides benefits that help users transition from ideation to production-ready LLM-infused applications. It addresses common customer queries about managing prompt versions, integrating with CI/CD processes, and exporting and deploying prompt flows. A code-first experience is introduced through our SDK, CLI, and VS Code extension. Developers can export a flow folder from the prompt flow UI for version control. The SDK allows local testing, cloud workspace batch runs, and extensive scenario handling. Seamless integration with Azure DevOps and GitHub Actions is provided for smooth CI/CD pipelines.
Please refer to the link below for a sample code with Prompt flow SDK/CLI:
promptflow/examples/tutorials/get-started/quickstart.ipynb · microsoft/promptflow (github.com)
VS Code Extension for prompt flow: The suite of development tools provided by prompt flow includes a robust VS Code extension. This extension aids developers in creating, testing, and tuning prompt flows. It offers support for both code-based and visual editing, allowing for comprehensive testing of entire prompt flow or individual steps.
Figure: Prompt flow development using the VS Code extension
The prompt flow extension can be installed from the VS Code Extensions marketplace:
Figure: VS Code Extension for prompt flow
Please also check out this demo video to learn how code-first experiences in prompt flow work in practice.
Retrieval Augmented Generation (RAG):
Another way of reducing the risks of LLMs is by grounding with the domain specific data so the LLMs will investigate that data for giving the responses. This is called Retrieval Augmented Generation (RAG). The RAG process works by chunking large data into manageable pieces, then creating vector embeddings that make it easy to understand the relationships between those pieces.
Figure: Retrieval Augmented Generation (RAG) process flow
Creating RAG pipeline is easy with prompt flow by connecting various components such as extracting data from Datastores, creating vector embedding and storing vectors in a vector database.
Figure: Q&A Generation with the RAG pipeline
Please refer to the documentation below on RAG capabilities in Azure AML:
Use Azure Machine Learning pipelines with no code to construct RAG pipelines (preview)
GitHub Repo on RAG: azureml-examples/sdk/python/generative-ai/rag/notebooks at main · Azure/azureml-examples
LLM Model and Prompt Flow Deployment:
Next phase of the LLMOps is the deployment of the foundational models and prompt flows as endpoints so they can be easily integrated with the applications for production use. Azure Machine Learning offers highly scalable computers such as CPU and GPUs for deploying the models as containers and to support inferencing at scale:
- Real-time Inference: It supports real-time inferencing through low-latency endpoints, enabling faster decision-making in applications.
- Batch Inference: Azure Machine Learning also supports batch inferencing for processing large datasets asynchronously, without the need for real-time responses.
Deploying LLM Models: Once the LLM models (whether pre-trained or fine-tuned) are thoroughly evaluated and produce results that satisfy the business requirements, they can be seamlessly deployed as Endpoints on Azure’s robust, scalable, and secure infrastructure. Azure Machine Laring supports deployment of LLM models using the UI in the Azure Machine Learning Studio or using the SDK.
Figure: Deploying LLM model in Azure Machine Learning Studio
Please refer to the below link for detailed information on how to deploy foundational models:
How to use Open Source foundation models curated by Azure Machine Learning (preview)
Please refer to this Github link below for a sample code for LLM model deployment using the SDK:
azureml-examples/sdk/python/foundation-models/system/inference at main · Azure/azureml-examples
Deploying prompt flows:
Once the prompt flow is developed it can be easily deployed as an endpoint for integrating in the workflow.
Figure: Deploying Prompt flow using the UI
Figure: A Prompt flow endpoint API and its associated keys
For detailed step by step instructions on building CI/CD pipeline for deploying the prompt flows using the SDK/CLI, please refer to this link: Set up end-to-end LLMOps with Prompt Flow and GitHub (preview) - Microsoft Learn
Model Monitoring and Management:
Finally, once the LLM models are deployed as endpoints and integrated into the applications, it is very important to monitor these models to make sure they are performing as intended and they continue to generate value for the users. Azure Machine Learning provides comprehensive model monitoring capabilities including monitoring data for drift, model performance, groundedness, token consumptions, and infrastructure performance.
Data Drift: Data drift occurs when the distribution of input data used for predictions changes over time. This can lead to a decrease in model performance as the model is trained on historical data but used to make predictions on new data. Azure Machine Learning's data drift detection feature allows you to monitor the input data for changes in distribution. This helps you identify when to update your model and ensure that it remains accurate as the data landscape changes.
Figure: A sample Data Drift by Features in Azure Machine Learning
Figure: A sample Data Drift by Time in Azure Machine Learning
More detailed step by step instructions can be found here on monitoring Datastores for data drift: Detect data drift on datasets (preview) - Azure Machine Learning
Model Metrics
Model monitoring in production is important because it ensures consistent performance by detecting and addressing issues like model degradation and biases. It enables early identification of anomalies and helps maintain overall system quality. Compliance with regulatory requirements is also achieved through continuous monitoring. Furthermore, it fosters continuous improvement by identifying areas for optimization, ultimately resulting in better-performing, more reliable models.
Azure Machine Learning provides Data Collector feature that logs inference data in Azure Blob Storage, allowing data collection for new or existing online endpoint deployments. By using the provided Python SDK, the collected data is automatically registered as a data asset in the Azure Machine Learning workspace, which can be utilized for model monitoring purposes.
Data Collector integrates with AzureML’s pre-built evaluation, annotation, and measurement pipelines to evaluate generation safety and quality. Customers can monitor LLM applications for key metrics such as coherence, fluency, groundedness, relevance, and similarity. Please refer to this documentation for detailed explanation on these metrics: Monitoring evaluation metrics descriptions and use cases (preview) - Microsoft Learn
Azure Machine Learning model monitoring also allows customers to track token consumptions from the chat and completion endpoints using prompt flow's system metrics.
Together, these capabilities can help you better identify and diagnose issues, understand usage patterns, and inform how you optimize your application with prompt engineering. Ultimately, model monitoring for generative AI enables more accurate, responsible, and compliant applications in production.
Figure: View overall performance, and review notifications from the monitoring overview page
Figure: View time-series metrics, histograms, detailed performance, and resolve notifications from the monitoring details page.
Please refer to the documentation below on Model monitoring for Generative AI applications announcement below for more details on Generative AI Model Monitoring:
Model monitoring for generative AI applications (preview) - Azure Machine Learning | Microsoft Learn
For more detailed documentation on how to monitor the signals and metrics with model monitoring, please refer to this link:
Monitoring models in production (preview) - Azure Machine Learning | Microsoft Learn
Model and Instructure Monitoring: With the monitoring of model and infrastructure, we track model performance in production to understand from both model and operational perspectives. Azure Machine Learning supports logging and tracking experiments using MLflow Tracking. We can log the models, metrics, parameters, and other artifacts with MLflow. This log information is captured inside Azure App Insights which can then be accessed using Log Analytics inside Azure Monitor. Since the LLMs come as pre-trained we may not get deep into the model inferencing logs, but we can effectively track LLM hyperparameters, execution times, prompts and responses.
Figure: Monitoring endpoint for traffic inside Azure ML Studio
Figure: Monitoring endpoint for traffic and metrics inside Azure Portal.
For more detailed information on logging metrics and monitoring the endpoints in Azure Machine Learning, please refer to this documentation:
Log metrics, parameters and files with MLflow
Monitor online endpoints - Azure Machine Learning
Conclusion:
In conclusion, LLMOps plays a crucial role in streamlining the deployment and management of large language models for real-world applications. Azure Machine Learning offers a comprehensive platform for implementing LLMOps, addressing the risks and challenges associated with LLMs.
Generative AI is a rapidly growing domain and there are new capabilities being added to Azure on a regular basis. Consequently, it is vital to stay informed about the latest updates in Azure Machine Learning and LLMOps by monitoring Microsoft's current documentation, tutorials, and examples. This ensures that you utilize the most cutting-edge tools and strategies for effectively deploying, managing, and monitoring your large language models.
Acknowledgement: I would like to extend my deepest appreciation to Takuto Higuchi, Microsoft's Product Marketing Manager for Azure AI, for his thoughtful review of this blog and for offering invaluable suggestions. His assistance has been instrumental in refining this blog to reflect recent product updates and enhance the overall content quality.