Machine Learning Operationalisation (MLOps) is a set of practices that aim to quickly and reliably build, deploy and monitor machine learning applications. Many organizations standardize around certain tools to develop a platform to enable these goals.
One combination of tools includes using Databricks to build and manage machine learning models and Kubernetes to deploy models. This article will explore how to design this solution on Microsoft Azure followed by step-by-step instructions on how to implement this solution as a proof-of-concept. This approach aims to use common open source technologies and can easily be adapted for other workloads.
This article is targeted towards:
Organizations looking to build and manage machine learning models on Databricks.
Organizations that have experience deploying and managing Kubernetes workloads.
Organizations looking to deploy workloads that require low latency and interactive model predictions (e.g. a product recommendation API or integration with external applications).
A GitHub repository with more details can be found here.
This high-level design uses Azure Databricks and Azure Kubernetes Service to develop an MLOps platform for the two main types of machine learning model deployment patterns — online inference and batch inference. This solution can manage the end-to-end machine learning life cycle and incorporates important MLOps principles when developing, deploying, and monitoring machine learning models at scale.
At a high level, this solution design addresses each stage of the machine learning lifecycle:
Data Preparation: this includes sourcing, cleaning, and transforming the data for processing and analysis. Data can live in a data lake or data warehouse and be stored in a feature store after it’s curated.
Model Development: this includes core components of the model development process such as experiment tracking and model registration using MLflow.
Model Deployment: this includes implementing a CI/CD pipeline to build and deploy solutions for batch inference workloads and online inference workloads. For online inference workloads, machine learning models will be containerized as API services and deployed to an Azure Kubernetes cluster with Azure API Management exposing them to the outside world. For batch inference workloads, jobs consuming the model will be executed using an orchestration tool (such as Apache Airflow).
Model Monitoring: this includes monitoring the API performance and data drift by analyzing log telemetry with Azure Monitor.
Keep in mind this high-level diagram does not depict any security features large organizations would require when adopting cloud services (e.g. firewall, virtual networks, etc.). Moreover, MLOps is an organizational shift that requires changes in people, processes, and technology. This might influence the different services, features, or workflows your organization adopts which are not considered in this design. The Machine Learning DevOps guide from Microsoft is one view that provides guidance around best practices to consider.
Next, we will share how an end-to-end proof of concept illustrating how an MLflow model can be trained on Databricks, packaged as a web service, deployed to Kubernetes via CI/CD and monitored within Microsoft Azure. This proof of concept will only cover the online inference deployment pattern and focus on a simplified architecture compared to that shown above.
Detailed step-by-step instructions describing how to implement the solution can be found in the Implementation Guide of the GitHub repository. This article will focus on what actions are being performed and why.
A high-level workflow of this proof-of-concept is shown below:
GitHub to store code for the project and enable automation by building and deploying artifacts.
By default, this proof-of-concept has been implemented by deploying all resources into a single resource group. However, for production scenarios, many resource groups across multiple subscriptions would be preferred for security and governance purposes (see Azure Enterprise-Scale Landing Zones) with services deployed using infrastructure as code (IaC).
Some services have been further configured as part of this proof-of-concept:
Azure Kubernetes Service: container insights has been enabled to collect metrics and logs from containers running on AKS. This will be used to monitor API performance and analyze logs.
Azure Databricks: the files in repo feature has been enabled (not enabled by default at the time of developing this proof-of-concept) and a cluster has been created for Data Scientists, Machine Learning Engineers, and Data Analysts to use to develop models.
GitHub: two GitHub Environments have been created for Staging and Production environments along with GitHub Secrets to be used during the CI/CD pipeline.
In practice within an organization, a Cloud Administrator will provision and configure this infrastructure. Data Scientists and Machine Learning Engineers who build, deploy, and monitor machine learning models will not be responsible for these activities.
Once the infrastructure is provisioned and data is sourced a Data Scientist can commence developing machine learning models. The Data Scientist can add a Git repository with Databricks Repos for each project they (or the team) are working on within the Databricks workspace.
For this proof-of-concept, the model development process has been encapsulated in a single notebook called develop_model. This notebook will develop and register an MLflow Model for deployment consisting of:
a machine learning model to predict the liklihood of employee attrition
a statistical model to determine data drift in features
a statistical model to determine outliers in features
Training notebook in Azure Databricks
After executing this notebook the MLflow model will be registered and training metrics will be captured in the MLflow model registry and Experiments tracker respectively.
In practice, the model development process requires more effort than illustrated in this notebook and will often span multiple notebooks. Note that important aspects of well-developed MLOps processes such as explainability, performance profiling, pipelines, etc. have been ignored in this proof-of-concept implementation but foundational components such as experiment tracking and model registration and versioning have been included.
Experiment metrics for hyperparameter tuning in Azure Databricks
Registered models in Azure Databricks
A JSON configuration file is used to define which version of each model from the MLflow model registry should be deployed as part of the API. All three models need to be referenced since they perform different functions (predictions, drift detection, and outlier detection respectively).
Data scientists can edit this file once models are ready to be deployed and commit the file to the Git repository. The configuration file service/configuration.json is structured as follows: