Co-Authors: Lotta Åhag, Gustav Kruse, Erik Rosendahl.
This post will guide you through how we, Lotta Åhag and Gustav Kruse, used Azure AutoML and the 'Enterprise Scale ML (ESML) solution accelerator for Azure', to build an end-2-end machine learning solution in 6 weeks. The value of the solution is estimated to reduce 3.35 tons of Co2 emissions of propane and decrease electricity usage of 90MWh per year after putting the solution into production. The ecological impact is much higher - fewer quality rejections will save a lot of resources including coal (and gas) needed to produce the steel. We collaborated with the Epiroc Data Scientist, Erik Rosendahl, who worked with the ESML templates for operation and governance.
This is the story we want to share, about an end-2-end, machine learning solution on Azure.
We wanted to leverage AI for steel manufacturing, in the area of heat treatment quality, with the goal to enhance the process to be able to reduce CO2. We got help from 2 student data scientists who wanted to execute this as their master thesis.
In 6 weeks, they managed to leverage Azure Machine Learning and the ESML AI factory at Epiroc, using AutoML to build a machine learning model w an end-2-end pipeline from data lake to Power BI report.
The team quickly got its own set of Azure resources as an ESML Project which ensured both enterprise grade scale and security – and out came a new AI innovation, with ecological wins - 3.35 ton C02 reduction and 30% less quality rejections.
//Jonas Jern, Head of Digital Innovation, Epiroc Drilling Tools AB
Today's companies often have a large amount of data used for product follow-up and order management. Also, technical process decisions are usually made through experience or simple statistics. During this project, we had the opportunity to work with process operators and engineers who have hardened steel for over 30 years. Imagine the potential of supporting these professionals and using their experience as input to a machine learning model that optimizes the manufacturing process. It was clear that a key factor to success was to combine experience of the domain with modern technology.
Before we talk about this specific use case and solution, we want to give you, as a reader some context about the solution accelerator used in this project explaining the acceleration.
Enterprise Scale ML (ESML) is a solution accelerator to get an AI Factory on Azure. It is based on Microsoft best practices, such as Cloud Adoption Framework combined with proven practices. It is open source and has an MIT license.
Innovating with AI and Machine Learning, multiple customers expressed the need to have an Enterprise Scale AI & Machine Learning Platform - with end-2-end turnkey DataOps and MLOps. Other requirements were:
Even if best practices exist, it is time consuming and complex to set up such an AI Factory solution.
A private solution without public internet is also often desired since working with productional data from day one, already in the R&D phase. The challenges:
Also, the full solution should be able to be provisioned via infrastructure-as-code (IaC), to be recreated and scale across multiple Azure subscriptions. And, it should be able to have a project-based approach, to scale up to 250 project teams, where each project team can create one to many models. Each project teams have their own set of services glued together, such as their own Azure machine learning service, compute clusters & their own modern data architecture.
To meet the customer requirements in Azure, multiple best practices needed to be married and implemented, such as: Cloud Adoption Framework, AI Factories , Well-Architected Framework(WAF), MLOps, Datalake design with Datamesh, Modern Analytics Architecture. Instead of all working in silos and re-inventing wheels, the idea was to have an open source initiative, help all at once – the open source accelerator Enterprise Scale ML(ESML).
ESML provided an AI Factory quick in comparison (4-40 hours VS before 1000’s of hours).
For us and Epiroc, we were able to have 8 project teams with set of Azure services glued together securely. We worked in one of the teams and all gained the challenges solved:
ESML marries multiple best practices into one solution accelerator, with infrastructure-as-code
Figure 1: Best practices and accelerator to enable Enterprise Scale Machine Learning.
ESML is a template way of working. It uses BICEP for provisioning the AI Factory, a datalake template, multiple Azure Data Factory and Azure Machine Learning pipeline templates and the MLOps template.
As seen in figure 2 below, it accelerates by requiring only 2 lines of code to generate an Azure Machine Learning pipeline, which could be either a scoring or training pipeline. These pipelines were MLOps artifacts often needed for multiple use cases and projects. ESML templates and ESML Pipeline factory and its 7 pipeline types accelerated this, and in the end - we get easier governance as a bonus.
Figure 2: The templates in the ESML accelerator
Azure as a cloud addresses challenges with PaaS services. The Azure Machine Learning service accelerates model management, pipelines, deployment, and training with Azure AutoML. Multiple best practices address the challenges of DataOps, MLOps, and well architected solutions. The ESML solution accelerator marries multiple best practices together and abstracts enterprise grade security for the data scientist to be even more productive with templates and its control plane.
Figure 3The ESML AI Factory at Epiroc shows the architectural use case of this project, where data flows through the pipeline from ingestion (1) to reporting (5)
So how did we use the accelerator? Let´s get into the exciting stuff!
The project outset was to investigate factors that have an impact on the quality of heat-treated goods. Deeper knowledge of the area would make steel manufacturing more efficient and reduce the environmental impact made by rejections and reprocessed products that don´t pass the bar.
We, Gustav & Lotta, created an end-to-end pipeline with ESML accelerator templates and presented the results in Power BI. These pipelines were then handed over to Epiroc’s Data Scientist Erik R who filled out the ESML production part of the templates - DataOps template for Azure Data Factory and MLOps templates in the accelerator - using the ESMLPipeline factory to write less code (0.1% code per Azure Machine Learning pipeline compared to same pipeline without accelerator).
We started by identifying data sources of eventual importance for output quality. This was done in collaboration with people of domain knowledge, who know the manufacturing process best. The raw processed data was transferred to the data lake folder for our ESML project, from which we could read the data with just a few predefined lines of code in Azure Databricks due to the ESML accelerator PySpark library.
We chose to use Azure Databricks in the project’s R&D stage as we wanted to easily collaborate on notebooks while exploring and cleaning data. In previous projects, keeping track of file paths for different versions of data has been extremely time-consuming. That is something we could avoid in this project, as the ESML data lake design enables stages of refined data states called Bronze, Silver and Gold. This simplifies the reusability and accessibility as no folder paths need to be remembered. Thus, we could easily use Azure Databricks to go from raw processed data to a uniform formatted, cleaned, and aggregated stage.
All machine learning training was conducted with the AutoML feature in Azure machine learning SDK. It provided all code needed for us to go from training to deployment quickly as AutoML generated the scoring script and provided the model artifacts needed, such as the environment, docker image, and pickle file.
First, we specified all settings for the training, such as target label, type of algorithm, training time and iterations. In this project, we chose the steel quality results (approved or rejected) to be the target label. We also specified the sizes of our train, test, and validation datasets. During the training, the features, algorithms, and parameters are tuned in different iterations out of which the best performing ones are ranked in a leaderboard in Azure ML Studio. In this project, we trained and compared 9 different machine learning algorithms such as Random Forest and Light Gradient Boosting.
Figure 4: Azure AutoML leaderboard concept
When training the algorithms, we used the code snippet below. For example, we chose the model specified for classification from the configuration file and AUC as the primary performance metric. We also used the training dataset that was saved from the Gold stage which we created in Databricks.
This snippet shows the concept of the Enterprise Scale ML accelerator features – the accelerator fetches the performance configuration to train and to serve a model given the AI factory environment which can either be Development, Test or Production. It also provides a “folder-path-less” experience, since ESML automatically maps data as Azure machine learning datasets
Figure 5: AutoML training configuration with ESML acceleration
Azure Machine Learning Studio (Azure ML Studio) is a graphical user interface which gathers all results from training. It also has other in-built functionality that for example warned us about using an imbalanced dataset. The best performing model for each iteration is ranked on a score board from which we got the overall performance from metrics such as accuracy, precision, and F1-scoring.
Figure 6: Scoreboard in Azure Machine Learning with AutoML
Figure 7 : ESML Model settings – to control what metrics to use when comparing the ESML generated test-set scoring, if model A or B is to be promoted. Also with possibility to set weights when comparing scoring.
As all runs are automatically logged and scored in Azure ML studio, we could quickly iterate the training process with different configurations. In our case, we got early insights on what features to investigate further and how well we could predict the quality output through supervised learning.
During this process, we found out that the outdoor temperature would be interesting to add. After just one afternoon, weather data was added and ready for training, thanks to the agile architecture within the framework.
For the best performing models, we also took a closer look at the feature importance to decide whether to add or reduce any features. For example, we could see that the total batch weight and temperature have a significant importance in predicting output quality. The features were iteratively reduced and the training resulted in increased performance.
Figure 8: Responsible AI: Explainability and feature importance
One way to understand the algorithm in a visual way was to look at each feature´s importance. From there, we could, for example, see the intervals for outdoor temperature and its importance. The x axis represents the temperature, and the y axis indicates whether the quality is predicted to be rejected above the zero line or approved below the zero-line. In this case, we could clearly see that a temperature below 5 degrees tends to have a negative impact.
When we were satisfied with the model, we tested it on the test dataset by running the cell below. This resulted in a confusion matrix describing how well the model predicts rejected and approved batches.
Figure 10: Model deployed as an online webservice powered by a private Azure Kubernetes Service cluster
The model was then deployed with just two lines of code, both as an online webservice powered by Azure Kubernetes service (AKS) and also as a batch scoring Azure machine learning pipeline where scored data is saved to the data lake. Two lines of code due to ESML, Azure Machine Learning and AutoML. ESML will wrap the full deployment and fetch the correct configuration settings and have the Azure Kubernetes created if it does not exist in the correct virtual network as a private cluster. AutoML will generate the scoring script automatically and Azure Machine Learning will dockerize the model and deploy it on the AKS cluster as a webservice. Finally, ESML will store the url and API secret to the project Azure keyvault, secured with virtual network and private endpoints.
The model deployment consisted of two parts, first setting up the data transfer from a shared master lake to the project, and then rebuilding the machine learning pipeline in the production environment.
The machine learning pipeline was easy to set up as most of the work had been done in the R&D stage. The code was rewritten to take in one pandas data frame of the raw data and then return a processed data frame. The code was then copied into template files that are automatically generated by the notebooks provided with ESML.
Once the python scripts were ready, the machine learning part of the pipeline could be created and registered by running a few lines of python code. Once this was done the machine learning pipeline could be added in Azure Data Factory and scheduled to run as required. In our case that was once per day, performing inference on the latest data set consisting of the data and returning a predicted result.
Figure 10: Model deployed as an online webservice powered by a private Azure Kubernetes Service cluster
The predicted results were presented in PowerBI as a report through both tables and visualizations to make it easier for stakeholders to observe the predictions of the batches. We also added processed data to the dashboard.
When the technical delivery proved to be viable, we decided to move the solution from an RnD stage to production. The handover was carried out together with a local MLOps engineer, Erik Rosendahl, who translated the PySpark code into Python code in Visual Studio, a less computationally demanding solution. The advantage of this is that an automated pipeline can be launched at the same time as the code can be double-checked and its quality ensured. The knowledge migration between the teams is also a great advantage.
I worked with Gustav and Lotta who handed over the solution. What was left was to do, was to use the ESML templates for DataOps, Azure Data factory templates, and the MLOps artifacts in form of the ESML templates (Figure 2 and 11) for Azure Machine Learning pipelines.
DataOps: The data transfer from source to master lake and the subsequent transfer from master lake to project was set up by an ESML core team, focusing on data engineering. The data transfer was then set up to automatically update the data with the desired frequency via the ESML DataOps templates for Azure Data Factory. The template IN_2_GOLD_SCORING was used for this use case, easy to configure for batch scoring scenarios.
Machine Learning Operations: The machine learning pipeline was easy to set up as most of the work had been done in the R&D stage. The code was rewritten to take in one pandas data frame of the raw data and then return a processed data frame. The code was then copied into template files that are automatically generated by the notebooks provided with ESML control plane (see Figure 11) – a Python SDK that orchestrates the workflow and leverages the templates for the data scientist to work in an accelerated way.
ESML generated the python scripts baseline, and then once they were edited with feature engineering logic and ready, the machine learning part of the pipeline could be created and registered by running a few lines of python code. Once this was done the machine learning pipeline could be configured in the ESML DataOps template, e.g. in the Azure data factory pipeline, and scheduled to run as required, in our case once per day performing inference on the latest data set consisting of the data and returning a predicted result.
Figure 2: Model deployed as a batch scoring Azure machine learning pipeline, powered by a private Azure Machine learning CPU-clusters.
Figure 2: Model deployed as an batch scoring Azure machine learning pipeline, powered by a private Azure Machine learning CPU-clusters.
The ESML accelerator matched all our requirements for the project; enabling fast implementation, enterprise grade security, customizable settings for machine learning, comprehensive analysis, and quick insights. We could focus on the core of the problem; finding the optimal algorithm for quality improvement instead of spending time on repetitive architectural and orchestration tasks required for the analysis in an R&D stage. ESML also created the experiments in an automated way including mapping the data as AzureML datasets. We especially appreciated how AutoML automatically logged metrics. In that way, we could easily compare different runs and use visualizations to share our insights with stakeholders and quickly iterate on selected features, data sources and end-result.
Decades of production data and knowledge processed in seconds and consumed on a Power BI report at the shop floor only meters from the ovens. It is exciting to see how we could get such a deep scientific understanding that was not yet documented by the company's research department. Some insights resulted in direct actions that improved the manufacturing process with value creation right away.
The ESML accelerator is constantly improved with the help of feedback from the community and customers, such as this project. And the ESML accelerator itself, with its extension features have also provided feedback to inspire end-2-end machine solutions, from DataOps to MLOps with both inner and outer loop, less code to write pipelines in Azure Machine Learning, one-liner code for test set scoring.
This project was executed back in 2021, before Azure Machine Learning v2. Now (writing moment 2022-06-21) Azure Machine Learning has evolved - easier creation of Azure Machine Learning pipelines (endpoints) and a feature exists in public preview in Azure Machine Learning Studio (also in v1) , called Test model to get test-set scoring. We have already started looking into this evolution and new AutoML features. We can´t wait to try this and see what Azure Machine Learning and the ESML accelerator will help to achieve in the future, given the great impact it had on our project results so far.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.