In this two-part blog series we explore a solution template for creating models for soft sensors, taking advantage of the scalability and automation provided by the Microsoft Azure platform.
In the first part, we explored what a soft sensor is, the use case, dataset, common approaches and major steps usually needed to model soft sensors.
In this second part, we will explore how to use Microsoft's Azure Data and AI platforms to process and model the data at scale, in an automated fashion.
The code for this second part is available at this GitHub repository, as a series of Python scripts and a Jupyter notebook implementing the steps needed to prepare the data and model the soft sensors according to the solution architecture presented here.
In a real-world scenario, there will be potentially thousands of sensor-generated timeseries as input data to be prepared and hundreds of critical sensors to be modeled. Therefore we need a scalable solution, capable of parallelizing and distribute the data processing and model training tasks.
Proposed Solution Architecture
We rely on Azure ML Pipelines for the implementation end execution of the needed steps for data preparation, feature selection, and model training. By doing so, we have advantages such as:
- Creating a modular implementation, by isolating self-contained processing steps and promoting separation of concerns. For example, different teams implementing and maintaining data preparation and model training steps.
- Utilizing heterogeneous computing platforms, where we use the most adequate platform according to the tasks and teams preferences. For example, using Apache Spark on Azure Synapse for data preparation and feature selection, and Python on Azure ML for model training.
- Automating workflows and running in an unattended fashion. The operationalized solution will likely need to be triggered and automated by a workflow orchestrator, such as Azure Data Factory, at a higher level.
In the Fig. 1 below we have an overview of the proposed solution architecture, highlighting the implemented execution steps and the data flow:
Fig.1: Proposed Solution Architecture
Pipeline Processing
At a high level, the code running on Azure ML (2) defines and orchestrates the execution of the pipeline steps for data preparation (3), feature selection (6), and model training (8). It also defines Azure ML Datasets referencing the input data (1), intermediate data (5), (7), and output data (9) and the compute targets for the execution of those pipeline steps.
Data preparation and feature selection steps run on Azure Synapse but are driven by Azure ML and take advantage of parallel distributed computing on Apache Spark as explained later. Creating an Azure ML compute target pointing to an Azure Synapse Spark pool is possible once you link your Azure Synapse workspace to your Azure ML workspace.
Model training runs on an Azure ML Compute Cluster taking advantage of parallel distributed computing on Azure ML as explained later.
Data Preparation Step
Here, an Azure ML pipeline step running on an Azure Synapse Spark pool (3) reads the raw sensor data from the input storage location (1) and processes it for missing value imputation, outlier detection and treatment, and time series resampling. The code in this step runs on PySpark, taking advantage of data manipulation on Spark Dataframes for processing parallelism. By running the code on Azure Synapse, we have native access to SynapseML (previously MMLSpark), which includes APIs for accessing several Azure Cognitive Services, including Azure Anomaly Detector, which is used here for outlier detection (4). The prepared sensor data is then saved as an intermediate dataset (5) to be consumed by the next step in the pipeline.
Feature Selection Step
This step also runs on an Azure Synapse Spark pool (6). It begins by reading the intermediate data generated by the data preparation step (5). The feature selection procedure is computed for each target sensor against all features individually. Thus we again take advantage of data manipulation on Spark Dataframes for processing parallelism. After this computation, the featurized sensor data is then saved as an intermediate dataset (7) to be consumed by the next step in the pipeline.
Model Training Step
In order to model potentially hundreds of sensors in parallel, here we take advantage of the ParallelRunStep implementation from Azure ML Batch Inference. Although the name implies a batch inference task, this feature is also well suited for generic parallel map-style operations. First, the intermediate data generated by the feature selection step is read (7) and it is partitioned by output sensor name when defined as an Azure ML Dataset. In this way, each data partition is processed independently, in parallel, by a different node in the Azure ML Compute Cluster (8). At the end of the processing, the model data and validation metrics are saved to the output storage location (9).
Conclusion
In this post we explored a scalable and robust solution architecture suitable for real-world use cases of soft sensor modeling. The modular nature of this architecture allows us to segregate the processing steps, mapping them to the steps in our development cycle, such as data preparation, feature selection, and model training. By doing so, we can leverage the backend computing resources and processing platforms that are better suited for each task.
The architecture presented here is just the foundation for an end-to-end implementation, being able to be integrated with other services and components for an MLOps approach implementation, such as automated data movement and execution triggering, a feature store, and pipelines for model deployment and model consumption.