Data Drift in Azure Machine Learning

Microsoft

Mar 09, 2020

In this new age of collaboration including a more refined focus on Data Engineering, Data Science Engineering, and AI Engineering it becomes necessary to be clear that the choppy “swim-lanes” of responsibility have become even more blurred.

In this short post, I wanted to cover a new feature in the Azure Machine Learning Studio named, Data Drift Detection. Data Drift for this article is the change in model input data that leads to model performance degradation. It is one of the top reasons where model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.

Some major causes of data drift may include:

Causes of data drift include:

Upstream process changes, such as a sensor being replaced that change the unit of measure being applied. Maybe from Imperial to Metric?
Data quality issues, such as a failed deployment of a patch for a software component or a failed sensor reading 0.
Natural drift in the data, such as mean temperature changing with the seasons or deprecated interface features no longer being utilized.
Change in relation between features, or covariate shift.

With the use of Dataset monitors in Azure Machine Learning studio, your organization is able to setup alerts to assist in the detection of data drift which can be useful in helping you maintain a healthy and accurate Machine Learning Model in your deployments.

There are 3 primary scenarios for setting up dataset monitors in Azure Machine Learning

Scenario	Description
Monitoring a model's serving data for drift from the model's training data	Results from this scenario can be interpreted as monitoring a proxy for the model's accuracy, given that model accuracy degrades if the serving data drifts from the training data.
Monitoring a time series dataset for drift from a previous time period.	This scenario is more general, and can be used to monitor datasets involved upstream or downstream of model building. The target dataset must have a timestamp column, while the baseline dataset can be any tabular dataset that has features in common with the target dataset.
Performing analysis on past data.	This scenario can be used to understand historical data and inform decisions in settings for dataset monitors.

Your dataset will require a timestamp; however, once you have a baseline dataset defined - your target dataset or incoming model input will be compared on the intervals specified to help your system become more proactive.

Using the azureml SDK you would execute the following code to create a workspace (your Azure Machine Learning workspace), your originating Datastore, and the Dataset you wish to monitor.

from azureml.core import Workspace, Dataset, Datastore

# get workspace object
ws = Workspace.from_config()

# get datastore object 
dstore = Datastore.get(ws, 'your datastore name')

# specify datastore paths
dstore_paths = [(dstore, 'weather/*/*/*/*/data.parquet')]

# specify partition format
partition_format = 'weather/{state}/{date:yyyy/MM/dd}/data.parquet'

# create the Tabular dataset with 'state' and 'date' as virtual columns 
dset = Dataset.Tabular.from_parquet_files(path=dstore_paths, partition_format=partition_format)

# assign the timestamp attribute to a real or virtual column in the dataset
dset = dset.with_timestamp_columns('date')

# register the dataset as the target dataset
dset = dset.register(ws, 'target')

Once this is established you can build a dataset monitor to detect drift with the delta range you provide

from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
from datetime import datetime

# get the workspace object
ws = Workspace.from_config()

# get the target dataset
dset = Dataset.get_by_name(ws, 'target')

# set the baseline dataset
baseline = target.time_before(datetime(2019, 2, 1))

# set up feature list
features = ['latitude', 'longitude', 'elevation', 'windAngle', 'windSpeed', 'temperature', 'snowDepth', 'stationName', 'countryOrRegion']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'drift-monitor', baseline, target, 
                                                      compute_target='cpu-cluster', 
                                                      frequency='Week', 
                                                      feature_list=None, 
                                                      drift_threshold=.6, 
                                                      latency=24)

# get data drift detector by name
monitor = DataDriftDetector.get_by_name(ws, 'drift-monitor')

# update data drift detector
monitor = monitor.update(feature_list=features)

# run a backfill for January through May
backfill1 = monitor.backfill(datetime(2019, 1, 1), datetime(2019, 5, 1))

# run a backfill for May through today
backfill1 = monitor.backfill(datetime(2019, 5, 1), datetime.today())

# disable the pipeline schedule for the data drift detector
monitor = monitor.disable_schedule()

# enable the pipeline schedule for the data drift detector
monitor = monitor.enable_schedule()

Of course, you can do this all through the Azure ML Studio (ml.azure.com) if you run the Enterprise features. Oops, there are locks beside them!

Not to worry, if you see a lock beside the features. just click on Automated ML or Designer in the left-hand navigation pane and you'll see an upgrade modal appear and it literally takes just a few seconds to do.

Once you're finished with the upgrade to Enterprise features you'll be able to configure your data-drift monitor within the portal.