Make use of the training code behind your AutomatedML model

Former Employee

Oct 12, 2022

Overview

We are excited to announce the general availability of AutomatedML (AutoML) training code generation. With this feature, users can view the training script behind their AutoML models to ensure they have full transparency into how their model was trained. Users can also use the script to customize/tweak the training as needed for their use case, allowing them to quickly move AutoML models to production.

Why is this important?

AutoML is a very powerful tool for data scientists. All they need to do is to provide data & configure basic job parameters, and AutoML iterates over the applicable ML algorithms to train the ideal model based on the data provided and the accuracy metric selected. However, we often hear that AutoML is a black box – prohibiting data scientists from productionizing AutoML models.

One reason for this is that enterprises often have compliance regulations that require a data scientist to have a full understanding of how the model was trained. With the generated code script, data scientists can view & analyze the model training code and have peace of mind around the models they are pushing to production. Additionally, data scientists working in a certain problem space can use their subject matter expertise to further iterate on the model provided by AutoML. This might materialize in the form of modifying the featurization step during training of an AutoML model or switching up the hyperparameters. The generated script uses open-source frameworks where possible, so data scientists can quickly build on top of AutoML’ s best model and customize as needed to improve the model performance for their use case. Also, since model training is an iterative process, it is also very important for data scientists to track the versions of code used to generate different models. Using the generated python script, they can use their versioning system of choice to make sure their training process is trackable and reproducible.

How does this feature work?

By default, training code is generated for all AutoML models. There are two main assets that can be accessed – script.py (training code), (boiler plate jupyter notebook used to submit the script.py as a job in AzureML).

Script.py

This file contains the core logic needed to train a model. While intended to be executed in the context of an Azure ML command job, with some modifications, the model's training code can also be run as a standalone script in a user's environment of choice.

The script can roughly be broken down into the following parts: data preparation, data featurization and algorithm specification. The above pieces are stitched together with functions that allow the standalone script to be submitted as is for replicating the training of an AutoML model.

Data preparation code

The function prepare_data() cleans the data, splits out the feature and sample weight columns and prepares the data for use in training. This function can vary depending on the type of dataset and the AutoML experiment task type (classification, regression, time-series forecasting).

The following example shows that in general, the data frame from the data loading step is passed in. The label column and sample weights, if originally specified, are extracted and rows containing NaN are dropped from the input data.

Custom data preparation can be added to the above code as required.

Data featurization code

The featurizers from the original experiment are reproduced here, along with their parameters.

For example, possible data transformation that can happen in this function can be based on imputers like, SimpleImputer() and CatImputer(), or transformers such as StringCastTransformer() and LabelEncoderTransformer().

The following example is a transformer of type StringCastTransformer() that can be used to transform a set of columns. In this case, the set indicated by column_names.

Algorithm and hyperparameters specification code

The algorithm and hyperparameters specification code is likely what many data scientists are most interested in.

The following example uses an XGBoostClassifier algorithm with specific hyperparameters. The generated code in most cases uses open-source software packages and classes (for example, XGBoost classifier and other commonly used libraries like LightGBM or Scikit-Learn algorithms). Data scientists can customize that algorithm's configuration code by tweaking its hyperparameters as needed based on their skills and expertise.

Script_run_notebook.ipynb

This notebook serves as an easy way to execute script.py on an Azure ML compute. This notebook is similar to existing Automated ML sample notebooks and has cells for connecting to a workspace, creating/using AzureML compute and submitting an AzureML command job run (as depicted in the below image).