How to train a machine learning model to be analyzed for issues with Responsible AI (Part 2)

Former Employee

Mar 06, 2023

When we train a machine learning model, we want the model to learn or uncover patterns. We focus on how accurately a model can make predictions and try to reduce the error rate of the model. However, by focusing too much on aggregated model performance metrics such as accuracy, we often neglect two important things that could happen:

Errors are often not uniformly distributed in our dataset; there often exists sub cohorts of data that are getting more erroneous predictions than others (aka model blind spots). For instance, it is very important to look at other factors like the feature error distribution in the data, and how well the model is performing in certain cohorts of the data vs. other cohorts.
Popular model performance metrics such as accuracy are insufficient. They tell a part of the story. There are other important aspects such as model fairness, reliability, or interpretability that should be taken into consideration to help deliver a holistic model assessment experience.

As AI becomes more common, it’s important to keep people and their goals at the center, working to maximize benefits & minimize risks. As a result, machine learning models need to be scrutinized for errors, fairness, and reliability. These considerations must be apparent not just to data scientists but to decision makers, end-users, and compliance auditors as well.

In the prior tutorial, we covered how to create an Azure Machine Learning workspace with Responsible AI (RAI) Components. In this tutorial, we’ll train a model that will be used to analyze and resolve issues we find using the Azure Machine Learning Responsible AI dashboard. We will be using the Diabetes Hospital Readmission dataset to predict whether or not a diabetic patient will be readmitted back to a hospital in less than 30 days.

Prerequisites

This is Part 2 of a tutorial series. You’ll need to complete the prior tutorial(s) below:

Clone the GitHub RAI-Diabetes-Hospital-Readmission-classification repository
Complete Part 1: Getting started with Azure Machine Learning Responsible AI components
NOTE: We’ll be using UCI’s Diabetes 130-US hospitals for years 1999–2008 dataset for this tutorial

Create an Experiment

To create the experiment, first we need to initialize the client session and connect to the Azure Machine Learning workspace. You’ll need to copy the config.json file (see the “Download config.json” link highlighted in red in the image below) and put it in the same folder as your Jupyter notebook file.

Run the next code snippet to initialize the client session. The MLClient uses the data in the config.json to authenticate and connect to your workspace. In addition, you need to create another client session to connect to the workspace and be able to access the “azureml” system registry where the RAI Components are registered. You should get a message stating that your config.json file was found.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Environment, BuildContext
from azureml.mlflow import register_model
import mlflow
import pandas as pd

#connect to the workspace
registry_name = "azureml"
credential = DefaultAzureCredential()
ml_client =  MLClient.from_config(credential=credential)

ml_client_registry = MLClient(
    credential=credential,
    subscription_id=ml_client.subscription_id,
    resource_group_name=ml_client.resource_group_name,
    registry_name=registry_name
    )

Data Preparation

The original UCI dataset is large with ~101,700 records. In addition, the data is very imbalanced. So, we reduced the data sample size and used the SMOTE technique to balance out the minority class during the data cleansing process for our training. We dropped the columns that had a huge number of missing values or were not relevant to a patient returning back to the hospital within 30 days. For example, having 20+ columns of whether or not a patient took a certain diabetic medication (rosiglitazon, citoglipton, metformin etc.) had little impact on a patient’s return to the hospital. Weight could impact a patient’s return, however 30% of the data has missing values.

The patient’s form of payment likely does not have impact on potential likelihood of return to the hospital, so we dropped the Payer_Code field. However, we added the Medicare and Medicaid column to indicate whether or not the hospital bill was paid through government medical assistance for low-income patients. This is to help us understand if there are any socioeconomic gaps in patient demographics associated with findings.

To work with our dataset, we need to load the preprocessed training and testing data to a dataframe. In our case, the files are stored in parquet format; however, you can use the train_test_split function to split the training and testing dataset, then store in a file format of your choice for your experiment.

train_data = pd.read_parquet('data/train_dataset.parquet')
test_data = pd.read_parquet('data/test_dataset.parquet')

Since this is a classification problem, we will choose the “readmit_status” column to be our target column. Also, we will change its value pairs from 0 or 1 to something easier to understand like Not Readmitted or Readmitted. The component for training the model will need the name of the target column as input, so we need to define a variable for the name.

target_column = "readmit_status"

Next, we will store the training and testing parquet data files to Azure Machine Learning datasets. The benefits of storing the datasets in Azure ML workspaces is that data is saved in one centralized location and can be referenced by other workflows. In addition, it is a good way to later track datasets used for each experiment. Since Azure Machine Learning Studio is a shared workspace, other team members can reuse the cleansed dataset by just referencing it.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Environment, BuildContext
from azureml.mlflow import register_model
import mlflow
import pandas as pd

#connect to the workspace
registry_name = "azureml"
credential = DefaultAzureCredential()
ml_client =  MLClient.from_config(credential=credential)

ml_client_registry = MLClient(
    credential=credential,
    subscription_id=ml_client.subscription_id,
    resource_group_name=ml_client.resource_group_name,
    registry_name=registry_name
    )

Training Components

We’ll be using Azure Machine Learning components to divide the experiment into different tasks. Components are reusable independent units of tasks that have inputs and output in machine learning (e.g., data cleaning, training a model, registering a model, deploying a model etc.). For our experiment, we will create a component for training a model. The component will consist of a python training code with inputs and outputs.

The first thing to define in our python code is the training script containing a function that declares an argument parser object that adds the names and types for the input and output parameters. For our Diabetes Hospital Readmission use case, we will be passing the path to our training dataset and the target column name as the classifier. Then the trained model will be the output for the script.

%%writefile component/hospital_training.py

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", type=str, help="Path to training data")
    parser.add_argument("--target_column_name", type=str, help="Name of target column")
    parser.add_argument("--model_output", type=str, help="Path of output model")

    # parse args
    args = parser.parse_args()    

    # return args
    return args

Next, we will define the main function for the python training code with the argument parser object to contain the parameters. The function reads in the training dataset and the target column as input.

The code will use Scikit Learn’s ColumnTranformer function to transform the string columns and numeric columns. Since we are trying to predict whether or not a diabetic patient will return to a hospital within 30 days or not, we are using Logistic Regression to train our model.

def main(args):
    current_experiment = Run.get_context().experiment
    tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()
    print("tracking_uri: {0}".format(tracking_uri))
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(current_experiment.name)

    # Read in data
    print("Reading data")
    all_training_data = pd.read_parquet(args.training_data)
    target = all_training_data[args.target_column_name]
    features = all_training_data.drop([args.target_column_name], axis = 1)  

    # Transform string data to numeric
    numerical_selector = selector(dtype_include=np.number)
    categorical_selector = selector(dtype_exclude=np.number)

    numerical_columns = numerical_selector(features)
    categorical_columns = categorical_selector(features)

    categorial_encoder = OneHotEncoder(handle_unknown="ignore")
    numerical_encoder = StandardScaler()

    preprocessor = ColumnTransformer([
    ('categorical-encoder', categorial_encoder, categorical_columns),
    ('standard_scaler', numerical_encoder, numerical_columns)])

    categorical_indices = get_categorical_index(features)
    clf = make_pipeline(preprocessor, LogisticRegression())
    
    X_train, X_test, y_train, y_test = train_test_split(features, target, 
                                                        test_size=0.3, random_state=1)

    print("Training model...") 
    
    model = clf.fit(X_train, y_train)

The last important step in our main function is to store the trained model file in a local output directory. Then we’ll use MLFlow to save the model file in the Azure ML experiment output folder.

    # Saving model with mlflow 
    model_dir =  "./model_output"
    with tempfile.TemporaryDirectory() as td:
        print("Saving model with MLFlow to temporary directory")
        tmp_output_dir = os.path.join(td, model_dir)
        mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)

        print("Copying MLFlow model to output path")
        for file_name in os.listdir(tmp_output_dir):
            print("  Copying: ", file_name)
            # As of Python 3.8, copytree will acquire dirs_exist_ok as
            # an option, removing the need for listdir
            shutil.copy2(src=os.path.join(tmp_output_dir, file_name), 
                         dst=os.path.join(args.model_output, file_name))

# run script
if __name__ == "__main__":
    # add space in logs
    print("*" * 60)
    print("\n\n")

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")

To define our training component, we’ll need to create a yaml file where we specified the component name, input, and output parameters; the location of the python code that trains the model; the command-line to execute the python code; and the environment to run the code. You can use Azure Machine Learning curated environments or customize your own environment. In our case, we are using Responsible AI’s out of the box curated environment: AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu. (NOTE: Select the Environments tab in Azure Machine Learn Studio to see the available environments) Then use the experiment client session to register the component definition in workspace components.

from azure.ai.ml import load_component

yaml_contents = f"""
$schema: http://azureml/sdk-2-0/CommandComponent.json
name: rai_hospital_training_component
display_name: hospital  classification training component for RAI example
version: {rai_hospital_classifier_version_string}
type: command
inputs:
  training_data:
    type: path
  target_column_name:
    type: string
outputs:
  model_output:
    type: path
code: ./component/
environment: azureml://registries/azureml/environments/AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu/versions/4
""" + r"""
command: >-
  python hospital_training.py
  --training_data ${{{{inputs.training_data}}}}
  --target_column_name ${{{{inputs.target_column_name}}}}
  --model_output ${{{{outputs.model_output}}}}
"""

yaml_filename = "RAIhospitalClassificationTrainingComponent.yaml"

with open(yaml_filename, 'w') as f:
    f.write(yaml_contents.format(yaml_contents))
    
train_component_definition = load_component(
    source=yaml_filename
)

ml_client.components.create_or_update(train_component_definition)

The second component we need to create is for registering the trained model. Similar to the training component above, the yaml definition file for the model registration will have the same Azure Machine Learning Component fields.

yaml_contents = f"""
$schema: http://azureml/sdk-2-0/CommandComponent.json
name: register_hospital_model
display_name: Register hospital Model
version: {rai_hospital_classifier_version_string}
type: command
is_deterministic: False
inputs:
  model_input_path:
    type: path
  model_base_name:
    type: string
  model_name_suffix: # Set negative to use epoch_secs
    type: integer
    default: -1
outputs:
  model_info_output_path:
    type: path
code: ./register_model_src/
environment: azureml://registries/azureml/environments/AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu/versions/4
command: >-
  python model_register.py
  --model_input_path ${{{{inputs.model_input_path}}}}
  --model_base_name ${{{{inputs.model_base_name}}}}
  --model_name_suffix ${{{{inputs.model_name_suffix}}}}
  --model_info_output_path ${{{{outputs.model_info_output_path}}}}

"""
yaml_filename = "model_register.yaml"

with open(yaml_filename, 'w') as f:
    f.write(yaml_contents)
    
register_component = load_component(
    source=yaml_filename
)

ml_client.components.create_or_update(register_component)

Training Pipeline

An Azure Machine Learning pipeline packages all the components and runs them sequentially during runtime with a specified compute instance, docker images or conda environments in the job process. After the training component defined above has been created, we need to define the pipeline that is going to package all the dependencies needed to run the training experiment. To do this, you will need the following:

The experiment name and description
Input object for the training dataset path
Input object for the testing dataset path
Get component name that trains the model
Get component name that registers the model
The compute instance for running the training job

It will use all of this information to package a pipeline job to run the experiment for training and registering the model.

from azure.ai.ml import dsl, Input


hospital_train_parquet = Input(
    type="uri_file", path="data/train_dataset.parquet", mode="download"
)

hospital_test_parquet = Input(
    type="uri_file", path="data/test_dataset.parquet", mode="download"
)

@dsl.pipeline(
    compute=compute_name,
    description="Register Model for RAI hospital ",
    experiment_name=f"RAI_hospital_Model_Training_{model_name_suffix}",
)
def my_training_pipeline(target_column_name, training_data):
    trained_model = train_component_definition(
        target_column_name=target_column_name,
        training_data=training_data
    )
    trained_model.set_limits(timeout=120)

    _ = register_component(
        model_input_path=trained_model.outputs.model_output,
        model_base_name=model_base_name,
        model_name_suffix=model_name_suffix,
    )

    return {}

model_registration_pipeline_job = my_training_pipeline(target_column, hospital_train_parquet)

Once you have defined and registered the pipeline to the workspace, you can submit the pipeline to run in a job. In our python code, we are checking the status of the job.

from azure.ai.ml.entities import PipelineJob
import webbrowser

def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:
    created_job = ml_client.jobs.create_or_update(pipeline_job)
    assert created_job is not None

    while created_job.status not in ['Completed', 'Failed', 'Canceled', 'NotResponding']:
        time.sleep(30)
        created_job = ml_client.jobs.get(created_job.name)
        print("Latest status : {0}".format(created_job.status))


    # open the pipeline in web browser
    webbrowser.open(created_job.services["Studio"].endpoint)
    
    #assert created_job.status == 'Completed'
    return created_job

# This is the actual submission
training_job = submit_and_wait(ml_client, model_registration_pipeline_job)

An alternative way of checking the status of all the components in the pipeline job is by clicking on the Jobs icon from the Azure Machine Learning Studio.

From the Jobs list, you can click on the job to view the jobs progress and can drill-down to pinpoint where an error occurred. You will see the Diabetes Hospital Readmission dataset feeding as an input into our training component and the output model feeding into the register model marked as complete. After the pipeline has successfully finished running, you will have a trained model that is registered to the Azure Machine Learning Studio.