Responsible AI Dashboard and Scorecard in Azure Machine Learning
Published May 24 2022 08:00 AM 12.1K Views
Microsoft

Introduction

 

In December 2021, we introduced the Responsible AI dashboard, a comprehensive experience bringing together several mature Responsible AI tools in the areas of data explorer (to proactively identify whether there is sufficient data representation for the variety of data subgroups), fairness assessment (to assess and identify your model's group fairness issues), model interpretability (to understand how features are impacting your model predictions), error analysis (to easily identify error distributions across your data cohorts), counterfactual and causal inference analysis (to empower you to make responsible model-driven and data-driven decisions). The dashboard aims to address the issues of Responsible AI tool discoverability and fragmentation by enabling:

  • Model Debugging: Evaluate machine learning models by identifying model errors, diagnosing why those errors are happening, and mitigating them. 
  • Responsible Business Decision Making: Boost your data-driven decision-making abilities by addressing questions such as “what is the minimum change the end user could apply to his/her features to get a different outcome from the model?” and/or “what is the causal effect of reducing red meat consumption on diabetes progression? 

Responsible AI dashboard_flowchart.png

The Responsible AI dashboard is now integrated with the Azure Machine Learning (AzureML) platform, generally available on November 10th 2022, enables our cloud customers to use a variety of experiences (via CLI, SDK, and no-code UI wizard) to generate Responsible AI dashboards for their machine learning models, enhancing their model debugging and understanding processes.

 

Also in public preview, the Responsible AI scorecard, is a reporting feature which can also be generated in AzureML to create and share reports surfacing key data characteristics, and model performance and fairness insights. The scorecard helps contextualize the model and data health insights with both technical and non-technical audiences, bringing stakeholders along as well as assisting in compliance reviews.

 

Walkthrough of the Responsible AI dashboard 

 

In this article, the scenario we will walk through is a linear regression model used for the hypothetical purpose of determining developer access to a GPT-2 model published for a limited group of users. In the following sections, we will dive deeper into how the Responsible AI dashboard can be used to debug the data and model and inform better decision making. The regression model is trained on a historical dataset of programmers who were scored from 0 to 10 based on characteristics such as age, geographical region, what operating system they use, employer, style of coding, and so on. If the model predicts a score of 7 to 10, then they are allowed access. A sample of the synthetic data is below:

 

First name Last name Score (target) Style YOE IDE Programming language Location Number of GitHub repos contributed to Employer OS Job title Age
Bryan Ray  8 spaces 16 Emacs R Antarctica 2 Snapchat MacOS Principal Engineer 32
Donovan Lucero  3 tabs 9 pyCharm Swift Antarctica 2 Instagram Linux Distinguished Engineer 35
Dean Hurley  1 tabs 7 XCode C# Antarctica 0 Uber MacOS Senior Engineer 32
Nathan Weaver  6 spaces 15 Visual Studio R Antarctica 0 Amazon Linux Principal Engineer 32
Raelyn Sloan  5 tabs 7 Eclipse Java Antarctica 0 Twitter Windows SWE 2 33.1

 

Essentially, this model is allocating opportunity across different developers. So, we should take a closer look at this model to identify what kind of errors it’s making, diagnose what is causing those errors, and use those insights to improve the model. After uncovering those evaluation insights on our model, we can share them via the Responsible AI scorecard with other stakeholders who also want to ensure the app’s transparency and robustness and build trust with our end users.

 

The Responsible AI dashboard can be generated via a code-first CLI v2 and SDK v2 experience or a no-code method via Azure Machine Learning’s studio UI.

 

Generating a Responsible AI dashboard 

 

Using Python with the Azure ML SDKv2 

An Azure ML training pipeline job can be configured and executed remotely with a python notebook using the Azure ML SDKv2. Once you train your model and register it, you can create a Responsible AI dashboard by first, selecting the components you would like to activate in the dashboard, specifying the input and outputs of each component, and creating a component job for each of them. The components available are: 

  • Fetching the model: this will get the registered model from Azure ML that you are generating the Responsible AI dashboard for
  • An initial constructor: this will hold all the other components such as explanations, error analysis, etc.
  • Adding an explanation: This also provides our data exploration and model overview in the Responsible AI dashboard.
  • Adding causal analysis: we’re interested in using the historic data to uncover the causal effect of the number of GitHub repos programmers contributed to and years of experience on their score.
  • Adding counterfactual analysis: we want to generate 10 counterfactual examples per datapoint, leading to the prediction value to have the desired score of 7 to 10.
  • Adding error analysis: we can optionally specify generating a heat map for error distributions across the features of style and employer.
  • Finally, a ‘gather’ component: this will assemble all our Responsible AI insights into the dashboard.

 

 

def rai_programmer_regression_pipeline(
        target_column_name,
        train_data,
        test_data,
    ):
        # Fetch the model
        fetch_job = fetch_model_component(
            model_id=expected_model_id
        )
        fetch_job.set_limits(timeout=120)
        
        # Initiate the RAIInsights
        create_rai_job = rai_constructor_component(
            title="RAI Dashboard Example",
            task_type="regression",
            model_info_path=fetch_job.outputs.model_info_output_path,
            train_dataset=train_data,
            test_dataset=test_data,
            target_column_name=target_column_name,
            categorical_column_names=categorical_columns
        )
        create_rai_job.set_limits(timeout=120)
        
        # Add an explanation
        explain_job = rai_explanation_component(
            comment="Explanation for the programmers dataset",
            rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
        )
        explain_job.set_limits(timeout=120)
        
        # Add causal analysis
        causal_job = rai_causal_component(
            rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
            treatment_features=treatment_features,
        )
        causal_job.set_limits(timeout=180)
        
        # Add counterfactual analysis
        counterfactual_job = rai_counterfactual_component(
            rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
            total_cfs=10,
            desired_range=desired_range
        )
        counterfactual_job.set_limits(timeout=600)
        
        # Add error analysis
        erroranalysis_job = rai_erroranalysis_component(
            rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
            filter_features=filter_columns
        )
        erroranalysis_job.set_limits(timeout=120)
        
        # Combine everything
        rai_gather_job = rai_gather_component(
            constructor=create_rai_job.outputs.rai_insights_dashboard,
            insight_1=explain_job.outputs.explanation,
            insight_2=causal_job.outputs.causal,
            insight_3=counterfactual_job.outputs.counterfactual,
            insight_4=erroranalysis_job.outputs.error_analysis,
        )
        rai_gather_job.set_limits(timeout=120)

        rai_gather_job.outputs.dashboard.mode = "upload"
        rai_gather_job.outputs.ux_json.mode = "upload"

 

 

With our components defined, we can assemble our pipeline job and submit it to Azure ML.

 

Using YAML with the Azure ML CLIv2 

Alternatively, we can create this job with a YAML file to automate creating the Responsible AI dashboard in your MLOps via the Azure ML CLIv2 experience. We can specify all the jobs that we want to kick off: training the model, registering the model, and then creating the Responsible AI dashboard with a YAML file then executing the job with a single line from the CLI.

 

 

jobs:
  fetch_model_job:
    type: command
    component: azureml:fetch_registered_model:{version_string}
    limits:
      timeout: 60
    inputs:
      model_id: {expected_model_id}

  create_rai_job:
    type: command
    component: azureml:rai_insights_constructor:{version_string}
    inputs:
      title: RAI Programmer Analysis from YAML
      task_type: regression
      model_info_path: ${{{{parent.jobs.fetch_model_job.outputs.model_info_output_path}}}}
      train_dataset: ${{{{parent.inputs.my_training_data}}}}
      test_dataset: ${{{{parent.inputs.my_test_data}}}}
      target_column_name: ${{{{parent.inputs.target_column_name}}}}
      categorical_column_names: '["location", "job title", "OS", "Employer", "IDE", "Programming language", "style"]'
      
  explain_01:
    type: command
    component: azureml:rai_insights_explanation:{version_string}
    inputs:
      comment: Explanation from YAML for RAI Programmer example
      rai_insights_dashboard: ${{{{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}}}

  causal_01:
    type: command
    component: azureml:rai_insights_causal:{version_string}
    inputs:
      rai_insights_dashboard: ${{{{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}}}
      treatment_features: '["Number of github repos contributed to", "YOE"]'

  counterfactual_01:
    type: command
    component: azureml:rai_insights_counterfactual:{version_string}
    inputs:
      rai_insights_dashboard: ${{{{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}}}
      total_CFs: 10
      desired_range: '[5, 10]'

  error_analysis_01:
    type: command
    component: azureml:rai_insights_erroranalysis:{version_string}
    inputs:
      rai_insights_dashboard: ${{{{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}}}
      filter_features: '["style", "Employer"]'

  gather_01:
    type: command
    component: azureml:rai_insights_gather:{version_string}
    inputs:
      constructor: ${{{{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}}}
      insight_1: ${{{{parent.jobs.causal_01.outputs.causal}}}}
      insight_2: ${{{{parent.jobs.counterfactual_01.outputs.counterfactual}}}}
      insight_3: ${{{{parent.jobs.error_analysis_01.outputs.error_analysis}}}}
      insight_4: ${{{{parent.jobs.explain_01.outputs.explanation}}}}

 

 

Read more about how to create the Responsible AI dashboard with Python and YAML in SDKv2/CLIv2.

 

Using no-code guided UI wizard in AzureML studio 

Finally, we can create this job without leaving the AzureML studio at all with a no-code wizard experience. If we go to our list of registered models, we first select the model we want to generate Responsible AI insights for and click the "Create Responsible AI insights > Create dashboard" button.

 

mithigpe_0-1665506480428.png

You first pick a train and test dataset that was used to train and test your model.

 

RAIwizard_dataset.png

 

For this scenario we will be choosing regression to match our model.

 

RAIwizard_modelingtask.png

 

For the Responsible AI dashboard components we’re interested in, we can choose either the debugging profile or real-life interventions profile.

 

RAIwizard_profile.png

 

We’ll move forward with model debugging and customize the dashboard to include error analysis, counterfactual analysis, and model explanation. For error analysis, I can choose up to two features to pre-generate an error heat map for. For counterfactual analysis, I’m interested in seeing a diverse set of examples (let’s say 10 examples for each datapoint) where we automatically perturb features just enough, so they receive a score of 7 to 10. We can even control which features are being perturbed if we don’t want certain features to be changed.

 

mithigpe_1-1665506641363.png

Once that all looks good, we can move on to the final step to configure our experiment. We can name our job that will generate our Responsible AI dashboard, and either select an existing experiment to kick off the job in or create a new one. We’ll create a new one with the necessary resources and hit ‘Create’ and kick off the job. 

RAIwizard_expconfig.png

With that we can jump into the AzureML studio to see if the job has been successfully completed and we can see the resulting Responsible AI dashboard for our model showing up.

 

Read more about how to create the Responsible AI dashboard with no-code UI wizard in Azure Machine Learning studio.

 

Viewing the Responsible AI dashboard 

The Responsible AI dashboard is a dynamic and interactive interface to investigate your model and data built on a host of open-sourced state-of-the-art technology. You can view your dashboard(s) by navigating to the registered model you have generated a Responsible AI dashboard for. Clicking on the Responsible AI tab will take you to your dashboards.

 

mithigpe_2-1665506813231.png

 

 

We enable an integration of your workspace compute resources to access all the features such as retraining error trees, recalculating probabilities and generating insights in real time.

 

RAIdashboard_CI.png

 

The different components of the Responsible AI dashboard are designed such that they can easily communicate with each other. You can create cohorts of your data to slice and dice your analysis and interactively pass cohorts and insights from one component to another for deep-dive investigations. You can hide the different components you’ve generated for the dashboard in the “dashboard configuration” or add them back by clicking the blue “plus” icon.

 

We first look at our error tree, which tells us where the distribution of most of our errors lie. It seems that our models made the greatest number of errors for programmers living in Antarctica who don’t program in C, PHP, or Swift and don’t contribute that often to GitHub repos. We can easily save this as a new cohort to investigate later, but in the meanwhile it will show up as a “Temporary cohort” in the subsequent components.

 

RAIdashboard_Error analysis.png

When looking at our model overview, we can get a high-level view of the model prediction distribution to help build intuition for the next steps in model debugging. We can use the data explorer to see if feature distribution in our dataset is skewed. This can cause a model to incorrectly predict datapoints belonging to an underrepresented group or to be optimized along an inappropriate metric. If we bin our x-axis to be the ground truth of different scores a programmer can get (where 7-10 is the accepted range) and look at the style, we see that there is a highly skewed distribution of programmers who use tabs being scored lower and programmers who use spaces being scored higher.

 

mithigpe_0-1653375334931.png

Additionally, since we know our model made the most amount of error for those living in Antarctica, when we investigate location, we see a highly skewed distribution of programmers living in Antarctica who were scored lower. What this means is that our model will unfairly favor those who are using spaces, and not living in Antarctica when providing access to the application we built.

 

mithigpe_1-1653375374372.png

 

Coming down to our aggregate feature importance, we can see for our overall model, which features were the most important to the model’s predictions; and we can see that style (tabs or spaces) is by far the most considered, then operating system then programming language. If we click into style, we can see that using ‘spaces’ has a positive feature importance and ‘tabs’ has a negative feature importance showing us that ‘spaces’ is what contributes to a higher score.

mithigpe_3-1665506914846.png

mithigpe_4-1653375562499.png

mithigpe_5-1653375575695.png

 

We can also look at two specific programmers who got a low and high score. Row 35 has a high score and uses spaces and row 2 has a low score and uses tabs. When we look at the individual feature importance of each programmers’ features, we can see that the ‘spaces’ positively contributed to Row 35’s high score, while ‘tabs’ contributed negatively towards a lower score for Row 2. 

mithigpe_6-1653375625088.png

mithigpe_7-1653375639679.png

 

We can take a deeper look with counterfactual what-if examples. When selecting someone below the 7 to 10 range prediction, we can see what bare minimum changes could happen to their features to lead to much higher predictions. In this programmer’s case, some recommended changes would be switching their style to spaces.

 

mithigpe_9-1653375982931.png

 

Finally, if we wanted to purely use historic data to identify the features that have the most direct effect on our outcome of interest, in this case the score, we can use causal analysis.  In our case, we want to understand the causal effect of years of experience and number of GitHub repos a programmer has contributed to on the score. The aggregate causal effects show you overall for your whole dataset, on average, increasing the number of GitHub repos by 1 increases the score by 0.095 whereas increasing the number of years of experience by 1 doesn’t increase the score by much at all.

 

mithigpe_8-1653375714703.png

 

However, if we want to look at individual programmers and perturb those values and see the outcome of specific treatments to years of experience, we can see that for some programmers, increasing the years of experience does cause th e score to increase by a bit.

 

ViewRAIdashboard_individualcausalwhatif.png

 

Additionally, the treatment policy tab can help us decide what overall treatment policy to take to maximize real-world impact on our score.  We can see the best future interventions to apply to certain segmentations of our programmer population to see the biggest boost in the scores overall.

 

ViewRAIdashboard_causal_treatmentpolicy1.png

ViewRAIdashboard_causal_treatmentpolicy2.png

 

And if you can only focus on 10 programmers to reach out to, you can see a ranked list of top k programmers who would gain the most from either increasing or decreasing the number of GitHub repos.

 

ViewRAIdashboard_causal_treatmentpolicy3.png

Read the UI overview of how to use the different charts and visualizations of the Responsible AI dashboard.

 

Share insights with Responsible AI scorecard 

Azure Machine Learning’s Responsible AI dashboard is designed for machine learning professionals and data scientists to explore and evaluate model insights and inform their data-driven decisions, and while it can help you implement Responsible AI practically in your machine learning lifecycle, there often exists a gap between the technical and non-technical stakeholders. Across the industry, there are no tools to ensure appropriate alignment between the two, helping technical experts get timely feedback and direction from the non-technical stakeholders.

 

As a part of the Azure Machine Learning’s model registry infrastructure and to accompany machine learning models and their corresponding Responsible AI dashboards, we introduce the Responsible AI scorecard, a customizable report which you can easily configure, download, and share with your technical and non-technical stakeholders to educate them about your data and model health and compliance and build trust. This scorecard could also be used in audit reviews to inform the stakeholders about the characteristics of your model.

 

Generating a scorecard with Azure ML SDKv2 or CLIv2

Like other Responsible AI dashboard components configured in the YAML pipeline, you can add a component to generate the scorecard in the YAML pipeline: 

 

 

RAIscorecard: 
   type: command 
   component: azureml:rai_score_card@latest 
   inputs: 
     dashboard: ${{parent.jobs.gather_01.outputs.dashboard}} 
     pdf_generation_config: 
       type: uri_file 
       path: ./pdf_gen.json
       mode: download 
     predefined_cohorts_json: 
       type: uri_file 
       path: ./cohorts.json 
       mode: download 

 

 

Where pdf_gen.json and cohorts.json are the score card generation configuration JSON file, and prebuilt cohorts definition JSON file. 

 

Sample JSON for cohorts definition:

 

 

1.	[
2.	  {
3.	    "name": "High Yoe",
4.	    "cohort_filter_list": [
5.	      {
6.	        "method": "greater",
7.	        "arg": [
8.	          5
9.	        ],
10.	        "column": "YOE"
11.	      }
12.	    ]
13.	  },
14.	  {
15.	    "name": "Low Yoe",
16.	    "cohort_filter_list": [
17.	      {
18.	        "method": "less",
19.	        "arg": [
20.	          6.5
21.	        ],
22.	        "column": "YOE"
23.	      }
24.	    ]
25.	  }
26.	]

 

 

Sample JSON of Responsible AI scorecard generation configuration: (The configuration stage requires you to use your domain expertise around the problem to set your desired target values on model performance and fairness metrics.)

 

 

1.	{
2.	  "Model": {
3.	    "ModelName": "GPT2 Access",
4.	    "ModelType": "Regression",
5.	    "ModelSummary": "This is a regression model to analyzer how likely a programmer is given access to gpt 2"
6.	  },
7.	  "Metrics": {
8.	    "mean_absolute_error": {
9.	      "threshold": "<=2"
10.	    },
11.	    "mean_squared_error": {}
12.	  },
13.	  "FeatureImportance": {
14.	    "top_n": 6
15.	  },
16.	  "DataExplorer": {
17.	    "features": [
18.	      "YOE",
19.	      "age"
20.	    ]
21.	  },
22.	  "Cohorts": [
23.	    "High Yoe",
24.	    "Low Yoe"
25.	  ]
26.	}

 

 

 

Create your Responsible AI scorecard with UI

In addition to creating the scorecard via SDK or CLI, you can also create a scorecard for any Responsible AI dashboard you generate by clicking on Create Responsible AI insights > Generate new PDF scorecard which will open up a panel for you to walk through the same steps via a UI wizard but without any code.

 

mithigpe_0-1665547312383.png

mithigpe_1-1665547459890.png

 

Download your Responsible AI scorecard

Let’s look at the one generated for the model we just looked at. Responsible AI scorecards are linked to your Responsible AI dashboards. To view your Responsible AI scorecard, go into your model registry and select the registered model you have generated a Responsible AI dashboard for. Once you click on your model, click on the Responsible AI tab to view a list of generated dashboards. Select which dashboard you’d like to export a Responsible AI scorecard PDF for by clicking Create Responsible AI insights > View all PDF scorecards.

 

mithigpe_4-1665506918216.png

 

 

Select which scorecard you’d like to download from the list and click Download to download the PDF to your machine.

 

mithigpe_5-1665507097039.png

 

 

How to read your Responsible AI scorecard?

The Responsible AI scorecard is a PDF summary of your key insights from the Responsible AI dashboard. The first summary segment of the scorecard gives you an overview of the machine learning model and the key target values you have set to help all stakeholders determine if your model is ready to be deployed.

Letter - 9.png

The data explorer segment shows you characteristics of your data, as any model story is incomplete without the right understanding of data.

Letter - 10.png

The model performance segment displays your model’s most important metrics and characteristics of your predictions and how well they satisfy your desired target values.

Letter - 11.png

Next, you can also view the top performing and worst performing data cohorts and subgroups that are automatically extracted for you to see the blind spots of your model.

Letter - 12.png

Then you can see the top important factors impacting your model predictions, which is a requirement to build trust with how your model is performing its task. You can further see your model fairness insights summarized and inspect how well your model is satisfying the fairness target values you had set for your desired sensitive groups.

Letter - 14.png

Finally, you can observe your dataset’s causal insights summarized, figuring out whether your identified factors/treatments have any causal effect on the real-world outcome.

Letter - 16.png

Responsible AI dashboard and scorecard in action

With more than six million people in England waiting for treatment on the National Health Service, a team of medical professionals at one of the largest UK National Health Service (NHS) trusts in the country is exploring how AI could help reduce waiting times, support recommendations from healthcare teams and provide patients with better information so they can make more informed decisions about their own care. Orthopedic surgeons Justin Green and Mike Reed at Northumbria Healthcare NHS Foundation Trust have developed an AI model that helps consultants give their patients a personalized risk assessment of upcoming hip or knee operations. That is reassuring people at one the most stressful and worrying times of their life.

 

Since the AI model is hosted in Microsoft’s Azure cloud, the team has used the Responsible AI dashboard in Azure Machine Learning, to ensure medical professionals are given a clearer understanding of why the AI has reached those conclusions. That is critical in the ultra-cautious healthcare sector. Consultants can now see how the model works and have confidence that the advice they give to patients is based on accurate and reliable data.

 

Conclusion

To conclude, we are announcing the general availability on November 10, 2022 of the Responsible AI dashboard in Azure Machine Learning, generatable via a variety of experiences through CLI, SDK, or a no-code UI experience. Also in public preview, the Responsible AI scorecard, a PDF report you can extract and share with summaries of key data and model performance and fairness insights. The dashboard supports the ML professional personas, empowering them to easily debug and improve their machine learning models, while the scorecard helps technical and non-technical audiences alike understand the impact of applying Responsible AI, so that all stakeholders can participate in compliance reviews.

 

Next steps

Learn more about the RAI dashboard and scorecard in the Azure Machine Learning docs and generate them today to boost justified trust and appropriate reliance in your AI-driven processes.

  • Learn more about the Responsible AI in Azure Machine Learning
  • Learn more about how to generate and use the Responsible AI dashboard and scorecard
  • View the sample notebook of this example scenario on creating the Responsible AI dashboard via SDKv2/CLIv2
  • Read the customer story about how the Responsible AI dashboard was used by Northumbria Healthcare NHS Foundation Trust
  • Watch the AI Lab video on how to use the Responsible AI dashboard

 

Acknowledgements

 

mithigpe_0-1653076963010.png

In the past year, our teams across the globe have joined forces to release the very first one-stop-shop dashboard for easy implementation of responsible AI in practice, making these efforts available to the community as open source and as part of the Azure Machine Learning ecosystem. We acknowledge their great efforts and are excited to see how you use this tool in your AI lifecycle.

 

Azure Machine Learning:

  • Responsible AI development team: Steve Sweetman, Lan Tang, Ke Xu, Roman Lutz, Richard Edgar, Ilya Matiach, Gaurav Gupta, Kin Chan, Vinutha Karanth, Tong Yu, Ruby Zhu
  • AI marketing team: Thuy Ngyuen, Melinda Hu, Trinh Duong
  • Additional thanks to Seth Juarez, Christian Gero, Manasa Ramalinga, Vijay Aski, Anup Shirgaonkar, Benny Eisman for their contributions to this launch!

Microsoft Research:

Big thanks to everyone who made this possible!

 

3 Comments
Co-Authors
Version history
Last update:
‎Oct 13 2022 03:29 PM
Updated by: