Startups at Microsoft

6 MIN READ

Optimizing ML Models in Production in The Cloud or At the Edge Using A/B Testing

Microsoft

Jan 29, 2024

In this second blog post in our series, guest blogger Martin Bald, Senior Manager, Developer Community from one of our startup partners Wallaroo.AI will dive into the production ML model validation method of A/B testing.

Introduction

Testing and experimentation are a critical part of the ML lifecycle. This is because the ability to quickly experiment and test new models in the real world helps data scientists to continually learn, innovate, and improve AI-driven decision processes.

In the first blog post in this series, we stepped through deploying a packaged ML model to an edge device in a retail CV use case example. However, in the MLOps lifecycle just deploying the model in production is not enough. In many situations, it's important to vet a model's performance in the real world before fully activating it. Real world vetting can surface issues that may not have arisen during the development stage, when models are only checked using hold-out data.

A/B Testing

An A/B test, also called a controlled experiment or a randomized control trial, is a statistical method of determining which of a set of variants is the best. A/B tests allow organizations and policy-makers to make smarter, data-driven decisions that are less dependent on guesswork.

In the simplest version of an A/B test, subjects are randomly assigned to either the control group (group A) or the treatment group (group B). Subjects in the treatment group receive the treatment (such as a new medicine, a special offer, or a new web page design) while the control group proceeds as normal without the treatment. Data is then collected on the outcomes and used to study the effects of the treatment.

In data science, A/B tests are often used to choose between two or more candidate models in production, by measuring which model performs best in the real world. In this formulation, the control is often an existing model that is currently in production, sometimes called the champion. The treatment is a new model being considered to replace the old one. This new model is sometimes called the challenger. In our discussion, we'll use the terms champion and challenger, rather than control and treatment.

Keep in mind that in machine learning, the terms experiments and trials also often refer to the process of finding a training configuration that works best for the problem at hand (this is sometimes called hyperparameter optimization). In this post, we will use the term experiment to refer to the use of A/B tests to compare the performance of different models in production.

How to Design an A/B Test

A/B tests are a useful way to rely less on opinions and intuition and to be more data-driven in decision making, but there are a few principles to keep in mind. The experimenter has to decide on a number of things.

First, decide what you are trying to measure. We'll call this the Overall Evaluation Criterion or OEC. This may be different and more business-focused than the loss function used while training the models, but it must be something you can measure. Common examples are revenue, click-thru rate, conversion rate, or process completion rate.

Second, decide how much better is "better". You might want to just say "Success is when the challenger is better than the champion," but that's actually not a testable question, at least not in the statistical sense. You have to decide how much better the challenger has to be.

Some Practical Considerations for Setting Up an A/B Test

Splitting your subjects: When splitting your subjects up randomly between models, make sure the process is truly random, and think through any interference between the two groups. Do they communicate or influence each other in some way? Does the randomization method cause an unintended bias? Any bias in group assignments can invalidate the results. Also, make sure the assignment is consistent so that each subject always gets the same treatment. For example, a specific customer should not get different prices every time they reload the pricing page.

Don't Peek!: Due to human nature, it's difficult not to peek at the results early and draw conclusions or stop the experiment before the minimum sample size is reached. Resist the temptation. Sometimes the "wrong" model can get lucky for a while. You want to run a test long enough to be confident that the behavior you see is really representative and not just a weird fluke.

Let’s go through an A/B test in action. If Shadow Deployment is your preference you can skip to the blog post on Shadow Deployment. We will set up the test for a 50/50 split of the data between the champion and the challenger models.

Our first step is to create and upload our challenger model:

# Get the most recent version of a model.
# Assumes that the most recent version is the first in the list of versions.
# wl.get_current_workspace().models() returns a list of models in the current workspace

def get_model(mname, modellist=wl.get_current_workspace().models()):
    model = [m.versions()[0] for m in modellist if m.name() == mname]
    if len(model) <= 0:
        raise KeyError(f"model {mname} not found in this workspace")
    return model[0]

# get a pipeline by name in the workspace
def get_pipeline(pname, plist = wl.get_current_workspace().pipelines()):
    pipeline = [p for p in plist if p.name() == pname]
    if len(pipeline) <= 0:
        raise KeyError(f"pipeline {pname} not found in this workspace")
    return pipeline[0]

Next we will retrieve the models and pipelines

pipeline = get_pipeline('tutorialpipeline-jch')

challenger_model = wl.upload_model('challenger-model', './models/rf_model.onnx', framework=Framework.ONNX)

Then creating an A/B test deployment would look something like this:

# retrieve handles to the most recent versions 
# of the champion and challenger models

control_model = get_model('tutorial-model')

challenger_model = get_model('challenger-model')

Second step is to retrieve the pipeline created in the previous Notebook, then redeploy it with the A/B testing split step as seen in the code below.

# get an existing single-step pipeline and undeploy it
pipeline = get_pipeline("pipeline")
pipeline.undeploy()

# clear the pipeline and add a random split
pipeline.clear()
pipeline.add_random_split([(2, champion), (1, challenger)])
pipeline.deploy()

In our example, a pipeline will be built with a 2:1 weighted ratio between the champion and a single challenger model. The random split will randomly send inference data to one model based on the weighted ratio. As more inferences are performed, the ratio between the champion and challengers will align more and more to the ratio specified.

We can see from the code below that our test is set for a 2:1 random split of the data between the Champion and a single Challenger model.

Note: You can also add more than one Challenger model if you need to as seen in the below example which will distribute data in the ratio 2:1:1 (or half to the champion, a quarter each to the challengers) to the champion and challenger models, respectively.

pipeline.add_random_split([ (2, champion), (1, challenger01), (1, challenger02) ])

pipeline.clear()
pipeline.add_random_split([(2, control_model), (1, challenger_model)])
pipeline.deploy()

The pipeline steps are displayed with the Pipeline steps() method. This is used to verify the current deployed steps in the pipeline.

pipeline.steps()

[Output]

[{'RandomSplit': {'hash_key': None, 'weights': [{'model': {'name': 'tutorial-model', 'version': '44f9e250-7636-4800-be08-da624b51d057', 'sha': 'ed6065a79d841f7e96307bb20d5ef22840f15da0b587efb51425c7ad60589d6a'}, 'weight': 2}, {'model': {'name': 'challenger-model', 'version': 'bd69c37d-8e8d-4cfa-8cf7-6f47a411c893', 'sha': 'e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6'}, 'weight': 1}]}}]

Now we are ready to send some queries to our A/B test deployment. The first step is to test our setup by sending a single datum to the A/B test pipeline we created.

df_from_csv = pd.read_csv('./data/test_data.csv')

singleton = get_singleton(df_from_csv, 0)
display(singleton)

single_result = pipeline.infer(singleton)
display(single_result)
display(get_names(single_result))

[Output]

Fig 1.

The second step is to send a large number of queries (at least 100) one at a time to the pipeline.

results = []

# get a list of result frames
for i in range(20):
    query = get_singleton(df_from_csv, i)
    results.append(pipeline.infer(query))

# make one data frame of all results    
allresults = pd.concat(results, ignore_index=True)

# add a column to indicate which model made the inference
allresults['modelname'] = get_names(allresults)

# get the counts of how many inferences were made by each model
allresults.modelname.value_counts()

[Output]

tutorial-model 14

challenger-model 6

Name: modelname, dtype: int64

Conclusion

In this blog we have seen a post-production ML model validation method in action for A/B testing. With testing and experimentation being a critical part of the ML lifecycle, we have the ability to quickly experiment and test new models in production and make fast informed decisions to replace the production champion model on the fly when a new model shows better performance without halting production. This helps data scientists to continually learn, innovate, and improve AI-driven decision processes.

The next blog post in this series will cover the Shadow Deployment method for model validation for edge or multi-cloud production deployments.

If you want to try the steps in these blog posts, you can access the tutorials at this link and use the free inference servers available on the Azure Marketplace. Or you can download a free Wallaroo.AI Community Edition you can use with GitHub Codespaces.

Updated Jan 29, 2024

Version 1.0

Christopher_Tearpak

Microsoft

Joined March 29, 2023

View Profile

Startups at Microsoft

Follow this blog board to get notified when there's new activity