DevOps for Data Science - Part 6 – Continuous Integration

Microsoft

Mar 09, 2021

In the previous post in this series on DevOps for Data Science, I covered the first the concept in a DevOps “Maturity Model” – a list of things you can do, in order, that will set you on the path for implementing DevOps in Data Science. The first thing you can do in your projects is to implement Infrastructure as Code (IaC) as I explained in the last article.

The next level of maturity is Continuous Integration. This is something developers have done for quite some time. Many Data Science projects have this element as well, but perhaps without using that term. First, let’s define “classical” Continuous Integration, or CI:

Continuous Integration is the process of merging changes from developer’s code quickly back into the release candidate code, rather than waiting until all changes are made in all developers’ code.

We can make that a bit clearer with an example. Assume three developers (Jane, Bob, and Sandra) are working on a program that accepts a picture input from a user’s cell phone camera, runs an image recognition algorithm against it, and returns a label to the user – something like “That’s not your cat” or the like.

The main part of the code is created, and “checked in” to a code repository system like git. This is called the “main branch” – it’s where all the work is complete as of a given point in time.

If Jane is using git, she uses a git command to copy the code in the main branch to her local system. She can then create a “branch” of her own, perhaps calling it “JaneBranch”. Any changes she makes there are compared against the main branch on her system.

She then begins to alter the code in JaneBranch to have a better User Interface for the program. At this point, the main branch has not changed. Bob and Sandra have done the same on their systems, working away on anything they want to change, in thier own branches.

After Jane makes any changes she likes and tests to make sure it works, she can request that her code be integrated back into the master branch, using something called a “Pull Request”. Her changes go to the main branch. Bob and Sandra will do the same after making changes on their system.

For the most part, this works fine – until you have lots of changes, some of which might conflict. For instance, if Jane alters the code in the User Interface that sends the image to the prediction algorithm, Bob might have a dependency on that call. That would cause a break. And then if Sandra’s code depended on things that both Bob and Jane are doing, that would also cause her code to break. If they all waited to merge to the main branch, all these errors (and probably more) would show up at once, like a pile-up on a freeway.

To avoid this issue, as soon as Jane’s code tests well, it should be pulled into the main branch, and Bob and Sandra should pull that new main branch back down to their systems, to test their code against.

As soon as Bob makes a change, he should also "push" that back into the main branch, as should Sandra.

The key is that with little changes being merged back into the main branch, failures show up faster. If you fast-forward that to almost any functional change being made in any part of the code, you get Continuous Integration.

Doing that by hand would of course take a lot of coordination – so systems exist that make that a lot easier and more automatic, like Visual Studio Team Services and other packages.

What does this mean for the Data Science team? Actually, quite a lot.

Depending on the type of algorithm, we have a lot of dependencies on the data we get for training or for the trained model. We expect certain parameters to pass as inputs, and we expect to return a certain parameter or parameters back, most of the time strongly typed and in both directions. If changes “break” our inputs or outputs, we need to know that as soon as possible. In some cases, it can be as dramatic as retraining the original model or even creating a new one using a different algorithm.

I’ve only intimated the main idea here – testing. It’s quite difficult to have Continuous Integration without a test happening automatically (Automated Testing) – but it can be done. In practice most development shops will put Automated Testing together with Continuous Integration, but in a Data Science algorithm, it’s a bit tricky to create an automated test. I’ll cover that in the next installment of this series, but for now, try to get the Data Science coding process integrated into the organization’s code control system.

Perhaps you’ve already done that – congratulations! If not, take some time, learn the system your organization uses, and get your R, Python or whatever other Data Science code you have checked in along with everyone else’s. Tell the team that runs the code control system that for now, you need a “Manual Test” step inserted after your code Integrates. They’ll not be too happy about that, but it is better than not testing at all. I’ll explain how to include as much automated testing as possible in our next article.

See you in the next installment on the DevOps for Data Science series.

For Data Science, I find this progression works best – taking these one step at a time, and building on the previous step – the entire series is here:

Infrastructure as Code (IaC)
Continuous Integration (CI) and Automated Testing (This article)
Continuous Delivery (CD)
Release Management (RM)
Application Performance Monitoring
Load Testing and Auto-Scale

In the articles in this series that follows, I’ll help you implement each of these in turn.

If you’d like to implement DevOps, Microsoft has a site to assist. You can even get a free offering for Open-Source and other projects: https://azure.microsoft.com/en-us/pricing/details/devops/azure-devops-services/