Mithun Prasad, PhD, Data Scientist at Microsoft
As data scientists, we are used to developing and training machine learning models in our local Python notebook, and handing off the resultant code to an app developer who must then integrate it into the larger application and deploy it. Often times, any bugs or performance issues go undiscovered until the code has already been deployed, and the resultant to and fro between app developers and data scientists to identify and fix the root cause can be a slow and frustrating process.
As more and more business-critical applications are being infused with AI, it is increasingly clear that data scientists and application developers need to collaborate more closely to build and deploy AI powered applications more efficiently. While it is not practicable to burden data scientists with the ins and outs of application lifecycle management, a lot goes into building and maintaining an application and infusing AI into it is just a small part of it.
So, what's needed is a happy medium, which is where Azure Machine Learning and Azure DevOps come into the picture. Together, these platform features facilitate collaboration between data scientists and app developers while letting you use the tools and languages each of you is already familiar and comfortable with. You can now build AI-infused apps faster, together. You can automate unit testing and integration of your AI model with the larger business application, including any periodic retraining and redeployment to compensate for data drift.
In short, your data science process is now part of your enterprise application’s Continuous Integration (CI) and Continuous Delivery (CD) pipeline. No more of data scientists and app developers pointing fingers at each other for unexpected delays in deploying apps or for bugs discovered after the app has been deployed in production! Let’s walk through the diagram below to understand how this integration between the data science cycle and the app development cycle is achieved.
A starting assumption is that both the data scientists and app developers in your enterprise use GitHub as your code repository. As a data scientist, any changes you make to training code will trigger the Azure DevOps CI/CD pipeline to execute unit tests, an Azure Machine Learning pipeline run and code deployment push. Likewise, any changes the app developer or you make to application or inferencing code will trigger integration tests and a code deployment push. You can also set specific triggers on your data lake to execute both model retraining and code deployment steps.
With this approach, you as the data scientist retain full control over model training. You can continue to write and train models in your favorite Python environment and rest assured that it is your Python code that is deployed to production, not a Java translation that the app developer took the liberty to create from the original! You get to decide when to execute a new ETL / ELT run using Azure Data Factory to refresh the data to retrain your model. Likewise, you continue to own the Azure Machine Learning pipeline definition including the specifics for data wrangling, feature extraction and experimentation steps, such as compute target, framework and algorithm to use. At the same time, your app developer counterpart can sleep comfortably knowing that any changes you commit will pass through the required unit, integration testing and human approval steps for the overall application.
Recommended background reading:
