The Need for an End-to-End Data Science Lifecycle Process
If you've ever worked on (or with) a data science team, you know that consistently delivering value can be frustrating (to put it nicely). There are so many places where things can go wrong and projects can fail. It has almost become a cliché to talk about the high failure rates of data science projects. However, given the demonstrated value that AI and Data Science have shown across industries, it's a problem that needs to be solved. There's just too much value to leave on the table. The division between successful companies and those who fall behind will be largely influenced by the success of their data science capabilities.
In response to this, it seems almost everyone is jumping on the MLOps train, and with good reason. MLOps has finally given us a way to consistently deploy, monitor, and retrain our models at scale. It's becoming clear that MLOps will be a required component of any successful data science team. So why do I say it’s not enough?
It's my view that MLOps on its own won't deliver. It will be transformational for making your existing models more robust, easier to retrain and monitor, etc. But what about new projects and new models? MLOps starts with a model, which means you've already found a model that works and now you want to enter the MLOps loop. Train, register, deploy, monitor, retrain, repeat.
The reality is that most teams and organizations struggle getting to that point consistently. Beyond deployment difficulties and risks, there are several other key areas where things go wrong:
Solving the wrong problems
Building models that don't map well to business processes
Bad assumptions about the data or a mismatch in population
Converting the results of your experimentation into a production ready model
Figure 1: An Internet Famous MLOps Diagram (with annotations)
We've seen all of these kill data science projects well before teams got to the stage where they'd even think about deployment. The good news is that while data science is experimental in nature, it's not random, which means we can identify ways to account for these common patterns.
Introducing the Data Science Lifecycle Process
In a previous article, I talked about the need for teams to create processes that cover the end-to-end data science process. We knew that MLOps would be a critical component, but based on our experience working with many data science teams, we still felt that there was a gap in the process when it came from going for the ideation phase to the point where you had a model you were ready to build and deploy. We dubbed what we came up with the Data Science Lifecycle Process (lovingly referred to as the DSLP).
We’re happy to announce that we’ve open-sourced this process so that every data science team can start improving their processes immediately. We’ve documented the process and created issue templates and repos and it’s all available on GitHub in the DSLP repo.
The DSLP is designed to break down the siloes between data scientists, developers, IT, and the business. Data science projects are cross-functional by nature. This means we need to bridge the gap between the (often ad-hoc) experimental workflows of data scientists and the more systematic approach of engineering teams. We've attempted to do this by creating a branching strategy, issue templates, and workflow patterns that establish clear boundaries and handoff points from the model development process to the implementation and deployment process. With a clear pivot point, it becomes easy to apply all the best parts of MLOps to the implementation and deployment process, while still giving data scientists the flexibility they need in the problem framing, experimentation, and development parts of the process.
Figure 2: The Phases of an ML Project and the Roles Involved
The need for a process like this is likely apparent to anyone who has worked on delivering data science projects in an enterprise environment. The friction between data science teams and just about everyone else is generally pretty high and leads to a lot of throwing things over the wall. It's not good folks.
Feedback on the DSLP So Far
As we developed the DSLP, we worked with several teams to test how well the process performed on real data science projects. We spoke with Cameron Vetter (an ML Engineer) and Carolyn Olsen (a Data Scientist) from Octavian Technology Group about the challenges they've seen enterprise data science teams face and how the DSLP addresses many of them.
In my experience, there are two places where existing data science processes really break down. The first is that the traditional software development processes don’t fit data science well, because data science is such a non-linear process. Workflows can quickly spread out like a hydra’s head. After a few weeks of work, data scientists may struggle to replicate exactly what they did along the way, or can get lost down analytical rabbit holes. DSLP makes non-linear data science processes focused and reproducible, by linking exploration and modeling experiments artifacts directly with Issues describing exactly what they’re meant to accomplish and results.
The second struggle many data scientists have is the pain of building a great model then seeing it “sit on a shelf,” never getting into production. Like agile project management, DSLP helps keep work focused on business goals, increasing likelihood of stakeholder buy-in. It also facilitates hand-off from data scientists to the engineers getting the model into production, by giving data scientists a structured way to hand off code and documentation.
Data Science projects often have a disconnect between the engineers and data scientists. These two groups work in vastly different ways, and often struggle to sync their efforts. Engineers usually work within SDLC processes using them to align their teams towards the same goal. Data Scientists tend to be more experimental in their work. A Data Scientist will often go down a path and completely abandon it, starting down a new path many times during a project.
This experimental nature often leads Data Scientists to follow an ad-hoc process, making it difficult to hand off their work to ML engineers. By the time the work is handed off to the engineers, the Data Scientists are unable to explain why certain decisions were made around modeling, data shaping, and data enhancement. This can lead to a lot of throw-over-the-wall deployments where engineers are making decisions without understanding the how or the why behind what they are implementing.
The DSLP adds process to this experimental phase and does it within familiar SDLC tools that the engineers are comfortable with. This allows engineers to use the documented issues combined with the branching strategy to understand the flow of what happened prior to the hand off. This understanding will impact how this model is brought to production. This enables them to collaboratively iterate with the Data Scientists as they productionize the model.
We’re going to continue building out this process as we continue to work on the projects we do with our customers and partners. We’re sure that what we’ve built isn’t perfect, but from what we’ve seen it can create a major positive impact on data science teams. Try it for yourself by implementing the branching strategy and using the issue templates on your next project. Keep on the look-out for more content from us as we continue to develop, document, and evangelize this process.
Feedback is welcome and as an open-source initiative we hope to create a vibrant community over time. If you want to learn more, test it out, or engage with us on implementing or improving this process, feel free to open an issue on GitHub or email us at email@example.com.