March 15, 2021
Want your data scientists to freak out? Tell them to start following the agile methodology.
Want your product teams to freak out? Tell them that agile doesn’t work for data science projects.
You’re going to run into extreme opinions on this topic. Scrum, Kanban, and the many variants have become so standard that today’s developers and product teams can’t imagine working differently. The “agile is best” zealotry has become so ingrained that proponents will reject any contrary opinions.
Unfortunately, blindly following any agile software methodology won’t work for data science. At the same time, your product teams rely on agile for accountability, and they will correctly assert that perpetual data science experimentation is an unacceptable risk.
In this issue of FeedForward I’ll explore this topic and help you get your data science and product teams working together in harmony. Fortunately it isn’t that hard once you recognize the risks they are trying to mitigate.
Scrum, Kanban, and the many variants have become so standard that today’s developers and product teams can’t imagine working differently. The “agile is best” zealotry has become so ingrained that proponents will reject any contrary opinions.
Agile methodologies were primarily created to handle requirements uncertainty. If you can perfectly document what a software application is supposed to do—and nothing changes—then you don’t need agile. But in the real world this never happens. Requirements change, new opportunities emerge, and unforeseen technical challenges spring up. Agile methodologies were created to give product teams more flexibility in dealing with this uncertainty.
The root problem is that agile was not developed with data science in mind.
In data science the biggest risk is solution viability—NOT requirements uncertainty.
I’ll illustrate this point with a thought experiment. I would like you to build a model that can automatically generate this article. My requirements are quite clear and unlikely to change. So what’s the problem? This solution isn’t viable because text-generation techniques are not yet good enough. Optimizing methodology won’t get us anywhere. The only way forward is to redefine the problem, such as building a model that can automatically generate summaries of sporting events.
Many data scientists believe that Scrum is awful for data science because it doesn’t address viability risk. Addressing viability requires … well … science!
Data scientists are trained to systematically isolate viability risks through experimentation. Unfortunately data science experiments are highly unpredictable, particularly on machine learning applications that leverage state-of-the-art techniques.
The biggest risk in these projects is (1) insufficient data, or (2) unacceptable model performance, and experienced data scientists have techniques for rapidly isolating them. Examples are exploratory data analysis or deliberate model overfitting. Unfortunately tasks and times are nearly impossible to predict at the outset.
Most software engineers have never confronted this level of technical risk. The biggest risk in software is building something nobody wants, particularly for new software products. I know because I am also guilty of making this error. Prior to Prolego I attempted to launch a machine learning startup for content marketers. I nailed the customer needs but data issues killed our product’s viability.
Unfortunately most product teams don’t realize their own ignorance about data science and machine learning.
Many product leaders view machine learning models like any other software library, and assume the data scientist’s only job is to make this library.
Building a machine learning model isn’t like building a data-driven web application. A machine learning project cannot be easily broken down into a series of steps.
So what happens at the first sprint planning meeting? The Scrum master goes ballistic when the data scientists create ambiguously-worded JIRA tickets like “data gathering” with undefined story points.
Our hypothetical Scum master’s reaction is understandable. Many data scientists have no software product experience, and they don’t understand how hard it is to maintain and scale software in a modern data center. The brilliant astrophysicist on your team may have spent 10 years writing experimental python code which nobody else had to understand or maintain.
They may not appreciate or understand the many execution challenges that processes are designed to prevent. I’ve watched data science teams create “solutions” in thousands of lines of code in Jupyter notebooks. Reality hits when they ask IT to deploy it.
You simply cannot build an application that solves real world problems without having a process.
So here is your dilemma: your team needs a process to successfully build and deploy your ML models, but the agile methodology they have been refining for the past decade won’t work. Here is how I’ve learned to solve this problem.
I start by thinking of any new ML project in distinct phases:
Before ramping up any ML project the data scientist should evaluate whether or not the problem is even solvable. This usually requires (1) assessing the quality and quantity of the underlying, and (2) evaluating whether a methodology exists. The data scientists will perform exploratory data analysis and possibly a literature review.
Many machine learning projects never make it past this point. The data science team should be working independently. Since the project isn’t yet a software development effort Agile plays no role.
If the solution is potentially viable the data science team can begin assessing how well it solves the business problem. This step usually requires iteratively training a model and evaluating the results against the minimum acceptable accuracy.
Prototyping can take several months. The schedule is usually driven by the complexity of the solution, accessibility of the data, and the availability of business customers for feedback. Again, the data scientist should do most of this independently.
When the model is “good enough”—a threshold that obviously varies depending on the problem—other team members may join the project. At this point the data scientist’s tasks are more predictable and you can begin introducing Agile.
Finally the entire team needs to deploy the model and the project starts to look like a software engineering effort. Agile is critical at this junction because the entire engineering team needs to work collaboratively to get the model in production.
The data scientist will continue running experiments, but the time and scope will be more predictable. These data science tasks, however, will never be as predictable as tasks in a traditional software engineering effort.
Many of your problems will be solved by simply introducing Agile at the right time. However, your entire team will be more effective if everyone appreciates each others’ challenges.
Encourage your product managers and developers to get some basic data science experience. At Prolego we ask our data engineers (the people putting models into production) to take online courses or do Kaggle competitions. Your product team will be much more effective if everyone understands the basics of data cleansing, feature development, and model training.
Data scientists will be more effective team members if they understand how enterprise software development works. Have them participate in activities like sprint planning sessions, code reviews, and customer meetings. They may also benefit from courses in object-oriented programming, Kanban, or Scrum.
If nothing else, cross-functional training will get everyone using the same language.
I have yet to see a single ML project that didn’t struggle with this challenge, and it will be several years until the methodologies catch up. In the meantime you have another opportunity to be the AI translator your company needs.
When you see a team arguing about the ‘best’ way to run a ML project, just remember that most of the disagreement comes from misunderstanding. Data scientists are not deliberately trying to be vague, and Scrum masters are not trying to create busy work with Jira tickets.
Be the AI translator who helps the product team understand why ML is different. Be the process champion who helps the data science team realize the value of following a methodology. By doing so you will further establish the critical role you play in leading the company to the inevitable AI-driven future.
June 6, 2021
Many data science projects die a slow, painful death because the organization isn’t motivated to make it succeed. In this post we address the three primary reasons projects fail and provide suggestions for what you can do to overcome these challenges.:
May 2, 2021
Your goal as an AI leader is to get your teams to think like pros. You want them to strategically look for ways in which AI can lift the entire business instead of just solving a narrowly defined problem. Your team should constantly seek ways to advance the bigger vision of becoming an AI-driven company. In this issue of FeedForward, I’ll describe the difference between how pros and amateurs think about AI.
March 31, 2021
In this video Justin Pounders, Director of Machine Learning and AI Research at Prolego, breaks down natural language generation (NLG) into its most basic components and describes how you can begin building out these components in your business. (And, no, it doesn’t depend on GPT-3!) He describes how NLG depends critically on two questions (WHAT you want to say and HOW you say it), the types of data you can feed into NLG systems, and a development path for being able to summarize multiple sources of data in plain English.