The most dangerous antipattern in MLOps: The model is “just a binary”

An antipattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.Wikipedia

A company’s transition to AI is incredibly hard. Even tech giants like Apple, Facebook, and Google have struggled with the transition. As non-tech companies look for ways to evolve their existing teams and software infrastructure to support machine learning, they often make a common mistake: the “just a binary” machine learning (ML) antipattern. This approach to deploying models considers the ML model as an isolated binary inside the existing infrastructure.

Although seemingly reasonable, the antipattern is fraught with hidden dangers.

The “just a binary” antipattern

In this antipattern paradigm, the data scientist develops a model offline and hands it over to IT for deployment. IT treats the model like any other third-party software library by writing against its application programming interfaces (APIs).

Why this antipattern is appealing

This antipattern is appealing for several reasons:

●     It isolates data science from IT. Most data scientists lack the software engineering background to design their solutions for production. IT lacks the background in ML to understand how models are developed. Because these teams sometimes work in isolated departments, the antipattern allows for a smooth handoff.

●     It simplifies accountability. IT (understandably) doesn’t want to be accountable for the output or accuracy of a model they didn’t develop. A model that’s isolated by interfaces keeps accountability separate.

●     It avoids new infrastructure investment. Viewing a ML as “just a binary” gives IT the ability to deploy the ML in any modern data infrastructure. No new tools or cloud platforms are necessary.

Why this antipattern is dangerous

Despite its appeal, the “just a binary”antipattern poses some significant problems.

Problem 1: The model is only part of the solution

There’s a huge difference between building a Jupyter notebook model in the lab and deploying a production system that generates business value.” —Andrew Ng, technology entrepreneur

A focus on “the model” usually indicates that an engineering team has limited experience with ML. Yes, building a model can be hard, but most teams underestimate how much additional infrastructure is needed to support the model. Practical solutions require:

●     Ongoing access to fresh training data.

●     Feature scaling and reuse.

●     Input data monitoring.

●     Model output monitoring.

●     Model orchestration and versioning.

●     Pipelines.

End users don’t want an API; they want a solution. Models that are deployed without supporting infrastructure go stale, lose accuracy, and don’t get used.

Problem 2: Technical debt & slow iteration

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. . . . [But] it is remarkably easy to incur massive ongoing maintenance costs at the system level.” — D. Sculley, Gary Holt, et al., “ Machine learning: The high interest credit card of technical debt”

The antipattern can work for a beta deploy mentor for relatively simple, static problems. But most companies invest in AI to maintain a competitive edge or to automate complex processes. In these situations, the engineering team begins to accumulate technical debt when a deployed a model uses the antipattern approach. Progress slows to a crawl, and the team struggles to keep the model running.

Technical debt accumulates in a few predictable scenarios:

●     The model begins to perform worse(that is, drift) because it runs on data that evolved after training.The engineering team recognizes this problem after end-users alert them. They scramble to identify the problem’s source and fix it.

●     The data scientists and engineering team lose track of model versions and the associated training data.Every model update requires the team to collect current training data.

●     Data scientists begin building different models that have identical features or, worse, identically named features.For example, multiple models might have differing features named customer.

●     Data scientists add new features to a model, but the feature engineering isn’t put into production. After the model fails, the team rushes to put emergency patches into production.

●     Data scientists build models and experiment in notebooks, but the notebook code isn’t migrated into more robust software packages. A new data scientist joins the team and copies the notebook, but changes aren’t synchronized with production.

●     The model is coded and deployed perfectly. But the underlying hardware creates a result that wasn’t expected.IT and the data scientists spend weeks pointing fingers before they discover the problem.

Problem 3: Reliance on data scientists for operational support

Models can fail for many reasons, but commonly the causes are unrelated to the model itself. Hardware fails. Data feeds suddenly put out null values. IT deploys the wrong model versions. Bugs creep into feature code.

Unfortunately, the deployment approach in this antipattern will fail to catch most errors. The IT personnel know only that “the model doesn’t work,” so they call the data scientists whenever a problem occurs.

Data scientists aren’t trained for (and usually don’t enjoy) providing operational support. They’re usually ill equipped to diagnose errors in these complex systems. Constant operational support calls lead to burnout and attrition among data scientists.

Poor planning ends badly

These problems slowly accumulate over months.Because data scientists can’t deploy their own models, they constantly turn toIT to support their changes.

In the best case, data scientists endure this process but are frustrated with the slow updates. Morale falls.

In the worst case, the data scientists drive production changes by using hacks. Prolego has rescued projects where data scientists had been working around infrastructure challenges by updating and deploying models from a laptop. When the data scientists no longer want to deal with these hassles, they quit. Their models and customized cron jobs are lost, and the new team has to start again from scratch.

A pattern that works: begin investing in MLOps

So what’s the alternative to the “just a binary” antipattern? Begin investing in machine learning operations (MLOps).MLOps is a new domain, and its tools and approaches are still immature. But fortunately you don’t have to solve all possible problems when you deploy your first models.

One of Prolego’s core beliefs is that your biggest challenges are cultural, not technical. Although you might not have the engineering infrastructure to replicate Uber’s MLOps solution, you can avoid worst-case scenarios by changing mindsets and working styles. The following sections provide tips to help you evolve your teams’ mindset.

Think big but start small

Google has outlined a good approach to build out MLOps architecture in phases. Although your solution will vary based on your infrastructure, data, and business problem, this phased approach can allow you to systematically harden your infrastructure as your solution is adopted.

You don’t need to throw out your current process and start over. You can start small:

●     Set up a few simple monitoring statistics to detect model degradation.

●     Work with your operations team to develop an escalation plan to respond to issues. The plan should reduce or eliminate calls to the data scientists for help.

●     Begin creating documentation for yourML systems. The documentation should support governance and the transfer of work between individuals.  

Stay focused on the ultimate goal

End users don’t care about the model. They care about how the system makes their work easier or better. Keep this end goal in mind from the start so that both data scientists and IT remain focused on the same outcome.

Assign accountability

The team needs to be jointly accountable for deployment. The data scientists need to develop solutions that are ready for production. Development includes unit-testing the code, writing clear and readable code, and using peer review and pull requests to validate changes. IT needs to build expertise in MLOps and lead the development of scalable infrastructure.

Get assistance

Need help getting started with MLOps? Prolego has helped many companies like yours. We can work with you to develop an MLOps production plan based on your unique environment.

Did you find this valuable? Subscribe to our newsletter and get our best content delivered straight to you.

Kevin Dewalt
Chief Executive Officer & co-founder

More Ideas

AI Abundance:

Why you have only five years to prepare for the inevitable business extinction event.