DevOps for machines: thinking outside the MLOps spin cycle

AI projects can struggle when they hit production. Machine learning, paired with DevOps, does offer a way around this problem—just beware of the hype around MLOps.

Posted: December 8, 2020 by Gavin Clarke

Robots overwhelmed by parts (graffiti art) — ^{Image by MMT from Pixabay}

It’s a breakout year for Machine Learning Operations (MLOps), according to Forrester, which reckons it’s seen increased maturity among tools providers.

According to Forbes, MLOps is just the "beginning" of enterprise AI, while others report MLOps will be the future of "successful AI implementations."

The excitement, the investment, the predictions—we seem to have reached peak MLOps. It’s tempting to think of MLOps as just AIOps spun differently but is that true and, more importantly, does it matter? Yes, as Red Hat SVP Cloud Platforms Ashesh Badani explains:

"AI/ML represents a top emerging workload for Red Hat OpenShift across hybrid cloud and multi-cloud deployments for both our customers and for our partners supporting these global organizations. By applying DevOps to AI/ML on the industry’s most comprehensive enterprise Kubernetes platform, IT organizations want to pair the agility and flexibility of industry best practices with the promise and power of intelligent workloads." —Ashesh Badani, SVP, Cloud Platforms, Red Hat

Underpinning MLops and AIOps is AI, which, according to McKinsey, has been embedded in at least one capability in a process, product, function, or business unit by more than half the organizations they speak to. AI is being used to improve customer experience and organizational efficiency and to help make organizations more flexible to change. Cloud, meanwhile, is putting AI into the hands of more "ordinary" users—powerful processing platforms, data analytics on demand, off-the-shelf programming frameworks, and custom hardware are democratizing AI, according to Gartner.

But there’s a problem: AI projects have a huge failure rate. IDC reckons a quarter will flop. And those in IT aren’t much better when it comes to implementing AI—in 2017, Gartner reckoned 25% of enterprises would have "strategically implemented" AIOps by 2019, but in the end, just 5% had succeeded.

This failure rate should be a cause for concern for anybody whose job it is to specify, implement, lead, and champion production-grade IT systems. Why? Because AI rollouts are coming their way.

What’s tripping up AI? To some extent, it’s being haunted by the problems that struck down past enterprise IT projects: poor specification, overly ambitious goals, failure to capture the requirements of the business or user, etc. Then, there are structural issues that come with technological and organizational silos. New to the mix is data. AI is fuelled by data, meaning you must pay close attention to the quality and provenance of the data driving the machine model behind your AI.

Machine meets world

The mark of a failed AI project is not that it runs late or over budget—the hallmark of past enterprise IT project failures; it’s when a system starts doing the wrong thing. AI projects are judged to have failed when they deliver incorrect outcomes—a future that Gartner says awaits a worrying 85% of AI projects.

David Talby, Pacific AI Chief Technology Officer, in his piece, Why Machine Models Crash and Burn in Production, describes how and why AI makes mistakes: "In contrast to a calculator,

your ML system does interact with the real world. If you are using ML to predict demand and pricing for your grocery store, you’d better consider this week’s weather, the upcoming national holidays, and what your competitor across the street is doing."

In other words, machine models that do not adapt to new conditions as described by changing data will act like there’s been no change. They make the right decision, just in the wrong world.

DevOps: the beginning of a meaningful consensus

If this were a software issue, then the answer would be simple: diagnose, test, fix, validate, and deploy through your DevOps pipeline. But this is not simply a software problem. The software used in building machine learning is comparatively small. Machine models depend on a range of cloud-based building blocks and, as a result, feature unique hardware and software dependencies.

Also, you have data, schemas, and a model to test and validate. In Machine Learning: The High-Interest Credit Card of Technical Debt, Google’s engineers highlight the problem posed by data: "While code dependencies can be relatively easy to identify via static analysis, linkage graphs, and the like, it is far less common that data dependencies have similar analysis tools. Thus, it can be inappropriately easy to build large data-dependency chains that can be difficult to untangle."

This feeds into a people and process problem. Cloud might be democratizing AI, but cloud itself—especially the kind of hybrid cloud being deployed in the enterprise—is a challenge to manage. Deploying, configuring, monitoring, and managing servers, systems, and applications across data centers, domains and providers, requires large-scale automation of processes and tools to overcome this complexity and ensure uptime and availability of systems and services.

Machine model drift takes hold in environments where teams are not set up to manage this complexity—for example, where the task of preparing data analysis and training the model is done manually using scripts; where there’s no connection between the data scientists building and piloting the model and the engineering team who must maintain it in production; and where there’s a lack of active performance monitoring that could help track model predictions and actions.

"This manual, data scientist-driven process might be sufficient when models are rarely changed or trained. In practice, models often break when they are deployed in the real world. The models fail to adapt to changes in the dynamics of the environment or changes in the data that describe the environment," states Google documentation.

Pipeline and process

If this sounds like a call for the application of a form of DevOps in AI, then it is. Putting machine models into production means providing those teams who build the models and the teams managing them in production with a shared way to identify problems and automate resolution. That should bring the data scientists building the models into the operational flow established to manage the corporate IT lifecycle. The DevOps culture, with its process automation and orchestration, can overcome the complexity of development and operational management in the cloud.

But, as DotScience warns, don’t try to simply apply the standard DevOps iron to the machine-model hide. Establishing a culture of collaboration between data science and operations is one thing; specifying the right tools and defining processes is quite another. As an architect, you must specify and select tools that help test datasets, model parameters, metrics, and outputs of the model once it hits production. Those tools will also need a platform that provides the kind of automation and orchestration that will close the gap between pilot and production.

Keep it simple

AI isn’t "here," but it is coming; machines have made landfall in mainstream IT and will only increase their presence. The challenge to that expansion, however, lies not in coding machine models but in "operationalizing" them—turning them into production-grade systems that serve your organization’s goals while fitting into a predictable system of IT lifecycle management.

MLOps risks becoming subject to conflation and hype. In viewing MLOps as a way to extend DevOps culture and practices to data science, however, you have at your disposal a way to turn the machine-driven projects coming your way into reliable members of the corporate technology estate.

Topics: Cloud

OUR BEST CONTENT, DELIVERED TO YOUR INBOX

Privacy Statement