Using Operate First to host cloud-native AI

Learn why Open Data Hub selected Operate First, which applies open source concepts to operations, to host its project.

Posted: January 12, 2022 by Tom Coufal (Red Hat)

Clouds seen through tunnel — ^{Photo by Pixabay from Pexels}

Operate First is a concept that says knowledge about operations should be shared and exchanged the same way as source code is—transparently, publicly, and openly. It's the idea that the community can critique operational data, procedures, and management to achieve better maintainability and a more robust and reliable stack for everything from bare-metal infrastructures to user applications.

To achieve this, Operate First believes that you should trust a community with running and managing applications and infrastructure. This approach can help bridge the gap between Ops and Site Reliability Engineers (SREs) on one side and developers and quality assurance (QA) on the other. It challenges the well-established paradigm that once an application gets developed, it can be thrown over the fence to the SRE and refined and matured through bug reports without developer presence in the maintenance. Operate First aims to close the feedback loop on software development by providing developers unheard-of visibility and participation in production environments.

At DevConf 2021, we presented a talk on Cloud-native AI using OpenShift, and we've summarized it in this article and our previous piece. Our first article in this series describes Open Data Hub, a blueprint for how you can put data science components together on top of a scalable platform like OpenShift or Kubernetes. In this article, we'll explain why we selected Operate First to deploy Open Data Hub.

Using Operate First for data science

Ideas are important, but what matters even more is whether there are real results to back those ideas up. Any initiative forming strong opinions should lead by example and provide a living example of the ideas it preaches.

For Operate First to be recognized and treated seriously, it's essential to provide a hands-on experience to a broad audience so that others can challenge the idea in the real world. This train of thought brought Operate First's community cloud to life. It provides a community-managed cloud environment that allows users to manage and maintain their applications and provide those applications as services to others.

We determined that the Open First community cloud is the ideal place where Open Data Hub can be deployed, used, and fully exposed from all possible points of view. Operate First allows you to judge the deployment model and its characteristics and friction points with minimal investment on your part while also allowing you to participate in decision-making if you wish to.

[ An organization's ability to adapt and innovate is critical to survive and compete. Download Culture matters: The IT executive's guide to building open teams to foster innovation. ]

Deploying and maintaining a full-scope, robust, and serious data science platform is no easy task, so allowing users and architects to inspect the model is essential for the project to mature well. Exposing an upstream open source data science platform to users who are willing to share their usage data, metrics, and logs to benefit the platform provides enormous value to the Open Data Hub community. It can demonstrate and catch hard-to-reproduce issues and error states that would be very challenging to debug otherwise and that could eventually lead to service downtime for customers. Running Open Data Hub as a first-class Operate First community cloud citizen can help us prevent many operational issues.

A technology-neutral model

Operate First doesn't differentiate between a website, data science platform, mail server, or a Kubernetes cluster. It recognizes that all require operations and maintenance, and that all should handle all aspects of their lifecycle transparently and share the knowledge.

In that sense, data science applications and deployments are no different. Training and serving a machine learning (ML) model is no different in operations to building, packaging, and deploying a web application. Therefore, both should be explored and handled the same: transparently and openly.

Since our world mainly revolves around Kubernetes-based infrastructure, we also focus on this deployment model for data science applications. The train of thought is simple: You can define Kubernetes resources as YAML manifests. You can store these manifests in Git and apply them using GitOps to the cluster.

[ Learn how open source platforms, including Kubernetes and Ceph, can help you increase data agility for AI/ML. Read the eBook Open source data pipelines for intelligent applications. ]

Also, you can use various continuous integration/continuous development (CI/CD) tools, but the most straightforward to use is ArgoCD, also available as OpenShift GitOps. And as you would expect from any CD tool, ArgoCD can watch for changes in your repositories and apply them to a given cluster. Therefore, the SRE just has to monitor the live environment and adjust manifests in the repository to correct any misbehavior. The catch is that the SRE can and should be the developer or data scientist. They are the person who fully understands how their application should behave. The role of Operate First, in this case, is to enable the data scientist to do the operations. It provides them with the data and procedures they need (in a consumable way), so even an untrained SRE can understand what to do—how and why.

How can Operate First achieve that? The same way as in any case where you have too many options to choose from—through standardization.

It's very important to offer a baseline where individual projects can grow. There's a well-established pattern of repository templates in software development for these scenarios. This can be no different for data science. Having a repository with a predictable structure (like the cookiecutter format enriched with deployment manifests and operational knowledge in the form of documentation and runbooks) can greatly reduce the cost of maintenance. And it can help data scientists quickly pilot an application that they can directly plug into the existing ecosystem of applications.

And since many different data science projects can start from the same baseline by using the same template, Operate First can facilitate experience exchange, and users can refine the template to make operations even simpler for the next generation of projects.

Open source operations

Knowledge about operations should be shared and exchanged transparently like open source software code. Operate First provides a community-managed cloud environment for managing and maintaining applications where you can transparently judge the deployment model's friction points and prevent many operations issues, in a technology-neutral model.

For more, please see our DevConf presentation, Cloud-native AI using OpenShift.

DevConf is a free, Red Hat-sponsored technology conference for community projects and professional contributors to free and open source technologies. Check out the conference schedule to find other presentations that interest you, and access the YouTube playlist to watch them on demand.