No Downtime Upgrade for Red Hat Data Grid on Openshift

2019年 6月 12日Alex Handy7 分钟阅读

In a blog post I wrote on the Red Hat Developer’s Blog, I wrote about multiple layers of security available while deploying Red Hat Data Grid on Red Hat Openshift. Another challenging problem I see for customer is performing a no downtime upgrade for Red Hat Data Grid images (published on Red Hat Container Catalog). That's what we're going to tackle in this post.

If you're new to it, Red Hat Data Grid is an in-memory, distributed, NoSQL datastore solution. With it, your applications can access, process, and analyze data at in-memory speed designed to deliver a superior user experience compared to traditional data stores like relational databases. In-memory Data Grids have a variety of use cases in today’s environments, such as fast data access for low-latency apps, storing objects (NoSQL) in a datastore, achieving linear scalability with data distribution/partitioning, and data high-availability across geographies.

Red Hat Data Grid runs like other applications on Openshift. However, Data Grid is a stateful application and the upgrade process for stateful applications can be challenging in the container world. The clustering capabilities of Data Grid add another layer of complexity, as well.

Red Hat releases container images for a number of its products on its container catalog website. Red Hat provides a container health index of each image and updates that health index as new vulnerabilities are found or new versions of the product are released. So an image today having an A rating on the container health index may not maintain the same index six months down the line, as new issues surface after time in the field.

Why "Rolling" upgrades are not suitable for Red Hat Data Grid?

For clustered applications, it does make sense to have rolling upgrade. However, there is a critical thing in templates exposed by Red Hat for deploying Data Grid. Here are the templates for Data Grid version 7. If you open any template file, you will see that the upgrade strategy is to "recreate" data grid pods. It is not defined as "rolling." It is expected that two pods from different major versions may not work in a single data grid cluster. The below section outlines how one could upgrade data grid versions with no downtime.

The objective here is to upgrade Red Hat Data Grid in place with no downtime. While we have experienced no downtime upgrades with this method, this may not work in every environment and against every application. We strongly recommend practicing this in a development or test environment before attempting in production.

Four Steps Designed to Upgrade Data Grid

The Operator Framework, which appears in OpenShift 4, should make the upgrade process easier. Operator Hub. lists out operators available today for various products. The Data Grid Operator could provide the capability to install, upgrade, scale, monitor, backup and restore an application.

However, until we have an Operator for Red Hat Data Grid, we need to find an alternative to upgrade version 6 to version 7. Red Hat Data Grid Operator is still in development at this time.

So, let's assume we have deployed our Data Grid--which we’ll call version 1--into our Openshift cluster. Another application is accessing the data grid via a route (can be a service URL as well, does not matter). We now need to upgrade to version 2, without losing data.

If you would like to play with the upgrade process, have a look at “Orchestrating JBoss Data Grid Upgrades in Openshift/Kubernetes” on GitHub. There, you can find code snippets and commands to upgrade from one version to another. I recommended to try this process first in your dev/test environment before rolling out in production.

Step 1 - Deploy version 2 and use "Remote Store"

We need to do following while deploying v2 -

Define all caches in v2 as "Remote Store". A remote store is defined as a store that stores and loads data configured in another data grid.
Provide details (service address and port) of v1 while configuring remote store in v2.

We cannot deploy v2 in the same way we deploy v1. At this point in time, the templates do not expose a way to declare remote cache store and some other required details. Therefore, we need to find an alternative way to deploy v2. We will deploy v2 using a custom configuration file. Here is what the process would look like:

Define a custom configuration file and name it standalone.xml
- A sample file is located here.
- See how "mycache" cache is defined here.

<distributed-cache name="mycache">

<remote-store cache="mycache" socket-timeout="60000"

tcp-no-delay="true" protocol-version="2.6" shared="true"

hotrod-wrapping="true" purge="false" passivation="false">

<remote-server

outbound-socket-binding="remote-store-hotrod-server"/>

</remote-store>

</distributed-cache>

- Define remote data grid server. Replace Data Grid service url with service IP of v1 pod running in OpenShift.

<outbound-socket-binding name="remote-store-hotrod-server">

<remote-destination host="<REPLACE SOURCE Data Grid SERVICE

URL>" port="11333"/>

</outbound-socket-binding>

Define a config map that contains data as standalone.xml file ("oc create configmap --from-file ..." construct).
Create a new template for deploying v2. This template would include instructions to use configmap and mount it to /opt/datagrid/standalone/configuration/user location
- A sample template file is located here.
- Search for config-volume and observe that "datagrid-config" configmap is mounted at the above location.
- We should set the deployment strategy to "Rolling." This is a change we are doing with respect to out of the box templates provided by Red Hat which have deployment strategy set to "Recreate."
- We will also set "minReadySeconds" parameter to 60 as shown in the sample template file above.

Once we have made these changes, we are ready to deploy v2. After a successful deployment, we need to map the route (or service selector for v1) to start routing traffic to v2's service IP (or point to v2's pods via selector). Once this change is completed, next request to data grid will go to v2, and v2 will load data from v1. The figure below describes this state -

Step 2 - Syncing data

Once we have v2 deployed, we need to ssh into one of the v2's pod and use CLI commands to copy/sync all data from v1 to v2. Run these commands to perform this data syncing -

$ oc rsh <v2 pod name>

sh-4.2$ /opt/datagrid/bin/cli.sh --connect controller=localhost:9990 -c "/subsystem=datagrid-infinispan/cache-container=clustered/distributed-cache=mycache:synchronize-data(migrator-name=hotrod)"

{"outcome" => "success"}

Run the above command for all caches defined in data grid.

Step 3 - Rolling Upgrade

Now that data is synced, it makes sense to delete v1. However, we can't just delete v1 now because the cache configuration in v2 still remains "remote cache" and refers to the service IP of v1. If v1 is deleted and a request comes for a key which does not exist in v2, then v2 will try to load data from v1. This request will fail.

We need to find a way to get rid of this dependency. A rolling upgrade for v2 with new configuration (no version upgrade) could potentially remove this dependency. The following changes are done to the existing deployment configuration of v2 -

Change cache definition - Edit configmap (which is holding cache configuration) and change cache definition from

<distributed-cache name="mycache">

<remote-store cache="mycache" socket-timeout="60000"

tcp-no-delay="true" protocol-version="2.6" shared="true"

hotrod-wrapping="true"

purge="false" passivation="false">

<remote-server

outbound-socket-binding="remote-store-hotrod-server"/>

</remote-store>

</distributed-cache>

<distributed-cache name="mycache" mode="SYNC"/>

Change remote-destination - Edit configmap and change remote destination host from

<remote-destination host="172.30.232.114" port="11333"/>

<remote-destination host="remote-host" port="11333"/>

Note that 172.30.232.114 above is the service IP of v1.

Roll out the changes for the v2's deployment config after completing the above changes. Since we defined our deployment strategy as "Rolling", a rolling update will start. Remember, we defined a new parameter "minReadySeconds" in the previous step. We defined this parameter because we don't want to kill the existing pod when a new pod comes up. If we don't wait for the minimum time, then it is highly likely that the new pod is unable to replicate data (when it joins the cluster) before OpenShift kills existing pod and continues with rolling upgrade. This may end up causing data loss.

Eventually when the rolling update completes, you will have new pods that have local definitions of cache(s) and don't refer to v1 anymore. We have successfully removed this dependency. The final state would look like the figure below:

Step 4 - Delete Version 1

This last step is pretty straightforward. Just delete version 1. At this point in time, you should have been able to successfully completed migrating data grid from one major version to another. The figure here represents the stages that we went through:

If you would like to practice these recommendations, see “Orchestrating JBoss Data Grid Upgrades on Openshift/Kubernetes” on GitHub. There, you can find code snippets and commands to upgrade one version to another. The examples in the github link uses data grid version 7.2.

关于作者

Alex Handy

Principal Product Marketing Manager

Red Hatter since 2018, technology historian and founder of The Museum of Art and Digital Entertainment. Two decades of journalism mixed with technology expertise, storytelling and oodles of computing experience from inception to ewaste recycling. I have taught or had my work used in classes at USF, SFSU, AAU, UC Law Hastings and Harvard Law.

I have worked with the EFF, Stanford, MIT, and Archive.org to brief the US Copyright Office and change US copyright law. We won multiple exemptions to the DMCA, accepted and implemented by the Librarian of Congress. My writings have appeared in Wired, Bloomberg, Make Magazine, SD Times, The Austin American Statesman, The Atlanta Journal Constitution and many other outlets.

I have been written about by the Wall Street Journal, The Washington Post, Wired and The Atlantic. I have been called "The Gertrude Stein of Video Games," an honor I accept, as I live less than a mile from her childhood home in Oakland, CA. I was project lead on the first successful institutional preservation and rebooting of the first massively multiplayer game, Habitat, for the C64, from 1986: https://neohabitat.org . I've consulted and collaborated with the NY MOMA, the Oakland Museum of California, Cisco, Semtech, Twilio, Game Developers Conference, NGNX, the Anti-Defamation League, the Library of Congress and the Oakland Public Library System on projects, contracts, and exhibitions.

Read full bio

按频道浏览

探索所有频道

平台产品

试用与购买

特色产品

按行业分类

特色产品

主题

文章

了解更多

面向客户

面向合作伙伴

关于红帽

开源

公司信息

建议

选择语言

选择语言

No Downtime Upgrade for Red Hat Data Grid on Openshift

Why "Rolling" upgrades are not suitable for Red Hat Data Grid?

Four Steps Designed to Upgrade Data Grid

Step 1 - Deploy version 2 and use "Remote Store"

Step 2 - Syncing data

Step 3 - Rolling Upgrade

Step 4 - Delete Version 1

关于作者

Alex Handy

更多此类内容

按频道浏览

产品

工具

试用购买与出售

沟通

关于红帽

选择语言

Red Hat legal and privacy links

Red Hat legal and privacy links