订阅内容

In a blog post I wrote on the Red Hat Developer’s Blog, I wrote about multiple layers of security available while deploying Red Hat Data Grid on Red Hat Openshift. Another challenging problem I see for customer is performing a no downtime upgrade for Red Hat Data Grid images (published on Red Hat Container Catalog). That's what we're going to tackle in this post.

If you're new to it, Red Hat Data Grid is an in-memory, distributed, NoSQL datastore solution. With it, your applications can access, process, and analyze data at in-memory speed designed to deliver a superior user experience compared to traditional data stores like relational databases. In-memory Data Grids have a variety of use cases in today’s environments, such as fast data access for low-latency apps, storing objects (NoSQL) in a datastore, achieving linear scalability with data distribution/partitioning, and data high-availability across geographies.

Red Hat Data Grid runs like other applications on Openshift. However, Data Grid is a stateful application and the upgrade process for stateful applications can be challenging in the container world. The clustering capabilities of Data Grid add another layer of complexity, as well.

Red Hat releases container images for a number of its products on its container catalog website. Red Hat provides a container health index of each image and updates that health index as new vulnerabilities are found or new versions of the product are released. So an image today having an A rating on the container health index may not maintain the same index six months down the line, as new issues surface after time in the field.

Why "Rolling" upgrades are not suitable for Red Hat Data Grid?

For clustered applications, it does make sense to have rolling upgrade. However, there is a critical thing in templates exposed by Red Hat for deploying Data Grid. Here are the templates for Data Grid version 7. If you open any template file, you will see that the upgrade strategy is to "recreate" data grid pods. It is not defined as "rolling." It is expected that two pods from different major versions may not work in a single data grid cluster. The below section outlines how one could upgrade data grid versions with no downtime.

The objective here is to upgrade Red Hat Data Grid in place with no downtime. While we have experienced no downtime upgrades with this method, this may not work in every environment and against every application. We strongly recommend practicing this in a development or test environment before attempting in production.

Four Steps Designed to Upgrade Data Grid

The Operator Framework, which appears in OpenShift 4, should make the upgrade process easier. Operator Hub. lists out operators available today for various products. The Data Grid Operator could provide the capability to install, upgrade, scale, monitor, backup and restore an application.

However, until we have an Operator for Red Hat Data Grid, we need to find an alternative to upgrade version 6 to version 7. Red Hat Data Grid Operator is still in development at this time.

So, let's assume we have deployed our Data Grid--which we’ll call version 1--into our Openshift cluster. Another application is accessing the data grid via a route (can be a service URL as well, does not matter). We now need to upgrade to version 2, without losing data.

If you would like to play with the upgrade process, have a look at “Orchestrating JBoss Data Grid Upgrades in Openshift/Kubernetes” on GitHub. There, you can find code snippets and commands to upgrade from one version to another. I recommended to try this process first in your dev/test environment before rolling out in production.

Step 1 - Deploy version 2 and use "Remote Store"

We need to do following while deploying v2 -

  • Define all caches in v2 as "Remote Store". A remote store is defined as a store that stores and loads data configured in another data grid.
  • Provide details (service address and port) of v1 while configuring remote store in v2.

We cannot deploy v2 in the same way we deploy v1. At this point in time, the templates do not expose a way to declare remote cache store and some other required details. Therefore, we need to find an alternative way to deploy v2. We will deploy v2 using a custom configuration file. Here is what the process would look like:

  • Define a custom configuration file and name it standalone.xml
    • A sample file is located here.
    • See how "mycache" cache is defined here.

 

<distributed-cache name="mycache">

      <remote-store cache="mycache" socket-timeout="60000"

           tcp-no-delay="true" protocol-version="2.6" shared="true"

           hotrod-wrapping="true" purge="false"  passivation="false">

            <remote-server

                                         outbound-socket-binding="remote-store-hotrod-server"/>

             </remote-store>

  </distributed-cache>

    • Define remote data grid server. Replace Data Grid service url with service IP of v1 pod running in OpenShift.

 

<outbound-socket-binding name="remote-store-hotrod-server">

         <remote-destination host="<REPLACE SOURCE Data Grid SERVICE

              URL>" port="11333"/>

        </outbound-socket-binding>

 

  • Define a config map that contains data as standalone.xml file ("oc create configmap --from-file ..." construct).
  • Create a new template for deploying v2. This template would include instructions to use configmap and mount it to /opt/datagrid/standalone/configuration/user location
    • A sample template file is located here.
    • Search for config-volume and observe that "datagrid-config" configmap is mounted at the above location.
    • We should set the deployment strategy to "Rolling." This is a change we are doing with respect to out of the box templates provided by Red Hat which have deployment strategy set to "Recreate."
    • We will also set "minReadySeconds" parameter to 60 as shown in the sample template file above.

Once we have made these changes, we are ready to deploy v2. After a successful deployment, we need to map the route (or service selector for v1) to start routing traffic to v2's service IP (or point to v2's pods via selector). Once this change is completed, next request to data grid will go to v2, and v2 will load data from v1. The figure below describes this state -

Step 2 - Syncing data

Once we have v2 deployed, we need to ssh into one of the v2's pod and use CLI commands to copy/sync all data from v1 to v2. Run these commands to perform this data syncing -

$ oc rsh <v2 pod name>

sh-4.2$ /opt/datagrid/bin/cli.sh --connect controller=localhost:9990 -c "/subsystem=datagrid-infinispan/cache-container=clustered/distributed-cache=mycache:synchronize-data(migrator-name=hotrod)"

{"outcome" => "success"}

Run the above command for all caches defined in data grid.

Step 3 - Rolling Upgrade

Now that data is synced, it makes sense to delete v1. However, we can't just delete v1 now because the cache configuration in v2 still remains "remote cache" and refers to the service IP of v1. If v1 is deleted and a request comes for a key which does not exist in v2, then v2 will try to load data from v1. This request will fail.

We need to find a way to get rid of this dependency. A rolling upgrade for v2 with new configuration (no version upgrade) could potentially remove this dependency. The following changes are done to the existing deployment configuration of v2  -

  • Change cache definition - Edit configmap (which is holding cache configuration) and change cache definition from

<distributed-cache name="mycache">

             <remote-store cache="mycache" socket-timeout="60000"

                       tcp-no-delay="true" protocol-version="2.6" shared="true"

     hotrod-wrapping="true"

                             purge="false" passivation="false">

                            <remote-server

                                  outbound-socket-binding="remote-store-hotrod-server"/>

           </remote-store>

  </distributed-cache>

            to

<distributed-cache name="mycache" mode="SYNC"/>

  • Change remote-destination - Edit configmap and change remote destination host from

  <remote-destination host="172.30.232.114" port="11333"/>

            to

<remote-destination host="remote-host" port="11333"/>

Note that 172.30.232.114 above is the service IP of v1.

Roll out the changes for the v2's deployment config after completing the above changes. Since we defined our deployment strategy as "Rolling", a rolling update will start. Remember, we defined a new parameter "minReadySeconds" in the previous step. We defined this parameter because we don't want to kill the existing pod when a new pod comes up. If we don't wait for the minimum time, then it is highly likely that the new pod is unable to replicate data (when it joins the cluster) before OpenShift kills existing pod and continues with rolling upgrade. This may end up causing data loss.

Eventually when the rolling update completes, you will have new pods that have local definitions of cache(s) and don't refer to v1 anymore. We have successfully removed this dependency. The final state would look like the figure below:

Step 4 - Delete Version 1

This last step is pretty straightforward. Just delete version 1. At this point in time, you should have been able to successfully completed migrating data grid from one major version to another. The figure here represents the stages that we went through:

If you would like to practice these recommendations, see “Orchestrating JBoss Data Grid Upgrades on Openshift/Kubernetes” on GitHub. There, you can find code snippets and commands to upgrade one version to another. The examples in the github link uses data grid version 7.2.

 


关于作者

Red Hatter since 2018, technology historian and founder of The Museum of Art and Digital Entertainment. Two decades of journalism mixed with technology expertise, storytelling and oodles of computing experience from inception to ewaste recycling. I have taught or had my work used in classes at USF, SFSU, AAU, UC Law Hastings and Harvard Law. 

I have worked with the EFF, Stanford, MIT, and Archive.org to brief the US Copyright Office and change US copyright law. We won multiple exemptions to the DMCA, accepted and implemented by the Librarian of Congress. My writings have appeared in Wired, Bloomberg, Make Magazine, SD Times, The Austin American Statesman, The Atlanta Journal Constitution and many other outlets.

I have been written about by the Wall Street Journal, The Washington Post, Wired and The Atlantic. I have been called "The Gertrude Stein of Video Games," an honor I accept, as I live less than a mile from her childhood home in Oakland, CA. I was project lead on the first successful institutional preservation and rebooting of the first massively multiplayer game, Habitat, for the C64, from 1986: https://neohabitat.org . I've consulted and collaborated with the NY MOMA, the Oakland Museum of California, Cisco, Semtech, Twilio, Game Developers Conference, NGNX, the Anti-Defamation League, the Library of Congress and the Oakland Public Library System on projects, contracts, and exhibitions.

 
Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事