Everyone is into Kubernetes these days, and everyone wants their applications to run in a highly available manner, but what happens when the underlying infrastructure is failing? What if there’s a network outage, hardware failure or even software bugs?
While stateless workloads are usually immune to these kinds of failures (the cluster can always schedule more instances somewhere else), the real problem begins with stateful workloads.
When there’s a node failure we want to achieve two goals:
- Minimize application downtime — we want to reschedule affected applications on other healthy node(s) to minimize their downtime.
- Recover compute capacity — the failing node is malfunctioning, thus its compute power can’t be used to run any workloads. We want to recover it as fast as possible to recover compute capacity for the cluster.
Stateful workloads with run-once semantics can’t be immediately rescheduled on other nodes. The cluster doesn’t really know if that workload is still running on the unhealthy node (perhaps it’s just a network outage), and scheduling the workload into other nodes introduces a risk that two instances would run in parallel, which would violate the at-most-one semantics and might result in data corruption or diverging datasets.
Kubernetes won’t run that workload anywhere else until it can make sure the workload is no longer running on the unhealthy node. However, if we can’t communicate with the node, there is no way to know that.
We need to bring the node into a “safe” state (this process is called fencing) and one of the most common forms of fencing is to power the node off using IPMI. Once the host running the node is powered off we can definitely assume that there are no running workloads there. To signal that to the cluster, we delete the node.
To power if off we’ve got two options:
- Using the server's management interface, but this requires BMC credentials, and obviously network connectivity to the management interface.
- Causing the unhealthy node to reboot itself in case of a failure.
The first one was implemented as an opt-in feature in OpenShift and is out of scope for this post.
Taking the poison pill at the right time
To achieve this, each node will run an agent that will constantly monitor its health status.
An external controller, machine-healthcheck-controller (MHC), determines if nodes are unhealthy, after which it applies an annotation to the node/machine object.
The poison-pill agent periodically looks for the unhealthy annotation. Once detected, the node takes the poison pill and reboots itself.
But there are some challenges:
- What happens if the unhealthy node can’t reach the api-server, and can’t see if the unhealthy annotation is there?
- Deleting the node only when no workload is running on it
- Resource starvation, or bugs might prevent the poison pill agent to detect its health status
Peers to the rescue
If the unhealthy node can’t reach the api-server, we could assume that it’s unhealthy, but this will result in a false positive and cause a reboot storm in the cluster, if the failure exists only in the api-server nodes.
To distinguish between a real node failure and an api-server failure, the poison-pill agent will contact its peers. If all peers can’t access the api-server, it is safe to assume there is an api-server failure and doesn’t reboot itself.
If other peers can reach the API server and report that the node is unhealthy, or if none of the peers respond, the node will reboot itself.
Respecting Singleton-like semantics
Some Kubernetes objects such as StatefulSets and exclusive Volume mounts are designed around the idea that they are only ever active in at most one location. In order to respect these semantics, we must only delete the node when no workload is running there.
The existing workloads will stop running when the system is rebooting, so we need to remove the node object only after reboot has completed the ‘off’ phase .
How can other peers tell if the unhealthy node has been rebooted if it’s unresponsive?
The premise of kubernetes poison pill fencing is that within a known and finite period of time from having declared the node as unhealthy, either the node (directly or a peer) will see the unhealthy annotation in etcd, or it will terminate.
Having calculated and configured the period (based on heartbeat and retry intervals), peers that have waited for that amount of time can safely assume that either the node was already dead, or is so now.
To minimise this timeout, as well as make sure that it is bounded, we make use of a watchdog device to panic the node rather than perform a normal shutdown or reboot. This avoids the need to factor in every possible service on the machine and their worst case shutdown times, as well as protecting against bugs, hung services, and resource starvation.
Since the timeout is an upper bound for the reboot to commence, the node could come up by then and take new workloads before the Node object is deleted, which would result in us actively creating the exact conditions we’re trying to prevent - starting multiple copies of particular workloads.
This can happen if the reboot fixed the failure, or if the criteria for “healthy node” is different from MHC and the cluster scheduler perspective. (e.g. disk pressure could be defined as unhealthy in MHC but the scheduler can still assign workloads).
One possibility is for the watchdog to trigger a shutdown rather than a reboot, but then we lose compute capacity and require manual admin intervention to power on the host.
Instead, the poison pill agents (on all nodes) will try to mark the unhealthy node as unschedulable. This is a safety mechanism such that even if the node reboots, it won’t be assigned to run any workloads.
The poison pill agents will only delete the node if it’s marked as unschedulable.
For that matter, we need to handle 3 cases.
- Unhealthy node is able reach the api-server - the node can just verify the unschedulable taint is there before reboot
- Unhealthy node can’t reach the api-server but can contact other peers which will instruct a reboot only if the unschedulable taint exists
- Unhealthy node can’t reach the api-server, nor any of its peers - we assume that the healthy nodes will be able to add the unschedulable taint before the unhealthy machine contacts all of its peers and reboots.
After node deletion, the poison pill agents will re-create the node object without the unschedulable taint, allowing the scheduler to re-assign workloads to that node.
Resource Starvation
Since we’re relying on timeouts to assume that the machine has been rebooted, we need to make sure that the reboot takes place even if there’s resource starvation. The poison pill is using a watchdog timer, which should be fed once in a while. If there’s a resource starvation, the watchdog will be starved as well, and this will trigger a reboot.
Wrapping Up
Minimizing downtime is an important goal for everyone. Distributed computing and stateful applications introduce some challenges and difficulties. The poison pill controller is trying to solve some of these.
The implementation is at a PoC level and the source code is available on the poison pill repo.
À propos de l'auteur
Parcourir par canal
Automatisation
Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements
Intelligence artificielle
Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement
Cloud hybride ouvert
Découvrez comment créer un avenir flexible grâce au cloud hybride
Sécurité
Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies
Edge computing
Actualité sur les plateformes qui simplifient les opérations en périphérie
Infrastructure
Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde
Applications
À l’intérieur de nos solutions aux défis d’application les plus difficiles
Programmes originaux
Histoires passionnantes de créateurs et de leaders de technologies d'entreprise
Produits
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Services cloud
- Voir tous les produits
Outils
- Formation et certification
- Mon compte
- Assistance client
- Ressources développeurs
- Rechercher un partenaire
- Red Hat Ecosystem Catalog
- Calculateur de valeur Red Hat
- Documentation
Essayer, acheter et vendre
Communication
- Contacter le service commercial
- Contactez notre service clientèle
- Contacter le service de formation
- Réseaux sociaux
À propos de Red Hat
Premier éditeur mondial de solutions Open Source pour les entreprises, nous fournissons des technologies Linux, cloud, de conteneurs et Kubernetes. Nous proposons des solutions stables qui aident les entreprises à jongler avec les divers environnements et plateformes, du cœur du datacenter à la périphérie du réseau.
Sélectionner une langue
Red Hat legal and privacy links
- À propos de Red Hat
- Carrières
- Événements
- Bureaux
- Contacter Red Hat
- Lire le blog Red Hat
- Diversité, équité et inclusion
- Cool Stuff Store
- Red Hat Summit