Everyone is into Kubernetes these days, and everyone wants their applications to run in a highly available manner, but what happens when the underlying infrastructure is failing? What if there’s a network outage, hardware failure or even software bugs?
While stateless workloads are usually immune to these kinds of failures (the cluster can always schedule more instances somewhere else), the real problem begins with stateful workloads.
When there’s a node failure we want to achieve two goals:
- Minimize application downtime — we want to reschedule affected applications on other healthy node(s) to minimize their downtime.
- Recover compute capacity — the failing node is malfunctioning, thus its compute power can’t be used to run any workloads. We want to recover it as fast as possible to recover compute capacity for the cluster.
Stateful workloads with run-once semantics can’t be immediately rescheduled on other nodes. The cluster doesn’t really know if that workload is still running on the unhealthy node (perhaps it’s just a network outage), and scheduling the workload into other nodes introduces a risk that two instances would run in parallel, which would violate the at-most-one semantics and might result in data corruption or diverging datasets.
Kubernetes won’t run that workload anywhere else until it can make sure the workload is no longer running on the unhealthy node. However, if we can’t communicate with the node, there is no way to know that.
We need to bring the node into a “safe” state (this process is called fencing) and one of the most common forms of fencing is to power the node off using IPMI. Once the host running the node is powered off we can definitely assume that there are no running workloads there. To signal that to the cluster, we delete the node.
To power if off we’ve got two options:
- Using the server's management interface, but this requires BMC credentials, and obviously network connectivity to the management interface.
- Causing the unhealthy node to reboot itself in case of a failure.
The first one was implemented as an opt-in feature in OpenShift and is out of scope for this post.
Taking the poison pill at the right time
To achieve this, each node will run an agent that will constantly monitor its health status.
An external controller, machine-healthcheck-controller (MHC), determines if nodes are unhealthy, after which it applies an annotation to the node/machine object.
The poison-pill agent periodically looks for the unhealthy annotation. Once detected, the node takes the poison pill and reboots itself.
But there are some challenges:
- What happens if the unhealthy node can’t reach the api-server, and can’t see if the unhealthy annotation is there?
- Deleting the node only when no workload is running on it
- Resource starvation, or bugs might prevent the poison pill agent to detect its health status
Peers to the rescue
If the unhealthy node can’t reach the api-server, we could assume that it’s unhealthy, but this will result in a false positive and cause a reboot storm in the cluster, if the failure exists only in the api-server nodes.
To distinguish between a real node failure and an api-server failure, the poison-pill agent will contact its peers. If all peers can’t access the api-server, it is safe to assume there is an api-server failure and doesn’t reboot itself.
If other peers can reach the API server and report that the node is unhealthy, or if none of the peers respond, the node will reboot itself.
Respecting Singleton-like semantics
Some Kubernetes objects such as StatefulSets and exclusive Volume mounts are designed around the idea that they are only ever active in at most one location. In order to respect these semantics, we must only delete the node when no workload is running there.
The existing workloads will stop running when the system is rebooting, so we need to remove the node object only after reboot has completed the ‘off’ phase .
How can other peers tell if the unhealthy node has been rebooted if it’s unresponsive?
The premise of kubernetes poison pill fencing is that within a known and finite period of time from having declared the node as unhealthy, either the node (directly or a peer) will see the unhealthy annotation in etcd, or it will terminate.
Having calculated and configured the period (based on heartbeat and retry intervals), peers that have waited for that amount of time can safely assume that either the node was already dead, or is so now.
To minimise this timeout, as well as make sure that it is bounded, we make use of a watchdog device to panic the node rather than perform a normal shutdown or reboot. This avoids the need to factor in every possible service on the machine and their worst case shutdown times, as well as protecting against bugs, hung services, and resource starvation.
Since the timeout is an upper bound for the reboot to commence, the node could come up by then and take new workloads before the Node object is deleted, which would result in us actively creating the exact conditions we’re trying to prevent - starting multiple copies of particular workloads.
This can happen if the reboot fixed the failure, or if the criteria for “healthy node” is different from MHC and the cluster scheduler perspective. (e.g. disk pressure could be defined as unhealthy in MHC but the scheduler can still assign workloads).
One possibility is for the watchdog to trigger a shutdown rather than a reboot, but then we lose compute capacity and require manual admin intervention to power on the host.
Instead, the poison pill agents (on all nodes) will try to mark the unhealthy node as unschedulable. This is a safety mechanism such that even if the node reboots, it won’t be assigned to run any workloads.
The poison pill agents will only delete the node if it’s marked as unschedulable.
For that matter, we need to handle 3 cases.
- Unhealthy node is able reach the api-server - the node can just verify the unschedulable taint is there before reboot
- Unhealthy node can’t reach the api-server but can contact other peers which will instruct a reboot only if the unschedulable taint exists
- Unhealthy node can’t reach the api-server, nor any of its peers - we assume that the healthy nodes will be able to add the unschedulable taint before the unhealthy machine contacts all of its peers and reboots.
After node deletion, the poison pill agents will re-create the node object without the unschedulable taint, allowing the scheduler to re-assign workloads to that node.
Resource Starvation
Since we’re relying on timeouts to assume that the machine has been rebooted, we need to make sure that the reboot takes place even if there’s resource starvation. The poison pill is using a watchdog timer, which should be fed once in a while. If there’s a resource starvation, the watchdog will be starved as well, and this will trigger a reboot.
Wrapping Up
Minimizing downtime is an important goal for everyone. Distributed computing and stateful applications introduce some challenges and difficulties. The poison pill controller is trying to solve some of these.
The implementation is at a PoC level and the source code is available on the poison pill repo.
저자 소개
유사한 검색 결과
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.