This document provides a summary of the quay.io incidents that occurred on May 19 and May 28, 2020. This document describes the nature of the incidents, then provides information about steps being taken to reduce the likelihood of similar outages in the future.
Incidents Summary
Red Hat's quay.io container registry service experienced two periods of degraded performance and availability in May 2020. The first incident occurred from approximately 7 a.m. May 19, 2020, to 1 a.m. May 20, 2020, UTC. The second incident occurred from approximately 11 a.m. to 4 p.m. May 28, 2020, UTC.
During these periods, users experienced a range of outcomes, including slow container image access times and inability to retrieve container images. These issues affected several other Red Hat services, including OpenShift Cluster Manager, which is used to deploy and manage OpenShift clusters.
Red Hat Site Reliability Engineering teams concluded several factors combined to form the root cause of these incidents. These factors include:
- non-optimal quay.io tuning in the areas of process parallelization and database access
- traffic surges from simultaneous OpenShift platform upgrades
Following these two incidents, several actions have been defined to improve quay.io availability, reliability, and continuity. Some of these items are already complete, others are being actively worked on, and some are being researched. They are described in the sections below.
Completed Actions
A number of actions have already been completed to address these two incidents. These actions include:
- Redeploying quay.io on a 4.3.18 OpenShift Dedicated cluster.
- Indefinitely disabling garbage collection to reduce database load.
- Optimizing several aspects of quay.io in the areas of process parallelization and database access. These optimizations will prevent the database from being driven to the point of lockup.
- Doubling the underlying quay.io database size/capacity to handle more traffic.
- Suspending certain OpenShift Dedicated z-stream updates to prevent potential traffic spikes during OpenShift Dedicated upgrades.
In addition to these completed actions, many others are pending, and are described below.
Short-term Actions
The following actions are in progress and are targeted for completion by June 11, 2020:
- Creating database “read-replicas” that will distribute database load and prevent database lockups
- Appreciably accelerating the quay.io redeployment process for faster incident response
- Creating a quay.io hot standby in a separate region
- Adding pod caching for faster quay.io restart time
- Improving monitoring for more visibility into quay.io health and the ability to detect potential problems
Long-term Actions
The following actions have been proposed and are under consideration:
- Investigating potential exacerbating networking issues
- Conducting substantial load and performance analysis to identify additional parallelization/database optimizations
- Reducing quay.io logging verbosity
- Investigating additional caching methods
- Investigating per-account rate-limiting
- Upgrading/replacing various database components and sub-systems
- Implementing numerous process and documentation improvements
Sobre el autor
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit