Microservices, and the Observability Macroheadache

4 octobre 2019Alex Handy3 minutes (temps de lecture)

Moving to a microservice architecture, deployed on a cloud platform such as OpenShift, can have significant benefits. However, it does make understanding how your business requests are being executed, across the potentially large numbers of microservices, more challenging.

If we wish to locate where problems may have occurred in the execution of a business request, whether due to performance issues or errors, we are potentially faced with accessing metrics and logs associated with many services that may have been involved. Metrics can provide a general indication of where problems have occurred, but not specific to individual requests. Logs may provide errors or warnings, but cannot necessarily be correlated to the individual requests of interest.

Distributed tracing is a technique that has become indispensable in helping users understand how their business transactions execute across a set of collaborating services. A trace instance documents the flow of a business transaction, including interactions between services, internal work units, relevant metadata, latency details and contextualized logging. This information can be used to perform root cause analysis to locate the problem quickly.

How does a OpenShift Service Mesh help

The OpenShift Service Mesh simplifies the implementation of services by delegating/moving some capabilities into the platform, such as circuit breaking, intelligent routing, etc. These capabilities include the ability to report tracing data associated with the HTTP interactions between services.

This means that the service is not required to support distributed tracing directly itself - the sidecar proxy will handle sampling decisions, creation of spans (the building blocks of a trace instance) and ensuring that consistent metadata is reported.

The only responsibility that cannot be handled by the OpenShift Service Mesh is the propagation of the trace context between inbound and outbound requests within the service itself. This needs to be implemented by the service - either by copying relevant headers from the inbound request to the outbound request, or using a suitable library to handle it.

Jaeger to the Rescue

Instrumenting the service mesh and your business application is only one part of the story. Presenting this data in a way that is easy to consume and understand is the role of a tracing solution. That's why OpenShift Service Mesh bundles a component called Jaeger, that can be used to collect, store, query and visualize the tracing data.

The Jaeger UI/console allows users to search for trace instances that meet certain criteria, including service name, operation name, tag names/values, a time frame and containing spans that have a max/min duration.

The UI shows a scattergraph of the trace instance durations to enable users to focus in on performance issues. The list also highlights trace instances that represent error situations.

Once a trace instance of interest is selected, the UI will show the individual spans in a gantt chart style. Each line represents a unit of work, typically called a 'span' in the distributed tracing world, color coded based on the service it represents, with a length that identifies the time duration. This enables a user to focus in on the services and operations where most time is spent for the business transaction.

When a span is selected, it will be expanded to show further details, including tag names/values and log entries. This can provide additional information that may help diagnose issues.

It is also possible to compare the structure of trace instances against each other, by selecting multiple trace instances on the search page and pressing the “Compare Traces” button.

This feature is useful to narrow down the search space for traces with large number of spans. The visualization highlights added or missing operations in two trace instances.

One Less Headache for your Microservices Journey

While distributed tracing on its own is not the monitoring panacea that devops teams require, it is a prerequisite for understanding the root cause of problems that will arise in complex and distributed architectures. When used in conjunction with other observability signals, such as metrics and logging, it can help diagnose problems and provide a more comprehensive view of the health of our business applications.

À propos de l'auteur

Alex Handy

Principal Product Marketing Manager

Red Hatter since 2018, technology historian and founder of The Museum of Art and Digital Entertainment. Two decades of journalism mixed with technology expertise, storytelling and oodles of computing experience from inception to ewaste recycling. I have taught or had my work used in classes at USF, SFSU, AAU, UC Law Hastings and Harvard Law.

I have worked with the EFF, Stanford, MIT, and Archive.org to brief the US Copyright Office and change US copyright law. We won multiple exemptions to the DMCA, accepted and implemented by the Librarian of Congress. My writings have appeared in Wired, Bloomberg, Make Magazine, SD Times, The Austin American Statesman, The Atlanta Journal Constitution and many other outlets.

I have been written about by the Wall Street Journal, The Washington Post, Wired and The Atlantic. I have been called "The Gertrude Stein of Video Games," an honor I accept, as I live less than a mile from her childhood home in Oakland, CA. I was project lead on the first successful institutional preservation and rebooting of the first massively multiplayer game, Habitat, for the C64, from 1986: https://neohabitat.org . I've consulted and collaborated with the NY MOMA, the Oakland Museum of California, Cisco, Semtech, Twilio, Game Developers Conference, NGNX, the Anti-Defamation League, the Library of Congress and the Oakland Public Library System on projects, contracts, and exhibitions.

Read full bio