피드 구독

As a system administrator, have you ever found yourself in this situation? Your production systems have been humming along nicely when, out of the blue, the phone rings.  “What just happened!? Everything slowed to a crawl, but now it’s fine again?”

Where to even start? Your systems are complex. There are so many moving parts potentially contributing to the problem that root causes can be anywhere—and worse, they can be transient. There are databases, networked storage, firewalls, applications, JVMs, VMs, containers, kernels, power management, backups, live migrations, database schema changes—and that’s just on your development laptop! It’s even more complex in production datacenters and the cloud, where you’ve got it all operating on large, real-life data sets.

Red Hat Enterprise Linux (RHEL) provides tools that can help. Let’s look at a simple command line technique and tools that will help you as the first responder to a performance emergency—beyond dashboards, beyond top and iostat and pidstat and netstat and vmstat and so on and so forth—to help you to gain deeper understanding and find root causes.  

Solving performance problems with Performance Co-Pilot (PCP)

I’ll use the Performance Co-Pilot (PCP) toolkit here as that’s readily available in RHEL, has good metric coverage out-of-the-box and is easy to add your own metrics to. The ideas here can be implemented using other tooling as well, or through the combination of PCP working with other solutions.

Get started

First of all, we need instrumentation across the board—anything that could be contributing to the performance problem, we want to have visibility into that. In PCP, this is managed by pmcd(1). It provides a common language (Screencast 1) for that instrumentation where each metric has metadata - for example human-readable names, units, semantics and metric type  (Screencast 2).

Screencast 1

Screencast 2

Second, we need to have a “flight recorder”—something that is always-on and lightweight—reliably recording our systems activity through the good times and bad. In PCP, this is handled by pmlogger(1). 

You can use this command to get started: yum install pcp-zeroconf

The final piece of the puzzle is tooling that takes your recordings and time windows of interest (“Tuesday 10 a.m. all was well, but in the half hour after midday everything broke loose”), analyzes like-to-like metrics amongst the many thousands of recorded values, and reports back to you with those metrics having the most variance—automating the task of separating out the performance noise. In PCP, this tool is pmdiff(1): 

pmdiff -X ./cull --threshold 10 --start @10:00 --finish @10:30 --begin @12:00 --end @12:30 ./archives/app3/20120510 | less

What we’re seeing here is:

  • A recording from the day of our performance crisis is passed into pmdiff (./archives/app3/20120510) along with two time windows of interest (--start/--finish for the “before” window, --begin/--end for “after”)

  • The tool reports four columns: “before” values, “after” values, how much those average values changed (Ratio), and individual performance metric names (Metric-Instance).

  • The --threshold parameter (10) sets the point at which the Ratio column should be culled. We look for average values that are 10x (or more) and 1/10th (or less) between the time windows.

  • The first five rows show a Ratio of |+| - this simply indicates that the average value changed from completely zero “before” to non-zero “after.”  Interestingly these are all metrics relating to the Linux virtual memory subsystem—our first insight.

  • The next 15 or so rows contain the value >100 in the Ratio column (i.e., the average values for these metrics during the second time window has increased by more than 100 times!). Again, we have strong indicators that page compaction (a function of the kernel’s virtual memory subsystem) is behaving in radically different ways during the two time windows. We also see aggregate disk read I/O is way up, and we see the specific device (the sda metric instance) that has caused this change.

There’s plenty more we can glean from this recording as we continue to dig into it. However, we’ve already moved from having no real idea where to look, to being able to hold a coherent conversation with a kernel virtual memory expert about root causes of a transient performance problem! (Or, we might do research ourselves now to find what “direct page reclaim” and “memory compaction” involves.)

I’m sure you can now see the value of the general technique here. We could just as easily be having follow-up discussions with our database administrators, network specialists, or any of our other teams about contributing factors in each of their specialised areas. Perhaps this technique helps us better understand the root cause ourselves. With well-instrumented systems, these tools very quickly give us insights, and often in places that we might not have considered looking.

Make it personal

If your development teams are instrumenting the applications they build with detailed metrics, you can quickly achieve performance insights into your company’s applications too. Since RHEL system services, kernels, containers, databases, and other components in your stack have instrumentation available, you can quickly see which parts of an application degrade or improve—along with any changes in access patterns, cache sizes, new RHEL versions, kernel configurations, additional I/O, or any other component variation.

Conclusion

Using intuition alone, you might miss important performance insights that could be right in front of you. Dashboards are fantastic, but can be limiting in terms of the specific metrics you’ve chosen to focus on.

There are RHEL tools, including PCP, that can help automate performance analysis, taking the guesswork out of “performance crisis” triage and root cause analysis. Want to see more PCP in action? Take a look at our previous posts on visualizing system performance or solving performance mysteries.


저자 소개

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Original series icon

오리지널 쇼

엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리