Improve your Information Technology Infrastructure Library with automation: Incident and problem management

2025年 1月 20日Eric Lavarde3 分 (読了時間の目安)

It may seem that with automation and agility, an Information Technology Infrastructure Library (ITIL) is outdated, but I don't think we've seen the end of this methodology yet. ITIL has served numerous IT organizations as a guideline and blueprint for processes, and it continues to be a significant tool for the IT professional. You can modernize your approach to ITIL with the automation tools provided by Red Hat Ansible Automation Platform and the principles of infrastructure-as-code (IaC).

What is incident and problem management?

In a nutshell, problem management is the proactive sibling of reactive incident management, but what are incident and problem management, exactly?

Incident management: Detecting and handling issues negatively impacting the quality, availability or performance of any service. The handling encompasses restoring the service, generally based on a written script (a.k.a. documentation) followed by a support person. For example, if users can’t access an application, the script will describe the troubleshooting of this application and restarting it if it crashed.

Problem management: This is a kind of follow-up of incident management, and consists of analyzing the root cause of recurring or important incidents, and deriving action plans to fix them so that they don’t appear again. It is one of my favorite ITIL processes because it is about avoiding issues instead of fixing them (who doesn’t want to avoid problems?), and is a basis for continuous improvement! Sadly, it is seldom done properly because once the incident is gone, it is difficult to find the time to do the work to avoid it from happening again.

Continuing with our previous example, we’d first find out why the application crashes regularly (e.g. because it runs out of memory), and fix the underlying root causes (e.g. increase the memory on the server, monitor memory consumption and potentially fix a memory leak in the application).

Incident management and automation

It is relatively straightforward to automate manual steps described in the aforementioned scripts using Ansible Playbooks.

Those playbooks can be made available to your support personnel through the role based access control (RBAC) system of Red Hat Ansible Automation Platform , either directly in the web UI of Ansible Automation Platform, or through an API integration within your ticketing system (or other portal).

NOTE: You have the idea of using Ansible Automation Platform as your monitoring system to detect issues and accordingly create incidents. While this is technically possible, the performance impact for close monitoring would likely be prohibitive, and there are much better solutions. Instead, you should integrate a proper monitoring system with Event-Driven Ansible to trigger automation, as described above.

Once you feel confident enough, you can skip the human step completely and use Event-Driven Ansible to automatically trigger the automation put in place.

Workflow going from monitoring a bug to solving it using automation and Event-Driven Ansible

In a first approach, you can complete the incident ticket with additional information gathered by Ansible Automation Platform so that you can watch for negative effects while you build confidence in your own automation.

Even if incidents are resolved automatically, it remains important to keep a record so you can analyze them. You want to be aware of the fact that, unseen, Ansible Automation Platform has restarted an application 100 times a day—if that is happening, the application needs to be fixed. This brings us to our next topic.

Problem management and automation

The relationship between problem management and automation might not be that obvious, so let's take a moment to clarify it.

As your environment becomes increasingly automated, any incident you might encounter is potentially due to:

An error or a glitch in your existing automation
A manual intervention due to a gap in your automation

Also, as we’ve seen in the previous article of this series, release management encompasses regular testing of your automation in a pipeline.

That means that in addition to searching the root cause of your problem, you’ll have to think about its impact on your automation and extend the corrective actions along the lines of:

How to fix the automation to avoid the incident happening again
Which automation to add to avoid the manual mishap in the future
And, most important, which test case to add to your pipeline to detect the issue before it can happen again in production. A developer would tell you that you’re avoiding regressions, making sure that your automation always improves.

The last point is why I recommend creating a simple test pipeline (known as "smoke tests"), and expanding it step-by-step with test cases that catch errors happening in reality. This avoids having too many theoretical test cases which never catch any issue, because test cases also need to be maintained and require additional effort. Problem management is the perfect place to catch those real test cases.

Wrap up

We’ve seen how to improve and optimize incident management with automation and Event-Driven Ansible, working towards a self-healing environment. We've also talked about how problem management and automation can be combined to support continuous improvement and avoid regressions in your automation content.

Automation can be a long but rewarding journey, and Red Hat Services would be happy to help you introduce automation in your enterprise, with or without ITIL.

執筆者紹介

Eric Lavarde

Automation & Edge Principal Architect

Since 2013 at Red Hat, I'm responsible within Red Hat Consulting EMEA to create Services Solutions encompassing Automation and Edge topics. I'm also Automation Community of Practice Manager, addressing Red Hat automation practitioners around the globe.
You may address me in English, French or German.

Read full bio

さらに調べる

チャンネル別に見る

すべてのチャンネルを見る

プラットフォーム製品

試す & 買う

注目のコースと認定

業種別

注目のコースと認定

トピックス

記事

その他

お客様向け

パートナー向け

Red Hat の使命と歩み

オープンソース

企業情報

おすすめのリソース

言語を選択してください

言語を選択してください

Improve your Information Technology Infrastructure Library with automation: Incident and problem management

What is incident and problem management?

Incident management and automation

Problem management and automation

Wrap up

自動化による IT インフラストラクチャの単純化

執筆者紹介

Eric Lavarde

類似検索

さらに調べる

チャンネル別に見る

製品

ツール

試用、購入、販売

コミュニケーション

Red Hat について

言語を選択してください

Red Hat legal and privacy links

Red Hat legal and privacy links