フィードを購読する

It may seem that with automation and agility, an Information Technology Infrastructure Library (ITIL) is outdated, but I don't think we've seen the end of this methodology yet. ITIL has served numerous IT organizations as a guideline and blueprint for processes, and it continues to be a significant tool for the IT professional. You can modernize your approach to ITIL with the automation tools provided by Red Hat Ansible Automation Platform and the principles of infrastructure-as-code (IaC).

What is incident and problem management?

In a nutshell, problem management is the proactive sibling of reactive incident management, but what are incident and problem management, exactly?

Incident management: Detecting and handling issues negatively impacting the quality, availability or performance of any service. The handling encompasses restoring the service, generally based on a written script (a.k.a. documentation) followed by a support person. For example, if users can’t access an application, the script will describe the troubleshooting of this application and restarting it if it crashed.

Problem management: This is a kind of follow-up of incident management, and consists of analyzing the root cause of recurring or important incidents, and deriving action plans to fix them so that they don’t appear again. It is one of my favorite ITIL processes because it is about avoiding issues instead of fixing them (who doesn’t want to avoid problems?), and is a basis for continuous improvement! Sadly, it is seldom done properly because once the incident is gone, it is difficult to find the time to do the work to avoid it from happening again.

Continuing with our previous example, we’d first find out why the application crashes regularly (e.g. because it runs out of memory), and fix the underlying root causes (e.g. increase the memory on the server, monitor memory consumption and potentially fix a memory leak in the application).

Incident management and automation

It is relatively straightforward to automate manual steps described in the aforementioned scripts using Ansible Playbooks.

Those playbooks can be made available  to your support personnel through the role based access control (RBAC) system of Red Hat Ansible Automation Platform , either directly in the web UI of Ansible Automation Platform, or through an API integration within your ticketing system (or other portal).

NOTE: You have the idea of using Ansible Automation Platform as your monitoring system to detect issues and accordingly create incidents. While this is technically possible, the performance impact for close monitoring would likely be prohibitive, and there are much better solutions. Instead, you should integrate a proper monitoring system with Event-Driven Ansible to trigger automation, as described above.

Once you feel confident enough, you can skip the human step completely and use Event-Driven Ansible to automatically trigger the automation put in place.

 

Workflow going from monitoring a bug to solving it using automation and Event-Driven Ansible

 

In a first approach, you can complete the incident ticket with additional information gathered by Ansible Automation Platform so that you can watch for negative effects while you build confidence in your own automation.

Even if incidents are resolved automatically, it remains important to keep a record so you can analyze them. You want to be aware of the fact that, unseen, Ansible Automation Platform has restarted an application 100 times a day—if that is happening, the application needs to be fixed. This brings us to our next topic.

Problem management and automation

The relationship between problem management and automation might not be that obvious, so let's take a moment to clarify it.

As your environment becomes increasingly automated, any incident you might encounter is potentially due to:

  1. An error or a glitch in your existing automation
  2. A manual intervention due to a gap in your automation

Also, as we’ve seen in the previous article of this series, release management encompasses regular testing of your automation in a pipeline.

That means that in addition to searching the root cause of your problem, you’ll have to think about its impact on your automation and extend the corrective actions along the lines of:

  1. How to fix the automation to avoid the incident happening again
  2. Which automation to add to avoid the manual mishap in the future
  3. And, most important, which test case to add to your pipeline to detect the issue before it can happen again in production. A developer would tell you that you’re avoiding regressions, making sure that your automation always improves.

The last point is why I recommend creating a simple test pipeline (known as "smoke tests"), and expanding it step-by-step with test cases that catch errors happening in reality. This avoids having too many theoretical test cases which never catch any issue, because test cases also need to be maintained and require additional effort. Problem management is the perfect place to catch those real test cases.

Wrap up

We’ve seen how to improve and optimize incident management with automation and Event-Driven Ansible, working towards a self-healing environment. We've also talked about how problem management and automation can be combined to support continuous improvement and avoid regressions in your automation content.

Automation can be a long but rewarding journey, and Red Hat Services would be happy to help you introduce automation in your enterprise, with or without ITIL.

resource

自動化による IT インフラストラクチャの単純化

複雑なものを管理するのは簡単ではありません。この e ブックでは、インフラストラクチャの自動化によって繰り返し可能なプロセスを作成し、時間とコストを節約する方法について説明します。

執筆者紹介

Since 2013 at Red Hat, I'm responsible within Red Hat Consulting EMEA to create Services Solutions encompassing Automation and Edge topics. I'm also Automation Community of Practice Manager, addressing Red Hat automation practitioners around the globe.
You may address me in English, French or German.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

チャンネル別に見る

automation icon

自動化

テクノロジー、チームおよび環境に関する IT 自動化の最新情報

AI icon

AI (人工知能)

お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート

open hybrid cloud icon

オープン・ハイブリッドクラウド

ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。

security icon

セキュリティ

環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報

edge icon

エッジコンピューティング

エッジでの運用を単純化するプラットフォームのアップデート

Infrastructure icon

インフラストラクチャ

世界有数のエンタープライズ向け Linux プラットフォームの最新情報

application development icon

アプリケーション

アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細

Original series icon

オリジナル番組

エンタープライズ向けテクノロジーのメーカーやリーダーによるストーリー