This how-to describes how to replace a failing drive on a software RAID managed by the mdadm
utility. To replace a failing RAID 6 drive in mdadm
:
- Identify the problem.
- Get details from the RAID array.
- Remove the failing disk from the RAID array.
- Shut down the machine and replace the disk.
- Partition the new disk.
- Add the new disk to the RAID array.
- Verify recovery.
Let us look at this process in more detail by walking through an example.
Identify the problem
To identify which disk is failing within the RAID array, run:
[root@server loc]# cat /proc/mdadm
Or:
[root@server loc]# mdadm -–query -–detail /dev/md2
The failing disk will appear as failing or removed. For example:
[root@server loc]# mdadm -–query -–detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Mon Jun 22 08:47:09 2015
Raid Level : raid6
Array Size : 5819252736 (5549.67 GiB 5958.91 GB)
Used Dev Size : 2909626368 (2774.84 GiB 2979.46 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Oct 15 11:55:06 2018
State : clean, degraded, recovering
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Rebuild Status : 3% complete
Name : localhost.localdomain:2
UUID : 54404ab5:4450e4f3:aba6c1fb:93a4087e
Events : 1046292
Number Major Minor Raid Device State
0 0 0 0 removed
1 8 36 1 active sync /dev/sdc4
2 8 52 2 active sync /dev/sdd4
3 8 68 3 active sync /dev/sde4
Get details from the RAID array
To examine the RAID array's state and identify the state of a disk within the RAID:
[root@server loc]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2 : active raid6 sdb4[4](F) sdd4[2] sdc4[1] sde4[3]
5819252736 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [_UUU]
[>………………..] recovery = 3.4% (100650992/2909626368) finish=471.5min speed=99278K/sec
bitmap: 2/22 pages [8KB], 65536KB chunk
unused devices: <none>
As we can see, the device /dev/sdb4
has failed in the RAID.
Since we identified that the failed disk is /dev/sdb4
(which was the case on this server), we’d need to get the disk's serial number using smartctl
:
[root@server loc]# smartctl -–all /dev/sdb | grep -i 'Serial'
The above command is important since you need to know what disk to remove from the server, according to the disk's physical label.
Remove the failing disk from the RAID array
It is important to remove the failing disk from the array so the array retains a consistent state and is aware of every change, like so:
[root@server loc]# mdadm -–manage /dev/md2 -–remove /dev/sdb4
On a successful remove, a message like the following will return:
[root@server loc]# mdadm: hot removed /dev/sdb4 from /dev/md2
Check the state of /proc/mdstat
once again:
[root@server loc]# cat /proc/mdstat
You can see that /dev/sdb4
is no longer visible.
Shut down the machine and replace the disk
Now it’s time to shut down the system and replace the faulty disk with a new one, but before shutting down the system, comment /dev/md2
out of your /etc/fstab
file. See the example below:
[root@server loc]# cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Fri May 20 13:12:25 2016
#
# Accessible filesystems, by reference, are maintained under ‘/dev/disk’
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/centos-root / xfs defaults 0 0
UUID=1300b86d-2638-4a9f-b366-c5e67e9ffa4e /boot xfs defaults 0 0
#/dev/mapper/centos-home /home xfs defaults 0 0
/dev/mapper/centos-swap swap swap defaults 0 0
#/dev/md2 /var/loc xfs defaults 0 0
Partition the new disk
Since we have other working disks within the RAID array, it is easy and convenient to copy the partition schema of a working disk onto the new disk. This task is accomplished with the sgdisk
utility, which is provided by the gdisk
package.
Install gdisk
like this (adjust this command for your distribution):
[root@server loc]# yum install gdisk
Using gdisk
, we will first pass the -R
option (stands for Replicate). Make sure you replicate the partition schema of a working disk. It is important that you use the correct order of disks to replicate the partition schema from a working disk to a new one. In our situation, on the new disk is /dev/sdb
and the working disks are /dev/sdc
, /dev/sdd
, /dev/sde
.
Now, to replicate the partition schema of a working disk (say /dev/sdc
) to the new disk /dev/sdb
, the following command is needed:
[root@server loc]# sgdisk -R /dev/sdb /dev/sdc
To prevent GUID conflicts with other drives, we’ll need to randomize the GUID of the new drive using:
[root@server loc]# sgdisk -G /dev/sdb
The operation has completed successfully.
Next, verify the output of /dev/sdb using the parted
utility:
[root@server loc]# parted /dev/sdb print
Add the new disk to the RAID array
After completing the partition schema replication to the new drive, we now can add the drive to the RAID array:
[root@server loc]# mdadm -–manage /dev/md2 -–add /dev/sdb4
mdadm: added /dev/sdb4
Verify recovery
To verify the RAID recovery, use the following:
[root@server loc]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2 : active raid6 sdb4[4] sdd4[2] sdc4[1] sde4[3]
5819252736 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [_UUU]
[==>………………] recovery = 12.2% (357590568/2909626368) finish=424.1min speed=100283K/sec
bitmap: 0/22 pages [0KB], 65536KB chunk
unused devices: <none>
Or:
[root@server loc]# mdadm -–query -–detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Mon Jun 22 08:47:09 2015
Raid Level : raid6
Array Size : 5819252736 (5549.67 GiB 5958.91 GB)
Used Dev Size : 2909626368 (2774.84 GiB 2979.46 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Oct 15 12:37:37 2018
State : clean, degraded, recovering
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Rebuild Status : 12% complete
Name : localhost.localdomain:2
UUID : 54404ab5:4450e4f3:aba6c1fb:93a4087e
Events : 1046749
Number Major Minor Raid Device State
4 8 20 0 spare rebuilding /dev/sdb4
1 8 36 1 active sync /dev/sdc4
2 8 52 2 active sync /dev/sdd4
3 8 68 3 active sync /dev/sde4
From the above output, we now see that /dev/sdb4
is rebuilding, and four working and active devices are available. The rebuilding process might take a while, depending on your total disk size and disk type (i.e., traditional or solid-state).
Celebrate
You have now successfully replaced a failing RAID 6 drive with mdadm
. Hopefully, you will never need to do this, but hardware fails. Odds are that if you're using RAID 6, it will happen eventually. If you can, set up a lab, force a RAID 6 to fail in it, and then recover it. Knowing how to address the problem will make the experience when the unthinkable happens far less stressful.
執筆者紹介
Valentin is a system engineer with more than six years of experience in networking, storage, high-performing clusters, and automation.
He is involved in different open source projects like bash, Fedora, Ceph, FreeBSD and is a member of Red Hat Accelerators.
チャンネル別に見る
自動化
テクノロジー、チームおよび環境に関する IT 自動化の最新情報
AI (人工知能)
お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート
オープン・ハイブリッドクラウド
ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。
セキュリティ
環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報
エッジコンピューティング
エッジでの運用を単純化するプラットフォームのアップデート
インフラストラクチャ
世界有数のエンタープライズ向け Linux プラットフォームの最新情報
アプリケーション
アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細
オリジナル番組
エンタープライズ向けテクノロジーのメーカーやリーダーによるストーリー
製品
ツール
試用、購入、販売
コミュニケーション
Red Hat について
エンタープライズ・オープンソース・ソリューションのプロバイダーとして世界をリードする Red Hat は、Linux、クラウド、コンテナ、Kubernetes などのテクノロジーを提供しています。Red Hat は強化されたソリューションを提供し、コアデータセンターからネットワークエッジまで、企業が複数のプラットフォームおよび環境間で容易に運用できるようにしています。
言語を選択してください
Red Hat legal and privacy links
- Red Hat について
- 採用情報
- イベント
- 各国のオフィス
- Red Hat へのお問い合わせ
- Red Hat ブログ
- ダイバーシティ、エクイティ、およびインクルージョン
- Cool Stuff Store
- Red Hat Summit