Debian – Replace failed hard drive in software raid 1

I recently had to do a disk replace in a software raid 1 system, running on a Debian KVM host. On my server my first disk (sda) failed. Here is what I did to get my system up and running again. It was a fun and challenging task to do from only using a terminal window.

First identify which disk that is failing.

Run this command (for each disk):# smartctl -a /dev/sdb
And if about 15-20 lines down in the result see this line, your disk is healthy.
SMART overall-health self-assessment test result: PASSED
If you see this line instead (only a few lines down, since rest of the test is failing).
Log Sense failed, IE page [scsi response fails sanity test]
Then your disk is failing.

Remove your disk from the raid array.

This is your next step. You will need to take the drive out of the raid array, and replace the actual disk. You will need to remove all partitions from the raid array. My system only had 3 partitions. Swap, Root and Boot. Here is how to take it out of the array.
# mdadm /dev/md0 -r /dev/sda1
# mdadm /dev/md1 -r /dev/sda2
# mdadm /dev/md2 -r /dev/sda3

Now you can replace the physical disk.

Identify partition system.

Then you need to find out if you are using GUID partition tables (GDT) or Master Boot Records (MBR).
You will find that by issuing this command:
# parted -l
If you get this result: Partition Table: msdos
You are using MBR or msdos MBR is the correct name.
Note: Your raid array partitions will show up with loop. That means raw access without partitions.
If you using GDT, proceed to prepare your new disk with GDT. If MBR, you skip that part and starts on prepare your new disk with MBR.

After you replaced the disk and starting to prepare the new one, loading boot loader and such. Do not reboot until you are done and the array is working. You might have booting issues, especially if you are not at the same location as the server.

Prepare your new disk with GDT

With the new disk in we will copy the partition table from sdb and into sda. The partition tables in a raid 1 setup needs to match exactly.
# sgdisk -R /dev/sda /dev/sdb
All members in a raid array need to have their own unique id (UUID). So this is how we assign a random UUID to a disk.
# sgdisk -G /dev/sdb

Prepare your new disk with MBR

Copy a working MBR from SDB.
# sfdisk -d /dev/sdb | sfdisk /dev/sda
If your partitions is not detected you will need to reload them from the kernel. Try this command.
# sfdisk -R /dev/sda

Put the raid partitions back.

Now your are ready to put the raid array partitions back in.
# mdadm /dev/md0 -a /dev/sdb1
# mdadm /dev/md1 -a /dev/sdb2
# mdadm /dev/md2 -a /dev/sdb3

The new drive is now back in the array and have started to sync against the working hard drive. On my 1.5TB drive this process took about 4 hours. You can monitor the progress by looking at this file.
# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[2] sdb3[1]
1451896960 blocks super 1.2 [2/2] [UU]
[==========>……….] recovery = 16.5% (240146752/1451896960) finish=158.9min speed=127054K/sec
md1 : active raid1 sda2[2] sdb2[1]
523968 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[2] sdb1[1]
12574592 blocks super 1.2 [2/2] [UU]

Install the boot loader.

This is the last thing we need to do before we are finished. However you should wait until resync is complete. To install grub2 into the new hard drive, do this:
# grub-install /dev/sda

Now be brave and try to reboot your system. It should be working fine now with a new drive.

Happy recovery!