Rebuild Software RAID

SUMMARY

ISSUE

When any of the software RAIDs enter a degraded state an alert is shown in the User Interface as well as emailed in the daily status email.

Purpose

Almost all Unitrends appliances use some form of software RAID. It is possible for the software RAID to get into a degraded state. This article will outline some common causes and methods for rebuilding the RAID volumes.

Applies To

DPU appliances utilizing software RAID

RESOLUTION

* Caution: For advanced users only, requires command-line usage *

WARNING!!! Running sfdisk, parted, or mdadm commands can be dangerous. Please use utmost care, please use rebuild_disk instead.

Our 2U/3U sized rack mount units use software RAID on the two internal OS drives. The 1U rack mount units and desktop units use software RAID on a larger scale.

To find out specific information on the RAID statuses use this command:

[root@Recovery-713 ~]# cat /proc/mdstat

Personalities : [raid1] [raid0]

md5 : active raid0 md3[0] md4[1]

5747030528 blocks super 1.2 64k chunks

md4 : active raid1 sdc2[0] sdd2[1]

2877837631 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sda4[0](F) sdb4[1]

2869193023 blocks super 1.2 [2/1] [_U]

md1 : active raid1 sdb3[1] sda3[0]

12582848 blocks [2/2] [UU]

md2 : active raid1 sdd1[1] sdc1[0]

52428672 blocks [2/2] [UU]

md0 : active raid1 sdb2[1] sda2[2](F)

48234432 blocks [2/1] [_U]

unused devices:

You can see from this output where devices have a (F) beside it showing that device is in a failed state. Also the [_U] shows there is a missing device in that RAID volume.

You should use the /usr/bp/bin/rebuild_disk script whenever possible.

Options for rebuild_disk

rebuild_disk --help #show the help message
rebuild_disk --speed
#set rebuild speed and wait to unset
The speed function could be used if the rebuild was started without the services stopped, and now the services are stopped so that the rebuild can take up more resources.
rebuild_disk /dev/sdN
#rebuild the replaced disk after rebooting
Used after powering off the system, removing the old disk, inserting the new disk and booting up again.
rebuild_disk --hotswap /dev/sdN
#Hotswap warranty replace without rebooting
Used in situations where you want to replace under warranty a failed/failing drive without rebooting. Run this after the replacement disk is on-site, but BEFORE swapping the disk. The hotswap function will remove the specified device from all associated RAID sets, then ask the user to remove the drive and replace with the new drive. Afterwards, it will rebuild that drive back into the array.
rebuild_disk --readd /dev/sdN
#Remove/Re-add the disk for rebuild
Used in situations where you want to initiate a rebuild on a good RAID array. The scenarios are if there are pending sectors and would like to initiate a rebuild to fix them. In this case, no drives have been dropped from the RAID sets.
The readd function now also can rebuild a drive back into an array even if it has been removed from some of the RAID arrays.

rebuild_disk --locate /dev/sdN
#flash the disk LED to locate it
Used to physically locate and verify the drive that needs to be replaced.

If the existing disk is marked as failed (e.g. sda), and you just need to rebuild it, do this:

/usr/bp/bin/rebuild_disk --readd /dev/sda

To replace a disk, the recommended method is to shutdown the system, replace the failed disk drive, and then power up the system.

Shut down the Unitrends processes so that the rebuild will go faster (optional, but takes 1/3 the time). When the rebuild completes, it will automatically restart the Unitrends processes again. Rebuild time on an R813 SW RAID5 with 3TB disks with services stopped should be about 5 ½ hours instead of 16+ hours.
/etc/init.d/bp_rcscript stop

For replaced disk device (e.g. sda), do this:

/usr/bp/bin/rebuild_disk /dev/sda

When the new drive has been initialized, do this:

/usr/bp/bin/rebuild_disk

All of the 1U rack mount units are not hot swappable but if the you cannot shut down the system for replacement, use one of these methods:

1) Use the rebuild_disk script to hotswap a disk - it will prompt the user when to replace the disk

/usr/bp/bin/rebuild_disk --hotswap /dev/sda

2) Or, you can run this command to rescan the SCSI bus so that you do not have to reboot after replacing one of the drives.

If the 'dpu version' is less than 7.4.0, download the rebuild_disk script. If version 7.4.0 or later, skip this step.

https://sftp.kaseya.com/utilities/rebuild_disk
chmod +x rebuild_disk
cp rebuild_disk /usr/bp/bin

CAUSE

1) One of the disks has failed

2) Unclean shutdown caused the RAID members to fail out of sync causing a degraded state.

NOTES

Note: sfdisk does not work on 3T drives so parted or gdisk is used instead.

*** Caution: For advanced users only. ***
*** The rescan may not align the disk order as expected on all systems ***
for HOST in `ls -l /sys/class/scsi_host/ | grep host |awk '{print $9}'` ; do echo '- - -' > /sys/class/scsi_host/${HOST}/scan ; done

Even after doing this you may be asked to reboot.

In 1U units there are 4 devices, sda, sdb, sdc, sdd. Once running the above command the replaced device comes back online as /dev/sdf.

You can now use the rebuild_disk command to rebuild the disk back into the RAID.

/usr/bp/bin/rebuild_disk /dev/sdf

You can also use sfdisk and mdadm to manually prepare the disk and rebuild the RAIDs but use rebuild_disk if it all possible.