SUMMARY
Rebuild Software RAID
ISSUE
When any of the software RAIDs enter a degraded state an alert is shown in the User Interface as well as emailed in the daily status email.
Purpose
Almost all Unitrends appliances use some form of software RAID. It is possible for the software RAID to get into a degraded state. This article will outline some common causes and methods for rebuilding the RAID volumes.
Applies To
DPU appliances utilizing software RAID
RESOLUTION
*** Caution: For advanced users only, requires command-line usage ***
WARNING!!! Running sfdisk, parted, or mdadm commands can be dangerous. Please use utmost care, please use rebuild_disk instead.
Our 2U/3U sized rack mount units use software RAID on the two internal OS drives. The 1U rack mount units and desktop units use software RAID on a larger scale.
To find out specific information on the RAID statuses use this command:
[root@Recovery-713 ~]# cat /proc/mdstat
Personalities : [raid1] [raid0]
md5 : active raid0 md3[0] md4[1]
5747030528 blocks super 1.2 64k chunks
md4 : active raid1 sdc2[0] sdd2[1]
2877837631 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sda4[0](F) sdb4[1]
2869193023 blocks super 1.2 [2/1] [_U]
md1 : active raid1 sdb3[1] sda3[0]
12582848 blocks [2/2] [UU]
md2 : active raid1 sdd1[1] sdc1[0]
52428672 blocks [2/2] [UU]
md0 : active raid1 sdb2[1] sda2[2](F)
48234432 blocks [2/1] [_U]
unused devices:
You can see from this output where devices have a (F) beside it showing that device is in a failed state. Also the [_U] shows there is a missing device in that RAID volume.
You should use the /usr/bp/bin/rebuild_disk script whenever possible.
Options for rebuild_disk
- rebuild_disk --help #show the help message
-
rebuild_disk --speed
#set rebuild speed and wait to unset
The speed function could be used if the rebuild was started without the services stopped, and now the services are stopped so that the rebuild can take up more resources. -
rebuild_disk /dev/sdN
#rebuild the replaced disk after rebooting
Used after powering off the system, removing the old disk, inserting the new disk and booting up again. -
rebuild_disk --hotswap /dev/sdN
#Hotswap warranty replace without rebooting
Used in situations where you want to replace under warranty a failed/failing drive without rebooting. Run this after the replacement disk is on-site, but BEFORE swapping the disk. The hotswap function will remove the specified device from all associated RAID sets, then ask the user to remove the drive and replace with the new drive. Afterwards, it will rebuild that drive back into the array. -
rebuild_disk --readd /dev/sdN
#Remove/Re-add the disk for rebuild
Used in situations where you want to initiate a rebuild on a good RAID array. The scenarios are if there are pending sectors and would like to initiate a rebuild to fix them. In this case, no drives have been dropped from the RAID sets.
The readd function now also can rebuild a drive back into an array even if it has been removed from some of the RAID arrays.
-
rebuild_disk --locate /dev/sdN
#flash the disk LED to locate it
Used to physically locate and verify the drive that needs to be replaced.
If the existing disk is marked as failed (e.g. sda), and you just need to rebuild it, do this:
/usr/bp/bin/rebuild_disk --readd /dev/sda
To replace a disk, the recommended method is to shutdown the system, replace the failed disk drive, and then power up the system.
Shut down the Unitrends processes so that the rebuild will go faster (optional, but takes 1/3 the time). When the rebuild completes, it will automatically restart the Unitrends processes again. Rebuild time on an R813 SW RAID5 with 3TB disks with services stopped should be about 5 ½ hours instead of 16+ hours.
/etc/init.d/bp_rcscript stop
For replaced disk device (e.g. sda), do this:
/usr/bp/bin/rebuild_disk /dev/sda
When the new drive has been initialized, do this:
/usr/bp/bin/rebuild_disk
All of the 1U rack mount units are not hot swappable but if the you cannot shut down the system for replacement, use one of these methods:
1) Use the rebuild_disk script to hotswap a disk - it will prompt the user when to replace the disk
/usr/bp/bin/rebuild_disk --hotswap /dev/sda
2) Or, you can run this command to rescan the SCSI bus so that you do not have to reboot after replacing one of the drives.
If the 'dpu version' is less than 7.4.0, download the rebuild_disk script. If version 7.4.0 or later, skip this step.
https://sftp.kaseya.com/utilities/rebuild_disk
chmod +x rebuild_disk
cp rebuild_disk /usr/bp/bin
CAUSE
1) One of the disks has failed
2) Unclean shutdown caused the RAID members to fail out of sync causing a degraded state.
NOTES
Note: sfdisk does not work on 3T drives so parted or gdisk is used instead.
*** Caution: For advanced users only. ***
*** The rescan may not align the disk order as expected on all systems ***
for HOST in `ls -l /sys/class/scsi_host/ | grep host |awk '{print $9}'` ; do echo '- - -' > /sys/class/scsi_host/${HOST}/scan ; done
Even after doing this you may be asked to reboot.
In 1U units there are 4 devices, sda, sdb, sdc, sdd. Once running the above command the replaced device comes back online as /dev/sdf.
You can now use the rebuild_disk command to rebuild the disk back into the RAID.
/usr/bp/bin/rebuild_disk /dev/sdf
You can also use sfdisk and mdadm to manually prepare the disk and rebuild the RAIDs but use rebuild_disk if it all possible.