Using RAID to Escape Disaster

Failed hard drives are inevitable. Especially when the drive in question was manufactured on November 27, 2001. You know the time has come to replace it when your log files start filling up with errors like this:

Oct 28 03:53:05 cat kernel:         res 51/40:00:fc:33:4e/00:00:00:00:00/e0 Emask 0x9 (media error)
Oct 29 16:06:46 cat smartd[24427]: Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!

Failure is inescapable. Everything fails eventually, computers, people, electronics. This is the only constant in life. It is only a question of when. In my case this 40GB drive had served me well in multiple computers and as part of a RAID5 array for my Linux Journal article. In its final installation it was part of a 2 disk RAID1 in cat, my webserver. cat runs Fedora 13 and a minimal set of software for serving up my webpages, including this blog. cat was built using spare parts, its job isn't hard and space requirements aren't large. Good logging and reporting are important, they help you anticipate the impending doom. On my systems I am running the smartd daemon to monitor drive health as well as epylog to parse all my logfiles and email me nightly results.

Cat was setup running Fedora 13 on 2 drives with 3 partitions. /boot, / and swap. / was setup as a 2 disk RAID1 and /boot was actually /boot and /boot2 because at the time I was unsure if grub could boot from a RAID (yes, it can, and that's another post entirely). The partitioning looked like this:

[root@cat ~]# parted -l
Model: ATA Maxtor 5T040H4 (scsi)
Disk /dev/sda: 41.0GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system     Name  Flags
 1      1049kB  525MB   524MB   ext4                  boot
 2      525MB   2622MB  2097MB  linux-swap(v1)
 3      2622MB  41.0GB  38.4GB                        raid

When the errors showed up I jumped over to Amazon Prime and found a pretty good deal on a pair of Seagate 500 GB Drives. I had them the next day, but didn't have time to start the process of swapping them in and expanding the storage. Instead I removed the failing drive from the array using mdadm --manage --set-faulty /dev/md0 /dev/sdb3, as well as removing the references to it's /boot partition in /etc/fstab. I have good nightly backups of the system and smartctl was reporting that the remaining drive was running fine. The system is pretty much read-only so nightly backups were sufficient to provide a good restore point in case the final drive failed.

The replacement plan was to hook up the 2 new drives, which use SATA instead of IDE, add them to the existing array and let mdraid sync the data over from the old drive. At that point I would have 3 drives in the array, all with 40G partitions. I would then remove the old drive and grow the filesystem on the new drives to take up all 500GB. Sometimes plans actually do work. The old drives were EIDE and I had 2 SATA ports on the motherboard -- confirmed by using dmidecode to grab the motherboard's model number to look it up online. The only glitch there was that I had to enable the SATA controller in BIOS before the drives were recognized. I used parted to partition the drives into 3 partitions. They look like this when finished:

[root@cat ~]# parted /dev/sda print
Model: ATA ST3500418AS (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size   Type     File system     Flags
 1      1049kB  1000MB  999MB  primary  ext4            boot, raid
 2      1000MB  2000MB  999MB  primary  linux-swap(v1)
 3      2000MB  500GB   498GB  primary                  raid

Don't forget to set the boot flag on the /boot partition on both drives. You never can tell when the BIOS might decide to boot the other one, and if one fails you want the other to still be bootable. GRUB can boot from a RAID partition as long as it is a filesystem it supports, like ext2,3,4 and as long as mdraid metadata v1,0 or earlier is used. This is because the metadata is written to the end of the partition so grub never sees it. In v1.1 and later the RAID metadata is written to the start of the partition and grub cannot find the filesystem. I setup /boot as a 2 disk RAID1 like this:

mdadm --create --verbose /dev/md1 --level=raid1 --raid-devices=2 --metadata=1.0 /dev/sdb1 /dev/sdc1

I then copied over the /boot partition from the existing system:

mkfs.ext4 /dev/md1
mount /dev/md1 /mnt
rsync -avc /boot /mnt
umount /mnt

Next is adding the new large partitions to the existing array. I physically removed the failed drive so that it couldn't cause any problems and added the new partitions like so:

[root@cat ~]# mdadm --manage /dev/md0 --add /dev/sdb3
mdadm: added /dev/sdb3
[root@cat ~]# mdadm --manage /dev/md0 --add /dev/sdc3
mdadm: added /dev/sdc3

mdraid immediately begins to sync the data from the 40GB drive over to one of the new drives. Since it is a 2 drive array it leaves the other partition as a spare. There is no need to create a filesystem on the new partitions because they are being written with the data from the old drive, which includes the filesystem. /proc/mdstat looked like this during the sync:

[root@cat ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdc1[1] sdb1[0]
      975860 blocks super 1.0 [2/2] [UU]

md0 : active raid1 sdc3[3](S) sdb3[2] sda3[0]
      37458876 blocks super 1.1 [2/1] [U_]
      [>....................]  recovery =  0.6% (247296/37458876) finish=22.5min speed=27477K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices:

When that sync is finished I then manually failed the old 40GB drive - mdadm --manage /dev/md0 --fail /dev/sda3 and waited for the data to be synched to the other new drive and then removed the old drive from the array with this: mdadm --manage /dev/md0 --remove /dev/sda3. At this point I now have 40GB of a 498GB partition being used. It would have worked just fine like that, but it does seem like such a waste so I want to resize it. But first I made sure I could boot the system with just the 2 new drives and their RAID1 /boot partition.

That's when I goofed.

I had grabbed the UUID values (unique values that help Linux find the right partition to mount) and updated my /etc/fstab with the new values. I also updated the swap entries with their new UUID values (printed when you run mkswap). You can always see the UUID of a partition by running blkid /dev/sdX or blkid /dev/md0. We used to refer to the drives in /etc/fstab using their device names, like /dev/sda1, but changes in how drives are mounted means that they may not always get the same letter assignment. The UUID is unique and tied to the filesystem so you are guaranteed to always get what you expect. No more nasty surprises when you plug in a USB drive and reboot only to find the BIOS changed the drive order on you.

Oh, back to the goof. Well, in my excitement to see if GRUB really would boot the RAID1 /boot partition I had neglected to actually write GRUB to the MBR of the new drives. This caused the system to, well, not boot. The fix was simple, slap the old 40GB drive in, use its MBR to boot and then write GRUB using grub:

[root@cat ~]# grub
root (hd0,0)
setup (hd1)
root (hd0,0)
setup (hd2)

The root line should match what is in /etc/grub.conf and the setup (hdX) tells it to write to that drive, which may be a different number when booting without the old drive installed.

Next is resizing things. You need to resize the RAID container and then resize the filesystem. The first time I tried this I ran into the Bitmap must be removed before size can be changed error which sounds a bit ominous when you aren't expecting it. What it means is that the bitmap that the array uses to track what has been synced needs to be removed. It isn't big enough for the new size anyway. To do that you run mdadm --grow /dev/md0 --bitmap none which allows you to then actually grow it - mdadm --grow /dev/md0 --size max. This will take a while. How long it takes depends on, things like drive controller speed, CPU speed, drive speed and who knows what else. In my case it took about 3 hours. You can monitor the progress by watching /proc/mdstat using watch -n 20 cat /proc/mdstat.

When that is finished you want to add the bitmap back to the array, which is done by running mdadm --grow /dev/md0 --bitmap internal. Now we are ready to resize the filesystem. Back in the old days (cough) you had to reboot into a rescue disk and run things like this on an unmounted filesystem. Those days are long gone. We just need to run resize2fs /dev/md0 and sit back and watch it grow. You can monitor with all the normal filesystem utilities. It shows the new size in realtime - df -h.

The last step, as it should be with any filesystem changes, is to run a filesystem check. touch /forcefsck and reboot and it will be handled at boot time.

I have to thank the many resources found via google, but especially this howto forge article on replacing disks in a RAID1 array, and the kernel.org wiki entry on Growing a RAID.

(note: this is what worked for me, in my setup, yours will be different and this information may or may not work for you. Make sure you have good backups before doing anything with your filesystems).

UPDATE: I think my original title was dumb. I've changed it.