Of Mirrors and Luck - a Tail of Losing The Coin Flip
Friday, October 07 2011 @ 08:59 AM PDT
Contributed by: Richard Pitt
Unlike Harrison Ford in Indiana Jones and the Last Crusade, I didn't choose wisely. In fact, there was no obvious information on which to base the choice, I simply picked one of the two RAID 1 drives (mirrors) and removed it - and the system immediately started to run faster; at least for about a day.
A lot of analysis had been done up to that point, and finally one screen, updating every 1/10 second, showed me that the process that was sticking the system up was to do with the RAID 1 drives on the root partition.
I'd had no hardware errors reported by
smartd or RAID errors by
The only thing I had done recently was update the kernel to the latest version on Fedora Core 14, the OS version on the system. Could this have done something? I found an obscure reference to running RAID and problems with Western Digital Caviar Green drives - this system uses Caviar Blacks, but... maybe.
So I broke the mirror.
mdadm /dev/md0 --fail /dev/sda3
I could have chosen /dev/sdc3, but that 50/50 chance somehow made me type the first drive's name instead. It didn't seem to be a specific hardware problem with one drive but instead simply using the type of drive in a RAID, so no obvious reason to choose one over the other.
Then the problem surfaced again, but by this time I'd re-formatted and tested the removed partition (thank goodness I didn't do this to the rest of the drive), and there was simply no way back. Two days later, I'm still recovering files from the other partitions and getting the system back running all the various things it ran. Looking back, I'm not sure that there was anything I could have done differently except had the good luck to pick the correct drive - in general I don't gamble because "if you don't play, you can't lose" - in this case I lost.
It all started with the Linux server system I host becoming slower and slower. The real load on it at this time of year is trivial, so load should not have been a factor at all. I started looking for other reasons: denial of service attack, huge directories (some of which are growing to contain millions of photos at this point) and a number of other things went through my head and were tested and didn't show as the real cause.
Meanwhile, the all-year core members of Hancock Wildlife Foundation (the major tenant on the system) were starting to really complain - their sessions were disappearing, posts were duplicated, response to simply looking at some of the pages was measured in minutes at times.
The steps along the way to full recovery of the data should prove interesting if you're faced with similar problems, no matter whether caused by choosing wrongly as I did, or by real hardware errors. Sometimes the best data is on the "failed" drive - and getting it off once the other drive is toast is not an easy task.
When the system started to run slowly again, same symptoms: running, slow, nothing, then a burst of activity - I simply broke the other mirrors (there were 4 left) expecting that one of them was in fact to blame.
I did this without re-running the analysis that had pointed out the problem in the first place - silly me. It would have shown, I'm guessing, that the culpret was in fact the /dev/sdc3 partition on the /dev/sdc drive, the original root partition - slow because it was failing (in the way the obscure reference observed by the way - more later)
So I broke the mirrors one by one, and was left with the system running almost 100% on the bad drive. Only 1/2 of the major data was on a pair that didn't include this drive. The rest of the system: root, boot, mysql, home and a bunch of image files, all were on this one drive, and it was running at about 1/10 or less normal speed.
I tried to re-build the /boot array:
mdadm /dev/md0 --remove /dev/sda1
mdadm /dev/md0 --add /dev/sda1
It took almost an hour for this 500 Meg partition to rebuild. Doing the same for the 50 Gig root and mysql partitions and 1.8 Tbyte main data partition would take weeks at that rate - not an option, even if I could guarantee that the failed drive would last that long.
So I started to try to figure out how to use the "failed" drive partitions as if they were good. I'm sitting in the machine room in downtown Vancouver at this time, using my laptop as my reference and lookup tool while fans roar around me.
The long and short of it is, there isn't any documented way that I could find to tell a "failed" partition that it is in fact OK and enable its use as "the" master. I can conceive of editing the partition data with a binary editor and flipping the bit/field, but could not find the format definition or incantation to do so.
After removing the system from the rack and replacing the failed drive with a new one (Seagate this time - I'm changing to using two different manufacturers of drives in all my mirrors now, just in case) I spent quite bit of time trying to speed up the copy process from the failed drive (now in an external SATA->USB drive bay) enough that I could get the new drive to at least boot up with all the system's major software setups intact; about 6 hours - and it still simply wasn't enough.
By this time the system had been down for almost 4 hours since the last time email had flowed - one of the critical systems on the machine and a critical time period in SMTP since many systems will start informing senders that a message can't be delivered after 4 hours sitting in the queue.
I bit the bullet and simply installed a fresh version of FC14 from DVD, did a kernel update, ensured that I could access it remotely via ssh and that it was othewise locked down, and left for home. I left the failed drive and another fresh drive in external drive bays in the cage, to be retrieved later once the system was fully functional again.
I got home just before dinner and, even with the fresh install, had email up and running in less than 1/2 hour. That done, I started looking for ways to get at the "failed" drives as if they were simply non-RAID drives. The partition section that contains the file system is simply offset from the beginning of the drive's partition-table start point - but offset how much?
Here is where a bit of sleuthing finally paid off. I knew it could be done since I'd done it before when fixing another machine, but I'd lost the URL in one of my own workstation's infrequent cleanups. That's one reason for my writing this blog entry - to remind myself of the tools and techniques I use just in case the same thing happens again.
Background to the restoration of this system
Over the years since we sold our ISP, I've had as many as 5 systems running various utilities for my own and friends and long-time customers' use. They have all been Linux systems, mostly based on Red Hat and/or one of their development releases.
The facilities include email hosting and redirection, DNS, Web and some minor specialty systems. The basic software is open source and for the most part is only minorly modified, if at all. Only some of the configurations (such as running the email database on an old setup of open LDAP) are in any way "strange" although some of the background utilities are getting a bit long in the tooth and are due for some updating. But for the most part, aside from some massive numbers of files that I'll get into shortly, the systems are pretty plain vanilla. This is why getting email running was actually pretty easy and quick.
Today, instead of several machines (or machine instances running under VMWare, as I've had in recent years) I've put everything onto one fairly new (less than 1 year old) 2U high custom-built (by me) server system with an Intel motherboard and dual Xeon 2.4GHz processors with 48 Gigs of RAM and 4 SATA hard drives, each 2 Terabytes.
The operating system is Fedora Core 14 at this time, mostly because I'd been running it on my own workstation at home when I put this system together, and I knew that it was stable and would work well with what I was planning on putting on it. I'm comfortable enough with security and such issues, and the update path with Fedora, that using a development release does not bother me, and there are some performance plusses in using the newer kernels over using CenOS or other "production" versions of Linux that are using slightly older kernel versions.
I also get access to some of the more recent releases of various video tools such as ffmpeg and mencoder - tools that match those I've put on some of the Hancock Wildlife Foundation's video server systems at eagle nest camera sites, where the new faciltiies are used extensively. Many things are done on this server with files that originate on those remote sites and having consistent tool sets is an advantage.
Recovery Is Possible
So, how do you get at the underlying file system in a RAID 1 mirror drive?
The key is figuring out what the offset from the beginning of the partition is - and to find that you either have to wade through pages of answers from Google to the question "what is the layout of a Linux RAID 1 drive partition?" - which I did without finding the answer (and yes, I rephrased the question in several different ways) - or you go fishing for it.
I chose to go fishing - and found the ideal fishing script. It almost worked "out of the box" but it turns out that it needed a minor tweak to deal with mounting on a loop-back interface - it kept running out of loop devices.
I found a page that referred to a "scandrive" program that the source code was provided for. It purported to look through a drive looking for anything that resembled the start of an ext2 partition, but I'd have to modify it to look for ext4 and compile it. Having only just the basic Linux loaded so far, I'd have to download the compiler and other tools, dig through the meta information that identifies various files (the "file" utility uses information in /usr/share/misc/magic.) and/or pulling in the Linux header files including that for the ext4 file definitions.
I also had found a page by Hans-Henry Jakobsen at Pario.NO about mounting a RAID Reconstructor disk image. It simply uses a shell script - much better and faster as a first attempt.
In this case he'd used a tool that had pulled a disk image of a raw hard drive (from a failed RAID array of some type) into a file on his system - and wanted to get at the underlying file system in this image. I was going to skip the "create a file of the disk" part and go directly to the raw drive because as one of a pair of mirrors, I knew it has all the data. With Linux the details of what you're reading from rarely matters; disk image, raw drive, stream of data on a wire - all look like bits and bytes to most of the utilities.
His original code:
for ((i=0 ; $i < 20000 ; i=$i + 1)) ; do
mount -t ext2 -o loop,offset=$(($i * 512)) diskimage.img /mnt/point && break
I was looking for an ext4 file system, so change the ext2 to ext4. I was also reading from the drive directly so change diskimage.img to /dev/sdb3 (the new name of what had been /dev/sda3 before I'd put in the new boot drive) and change the mount point to /m/olddata2 after creating that directory to mount to.
So I ran:
for ((i=0 ; $i < 20000 ; i=$i + 1)) ; do
mount -t ext4 -o loop,offset=$(($i * 512)) /dev/sdb6 /m/olddata2 && break
but what I got back was a lot of error messages that said "no free loop devices" after the first couple of itterations.
Again, a bit of sleuthing, and the incantation for releasing the loop devices (rather than simply increasing the available number, which at a max of 64 was too few in any case) got added to the script. At first, I added only enough to delete the first loop device - /dev/loop0 - but for some reason either the script was not actually doing this reliably, or there was something else going on - never did find out but, the final code simply deleted them all one-by-one each time through the script.
I was not using any loop devices for anything else, so cleaning them all up seemed OK.
for ((i=0 ; $i < 20000 ; i=$i + 1)) ; do
mount -t ext4 -o loop,offset=$(($i * 512)) /dev/sdb6 /m/olddata2 && break
for x in 0 1 2 3 4 5 6 7
losetup -d /dev/loop$x
The "break" would stop the code if a successful mount was issued - and it did. The fact that it was to /dev/loop6 shows that somehow the first ones were still hung up, but that matters little at this point - I had mounted the old file system without using the RAID facilities - and since it had been a full mirror and was on a good drive, it was in fact in excellent shape - didn't even need me to run fsck on it.
Now, finally (10 hours after I'd initially started the whole recovery process), I could start copying in not only the setup files, but the millions and millions of image files that Hancock Wildlife Foundation has on this server - one per minute or one per 5 minutes off each of the high-definition cameras in the field, plus thumbnails, time-lapse video and all manner of member-submitted photos from the various "ground observers" - well in excess of 10 million the last time I counted them.
But where to begin? If I simply copied in the whole file system (I was NOT going to use the mount point as the actual file store - no redundancy) then it would take more than a day before any of the web sites could be brought back online. That simply was not practical - so I set the system to copying in everything but the images, using rsync and the "exclude=" facility to weed out the huge file areas scattered through the several sites to simply ignore most of the images and videos, along with the "max-size" to ignore things like huge log files, backups, etc. that also are scattered over the machine in various home directories, etc. I'd go back and remove these restrictions after the sites were up, and people would see the images start to show up over time. That's actually what is happening as I write this - one of the areas (discussion forum media store) is showing about 180,000 left out of 1.4 million files - and those are just the thumbnails!
The drives are copying at between 1.5 and 100Megs/second but with this many files it simply takes time.
Along this path to recovery another of the minor things that bit me was some changes in permissions. The current system has directories and such from previous uses with various owners and permissions - many of whom are no longer allowed on the system. In some cases they are previous employees - in others, customers of one type or another. There are valid reasons for me keeping the files - in many cases they are work I've done or paid for - but there's no reason to maintain the user accounts.
I went through the home directories and created accounts for those that needed them - and would set permissions and ownership on the rest to my own. The problem is, I got different UID numbers for some of the accounts, and since I'm copying the original source drive (with the old UID numbers) over top several times, I'm having to run a script that goes back over the area and changes permissions and ownership fairly frequently - about every 30 seconds or so, just so that the now active web sites don't stop working due to some mismatch.
The other thing that bit me is that in my initial copy to get the sites up and running, I restricted the size of files too tightly - I set it to 10K, figuring that this would be more than enough for the various PHP, HTML and template files and that anything larger was either an image, video, or backup of some type. It turns out that this is true with the exception of some library files that in one case turned out to be over 300K. The error messages from Apache were saying "can't find file" for these until I got them copied over in another run with looser size paramters.
Now the system is up with web and email running fairly happily. I've posted a note on the HWF site that the images are coming back as fast as the system can copy them (it just finished the home directory where most of them are - now has to do the image archive of the 1 and 5 minute files - another 6+ million).
Next I have to go through and get some of the other facilities running. They are not as critical as I have redundancy on other machines for them - but it still needs to be done.
In the case of DNS, this is a lesson for you all "out there" - don't put all your DNS on one machine or even on one network like I've seen so many times in the past. If the machine or network goes down, your website and email isn't just "not able to connect" - it is "not found" which is much more critical.
In my case, I have DNS secondaries at 3 different physical locations around the city, all on different backbone networks and all updated from my master here at my home. In fact, each of the servers thinks it is a master and can fill that role if my own system goes offline for any length of time. The setup was a bit hairy to start with, but it works and has done so for almost 20 years now. No customer domains have gone missing in that time.
My next task for administration is to reintroduce email host redundancy. It is built into the software I use, but because I now have a single very powerful machine that can handle everything else, I no longer have the physical redundancy I used to have for email - and getting something up and running in 4 hours or less is my penalty for this. I have access to another system that I'll set up as a low-priority MX host - after the rest of this restore is done. Priorities, always priorities.
Maybe next time I'll choose wisely and save my self a day or two - but in the mean time I've learned more as usual. Hope you learned something too.