Unlike Harrison Ford in Indiana Jones and the Last Crusade, I didn't choose wisely. In fact, there was no obvious information on which to base the choice, I simply picked one of the two RAID 1 drives (mirrors) and removed it - and the system immediately started to run faster; at least for about a day.
A lot of analysis had been done up to that point, and finally one screen, updating every 1/10 second, showed me that the process that was sticking the system up was to do with the RAID 1 drives on the root partition.
I'd had no hardware errors reported by smartd or RAID errors by mdmonitor.
The only thing I had done recently was update the kernel to the latest version on Fedora Core 14, the OS version on the system. Could this have done something? I found an obscure reference to running RAID and problems with Western Digital Caviar Green drives - this system uses Caviar Blacks, but... maybe.
So I broke the mirror.
mdadm /dev/md0 --fail /dev/sda3
I could have chosen /dev/sdc3, but that 50/50 chance somehow made me type the first drive's name instead. It didn't seem to be a specific hardware problem with one drive but instead simply using the type of drive in a RAID, so no obvious reason to choose one over the other.
Then the problem surfaced again, but by this time I'd re-formatted and tested the removed partition (thank goodness I didn't do this to the rest of the drive), and there was simply no way back. Two days later, I'm still recovering files from the other partitions and getting the system back running all the various things it ran. Looking back, I'm not sure that there was anything I could have done differently except had the good luck to pick the correct drive - in general I don't gamble because "if you don't play, you can't lose" - in this case I lost.
It all started with the Linux server system I host becoming slower and slower. The real load on it at this time of year is trivial, so load should not have been a factor at all. I started looking for other reasons: denial of service attack, huge directories (some of which are growing to contain millions of photos at this point) and a number of other things went through my head and were tested and didn't show as the real cause.
Meanwhile, the all-year core members of Hancock Wildlife Foundation (the major tenant on the system) were starting to really complain - their sessions were disappearing, posts were duplicated, response to simply looking at some of the pages was measured in minutes at times.
The steps along the way to full recovery of the data should prove interesting if you're faced with similar problems, no matter whether caused by choosing wrongly as I did, or by real hardware errors. Sometimes the best data is on the "failed" drive - and getting it off once the other drive is toast is not an easy task.




