The Joys of Commodity Hardware

November 30, 2003

As you may or may not have noticed there was yet another blip in the personal hell that is UFies.org hardware problems for me. However, we're back up and going. We did lose /var, but most of the important files (mail, databases, etc) were backed up at 3am last night, and they were restored. Not everything was backed up, so if you notice anything strange throw me an email at arcterex ( - at - ) ufies dawt org

Thanks for everyone's patience and understanding. Read on for an explanation of what actually happened.

Update - ~~webmail is busted, I'll deal with that in the morning.~~ Fixed.

It was the start of a sunday that I was hoping to take easy and relax... ah, the best laid plans of mice and men. Fred threw me an email to point me to a RAID failure message. We had had an element in one of the arrays fail not that long ago, but the spare kicked in and all was well, and I was turning a blind eye until I had the mental strength to deal with it. Anyway, the co-lo had a scheduled powerdown from the hydro company for last night. Turns out the power company was early though. While the RAID5 was reconstructing the array, it must have hit a bad block or something and kicked out the element. So now we were down to 2/3 :(

I moved the spare partition in from the first array to put in as the 3rd and it seemed to go in ok, but as it was reconstructing something kicked it out and now we had 1/3 going, which as anyone who knows about RAID5 knows, can't really happen. Things were still running, and "ls" was still showing files, but there were I/O errors when I tried to copy files around (in a desparate attempt to save data in case something happened). So I resigned myself to the fact that I was going to have to go in and fix things from the console. An hour and a half later I was there, armed with a new drive.

It's getting late and it's been a long day so I'm going to keep it short, hit my blog sometime tomorrow for an even longer and more boring version of this.

Anyway, I figured out which drive was causing the errors and swapped it out. When it came back up all nicely recostructed and what not I was quite disheartened ("bugger" I think was one of the terms I used) to see that I couldn't mount /var anymore, as it was not recognized as an ext3 filesystem. So I re-created it, copied the files I have backed up from /var every night (mail spool, databases, dpkg lib, etc) and brought the system hobbling back up to life. I managed to recreate the needed files from a friend of mine who also has a debian unstable system (thanks wim!) and have been running various "apt-get install --reinstall <package>" commands since. Everything appears to be running again, and other than losing all the logs and mail/database stuff from 3am till 10am when the system went down ("luckily" the mailserver was down most of the day as well, so mail was sent over the the secondary server and saved, and there was only a 7 hour window of nuked mail).

Anyway, thats about it. Like I said, let me know if anyone sees anything that is broken, and thanks to Fred for getting me the spare drive and the guys at data-fortress for putting up with me there and for keeping me company (though they had to be there anyway dealing with the power issues).

Posted by Arcterex at November 30, 2003 10:30 PM