Yuk, the firewall/router machine at home just died horribly.
I use ReiserFS at home. Yes, I know. I liked it back in the days of ext2 (and before ext3 was as obviously stable as it is now). Unfortunately I have noticed it seems to be a little sensitive to disk corruption. From what I can gather from looking at logs etc, a few weeks ago a power cut at home caused the machine to reboot un-cleanly and probably caused a little corruption to the partition holding /home. This amounted to a ticking time bomb since, when reiserfsck is run at boot, it replays the journal and assumes all is fine and dandy and doesn’t actually check the filesystem. At this point eveyone is screaming ‘why didn’t you check it?’. Well mostly because I am a big lazy fool who will now know better in future. OK?
Fast-wind to today and an e-mail arrives at 3:43am which causes exim to drop a file in my home directory’s Maildir which mutt then tries to read. Some nasty ju-ju happens and mutt is blocked while reiserfs starts running around kernel memory like a headless chicken. From this point on, reads from other files in /home start blocking indefinately. I wake up this morning to find ~400 exim processes sitting there waiting to finish writing files to my maildir. Further more any other process that tries to write to /home is being blocked, all of them as TASK_UNINTERUPTABLE which means the kill -9 has no effect. By this point 99.9% of the CPU was being utilised by the kernel as resiserFS sat there in an infinite loop brain farting everywhere.
Add to this happy picture the fact that I only see this while sitting at work, 2 miles away with nothing more than a SSH session into the box in question. I can’t leave this session since I strongly suspect that trying to ssh in again will lead to sshd being blocked when it runs bash (and bash tries to read my profile). Unmounting /home wont be easy (all these processes are sitting there making the device ‘busy’) and /sbin/reboot wont work since a) it can’t unmount /home and b) it can’t kill all the processes and will in fact just become another blocking process. Thanks to the wonder of lazy-unmounting (umount -l /home) I can pretend sufficiently well to reiserfsck that /home isn’t mounted and that it can try fixing the filesystem. The first error spotted is the troublesome email that triggered the whole mess which reiserfsck claimes to have ‘fixed’ but did in fact just nuke it. After fsck thinks it’s done I now need to reboot the box, hope it comes up again, and hope that the hastily appended code to /etc/conf.d/local.sh will bring the Internet connection back up (don’t ask about the ADSL woes….). Suffice it to say that this was the only time I’ve ever had to ‘echo ‘b’ > /proc/sysrq-trigger‘ just to get the damn box to reboot. Thank god I compiled it in (remember I had no phyiscal access to the box).
Happily it came up again and seems to be happy now. As far as I can tell, exim even managed to deliver all the mail after the naughty one too. I was particularly impressed by that to tell the truth.
Well, how was your morning? :)