New Jersey’s source for Open Source Consulting since 1998

Troubleshooting a hard drive crash, part deux

In an earlier post, I gave you the the data needed to diagnose a hard drive failure and several symptoms of the failure. In this post, we’ll go over my observations and fixes.

First problem

I mentioned that partitions 6 and 7 were complaining of errors. Naturally, the way to fix these is to run a file system check on the partitions. The relevant fstab entries tell us that the partitions have ext2 file systems on them, so running fsck is called for.

Oh, but wait, we don’t have the root password. No problem, we boot into KNOPPIX and run fsck from there.

But what partitions do you run fsck on? Note that fdisk -l reports the partitions on /dev/sdb but fstab reports partitions on /dev/sda! They can’t both be right, can they?

Yes, they can. While we only have one hard drive, which I verified by actually opening up the machine (remember, ALWAYS check your assumptions!), we have two different operating systems, each of which does things slightly differently. My guess is KNOPPIX saw the RAID controller as /dev/sda and the hard drive as /dev/sdb whereas the original operating system, Mandrake 10.1, was using the RAID drivers and saw the drive as sda. So I knew it was safe to run fsck.

Second problem

This problem is trickier. According to fdisk -l, we have a FAT file system smaller than 32 M, but the client tells us he has 45 GB of data stored on this partition. So what gives?

Look very closely at the output of fdisk. You’ll notice three things:

  1. partition 8 is only one block long,
  2. there is no partition 9, and,
  3. there is a three block gap between the end of partition 7 and the beginning of partition 8

Item one tells you the partition table is corrupt; it doesn’t make any sense to have a partition that’s one block long.

Item two may indicate that someone removed the partition but forgot to take it out of fstab and while that would cause major problems on booting, it wouldn’t stop us from booting. Besides, the clients didn’t have the technical know-how to do that.

Item three, while technically is allowed (meaning there’s nothing stopping you from doing it), it’s just a Bad Thing.

So, how to fix it? If we knew how big partition 8 or 9 was, we could calculate how big the other partition was, but we don’t have that info. We know from fstab that partition 9 was the /tmp directory and since the /tmp directory doesn’t have to be on a separate partition (although it is Good Practice), we can ignore it. So I made an executive decision and reset partition 8 to be the rest of the drive.

Now we have a new problem

With the new partition 8 now spanning the old 8 and 9 partitions, we are going to have problems with the filesystem on the new partition, namely, the superblock is corrupt and the files from the old partition 9 won’t “fit” into the new filesystem. Since old partition 9 was originally mounted on /tmp, I don’t care what happens to that data, so we’re left with the superblocks problem.

Running fsck complained that no superblock could be found. The canned response to that is “use a backup superblock”. The trick is to find one. The easiest way is to use the testdisk utility that is part of KNOPPIX. testdisk found backup superblocks at 32768, 98304, and 163840. So running fsck -b 32768 -B 4096 fixed the filesystem!

N.B.: Notice that the first superblock was found at the 32768/4096 - 1 = 7th backup position. It looked like the backup superblocks at positions 1, 3, and 5 were corrupted (yes, we’re using spares superblocks). Fortunately, I decided to use testdrive instead of trying each backup superblock in turn until I found one.

At this point, I rebooted the machine and…

It boots!

Well, sorta’. The system complained that /var/run didn’t exist. I have no idea why it didn’t exist, but it was easy enough to fix. Reboot and the system came up, except for Webmin. After checking the proper directories and config files which, naturally, had not changed, I did a strace -f -o strace.out /sbin/service webmin start to discover that webmin couldn’t write to /var/run/webmin. Another easy fix and a reboot.

Finally, the system booted and was running normally!

Technorati Tags: , , , ,

Tags:     

 

Leave a Reply


Linux New Jersey powered by WordPress Wordpress Template Design was Done In Style.
Entries (RSS) and Comments (RSS).