Mountains

Mountains

Wednesday, August 3, 2011

Internal Combustion

Or, the death of a Seagate Momentus ST94011A from a combination of old age and heat death.

Given that the NAS was not really a critical system, I've had all kinds of time to attempt to figure out what is going on.

Using mprime to attempt to overheat the laptop (while booted from a xubuntu livecd) proved that the motherboard itself was fine. Factoring primes for days simply showed that the system could, when pressed, drive up the electric bill and heat the room even further that the 100°C heat days were.

Examining the SMART logs from FreeNAS, the boot drive has far exceeded it's spinup/spindown count. Something I feared would happen since the NAS tended to wake the drive up frequently to write log data to it. However, it still seemed to boot and read/write data well enough. Just now and then, it would kernel panic, crash, and reboot.

Returning to linux, I've been fiddling with badblocks as a tool to test the drive. Curiously, badblocks would write to the entire disk, but then crash when verifying with a cryptic "%killed" message.

Very odd. I can find little information on what a badblocks crash actually means: the program provides very little in debugging information. Some experimenting has revealed that badblocks can be made to crash by testing too many blocks simultaneously (%badblocks -b 4096 -c 999999999999999), causing it to exhaust available system memory. My original tests used 256 megs of ram with the -c flag. When I scaled down to ~32 megs, the crashing stopped. Possibly the available memory of a virtual memory free livecd boot was the culprit.

Running successive passes did turn up failures after a few hours
(% badblocks -b 4096 -c 32000 -w -v -p 10 /dev/sda )
leaving interesting messages in /var/kern.log and barfing a exhaustive list of bad sectors. I interpret this to mean the controller on the disk is fried.

I picked up another old PATA drive out of the pile and assembled it into the machine. Another day of stress testing, then we should be back in business.

2 comments:

  1. I mainly wanted to say, memtest. Just to cover your bases.

    I also wanted to say smartctl might be helpful--your drive does lba I'm sure. That's like virtual memory--logical blocks aren't necessarily sequential on disk. The drive has a spare block pool it remaps over damaged blocks as it detects them. This is why sometimes you end up with "stuck" blocks(offline uncorrectable is the lingo)--won't read successfully and thus can't be remapped till you write a new block to that logical address. Anywho, smart keeps track of how many blocks have been swapped out/how many replacement blocks remain. On modern drives, that pair being (large:small) might be a better indicator than badblocks.

    .02$

    More than you want to know: http://smartmontools.sourceforge.net/badblockhowto.html

    ReplyDelete
  2. Actually, I did memtest first. Memtest doesn't flog the whole motherboard like running mprime while an opengl screensaver runs, so if you're looking for heat failure, you'll have to do better than memtest.

    There was some fishy data in the SMART logs on the NAS, however, based on the way that badblocks was failing, the drive was heating up, then the controller was going to lunch.

    ReplyDelete

Leave a message after the tone...