Greg Smith's Note Magnet: Watching a hard drive die

One thing I get asked all the time is how to distinguish between a hard drive that is physically going bad and one that is just not working right from a software perspective. This week I had a drive fail mysteriously and saved the session where I figured out what went wrong to show what I do. It's easy enough to find people suggesting "monitor 'x'" for your drive, where 'x' varies a bit depending on who you ask. Writing scripts to do that sort of thing is easier if you've seen how a bad one acts, which (unless you're as lucky as me) you can't just see on demand easily. This is part one to a short series I'm going to run here about hard drive selection, which will ultimately lead to the popular "SATA or SAS for my database server?" question. To really appreciate the answer to that question, you need to start at the bottom first, with how errors play out on your computer.

Our horror story begins on a dark and stormy night (seriously!). I'm trying to retire my previous home server, an old Windows box, and migrate the remainder of its large data files (music, video, the usual) to the new Ubuntu server I live on most of the time now. I fire up the 380GB copy on the Windows system a few hours before going to bed, using cygwin's "cp -arv" so I won't get "are you sure?" confirmations stopping things. I expected it will be finished in the morning. I check up on it later, and the copy isn't even running anymore. Curious what happened, I ran "du -skh /cygdrive/u/destination" to figure out how much it did copy before dying. In the middle of that, the drive starts making odd noises, and the whole system reboots without warning. This reminds me why I'm trying to get everything out of Windows 2000.

At this point, what I want to do is look at the SMART data for the drive. The first hurdle is that I can't see that when the disk is connected via USB. A typical USB (and Firewire) enclosure bridge chipset doesn't pass through requests for SMART data to the underlying drive. So when things go wrong, you're operating blind. Luckily, this enclosure also has an eSATA connector, so I can connect it directly to the Linux PC to see what's going on. That connection method won't have the usual external drive limitations. If that weren't available, I'd have to pull the drive out of its enclosure and connect directly to a Linux system (or another OS with the tools I use) to figure out what's going on.

(If you don't normally run Linux, you can install smartmontools on Windows and other operating systems. Just don't expect any sort of GUI interface. Another option is to book a Linux live CD; I like Ubuntu's for general purpose Linux work, but often instead use the SystemRescueCd for diagnosing and sometimes repairing PC systems that are acting funny.)

Plugged the drive into my desktop Linux system, "tail /var/log/messages" to figure out what device it gets assigned (/dev/sdg), and now I'm ready to start. First I grab the drive's error log to see if the glitch was at the drive hardware level or not:

$ sudo smartctl -l error /dev/sdg
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
...
Error 2 occurred at disk power-on lifetime: 153 hours (6 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 08      00:00:17.300  IDENTIFY DEVICE

Error 1 occurred at disk power-on lifetime: 154 hours (6 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 01 00 00 a0  Error: UNC 1 sectors at LBA = 0x00000001 = 1

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 03 01 01 00 00 a0 ff      11:17:17.000  READ DMA EXT
  25 03 01 01 00 00 a0 ff      11:17:17.000  READ DMA EXT
  25 03 30 5e 00 d4 48 04      11:17:05.800  READ DMA EXT
  25 03 40 4f 00 d4 40 00      11:16:56.600  READ DMA EXT
  35 03 08 4f 00 9c 40 00      11:16:56.600  WRITE DMA EXT

Getting DMA read/write errors can be caused by driver or motherboard issues, but failing to idenitify the device isn't good. Is the drive still healthy? By a rough test, sure:

$ sudo smartctl -H /dev/sdg
SMART overall-health self-assessment test result: PASSED

This is kind of deceptive though, as we'll see here. The next thing we want to know is how old the drive is right now and how many reallocated sectors are there. Those are the usual first warning sign that a drive is losing small amounts of data. We can grab just about everything from the drive like this (bits clipped from the output in all these examples to focus on the relevant parts):

$ sudo smartctl -a /dev/sdg
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   092   092   016    Pre-fail  Always       -       2621443
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   111   111   024    Pre-fail  Always       -       600 (Average 660)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       74
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       7
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       154
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       77
193 Load_Cycle_Count        0x0012   100   100   050    Old_age   Always       -       77
194 Temperature_Celsius     0x0002   114   114   000    Old_age   Always       -       48 (Lifetime Min/Max 18/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       12
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       4
...

Make sure to scroll this to the right, the last column is the most important. UDMA_CRC_Error_Count matches the errors we're still seeing individually. But the real smoking gun here, and in many other cases you'll see if you watch enough drive die, is Reallocated_Sector_Ct (7) and its brother Reallocated_Event_Count (12). Ignore all the value/worst/thresh nonsense; that data is normalized by this weird method that doesn't make any sense to me. The "raw_value" is what you want. On a healthy drive, there will be zero reallocated sectors. Generally, once you see even a single one, the drive is on its way out. This is always attribute #5, I like to monitor #194 (temperature) too because that's a good way to detect when a system fan has died. The drive overheating can be a secondary monitor for that very bad condition. You can even detect server room cooling failures that way, it's fun watching a stack of servers all kick out temperature warnings at the same time after the circuit the AC is on blows.

The other thing to note here is Power_On_hours. Here the raw value (154 hours) confirms that the recent errors in the logs did just happen. This is a backup drive I only power on to copy files to and from it, and it's dissapointing that it's died with so little life. Why that happened and how to prevent it is another topic.

Next thing to do is to run a short self-test, wait a little bit, and check the results. This drive is loud enough that I can hear when the test is running, and it doesn't take long:

$ sudo smartctl -t short /dev/sdg
$ sudo smartctl -l selftest /dev/sdg
...
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       60%       153         717666530

Here's the thing to realize: the drive reports "healthy" to the simple test. But isn't; there are reallocated sectors, and even the simplest of self-checks will find them. Enterprise RAID controllers can be configured to do a variety of "scrubbing" activities when the drives aren't being used heavily, and this is why they do that: early errors can get caught by the drive long before you'll notice them any other way. Nowadays drives will reallocate marginal sectors without even reporting an error to you, so unless you look at this data yourself you'll never know when the drive has started to go bad.

At this point I ran a second short test, then an extended one; here's the log afterwards:

$ sudo smartctl -t short /dev/sdg
$ sudo smartctl -t long /dev/sdg
$ sudo smartctl -l selftest /dev/sdg

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       157         2454905
# 2  Short offline       Completed: read failure       40%       156         711374978
# 3  Short offline       Completed: read failure       60%       153         717666530

The extended test found an error even earlier on the disk. It seems pretty clear this drive is failing quickly. At this point, there's little hope for it beyond saving any data you can (which I did before even launching into this investigation) and moving toward running the manufacturer's diagnostic software. What I'd expect here is that the drive *should* get marked for RMA if it's already in this condition. It's possible it will "fix" it instead. That's a later topic here.

In short, there are a few conclusions you can reach yourself here, and since this is a quite typical failure I can assure you these work:

Try not to connect external drives to a Windows server if you can avoid it, as they're pretty error prone and Windows isn't great at recovering from this sort of error. This fact is one reason I get so many of these requests to help distinguish true hardware errors from Windows problems.

Drives that are setup such that they can't check themselves via SMART are much more likely to quietly fail. Had this one been connected directly rather than via USB, I could have picked up this problem when the reallocated sector count was lower and decreased my risk of data loss.

If you're taking an offline drive out of storage to use it again, a short SMART check and look at the logs afterwards is a good preventative measure. Some drives even support a "conveyance" self-test aimed for checking quality after shipping, this one didn't so I went right from the short to long tests.

When monitoring regularly with smartmontools, you must monitor reallocated sectors yourself, the health check will not include them in its calculations. They are an excellent early warning system for errors that will eventually lead to data loss.

Regularly running the short SMART check doesn't introduce that high of a disk load, and it is extremely good at finding errors early too. I highly recommend putting a periodic SMART self-test on your server outside of peak hours if you can, if you're not using some other form of data consistency scrubbing

Running a baseline SMART self-test when you first put a drive into service helps provide a baseline showing good performance before things go wrong. I didn't do that in this case and I wish I had that data for comparison. It's helpful for general drive burn-in too.

That's the basics of what an error looks like when you catch it before the drive stops responding altogether. I've found this is quite often the case. Anecdotally, 75% of the drive failures I've seen in the last seven years (3/4 failures since when I started doing this) show up like this before the drive stops responding altogether. Some of the larger drive studies floating around recently suggest it's not quite that accurate as an early warning for most, it's certainly much better than not checking for errors at all.

The fact that this drive died somewhat mysteriously, in a way that it even still passed its SMART health check, has some interesting implications for its suitability in a RAID configuration. That's where I'm heading with this eventually.

Whatever OS you're running, you should try to get smartmontools (or a slicker application that does the same thing) running and setup to e-mail you when it hits an error. That regime has saved my data on multiple occasions.

Greg Smith's Note Magnet

Monday, October 12, 2009

Watching a hard drive die

2 comments:

Blog Archive

About Me