Monday, October 12, 2009

Watching a hard drive die

One thing I get asked all the time is how to distinguish between a hard drive that is physically going bad and one that is just not working right from a software perspective. This week I had a drive fail mysteriously and saved the session where I figured out what went wrong to show what I do. It's easy enough to find people suggesting "monitor 'x'" for your drive, where 'x' varies a bit depending on who you ask. Writing scripts to do that sort of thing is easier if you've seen how a bad one acts, which (unless you're as lucky as me) you can't just see on demand easily. This is part one to a short series I'm going to run here about hard drive selection, which will ultimately lead to the popular "SATA or SAS for my database server?" question. To really appreciate the answer to that question, you need to start at the bottom first, with how errors play out on your computer.

Our horror story begins on a dark and stormy night (seriously!). I'm trying to retire my previous home server, an old Windows box, and migrate the remainder of its large data files (music, video, the usual) to the new Ubuntu server I live on most of the time now. I fire up the 380GB copy on the Windows system a few hours before going to bed, using cygwin's "cp -arv" so I won't get "are you sure?" confirmations stopping things. I expected it will be finished in the morning. I check up on it later, and the copy isn't even running anymore. Curious what happened, I ran "du -skh /cygdrive/u/destination" to figure out how much it did copy before dying. In the middle of that, the drive starts making odd noises, and the whole system reboots without warning. This reminds me why I'm trying to get everything out of Windows 2000.

At this point, what I want to do is look at the SMART data for the drive. The first hurdle is that I can't see that when the disk is connected via USB. A typical USB (and Firewire) enclosure bridge chipset doesn't pass through requests for SMART data to the underlying drive. So when things go wrong, you're operating blind. Luckily, this enclosure also has an eSATA connector, so I can connect it directly to the Linux PC to see what's going on. That connection method won't have the usual external drive limitations. If that weren't available, I'd have to pull the drive out of its enclosure and connect directly to a Linux system (or another OS with the tools I use) to figure out what's going on.

(If you don't normally run Linux, you can install smartmontools on Windows and other operating systems. Just don't expect any sort of GUI interface. Another option is to book a Linux live CD; I like Ubuntu's for general purpose Linux work, but often instead use the SystemRescueCd for diagnosing and sometimes repairing PC systems that are acting funny.)

Plugged the drive into my desktop Linux system, "tail /var/log/messages" to figure out what device it gets assigned (/dev/sdg), and now I'm ready to start. First I grab the drive's error log to see if the glitch was at the drive hardware level or not:
$ sudo smartctl -l error /dev/sdg
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
...
Error 2 occurred at disk power-on lifetime: 153 hours (6 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 00 00 00 a0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 a0 08 00:00:17.300 IDENTIFY DEVICE

Error 1 occurred at disk power-on lifetime: 154 hours (6 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 01 00 00 a0 Error: UNC 1 sectors at LBA = 0x00000001 = 1

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 03 01 01 00 00 a0 ff 11:17:17.000 READ DMA EXT
25 03 01 01 00 00 a0 ff 11:17:17.000 READ DMA EXT
25 03 30 5e 00 d4 48 04 11:17:05.800 READ DMA EXT
25 03 40 4f 00 d4 40 00 11:16:56.600 READ DMA EXT
35 03 08 4f 00 9c 40 00 11:16:56.600 WRITE DMA EXT



Getting DMA read/write errors can be caused by driver or motherboard issues, but failing to idenitify the device isn't good. Is the drive still healthy? By a rough test, sure:
$ sudo smartctl -H /dev/sdg
SMART overall-health self-assessment test result: PASSED

This is kind of deceptive though, as we'll see here. The next thing we want to know is how old the drive is right now and how many reallocated sectors are there. Those are the usual first warning sign that a drive is losing small amounts of data. We can grab just about everything from the drive like this (bits clipped from the output in all these examples to focus on the relevant parts):
$ sudo smartctl -a /dev/sdg
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 092 092 016 Pre-fail Always - 2621443
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 111 111 024 Pre-fail Always - 600 (Average 660)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 74
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 7
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 154
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 77
193 Load_Cycle_Count 0x0012 100 100 050 Old_age Always - 77
194 Temperature_Celsius 0x0002 114 114 000 Old_age Always - 48 (Lifetime Min/Max 18/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 12
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 4
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 4
...

Make sure to scroll this to the right, the last column is the most important. UDMA_CRC_Error_Count matches the errors we're still seeing individually. But the real smoking gun here, and in many other cases you'll see if you watch enough drive die, is Reallocated_Sector_Ct (7) and its brother Reallocated_Event_Count (12). Ignore all the value/worst/thresh nonsense; that data is normalized by this weird method that doesn't make any sense to me. The "raw_value" is what you want. On a healthy drive, there will be zero reallocated sectors. Generally, once you see even a single one, the drive is on its way out. This is always attribute #5, I like to monitor #194 (temperature) too because that's a good way to detect when a system fan has died. The drive overheating can be a secondary monitor for that very bad condition. You can even detect server room cooling failures that way, it's fun watching a stack of servers all kick out temperature warnings at the same time after the circuit the AC is on blows.

The other thing to note here is Power_On_hours. Here the raw value (154 hours) confirms that the recent errors in the logs did just happen. This is a backup drive I only power on to copy files to and from it, and it's dissapointing that it's died with so little life. Why that happened and how to prevent it is another topic.

Next thing to do is to run a short self-test, wait a little bit, and check the results. This drive is loud enough that I can hear when the test is running, and it doesn't take long:
$ sudo smartctl -t short /dev/sdg
$ sudo smartctl -l selftest /dev/sdg
...
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 60% 153 717666530


Here's the thing to realize: the drive reports "healthy" to the simple test. But isn't; there are reallocated sectors, and even the simplest of self-checks will find them. Enterprise RAID controllers can be configured to do a variety of "scrubbing" activities when the drives aren't being used heavily, and this is why they do that: early errors can get caught by the drive long before you'll notice them any other way. Nowadays drives will reallocate marginal sectors without even reporting an error to you, so unless you look at this data yourself you'll never know when the drive has started to go bad.

At this point I ran a second short test, then an extended one; here's the log afterwards:
$ sudo smartctl -t short /dev/sdg
$ sudo smartctl -t long /dev/sdg
$ sudo smartctl -l selftest /dev/sdg

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 157 2454905
# 2 Short offline Completed: read failure 40% 156 711374978
# 3 Short offline Completed: read failure 60% 153 717666530

The extended test found an error even earlier on the disk. It seems pretty clear this drive is failing quickly. At this point, there's little hope for it beyond saving any data you can (which I did before even launching into this investigation) and moving toward running the manufacturer's diagnostic software. What I'd expect here is that the drive *should* get marked for RMA if it's already in this condition. It's possible it will "fix" it instead. That's a later topic here.

In short, there are a few conclusions you can reach yourself here, and since this is a quite typical failure I can assure you these work:

  • Try not to connect external drives to a Windows server if you can avoid it, as they're pretty error prone and Windows isn't great at recovering from this sort of error. This fact is one reason I get so many of these requests to help distinguish true hardware errors from Windows problems.

  • Drives that are setup such that they can't check themselves via SMART are much more likely to quietly fail. Had this one been connected directly rather than via USB, I could have picked up this problem when the reallocated sector count was lower and decreased my risk of data loss.

  • If you're taking an offline drive out of storage to use it again, a short SMART check and look at the logs afterwards is a good preventative measure. Some drives even support a "conveyance" self-test aimed for checking quality after shipping, this one didn't so I went right from the short to long tests.

  • When monitoring regularly with smartmontools, you must monitor reallocated sectors yourself, the health check will not include them in its calculations. They are an excellent early warning system for errors that will eventually lead to data loss.

  • Regularly running the short SMART check doesn't introduce that high of a disk load, and it is extremely good at finding errors early too. I highly recommend putting a periodic SMART self-test on your server outside of peak hours if you can, if you're not using some other form of data consistency scrubbing

  • Running a baseline SMART self-test when you first put a drive into service helps provide a baseline showing good performance before things go wrong. I didn't do that in this case and I wish I had that data for comparison. It's helpful for general drive burn-in too.

That's the basics of what an error looks like when you catch it before the drive stops responding altogether. I've found this is quite often the case. Anecdotally, 75% of the drive failures I've seen in the last seven years (3/4 failures since when I started doing this) show up like this before the drive stops responding altogether. Some of the larger drive studies floating around recently suggest it's not quite that accurate as an early warning for most, it's certainly much better than not checking for errors at all.

The fact that this drive died somewhat mysteriously, in a way that it even still passed its SMART health check, has some interesting implications for its suitability in a RAID configuration. That's where I'm heading with this eventually.

Whatever OS you're running, you should try to get smartmontools (or a slicker application that does the same thing) running and setup to e-mail you when it hits an error. That regime has saved my data on multiple occasions.

2 comments:

mrasero said...

Thanks Greg i will be waiting for the rest of the series, it is a really interesting topic.

Unknown said...
This comment has been removed by a blog administrator.