Tuesday, August 26, 2008

Linux disk failures: Areca is not so SMART

One of the most frequently asked questions on the PostgreSQL Performance list is "what disk controller works best with PostgreSQL?" It's a tough question; every product has different things it's good and bad at, and it can be hard to through all that to figure out what makes sense for your application. You need to focus equally on performance and reliability, as disk drives are very prone to failure.

I have a mini-server at home I use mainly for performance testing. The main thing that makes it a mini-server rather than an oversized desktop is that I've got an Areca ARC-1210 as the disk controller, which when I bought it cost more than the whole rest of the computer I installed it in. Areca is a popular recommendation because for a while there, their products were considerably faster per dollar than the competition in the SATA RAID space (mainly 3Ware and LSI Logic), and reports of how the controllers handled disk failure suggested reasonable behavior there.

I knew the performance was good, but with the relatively small number of drives I run at home I hadn't actually run into a SATA disk failure to comment on that part of the controller's firmware yet. Luckily Murphy stalks my life, and during a weekend I really didn't have time to fool around with hardware I got my chance. Snipped from /var/log/messages:

sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:1:0: [sdb] Sense Key : Medium Error [current]
Info fld=0x0
kernel: sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdb, sector 33470855
sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sd 0:0:1:0: [sdb] Sense Key : Medium Error [current]
kernel: Info fld=0x0
kernel: sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdb, sector 33470855
Here the end_request lines show the sector on the drive that went bad--two of them here, there were plenty more. This disk was part of a 3-disk software RAID0 set, which meant the file these sectors were in the middle of was trashed. Obviously I wasn't optimizing for reliability using RAID0 so the file wasn't important. What was important is that I finally had a chance to see exactly how the Areca controller handled this situation.

Normally what you do when a drive looks like it's going bad is to run smartctl. Actually, you should be running smartctl proactively all the time to try and catch these errors as soon as possible, and getting e-mailed the results. Again, this is a development and test system so I hadn't set all that up. And the results when I tried to run a scan on this disk were disappointing:


# smartctl --all /dev/sdb
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: WDC WD1600AAJS-00WAA Version: R001
Serial number: WD-WCAS20994885
Device type: disk
Transport protocol: Fibre channel (FCP-2)
Local Time is: Sun Aug 24 02:23:58 2008 EDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature: 30 C
Drive Trip Temperature: 25 C
Manufactured in week 30 of year 2002
Current start stop count: 256 times
Recommended maximum start stop count: 4278190080 times
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 0.000 0
write: 0 0 0 0 0 0.000 0

Non-medium error count: 0
Device does not support Self Test logging


The data returned here is clearly bogus--every time I looked at it after encountering errors, the counters were still all 0. Generally the best way to figure out what's going on is to run a self-test:

# smartctl -t long /dev/sdb
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Long (extended) offline self test failed [unsupported scsi opcode]
Oops. Much like how the other counters aren't supported, the Areca controller doesn't pass through SMART test requests. I vaguely recalled that from when I was doing the shopping for this card. Areca supplies a couple of utilities on their FTP site to handle chores like this. The first one, the HTTP server, lets you access all the card configuration you might normally do as part of the BIOS setup from a web browser. You run that utility and then access it via ports 81 and 82 on localhost. That worked OK but didn't tell me too much.

The other thing they provide, the part I was intending to rely on here, is the command line interface program. They provide two versions for both 32-bit and 64-bit Linux. The problem with this utility is that unlike the disk driver, the CLI is not open source.

The Areca driver made its way into the mainstream Linux kernel as of 2.6.19, and has been improved significantly by contributors outside of Areca since. The version that ships with kernel 2.6.22 is the one I recommend, that rev is solid. Areca's CLI they provide in binary form, tested against a couple of supported distributions, and that's all you get.

My server system was recently upgraded to run Ubuntu 8.04.1 in addition to the CentOS 5.2 it's had on it for a while now. The CLI tool didn't work at all on Ubuntu; it didn't find the device and I have no idea why. If you're really committed to having an true open-source stack for your controller card, perhaps because you value being able to understand and possibly resolve issues yourself when they happen, cross Areca off your list. Their essential CLI utility can easily break on you and leave you stuck.

Luckily I still had the CentOS partition setup, and as that's a derivative of the supported RHEL their CLI is aimed at it worked fine there. Here's a little session I had with it, looking for what went wrong here:
Copyright (c) 2004 Areca, Inc. All Rights Reserved.
Areca CLI, Version: 1.80A, Arclib: 260, Date: Oct 12 2007( Linux )

Controllers List
----------------------------------------
Controller#01(PCI): ARC-1210
Current Controller: Controller#01
----------------------------------------

CMD Description
==========================================================
main Show Command Categories.
set General Settings.
rsf RaidSet Functions.
vsf VolumeSet Functions.
disk Physical Drive Functions.
sys System Functions.
net Ethernet Functions.
event Event Functions.
hw Hardware Monitor Functions.
mail Mail Notification Functions.
snmp SNMP Functions.
exit Exit CLI.
==========================================================
Command Format: cmd [Sub-Command] [Parameters].
Note: Use cmd -h or -help to get details.

CLI> event info
Date-Time Device Event Type
===============================================================================
2008-08-24 06:14:51 H/W MONITOR Raid Powered On
2008-08-24 06:01:35 H/W MONITOR Raid Powered On
2008-08-24 05:48:13 Proxy Or Inband HTTP Log In
2008-08-24 05:44:24 IDE Channel #02 Reading Error
2008-08-24 05:44:16 IDE Channel #02 Reading Error
2008-08-24 05:44:08 IDE Channel #02 Reading Error
2008-08-24 05:43:54 IDE Channel #02 Reading Error
2008-08-24 05:43:46 IDE Channel #02 Reading Error
2008-08-24 05:37:12 Proxy Or Inband HTTP Log In
2008-08-24 05:20:18 H/W MONITOR Raid Powered On
That's pretty clear: adjusting for time zone the disk attached to channel #02 (which is in fact /dev/sdb) had a read error at the same time the I/O errors showed up in the kernel logs. So the controller did catch that. Clearly any production system with an Areca controller should be monitoring the output from "cli
event info" for these messages.

If I were running this in a hardware RAID environment, I suspect the Areca controller would have kicked the disk out after that. Sometimes, though, the SMART errors are more subtle. I like to schedule an extended disk test for any production drive at least once a week, trying to find issues before they actually wipe data out. Here's the options there, and what the SMART info for a good disk in the system looks like:
CLI> disk -h
smart Parameter:
Fn: Display Disk S.M.A.R.T Data.
Ex: Display Disk#1 S.M.A.R.T Data.
Command: disk smart drv=1 [Enter]

dctest Parameter: .
Fn: Perform A S.M.A.R.T Off-Line Data Collection.
Ex: Perform A Test On Disk#1~4.
Command: disk dctest drv=1~4 [Enter]

sttest Parameter: [mode=].
Fn: Perform A S.M.A.R.T Off-Line Short or Extended Self-Test.
Ex: Perform A Short Self-Test On Disk#1~4.
Command: disk sttest drv=1~4 mode=short [Enter]
Note: Default Mode=Short.

CLI> disk smart drv=1
S.M.A.R.T Information For Drive[#01]
# Attribute Items Flag Value Thres State
===============================================================================
1 Raw Read Error Rate 0x0f 200 51 OK
3 Spin Up Time 0x03 152 21 OK
4 Start/Stop Count 0x32 100 0 OK
5 Reallocated Sector Count 0x33 200 140 OK
7 Seek Error Rate 0x0e 100 51 OK
9 Power-on Hours Count 0x32 99 0 OK
10 Spin Retry Count 0x12 100 51 OK
11 Calibration Retry Count 0x12 100 51 OK
12 Device Power Cycle Count 0x32 100 0 OK
192 Power-off Retract Count 0x32 200 0 OK
193 Load Cycle Count 0x32 200 0 OK
194 Temperature 0x22 102 0 OK
196 Reallocation Event Count 0x32 200 0 OK
197 Current Pending Sector Count 0x12 200 0 OK
198 Off-line Scan Uncorrectable Sector Count 0x10 200 0 OK
199 Ultra DMA CRC Error Count 0x3e 200 0 OK
===============================================================================
Here's what the bad one reported:
CLI> disk smart drv=2
S.M.A.R.T Information For Drive[#02]
# Attribute Items Flag Value Thres State
===============================================================================
1 Raw Read Error Rate 0x0f 200 51 OK
3 Spin Up Time 0x03 153 21 OK
4 Start/Stop Count 0x32 100 0 OK
5 Reallocated Sector Count 0x33 200 140 OK
7 Seek Error Rate 0x0e 100 51 OK
9 Power-on Hours Count 0x32 99 0 OK
10 Spin Retry Count 0x12 100 51 OK
11 Calibration Retry Count 0x12 100 51 OK
12 Device Power Cycle Count 0x32 100 0 OK
192 Power-off Retract Count 0x32 200 0 OK
193 Load Cycle Count 0x32 200 0 OK
194 Temperature 0x22 104 0 OK
196 Reallocation Event Count 0x32 200 0 OK
197 Current Pending Sector Count 0x12 193 0 OK
198 Off-line Scan Uncorrectable Sector Count 0x10 193 0 OK
199 Ultra DMA CRC Error Count 0x3e 200 0 OK
===============================================================================
Ah ha--so the old "Uncorrectable Sector Count" quality factor
is going down. That's usually a sign of early failure. But how to
trigger a more comprehensive test?

CLI> disk sttest drv=2 mode=extended
ErrMsg: Drive 2 Is Not Available

CLI> disk dctest drv=2
ErrMsg: Drive 2 Is Not Available
Oops. A bit of research reinforced my fear here: if you've got a device that's part of a RAID group or, like in this case, assigned as a JBOD disk, the Areca controller won't let you run a SMART test on it. This is really unreasonable behavior for a disk controller aimed at enterprise use. I only ran into these bad sectors because I used them and then tried to read the result back. The extended test may have caught the read problem before I did that, saving me from the data loss. That's certainly how it's worked for the last two disk failures I ran into at home.

The next question, then, is how to run a full scan over this disk if I can't do it via smart? First I tried rebuilding the RAID0 /dev/md0 filesystem this disk is part of using the bad block checker you can invoke as part of mkfs:


mkfs -j -c -c /dev/md0
This returned clean, no I/O errors at the kernel level. It took many hours to complete though. What about the Areca logs?


Date-Time Device Event Type
===============================================================================
2008-08-24 12:26:37 IDE Channel #02 Reading Error
2008-08-24 12:26:21 IDE Channel #02 Reading Error
2008-08-24 12:26:13 IDE Channel #02 Reading Error
2008-08-24 12:26:05 IDE Channel #02 Reading Error
2008-08-24 12:25:57 IDE Channel #02 Reading Error
2008-08-24 12:25:43 IDE Channel #02 Reading Error
2008-08-24 12:25:35 IDE Channel #02 Reading Error
2008-08-24 12:25:27 IDE Channel #02 Reading Error
2008-08-24 12:25:06 IDE Channel #02 Reading Error
2008-08-24 12:24:59 IDE Channel #02 Reading Error
...
That goes on for a while. Seeing a bunch of errors in the controller that didn't show up in the kernel is kind of disconcerting. I'm left distrusting the way mkfs does this test; perhaps the writes it does to disk are staying cached by the controller, causing the read to return with that cached data? Not sure what's going on here, but I sure don't like it.

Time to test this out at the filesystem level by forcing a read of everything instead, which is easier to do anyway because it doesn't write:
e2fsck -c /dev/md0

Ah, here come the kernel level errors in /var/log/messages. The end_request lines are the important ones here, here's the first of those (so many showed up they didn't even all appear in the logs):
end_request: I/O error, dev sdb, sector 34426095
end_request: I/O error, dev sdb, sector 34426127
end_request: I/O error, dev sdb, sector 34426151
end_request: I/O error, dev sdb, sector 34428623
end_request: I/O error, dev sdb, sector 34428623
end_request: I/O error, dev sdb, sector 34444831
end_request: I/O error, dev sdb, sector 34453535
end_request: I/O error, dev sdb, sector 34453631
It ends saying that it made modifications to the filesystem, presumably to mark the blocks that were bad. But how do you find out where those are? You can get that info with the superblock header option of dumpe2fs:
# dumpe2fs -f /dev/md0
Bad blocks: 12909760, 12909778, 12909779, 12909785, 12909790, 12909791, 12909796, 12909797, 12909802, 12909803, 12909809, 12909814, 12909815, 12909820, 12909821, 12910712, 12910714, 12916792, 12920048, 12920092
What have we learned:
  • The Areca controller has serious problems reporting SMART errors when you're using it in JBOD mode. It seems it's only really designed for handling all that internally
  • The CLI they provide is a closed source mess that you must test for compatibility
  • You must monitor the Areca event log using their CLI if you're using their controller in production
  • The controller doesn't support any user-requested SMART scans
  • On a caching controller, using "e2fsck -c" may give more reliable results than the full "mkfs -c"
Since the disk here is obviously trashed, and I can't do any of the SMART operations to try and make it better when attached to the Areca controller, at this point I moved it to another system and replaced it with the spare I had for this array. I didn't lose any data here, but I did find myself a lot less comfortable recommending Areca controllers.

I just pointed out last month that current 3ware controllers have caught up with Areca's performance. Of course, as far as I can tell, 3ware's tw_cli utility is just as closed source as Areca's. But since you can run smart tests directly with the Linux smartctl on 3ware controllers, all you really need the CLI for is looking inside the hardware RAID to see what the disk status there is.

As for what I went through trying to get the failing drive back to operating order again, that's a whole 'nother story.

8 comments:

Bruce Momjian said...

Good analysis, and something I don't see very often. Of course, I'm a SAS man myself. ;-)

toby said...

If you don't want to be at the mercy of an unreliable error reporting chain, and failing hardware, you need ZFS. This anecdote shows many reasons why...

Greg Smith said...

The ZFS scrubbing regime is a useful replacement for running filesystem level checks manually. But ZFS does nothing to inform you when the hard drive is quietly gobbling up its spare sectors to cover failing sections of the disk. If you want that data, there's no way to get it but via SMART. By the time errors get reported to the kernel such that ZFS (or any filesystem) can find out about them, you've already lost valuable time that could of let you take action before there was an actual data losing failure.

Scott said...

I've got a 1680 running in production with 16 SAS disks, and a day or two after getting the second machine up and running, a drive failed and the RAID controller failed over perfectly to one of the hot spares. Replacing the failed drive and making the replacement a hot spare was very easy through the web interface.

The interesting thing about this newer controller is that it has a dedicated network port for the web interface and for SNMP traps. I'm going to write a plug in for nagios to work with the snmp on the dedicated NIC.

The dedicated NIC and web / snmp interface makes this card pretty OS agnostic in terms of reporting errors.

yarikoptic said...

Nice post indeed... you might guess I've ran into it -- I got problems with our ARC-1160 16-Port. Initially I had an issue (lost/dropped drives) which boiled down to insufficient power supply. Boosting power supply from 650W to 900W (2+1spare) helped to resolve. Not sure why vendor didn't bring my attention initially while suggesting the box. Controller had been functioning flawlessly since then for 3 years, but since few days ago started to experience the same situation -- 1 drive is reported bad a day, rebuild starts; but that drive looks ok on the further tests... who knows WTF...

I have been using email notification system from the controller since I have installed it -- so I never missed any 'lost/bad drive' event, but indeed absence of common open-source client is a big minus. Not sure why they don't release it opensource

KnottyMan said...

Under Ubuntu, you need to be root or use sudo to access the CLI.

And using a raid controller for jbod to later use sw raid seems a bit silly.

All my linux/areca/ubuntu boxes behave themselves nicely. I'll never go back to 3ware.

One other thing to check is the firmware/bios on the card. Very easy to upgrade and might have added the extra smart features you're looking for.

Greg Smith said...

Thanks for giving me the benefit of the doubt here, as if I didn't know I had to be root to access the drives directly in Ubuntu. That was certainly not my issue. I just re-checked with the current Ubuntu 9.04 install on my server and the V1.82 of Areca's utility (1.80 was the latest when I wrote this), and sure enough you're right about the current state of things; Areca's CLI works for me now on Ubuntu too. My point about never being sure what you're going to get from these closed-source CLI utilities stands. I had no idea why it didn't work on the older version, and with no source code available I had limited troubleshooting possibilities.

Using a RAID controller in JBOD mode provides battery-backed write caching, which is necessary for accelerating some common database applications, but doesn't require the card to do anything else. Layering software RAID on that of that can end being significantly faster than doing RAID on the card itself, particular if you're going crazy and using RAID5 where it's easier to overload the parity computations required on the controller's tiny CPU. This is a pretty common technique for high-performance database applications.

As for 3ware, they've stepped backwards in this area too. If you're using one of their controllers that also supports SAS drives, you can't view SMART data on them either. Really disappointing.

Unknown said...

I would like to note that now smartctl can access devices behind the areca controller...

example:

dylan@BigBox ~ $ sudo smartctl -a /dev/sg4 -d areca,1

I can't post the output but it shows the tests that i've had running on it, etc.