Monday, July 6, 2009

Python logging TypeError messages

I've been writing small Python programs for about two years now. There are a few things that slipped my notice until really recently though, and having good application logging instead of using "print" is on that list. Lost some time today figuring out that "global" goes inside functions where that global is used, rather than as a modifier when creating the variable; Understanding 'global' in Python cleared that up for me. While trying to sort that out, I started converting my program's mess of print statements into something that used the logging facility instead. There's an unexpected and quite messy surprise waiting there, a simple to make mistake that the logging facility completely botches handling in a friendly way.

Let's say we start with a simple program that logs the value of a variable at a point:


#!/usr/bin/env python
x=1
print "x =",x


We decide to switch that over to use logging instead, so that we can adjust whether that bit of detail is logged or not better. You might do that (wrongly!) like this:


#!/usr/bin/env python
import logging
x=1
logging.warning("x =",x)


I'd guess this is a pretty common amateur mistake. If you pass more than one argument into a logging statement, the module presumes the first one is a format string and the rest fill in parameters in that string. This gives you the following cryptic error when it uses the "%" facility to format that string with the variable given:


Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 760, in emit
msg = self.format(record)
File "/usr/lib/python2.6/logging/__init__.py", line 644, in format
return fmt.format(record)
File "/usr/lib/python2.6/logging/__init__.py", line 432, in format
record.message = record.getMessage()
File "/usr/lib/python2.6/logging/__init__.py", line 302, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting


OK, so that's my bad. But notice what's missing from there? There's not a hint of an idea where in my code the error is at! All of the error messages refer to things within the logging __init__.py module. What to do?

This topic came up recently on the Python list, with a couple of messages about how to hack up the logging module to present more information. That works, but isn't particularly elegant. Buried in there was a clever way around this problem though: override the error handler. Here's what that looks like patched into our bad code example:


#!/usr/bin/env python
import logging

def handleError(self, record):
  raise
logging.Handler.handleError = handleError

x=1
logging.warning("x =",x)


Now when this is run, the error traceback you get starts by showing you the bad line in your code:


Traceback (most recent call last):
File "./logtest.py", line 9, in
logging.warning("x =",x)
File "/usr/lib/python2.6/logging/__init__.py", line 1441, in warning
root.warning(*((msg,)+args), **kwargs)
...


Note that just inserting an error handler like this isn't what you want to deploy, because an error in logging should introduce an application error. It's really more for making sure you get all your log messages correct in the first place. Make sure you read and following through understanding the comments about how handleError works later in the message thread before leaving this error dumping code in your app permanently.

I broke this down into a trivial case for demonstration, in my actual problem I had 600 lines of code filled with logging calls and only one of them was bad. Patching the error handler took me right to the bad one, and it was obvious how to fix it once I was staring at it.

Here's a correct snippet of code below, including some initialization bits I borrowed from the excellent Manage concurrent threads tutorial that makes the log messages more appropriate for threads. The reason I needed logging and global variables here were to cope with some program-wide thread locking issues with a single shared resource, and something like this is what's actually going into my app now that the log messages are all formatted properly and I could punt the error handler hack out of the program. But you can bet I'll remember that for the next time I run into the cryptic logging TypeError.

#!/usr/bin/env python
import logging
logging.basicConfig(
  level=logging.DEBUG,
  format='(%(threadName)-10s) %(message)s')
x=1
logging.warning("x = %s",x)


Note that there are all sorts of additional format strings like "%(threadname)" available, including timestamps and line numbers, and making changes to the base format lets you adjust that for the whole program. If you find yourself starting to put function names into your log messages or similarly repetitive work, you should consider using those substitution formatters instead.

Wednesday, May 27, 2009

Bottom-up PostgreSQL benchmarking and PGCon2009

Last week I got a lot of positive feedback from my PGCon presentation in Ottawa about how to benchmark systems at a low-level when the intended application is to run a database. There were three main topics I was trying to cover in that:
  1. Why you should always run your own hardware benchmarks on every piece of hardware you can
  2. Examples of the simplest benchmarks I've found to be accurate
  3. How do organize your tests and your vendor interactions to support performance measurement as a purchasing requirement
There was one slide missing from the set I presented. I've uploaded a version of the slides that fixes that (along with a typo in the sysbench seeks slide) to my home page. For those who missed it, a couple of people have put their notes from the talk as part of PGCon coverage on the PG wiki, and video of many talks from the conference is already available from FOSSLC.

Also available on my web page now is a presentation I did last month at PG East 2009. Titled "Using and Abusing pgbench", that talk also has 3 things it tries to convey:
  1. How does pgbench and its internal scripting language work? (Most people aren't even aware there is such a scripting language available)
  2. What should you do in order to get good results from the built-in pgbench tests?
  3. How can you use pgbench as a test harness for writing your own tests?
The hardware benchmarking presentation ends where the pgbench ones starts, with a bit of overlap. That's intentional--I always consider pgbench tests to be something you should do only after confirming all of your hardware does the right thing, top to bottom. A perfect example just came out recently: even someone who's done as much benchmarking work as Joshua Drake can end up measuring the wrong thing, because he skipped the step I suggest for confirming expected commit rate before moving onto higher-level pgbench tests. Since not many people saw the pgbench talk at PG East I'm hoping to repeat that one in the near future to a larger audience.

As part of putting that presentation together, I did more work on a toolchain I've been using for a couple of years now (since I was working on 8.3 development) I've named pgbench-tools. The current 0.4 release posted to my home page is the first to benefit from having some users, which has gotten me an enormous amount of feedback toward making the program bug-free and more usable. Thanks in particular to Robert Treat and Jignesh Shah for their contributions. I think it's finally mature enough that it might be useful for others who want to automate running large numbers of pgbench tests too.

Documentation is still minimal, but I have written some (and what's there is accurate, both of which put me ahead of a lot of open-source projects I guess). There is an into README in the tar file and the presentation tries to give some examples of usage too. When I get more time I'm putting the source code into the PostgreSQL git repository (the repo is already there, I just haven't pushed to it yet), where it will be easier for other people to work with and on. There's a growing need in the PG community for regression testing of performance results, and at the yearly PGCon Developer Meeting I volunteered to see if an improved version of this pgbench-tools package might be useful in that role. I hope the ideas in my presentations and the suggested practice demonstrated by these tools turns out to be helpful to others.

The approach taken in pgbench-tools, that you should parse results from pgbench, save them to a database, and then graph the lot of them using SQL to summarize as needed, is only partially mine. I stole the first rev of the graphing code and several other ideas from the work Mark Wong and others did on the dbt2 program (here's an intro to using dbt2). Now that I've got something useful for my purposes and am free from conferences for a while, I'm hoping to spend some time investigating how to integrate the unique things I'm doing with some of the tools he's already written. The biggest thing the dbt tests have that I haven't provided for pgbench yet is a framework for measuring I/O and similar statistics during the test run. Given that the PostgreSQL development process already has a heavy requirement on Perl, I really should fall into line and adopt that myself too--despite my strong personal preference for Python in this role.

Monday, December 8, 2008

Copying Virtual Box snapshots

I've really become comfortable nesting all of my Windows installs inside of Virtual Box lately (my main systems run both RedHat and Ubuntu Linux). Just being able to shuttle that image around to wherever I happen to be working is one big help. And the value of working with VMs was just reinforced this week when I learned that my recently installed XP Service Pack 3 introduced an incompatibility with the version of Microsoft SQL Server 2005: Express I needed to install to extract some client data. I just rolled back to the snapshot I took just before I installed that service pack (I am that paranoid), and then it installed fine. While I've been known to do a backup before such an operation even on a real hard disk, that's painful; VM snapshots are so trivial I can take them far more often.

This left me with a dilemma though: my XP install now has this bloated SQL Server mess installed on it that I can't delete until that project is done, but I need to do some real work that I want to keep beyond when I blow that image away. The snapshot tool in the VirtualBox GUI is pretty coarse: the only thing it will do is destructively revert a snapshot. What I want to do is fork the previous snapshot into another machine.

I found some clues for image import/export; it seemed easy. As usual, it wasn't at all.

Background: each disk image in VirtualBox gets a unique UUID. This is why you can't just copy the underlying disk image files to somewhere else--the UUID will still be the same when you import it and it won't work. The export tool "clonevdi" takes care of that for you, but you'll need all the relevant UUIDs for that to do anything useful. I'll show how you can get this info below; when I worked through it the first time I got an unpleasant surprise once I tried using the clone utility I want to talk about first:


$ VBoxManage clonevdi b534a21d-a24a-44b2-35ae-66502938a0b9 image.vdi
VirtualBox Command Line Management Interface Version 1.6.4
(C) 2005-2008 Sun Microsystems, Inc.
All rights reserved.

[!] FAILED calling hardDisk->CloneToImage(Bstr(argv[1]), vdiOut.asOutParam(), progress.asOutParam()) at line 3314!
[!] Primary RC = NS_ERROR_FAILURE (0x80004005) - Operation failed
[!] Full error info present: true , basic error info present: true
[!] Result Code = NS_ERROR_FAILURE (0x80004005) - Operation failed
[!] Text = Cloning differencing VDI images is not yet supported ('/d2/virtualbox/Machines/XP Pro/Snapshots/{b534a21d-a24a-44b2-35ae-66502938a0b9}.vdi')
[!] Component = HardDisk, Interface: IHardDisk, {fd443ec1-000f-4f5b-9282-d72760a66916}
[!] Callee = IHardDisk, {fd443ec1-000f-4f5b-9282-d72760a66916}


How fun is that? Differencing image, no copies for you!

After some poking around with the forum uber-thread covering copies, it sounded to me like anybody who makes any sort of snapshot is just screwed here. If it's not a plain old disk image, too bad.

As it would take far too long to recreate what's in this VM, I was plenty motivated to find a workaround. Here's how I eventually managed to extract those images:


  • With Virtual Box not running, backup the entire .VirtualBox directory.
  • Start the main image tool and navigate to the snapshot list for the relevant image.
  • If the one you want is in the history, rather than the current one you want, start at the bottom and blow away any change sets below that one; right-click on them and choose "Revert to current snapshot" if you just want to get rid of changes since then, or "Discard current snapshot and state" if the one you want is actually below either of them.
  • Once you've gotten to where the image you want is current, select each of the snapshots above it and right-click for "Discard snapshot" to merge their differences into the image below. You'll see the encouraging "Preserving changes to normal hard disk" message here.
  • Only the one image you want left? It's description should read like this: "IDE Primary Master: [Normal, 10.00GB]" Now you're set to use the command line tools! Exit the GUI, create the image as shown below, save that file somewhere else (it's created in the same VDI directory all the other images live in), then you can restore your original config to get everything back.


Here's how the command-line tools worked once I'd done the above to slim down to only the one image I wanted as available. First I take a look at all the UUIDs available:


$ VBoxManage list vms | grep UUID
UUID: aeb41dd3-d9f5-44fb-b1a2-a32e84f79e64
Primary master: /d2/virtualbox/VDI/XP Pro.vdi (UUID: 328bcae8-ddf8-4121-139c-f7d0566526f4)


That first UUID is for the whole VM (including the associated config files), that we can ignore. What I want to do then is copy the current running copy, the one labeled "Primary master", to a new image file. First I confirm I can access the right one via the command line tools:


$ VBoxManage showvdiinfo 328bcae8-ddf8-4121-139c-f7d0566526f4


That shows what I expected, so now I can carefully edit that working line via the old up arrow, changing that to the copy command instead. This time it works:


$ VBoxManage clonevdi 328bcae8-ddf8-4121-139c-f7d0566526f4 imagename.vdi


Now, how to actually use one of these images? You need a configuration XML file in the "Machines" directory that matches the one associated with this image copy. Make sure to save that matching file from the VDI/[Machine]/[Machine].xml directory before you do anything drastic (like restore the original configuration with all the snapshots); we'll need it later.

Once I made all the VDI images and had their matching config files, I put back the original .VirtualBox directory, then copied the new images into its VDI directory. Setting up a new VM to use those copies went like this:


  • Create a new snapshot with the correct type. When you get to "Virtual Hard Disk", select "Existing". Click on "Add".
  • You'll see everything listed in your VDI directory. Select the one associated with the snapshot you made and finish making this entry. Exit VirtualBox.
  • Now what you want to do is copy the long Machine... line from the new Machines/[Machine]/[Machine].xml file you just created to somewhere else (a text editor perhaps), along with the HardDiskAttachment... one that refers to your relocated snapshot. There are more details about this part at Cloning a complete VM.
  • Overwrite the new machine XML file with the original one associated with your VM, replacing just those two lines with the ones you saved. Basically, you need the new machine UUID and hard disk UUID, everything else should be the same as your original configuration to make sure this machine clones the original as closely as possible.


After going through all that, I had everything: the original VM with all its respective snapshots were still there. I had a copy of the install with SQL Server I could tinker with separately, while continuing my regular work in the original VM. Quite an unexpectedly long diversion, but now that I understand what does and doesn't work here I'll make sure to structure my images and snapshots accordingly.

Since I realize snapshots are a lot less useful than I originally thought because of these limitations, it strikes me I might even switch to using the more portable VMDK image files instead of the native VDI format. Snapshot compatibility was the only reason I didn't do that in the first place (can't use them with VMDK). I think I can do that by creating a new virtual disk in VMDK format, attaching that as the secondary master, using a boot CD image to dd the VDI one to the VMDK one, then detaching those disk images and making a new machine based on the VMDK-formatted one. Easy as can be, right? I got the idea from How To Resize a VirtualBox Virtual Disk. But not right now; I've had enough of a VM disk manipulation workout already today.

Sunday, November 30, 2008

Automating initial postgresql.conf tuning

About two years ago I returned to full-time work on PostgreSQL performance tuning for a contract job, after having done some earlier work in that area for a personal project. One of the things that struck me as odd was that no progress had been made in the intervening years toward providing an automatic tool to help with that goal. Since many of the existing guides only covered older versions of PostgreSQL (the 8.1 release changed what you can adjust usefully quite a bit), even the few template suggestions for various machine sizes floating around the web were largely obsolete--not to mention that the modern hardware was a lot more capable. When I tried writing an automatic tool myself, I started to realize just how much of a harder problem it was than it seemed.

One stumbling block is that you need a fair amount of information about the parameters in the postgresql.conf file, and some of that information wasn't all available any way except to read the associated source code. There's enough information embedded only there that it was impractical to expect a potential tool writing author to extract and maintain their own copy of the settings database. For example, I had to submit a patch myself earlier this year to make it possible to easily discover what value a parameter you don't set will get. Now you can look at the pg_settings view and that shows up as "boot_val".

A second problem is that coming up with a robust model for what to set everything to was harder than it seems at first. Earlier this year I worked with some other community members on codifying some of that into tuning guidelines, and there was a somewhat arcane tuning spreadsheet from Josh Berkus floating around. After showing I was close to an initial solution to this long outstanding TODO item, Josh came to my aid with a simplified model that was easy to implement as code.

There have been more than one attempt to write this kind of tool that fizzled after some initial work. The other classic trap people have fallen into here is worrying about the user interface before getting all the boring parts done. It takes a fair amount of code to read a postgresql.conf file usefully, learn the rules for what can you do with it (again, some of which are only documented in the database source code), and publish an updated file. As an example there, while it's possible to just maintain a list of which parameters are specified using "memory units" and what unit they default to (some are multiples of a single kB, others in 8kB blocks), what you really should do is look at the "unit" column that comes out of pg_settings instead to figure that out. And to do that, you need to save a copy of pg_settings into a file (since at the point this tool is being run the server isn't up yet) and load it into memory such that the related code can rely on it. Then it's trivial for the tool to distinguish between a value like "max_connections" (which is a simple integer that accepts no additional units) and "work_mem", which is specified in 1kB units but you can use 'MB' and 'GB' suffixes for. Oh, and don't forget the upper limit for that particular setting differs depending on whether you're on a 32 or 64 bit platform. What fun.

One of the guidelines I try to live by is "don't complain about anything you're not willing to help fix yourself", and I've done my share of complaining on this topic. This past holiday weekend, in between rounds of turkey overdose I reached a new milestone here: my post to the pg hackers mailing list yesterday includes a Python implementation that solves all the traditional problems besetting tuning tools authors in the context of the upcoming PostgreSQL 8.4. You feed it an existing postgresql.conf file and tell it what kind of database application you expect to build (data warehouse, web app, etc.). It tries to figure out how much RAM you have, then produces an updated postgresql.conf file for you with updates to 10 critical parameters.

Since the default configuration file for PostgreSQL is aimed at working with the conservative parameters many operating systems ship with, that makes those defaults only appropriate for a system having approximately 128MB of total RAM. If you have more than that and haven't considering performance tuning improvements for your PostgreSQL database, this tool, the model expressed in its source code, or a future version of it (once 8.4 is done I plan a version backward compatible with versions 8.1-8.3) may eventually be useful to you.

And if you're one of the frustrated potential tool authors who was stopped by one of the problems I mentioned here, consider this code release as an initial resource you can use. Even if it's not your preferred language, Python is pretty easy to read. I think it's easier to start by using this program as a spec to reference and emulate than it is to dig into the source code and resources I did to write it.

Tuesday, August 26, 2008

Linux disk failures: Areca is not so SMART

One of the most frequently asked questions on the PostgreSQL Performance list is "what disk controller works best with PostgreSQL?" It's a tough question; every product has different things it's good and bad at, and it can be hard to through all that to figure out what makes sense for your application. You need to focus equally on performance and reliability, as disk drives are very prone to failure.

I have a mini-server at home I use mainly for performance testing. The main thing that makes it a mini-server rather than an oversized desktop is that I've got an Areca ARC-1210 as the disk controller, which when I bought it cost more than the whole rest of the computer I installed it in. Areca is a popular recommendation because for a while there, their products were considerably faster per dollar than the competition in the SATA RAID space (mainly 3Ware and LSI Logic), and reports of how the controllers handled disk failure suggested reasonable behavior there.

I knew the performance was good, but with the relatively small number of drives I run at home I hadn't actually run into a SATA disk failure to comment on that part of the controller's firmware yet. Luckily Murphy stalks my life, and during a weekend I really didn't have time to fool around with hardware I got my chance. Snipped from /var/log/messages:


sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:1:0: [sdb] Sense Key : Medium Error [current]
Info fld=0x0
kernel: sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdb, sector 33470855
sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sd 0:0:1:0: [sdb] Sense Key : Medium Error [current]
kernel: Info fld=0x0
kernel: sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdb, sector 33470855


Here the end_request lines show the sector on the drive that went bad--two of them here, there were plenty more. This disk was part of a 3-disk software RAID0 set, which meant the file these sectors were in the middle of was trashed. Obviously I wasn't optimizing for reliability using RAID0 so the file wasn't important. What was important is that I finally had a chance to see exactly how the Areca controller handled this situation.

Normally what you do when a drive looks like it's going bad is to run smartctl. Actually, you should be running smartctl proactively all the time to try and catch these errors as soon as possible, and getting e-mailed the results. Again, this is a development and test system so I hadn't set all that up. And the results when I tried to run a scan on this disk were disappointing:


# smartctl --all /dev/sdb
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: WDC WD1600AAJS-00WAA Version: R001
Serial number: WD-WCAS20994885
Device type: disk
Transport protocol: Fibre channel (FCP-2)
Local Time is: Sun Aug 24 02:23:58 2008 EDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature: 30 C
Drive Trip Temperature: 25 C
Manufactured in week 30 of year 2002
Current start stop count: 256 times
Recommended maximum start stop count: 4278190080 times
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 0.000 0
write: 0 0 0 0 0 0.000 0

Non-medium error count: 0
Device does not support Self Test logging


The data returned here is clearly bogus--every time I looked at it after encountering errors, the counters were still all 0. Generally the best way to figure out what's going on is to run a self-test:

# smartctl -t long /dev/sdb
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Long (extended) offline self test failed [unsupported scsi opcode]
Oops. Much like how the other counters aren't supported, the Areca controller doesn't pass through SMART test requests. I vaguely recalled that from when I was doing the shopping for this card. Areca supplies a couple of utilities on their FTP site to handle chores like this. The first one, the HTTP server, lets you access all the card configuration you might normally do as part of the BIOS setup from a web browser. You run that utility and then access it via ports 81 and 82 on localhost. That worked OK but didn't tell me too much.

The other thing they provide, the part I was intending to rely on here, is the command line interface program. They provide two versions for both 32-bit and 64-bit Linux. The problem with this utility is that unlike the disk driver, the CLI is not open source.

The Areca driver made its way into the mainstream Linux kernel as of 2.6.19, and has been improved significantly by contributors outside of Areca since. The version that ships with kernel 2.6.22 is the one I recommend, that rev is solid. Areca's CLI they provide in binary form, tested against a couple of supported distributions, and that's all you get.

My server system was recently upgraded to run Ubuntu 8.04.1 in addition to the CentOS 5.2 it's had on it for a while now. The CLI tool didn't work at all on Ubuntu; it didn't find the device and I have no idea why. If you're really committed to having an true open-source stack for your controller card, perhaps because you value being able to understand and possibly resolve issues yourself when they happen, cross Areca off your list. Their essential CLI utility can easily break on you and leave you stuck.

Luckily I still had the CentOS partition setup, and as that's a derivative of the supported RHEL their CLI is aimed at it worked fine there. Here's a little session I had with it, looking for what went wrong here:

Copyright (c) 2004 Areca, Inc. All Rights Reserved.
Areca CLI, Version: 1.80A, Arclib: 260, Date: Oct 12 2007( Linux )

Controllers List
----------------------------------------
Controller#01(PCI): ARC-1210
Current Controller: Controller#01
----------------------------------------

CMD Description
==========================================================
main Show Command Categories.
set General Settings.
rsf RaidSet Functions.
vsf VolumeSet Functions.
disk Physical Drive Functions.
sys System Functions.
net Ethernet Functions.
event Event Functions.
hw Hardware Monitor Functions.
mail Mail Notification Functions.
snmp SNMP Functions.
exit Exit CLI.
==========================================================
Command Format: [Sub-Command] [Parameters].
Note: Use -h or -help to get details.

CLI> event info
Date-Time Device Event Type
===============================================================================
2008-08-24 06:14:51 H/W MONITOR Raid Powered On
2008-08-24 06:01:35 H/W MONITOR Raid Powered On
2008-08-24 05:48:13 Proxy Or Inband HTTP Log In
2008-08-24 05:44:24 IDE Channel #02 Reading Error
2008-08-24 05:44:16 IDE Channel #02 Reading Error
2008-08-24 05:44:08 IDE Channel #02 Reading Error
2008-08-24 05:43:54 IDE Channel #02 Reading Error
2008-08-24 05:43:46 IDE Channel #02 Reading Error
2008-08-24 05:37:12 Proxy Or Inband HTTP Log In
2008-08-24 05:20:18 H/W MONITOR Raid Powered On
That's pretty clear: adjusting for time zone the disk attached to channel #02 (which is in fact /dev/sdb) had a read error at the same time the I/O errors showed up in the kernel logs. So the controller did catch that. Clearly any production system with an Areca controller should be monitoring the output from "cli
event info" for these messages.

If I were running this in a hardware RAID environment, I suspect the Areca controller would have kicked the disk out after that. Sometimes, though, the SMART errors are more subtle. I like to schedule an extended disk test for any production drive at least once a week, trying to find issues before they actually wipe data out. Here's the options there, and what the SMART info for a good disk in the system looks like:

CLI> disk -h
smart Parameter:
Fn: Display Disk S.M.A.R.T Data.
Ex: Display Disk#1 S.M.A.R.T Data.
Command: disk smart drv=1 [Enter]

dctest Parameter: .
Fn: Perform A S.M.A.R.T Off-Line Data Collection.
Ex: Perform A Test On Disk#1~4.
Command: disk dctest drv=1~4 [Enter]

sttest Parameter: [mode=].
Fn: Perform A S.M.A.R.T Off-Line Short or Extended Self-Test.
Ex: Perform A Short Self-Test On Disk#1~4.
Command: disk sttest drv=1~4 mode=short [Enter]
Note: Default Mode=Short.

CLI> disk smart drv=1
S.M.A.R.T Information For Drive[#01]
# Attribute Items Flag Value Thres State
===============================================================================
1 Raw Read Error Rate 0x0f 200 51 OK
3 Spin Up Time 0x03 152 21 OK
4 Start/Stop Count 0x32 100 0 OK
5 Reallocated Sector Count 0x33 200 140 OK
7 Seek Error Rate 0x0e 100 51 OK
9 Power-on Hours Count 0x32 99 0 OK
10 Spin Retry Count 0x12 100 51 OK
11 Calibration Retry Count 0x12 100 51 OK
12 Device Power Cycle Count 0x32 100 0 OK
192 Power-off Retract Count 0x32 200 0 OK
193 Load Cycle Count 0x32 200 0 OK
194 Temperature 0x22 102 0 OK
196 Reallocation Event Count 0x32 200 0 OK
197 Current Pending Sector Count 0x12 200 0 OK
198 Off-line Scan Uncorrectable Sector Count 0x10 200 0 OK
199 Ultra DMA CRC Error Count 0x3e 200 0 OK
===============================================================================

Here's what the bad one reported:

CLI> disk smart drv=2
S.M.A.R.T Information For Drive[#02]
# Attribute Items Flag Value Thres State
===============================================================================
1 Raw Read Error Rate 0x0f 200 51 OK
3 Spin Up Time 0x03 153 21 OK
4 Start/Stop Count 0x32 100 0 OK
5 Reallocated Sector Count 0x33 200 140 OK
7 Seek Error Rate 0x0e 100 51 OK
9 Power-on Hours Count 0x32 99 0 OK
10 Spin Retry Count 0x12 100 51 OK
11 Calibration Retry Count 0x12 100 51 OK
12 Device Power Cycle Count 0x32 100 0 OK
192 Power-off Retract Count 0x32 200 0 OK
193 Load Cycle Count 0x32 200 0 OK
194 Temperature 0x22 104 0 OK
196 Reallocation Event Count 0x32 200 0 OK
197 Current Pending Sector Count 0x12 193 0 OK
198 Off-line Scan Uncorrectable Sector Count 0x10 193 0 OK
199 Ultra DMA CRC Error Count 0x3e 200 0 OK
===============================================================================


Ah ha--so the old "Uncorrectable Sector Count" quality factor
is going down. That's usually a sign of early failure. But how to
trigger a more comprehensive test?


CLI> disk sttest drv=2 mode=extended
ErrMsg: Drive 2 Is Not Available

CLI> disk dctest drv=2
ErrMsg: Drive 2 Is Not Available
Oops. A bit of research reinforced my fear here: if you've got a device that's part of a RAID group or, like in this case, assigned as a JBOD disk, the Areca controller won't let you run a SMART test on it. This is really unreasonable behavior for a disk controller aimed at enterprise use. I only ran into these bad sectors because I used them and then tried to read the result back. The extended test may have caught the read problem before I did that, saving me from the data loss. That's certainly how it's worked for the last two disk failures I ran into at home.

The next question, then, is how to run a full scan over this disk if I can't do it via smart? First I tried rebuilding the RAID0 /dev/md0 filesystem this disk is part of using the bad block checker you can invoke as part of mkfs:


mkfs -j -c -c /dev/md0
This returned clean, no I/O errors at the kernel level. It took many hours to complete though. What about the Areca logs?


Date-Time Device Event Type
===============================================================================
2008-08-24 12:26:37 IDE Channel #02 Reading Error
2008-08-24 12:26:21 IDE Channel #02 Reading Error
2008-08-24 12:26:13 IDE Channel #02 Reading Error
2008-08-24 12:26:05 IDE Channel #02 Reading Error
2008-08-24 12:25:57 IDE Channel #02 Reading Error
2008-08-24 12:25:43 IDE Channel #02 Reading Error
2008-08-24 12:25:35 IDE Channel #02 Reading Error
2008-08-24 12:25:27 IDE Channel #02 Reading Error
2008-08-24 12:25:06 IDE Channel #02 Reading Error
2008-08-24 12:24:59 IDE Channel #02 Reading Error
...


That goes one for a while. Seeing a bunch of errors in the controller that didn't show up in the kernel is kind of disconcerting. I'm left distrusting the way mkfs does this test; perhaps the writes it does to disk are staying cached by the controller, causing the read to return with that cached data? Not sure what's going on here, but I sure don't like it.

Time to test this out at the filesystem level by forcing a read of everything instead, which is easier to do anyway because it doesn't write:


e2fsck -c /dev/md0


Ah, here come the kernel level errors in /var/log/messages. The end_request lines are the important ones here, here's the first of those (so many showed up they didn't even all appear in the logs):
end_request: I/O error, dev sdb, sector 34426095
end_request: I/O error, dev sdb, sector 34426127
end_request: I/O error, dev sdb, sector 34426151
end_request: I/O error, dev sdb, sector 34428623
end_request: I/O error, dev sdb, sector 34428623
end_request: I/O error, dev sdb, sector 34444831
end_request: I/O error, dev sdb, sector 34453535
end_request: I/O error, dev sdb, sector 34453631
It ends saying that it made modifications to the filesystem, presumably to mark the blocks that were bad. But how do you find out where those are? You can get that info with the superblock header option of dumpe2fs:

# dumpe2fs -f /dev/md0
Bad blocks: 12909760, 12909778, 12909779, 12909785, 12909790, 12909791, 12909796, 12909797, 12909802, 12909803, 12909809, 12909814, 12909815, 12909820, 12909821, 12910712, 12910714, 12916792, 12920048, 12920092


What have we learned:
  • The Areca controller has serious problems reporting SMART errors when you're using it in JBOD mode. It seems it's only really designed for handling all that internally
  • The CLI they provide is a closed source mess that you must test for compatibility
  • You must monitor the Areca event log using their CLI if you're using their controller in production
  • The controller doesn't support any user-requested SMART scans
  • On a caching controller, using "e2fsck -c" may give more reliable results than the full "mkfs -c"

Since the disk here is obviously trashed, and I can't do any of the SMART operations to try and make it better when attached to the Areca controller, at this point I moved it to another system and replaced it with the spare I had for this array. I didn't lose any data here, but I did find myself a lot less comfortable recommending Areca controllers.

I just pointed out last month that current 3ware controllers have caught up with Areca's performance. Of course, as far as I can tell, 3ware's tw_cli utility is just as closed source as Areca's. But since you can run smart tests directly with the Linux smartctl on 3ware controllers, all you really need the CLI for is looking inside the hardware RAID to see what the disk status there is.

As for what I went through trying to get the failing drive back to operating order again, that's a whole 'nother story.

Friday, August 15, 2008

"The essential postgresql.conf" at the BWPUG

One of the common complaints about how you setup and tune a PostgreSQL server is that the postgresql.conf file is way too big, and there's no good guidance to where you should start. I've been talking and corresponding with many PostgreSQL administrators over the last few months, and getting writing help from Christopher Browne and then Robert Treat, to work on that problem. I'm proud to announce a document to address that need that is now both current and pretty complete, while not being overwhelming: Tuning Your PostgreSQL Server.

Robert and I finished this off just in time to present the results at this week's Baltimore/Washington PostgreSQL User Group meeting, providing edutainment going through the whole list of 18 most important parameters (plus logging). Thanks to Robert, Chris, the regulars on the PostgreSQL mailing lists, and everyone who gave us feedback at the meeting (now applied to the Wiki document) for helping to bring all this information together into one relatively easy to digest bite.

Monday, August 4, 2008

Virtualbox and custom kernels

I've been using Virtualbox successfully for a few months now. On my laptop, with Linux as the host operating system, the virtual Windows XP install runs faster than the real XP install on another partition, even though I've only given it 512MB of RAM to work with. Very interesting program.

But on my main desktop system I run a custom kernel, and the same Virtualbox install didn't work there. The kernel module it loads just failed. You can get that to reinstall using

/etc/init.d/vboxdrv setup

Or at least you're supposed to be able to. The error message it gave me in /var/log/vbox-install.log suggested I needed to set KERN_DIR to point to the root of my kernel source; did that. No help, and now there's nothing useful in that log file. The module was there, it just didn't work. Manually doing "modprobe vboxdrv" said it was missing symbols.

To track this down, I started by reading the source for the vboxdrv init script. Under the hood, it runs /usr/share/virtualbox/src/build_in_tmp to recompile the module. You can run this by hand to see what's really going on. Annoyingly, there's a crummy test that generates this bogus warning:

test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( \
echo; \
echo " ERROR: Kernel configuration is invalid."; \
echo " include/linux/autoconf.h or include/config/auto.conf are missing."; \
echo " Run 'make oldconfig && make prepare' on kernel src to fix it."; \

This is a lie, both files are there, and it prints this even in the final working configuration after I fixed the problem. I also saw some warnings about deprecated symbols in the middle and the linker complained about missing things, just like when I try to load the module.

I found some notes about using Virtualbox with a custom kernel in an ubuntu forum which was interesting background but no help. Another nice link that I want to remember for future use shows how to repackage Virtualbox using dpkg to support a customer kernel. But since this was on a RedHat system that wasn't going to help, and ultimately I suspected that the same compilation problem would kick in.

After some more searching, I found the root cause here. Some of the kernel interfaces that existed in the original 2.6.18 kernel I was using last time this worked were changed in the 2.6.25 kernel I was running now, which is why the symbols it was looking for were missing. A Gentoo bug report goes over the ugly details and suggests that Virtualbox versions after 1.6.2 fixes the problem (I had been running 1.5.6). Sure enough, download the latest version, install that one, and the module rebuilds itself just fine now as long as KERN_DIR is set correctly.

This is one of those interesting bits to debate: the reason this problem was introduced was that an API was changed "to fix bugs and add a really important feature". Linux benefits from being free to make such enhancements as necessary. But at the same time, the complete lack of concern for backward compatibility in kernel modules can really make things difficult for regular users of Linux software that must link to the kernel. The obligatory reading here for the developer's side of this argument is Stable API Nonsense.