Friday, May 23, 2008

pgbench suffering with Linux 2.6.23-2.6.26

About a month ago I got Linux installed on some new quad-core hardware at home, with the intention being that box would primarily be used for my PostgreSQL work which includes a lot of benchmarking. The hardware on the system is new enough that I needed a very recent Linux kernel version for it to work well. It was possible to install my standard distributions, CentOS 5 and Ubuntu 7.04, if I set the motherboard to "ATA/IDE Mode=legacy". But that limited me to old ATA modes without DMA. That topped out at 3MB/s of hard drive speed as measured by hdparm -t, an issue well documented in the libATA FAQ. In order to get reasonable speed (close to 70MB/s) I had to use a much newer kernel that allowed me to switch to AHCI mode. The earliest kernel that worked well was 2.6.22-19.

I should have stopped there.

For kicks, I decided to try the latest kernel at the time, 2.6.24.3, to see if it was any faster. The big change in recent kernels I was curious about was the introduction of the controversial Completely Fair Scheduler (CFS). Because that was merged into the kernel quite a bit faster than I find comfortable, I was rather suspicious of its performance, and I launched into a full set of tests using pgbench.

These went very badly, with an extreme drop in performance compared with earlier versions. By the time I quantified that and reported to PostgreSQL land, I was told Nick Piggin had already reported the problem and a fix was due for 2.6.25, which literally came out in the middle of the night the day was I coming up with a repeatable test case for 2.6.24.

2.6.25 gave the worst pgbench results yet. I did confirm that it's not in the server though; if I run the pgbench client on a remote system, the results scale fine to the level I expect.

It took me some time to organize all my test results and automate my test cases well enough that I thought they could be replicated. This week I submitted a kernel regression report about this problem. Mike Galbraith was easily able to reproduce it, and I've already gotten and tested one patch to improve things.

Still a ways back from 2.6.22 though, and with 2.6.26 being on -rc3 I have my doubts this will be fully wrapped up before that goes live. So here's my suggestions for those of you using PostgreSQL on recent Linux versions who care about performance:


  • Kernel 2.6.18 has been a fairly solid kernel for me under RHEL5 and the matching CentOS.
  • Kernel 2.6.19 was the first to merge the libata drivers and I avoid it because of that.
  • Kernel 2.6.20 in the form of Ubuntu 7.04 Feisty Fawn seems to have stabilized after the libata merge. I can't comment on the kernel.org distribution.
  • Kernel 2.6.22 seems solid on everything I throw at it, this is the latest Linux kernel I recommend using for PostgreSQL without serious caveats. The first 2.6.22 came out in July of last year, so it's almost a year old now, and it's currently at 2.6.22-19. Ubuntu's 7.10 Gutsy Gibbon includes 2.6.22.
  • Kernels 2.6.23 and 2.6.24, the first to use CFS, have known serious server-side PostgreSQL performance issues. The main distributions I'm aware of that are likely impacted here are Fedora Core 8 and Ubuntu 8.04. I expect Fedora to be running bleeding edge kernels with regressions (you could argue the existing of Fedora has pushed kernel testing toward the users rather than being done as much by the developers). I'm really disappointed Ubuntu adopted such a kernel given the rawness of CFS. Already ranted about this a bunch on my last blog post. I think this is just a timing problem for them. Had they adopted 2.6.22 instead, they'd be facing 5 years of LTS with a kernel using a model abandoned by the mainstream development and therefore have nowhere to turn when issues popped up. The right answer would be not to do a LTS right now, but I digress.
  • Kernel 2.6.25 seems to have resolved the worst of the server-side issues. This is what Fedora Core 9 is running. But results from pgbench should be considered suspect, particularly under high client loads, and without that working I can't really prove to myself that 2.6.25 is fine in all the situations I like to test.
  • Kernel 2.6.26 may get some fixes in to improve pgbench, depends on how things go. The patch I've already gotten closed much of the performance gap with only a few lines of code changed. You can follow my thread on lkml to see how things are going.


The really important lesson here I drill into people whenever I talk about performance and benchmarking issues is that you've got to measure baseline expectations on any system you put together, continue to check periodically to make sure things haven't eroded, and use that as guidance for any new system. Here I compared new hardware about to go into service against my older, trusted "production" unit to confirm it was faster, discovering both the disk and the pgbench issues. It's really handy to know how to quantify how fast your reference benchmark is on a system from a CPU/memory/disk perspective when you get to where you're looking to upgrade it. Not only will that help guide what should be changed, but it will let you know if the upgrade really worked or not when you're done.

Monday, May 19, 2008

Linux printing, as "fun" as ever

I've been using Ubuntu 7.04 "Feisty Fawn" as my primary desktop for the last six months. Let me start by saying I'm a big fan of the distribution for desktop use, and I'm looking into deeper use of Ubuntu in the next six months despite the occasional problems with it and its underlying infrastructure. (Debian lets a guy who doesn't know how to read C repackage crypto code? Seriously?)

Anyway, I needed to get that Feisty install talking to my QMS/Minolta (now Konica) Magicolor 2350 laser printer over the network, the only way I ever talk to that printer. Recalling Eric Raymond's Linux usability rant The Luxury of Ignorance, I was curious to see how much things have improved since then.

This printer has all sorts of network interfaces supported and genuine Postscript. Last time I networked it on a Linux host was all via command line and that was pretty simple, as command prompt interfaces go. Under Windows, all I have to do is install the driver and tell it the IP address of the printer. Ubuntu also provides a gui to handle all this, gnome-cups-manager. I started there. This printer supports the Internet Printing Protocol which seemed promising for an easy configuration.

Setup things for IPP to that address, just got errors with no useful descriptions in the GUI. Turning to Debugging Ubuntu Printing Problems, I found that on Feisty I had to manually edit the cupsd.conf file to get any useful feedback here. Basically, on this version of Ubuntu, the Gnome tool is useless the minute you have a problem. Supposedly it's better on Gutsy, not that it would have helped me.

I also learned a bit about troubleshooting this type error from the IPP: Only Raw Printing Works thread on the forums, which wasn't my problem but was in the same area. The other troubleshooting suggestion there for helping with reinstalls was:

sudo foomatic-cleanupdrivers

That was nice to make sure I was starting from closer to scratch properly every time, but was also no help at actually resolving the issue.

The debugging level logs led me to this bit that seemed my real problem:

E No %%BoundingBox: comment in header!
E PID 10539 (/usr/lib/cups/backend/socket) stopped with status 1!
E [Job 63] Unable to write print data: Broken pipe

What the hell? I found a bunch of people with this same error, but few suggestions. Subsequent reading suggested the whole gnome-cups-manager does little but get in the way if it doesn't work right off the bat, so now it was back to using

http://localhost:631/

To manage cups directly.

I tried a few more variations on the IPP protocol before deciding that whole protocol was yet another thing getting in the way. Back to the old standby of talking directly to the printer with lpd. While playing with that, cups went completely insane on multiple occasions, prompting a need for:

/etc/init.d/cupsys restart

Sigh. Did some digging on the default queue info for this printer, and the magic URI that finally worked is:

lpd://192.168.0.6/ps

So basically the old-school setup I used to use. The additional layers of cups and its gnome interface did nothing but get in my way by obfuscating what was going on underneath, and cups remains as buggy as ever. Gosh, maybe I should follow the link on the localhost cups page to purchase Easy Software's "ESP Print Pro"? The only thing worse than spamming my printer setup page with their ad is that when I click on it, the page doesn't even exist. Come on--you put a damn ad on Linux systems all over the world, and you can't even keep a redirect alive to the URL you used? Not exactly a way to get me so confident in your skills that I'd want to give you money.

Now, at this point the obvious flame is "what the hell are you complaining about Ubuntu 7.04 for when 8.04 is available?". That's easy--I only use LTS versions because they're the only ones I'd recommend a business deploy, and dissapointingly Hardy is a good beta quality release that was pushed out the door anyway to meet a pre-planned release schedule. It doesn't work anywhere close to well enough for my standards yet. They decided to use a Linux kernel so fresh (2.6.24) it's gone through almost no QA before release, with a crippling bug in the brand-new scheduler model that completely bogs down the application I spend most of my day using, PostgreSQL. Again, seriously? They just introduced a whole new scheduler in 2.6.23 and you expect it will work already? Have you ever actually developed software before? That sucker is many kernel revs away from having all the unexpected corner cases knocked out. Hardy should have shipped with 2.6.22 or earlier if they wanted a stable kernel at launch. Combined with the beta standard Firefox (I must have a reliable Flash plugin for my work as well) and the whole Pulseaudio mess, Hardy isn't even on my radar until service pack, err, update 1 comes out in a couple of months.

A final twist to my story: just after I got printing going, CUPS recognized my printer by itself! It suggested:

ipp://192.168.0.6:631/COM1

Of course, that didn't actually work either--just got IPP errors about not being able to get the status of the printer. Looks like Eric Raymond's Aunt Tillie is still a considerable distance away from easy Linux printing. As for my setup, web pages print fine but lpd output gets truncated at the margins. If only I had a fancy GUI to help set that up...