Friday, April 11, 2008

Conference East 08 and Solaris Notes

I had fun and picked up a lot of information at the East Coast PostgreSQL conference two weeks ago. I was waiting for a few things to come online before writing about it, rather than post before I had links to everything.

I was surprised at how good the reaction to my own presentation on the internals of the PostgreSQL buffer cache was. Much of that information you can only pick up from staring at the source code to the database, and apparently that's not so popular. There's theoretical aspects to the whole thing, but I did close with some practical suggestions for tuning with queries that use the pg_buffercache module. Slides and sample query source code now posted at my PostgreSQL Performance Pitstop. I'm hoping to present that again at the upcoming OSCON PostgreSQL Day in Portland. And this time I'll be prepared for the question Jeff Davis nailed me on about how the 8.3 buffer re-use changes actually work.

The first really informative talk I saw was Greg Sabino Mullane's discussion of replication solutions. I just used some of the insight gained there to write an updated intro to the Replication, Clustering, and Connection Pooling page on the PostgreSQL Wiki. That page still needs a lot more attention.

Solaris Adventures

One of the things I've been doing recently is deploying more systems on Solaris 10 instead of my traditional Linux environment. There were a few talks discussing Solaris-specific issues and I picked up some tricks at all of them.

Theo from OmniTI talked about their experiences as well as tools they've made available, like their PostgreSQL-Solaris Tools and
Project DTrace. Some things I picked up there:
  • ZFS has a pretty substantial overhead compared to the traditional UFS filesystem. It takes some tuning to pull that down for PostgreSQL use: dropping the default large block/stripe size values, reducing the ZFS buffer cache to being less aggressive, and adjusting the direct I/O parameters as appropriate.
  • They've found the shared lock that pg_dump takes on large tables makes it really inappropriate for use on larger databases. When they do instead is take a ZFS snapshot, then run the dump against that snapshot. Then there's no database locks on the snapshot to be worried about stomping on.
  • PostgreSQL major version upgrades are hard to pull off because the recommended procedure is a dump/reload; that doesn't work on TB scale databases very well. Slony is often recommended here, but using it requires a lot of hardware: you basically need a whole additional server just to handle the version upgrade in addition to the primary/backup combination the main business is running on. Here again you can play some tricks with ZFS snapshots to try things out on the production boxes while leaving the ability to roll-back if there's a problem. This greatly enhances the potential for a tool like pg_migrator to be useful for verions upgrades to a large databases, instead of a full dump/reload.
Jignesh from Sun did two talks about Solaris deployments. The benchmarking one was interesting for me personally as a performance tuning geek, but the focus on commercial benchmarks (TPC is not cheap) and practical suggestions aimed at gigantic user loads isn't exactly mainstream. But his Best Practices with PostgreSQL on Solaris talk is must reading for people using this combination. The comments about UFS and ZFS tuning cleared up many mysteries I'd wondered about differences in performance on those two compared to the Linux ext3 setups I'm used to tuning. Some details from the talk to expand on the slides a bit:
  • Disabling the freebehind algorithm turns off the somewhat unexpected Solaris behavior that avoids caching larger files like those you'll find in a database application. It's worth your time to look into all the parameters he mentions for filesystem tuning, they're pretty obscure but the defaults are not tuned well for database use--by PostgreSQL at least, which relies much more heavily on the OS filesystem cache than, say, Oracle.
  • One non-obvious reason that direct I/O really doesn't work well on the main database filesystem is that PostgreSQL really expects that access to the pg_clog files will be readily cached by the OS. When they're not that causes some ugly performance issues.
  • It wasn't really anticipated by the original design, but apparently the postgresql.conf commit_delay parameter has a highly useful application in very high throughput situations because it allows database writes to be much larger on average. That makes them better match the underlying storage/filesystem block sizes while also decreasing the number of I/O operations aka iops (which can become a bottleneck if you're doing lots of small writes).
  • There are alternative memory allocation and scheduler methods available (ibumem and the FX Scheduler) that seem more appropriate for PostgreSQL use than the defaults.
Finally, another talk that mentioned Solaris tools like ZFS snapshots was Robert Treat's "Big, Bad, Broken PostgreSQL". I would recommend waiting to see if the video for this talk becomes available; it was so funny in a black comedy vein to see how his disaster unfolded that I'd hate to see you ruin the surprises by just reading the slides first.