Tuesday, October 13, 2009

Triple partitioning and Dual Booting with Mac OS

A few months ago I bought a used Intel MacBook I'm now switching over to using as my primary personal laptop. I'm still using Linux as my preferred OS elsewhere though, so I need to deal with dual-boot both on its hard drive (and no, a virtualized Linux install will not be fast enough). I also got a new backup hard drive, and wanted to partition that to support three OSes. This is the saga of how that all went.

Starting with a blank hard drive, getting the OS to dual boot was pretty easy. The easy route to get started is to use the Mac OS X Disk Utility for the intial partitioning. You pick the number of partitions, it starts them all equally sized, but you can tweak the vertical lines between them to adjust that. It's not a bad UI, and it will create the type of EFI partition needed to boot from OS X properly. Once that works, you just need to install rEFIt to allow booting from the other partition. I used the standard packaged installer, followed instructions for Getting into the rEFIt menu, did "Start Partitioning Tool" to sync partition tables (a lot more about this below), then popped an Ubuntu disk in and installed using the "Boot Linux CD" option at the rEFIt menu. The How-To Install Ubuntu 8.10 on a White MacBook were basically spot on to make 9.04 work too, and if you partition right from the beginning you avoid the whole Boot Camp step and its potential perils.

The main usability problem I ran into is that the touchpad kept "clicking" when I typed, particularly keys near it like "." when typing an IP address. I followed the instructions for Disable Touchpad Temporarily When Typing and that made the problem go away. The wireless driver in the 2.6.28 kernel included with Ubuntu 9.04 was still a bit immature on this hardware when I connected to the 802.11n network here. To improve that, I grabbed the 2.6.30.9 PPA kernel, which fixed the worst of it; Gentoo Linux on Apple MacBook Pro Core2Duo looks like a good guide to which kernels tend to be better on Mac hardware in general. The wireless is still a bit unstable when doing large transfers, it just stops transferring sometimes. Annoying, but not a showstopper in most cases; I just plug into the wired network for things like system updates. I'm much more annoyed by not having a right mouse button, much less a third one to open links in Firefox as new tabs only when I want to like the Thinkpad I was using has.

The tough part came when I tried to get my new external backup drive working (my old one died in the middle of all this). Here I wanted a partition with FAT32 (compatible with any Windows install and for backing up my month old but already broken Playstation 3), one for Mac OS using its native HFS+ (for certain Time Machine compatibility), and one for Linux using ext3. This turned out to be harder than getting the boot drive working, mainly because rEFIt didn't do the hard part for me.

The background here is that Windows and Linux systems have all been using this awful partitioning scheme for years that uses a Master Boot Record(MBR) record to hold the partition information. This has all sorts of limitations, including only holding 4 partition entries. To get more, you have to create an "extended partition" entry that holds the rest of them. Apple rejected this for their Intel Macs, and instead adopted an Intel scheme that happens to work better with how Apple's hardware uses EFI to boot instead of the standard PC BIOS.

This means that to really get a disk that's properly partitioned for Mac OS X, you need to put a GPT partition table on it. But then other types of systems won't necessarily be able to read it, because they will look for an MBR one instead. It's possible to create a backwards compatibility MBR partition table from a GPT one (but not the other way), with the main restriction being that EFI will want a boot partition and the MBR can't have extended partitions. This means you can only get 3 old-school MBR partitions on the disk. That's how many I needed in my case, but things would have been more complicated had I tried to triple-boot my install disk. Then I'd have needed to worry about putting the Linux swap partition into the GPT space because it wouldn't have fit into the MBR section.

I hoped to do the whole thing on my Linux system and then just present the result to the Mac, but that turned out to be hard. The GRUB FPT HOWTO covers what you're supposed to do. I knew I was in trouble when step two didn't work, because the UI for "mkpart" within parted has changed since that was written; here's what worked to get started by creating a GPT partition table:
$ sudo parted /dev/sdb
GNU Parted 1.8.8
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel
Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to
continue?
Yes/No? yes
New disk label type? [msdos]? gpt
(parted) mkpart non-fs 0 2
(parted) quit
Information: You may need to update /etc/fstab.
Following the whole procedure would be quite messy though, and I did not have lot of faith that the result would also be compatible with OS X's requirements here. Most of those are outlined on the rEFIt "Myths" page, but there's a lot of aborb there.

I started over by wiping out what I did above the start of the disk, where the partition table lives ("dd if=/dev/zero of=/dev/sdb" and wait a bit before stopping it). Then I used the OS X disk utilty again from the Macbook to create the 3 partitions I needed. Since this doesn't create Linux partitions, I created the non-HFS+ ones as both fat32. Then, connect the drive back to the Linux system to convert one of them to ext3. This didn't work out so hot:
$ sudo parted /dev/sdb
GNU Parted 1.8.8
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: ST932042 1AS (scsi)
Disk /dev/sdb: 320GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 20.5kB 210MB 210MB fat32 EFI System Partition boot
2 211MB 107GB 107GB fat32 UNTITLED 1
3 107GB 214GB 107GB fat32 UNTITLED 2
4 214GB 320GB 106GB hfs+ Untitled 3

(parted) rm 3
(parted) mkpart
Partition name? []? bckext3
File system type? [ext2]? ext3
Start? 107GB
End? 214GB
(parted) print
Model: ST932042 1AS (scsi)
Disk /dev/sdb: 320GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 20.5kB 210MB 210MB fat32 EFI System Partition boot
2 211MB 107GB 107GB fat32 UNTITLED 1
3 107GB 214GB 107GB fat32 bckext3
4 214GB 320GB 106GB hfs+ Untitled 3

(parted) quit
Information: You may need to update /etc/fstab.
That changed the label...but not the type? I tried a few other approaches here in hopes they would work better. Tried deleting then exiting before creating again. Tried using "ext2" (the default) as the type. The partition was still fat32.

My reading made it clear that using parted from the command line is really not a well tested procedure anymore. The GUI version, gparted, also knows how to operate on GPT partition tables (even if it's not obvious how to create them rather than MBR ones), and that is the primary UI for this tool now. This worked; if I changed the type of the partition using gparted to ext3 and had it format it, the result was what I wanted:
$ sudo parted /dev/sdb
GNU Parted 1.8.8
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ST932042 1AS (scsi)
Disk /dev/sdb: 320GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 20.5kB 210MB 210MB fat32 EFI System Partition boot
2 211MB 107GB 107GB fat32 UNTITLED 1
3 107GB 214GB 107GB ext3 bckext3
4 214GB 320GB 106GB hfs+ Untitled 3

(parted) quit
Linux will mount all three partitions now (with the HFS+ one as read-only), OS X will mount FAT32 and HFS+ as expected. I've heard so many bad things about the OS X ext2 driver that I decided not to install it; can always use the FAT32 volume to transfer things between the two OSes if I have to.

But we're not done yet though, because the regular MBR on this system is junk:
$ sudo sfdisk -l /dev/sdb

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util sfdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 38913 cylinders, 255 heads, 63 sectors/track
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

Device Boot Start End #cyls #blocks Id System
/dev/sdb1 0+ 38913- 38914- 312571223+ ee GPT
start: (c,h,s) expected (0,0,2) found (0,0,1)
/dev/sdb2 0 - 0 0 0 Empty
/dev/sdb3 0 - 0 0 0 Empty
/dev/sdb4 0 - 0 0 0 Empty
One big GPT partition, not one for each actual partition. This won't mount in Windows or on other MBR-only systems (like my PS3, speaking of junk). OS X doesn't care about that detail when it created the partition table in the first place. That's one of the thing Boot Camp fixes, but if you've already partitioned the drive manually it's too late to use it. When I did the dual-boot install, rEFIt fixed this for me (even though I didn't even understand what it did at the time) when I ran its "Start Partitioning Tool" menu option. If you want to make a proper MBR from an existing GPT yourself on a non-boot volume, you need to run the gptsync utility it calls for you by hand.

gptsync is available for Ubuntu. Here's what I did to grab it and let it fix the problem for me:
$ sudo apt-get install gptsync
$ sudo gptsync /dev/sdb

Current GPT partition table:
# Start LBA End LBA Type
1 40 409639 EFI System (FAT)
2 411648 208789503 Basic Data
3 208789504 417171455 Basic Data
4 417171456 624880263 Mac OS X HFS+

Current MBR partition table:
# A Start LBA End LBA Type
1 1 625142447 ee EFI Protective

Status: MBR table must be updated.

Proposed new MBR partition table:
# A Start LBA End LBA Type
1 1 409639 ee EFI Protective
2 * 411648 208789503 0c FAT32 (LBA)
3 208789504 417171455 83 Linux
4 417171456 624880263 af Mac OS X HFS+

May I update the MBR as printed above? [y/N] y
Yes

Writing new MBR...
MBR updated successfully!
Afterwards, the GPT looks fine, and now MBR-based utilities understand it too; the good ones even know they shouldn't manipulate it directly:
$ sudo parted -l

Model: ST932042 1AS (scsi)
Disk /dev/sdb: 320GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 20.5kB 210MB 210MB fat32 EFI System Partition boot
2 211MB 107GB 107GB fat32 UNTITLED 1
3 107GB 214GB 107GB ext3 bckext3
4 214GB 320GB 106GB hfs+ Untitled 3

$ sudo sfdisk -l /dev/sdb

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util sfdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 38913 cylinders, 255 heads, 63 sectors/track
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

Device Boot Start End #cyls #blocks Id System
/dev/sdb1 0+ 25- 26- 204819+ ee GPT
start: (c,h,s) expected (0,0,2) found (1023,254,63)
end: (c,h,s) expected (25,127,14) found (1023,254,63)
/dev/sdb2 * 25+ 12996- 12971- 104188928 c W95 FAT32 (LBA)
start: (c,h,s) expected (25,159,7) found (1023,254,63)
/dev/sdb3 12996+ 25967- 12972- 104190976 83 Linux
/dev/sdb4 25967+ 38896- 12930- 103854404 af Unknown
I probably don't need the 210MB set aside for the "EFI System Partition" here, but am glad I got it. The backup drive I bought is the same one I put into the Macbook (standard operating procedure for me--I don't like to have a laptop that I can't recover from a hard drive failure on using parts I already own). If the main drive fails, knowing I can throw it into the Mac and have a decent shot of using it without having to repartition and lose everything first is worth that bit of wasted space. I expect that I should be able to swap drives, run the OS X installer, and hit the ground running if something went bad. If I'm lucky I won't ever have to find out if that's true or not.

I'm not really a lucky guy, so expect a report on that one day too.

One final loose end: what if you don't have a computer running Ubuntu around, and want to get this sort of partition setup with GPT and MBR setup using just OS X? The regular rEFIt installer doesn't seem to address this, the binary needed only gets installed into the boot area rather than somewhere you can run it at, and it only runs against the boot volume. You could probably build from source to get a copy instead. There is an old copy of gptsync available as part of a brief Multibooting tutorial that covers some of the same material I have here. There's a more up to date version of the utility with a simple installer available at enhanced gptsync tool that worked fine for my friend who tested it. If you run that installer, gptsync is then available as a terminal command. Use "df" to figure out what the names of your devices are, and don't use the partition number. This will probably work if you're using an external drive:
gptsync /dev/disk1
Once I used gptsync to make a good MBR, the backup drive talked to the PS3 and a Windows box without issues, while still working fine under Linux and OS X. That should cover the main things you need to know for the most common of the uncommon partitioning requirements I expect people to have here.

Getting started with rsync, for the paranoid

When a computer tool has the potential to be dangerous, my paranoia manifests itself by making sure I understand what the tool is doing in detail before I use it. rsync is a very powerful tool you can use to clone directory trees with. It's also possible to wipe out your local files with it, and understanding what it does is quite complicated to figure out. It doesn't help that the rsync manual page is a monster.

The basic tutorials I find in Google all seem a bit off so let me start with why I wrote this. You don't need to start an rsync server to use it, you really don't need or even want to start by setting up unsecure keys, and the tutorials that just focus on the basics leave me not sure sure what I just did. Quick and dirty guide to rsync is the closest to what I'm going to do here, but it lacks the theory and distrust I find essential to keeping myself out of trouble.

Let's start with local rsync, which is how you should get familiar with the tool. One useful mental model here is to think of rsync as a more powerful cp and scp rolled into one initially, then focus on how it differs. The canonical simplest rsync example looks like this:
$ rsync -av source destination

What does this actually do though? To understand that, you first need to unravel the options presented. This takes a while, because they're nested two levels deep! Here's a summary:
-v, --verbose      increase verbosity
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories

-l, --links copy symlinks as symlinks
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files

-t, --times preserve times
-p, --perms preserve permissions
-g, --group preserve group
-o, --owner preserve owner (super-user only)

I've broken these out into the similar groups here. Verbose you're going to want on in most cases, outside of automated operations like inside of cron. The first thing to be aware of with this simple recipe is that turning on archive mode means you're going to get recursive directory traversal. The "-l -D" behavior you probably want in most cases, to properly handle special files and symbolic links. You'll almost always want to preserve the times involved too. But whether you want to preserve the user and group information really depends on the situation. If you're copying to remote system, this might not make any sense at all, which means you can't just use "-a" and will need to decompose the operations here to include all of the remaining ones. In many cases where remote transfer is involved, you'll also want to use "-z" to compress too.

How does rsync make its decisions?

What are the problem spots to be concerned about here, the ones that can eat your data if you're not careful? In order to talk about that, you really need to understand how rsync makes its decisions by default, and its other major modes. Here's the relevant bits from the man page that describe how it decides what files should be transferred; you have to collect the beginning and the details related to a couple of options to figure out the major modes it might run in:
Rsync finds files that need to be transferred using a “quick check” algorithm (by default) that looks for files that have changed in size or in last-modified time. Any changes in the other preserved attributes (as requested by options) are made on the destination file directly when the quick check indicates that the file’s data does not need to be updated.

-I, --ignore-times: Normally rsync will skip any files that are already the same size and have the same modification timestamp. This option turns off this “quick check” behavior, causing all files to be updated.

--size-only: This modifies rsync’s “quick check” algorithm for finding files that need to be transferred, changing it from the default of transferring files with either a changed size or a changed last-modified time to just looking for files that have changed in size. This is useful when starting to use rsync after using another mirroring system which may not preserve timestamps exactly.

-c, --checksum: This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a “quick check” that (by default) checks if each file’s size and time of last modification match between the sender and receiver. This option changes this to compare a 128-bit MD4 checksum for each file that has a matching size. Generating the checksums means that both sides will expend a lot of disk I/O reading all the data in the files in the transfer (and this is prior to any reading that will be done to transfer changed files), so this can slow things down significantly...Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred, but that automatic after-the-transfer verification has nothing to do with this option’s before the-transfer “Does this file need to be updated?” check.

From this we can assemble the method used on each source file to determine whether to transfer it or not. Once the decision to transfer has been made, the rest of the tests related to that decision are redundant.
  1. Is --ignore-times on? If so, decide to transfer the file.
  2. Do the sizes match? If not, decide to transfer.
  3. Default mode with --size-only off: check the modification times on the file. If the source file is newer, decide to transfer.
  4. Checksum mode: compute remote and local checksums. If they don't match, decide to transfer.
  5. Transfer the file if we decided to above, computing a checksum along the way.
  6. Confirm the transfer checksum matches against the original
  7. Update any attributes we're supposed to manage whether or not the file was transferred.
Understanding rsync's workflow and decision making process is essential if you want to reach the point where you can safely use the really dangerous options like "--delete".

Common problem spots

One thing to be concerned about even in simple cases is that if you if you made a copy of something without preserving the times in the past, the copy will have a later timestamp than the original. This can turn ugly if you're trying to get the local additions to a system back to the original again, as all the copies will look like later ones and you'll transfer way more data than you'd expect. If you know you've just added files on a remote system and don't want to touch the ones that are already there, you can use this option:
  --ignore-existing skip updating files that exist on receiver

This will also keep you from making many classes of horrible errors by not allowing it to overwrite files, so turning it on can be extremely helpful when learning rsync in the first place.

If you're not sure what files have changed but always want to prefer the version on the source node, you can save on network bandwidth here by using the checksum option. That can take a while to scan all of the files involved to compute the checksums, but you'll only transfer the ones that changed even even if the modification times match. Another useful option to know about here is --modify-window, which allows you to add some slack into the timestamp computation, for example if the system clocks involved are a bit low resolution or out of sync.

Using rsync to compare copies

The sophistication of the options here means that you can get rsync to answer questions like "what files have really changed between these two copies?" without actually doing anything. You just need to use one or both of these options:
-n, --dry-run         perform a trial run with no changes made
-i, --itemize-changes output a change-summary for all updates

When learning how to use rsync in the first place, this should be your standard approach anyway: do a dry run with itemized changes, confirm it's doing what you expected, and then fire it off. You'll learn how the whole thing works that way soon enough. Note that if using checksum mode, those will get computed twice this way, but if your files are big enough that this matter you probably should be really paranoid about messing them up too. A rsync dry run with checksums turned on is a great way to get a high level "diff" between two directory trees, either locally or remotely, without touching either one of them.

Other useful parameters to turn on when getting started with rsync are "--stats" and "--progress".

Remote links

Next up are some notes on how the remote links work. If you put a ":" in the name, rsync defaults to using ssh for remote links; again you can think of this as being like scp. Since no admin in their right mind sets up an rsync server nowadays, this is the standard way you're going to want to operate. If you're not using the default ssh port (22), you need to specify it like this:
$ rsync --rsh='ssh -p 12345' source destination

You can abbreviate this to "-e", but I find it makes more sense and is easier to remember using the long version here. You're specifying how it should reach the remote shell here and that's reflected in the long option, the short one just got a random character that wasn't already used.

Wrap-up

That covers the basic rsync internals I wanted to know before I used the tool, and that usually get skipped over. The other tricky bit you should know is how directory handling changes based on whether there's a trailing slash on paths, that's covered elsewhere quite well so I'm not going to get into it here.

You should know enough now to use rsync and really understand what it's going to do, as well as how to be paranoid about using it. Don't overwrite things unless you know it's safe, always use a dry run for a new candidate rsync command, and break down the options you use to the subset you need if the big options collections like "-a" do more than that.

Where to go from here? In order of increasing knowledge requirements I'd suggest these three links:
  1. rsync Tips & Tricks: This gives some more detail about some of the options you should know about I skimped on, and covers a lot of odd situations too.
  2. Backups using rsync: Great description of how many of the more obscure parameters actually work. This will suggest what underdocumented parameters like the deletion ones actually do, and suggest how you could use some of them.
  3. Easy Automated Snapshot-Style Backups with Linux and Rsync: The gold mine guide of advanced techniques here. Once past the basics, it's easy to justify studying this for as long as it takes to understand how the whole thing works, as you'll learn a ton about how powerful rsync and how powerful rsync can be along the way.

Using doxypy for Python code documentation

Last time I wrote a long discussion about Python module documentation that led me toward using doxypy feeding into doxygen to produce my docs. Since I don't expect Python programmers in particular to be familiar with doxygen, a simple tutorial for how to get started doing that seemed appropriate. I had to document this all for myself anyway.

Running on Ubuntu, here's what I did to get the basics installed (less interesting bits clipped here):
$ sudo apt-get install doxygen
$ cd $HOME
$ wget http://code.foosel.org/files/doxypy-0.4.1.tar.gz
$ tar xvfz doxypy-0.4.1.tar.gz
$ sudo python setup.py install
running install
running build
running build_scripts
running install_scripts
creating /usr/local/local
creating /usr/local/local/bin
copying build/scripts-2.6/doxypy.py -> /usr/local/local/bin
changing mode of /usr/local/local/bin/doxypy.py to 755
running install_egg_info
Creating /usr/local/local/lib/python2.6/dist-packages/
Writing /usr/local/local/lib/python2.6/dist-packages/doxypy-0.4.1.egg-info

That last part is clearly wrong. The setup.py code that ships with doxypy is putting "/usr/local" where "/usr" should go, which results in everything going into "/usr/local/local". That needs to get fixed at some point (update: as of doxypy-0.4.2.tar.gz 2009-10-14, this bug is gone), for now I was content to just move things where they were supposed to go to work around it and cleanup the mess:
$ sudo mv /usr/local/local/bin/doxypy.py /usr/local/bin
$ sudo mv /usr/local/local/lib/python2.6/dist-packages/doxypy-0.4.1.egg-info /usr/local/lib/python2.6/dist-packages/
$ sudo rmdir /usr/local/local/bin
$ sudo rmdir /usr/local/local/lib/python2.6/dist-packages/
$ sudo rmdir /usr/local/local/lib/python2.6/
$ sudo rmdir /usr/local/
$ sudo rmdir /usr/local/local/lib/
$ sudo rmdir /usr/local/local
And, yes, I am so paranoid about running "rm -rf" anywhere that I deleted the directories one at a time instead of letting recursive rm loose on /usr/local/local instead. You laugh, but I've watched a badly written shell script wipe out a terabyte of data not being careful with rm.

Now we need a sample project work on. Here's a tired old example I've updated with a first guess at markup that works here:
$ cd <my project>
$ $EDITOR fib.py
#!/usr/bin/env python
"""@package fib
Compute the first ten numbers in the Fibonacci sequence
"""

def fib(n):
"""
Return a Fibonacci number

@param n Number in the sequence to return
@retval The nth Fibonacci number
"""
if n>1:
return fib(n-1)+fib(n-2)
if n==0 or n==1:
return 1

if __name__ == '__main__':
for i in range(1,10):
print fib(i)

Now we ask doxygen to create a template configuration file for us, edit it to add some lines to the end, and run it:
$ doxygen -g
$ $EDITOR Doxyfile

(add these lines to the end and comment them
out where they appear earlier)

# Customizations for doxypy.py
FILTER_SOURCE_FILES = YES
INPUT_FILTER = "python /usr/local/bin/doxypy.py"
OPTIMIZE_OUTPUT_JAVA = YES
EXTRACT_ALL = YES
FILE_PATTERNS = "*.py"
INPUT = "fib.py"

$ doxygen

The basics of how to get Doxygen going are documented in its starting guide. For a non-trivial program, you'll probably want to make INPUT more general and expand FILE_PATTERNS (which doesn't even do anything the way INPUT is setup here). As hinted above, I'd suggest commenting out all of the lines in the file where the parameter's we're touching above originally appear, along with adding a block like this to the end of it with your local changes. That's easier to manage than touching all of the values where they show up in the template.

Now fire up a web browser and take a look at what comes out in the html/ directory. You have to drill down into "Namespaces" or "Files" to find things in this simple example.

Function Documentation
def fib::fib ( n )

Return a Fibonacci number.

Parameters:
n Number in the sequence to return

Return values:
The nth Fibonacci number
There's a few more things I need to do here before I'll be happy enough with this to use it everywhere:
While there's plenty left to learn and some loose ends, so far I'm happy enough with this simple proof of concept to keep going in this direction.

Monday, October 12, 2009

Watching a hard drive die

One thing I get asked all the time is how to distinguish between a hard drive that is physically going bad and one that is just not working right from a software perspective. This week I had a drive fail mysteriously and saved the session where I figured out what went wrong to show what I do. It's easy enough to find people suggesting "monitor 'x'" for your drive, where 'x' varies a bit depending on who you ask. Writing scripts to do that sort of thing is easier if you've seen how a bad one acts, which (unless you're as lucky as me) you can't just see on demand easily. This is part one to a short series I'm going to run here about hard drive selection, which will ultimately lead to the popular "SATA or SAS for my database server?" question. To really appreciate the answer to that question, you need to start at the bottom first, with how errors play out on your computer.

Our horror story begins on a dark and stormy night (seriously!). I'm trying to retire my previous home server, an old Windows box, and migrate the remainder of its large data files (music, video, the usual) to the new Ubuntu server I live on most of the time now. I fire up the 380GB copy on the Windows system a few hours before going to bed, using cygwin's "cp -arv" so I won't get "are you sure?" confirmations stopping things. I expected it will be finished in the morning. I check up on it later, and the copy isn't even running anymore. Curious what happened, I ran "du -skh /cygdrive/u/destination" to figure out how much it did copy before dying. In the middle of that, the drive starts making odd noises, and the whole system reboots without warning. This reminds me why I'm trying to get everything out of Windows 2000.

At this point, what I want to do is look at the SMART data for the drive. The first hurdle is that I can't see that when the disk is connected via USB. A typical USB (and Firewire) enclosure bridge chipset doesn't pass through requests for SMART data to the underlying drive. So when things go wrong, you're operating blind. Luckily, this enclosure also has an eSATA connector, so I can connect it directly to the Linux PC to see what's going on. That connection method won't have the usual external drive limitations. If that weren't available, I'd have to pull the drive out of its enclosure and connect directly to a Linux system (or another OS with the tools I use) to figure out what's going on.

(If you don't normally run Linux, you can install smartmontools on Windows and other operating systems. Just don't expect any sort of GUI interface. Another option is to book a Linux live CD; I like Ubuntu's for general purpose Linux work, but often instead use the SystemRescueCd for diagnosing and sometimes repairing PC systems that are acting funny.)

Plugged the drive into my desktop Linux system, "tail /var/log/messages" to figure out what device it gets assigned (/dev/sdg), and now I'm ready to start. First I grab the drive's error log to see if the glitch was at the drive hardware level or not:
$ sudo smartctl -l error /dev/sdg
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
...
Error 2 occurred at disk power-on lifetime: 153 hours (6 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 00 00 00 a0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 a0 08 00:00:17.300 IDENTIFY DEVICE

Error 1 occurred at disk power-on lifetime: 154 hours (6 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 01 00 00 a0 Error: UNC 1 sectors at LBA = 0x00000001 = 1

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 03 01 01 00 00 a0 ff 11:17:17.000 READ DMA EXT
25 03 01 01 00 00 a0 ff 11:17:17.000 READ DMA EXT
25 03 30 5e 00 d4 48 04 11:17:05.800 READ DMA EXT
25 03 40 4f 00 d4 40 00 11:16:56.600 READ DMA EXT
35 03 08 4f 00 9c 40 00 11:16:56.600 WRITE DMA EXT



Getting DMA read/write errors can be caused by driver or motherboard issues, but failing to idenitify the device isn't good. Is the drive still healthy? By a rough test, sure:
$ sudo smartctl -H /dev/sdg
SMART overall-health self-assessment test result: PASSED

This is kind of deceptive though, as we'll see here. The next thing we want to know is how old the drive is right now and how many reallocated sectors are there. Those are the usual first warning sign that a drive is losing small amounts of data. We can grab just about everything from the drive like this (bits clipped from the output in all these examples to focus on the relevant parts):
$ sudo smartctl -a /dev/sdg
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 092 092 016 Pre-fail Always - 2621443
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 111 111 024 Pre-fail Always - 600 (Average 660)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 74
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 7
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 154
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 77
193 Load_Cycle_Count 0x0012 100 100 050 Old_age Always - 77
194 Temperature_Celsius 0x0002 114 114 000 Old_age Always - 48 (Lifetime Min/Max 18/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 12
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 4
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 4
...

Make sure to scroll this to the right, the last column is the most important. UDMA_CRC_Error_Count matches the errors we're still seeing individually. But the real smoking gun here, and in many other cases you'll see if you watch enough drive die, is Reallocated_Sector_Ct (7) and its brother Reallocated_Event_Count (12). Ignore all the value/worst/thresh nonsense; that data is normalized by this weird method that doesn't make any sense to me. The "raw_value" is what you want. On a healthy drive, there will be zero reallocated sectors. Generally, once you see even a single one, the drive is on its way out. This is always attribute #5, I like to monitor #194 (temperature) too because that's a good way to detect when a system fan has died. The drive overheating can be a secondary monitor for that very bad condition. You can even detect server room cooling failures that way, it's fun watching a stack of servers all kick out temperature warnings at the same time after the circuit the AC is on blows.

The other thing to note here is Power_On_hours. Here the raw value (154 hours) confirms that the recent errors in the logs did just happen. This is a backup drive I only power on to copy files to and from it, and it's dissapointing that it's died with so little life. Why that happened and how to prevent it is another topic.

Next thing to do is to run a short self-test, wait a little bit, and check the results. This drive is loud enough that I can hear when the test is running, and it doesn't take long:
$ sudo smartctl -t short /dev/sdg
$ sudo smartctl -l selftest /dev/sdg
...
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 60% 153 717666530


Here's the thing to realize: the drive reports "healthy" to the simple test. But isn't; there are reallocated sectors, and even the simplest of self-checks will find them. Enterprise RAID controllers can be configured to do a variety of "scrubbing" activities when the drives aren't being used heavily, and this is why they do that: early errors can get caught by the drive long before you'll notice them any other way. Nowadays drives will reallocate marginal sectors without even reporting an error to you, so unless you look at this data yourself you'll never know when the drive has started to go bad.

At this point I ran a second short test, then an extended one; here's the log afterwards:
$ sudo smartctl -t short /dev/sdg
$ sudo smartctl -t long /dev/sdg
$ sudo smartctl -l selftest /dev/sdg

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 157 2454905
# 2 Short offline Completed: read failure 40% 156 711374978
# 3 Short offline Completed: read failure 60% 153 717666530

The extended test found an error even earlier on the disk. It seems pretty clear this drive is failing quickly. At this point, there's little hope for it beyond saving any data you can (which I did before even launching into this investigation) and moving toward running the manufacturer's diagnostic software. What I'd expect here is that the drive *should* get marked for RMA if it's already in this condition. It's possible it will "fix" it instead. That's a later topic here.

In short, there are a few conclusions you can reach yourself here, and since this is a quite typical failure I can assure you these work:

  • Try not to connect external drives to a Windows server if you can avoid it, as they're pretty error prone and Windows isn't great at recovering from this sort of error. This fact is one reason I get so many of these requests to help distinguish true hardware errors from Windows problems.

  • Drives that are setup such that they can't check themselves via SMART are much more likely to quietly fail. Had this one been connected directly rather than via USB, I could have picked up this problem when the reallocated sector count was lower and decreased my risk of data loss.

  • If you're taking an offline drive out of storage to use it again, a short SMART check and look at the logs afterwards is a good preventative measure. Some drives even support a "conveyance" self-test aimed for checking quality after shipping, this one didn't so I went right from the short to long tests.

  • When monitoring regularly with smartmontools, you must monitor reallocated sectors yourself, the health check will not include them in its calculations. They are an excellent early warning system for errors that will eventually lead to data loss.

  • Regularly running the short SMART check doesn't introduce that high of a disk load, and it is extremely good at finding errors early too. I highly recommend putting a periodic SMART self-test on your server outside of peak hours if you can, if you're not using some other form of data consistency scrubbing

  • Running a baseline SMART self-test when you first put a drive into service helps provide a baseline showing good performance before things go wrong. I didn't do that in this case and I wish I had that data for comparison. It's helpful for general drive burn-in too.

That's the basics of what an error looks like when you catch it before the drive stops responding altogether. I've found this is quite often the case. Anecdotally, 75% of the drive failures I've seen in the last seven years (3/4 failures since when I started doing this) show up like this before the drive stops responding altogether. Some of the larger drive studies floating around recently suggest it's not quite that accurate as an early warning for most, it's certainly much better than not checking for errors at all.

The fact that this drive died somewhat mysteriously, in a way that it even still passed its SMART health check, has some interesting implications for its suitability in a RAID configuration. That's where I'm heading with this eventually.

Whatever OS you're running, you should try to get smartmontools (or a slicker application that does the same thing) running and setup to e-mail you when it hits an error. That regime has saved my data on multiple occasions.

Wednesday, October 7, 2009

Writing monitoring threads in Python

A common idiom in programs I write is the monitoring thread. If you have a program doing something interesting, I often want to watch consumption of some resource in the background (memory, CPU, or app internals) while it runs. Rather than worrying the main event loop with those details, instead I like to fire off a process/thread to handle that job. When the main program is done with its main execution, it asks the thread to end, then grabs a report. If you write a reusable monitoring library like this, you can then just add monitoring thread for whatever you want to watch within a program with a couple of lines of code.

Threading is pretty easy in Python, and the Event class is an easy way to handle sending the "main program is exiting, give me a report" message to the monitoring thread. When I sat down to code such a thing, I found myself with a couple of questions about exactly how Python threads die. Some samples:
  • Once a thread's run loop has exited, can you still execute reporting methods against it?
  • If you are sending the exit message to the thread via a regular class method, can that method safely call the inherited thread.join and then report the results itself only after the run() loop has processed everything?
Here's a program that shows the basic outline of a Python monitoring thread implementation, with the only thing it monitors right now being how many times it ran:
#!/usr/bin/env python

from threading import Thread
from threading import Event
from time import sleep

class thread_test(Thread):

def __init__ (self,nap_time):
Thread.__init__(self)
self.exit_event=Event()
self.nap_time=nap_time
self.times_ran=0
self.start()

def exit(self,wait_for_exit=False):
print "Thread asked to exit, messaging run"
self.exit_event.set()
if wait_for_exit:
print "Thread exit about to wait for run to finish"
self.join()
return self.report()

def run(self):
while not self.exit_event.isSet():
self.times_ran+=1
print "Thread running iteration",self.times_ran
sleep(self.nap_time)
print "Thread run received exit event"

def report(self):
if self.is_alive():
return "Status: I'm still alive"
else:
return "Status: I'm dead after running %d times" % self.times_ran

def tester(wait=False):
print "Starting test; wait for exit:",wait
t=thread_test(1)
sleep(3)
print t.report() # Still alive here
sleep(2)
print "Main about to ask thread to exit"
e=t.exit(wait)
print "Exit call report:",e
sleep(2)
print t.report() # Thread is certainly done by now

if __name__ == '__main__':
tester(False)
print
tester(True)
Whether or not to call the thread's "join" method from the method that requests it to end is optional, so we can see both behaviors. Here's what the output looks like:
Starting test; wait for exit: False
Thread running iteration 1
Thread running iteration 2
Thread running iteration 3
Status: I'm still alive
Thread running iteration 4
Thread running iteration 5
Main about to ask thread to exit
Thread asked to exit, messaging run
Exit call report: Status: I'm still alive
Thread run received exit event
Status: I'm dead after running 5 times

Starting test; wait for exit: True
Thread running iteration 1
Thread running iteration 2
Thread running iteration 3
Status: I'm still alive
Thread running iteration 4
Thread running iteration 5
Main about to ask thread to exit
Thread asked to exit, messaging run
Thread exit about to wait for run to finish
Thread run received exit event
Exit call report: Status: I'm dead after running 5 times
Status: I'm dead after running 5 times
That confirms things work as I'd hoped. That is usually the case in Python (and why I prefer it to Perl, which I can't seem to get good at predicting). I wanted to see it operate to make sure my mental model matches what actually happens though.

Conclusions:
  1. If you've stashed some state information into a thread, you can still grab it and run other thread methods after the thread's run() loop has exited.
  2. You can call a thread's join method from a method that messages the run() loop and have it block until the run() loop has exited, that works. This means the method that stops things can be setup to return only complete output directly to the caller requesting the exit.
With that established, I'll leave you with the shell of a monitoring class that includes a small unit test showing how to use it. Same basic program, but without all the speculative coding and print logging in the way, so it's easy for you to copy and run with to build your own monitoring routines. The idea is that you create one of these, it immediately starts, and it keeps going until you ask it to stop doing whatever you want in the background--at which point it returns its results (and you can always grab them later too).
#!/usr/bin/env python

from threading import Thread
from threading import Event
from time import sleep

class monitor(Thread):

def __init__ (self,interval):
Thread.__init__(self)
self.exit_event=Event()
self.interval=interval
self.times_ran=0
self.start()

def exit(self):
self.exit_event.set()
self.join()
return self.report()

def run(self):
while not self.exit_event.isSet():
self.times_ran+=1
sleep(self.interval)

def report(self):
if self.is_alive():
return "Still running, report not ready yet"
else:
return "Dead after running %d times" % self.times_ran

def self_test():
print "Starting monitor thread"
t=monitor(1)
print "Sleeping..."
sleep(3)
e=t.exit()
print "Exit call report:",e
self_test=staticmethod(self_test)

if __name__ == '__main__':
monitor.self_test()
The main thing you might want to improve on here for non-trivial monitoring implementations is that the interval here will vary based on how long the monitoring task takes. If you're doing some intensive processing that takes a variable amount of time to happen at each interval, you might want to modify this so that the sleep time is adjusted so to aim for a regular target time, rather than to just sleep the same amount every time.

(All the above code is made available under the CC0 "No Rights Reserved" license and can be incorporated in your own work without attribution)

Tuesday, October 6, 2009

Formatting source code and other text for blogger

The biggest nemesis of this blog is that I regularly include everything from source code to log files in here, which really do not fit well into Blogger without some help. Today I got fed up with this enough to look for better ways than what I had been doing.

My HTML skills are still mired in cutting-edge 1995 design, I lost touch somewhere around CSS, so my earlier blog entries used this bit of HTML to insert text I didn't want the blogger formatting to touch as the quickest hack I found that worked:

<div style="padding: 4px; overflow: auto; width: 400px; height: 100px; font-size: 12px; text-align: left;"><pre>
Some text goes here
</pre></div>


That looks the way things formatted that way will look, except with only the inner scroll bar, and getting that posted turned quite self-referential.

Two things were painful about this. The first is that I had to include this boilerplate formatting stuff every time, which required lots of cut and paste. The second is that I had to manually adjust the height every time, and the heights didn't match between the preview and the actual post. I think I did that on purpose at one point, so that I could display a long bit of source code without having to show the whole thing. In general, this is a bad idea though, and you instead want to use "width: 100%" and leave out the height altogether.

What are the other options? Well, you could turn that formatting into a proper "pre" style entry which cuts down on the work there considerably, and is much easier to update across the whole blog. Then you just wrap things with the pre/code combo and you're off, which is a bit easier to deal with. There's an example of this at Blogger Source Code Formatter that even includes a GreaseMonkey script to help automate wrapping the text with what you need. Another example of adjusting there is at How to show HTML/java codes in blogger.

You probably want to save a copy of everything before you tinker and track your changes; the instructions at Can I edit the HTML of my blog's layout? covers this. I put my template into a software version control tool so I can track change I make and merge them into future templates; I'm kind of paranoid though so don't presume you have to do that. I settled on the "Simple II" theme from Jason Sutter as being the one most amenable as a base for a programming oriented blog, as it provides the most horizontal space for writing wide lines. I'd suggest considering a switch to that one before you customize your template, then tweak from there.

The main problem left to consider here, particularly when pasting source code, is that you need to escape HTML characters. I found two examples of "web services" that do that for you, including producing a useful header, that are minimally useful. I like the UI and output of Format My Source Code For Blogging better than Source Code Formatter for Blogger, but both are completely usable, and the latter includes the notion that you might want to limit the height on long samples. I think in most cases you'd want to combine using one of them with the approach of saving the style information into your template advocated by the GreaseMonkey-based site, just using the code and its wrapper from these tools in a typical case rather than using a one-off style every time. If you do that, you can just wrap things in a simple pre/code entries and possibly use something as simple as Quick Escape just to fix the worst things to be concerned about.

Here's what I got from the simpler tool I mentioned first:
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><code>
Some text goes here
</code></pre>

That's a bit more reasonable to work with, looks better (I favor simple over fancy but like something to make the code stand apart), and it easy to dump into my template for easy use (and changes) in the future.

After considering all the samples available, here's the config I ended up dumping into my own Blogger HTML template, after switching themes. This goes right before "]]></b:skin>" in the template:
pre
{
font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace;
background:#efefef;
color: #000000;
font-size:100%;
line-height: 100%;
overflow: auto;
width: 100%;
padding: 5px;
border: 1px solid #999999;
}

code
{
color: #000000;
font-size:100%;
text-align:left;
margin:0;
padding:0;
}
That's a bit better to my eye, the dashes looked bad. Code is easier to follow too.

Now, what if you want real syntax highlighting for source code? Here the industrial strength solution is SyntaxHighlighter. There's a decent intro to using that approach at Getting code formatting with syntax highlighting to work on blogger. The one part I'm not comfortable with there is linking directly to the style sheets and Javascript code to the trunk of the SyntaxHighlighter repo. That's asking for your page to break when the destination moves (which has already happened) or someone checks a bad change into trunk. And that's not even considering the security nightmare if someone hostile takes over that location (less likely when it was on Google Code, I'm not quite as confident in the ability of
alexgorbatchev.com to avoid hijacking). You really should try to find a place you have better control over to host known stable copies of that code at instead.

I may publish a more polished version of what I end up settling on at some point, wanted to document what I found initially before I forgot the details.