Writing

Software, technology, sysadmin war stories, and more. Feed
Saturday, February 18, 2012

Running things in a cluster should be simple

I keep hearing strange things about the kind of hoop-jumping which is required to get any sort of reasonable action from certain kinds of multi-machine clusters. It seems like you need to worship at the altar of XML or possibly even need to hook a whole bunch of extra cluster control library goop into your actual programs. This seems like a huge layering violation to me.

Instead, let's look at a simpler model for how to do this sort of thing. This cluster has quite a nice "API": it's the typical C library and Linux kernel syscall interface you get by virtue of compiling your code on a Linux box. That is, it's no different than running your program on any other system running that particular mix of libraries and kernel. Your program doesn't need to know that it's now part of some fancy cluster.

You could write "hello world", compile it to a binary, and get it to run, as long as you account for any library differences. This is the same thing you face when making a binary for any other system. If your dev and prod environments are that different, then set up a cross-compiler. Alternatively, you could consider static linkage.

Anyway, so now you have your little binary, and you tell the cluster to pick it up and run it somewhere. It preps a spot on a machine somewhere, drops it on, and kicks it off. The program runs, and then when it's done, your job is over. If it wrote some log files, you have a little while to stop by and check in on them before that space is recycled for another job. If you really need output from the job, make it actually write to permanent storage itself.

That's it! You now have a cluster.

So what about fancy parallel processing? That doesn't need to be very complicated, either. You could make a list of all of the things which need to be processed, and determine that you have 700 different items which need to be handled. Maybe you want to split this up so that 70 instances run in parallel and each process 10 items.

First, you build your program so it can handle an offset into the list, and then make it take an argument like "instance". It just has to take that number and multiply by 10, then jump into the list at that position.

In other words, --instance=0 makes it start at item 0 and end at item 9, --instance=1 makes it start at item 10 and end at item 19, and so on.

Now you just need a tiny little bit of help from your cluster software. Tell it to run 70 instances of your program and have it do a little twiddling of argv when it does that. It just needs to set "--instance=(instance_number)" when it starts your program. For this to work, you just need to have some way for your cluster stuff to do variable replacement on command line arguments which is interpreted for each instance it starts. This should not be a big deal.

For a lot of tasks, this is a great place to start. It gets a little hairier when you consider that not all of those 70 instances are going to run to completion. Some of them might never start, or start and then crash, or otherwise fail to do the job somehow. Now what?

One way around this would be to signal success with an exit code, and make your cluster software notice this and allow for retries. Maybe it would be allowed to retry any instance up to 10 times if it fails. That way, you wouldn't have to write a separate nanny process to keep tabs on your workers every time you come up with a job like this.

There is one important thing to remember: unlimited retries will eventually bring unlimited pain. Sooner or later, you will create a job which can never succeed at one or more of its tasks, and it will need to be shut down and rewritten. If you allow unlimited retries, it will beat your cluster to death with endless rescheduling work.

I would also suggest some kind of exponential backoff and retry scheme for failed tasks to make sure they don't get to hog a cluster. This would also give a nice happy bias towards the jobs which are better-written and don't crash nearly as much. That in turn may encourage programmers to write better code before handing it off to be run.

You might be surprised just how much you can get done with this scheme.