Software, technology, sysadmin war stories, and more. Feed
Tuesday, January 10, 2012

My design of a software agent to help support workers

I once had an idea for something which would reduce the number of dumb tasks a support tech had to do on a customer's system. The idea is that you would ask this 'agent' to run a given task and it would do it for you. Then you'd get a simple result like pass or fail and could go on with life.

The idea here was to reduce or maybe even eliminate the amount of human error which comes from trying to pass things along as oral tradition and having people remember everything. Let's say you are supposed to "QC" the installation before marking it finished and ready for the customer to "move in". You were supposed to log in and make sure it had the right version of Red Hat running, and the right control panel (if any), and all of this.

Actually checking this meant running dumb commands to cat version files. It got old. I figured we could write something which could use a rudimentary form of scripting to spell out what a "QC" run is supposed to look like. If it didn't find /etc/redhat-release, it would error out, otherwise it would display it, and so on.

Fortunately, I stopped short of actually implementing that insanity. Had that continued, I'm sure it would have resembled an actual programming language before long. My initial hard-coded tests where it looked for hard-coded strings like "qc psa" instead of having a real parser proved this out. I needed another technique.

The problem seemed interesting enough. I wanted to be able to push out new versions of these helpers so the techs would always have the latest tools available at all times. I also didn't want them to worry about things like having to update the helper tool. It should update itself. Finally, it really should stay running the whole time and "phone home" now and then so that we can have the machines do stuff without having to log in and su first.

I decided the answer would be a daemon which just sat on the machine and checked in periodically. It would also have a "wake up port" where it would just listen on a TCP port for connections. It did not read from that port, but if the connection was from one of our trusted internal (RFC-1918) networks, it would interpret it as a "poke". That would make it wake up and phone home sooner so we didn't have to wait for it to poll.

The wake up port was designed this way (accept, then shutdown + close without ever calling read) so that it would not present any sort of opportunity for breaking into the process. That way, you could have it running on a machine with no firewall and even though the rest of the world could connect to it briefly, they couldn't actually accomplish anything.

This part was easy enough. Next, I needed to be able to extend what it could do. My program's main core should only interpret a couple of very simple commands: get version, start self-update, and run a module. Those modules were the key. Let's say you asked it to run "interfaces.list". It would first check to see if it had a module on disk which provided that.

At first, it would not have that module, so it would then reach out to my module server via https and would request it. It downloaded the file, verified its signature with GPG, and then started another program to actually handle that request with the newly-fetched module. This is where it gets a little crazy.

This second program would then attempt to dlopen() the module which had just been retrieved from the server. If that worked, it would dig around in there with dlsym() to look for certain variables and data structures. If they all existed and had the right magic numbers, it would assume it was sane and would then jump into the module's entry point.

The module, you see, was a .so file.

If this all worked, then the module would use some helper functions in the second program to generate a response, and then it would return from that entry point. The second program would then exit(0). This in turn woke up the daemon via SIGCHLD and it would deal with the result.

I decided to implement it with two separate programs so I could be isolated from things which might break due to a library load gone bad. If it crashed that second program, I'd get a non-zero exit code in my daemon and perhaps some other metadata, like if it ended due to a signal. It wouldn't crash the daemon since they were totally separate processes.

So let's reflect here. I wanted a dynamic system which let us run commands and process data a little, but I didn't want to invent my own scripting language. It also needed to allow new features/modules to be added without replacing the entire core system every time. My answer was to come up with a scheme where the entirety of C was available for handling such tasks. You'd just need to write a function or two and arrange for some binary compatibility, compile it into a .so, and feed it to the system.

That's not all. Once in a while, the actual core programs (daemon + the second one which ran the .so files) would need to be updated, so I wrote a routine which would handle that. It worked by fetching the new binaries into a temp path and doing the usual GPG checks. Then it invoked the new daemon with a flag which meant "test yourself".

This caused the new binary to start up without trying to open a listener port (since it was already in use, remember). It would then immediately phone home and tell the server that it needed a test job. The server would then generate one, and the daemon would do the usual thing: fetch the job, run the second program, which runs a module, and generates a reply which is then uploaded. If all of this worked, it would return successfully.

The original daemon was watching for this return and the confirmation from the server that the test job had run as expected. If this all checked out, it concluded that the new version was in fact capable of handling another job. Then it moved the new binaries into place and made the leap of faith by calling exec() on itself.

The new version came up and kept on going from there. This scheme didn't totally establish that the new version would work 100%, but it was better than just blindly dumping in a binary and running it. I figured that having it combined with dev-side testing could catch most of the obvious forehead-slapper bugs which would force us to manually ssh in and reinstall things.

I designed and built this thing, and even I think this was pretty crazy. It had the benefit of being able to write a module and run it right away, but the actual mechanism underneath "... could be used to frighten small children", to steal a line from Linus Torvalds.

I've also seen the alternative, where there is nothing dynamic available, so everything gets baked into the daemon itself. If you want a new command or tool, you have to write it, get it approved, check it in, and wait for the next binary drop. Then you have to wait for it to make it into a release image that gets pushed to the fleet. You might wait three or four months for it to reach a point where you can rely on it actually working on real machines.

It's easy to assume that the scary-fast technique implies lax standards and the slow and methodical one implies quality. However, those are two different axes.

After all, some people can be scary-fast and good, while others somehow manage to be slow and bad. Everyone else is some mix.

To wrap this up, here are some things to make you go "hmm": On a two-dimensional chart with slow-fast on one axis and bad-good on another, where would you put yourself? Where are your teammates? Also, have these positions remained constant with time, or have they moved?