Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, July 16, 2011

My electronic sysadmin notebook for an entire fleet

I used to run a mixed fleet of Slackware Linux boxes. They didn't have much in the way of package management or config tracking, but I kept tabs on everything. Everything you could possibly need to figure out why a box is doing something or other went into these notes. I ran them for a long time, and the notes spanned all of that time.

From the top level, you could pick a list of systems, go to an overview of all of them on one page, with a summary of what they did. There was also a link to programs, for all of those dumb little scripts and other code which are written in the course of running systems.

My systems directory started with a link to a global change log, and then a list of all of my systems in the whole organization, ordered by IP address. Each system's name was a link to its own page. Some things which didn't really have pages but were otherwise important, like switches and routers, showed up on here with no links. This approach also let me visually show where our dialup pool lived in the middle of a particular /24, and also where the boss's home /28 subnet lived in that same block.

For systems which were identical and were not necessarily on static IP addresses, like our CD towers and wireless LAN VPN concentrators, they had their own generic pages. Finally, there was something rather important: a link to retired systems. Nothing was deleted, only archived.

Screenshot of penguin's page

As mentioned previously, each system had its own page. There, you'd see what operating system was presently installed, and when it had happened. There were also dates and details for previous versions, and how they were removed: upgraded, reinstalled completely, or whatever. Some of these systems were not the easiest things to find, so I went through and took pictures of them, and added those to the pages. That way, you'd know if you were looking for a skinny little Dell Optiplex pizza box or a massive HP NetServer LH 4. They were usually labeled, but this helped narrow it down.

Next, every system had a complete dump of its hardware. There, the actual model name, CPU details (type, quantity, cache, stepping), disks, RAM, NICs, video cards, and other such things lived. By the disks, I even listed which partitions mapped to which devices, and if there was any sort of RAID/hotswap tray, the id numbers, LUNs, and positions in the tray, like "bottom 3 drives". RAM was listed by quantity and type, like 4x64MB and 4x256MB for a total of 1.28GB. NICs were listed by type, included the MAC address, switch, and port number, with a link to that switch's MRTG activity page. Finally, that section had the most recent dmesg and Linux kernel .config from the last time I built a kernel for it.

The third major section of these pages was for "official roles". A system might be a web server for projects X, Y, and Z. It might relay mail for this and that. Or maybe it's a caching nameserver, or even a full-blown authoritative nameserver. Perhaps it runs a RADIUS server for our dialup pool. It might have MRTG running on it, so you know you can dig around and find nifty graphs.

Fourth, I listed all of the ports that were supposed to be listening, and all of the daemons which were responsible for each one. This provided a quick way to get some idea of exposure to various problems coming down the pipe ("oh no, another foobar remote vulnerability"), or just do a sanity check on your netstats. You could look for things which aren't supposed to be there, for instance.

Any other services which didn't necessarily consume ports came next. These are things above and beyond whatever the OS might have supplied, like majordomo, the previously-mentioned MRTG, various helper cron jobs, or whatever. These tools and programs would usually have a link out to the support page of that project (for something external), or the internal programs directory elsewhere within this tree.

My sixth section was just the random miscellaneous gotchas which would otherwise be missed. Let's say it wasn't running a stock kernel because I was paranoid about something. It would list the Openwall stack nonexec patches, for instance. Maybe in.comsat was misbehaving at some point, and I set up a custom binary. That patch would be here. Perhaps our hardware had some issue where stock kernels would not see the tape drive, and here was my hacked-up patch to make that work, too.

For some systems, I was able to track all of their reboots, and provide a post-mortem for what happened. My most important system had 18 reboots listed for a period of about 5 years. They were things like "enabling ext3 journals", or quite a few kernel upgrades. There were also "server move", "data room power rebuild", and "random twits at console trying to fix with ctrl-alt-del". There was only one unexplained entry, which was an apparent NIC crash that never reoccurred.

Last, I had my changelog for just that system. New packages or other software like kernels would show up here. Changes shown up above in other sections would also show up here in the timeline so they would make sense in context. At the very bottom, it showed "last update by", a date, and a username, which was always me, since I was the lone admin of these machines.

My programs directory was just a tree of directories, one per project. Whatever random things I had created over the years went in there. There was no particular curation of those entries.

My switch and documentation

Besides the above web pages, I even maintained a list of my switch port mappings and which MAC addresses were allowed on each. The switches were also configured to not accept any other address, and unused ports were administratively disabled. I had to do this after another one of those "ctrl-alt-del" people decided to plug something in without authorization and broke quite a few things.

As you might imagine, building this and maintaining this was a LOT of work. I still did it, because it kept the bus number of running those systems above 1. Even though there was nobody else around to consume them, I still kept doing it.

Imagine my surprise when their "network engineers" had the gall to accuse me of purposely not documenting things. I went to ridiculous lengths to make sure there were no mysteries on my network! Everything was locked down tight and listed clearly. Besides these pages, my switch lists were actually printed out and taped to each physical device. When they'd change, I'd update them.

I realize now that what I was doing was so far ahead of them, they just could not comprehend it. It's not that I was particularly special, but rather that they were just so far behind the rest of the world.