Writing

Software, technology, sysadmin war stories, and more. Feed
Saturday, February 9, 2013

Corner cases and shell scripts

Now and then I like to review my old chat logs to look for inspiration. Sometimes I can find old troubleshooting sessions and use that as a reminder of something which was messed up and should have been detected sooner. There are a lot of places where this can happen on a Linux box, leading to many opportunities for small utilities to fill a gap of some kind. Unfortunately, some of them can be derailed without too much effort.

One afternoon, someone in a different office pinged me and asked a question about mail and connectivity:

(16:17:50) T: when two computers are connected via private net and they try to email each other they cant because they are using live IPs not the private IPs
(16:18:04) T: there is a fix for that in the /etc/sysconfig/route file?

"Private net" in this context refers to the practice of putting a second NIC in your server so you can have it talk to your other servers across a separate network. This lets you do things like separating your web servers from your database server(s) while keeping that raw traffic off the public interface. Besides the obvious security implications of having your database's daemon open to the outside world, having traffic go across the public side is bad because it can leak to other hosts when the switches are having a bad day and decide to flood packets to all ports. It also drives up your traffic meter which turns into bandwidth overages.

In this case, there were two boxes owned by the same client and they had a situation where mail needed to go from one to the other. For whatever reason, it wasn't working, so he asked me for info. I said it shouldn't care which IPs are involved and asked him to point me to a current ticket so I could get some context, like which servers were involved and their login details.

He pasted in a chunk of what he was seeing:

(16:18:50) T: [root@server1 root]# telnet (customer_domain) 25
Trying (ip_117_249)...
telnet: connect to address (ip_117_249): No route to host

I jumped on the machine and started poking around. Eventually, I came up with a routing table entry which didn't look quite right.

(16:21:15) R: (ip_117_192) 0.0.0.0 255.255.255.192 U 40 0 0 eth0
(16:21:18) R: uh, what's this?

For some reason, it had a normal-looking network route for one of its additional IP addresses. The problem is that we always assigned those as /32s. The machine should think that each additional IP address was on a "network of one". With that netmask, it thought it was on a network which had 64 addresses. (Side note: .0 would be all 256, but .192 is two bits set -- 128 and 64 -- so it's half, then half again. That's just how I think about these things.)

That got me looking at routing entries in general. It didn't make sense and it didn't look like the other box on the same account. Then, somehow, I found it:

(16:23:35) R: oh, wait a second
(16:23:37) R: inet addr:(ip_117_240) Bcast:(ip_65_127) Mask:255.255.255.192
(16:23:41) R: bad interface

It was pretty wild. It had the bad netmask and it also had a crazy broadcast address from another network entirely. I've omitted some of the details and have changed a few others, but you can still see that x.x.117.240 and x.x.65.127 with a netmask of 255.255.255.192 don't have anything to do with each other!

My digging continued. I kept updating him as I went.

(16:26:02) R: somewhere on this box, something is configuring that interface with the wrong netmask
(16:26:16) R: but ifcfg-eth0:1 is fine
(16:26:21) R: no wait
(16:26:24) R: MASK= not NETMASK=
(16:26:25) R: bingo
(16:26:31) R: whoever configured this did it by hand and hosed it
(16:26:32) R: one sec
(16:26:53) R: -rw------- 1 root root 12288 Aug 7 07:50 .ifcfg-eth0:1.swp
(16:26:57) R: and they left behind a dropping from vi

So now I knew how it had gotten this way: someone had messed up while adding an interface. If you have a shell script which sources another shell script to get settings, and it's looking for a variable called NETMASK and you provide one called MASK, it's not going to get the right values.

Fixing this config and forcing the interface to be reconfigured from it as if it had just rebooted was all it took. I could have slapped it into shape with ifconfig, sure, but that wouldn't have proven that things would stay fixed the next time someone restarted the box or did an "ifdown + ifup". By changing the file and letting the OS networking scripts do their job I could be confident it would "stick".

One interesting thing about finding that "vi dropping" was that I had some idea of when this had happened. Someone had clearly added an additional IP address to this box on August 7th around 8 AM and had messed it up. This gave me enough info to dig around in the ticket history for this customer's server. I found the ticket where this specific IP address was added, and now knew who had done it.

I passed these details on to this person so he could give the original tech a good-natured elbowing for messing things up for him. It had been like this for two months and nobody had noticed. I guess the customer had never tried to actually use that new interface in all that time.

Looking at this now, I wonder what could be done to avoid problems like that in the future. Since the "config file" is basically a glorified shell script with a bunch of FOO=bar type assignments, it's not likely to have any sort of grammar checker. If it passes the basic syntax requirements for a shell script, odds are nothing will complain about it. At the same time, if someone turned it into a Real Config File with a full-blown program with a parser and all of this, it would get far more complicated and probably would break more often. It would also fly in the face of "doing things the Unix way" which matters to a fair number of people even now.

I guess someone could write yet more shell code which would have a list of expressions which were valid for the ifcfg-* files, and then could do something like an inverse grep against it. Anything which was left over after matching all possible valid expressions could be reported as anomalous.

Of course, then you'd have the problem of keeping that list of expressions in your checker in sync with whatever the actual code uses. Given the way shell scripts tend to be glued together, it might take some real work to figure out every possible thing which might be valid in such a file.

I mean, what do you expect when everything basically happens as a global variable with very few, if any, functions? It's trivial to add a new blob of code which looks for a new global variable without changing any other part of the code. Also, unless you go to some lengths in your script's code, having it not defined means it reduces to "", not an error.

$ export foo='this exists now'
$ echo ${foo}_hello
this exists now_hello
$ unset foo 
$ echo ${foo}_hello
_hello
$ echo $?
0

Global variables and conflating NULL with the empty string. Joy.

I mean, who really switches on the option to make it die on unset variables?

$ set -u
$ echo ${foo}_hello
bash: foo: unbound variable
$ echo $?
1

My guess is that most shell scripts don't have so much as a single "set" command, and they probably don't do their own equivalent sanity checks, either: dealing with the possibility of undefined variables, catching exit codes of failing processes, and so on.

It's a wonder any of this works at all.