Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, November 17, 2011

More hacks done by a sysadmin with no budget

Working as a sole sysadmin for an organization with a large number of users can be interesting. Operating in that situation with no budget to do things properly makes it doubly interesting. Being forced to make things work with the available resources leads to some really horrible hack jobs.

Here's one I did when we needed more reliability out of our Internet connections but couldn't afford to do it the right way. It involved a bunch of dumb app level mangling to give the impression that we had far better service than we actually did.

First of all, some context. Our primary Internet connection at the time was a T-1 to a local university. They let us park a small router in their machine room and put it on one of their networks. Our default route pointed at their big router, and their big router had a few static routes for us pointed at our little router.

That university then had a connection with a local firm in town that provided commodity Internet links. They just got that provider to add static routes for our networks pointed down their pipe, and that was that. Well, that was the plan, at least.

The reality is that our uplink wasn't too stable. They'd frequently have all kinds of network woes. Still, it was cheaper than anything else we could find, since all we were really paying for was the cost of hauling a circuit across town to their building. The actual cost we paid them for transit was more symbolic than anything else. This also left us in a situation where we didn't feel able to complain about the level of service. It was a mess.

About a year into this, we got a second link up and running into a local exchange which had been put up as a consortium of local school districts. It wasn't anything amazing. It was a frame relay T-1 with a 50% committed information rate (CIR). In other words, the most we could rely on getting across that pipe was about 768 Kbps, and then, that pipe only took us as far as the exchange.

The entire exchange and all of the school districts connected to it were behind this rickety Solaris box running some god-awful firewall software. That box then connected to the outside world over a load-balanced pair of T1s. The failure modes were numerous. If the firewall wasn't having a bad day, then the links would be saturated by one or more of the other customers. The fact that this exchange was being run by the telco didn't help.

So there it was: two horrible excuses for Internet connections, and neither of them were really good. Now, to make matters worse, we couldn't use BGP or anything of the sort to advertise our routes. We were too far from the "default-less" Internet by virtue of being buried behind the university on pipe #1 and the consortium on pipe #2.

Besides, we didn't have a router big enough to accept a full view. Such a beast would cost money, and that's what we didn't have. Now you get the idea of just how bad it was.

I was tasked with improving our reliability and somehow managing to use this second pipe even though there was no way to do multi-homing the way it should be done. I actually came up with a solution that worked most of the time, strangely enough. Here's how.

There were a few things working in my favor. First, nobody else had any idea how TCP/IP worked at that place, so they stayed out of my way. Second, I had a couple of random junk boxes which were capable of running Linux, and I knew how to bend them to my will. Third, we had much more IP space than we needed to allocate internally at that point in time, and we owned it outright. This meant I could do some really crazy routing tricks as long as our ISP was willing to play ball.

The first thing I did was to get that frame relay T-1 off our big router, where it had been terminated alongside the point-to-point T-1 to the university. I needed to create a new world in which I had two default routes running simultaneously, and that wasn't going to happen if they were both on the same box.

Instead, I spun off that frame relay link to a little teeny router like the kind we would park at our schools and the one we had at the university. Then I grabbed a little rinky-dink 10 Mbps hub and used it to build a new network. It had the big router, the new little router, and a brand new Linux box I put together which would be in charge of this ridiculous affair.

This Linux box was given an address outside my existing "Unix network" /24. In fact, it was parked on an entirely new /24, and the big and little routers were given addresses in that network as well. I think the Linux box was .1, the big router was .2, and the little router was .3. The important part was that nothing else in the organization was on this network.

Then I got on the horn with our ISP and told them that our routing would be *just* the single /24, and not the full /19 which we actually owned. They thought I was nuts, so I explained it to them. Traffic bound for the school district would normally see the /19 route and would come in by way of the university. However, if it was going to this specific dinky network and its single Linux box, the more-specific /24 route would win, and it would come in through the frame relay link.

They still thought I was nuts, but they did it anyway. Now I had two "platforms" of a sort inside my network, and this lonely Linux box had its own way out to the world, and the world had a way back to it. Now the problem was making it actually do useful work.

First, I stood up BIND and Sendmail on this box so that it would act as an authoritative name server and a tertiary mail relay for our domains. Then I had to go and add it to all of our domains at the registrar, as as well as the NS records on our end, plus the MXs so sendmail would start seeing traffic. This worked, and now we would actually still resolve and could receive mail when the main pipe was acting up!

But, there's more to life than just receiving mail. Most of our traffic was web stuff. I had already set up a Squid HTTP cache earlier in the year to allow us to keep track of who was going where and reduce the bandwidth load. This turned out to be the solution which let me balance the web traffic using this other link.

On this same lonely Linux box, I added another Squid proxy. I also switched on the option where it would ping a host before attempting to connect to it, and then added it as a parent cache to my existing system. Then I added the "ping-the-target" config to the original cache, and things got interesting.

When someone connected to my proxy and asked it to fetch a given URL, both that proxy and the upstream parent on the lonely Linux box would both ping it. Whichever one "won" would then be used to serve the traffic. If one of the pipes went down, obviously the ping would fail, and then it would usually wind up going out the other way. This wasn't perfect, since people block ICMP and other fun things like that, but most of the time it did work.

So that was the situation for a long time. HTTP slipped out both ways, SMTP and DNS came in both ways, and we just queued outgoing mail when the main pipe went stupid. If the primary pipe stayed down too long, then I rigged up a hack to make my normal mail server throw everything at this same lonely Linux box which would then relay to the world.

This little box which did all of this stuff was a Pentium 150 with 128 megabytes of memory and about a gig of disk space. I think my original iPhone from 2007 has more CPU power than that, and it definitely has more storage. It even has more connectivity, and it's just there for little old me! It doesn't have to provide service to 10,000 kids! Oh, and it fits in my purse. Awesome.

Anyway, I left that job eventually. I'm sure whoever came in after me had tons of dailywtf type traffic as they discovered all of my ugly hacks. What can I say? I had a job to do and no money to use on it.

Just like with software, the technical debt catches up sooner or later. It's clear that whoever they got to replace me had no idea what was going on. I noticed they turned down my little machine not long after I left, but left all of the NS and MX records in place, plus the primary "glue" at the registrars.

This meant 33% of initial lookups for their domain would just hang until it finally gave up and tried another nameserver. Brilliant!

Just for the record, I would never put up with this kind of trash these days. I had no idea that this kind of crap was exceptionally bad. I thought this is just what people did to run networks. Then I escaped.