Software, technology, sysadmin war stories, and more. Feed
Saturday, January 14, 2012

Logins failed when /etc/motd grew too big

My school got into the whole networking thing rather early. It had a bunch of coaxial cable run through the tunnels between buildings. It would surface just long enough to make a stop at a tee connector behind a hub, then it would keep on going to its next destination. For the time, this was actually quite good.

Unfortunately, things changed. One summer, a reconfiguration of the school bumped the computer lab from one side to the other, and the new area wasn't accessible yet. Someone got up in the attic and extended the network from its previous terminus in that building to the other side where the lab would now be.

For a while, things were fine. Our use of the network was relatively light. People ran NCSA telnet to connect to our BSD/OS box where they'd use things like Pine or Elm or Gopher. This was before the days of having web browsers all over the place. Besides, the school didn't have anything beefy enough to even run Mosaic back then.

Then, one day, I got a call from the new admins. People were having trouble logging in. They'd telnet in and get the usual banner and "login:" prompt just fine. Then it would accept that and ask for a password, and that would go fine too. Then it would say something like "Last login on ttyp0 from a.b.c.d at (date)" and it would sit there.

Over at the console, things were fine. It was only happening out on the various client machines. One thing I did notice is that they now had a massive /etc/motd. Whereas it used to be a couple of lines to welcome users with its operating system details and last upgrade date, now it was pushing a full screen of junk.

I made a copy of their motd and shrank it back down to its prior size, then walked back out to a distant machine. Logins worked again. I was able to poke around and do ordinary commands. But, then, there was a catch: if I ran something which spat out a big blob of data all at once like catting the original huge motd, it would hang again, forever.

It's been a long time, so I don't know what made me try this, but I decided to dial down the MTU on the Unix box. Instead of allowing it to shovel 1500 byte packets over its Ethernet, it was now reduced to tiny little packets which looked more like ATM cells. It added tons of overhead, but ... it worked.

So, here we were, with a network that could pass a small packet just fine but would choke if you sent a long one. I hypothesized that it was some kind of electrical problem somewhere relatively far away in that coax, and that it was probably reflecting the signal. If the packet could both begin and end before any of it got to that point in the cable, it might work. Otherwise, it would collide with itself, and, well, "all die, oh the embarrassment".

Given that this was a no-budget operation, there was no chance of getting a tool which would emit a pulse and look for reflections. I just had to try things to see what happened. I tried taking a spare terminator to either end of the backbone to see if one of them had gone bad. That didn't help.

After a lot of random trial and error, I capped off the backbone one hop short of the run to the new computer lab -- where it used to end before the move. Clearly, something had gone bad in that new run or perhaps in the hub at the very end. This took it back out of our network.

Everything started working again. Obviously, something in that run was broken, but we didn't have the ability to re-wire it. We couldn't just leave the lab off the network either, so something else had to be done. Luckily, there was a new project in the works to install twisted-pair to each room, and the new patch panels had already been installed in that part of the building. We were able to feed one of the panels from our last hub and then hop through that into a jack in the lab.

With all of that done, I put back the big motd and all was well. We never did figure out what happened to make it suddenly stop working like that, but in the long run, it didn't matter. The network eventually moved to a proper fiber backbone between buildings and all sorts of badness went away.

Strangely, I would see the opposite problem -- too many tiny packets -- years later at another gig. Upon reflection, I suspect my experience with that screwball network at my school gave me a good place to start troubleshooting this one.

Side note: I suspect that network had at least one nasty ground loop in it because all of those machines were sitting right on that coax, and we used to have weird unexplainable NIC deaths. All of those buildings had been built at different times since the 1950s, and I bet at least one of them had a separate ground. I guess I'm lucky it never gave me a good jolt!