Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, March 16, 2018

Troubleshooting IPv6 badness to certain hosts in a rack

IPv6 is generally a good thing to have working on your network. However, getting there may expose a bunch of problems in the infrastructure. This story is about one of them in particular. It's one of those things you might encounter when you finally get around to switching on IPv6 and discover that your vendor hasn't quite gotten the bugs out.

Let's say you build racks of 40 machines, and you spot one in a given rack that is being wonky. It's showing up on your monitoring as having elevated error counts as compared to its neighbors, or that latency is higher than others for whatever it does. You're not quite sure what to make of it as first, so you log into take a look.

The machine itself seems fine. It has no more CPU load than its friends, it's using about the same amount of memory, the hard drive is loaded about the same, and all of the usual on-host metrics seem good enough. You start noticing the problem by paying attention to the actual ssh you've done to poke at the machine.

Instead of a session that's nice and smooth with consistent latency (whatever it may be, based on distance to the box), the latency swings around all over the place. You start picking up on this, and once you start looking for it, it's no longer something you can ignore.

To be sure, you hold down a key and let it repeat. The machine echoes it back at the shell, and that's when it's crystal clear: instead of a steady stream of dots (or whatever), it comes in fits and starts. It'll block for a few seconds, then unleash a burst at you. It's annoying.

Meanwhile, a ssh to another box in the same rack seems super smooth and is great. You can't make it do the lag/burst thing no matter what. It echoes characters back with the same delay every time. There is no jitter.

Around this point, maybe you try to ping the bad box and a good box. They look exactly the same. This seems confusing at first, but then maybe someone points out that the ssh is over IPv6, or maybe you notice it yourself while looking at 'ss' or 'netstat'. Either way, this gets you thinking in terms of v4 vs. v6.

You switch to ping6. The round trip times don't seem consistent any more, but for the most part, they aren't too bad into the bad box. They're just a little strange: somewhat higher at times, much higher at others, and with just a touch of packet loss.

Clearly, this box is failing to do IPv6 properly, right? Well, not so fast. You have to measure from other angles, and so you do. You're already on another box in the same rack, and so you ping from the good one into the bad one. It's fine, as expected. You ping6 from the good one into the bad one, and it's also fine.

You repeat this with other tools to verify, and eventually it shakes down to this: traffic is slow to this one box, but only when it's coming in from outside the rack, and only when it's IPv6. If it's IPv4, it's always okay, and if it's from another host on the same switch (as its rackmates will be), it's always OK, whether v4 or v6.

Hopefully by now you have a little matrix down on a scratch sheet of paper or at least somewhere in your head. The patterns of "good this way" and "bad that way" should start adding up. Let's review.

If there was lagginess and general strangeness going into the box for both v4 and v6 traffic from any source, you might blame something specific to it. It could be something electrical, like a bad patch cable or fiber drop. It could be a bad NIC somehow. (We'll assume the software is identical since the machines are all under version control.)

If only v6 is slow from any source, then you might again blame some aspect of the machine. Maybe the offloading stuff on the NIC is broken. It's not as likely, but hey, what else do you have to work with here?

But no, this is only v6 being slow, and then only from outside the rack. This tells us that it can't reasonably be the NIC or the cable between it and the switch since the traffic manages to transit them just fine in other scenarios. Somehow, it's something about the traffic coming across the routing part of the switch on its way from the rest of the network. How does it "know" to be slow when it's coming from outside?

What could break v6 across the routing fabric?

Finally, you start up a ping6 and just let it keep plinking away at the bad box. It's wiggly, but the latency stays under 10 msec. While it's running, you fire up another ping6, but this one you set for 9000 byte packets and add the "flood ping" option for good measure. You let it rip.

As soon as the second ping6 kicks off, the first one sees its latency shoot way up, and then it starts dropping packets. The second ping6 starts spraying dots at you as it detects packet loss, too. Your ssh over IPv6 to the box gets completely disgusting, with huge delays.

Meanwhile, however, your ssh over IPv4 to the same box is just fine, as is plain old ping. Also, your ssh over IPv6 coming from the neighboring box is just fine, too!

Just what is going on here?

At this point in the story when it happened to me, I concluded the rack switch's routing part was seriously broken. It was somehow unable to route packets to this one IPv6 destination in any quantity. If you sent it enough, like with the flood ping, or during normal operations, it would just fall over.

In real life, I dragged someone from the network side of things to look at it, and they proudly concluded that the pings were fine. Even though that with an idle network, a good box had a consistent 0.1 msec and the bad box was yielding jittery results of anywhere from 1.0 to 10 msec, they said "1 msec is fine" (or somesuch). That was their argument, and they completely missed my point.

I was now good and pissed off at having spent all this effort to find a real problem, only to have it dismissed as "no big deal". I decided to run a scan of the entire brand new environment to look for other hosts which were also having these problems.

That night, I "rage coded" something really dirty that would start up a ping6 to get a baseline, then it would start the antagonist (huge flood pings) and would measure again. If it looked substantially different, it flagged the host for manual review. I let it run on the whole environment and found a non-trivial number of single-host-per-rack anomalies all over the place.

Given that it was hosed and nobody was taking me seriously, I just pulled the plug on the whole (brand new) environment. It had only just gone online that evening, and could stay offline another day or two until someone actually figured out what was going on.

That is exactly what happened. I sat on that disable from Thursday until Tuesday when someone finally cracked the problem. It took a lot of escalating and poking of manager types to get a fresh set of eyes on the problem who would not discount me out of hand.

That's when they noticed something I could not see from my vantage point as end-user on the network: the routers, for some reason, did not have an entry in their forwarding tables for the IPv6 addresses for the slowpoke hosts. Every other host on a rack would be there as you would expect, but for some reason a couple of them would get corrupted or otherwise would disappear.

As they explained to me later, when this happened, the usual "fast path" forwarding through the router broke down. It would have to look at the packet with the general-purpose CPU on the switch, which would then figure out where to send it, and the packet would go on its way. Trouble is, there's no way that path could handle line speed routing of these packets. It was supposed to populate the forwarding table at which point the ASICs would handle it on the fast path, but that never happened.

When we sent enough traffic, the CPU simply couldn't keep up, and the packets would be dropped. This would happen naturally as the machines received normal traffic, and so it introduced our problem.

Short term, the solution was to poke the switches and flush the forwarding tables used by the ASICs. This got them to re-establish everything, including the one for the trouble spots. They all went back to operating at line speed for all traffic, inside and out.

The next step was to set something up to keep tabs on the syslog on these routing switches. Apparently, when the forwarding table had this issue, it would emit some kind of warning in the log. By watching for it, you could automate a flush of the cache to get things going again.

This was majorly crappy, but it did get the environment back on its feet, and it was able to be re-enabled after several days sitting there doing exactly nothing (and wasting tons of money, no doubt).

The long term solution did involve yelling at the vendor and getting them to fix their corruption issue.

Lessons learned:

Pay attention to ssh lag, even if it seems minor, particularly if it's only jittery and only to the bad box and not to the others.

Think through the paths involved for the traffic. If some paths are slow and others are not, it's probably something to do with the paths and not the end device.

Find ways to try multiple paths to the same end device to prove out the last item. Otherwise, how can you be sure?

Don't accept the status quo when you report a problem and they blow you off. If you're sure it's there, don't doubt yourself. If you've practiced solid scientific methods and have a decent argument backed by data, stay on that thing until someone has to deal with you (and it).

Finally, when the last item stops working, and you start getting in trouble for it, leave the company behind.