Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, March 22, 2016

HTTP/HTTPS not working inside your VM? Wait for it.

Have you ever run into a problem, done some troubleshooting, but decided it only affected yourself, and shelved it for a while? I did that a couple of months ago with a particular problem which forced its way back into my life this morning.

The problem is simple enough: parts of the web are going IPv6-only, and I couldn't get to them from inside my VMware Fusion virtual machine. What was really strange is that ping6 worked, and traceroute6 worked, and even connecting to the port by hand with netcat worked. It was more stuff like Firefox and curl that would fail... or, a request piped into netcat.

Yes, now you should be wondering what was going on there. Running "nc -6 rachelbythebay.com 80" and typing out "HEAD / HTTP/1.0 [enter] [enter]" would work fine. But, rigging something to echo that and piping it into the same nc command would fail!

Not only would it fail, but it would fail in a most peculiar manner: it would just hang. Sniffing the traffic on the VM, on the host, and on my web server didn't really help me figure it out, either.

Eventually, I somehow realized that it was the delay. If I typed it in myself, there was a small but nonzero interval between the TCP connection opening and my request going out. If, however, I piped it in, netcat and the kernel would fire it down the pipe as soon as it could.

Firefox, curl, and just about everything else intended to speak over the web also has this situation: open a socket, fire a request down it. That's how HTTP works.

Around this time I also realized that ssh over IPv6 was working just fine. I chalked this up to the fact that ssh clients remain silent until the server pushes back a banner, and only then do they start handshaking. This adds a nice delay, and apparently it's usually enough to get past whatever the "danger zone" was.

That's where I left it until this morning. I figured VMware would come along and patch it eventually. They couldn't possibly miss the fact that trying to do real work with TCP over v6 from their VM hangs every single time, could they?

They could, and they did. Someone else in the world reported the problem back in September, and aside from some random person asking a totally useless question, nothing had happened on the thread.

I didn't know any of this until it came back into my life, though. Certain web servers have been going IPv6-only of late, and I'd been kludging around it in my own twisted way, but now it was starting to affect other people who were also using VMs. A friend remembered my mention of "it works with a delay" and pulled me in, and I decided to take this through to a fulfilling conclusion.

The first thing to do was to quantify the exact nature of the situation. I knew it needed some delay, but how much? I decided the easiest way was to pair up sleep and echo in a subshell, piped into netcat. I'd start with a large delay and would dial it back until it stopped working.

(sleep 1; printf 'HEAD / HTTP/1.0\r\n\r\n') | nc -6 rachelbythebay.com 80 | head -1

I ran that. I got "HTTP/1.1 200 OK" back, as expected. I dialed it back to 0.5 -- half a second. That worked, too. Then I went half of that, at 0.25. That also worked. Rather than binary-searching it down, I jumped to a natural value a human might use: .1 -- 100 milliseconds. That actually worked.

I backed it off a little bit to .095 -- 95 milliseconds. This worked, too, so I backed off 5 more milliseconds to .09, and that's when it got stuck. At 90 milliseconds of delay from userspace, it doesn't work. At 95, it will. I figure that's about 100 milliseconds with some slop for round-trip times and whatnot in the TCP session.

Just to be sure, I went beyond that to 25 milliseconds and below, and none of them worked. They'd all hang.

Then, just to prove my point, I kept the tiny sleep and flipped it to be "-4" to force an old-school (historic, even) IPv4 connection. It worked just fine, of course.

I grabbed a screen shot of all of this because nobody would believe this otherwise.

How goofy is this?

In relaying this story to my friends who had brought me into the discussion, I suddenly had a very bad idea: what if I purposely delay all traffic leaving my machine by 100 milliseconds? Would it work?

I'll let this screenshot answer that question for you.

Delay 100ms and it works...!

That's a big yes. A huge yes. I went into Firefox and loaded test-ipv6.com. For the first time ever, it was able to pass all of the checks when accessed from inside my VM.

I wish I was making this up.

For those of you similarly afflicted, here's that command so you don't have to transcribe it from my screenshot:

tc qdisc add dev eth0 root netem delay 100ms

Bonus points if you can figure out enough of the 'tc' syntax to only match outgoing TCP IPv6 connections and not touch everything else that's leaving your VM. I didn't care about that since all I needed was a simple proof of concept, and that did it.

Unbelievable.


March 27, 2016: This post has an update.