Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, March 27, 2016

That VMware IPv6 NAT thing is stranger than it looks

Right, so, a couple of days ago I wrote about VMware Fusion, trying to use their IPv6 NAT "feature", and failing miserably when attempting certain types of TCP connections over it. If you haven't read it yet, go check it out first, so this will make more sense.

Since then, there has been a steady trickle of feedback from the usual sources. One comment said that it sounded like a path MTU discovery problem. I'm going to have to disagree with that, since none of the packets are particularly large, and besides, this will repro easily even with no funky ICMP filtering going on. As for whether I might be able to recognize this in the wild, I present my post on that very topic from May 2015 -- IPv6, even.

Two different people commented that there is no "TCPv6", so I patched the original post. No particular need to get people worked up over some shorthand for something longer (that being "TCP, operating over IPv6").

There were questions if I was seeing checksum errors from TCP. I am not. It's a lot more evil than that, but you'll have to check out my packet traces to see just how far this rabbit hole goes.

Some people didn't get it working with 100 msec. I'm sorry to report that you may have to go higher. I don't have a solid explanation for why yet. While working towards that during some idle time this weekend, I found something particularly shocking which warranted putting out this post first.

One commenter on HN wondered if connect() is returning "before it's done its work". That's actually kind of what's happening, but not because of a nonblocking connect call. It's connecting because, as far as I can tell, VMware is spoofing the connection. Yeah. Telebit called, and they want their TurboPEP back.

Seriously, check this out. I went and installed Fusion 8 Pro on my underused Mac Mini and then installed Ubuntu LTS on that. Then I started trying to connect outward to my dual-stack machine at SoftLayer (the one that's probably feeding you this page). This is what I saw from inside the VM:

16:04:34.801556 IP6 fd15:4ba5:5a2b:1002:79c2:678f:5757:309b.57550 > 2607:f0d0:1101:6e::2.8080: Flags [S], seq 2318396716, win 28800, options [mss 1440,sackOK,TS val 243279 ecr 0,nop,wscale 7], length 0

16:04:34.801801 IP6 2607:f0d0:1101:6e::2.8080 > fd15:4ba5:5a2b:1002:79c2:678f:5757:309b.57550: Flags [S.], seq 2124082490, ack 2318396717, win 64240, options [mss 1460], length 0

16:04:34.801829 IP6 fd15:4ba5:5a2b:1002:79c2:678f:5757:309b.57550 > 2607:f0d0:1101:6e::2.8080: Flags [.], ack 1, win 28800, length 0

Okay, so, yeah, this is a lot of crazy cruft for people who don't speak this language. I will attempt to boil it down to the key pieces.

That fd15:... address is just whatever is being generated by the Fusion NAT setup. It's not the host Mac's actual (routable!) v6 address. It's what the Linux VM thinks it is, though. The 2607:...:2 address is magpie at SoftLayer, aka rachelbythebay.com at the time I write this.

With that in mind, here we go.

At 16:04:34.801556, the test Ubuntu VM sends out the bare SYN to my server's port 8080.

245 microseconds later at 16:04:34.801801, magpie allegedly responds with its own SYN and ACKs the SYN it supposedly received.

We don't even need to talk about the rest. It's just this simple. magpie is not here at my house. magpie is in Dallas, somewhere, at a SoftLayer colo facility. I'm using a Mac here in Silicon Valley. There is NO WAY you could get a SYN from here to there, respond to it, and get that response back in 245 microseconds.

You know how far light can travel in 245 usec? Just under 75 km. If you don't believe me, go feed this to your favorite solver:

((.801801 seconds) - (.801556 seconds)) * c

75 km, one way, wouldn't even get my packet out of the state, never mind to Texas.

Therefore, we have a hypothesis: VMware is spoofing the session establishment. With that in mind, can we test and see if it's really happening? Of course we can.

How? Filter (DROP) packets bound for port 8080 on the server. The server will NEVER respond. If we get anything back, it's a dirty lie. Here's a screen shot. You're going to need to click on it to see it full-size to make any sense of it.

Fake connection

I added a rule with ip6tables to drop TCP traffic to port 8080, and then watched with tcpdump to see what would show up. Sure enough, my Mac Mini's IPv6 address sends a whole bunch of SYNs to port 8080, and my server ignores them.

However, down in the Linux VM, it's actually received an ACK to its SYN and it ACKed that in turn, and it thinks it has a connection! Yes!

And, no, it doesn't do this on TCP over IPv4. I tried that, too. When you try to connect to a black hole, it doesn't spoof the damn connection.

Okay, one last thing to prove that this is what's going on. I'm going to connect to an IPv6 host that doesn't even exist, and Fusion is going to let me.

Calling [3:1:3:3:7::0] port 9090

Yep. "ss" (think 'netstat' if you're not familiar with it) shows a nice ESTABLISHED connection. As far as Linux knows, it's connected to this elite host with an address I just made up.

This looks less like them doing NAT and more like them doing some kind of SLiRP-ish thing where they see a connection attempt and then make a new one to the same destination and copy data between them. The problem is that apparently you can "race" it, and thus confuse it by speaking "too quickly".

If this is in fact what they are doing, I would expect all sorts of other TCP stuff to misbehave. Do they handle TCP OOB data? You know, the whole URG flag thing, perhaps only known for starring in 1997's WinNuke?

I'm not going to do their work for them. Figure it out, VMware.