Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, April 9, 2012

VLANs are not always "just like having separate networks"

I once had a project where we needed to put more bandwidth into someone's house since ISDN was no longer getting the job done. It wasn't a matter of the user's demands suddenly changing, since he was still using his systems the same way. Instead, what happened was that some new Windows machine on his network insisted on exchanging a bunch of traffic with the rest in their domain. This filled up his 128 kbps pipe and lagged his ssh, and this made him angry. He wanted a fatter pipe, so I was called in to help.

Given this was a school district matter, we decided to just do a wireless shot to a nearby school and call it done. We had a pair of Intel access points which could be set up in WLAP mode. That mode allowed us to set one of the APs as the "root node" while any others would join it. It also worked as a bridge, so packets entering one end would find their way out the other. I figured we could use that to build the low-level link between his house and the nearest school. Then, we'd plug the access points into Linux boxes at either end and run OpenVPN for reasonable crypto. The access points only did WEP, and I knew even then that it was insufficient in terms of security. With the tunnel up, the Linux boxes at either end would act as gateways, performing proxy ARP as appropriate. His home machines would think they were plugged straight into the school's Ethernet.

Obviously, we needed to test all of this before anyone started climbing on roofs, mounting antennas, and pulling wire. I had plenty of Linux boxes handy, and we had two access points available, so I figured I would just set up an end-to-end test with all of them. I had someone unbox them and plug them into specific ports on my network, then just walled them off in a VLAN while I set things up.

My plan was simple enough: I'd put Linux box #1 in a VLAN which contained just it and the first access point. Then it would have a wireless shot into the second access point. That second access point would then be in a VLAN with just Linux box #2. Those VLANs would simulate the crossover wires which would be in use in the real installation... or so I thought.

Instead, I turned out to create a royal mess for myself. Here's why.

An Ethernet packet carrying the OpenVPN magic from Linux box #1 goes out through its interface, bound for access point #1. The switch, as part of its normal operation, says "okay, MAC address X just spoke on this port, so let's remember that". This packet makes it to access point #1 and goes out over the air.

Moments later, access point #2 receives the packet on its wireless interface and transmits it on its wired interface. It's still coming from the ethernet address of Linux box #1. The switch now says "oh, I see, you're over here now", and updates its table yet again.

The same thing happens on the reply packets, as the switch starts seeing both of the Linux boxes flipping around between their actual physical ports and the ones which were getting copies via the wireless link. My mistake was believing that having separate VLANs would wall off the usual forwarding table stuff, and it would be just like having separate switches on separate networks.

Instead, what I got was a royally confusing mess. Of course, it wasn't obvious at first, so I went through a whole bunch of fiddly config settings before running across a warning in the switch manual about VLANs. Paraphrased, it says that source addresses weren't allowed to appear on multiple ports. Long story short, you can't test bridging with VLANs on that hardware.

I got someone to move one of the access points to a separate network (broadcast domain) on the other side of a router so they wouldn't see each other, and tried it again. This time, it worked.

Later, when everything was installed out in the field, it went well. We had successfully upgraded the network link to deal with the symptoms rather than having the "network engineers" try to understand why their Windows machines were being so chatty.

Given that nothing ever happened with addressing the root cause, I'm sure they promptly ran that link out of bandwidth and had to really get creative. Fortunately, by that point, I was beyond the "blast radius" and it was no longer my problem.

Using the right tech for the wrong reasons? Been there, done that.