What it takes to log network metadata
Let's say it's 2001 and you wanted to gather some data for the sake of geolocation. Specifically, you wanted to take the 48-bit MAC addresses of wireless access points and associate them with GPS coordinates. Then, later, you could take reports of "I'm hearing these networks" and work backwards to guess the user's location.
What would you need to do this? Well, first, obviously, you need a GPS receiver with a way to connect to your computer. This is no big deal, since models with serial ports have existed for a few years. Next, you need to decode what it's saying to you. Again, this is no big deal, because it's almost certainly using something called NMEA, and that's documented. You just pay attention to a few "sentences" which include your location and cache the latest value.
So now you know where you are, but which networks are around? Well, for that, you need to listen to the raw packets flying around. You can't just run "tcpdump" on your interface, since that's only going to look at the higher level pseudo-Ethernet world. Instead, you need a Orinoco (then) or Lucent (later) card which can be put into monitor mode. You also need to patch your driver to allow this sort of thing.
Switching this on gives you a firehose of data, assuming you know how to get to it. Fortunately, it's simple enough: you open a raw socket of type ETH_P_80211_RAW and just read() from it. You'll start getting packets.
The frames themselves aren't that difficult. The first byte tells you if it's a beacon, probe request, probe response, or data. A few bytes past that, you start getting MAC addresses: destination, source, and finally the network ID, which is what you actually want. You might throw away the meaningless ones such as those from unassociated clients.
Assuming you can keep up with both the GPS sentences and wireless packet spew, then you're in business. You can sock away a tuple of (network, lat, long) and go on with life. You might have to make it slightly more interesting down the road if you run into duplicates (very powerful networks? moving transmitters?), but you can "fix that in post".
About the only thing left is to make sure you change the channel on your card. Not all networks are going to be on channel 6, so you need to cycle through to see what else is out there. Running iwconfig or similar in a loop can do this.
It's really not that difficult. I did it as a proof of concept thing back in 2001. Back in those days, you could crank it up in lots of different locations (houses, businesses, etc.) and nothing would happen because most people hadn't discovered wireless yet.
That's it. That's all it takes to get the data you need to do this sort of thing. If you go into it saying you want location vs. network identifier data, then you write something which just gives you that and store it somewhere.
Fortunately, the metadata itself is relatively small. If you figure that a network ID is 48 bits, that's just 6 bytes. Then you have latitude and longitude values. I didn't get a whole lot of precision (and didn't really care too much), so my GPS listener would print things like "29 12 34 N 98 45 56 W" - degrees, minutes, and seconds. Even if you chose to store that as text for some reason, it's still only 22 characters at its longest ("89 59 59 N 179 59 59W"). Packing that down into some kind of binary encoding would shrink it even more.
No matter how you slice it, a single "hit" for a network is still only at most 28 bytes so far. Maybe you add the time and date that you saw it. Okay, we can express that as a nice 32 bit time_t until 2038, so add four more bytes. Now we're at 32 bytes and that seems like a good place to stop.
Even if you track a whole bunch of networks, or include duplicate hits for the same network in an attempt to figure out how "big" it is, it's still just not that much data.
Now, let's say you didn't want just the metadata from those frames, but actually wanted to see everything which went by. You could just write a few more lines of code to deliberately save it to disk. Or, you could just go get some program which already does this and just hang onto all of the output. There can be a lot of raw data flying around, so you'd need to go to some lengths to store that stuff for any period of time.
Remember that even in 2001, 802.11b was rated for 11 Mbps and could probably get 4-6 Mbps of effective throughput in good conditions. Do the math and you can see that 4000000 bits per second * 3600 seconds is about 1.6 GB, so you'd probably aim for 2 GB of space for each hour of "recording" you want to do. These days, you'd want far more.
As you can see, the storage needs would be completely different.
Given this, you'd think that given the actual presence of such a system, you could look at the resources it was given and then work backwards to see what the intent was. Small jobs don't need huge storage arrays.
Put another way, if you pulled over a car and saw a bottle of nitrous oxide in there, would you believe the driver's claims that they never race? If that's the case, what's up with all of the extra plumbing? You know they have to be up to something.