Software, technology, sysadmin war stories, and more. Feed
Friday, June 3, 2011

Web proxies and incriminating data for HR

There was a school district which had a Squid proxy set up to be a better net neighbor, reduce the amount of bandwidth needed to the outside world, and ... to keep tabs on what people were doing. For years, the situation was simple: users connect to Squid, Squid connects to the world, and a big disk keeps track. One day, a new filtering device was introduced and all of that changed.

Officially, it was due to some kind of new legislation mandating "Internet filtering" for organizations using Federal money to pay for things. Whatever the actual reason, there was now a turnkey filtering system called an iPrism sitting between the users and our big Squid box. Everything worked just fine, except for the fact that our Squid box now saw exactly one IP address for accesses: that of the iPrism box.

Imagine the fun the first time they wanted to do one of those "HR looks to see whether someone's been naughty" investigations and all they could say is "yep, someone did it, can't tell who". The iPrism box was in no position to further elaborate, since it didn't have the capacity to log all of that stuff for the length of time required. Our Squid box had been purposely designed to have that kind of room, but it didn't have the data. What a conundrum!

So, that afternoon, we started down the road to having yet another Squid box which would sit in front of the iPrism, just pushing all requests at it. This would give us a three-level proxy scheme just so that we had logging, then filtering, then caching. What a mess. Fortunately, someone managed to figure out how to switch on syslog pushing from that iPrism box, and our approach changed.

It turned out that you could make the filter push every single request over the network as syslog format: UDP on port 514. Then you have syslogd on another box just dump it somewhere. Now we were talking, since that got the all-important source IP address and destination URL off the filter and to a place where we could manipulate it.

On my receiving machine, it had some syslogd.conf magic to take local4.info and throw it at a FIFO so all of that raw gunk would not pollute my real logs. On the other end of that FIFO was a quick C program I had hacked up which would parse the incoming lines and reduce them to just the stuff we cared about. This was then written to disk.

Finally, there was a small CGI program which ran under Apache on that same machine which would accept two arguments: IP address and length of time to check. It would then go scour that file and print anything it found. This was deemed sufficient for whatever HR needed, and it gave them self-service access so they would not need to bug us for logs access.

I'd like to say that I invented some lovely schema to store all of this data efficiently and make the most of the disk, but that was not in the cards that day. All of this stuff was needed basically right then, so it was written in a couple of hours and just used boring flat text files. The analysis was little more than a glorified grep.

Optimizing for what our users actually needed (minimum time to usability) meant some tradeoffs, but in the end everyone was happy with it.