Pattern matching: hey, 2147 looks familiar...
I've learned quite a few magic numbers as a result of working with computers. Some of them tend to jump right off the screen and make themselves known. This is sometimes because they are associated with things which have stopped working, and as a result they become values I look for first.
One thing I wound up sharing with a bunch of my coworkers back in my web hosting support days was the number "2147". It was sufficiently unique that you could use it to go on a fishing expedition for certain problem files. This was in the days when large file support was not universal, and many programs would choke when a file reached (2^31)-1 bytes. Here's one of many such instances saved from my chat logs:
(19:19:04) Tech: [Sun Aug 14 19:13:44 2005] [notice] child pid 7211 exit signal File size limit exceeded (25)
(19:19:09) Me: 2.1 GB log limit
(19:19:18) Me: locate access_log | xargs ls -la | grep 2147
(19:19:57) Tech: [root@www root]# locate access_log | xargs ls -la | grep 2147
-rw-r--r-- 1 root root 2147483647 Aug 4 07:35 /home/site/site/log/httpd/access_log
(19:20:02) Tech: you've been doing this way too long
"2147" was just an easy way to look for that magic number without having to remember it with full precision. It also meant not having to scour the whole filesystem with 'find', since the locate database would certainly have picked up a ~2.1 GB file at least once.
The problem in that case was that this version of Apache could not run if it was pointed at any file which had hit that limit. The short-term fix was to just manually rotate that file out of the way, and then the web server would come back up. A medium-term fix was to get some kind of automated log rotation running. Naturally, the long-term fix was to get proper large file support for the entire system stack, thus postponing the problem for a very long time.
This sort of thing happened a lot. Here's another example. A different person pinged me about a ticket that was "killing him". I jumped onto the machine and found the site was behaving pretty slowly. It wasn't CPU load or disk I/O or anything like that, and the primary page was just static HTML. That's when I found this (sanitized, obviously):
write(119, "x.x.x.x - - [26/May/2006:12:47:06 -0400] \"GET /XxxxXxxxxxXxx.xx_ HTTP/1.1\" 404 293 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; {__(removed guid)__}; .NET CLR 1.1.4322)\"\n", 206) = -1 EFBIG (File too large)
Apache was failing to write a log entry and the children were dying. It kept respawning them, but all of this churn added a bunch of delays to incoming hits. I traced down fd 119, and sure enough, there's the magic number again:
# ls -la /var/log/httpd/access_log
-rw-r--r-- 1 root root 2147483647 May 26 04:48 /var/log/httpd/access_log
The good news is that only one file had that kind of cruft in it. The interesting thing is that the slow vhost wasn't logging to it. This was traffic arriving to the default vhost on this machine, and there was tons of it coming in from all over the place.
I mentioned this to my friend and he said he was talking to the customer, and that they didn't know what that URL (the one I replaced with XxxxXxx...) was supposed to be. It certainly wasn't part of his site. This got me wondering when this all started, so I backed up in the log files and found this:
x.x.x.x - - [18/May/2006:13:53:19 -0400] "GET /(__same_path__) HTTP/1.1" 404 293 "-" "Microsoft Internet Explorer"
On that day at that time, a bunch of hosts just started hitting that same URL out of the blue. I dug around in this customer's ticket history and found that on May 18th, we added an additional IP address to the server. Looking at the time stamps on the ticket, I could see that it was added between 12:51 PM and 1:00 PM central time, or 1:51 PM and 2:00 PM eastern time, which is what the server was running.
The hits started at 1:53 PM eastern time. That made it a pretty safe bet that all of this was latent traffic which had been trying to hit that IP address all along, and now that it had a real machine behind it, it would start showing up in logs.
I did some more digging with tcpdump on the incoming traffic and found that a certain domain name with "coupon" in it was pointed at that IP address. It was probably left over from some other customer. I suspected it was probably one or more previous customers who had been hosting spyware in that same IP space, and had left thousands or millions of infections, err, sorry, installations out there, all trying to phone home.
We didn't control the DNS for this mispointed domain, so there was no way to deal with that particular problem. My advice was to mark that IP address tainted, give our customer a second one, and apologize for the mess.
Any other route would have been suboptimal. Setting up a vhost to purposely match this traffic and then bitbucket the log info would still mean he was paying for it in terms of CPU time, overall web server load, and, yes, outgoing bandwidth. That was a technical answer, but we could do better than that.
A lot of being effective in that line of work is recognizing patterns. Many "new problems" are just variants of other well-known issues.