Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, April 14, 2013

Broken crawler behavior with my binary protofeed file

I detected a disturbing uptick in the number of 404s coming from a certain big web indexing robot in recent days. It was completely nonsensical stuff like this:

"GET /w/2013/03/31/snark/filesystem.png"><img HTTP/1.1"

Got that? It's actually picking up a quotation mark, a greater-than, and then a less-than, and the beginning of an "img src"!

I finally figured out what was going on this morning. They've been fetching my protofeed file and have parsing it as if it was HTML! Yes, my half-baked protobuf-based feed file from last month has been linked a few times, and it started being indexed. Then, for some reason, they decided the blobs of text within that binary protobuf were indexable, and went to it. The result is that mess above.

I will note that I have been serving it as "text/plain" for lack of a better MIME type. It's definitely not going out as "text/html", in other words.

For now, I've "solved" it by blocking this file in robots.txt. Let this be a warning to anyone who links to binary data from their web pages. If you have something resembling HTML in that binary blob, they might start following the links, and this is probably not what you want.