Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, April 20, 2013

More broken web robots

I see a lot of bad web robots.

If you use "zite", also known as "woriobot" (whatever those are) and don't get images for my posts, it's because their robot is badly broken. Check this out.

"GET /dialup1.jpg HTTP/1.1" 404
"GET /dialup2.jpg HTTP/1.1" 404
"GET /tc.jpg HTTP/1.1" 404

Those three images are part of my terminal server post from Wednesday. If retrieved as a post (that is, directly to /w/2013/04/17/slow/), it refers to those images using relative paths. That is, they are just "dialup1.jpg", with no dots, slashes, colons, hostnames, ports, protocols, or anything of the sort.

That means it lives at the same place as the base URL. Since I don't twiddle that setting in my pages, that means it's the same path that they just fetched plus the "dialup1.jpg". Easy. You'd think this fundamental tenet of the web would be well-understood by now, but apparently it is not.

Now, let's say they're actually crawling my Atom feed. That feed purposely spells everything out in long-form: protocol, hostname, path, filename. This has been the case since September when I declared that my "protocol-relative URL experiment" was over.

They aren't alone in their brokenness, though. There's another one from "Sosospider" which is has its own flavor of insanity:

"GET /w/2013/04/02/maps/img_0725.png HTTP/1.1" 404
"GET /w/2013/04/02/maps/img_0730.png HTTP/1.1" 404

These are from my first post this month about bad Apple maps. The problem is that these files are *uppercase*. They are the same filenames which came straight out of my old iPhone's screenshot facility. The file isn't called img_0725.png. It's IMG_0725.PNG.

I'm serving up HTML with the proper filenames. Nearly everyone manages to get this right. These guys, however, squash the case and so miss out. Who thought that was a good idea? Wouldn't it take more code to purposely squash case, and properly at that? Getting uppercase and lowercase right is just like date handling. Both are hard problems.

This is all on top of the well-known Java crawler stupidity where someone has decided that parsing the URLs which are targeted by SCRIPT tags is a good idea, even though those are not even HTML.

Welcome to the web, where any idiot can program for it, and probably does.