Enough with the broken "Java/x.y.z_nn" crawlers
I watch my web logs a lot. It's a good way to get inspired by seemingly random events. Seeing some bit of insanity arrive from the outside world can lead to a concept for a post. This is one such post.
For quite some time now, I've been seeing these boneheaded attempts at crawling sites which send a truly generic user-agent string like "Java/1.6.0_33" or similar. The version changes, but the "Java/" part and the stupidity remain the same. They are best visible in how they follow things which aren't even links and get seriously confused by things like JavaScript.
Allow me to demonstrate.
xx.xx.xx.xx - - [12/Aug/2012:12:54:31 -0700] "GET /contact/ HTTP/1.1" 200 1575 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"
Here, this robot asks for my contact page. Okay, big deal, that happens all day long. However, what happens next is a little wacky.
xx.xx.xx.xx - - [12/Aug/2012:12:54:34 -0700] "GET /contact/jquery-1.7.1/jquery.min.js HTTP/1.1" 200 93868 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"
Okay, so now it's gone and requested something which only occurs in a SCRIPT tag in the HEAD part of that page. This is usually the sort of thing a real web browser would do. However, unlike a browser, this thing never pulls my CSS, despite having encountered those declarations earlier in the file.
What comes next, however, reveals a kind of cluelessness I seldom see outside of these idiots:
xx.xx.xx.xx - - [12/Aug/2012:12:54:35 -0700] "GET /contact/jquery-1.7.1/,data:c,complete:function(a,b,c){c=a.responseText,a.isResolved()&&(a.done(function(a){c=a}),i.html(g?f( HTTP/1.1" 404 411 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"
xx.xx.xx.xx - - [12/Aug/2012:12:54:35 -0700] "GET /contact/jquery-1.7.1/]},bh=U(c);bg.optgroup=bg.option,bg.tbody=bg.tfoot= bg.colgroup=bg.caption=bg.thead,bg.th=bg.td,f.support.htmlSerialize||(bg. _default=[1, HTTP/1.1" 404 439 "-" "Java/1.7.0_05" "rachelbythebay.com" "-"
On two nearly-simultaneous hits, this robot manages to prove just how much mind-boggling stupidity it can wield. It clearly snarfed a JavaScript file by parsing a SCRIPT tag, but then it somehow turned raw minified JavaScript gunk into URLs and tried to GET them?
What planet are these programmers on? Who is that broken in the head?
It's almost to the point where I'm thinking about blocking any UA which matches "^Java/" just to catch anyone who thinks they can build a crawler just by gluing some Java examples together. If they can't even get as far as setting a halfway interesting User-Agent, what hope is there for them parsing and crawling things properly?
Randomly, at least one of these machines has port 139 open to the world and proudly proclaims that it is "Windows (R) Web Server 2008 6001 Service Pack 1", whatever that is. Maybe they're all just owned.