Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, December 17, 2011

Gigabot is mega-clueless

Whatever "Gigabot" is, it's clueless.

Exhibit 1:

64.22.106.82 - - [17/Dec/2011:18:31:18 -0800] "GET /robots.txt HTTP/1.0" 200 26 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"

64.22.106.82 - - [17/Dec/2011:18:31:19 -0800] "GET /wtf.html HTTP/1.0" 200 1200 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"

The fred robots.txt is simple enough:

User-agent: *
Disallow: /

Translation: if it's a URL on this site, you aren't supposed to fetch it with a spider. Reality: they requested something anyway. Duh?

Exhibit 2:

64.22.106.82 - - [17/Dec/2011:18:31:01 -0800] "GET /robots.txt HTTP/1.0" 200 65 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"

64.22.106.82 - - [17/Dec/2011:18:31:02 -0800] "GET /main?335095 HTTP/1.0" 200 10234 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"

The scanner robots.txt is a little more complicated, but it should still be unambiguous:

User-agent: *
Disallow: /main
Disallow: /main/
Disallow: /main/*

If this keeps up, I think I'll start seeding pages with links which go nowhere useful but are well-covered by robots.txt. It should show exactly who honors this sort of thing and who just thumbs their nose at it.