Gigabot is mega-clueless
Whatever "Gigabot" is, it's clueless.
Exhibit 1:
64.22.106.82 - - [17/Dec/2011:18:31:18 -0800] "GET /robots.txt HTTP/1.0" 200 26 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"
64.22.106.82 - - [17/Dec/2011:18:31:19 -0800] "GET /wtf.html HTTP/1.0" 200 1200 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"
The fred robots.txt is simple enough:
User-agent: * Disallow: /
Translation: if it's a URL on this site, you aren't supposed to fetch it with a spider. Reality: they requested something anyway. Duh?
Exhibit 2:
64.22.106.82 - - [17/Dec/2011:18:31:01 -0800] "GET /robots.txt HTTP/1.0" 200 65 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"
64.22.106.82 - - [17/Dec/2011:18:31:02 -0800] "GET /main?335095 HTTP/1.0" 200 10234 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"
The scanner robots.txt is a little more complicated, but it should still be unambiguous:
User-agent: * Disallow: /main Disallow: /main/ Disallow: /main/*
If this keeps up, I think I'll start seeding pages with links which go nowhere useful but are well-covered by robots.txt. It should show exactly who honors this sort of thing and who just thumbs their nose at it.