A sysadmin's rant about feed readers and crawlers
If a web site makes an RSS or Atom feed available, it's not a bad idea to poll it from time to time. Actually doing that poll like a good netizen (remember that concept?) takes a little attention to detail.
First of all, the feed probably doesn't change that often. Depending on who it is and what you're following, you might get an update once a day, or multiple times a day if it's something really special. Everyone and everything else is going to be somewhere less than that.
Given the reality of the situation, does it make sense to poll every 10 seconds? Probably not, right? Someone set up something like that and left it running for a couple of months until I finally noticed it and dealt with it myself.
Another thing about URLs is that some of them are well-behaved and stay the same until they are deliberately changed. They don't change just because you requested a copy. A good feed certainly behaves this way.
Given this, you can keep track of the "Last-Modified" header when you get a copy of the feed. Then, you can turn around and use that same value in an "If-Modified-Since" header the next time you come to look for an update. If nothing's changed, the web server will notice it's the same as what it already has, and it'll send you a HTTP 304 code telling you to use your local (cached) copy. In other words, there is no reason for you to download another ~640 KB of data right now.
Alternatively, many web servers (mine included) support this thing called "ETag", and it amounts to a blob that you just return in your requests. If it hasn't changed, you get a nice small 304. Otherwise, you'll get the same content as always. It's effectively another way to do the "IMS" behavior described above.
Besides that, a well-behaved feed will have the same content as what you will get on the actual web site. The HTML might be slightly different to account for any number of failings in stupid feed readers in order to save the people using those programs from themselves, but the actual content should be the same. Given that, there's an important thing to take away from this: there is no reason to request every single $(*&^$*(&^@#* post that's mentioned in the feed.
*exhale* ... okay, let's continue.
If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!
Then there are the user-agents who lie about who they are or where they are coming from because they think it's going to get them special treatment somehow. I mean, it probably will... some day... but not in the way they wanted. It'll be more like "welcome to the IP filter" and less like "oh here, bypass the paywall that this site never had in the first place".
Let's tally up some of the bad behaviors here:
- Loading the feed way too often, like every 10 seconds, when it never
updates anywhere near that often
- Not using If-Modified-Since (ETag), and so always download a full
copy of the feed even when nothing has changed
- Scraping individual posts after pulling a copy of the feed even
though they have the same damn content
- Doing this while claiming to be "GoogleBot"... guess what, that
actually *draws attention to* your process. It's pretty easy to
remember "66.249" and notice things which are definitely not that.
Also, any hosts that reverse to "googleusercontent.com" in particular
are NOT the actual crawler. (And yes, that Google netblock is a /19,
not a /16. It's a reasonable approximation if you're just eyeballing a
log.)
- Sending referrers which make no sense is just bad manners.
Would you believe most of the bad ones actually do more than one of these bad behaviors at the same time? It's not enough to just hit the site every 10 seconds, for example, but it'll also score a copy of the entire feed, too. 640 KB every 10 seconds for weeks and weeks until someone notices and pulls the plug (or I block it). Really. This is not a good use of resources!
What's really amazing is that when asked to slow it down and actually do the If-Modified-Since thing, it turned out they *couldn't* send the header. Yes! That is the state of the art in feed reader technology these days: download the full thing every time.
If-Modified-Since. It's in RFC 1945, aka the HTTP/1.0 spec... from *May 1996*. It's been over 25 years, and it's time to start using this thing.