feed reader best practices

If you're seeing this, it's probably because you want to adjust things to be a better netizen/online citizen/whatever we're calling ourselves these days. Thanks for that! There are a few things you can do to make things better for everyone.

New hostname effective January 17, 2025

The hostname has changed. Visit the top page for information on what to do next if you would like to continue participating.

Conditional requests: inbound

Whenever possible, your feed reader should issue conditional requests. This is done by storing one or two header values sent by the web server at the time it sends you a copy of the feed.

A typical web server response might look like this:

HTTP/1.1 200 OK
Date: <more or less the current date and time>
Server: <something or other>
Last-Modified: <some date and time -- a black box value>
ETag: "<probably some hex digits -- another black box value>"
Accept-Ranges: bytes
Content-Length: <some number>
Cache-Control: max-age=43200
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml

The two you care about are Last-Modified and ETag. Many servers will send both. Some servers will only send you one of them, and then some won't send either for some reason.

You should store the "Last-Modified" and/or "ETag" values exactly as they were received from the web server so you can refer to them later on the next attempt.

Be advised that ETag values are supposed to be surrounded in double quotes "like this" (per the RFCs, really!), and in that case, those double quotes are actually part of the value, so you need to store them as well, and return them later (see the next section in this document).

Conditional requests: outbound

To issue a conditional request, your feed reader will need to set at least one of the following headers: If-Modified-Since or If-None-Match.

The If-Modified-Since header should be set to whatever you last received from the server in a Last-Modified header. Don't invent your own time values for this header. You and the web server may have very different ideas of the update times for the feed, and you will miss content if you keep bumping it along.

The If-None-Match header should be set to whatever you last received from the server in an ETag header, and if it happens to be "wrapped like this", then you need to be sure to include those "quotes" as well. Believe it or not, this is right there in the RFCs - note "opaque quoted string". HTTP standards are so bizarre.

A conditional HTTP request might look like this:

GET /some/path/to/feed HTTP/1.1
Host: something.or.other.example.net
If-Modified-Since: <value from Last-Modified response>
If-None-Match: <same idea, but from ETag>
User-Agent: <something to identify your software>
... other headers as appropriate ...
<blank line>

Unconditional requests

It's expected that a feed reader will make a single unconditional request upon initializing a new feed subscription. This is done by not setting If-Modified-Since and also not setting If-None-Match.

This should be the only time that it issues an unconditional request by itself. If the user does the equivalent of a "shift reload" or a "really, go fetch it again *right now*" action, then consider issuing a request without those headers to see if anything useful happens. This is only useful on broken servers (or with broken feed readers).

It's important to minimize these requests since they force the web server to transmit the entire feed even if the client has already seen all of the content contained within. This can be a lot of content, and for very popular feeds with many subscribers, such behavior tends to not scale. It's amplified when the polling intervals are set too short - see the later section on those for more.

Expect to receive errors if you send unconditional requests too quickly. Web servers which have had to stand up protection for their feeds have limits, and you will quickly hit them if you only send unconditional requests. This is why you should keep and return valid headers.

Feed readers which generate unconditional requests beyond some threshold set by the server administrator may be rejected with various levels of error codes or even filtered at the IP layer. Most of these blocks are "soft", meaning if the client just slows down and plays nice, it will clear by itself and then it can go back to fetching things again.

Polling intervals

recommended values Conditional = 1 poll/hour
Unconditional = 1 poll/day

A suggested polling interval on a feed for which you have not made prior arrangements (payments, subscriptions, or other support) is no more than once per hour for conditional requests.

Feed readers which are unable to generate working conditional requests and thus always send unconditional requests and force transmission of the full feed should poll at most once per 24 hour period.

This means there is a clear advantage to playing nice and sending valid conditional requests: your feed reader can check roughly hourly instead of roughly daily. This makes for fresher updates and happier users.

Common mistakes made by multiple programs

I've observed a bunch of feed readers which have made the same mistakes. These are real bugs, and I'm not naming names, but they are happening in actual released software right now. You should read through them and make sure you aren't doing any of these things.

Using HEAD requests

There's no reason to do a HEAD when you can do a conditional GET and get the same metadata, and as a bonus, get a fresh copy of the content if it's changed since your last request.

If you're doing HEADs, you should never do GETs, which means you don't care about the content and only want to track the metadata for some reason. In that rather unlikely case, then sure, send HEAD requests. Otherwise, don't.

Not updating the metadata when the body doesn't change

More than a couple of feed readers use a library which hashes the contents of the body and then skips updating their cached values for Last-Modified and ETag. This makes them send the old values over and over, turning them into a generator of unconditional requests. This is bad.

Caching systems should be designed around the notion of "this request contains the headers using values from the last request": the previous Last-Modified from the server is the next If-Modified-Since. The previous ETag from the server is the next If-None-Match.

Not updating the cached ETag when Last-Modified is the same

It's possible for the Last-Modified header to stay the same while the ETag changes. Last-Modified on a stock Apache httpd is just seconds-level precision of the mtime value from the filesystem, while ETag on that same httpd is (length)-(mtime in microseconds, in hex).

Imagine putting "foo" in a file, then putting "blah" in the same file within the same second. They'll have the same Last-Modified time, but will have two different ETag values (length changed, fractional part of the mtime changed), and will still be completely legit. A poll could happen between those file writes, and it'd get stuck there.

Reader programs which make this mistake wind up sending unconditional requests (since the If-None-Match header they send doesn't match now) and will get themselves into trouble... or the dreaded 429.

Solution: every time you get a Last-Modified value and/or ETag value, update your cache for that feed! Don't shortcut out because one of them seems like it didn't change.

Not accounting for time spent doing a poll with tight scheduling

Let's say a reader program is set to do a poll every hour, so it starts working at the same offset into an hour, exactly 3600 seconds after the last start time. It will take some amount of time to reach any given URL in the list, connect to it, send a request, get a response, and figure out what to do with it.

For the sake of this thought experiment, imagine there is a variance of up to 60 seconds to do this. Maybe it checks a bunch of feeds serially, and they all have their own foibles. On one check, there are 30 seconds of delay before it gets to a certain feed, so that poll happens at 12:00:30.

An hour later, there's less delay before it gets to that certain feed - maybe this time it's 15 seconds. That poll runs at 13:00:15. Oops.

See the problem yet? Those two polls were 3585 seconds apart, not 3600. If you're aiming to poll "at most once every 3600 seconds", this isn't how you get there.

Assuming you're using some kind of SQL database to store the list of feeds, the easiest way to avoid this is to explicitly select things where the last poll time is at least n seconds ago. If you get to them too quickly, they won't show up this time, and you won't rush things.

This is why server operators have to ask for one poll frequency and actually allow another that's a bit looser, in case you had noticed that it works anyway. Ahem. Actually exploiting this is NOT recommended.

Using the wrong time for If-Modified-Since headers

Atom feeds have a top-level "updated" value - inside the actual XML of the feed. This is not the same as the Last-Modified value sent by the web server, and so it must not be sent as an If-Modified-Since value.

Solution: only use Last-Modified values as If-Modified-Since values.

Tracking their own If-Modified-Since times

Their logic is something like "I last talked to the web server at (my) time X, so I'll ask for anything new since that same time". That's no good, and you will miss updates. Apache-based systems which are serving file-based feeds are using the Unix filesystem's modification time (mtime) for their Last-Modified and If-Modified-Since operations.

Consider what happens if you poll at time X, then time X+60, and time X+120, but then at time X+130, the server receives an updated feed and the file has a timestamp of X+1. You will not receive that update.

The web server's time scale is entirely different from your own, particularly as it applies to whenever files are updated.

Solution: only use Last-Modified values as If-Modified-Since values. Really.

Launching multiple requests at feed initialization time

There is no justification for making redundant requests to add a feed. A feed reader which behaves this way has made the wrong decisions in how to design its network I/O. There is no reason to load it "once to see if it's valid, a second time to look for the favicon, and a third time to look for content". No.

This manifests as new subscribers who immediately hit the "429 wall" because their (unconditional!) requests show up as "less than 1 second since last", and that sure looks like someone who's started up curl in a loop. Don't do that.

Solution: load it once and use those results for all analysis purposes.

Advanced topics

Unnecessary headers: Referer

Unless you've been explicitly told to send a referrer (the HTTP header is actually spelled "Referer", and yes, it's missing a letter), your feed reader should leave that out. Sending fake referrer data is the mark of web abusers, and you should try to distance yourself from that kind of behavior as much as possible.

Unnecessary headers: Cookie

This one is the same idea as above: don't send it unless you've been told to do it explicitly.

Unnecessary URL parameters: ?foo=blah

URLs can have query arguments tacked onto them, so they end up looking like this:

http://example.com/feed.xml?something_useless=1234

The rule here is the same as the last two: if it's not part of the URL you've been given to poll for a feed, don't add such arguments yourself. At best, they will have no effect, and at worst, they will have unpredictable results when your feed reader makes a request.

User agents

It's great if a user-agent identifies itself. A feed reader that says it's "Foo Reader version 1.23" is a great start. In the case of feed readers which are actual products available to the public that have web sites, it's also nice to put a URL in the string so interested parties can check it out, or contact the authors if something is very wrong.

Adding a version number allows selective filtering in the event that a single bad release goes out which otherwise would sully the reputation of the entire project. Shipping every release with the same hard-coded version number will not you earn you any admirers from server admins.

Some projects use a git commit hash in lieu of a version number since they are from the newfangled world of "just sync to HEAD" and seemingly never do formal releases. You get the idea.

One important thing: for the sake of user privacy, your User-Agent header probably should not include any identifying features about who's running it. The IP address of a client is already quite a bit of identifying information.

Foo Reader v1.23 : not bad

Foo Reader v1.23; +https://fooreader.example.org/ : nice.

(something else that identifies it as a feed reader) : it works.

Mozilla/5.0 (compatible; feed thingy abcd) : *sigh*. Will this thing ever die? It works but it's groan-worthy.

(something that's an exact clone of a web browser) : yeah, no. don't do this. see the next entry.

Fake user-agents

Yes, some web servers are terrible and will block feed readers even though they offer feeds which are supposedly open to the public. We're not talking about them. This is about everyone else who's just handing out data to anyone who asks politely, which I would guess is most non-commercial feed providers.

Your feed reader should look like a feed reader, or at least, it shouldn't try to fake being a Chrome, Firefox, Safari, or any other web browser. One really good reason to not do this is that you will probably fail at actually looking like a real instance of whatever thing you're trying to fake. Actual Web Browsers have certain behaviors above and beyond their User-Agent headers, and if you claim to be one and yet aren't, it'll stand out. Don't do this.

There should be no need to fake a User-Agent to get reasonable behavior from a web server if you are sending well-formatted requests at a polite interval.

Non-200 HTTP status codes (the dreaded 429, and more)

429 ("Too Many Requests")

If you have followed the guidelines laid out in this document, you should never see the "dreaded 429" which means "you're sending too many requests in too short a period of time". Make sure you're sending conditional requests whenever possible (so any subsequent requests to a feed), and with a decent period of time between them, and it should never happen.

If you do get a 429, the feed reader should slow down. The 429 response might include a "Retry-After" header which is a length of time in seconds. Trying back before that amount of time elapses is a very bad idea because some providers actually reset the timer every time you generate another 429.

If the Retry-After header seems really large, try making conditional requests instead. People who say "it's telling me to try back after 86400" (a day worth of seconds) are actually sending unconditional requests and thus are hitting the earlier advice: unconditional requests should be at most once per day.

You should warn the user when a feed throws this, since they may have purposely dialed up the polling interval and caused the problem themselves. Even conditional requests have activity limits.

Ignoring a series of 429s will cause certain providers to automatically drop packets from the source host(s) for an unknown amount of time. Again, you should never experience this.

404 ("Not Found")

If you start getting this on an established feed, it probably means someone screwed something up on the web server, and you might want to try again a few more times at reasonable intervals before declaring it dead. However, after some amount of time, you should warn the user and disable it until the user takes some action to turn it back on.

If the feed returns a 404 initially, the user probably made a mistake entering the URL, and you should tell them as much. In this case, you should not add it for polls by default.

410 ("Gone")

This usually means there's some admin who's really unhappy about having random agents hitting a URL on their server years after they killed something off. They're done serving it, and are trying to get you to take it out of your list.

You should never see this. If you do, you've probably done something very wrong by ignoring a great many 404s up to this point.

403 ("Forbidden")

If this shows up out of nowhere and keeps happening, there's a good chance that it's not a mistake on the server and that the client has done something very wrong. This kind of outright ban is usually reserved for user agents which have demonstrated a complete lack of caring in terms of being responsible members of the feed ecosystem.

Example: a user-agent which always sends unconditional requests and which has a hard-coded poll interval of 30 seconds is likely to be banned by server administrators. There is no way for the user to fix the situation, so their choices are limited.

You also should never see this.

Feed URLs usually don't have wacky characters in them

People cut and paste URLs into feed reader input fields, and sometimes they select too much stuff. It's pretty unlikely that an admin has created a feed that actually has a space in the URL, or a carriage-return, or a newline, or even < or > type brackets.

It's worth asking the user "are you sure?" if you spot anything like that in the URL they are attempting to add. However, if they carry on anyway and get a 404, you should handle that correctly and not add the feed by default. See the 404 section above.

Paid subscriptions and faster polling rates

If you're paying for access to a feed, you may have earned the privilege of polling far more often, possibly even unconditionally. If it's a free service, you probably haven't.

If you really need to be able to find out when someone updates something down to the minute or whatever, reach out and see if they would be willing to provide a special "paid feed" that has looser limits on accesses.

rachelbythebay note: speaking for myself here, if you wanted to pay $750 a month (prepaid, a year at a time), I would be willing to talk about providing a separate feed source so you could bang away at it quite a bit more than the existing 1/hour or 1/day scheme that my site has now. Contact me if you're serious about this sort of thing.

Expert topics

Cache-Control and "max-age"

A reasonable person might ask "how could anyone possibly know what the feed operator wants us to use as a polling interval". There's at least one way to do this, and it's built into the HTTP headers:

Cache-Control: max-age=12345

If that's set, the server is suggesting that you can use a cached copy of this response for another 12345 seconds. After that, you should probably call back (conditionally!) and see what's up.

Cache-Control headers with a max-age value have been added to this project as of the afternoon (Pacific time) on Thursday, August 22, 2024, and the values sent are shown in the report, including the fact that they weren't present previously.

I don't think any readers actually implement this based on my own empirical tests. I'd love to be proven wrong on this.

IPv4 vs. IPv6

Most people don't have any control over whether IPv6 is available to them or not. It's entirely up to their ISP, network administrator, or random other people between them and the rest of the Internet.

Basically, if you have control over whether IPv6 is an option where you get your connectivity, you probably know it already. Otherwise, you're at the mercy of that not-you person figuring it out and turning it on some day.

If you have a cell phone, you are almost certainly using v6 already without realizing it. Cell networks have vast numbers of subscribers and would have hit the wall with routable IPv4 space a long time ago. Wired providers are still getting away with it in a lot of places.

It doesn't hurt to ask a lagging provider when they plan to get with the program. Every inquiry gives the network nerds inside the company that much more ammo to use as a lever against reluctant management types.

TL;DR this is why you don't get a "!" for IPv4. If you're not using v6 as an end-user, chances are it's not your fault.

And finally... (FAQs)

"What was the bug with the bad Last-Modified value?"

I screwed up. More info is available in a post. As of the afternoon of June 26th, the database has been fixed and you should no longer see problems due to my mistake.

Again, sorry about this to anyone who's been chasing ghosts.

"What's a 'bogus If-Modified-Since value'?"

The feed reader scoring system knows what Last-Modified values it's handed out. If it gets back an If-Modified-Since value that isn't in that set, then it knows it was made up. Don't do this - it's not what you want. See above for more.

"What's a 'bogus If-None-Match value'?"

Just like the last item, the scoring system knows what ETag values it's handed out to clients. If it gets back an If-None-Match value which isn't in that set, then it knows that was made up by the client. You really don't want to do this. You will miss out on updates this way.

"What's an 'out of sequence If-Modified-Since value'?"

"What's an 'out of sequence If-None-Match value'?"

This means the client handed the server a value (IMS or INM) that wasn't the same as the one the server served up (LM or ETag) on the last request. However, it was recognized as a value which had been served previously.

This is a strong indication of a caching bug in the client software.

This can happen if anything but a single feed reader install accesses the unique feed URL. This is yet another reason to only load that URL from a single install of a particular feed reader program. If you need more keys for more programs to test, just ask.

"Why am I not getting 304s from the feed tester?"

Okay, you caught me. I wrote this test thing to be quick and dirty, and it does some funny tricks to make sure I can tell a valid If-Modified-Since or If-None-Match value from one that's been made up. It also means I'm not doing the usual conditional stuff on my end, and always ship a full update. Good thing it's small!

Rest assured that the "real" feed I run does behave properly and will give you 304s when appropriate. Besides, think of it as an opportunity to test your stuff against a web server that doesn't know how to honor either of those headers because it wasn't written very well.