Archives

Categories

RSS Aggregation Software

The most commonly installed software for aggregating RSS feeds seems to be Planet and Venus (two forks of the same code base). The operation is that a cron job runs the Python program which syndicates a list of RSS feeds and generates a static web page. Of course the problems start if you have many feeds as polling each of them (even the ones that typically get updated at most once a week) can take a while. My experience with adding moderate numbers of feeds (such as all the feeds used by Planet Debian [1]) is that it can take as much as 30 minutes to poll them all – which will be a problem if you want frequent updates.

Frequent polling is not always desired, it means more network load and a greater incidence of transient failures. Any error in updating a feed is (in a default configuration) going to result in an error message being displayed by Planet, which in a default configuration will result in cron sending an email to the sysadmin. Even with an RSS feed being checked every four hours (which is what I do for my personal Planet installations) it can still be annoying to get the email when someone’s feed is offline for a day.

Now while there is usually no benefit in polling every 15 minutes (the most frequent poll time that is commonly used) there is one good reason for doing it if you can only poll. The fact that some people want to click reload on the Planet web page every 10 minutes to look for new posts is not a good reason (it’s like looking in the fridge every few minutes and hoping that something tasty will appear). The good reason for polling frequently is to allow timely retraction of posts. It’s not uncommon for bloggers to fail to adequately consider the privacy implications of their posts (let’s face it – professional journalists have a written code of ethics about this, formal training, an editorial board, and they still get it wrong on occasion – it’s not easy). So when a mistake is made about what personal data should be published in a blog post it’s best for everyone if the post can be amended quickly. The design of Planet is that when a post disappears from the RSS feed then it also disappears from the Planet web page, I believe that this was deliberately done for the purpose of removing such posts.

The correct solution to the problem of amending or removing posts is to use the “Update Services” part of the blog server configuration to have it send an XML RPC to the syndication service. That can give an update rapidly (in a matter of seconds) without any polling.

I believe that a cron job is simply the wrong design for a modern RSS syndication service. This is no criticism of Planet (which has been working well for years for many people) but is due to the more recent requirements of more blogs, more frequent posting, and greater importance attached to blogs.

I believe that the first requirement for a public syndication service is that every blogger gets to specify the URL of their own feed to save the sysadmin the effort of doing routine URL changes. It should be an option to have the server act on HTTP 301 codes and record the new URL in the database. Then the sysadmin would only have to manage adding new bloggers (approving them after they have created an account through a web-based interface) and removing bloggers.

The problem of polling frequency can be mostly solved by using RPC pings to inform the server of new posts if the RPC mechanism supports removing posts. If removing posts is not supported by the RPC then every blog which has an active post would have to be polled frequently. This would reduce the amount of polling considerably, for example there are 319 blogs that are currently syndicated on Planet Debian, there are 60 posts in the feed, and those posts were written by 41 different people. So if the frequent polling to detect article removal was performed for active articles, given the fact that you poll the bloggers feed URL not the article that would only mean 41 polls instead of 319 – reducing the polling by a factor of more than 7!

Now even with support for RPC pings there is still a need to poll feeds. One issue is that feeds may experience temporary technical difficulty in sending the RPC as we don’t want to compel the authors of blog software to try and make the ping as reliable a process as sending email (if that was the requirement then a ping via email might be the best solution). The polling frequency could be implemented on a per-blog basis based on the request of the blogger and the blog availability and posting frequency. Someone who’s blog has been down for a day (which is not uncommon when considering a population of 300 bloggers) could have their blog polled on a daily basis. Apart from that the polling frequency could be based on the time since the last post. It seems to be a general pattern that hobby bloggers (who comprise the vast majority of bloggers syndicated in Planet installations) often go for weeks at a time with no posts and then release a series of posts when they feel inspired.

In terms of software which meats these requirements, the nearest option seems to be the the Advogato software mod_virgule [2]. Advogato [3] supports managing accounts with attached RSS feeds and also supports ranking blogs for a personalised view. A minor modification of that code to limit who gets to have their blog archived, and fixing it so that a modified post only has the latest version stored (not both versions as Advogato does) would satisfy some of these requirements. One problem is that Advogato’s method of syndicating blogs is to keep an entire copy of each blog (and all revisions). This goes against the demands of many bloggers who demand that Planet installations not keep copies of their content for a long period and not have any permanent archives. Among other things if there are two copies of a blog post then Google might get the wrong idea as to which is the original.

Does anyone know of a system which does better than Advogato in meeting these design criteria?

5 comments to RSS Aggregation Software

  • foo

    You might want to take a look at this article for the future of data syndication:

    http://anarchogeek.com/articles/2008/7/23/beyond-rest-building-data-services-with-xmpp-pubsub

  • Joey Hess

    Ikiwiki’s aggregation support embeds the list of feeds to aggregate on a wiki page, so that users can edit them. It allows configuring the poll frequency on a per-feed basis there as well. Users can also create personalised feeds that include only a subset of the aggregated feeds.

    Ikiwiki does not immediatly remove posts removed from feeds. I tend to see that as trying to close the barn door after the horse is out. (Or, it’s a probably minor modification to the code, that I never considered making before..) It will, however, update posts with no record of the old content. That seems a more common way to retract things anyway, based on the modified versions of posts that I sometimes see pile up after the original post in my rss2email mailbox.

    http 301 support is on my todo list, but since any failure to access a feed shows up on the wiki page and can be fixed by anyone, and since many/most bloggers don’t know about using 301 when moving their blog anyway, it’s not been a priority. Nor has pinging, largely because I don’t feel pinging is a very scalable solution — I can’t expect to get all the blogs who I personally aggregate to ping my aggregator. Not until a lot of blog software supports rss 2.0’s clouds at least, which still seems to be an obscure and little-used thing.

    The pragmatic solution to needing to poll more feeds than your desired poll interval would seem to be threading the polling. 5 threads should be able to handle 300 feeds in well under 10 minutes.

  • If someone wants to develop a patches for mod_virgule to make these changes, that would be cool.

    The intent is actually for mod_virgule to keep only the most recent copy of a modified post and it works in most cases but some syndication formats make it almost impossible to distinguish a modified post from a new post. Any patch that improves this situation would helpful.

    It is by design that we archive posts forever but a patch that provided a configuration option in config.xml to set a retention time limit would be fine.

  • More pragmatic responses:

    Planet Venus supports multi-threaded polling. Our planets usually update in < 1 minute with 10 threads (and while they aren’t planet-debian-sized, they aren’t aren’t tiny)

    For people who don’t run their own infrastructure, the 301 approach won’t help much.

  • Karellen

    “it’s like looking in the fridge every few minutes and hoping that something tasty will appear”

    My friends and I refer to that, in conjunction with also opening all the cupboards in the kitchen for the same reason, as “the search for Schrödinger’s Sandwich”. We are hoping that, while the cupboard/fridge is closed, the unobserved quantum wavefunction within will enter a state such that the molecules rearrange themselves into a tasty sandwich the next time it is opened.

    It has not worked yet.

    (We thought it did once, but it turned out someone else put the sandwich there while the searcher was elsewhere.)