The most commonly installed software for aggregating RSS feeds seems to be Planet and Venus (two forks of the same code base). The operation is that a cron job runs the Python program which syndicates a list of RSS feeds and generates a static web page. Of course the problems start if you have many feeds as polling each of them (even the ones that typically get updated at most once a week) can take a while. My experience with adding moderate numbers of feeds (such as all the feeds used by Planet Debian [1]) is that it can take as much as 30 minutes to poll them all – which will be a problem if you want frequent updates.
Frequent polling is not always desired, it means more network load and a greater incidence of transient failures. Any error in updating a feed is (in a default configuration) going to result in an error message being displayed by Planet, which in a default configuration will result in cron sending an email to the sysadmin. Even with an RSS feed being checked every four hours (which is what I do for my personal Planet installations) it can still be annoying to get the email when someone’s feed is offline for a day.
Now while there is usually no benefit in polling every 15 minutes (the most frequent poll time that is commonly used) there is one good reason for doing it if you can only poll. The fact that some people want to click reload on the Planet web page every 10 minutes to look for new posts is not a good reason (it’s like looking in the fridge every few minutes and hoping that something tasty will appear). The good reason for polling frequently is to allow timely retraction of posts. It’s not uncommon for bloggers to fail to adequately consider the privacy implications of their posts (let’s face it – professional journalists have a written code of ethics about this, formal training, an editorial board, and they still get it wrong on occasion – it’s not easy). So when a mistake is made about what personal data should be published in a blog post it’s best for everyone if the post can be amended quickly. The design of Planet is that when a post disappears from the RSS feed then it also disappears from the Planet web page, I believe that this was deliberately done for the purpose of removing such posts.
The correct solution to the problem of amending or removing posts is to use the “Update Services” part of the blog server configuration to have it send an XML RPC to the syndication service. That can give an update rapidly (in a matter of seconds) without any polling.
I believe that a cron job is simply the wrong design for a modern RSS syndication service. This is no criticism of Planet (which has been working well for years for many people) but is due to the more recent requirements of more blogs, more frequent posting, and greater importance attached to blogs.
I believe that the first requirement for a public syndication service is that every blogger gets to specify the URL of their own feed to save the sysadmin the effort of doing routine URL changes. It should be an option to have the server act on HTTP 301 codes and record the new URL in the database. Then the sysadmin would only have to manage adding new bloggers (approving them after they have created an account through a web-based interface) and removing bloggers.
The problem of polling frequency can be mostly solved by using RPC pings to inform the server of new posts if the RPC mechanism supports removing posts. If removing posts is not supported by the RPC then every blog which has an active post would have to be polled frequently. This would reduce the amount of polling considerably, for example there are 319 blogs that are currently syndicated on Planet Debian, there are 60 posts in the feed, and those posts were written by 41 different people. So if the frequent polling to detect article removal was performed for active articles, given the fact that you poll the bloggers feed URL not the article that would only mean 41 polls instead of 319 – reducing the polling by a factor of more than 7!
Now even with support for RPC pings there is still a need to poll feeds. One issue is that feeds may experience temporary technical difficulty in sending the RPC as we don’t want to compel the authors of blog software to try and make the ping as reliable a process as sending email (if that was the requirement then a ping via email might be the best solution). The polling frequency could be implemented on a per-blog basis based on the request of the blogger and the blog availability and posting frequency. Someone who’s blog has been down for a day (which is not uncommon when considering a population of 300 bloggers) could have their blog polled on a daily basis. Apart from that the polling frequency could be based on the time since the last post. It seems to be a general pattern that hobby bloggers (who comprise the vast majority of bloggers syndicated in Planet installations) often go for weeks at a time with no posts and then release a series of posts when they feel inspired.
In terms of software which meats these requirements, the nearest option seems to be the the Advogato software mod_virgule [2]. Advogato [3] supports managing accounts with attached RSS feeds and also supports ranking blogs for a personalised view. A minor modification of that code to limit who gets to have their blog archived, and fixing it so that a modified post only has the latest version stored (not both versions as Advogato does) would satisfy some of these requirements. One problem is that Advogato’s method of syndicating blogs is to keep an entire copy of each blog (and all revisions). This goes against the demands of many bloggers who demand that Planet installations not keep copies of their content for a long period and not have any permanent archives. Among other things if there are two copies of a blog post then Google might get the wrong idea as to which is the original.
Does anyone know of a system which does better than Advogato in meeting these design criteria?