2

Citing Wikipedia

A meme that has been going around is that you can’t cite Wikipedia.

You can’t Cite Wikipedia Academically

Now it’s well known and generally agreed that you can’t cite Wikipedia for a scientific paper or other serious academic work. This makes sense firstly because Wikipedia changes, both in the short term (including vandalism) and in the long term (due to changes in technology, new archaeological discoveries, current events, etc). But you can link to a particular version of a Wikipedia page, you can just click on the history tab at the top of the screen and then click on the date of the version for which you want a direct permanent link.

The real reason for not linking to Wikipedia articles in academic publications is that you want to reference the original research not a report on it, which really makes sense. Of course the down-side is that you might reference some data that is in the middle of a 100 page report, in which case you might have to mention the page number as well. Also often the summary of the data you desire simply isn’t available anywhere else, someone might for example take some facts from 10 different pages of a government document and summarise them neatly in a single paragraph on Wikipedia. This isn’t a huge obstacle but just takes more time to create your own summary with references.

When Wikipedia is Suitable

The real issue however is how serious the document you are writing is and how much time you are prepared to spend on it. If I’m writing a message to a mailing list or a comment on a blog post then I probably won’t bother reading all the primary sources of Wikipedia pages, it would just waste too much of my time. Wikipedia is adequate for the vast majority of mailing list discussions.

If I’m discussing several choices for software with some colleagues we will probably start by reading the Wikipedia pages, if one option doesn’t appear to have the necessary features (according to Wikipedia) then we may ask the vendor if those features are really missing and if so whether they will be added in the next version – but we may decide that we don’t really need the features in question and modify our deployment plans. Many business decisions are made with incomplete data, time is money and there often isn’t time to do everything you want to do. Using Wikipedia as a primary source for business decisions is a way of trading off a little accuracy for a huge time saving. This is significantly better than the old fashioned approach of comparing products by reading their brochures – companies LIE in their advertising!

When writing blog posts the choice of whether to use Wikipedia as a reference depends on the point that you are trying to make and how serious the post is. If the post isn’t really serious or contentious or if the Wikipedia reference is for some facts that are not likely to be disputed then Wikipedia will probably do. For some posts a reference to a primary source will be better.

A blog post that references data that is behind a pay-wall (such as a significant portion of academic papers and news articles) is practically of less use than a post that cites Wikipedia. In most cases Wikipedia references free primary sources on the Internet (although it does sometimes refer to dead tree products and data that is behind a pay-wall). In the minority of cases where the primary references for a Wikipedia page are not available for free on the Internet there will be people searching for freely available references to replace the non-free ones. So if you refer to a Wikipedia page with non-free references a future reader might find that someone has added free references to it.

The Annoying People

One thing that often happens is that an Internet discussion contains no references for anything – it’s all just unsupported assertions. Then if anyone cites Wikipedia someone jumps in with “you can’t cite Wikipedia“. If you want to criticise Wikipedia references then please first start by criticising people who state opinions as fact and people who provide numbers without telling anyone where they came from! The Guinness Book of Records (now known as “Guinness World Records”) was devised as a reference to cite in debates in pubs [1]. It seems that most of the people who dismiss references to Wikipedia on the net would prefer that Internet debates have lower requirements for references than a pub debate.

When Wikipedia is cited in an online discussion it is usually a matter of one mouse click to check the references for the data in question. If Wikipedia happens to be wrong then anyone who cares can correct it. Saying “the Wikipedia page you cited had some transcription errors in copying data from primary sources and some of the other data was not attributed, I’ve corrected the numbers and noted that it contains original research” would be a very effective rebuttal to an argument that relies on data in Wikipedia. Saying “you can’t cite Wikipedia” means little, particularly if you happen to be strongly advocating an opposing position while not providing any references.

If one person cites an academic paper and someone else cites Wikipedia then it seems reasonable to assume that the academic paper is the better reference. But when it’s a choice between Wikipedia and no reference then surely Wikipedia should win! Also references to non-free data are not much good for supporting an argument, that’s really just unverified claims as far as most people can determine – therefore the issue becomes how much the person citing the non-free reference can be trusted to correctly understand and summarise the non-free data.

Also it has to be considered that not all primary sources are equal. Opinion pieces should be considered to have a fairly low value and while they are authoritative for representing the opinion of the person who wrote them they often prove little else – unless they happen to cite good references which brings them to the same level as Wikipedia. The main benefit for linking to opinion pieces is that it saves time typing and gives a better product for the readers – it’s sometimes easier to find someone else expressing an opinion well than to express it yourself.

So please, don’t criticise me for citing Wikipedia unless others in the discussion are citing better references. If most people are not citing any references or only citing opinion pieces then a Wikipedia page may be the best reference that is being provided!

3

Web Site Validation

Over the last few days I’ve got this blog and my documents blog to conform to valid XHTML according to the W3C validation service [1].

One significant change that I made was to use lower-case for HTML tags. For about 15 years I’ve been using capitals for tags to make them stand out from content and my blogs are the latest in a long line of web sites with that. Naturally I wasn’t going to correct 900 posts manually so I ran a series of SQL commands such as the following on my database server (where X is the WordPress table prefix):

update X_wp_posts set post_content = replace(post_content,'<PRE>','<pre>');

But make sure you have a good backup of your database before running SQL search and replace commands on your blog data.

After running such commands about 90% of my blog posts conformed, so I only needed to edit about 90 posts to correct things. This process gave some real benefits. One issue is that an apostrophe in a URL must be quoted, otherwise some browsers will link to the desired URL and some will link to a truncated URL. Fixing a couple of variations of this problem resulted in some broken links being fixed. Another issue is that you can’t have paragraphs (<p> tags) within list items, fixing this made some of my posts align correctly – it was a tricky fix, in some cases I had to use <br/> to break up text in a list item and sometimes I replaced lists with different sections delimited by <h3> headings (which apparently is rumored to give better SEO).

It would make a really nice WordPress feature to be able to do W3C validation as part of the publishing process, ideally an attempt to publish or schedule a post would result in a message saying “saved as a draft because it’s not valid XHTML” if the checks failed. The source to the W3C validation software is significantly larger than WordPress [2], but it seems to me that there are two main types of WordPress installations, small ones for personal use (which tend to be on fairly idle servers) and big ones that have so much traffic that the resource usage of validation would be nothing compared to the ongoing load.

As there seems to be no way of validating my posts before publication my best option is the W3C button I now have on my blog. This allows me to validate the page at a click so while I can’t entirely avoid the risk of publishing a post with invalid XHTML I can at least fix it rapidly enough that hardly anyone will notice.

It also seems like a useful feature to have aggregators like Venus [3] check for valid HTML and not display posts unless they are valid. It’s not a feature that could be enabled immediately (I’m sure that if you click on this link to the W3C validation service [1] from a Planet feed you will see lots of errors and warnings), but once bloggers have time to fix their installation it would allow preventing some of the common annoyances of Planet installations. It’s not uncommon on popular Planets to have unmatched tags in a post which results in significant amounts of the content being bold, underlined, in italics, or for the greatest annoyance struck-out. I know that this may be a controversial suggestion, but please consider why you are blogging – if you are blogging for the benefit of your readers (which seems to be the case for everyone other than sploggers) then it seems that the readers will benefit more by not having a broken post syndicated than they would benefit from having it syndicated and thus messing up the display of many following posts.

The next thing on my todo list in this regard is to do some tests of accessibility. The work that I have done to pass the XHTML validation tests has helped to some degree – if nothing else the images now all have alt= descriptions, but I expect that it will be a lot of work. The WordPress Codex has a page about accessibility, I haven’t read all of it yet [4].

Does anyone have any recommendations for free automated systems that check web sites for accessibility? What would be ideal is a service that allows different levels of warnings, so instead of trying to fix all problems at once I could start by quickly fixing the most serious problems on the most popular posts and finish the job at some later date.

WordPress Plugins

I’ve just added the WordPress Minify [1] plugin to my blog. It’s purpose is to combine CSS and Javascript files and to optimise them for size and it’s based on the Minify project [2]. On my documents blog this takes the main page from 313KB uncompressed, 169KB compressed, and a total of 23 HTTP transfers to 306KB uncompressed, 117KB compressed, and 21 HTTP transfers. In each case 10 of the HTTP transfers are from Google for advertising. It seems that a major obstacle to optimising the web page load times is Google adverts – of course Google has faster servers than I do so I guess it’s not that much of a performance problem. The minify plugin caches it’s data files and I had to really hack at the code to make it use /var/cache/wordpress-minify – a subdirectory of the plugins directory was specified in many places.

deb http://www.coker.com.au lenny wordpress
I’ve added a wordpress-minify package to my repository of WordPress packages for Debian/Lenny with the above APT line. I’ve also got the following packages:
adman
all-in-one-seo-pack
google-sitemap-generator
openid
permalink-redirect
stats
subscribe-to-comments
yubikey

The Super Cache [3] plugin has some nice features. It generates static HTML files that are served to users who aren’t logged in and who haven’t entered a comment. This saves significant amounts of CPU time when there is high load. The problem is that installing this requires modifying the main .htaccess file, adding a new .htaccess file in the plugins directory, and lots of other hackery. The main reason for this is to avoid running any PHP code in the most common cases, it would be good for really heavy use. Also PHP “safe mode” has to be disabled for some reason, which is something I’d rather not do.

The Cache [4] plugin was used as the base for the Super Cache plugin. It seems less invasive, but requires the ability to edit the config file. Getting it into a shape that would work well in Debian would take more time than I have available at the moment. This combined with the fact that my blog will soon be running on a system with two quad-core CPUs that won’t be very busy means that I won’t be packaging it.

If anyone would like to Debianise the Cache or Super Cache plugin then I would be happy to give them my rough initial efforts as a possible starting point.

I’m not planning to upload any of these packages to Debian, it would just add too much work to the Debian security team without adding enough benefit.

8

Help A Reporter Out

I recently discovered the Help A Reporter Out [1] service.

Subscribers receive three messages every business day each of which contains about 40 queries from journalists. People who subscribe can contact the journalist to provide information or offer an interview. Any journalist can send in a query. Peter Shankman runs this, it seems that it helps promote his other business ventures and there is also a paid advert at the top of every message.

This has to be one of the best services that I have ever unsubscribed from! The vast majority of the questions are about topics that are not relevant to me – there are typically about 6 IT related questions per day out of 100+.

I would like to see a “Help An IT Reporter Out” service. It could consist of a single email per day which might have 10 questions due to a more focussed market. This would take less time to skim read which would make it more appealing to most people who are doing interesting things with computers. Then of course it could allow targeted messages related to different IT sectors (servers, desktops, PDA/phones), technologies, etc. Exporting the questions to Twitter would be good for people who like that sort of thing.

If anyone wants to start such a service then let me know and I’ll promote it on my blog.

3

Web Hosting After Death

Steve Kemp writes about his concerns for what happens to his data after death [1]. Basically everything will go away when bills stop being paid. If you have hosting on a monthly basis (IE a Xen DomU) then when the bank account used for the bill payment is locked (maybe a week after death) the count-down to hosting expiry starts. As noted in Steve’s post it is possible to pay for things in advance, but everything will run out eventually.

One option is to have relatives keep the data online. With hard drives getting bigger all the time it wouldn’t be difficult to backup the web sites for everyone in your family to a USB flash device and then put it online at a suitable place. Of course that relies on having relatives with the skill and interest necessary.

The difficult part is links, if the domain expires then links will be broken. One way of alleviating this would be to host content with Blogger, Livejournal, or other similar services. But then instead of the risk of a domain being lost you have the risk of a hosting company going bankrupt.

It seems to me that the ideal solution would be to have a hosting company take over the web sites of deceased people and put adverts on them to cover the hosting costs. As the amount of money being spent on Internet advertising will only increase while the costs of hosting steadily go down it seems that collecting a lot of content for advertising purposes would be a good business model. If the web sites of dead people are profitable then they will remain online.

It wouldn’t be technically difficult to extract the data from a blog server such as WordPress (either from a database dump or crawling the web site), change the intra-site links to point to a different domain name, and then put it online as static content with adverts. If a single company (such as Google) had a large portion of the market of hosting the web sites of dead people then when someone died and had their web site transferred the links on the other sites maintained by the same company could be automatically adjusted to match. A premium service from such a company could be to manage the domain. If they were in the domain registrar business it would be easy to allow someone to pay for 10 or 20 years after their death. Possibly with a portion of the advertising revenue going towards extending the domain registration. I think that this idea has some business potential, I don’t have the time or energy to implement it myself and my clients are busy on other things so I’m offering it to the world.

Cory Doctorow has written an article for the Guardian about a related issue – how to allow the next of kin to access encrypted data when someone is dead [2]. One obvious point that he missed is the possibility that he might forget his own password, a small injury from a car accident could cause that problem.

It seems strange to me that someone would have a great deal of secret data that needs strong encryption but yet has some value after they are dead. Archives of past correspondence to/from someone who is dead is one category of secret data that is really of little use to anyone unless the deceased was particularly famous. Probably the majority of encrypted data from a dead person would be best wiped.

For the contents of personal computers the best strategy would probably be to start by dividing the data into categories according to the secrecy requirements. Publish the things that aren’t secret, store a lot of data unencrypted (things that are not really secret but you merely don’t want to share them with the world), have a large encrypted partition that will have it’s contents lost when you die, and have a very small encrypted device that has bank passwords and other data that is actually useful for the executors of the will.

One thing that we really need is to have law firms that have greater technical skills. It would be good if the law firms that help people draw up wills could advise them on such issues and act as a repository for such data. It seems to me that the technical skills that are common within law firms are not adequate for the task of guarding secret electronic data for clients.

4

How not to write the way dumb people think smart people write

Don Marti has written an amusing and informative little post about the way that ill-educated people use phrases in print [1]. The one example that didn’t fit with the tone of his post was the use of “half mast” to refer to a flag on land, that one is used often enough (both in print and verbally) that intelligent people will make that mistake.

Here are some that he missed:

  1. The expression “bailing them out” can be used to refer to someone who helps someone escape from a difficult situation. That might refer to bailing water from a leaky boat or might refer to posting bail to secure someone’s release from prison while awaiting trial. It certainly doesn’t refer to baling which is the creation of a bundle (such as a bale of hay). I’m sure that a farmer would appreciate some help at baling time, but such help would hardly qualify as rescue from a difficult situation.
  2. The expression “woe is me” is used (usually in a sarcastic manner) to refer to someone who feels that their situation is sorrowful. It is not “whoah is me“, I suspect that the word whoah was invented long after the expression “woe is me” became commonly used.
  3. Made up words such as worserer and worstest. I can understand that people such as Keith Olbermann [2] may find it a challenge to describe some of the bad people in the world, but he is articulate enough to rise to that challenge while using real words.

Can anyone add any more to this list?

4

Feeds and Banning from Planets

Stewart Smith has written about the removal of a blog from Planet Linux Australia [1] due to publishing a list of URLs that the Australian government wants to censor.

The first point I want to make is that even if you had a list with thousands of entries that are not likely to offend anyone or incur any legal liability then it’s still not suitable for syndication on most Planet feeds. The correct thing to do is to have a paragraph describing the list and why people would want to read it and then use the MORE feature of your blog so that the rest isn’t in the RSS feed. If you use WordPress which seems to have the MORE function broken then that would mean hosting the list somewhere else.

In regard to the specific post, in a comment on Stewart’s post Matt suggests that the Planet software somehow filter out certain blog posts. I am not aware of any way of doing that apart from through code changes, Matt could submit some patches to allow that sort of thing.

One thing that would be really good would be to have an exclusion tag or category in a blog feed. So you for example you could have feed URLs such as /feed/lca which would be configured to list all posts without the tag not-lca. Another way for a blogger to do this would be to use Yahoo pipes [2]. The people who run a Planet should be prepared to take any feed URL. It would not be difficult for a blogger to create a pipe that excludes all items that have “NSFW” in the title (or any other possible way of listing them).

A final option is to have multiple blogs. I have a blog for documents that I regularly update [3]. Many of those documents had been plain HTML files edited with vi for years before I started blogging. But WordPress is a reasonable CMS and as I use it for blogging it made sense to use it for other documents too. WordPress has no good option for managing two types of documents, ones that are date-based (regular blog posts with the date in the URL) and non-date based (which change periodically and have different date stamps). There are WordPress pages, but the support for having moderate numbers of pages is not great. Also on my document blog I will often have articles appear new regularly as I change the date when updating them. Anyone is welcome to subscribe to the feed for my document blog if they are interested in seeing new versions of the documents, but I expect that most people don’t want to.

The Debian WordPress package (as of last time I used it) and my fork of the Debian WordPress package have great support for multiple blogs. There is WordPress-MU for bulk blog hosting, but that is only designed for people who want to run something like LiveJournal or Blogger. If you just want a few blogs for friends and relatives then the regular Debian WordPress package will do the job well.

Some bloggers maintain two blogs, one for public things and another for close friends and relatives (people who ARE interested in what they ate for breakfast). Having one blog for the NSFW material would be a reasonable thing to do for certain bloggers.

Finally while I doubt that someone who runs a Planet installation faces any legal liability, there is also the issue of a PR liability. From a PR perspective I think it’s best for the reputation of Linux users in Australia for certain things to not appear on Planet Linux Australia. That said it would be good if there was a process for removing and reinstating blogs that was publicly documented. There will obviously be many differences of opinion as to what is too risky to allow on the Planet so we should expect that from time to time feeds will be temporarily removed. When that happens what does a blogger have to do to be done to be syndicated again?

Update:

A comment has revealed a way of filtering out RSS feeds via the feed URLs used by wordpress. A URL such as /feed/cat=-X will give a feed of all articles that don’t contain category number X. Multiple categories can be specified when separated by commas. So this allows WordPress users to exclude their NSFW category from Planet Linux Australia.

6

The FAIL Meme

One of the recent poor trends in mailing list discussions is to reply to a message with a comment such as “FAIL” or “EPIC FAIL“.

The FAIL meme has been around for a while and actually does some good in some situations, slate has a good article about it [1]. The first example cited in that article is that ‘when Ben Bernanke and Henry Paulson testified before the Senate banking committee last month about Paulson’s proposed bailout bill, a demonstrator in the audience held up an 8.5-by-11 piece of paper with one word scrawled on it in block letters: “FAIL.”‘. This is an effective form of political demonstration, short words generally work well on placards (if only because the letters can be larger and therefore read from a greater distance) and anyone can understand the meaning of “FAIL” in that context.

There are some blogs dedicated to publicising supposed failures, failblog.org and ShipmentOfFail.com are two examples. I cite these as supposed failures because some of the pictures that they contain are obviously staged. It’s basically an Internet equivalent of the “Funniest Home Videos” shows that I never watched because they were not particularly funny.

So using the word “FAIL” on it’s own can be an effective form of political protest and can be used for mildly amusing web sites. But where it falls down is when it’s applied to a discussion that involves people who are from different cultures or have different levels of background knowledge – which covers most mailing list discussions.

Something that might be obviously wrong to some people is often not obvious at all to others. For example being forced to reboot a computer for any reason other than a kernel upgrade seems obviously wrong to me (and to most people who use Linux or other Unix systems) but Windows users seem happy to reboot machines after applying patches or upgrades. So writing a message with “FAIL” as the only word in a discussion with Windows users would not be productive. It could however be reasonable to forward a link to a page on a Microsoft web site to Linux people for their amusement with “FAIL” as the only comment – anyone who would find the link in question amusing would require no more explanation.

Sometimes when in a debate someone will write a message that only says “FAIL“, this is a very unconvincing argument that will not convince the opposition or any onlookers.

Generally it seems that using “FAIL” in a discussion with other like-minded people when talking about someone outside your group for the purpose of amusement can be effective. But any other use is going to be a “FAIL“.

As a more general rule single-word messages seem to have little value apart from certain limited situations. I have identified the following seven scenarios where a single word message is useful. Can anyone think of any others?

  1. Code review – someone posts code (or design for code) and people who like it will write “ACK” or something similar.
  2. Arranging a meeting – the question “who wants to meet for lunch tomorrow” has “me” as a valid answer.
  3. Voting – “yes” and “no” are valid answers for a poll, but a mailing list or forum probably isn’t the best place for it.
  4. Citing an example to refute a claim – often a single word won’t be a great response but may be adequate to prove a point.
  5. Answering a request for a recommendation – if asked to recommend a laptop I might say “Thinkpad” or if asked to recommend a server I might say “HP“. Both those answers are poor (I recommend EeePC for netbooks and Dell for small/cheap servers), so while such an answer would be useful it would be below my usual quality standards for email (I prefer to write at least two paragraphs explaining why I recommend something).
  6. Informing people that something has been done by replying to a request with the word “Done“.
  7. Agreeing to a contract or proposal with “OK” or “Yes“.

Update: I added another two reasonable uses of single word messages,

10

Planet Flooding

One annoying thing that happens regularly is “Planet Flooding”. This is when one of the many blogs that is syndicated by a public Planet installation changes it’s time stamps and has 10 or more old posts appear as new. It’s doubly annoying when the blogger in question knows about the problem.

Planet Flooding is easy to solve. If you are changing your blogging software or doing something else that may result in old posts appearing to be new then all you have to do is configure your blog to include a small number of posts (maybe two or three) in the RSS feed. Seeing two old posts re-appearing plus a new post explaining it is not going to annoy anyone.

If you run a Planet (or Venus) installation then configure it to have a maximum number of posts per feed. For a Planet that syndicates feeds from a number of individuals and only includes a few days of traffic (which is probably a category that covers most Planets) there is no need for more than four items per feed.

For a severe case of Planet flooding (EG posts which always appear as being the newest and are therefore at the top of the list) the thing to do is to immediately remove the feed until the problem is fixed. Allowing a broken blog configuration to annoy other people is not doing any favors for the blogger in question, it simply drives people to filter the Planet to exclude the articles by that blogger. Yes it does take some work to adjust the configuration of the Planet, but that is surely no more work than replying to email rejecting requests for the configuration to be adjusted.

The first aim of running a blog or a Planet should be to make it readable, Planet flooding breaks this for the Planet and for the blogger who caused it. It is a technical problem and needs a technical solution (which can be temporarily removing the blog from the Planet syndication list).

4

OpenID Delegation

I’ve just installed Eran Sandler’s OpenID Delegation Plugin [1]. This means that I can now use my blog URL for OpenID authentication. I’ve also included the plugin in my WordPress repository (which among other things has the latest version of WordPress). One thing that I consider to be a bug in Eran’s plugin is the fact that it only adds the OpenID links to the main URL. This means that for example if I write a blog comment and want to refer to one of my own blog posts on the same topic (which is reasonably common – after more than two years of blogging and almost 700 posts I’ve probably written a post that is related to every topic I might want to comment on) then I can’t put the comment in the URL field. The problem here is that URLs in the body of a blog comment generally increase the spam-score (I use this term loosely to refer to a variety of anti-spam measures – I am not aware of anything like SpamAssassin being used on blog comments), and not having OpenID registration also does the same. So it seems that with the current functionality of Eran’s plugin I will potentially suffer in some way any time I want to enter a blog comment that refers to a particular post I wrote.

deb http://www.coker.com.au etch wordpress

My WordPress Debian repository is currently available with the above APT repository. While it specifies etch it works with Lenny too (my blog currently runs on Lenny). I will eventually change it to use lenny in the name.

For the OpenID server I am currently using the OpenID service provided by Yubico as part of the support for their Yubikey authentication token [2] (of which I will write more at a later date). I think that running their own OpenID server was a great idea, it doesn’t cost much to run such a service and it gives customers an immediate way of using their key. I expect that there are more than a few people who would be prepared to buy a Yubikey for the sole purpose of OpenID authentication and signing in to a blog server (which can also be via OpenID if you want to do it that way). I plan to use my Yubikey for logging in to my blog, but I still have to figure out the best way of doing it.

One thing that has been discussed periodically over the years has been the topic of using smart-cards (or some similar devices) for accessing Debian servers and securing access to GPG keys used for Debian work by developers who are traveling. Based on recent events I would hazard a guess that such discussions are happening within the Fedora project and within Red Hat right now (if I worked for Red Hat I would be advocating such things). It seems that when such an idea is adopted a logical extension is to support services that users want such as OpenID at the same time, if nothing else it will make people more prone to use such devices.

Disclaimer: Yubico gave me a free Yubikey for the purpose of review.

Update: The OpenIDEnabled.com tool to test OpenID is useful when implementing such things [3].