<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>etbe - Russell Coker &#187; Ha</title>
	<atom:link href="http://etbe.coker.com.au/category/ha/feed/" rel="self" type="application/rss+xml" />
	<link>http://etbe.coker.com.au</link>
	<description>Linux, politics, and other interesting things</description>
	<lastBuildDate>Tue, 07 Feb 2012 07:26:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>5 Principles of Backup Software</title>
		<link>http://etbe.coker.com.au/2012/02/07/5-principles-backup/</link>
		<comments>http://etbe.coker.com.au/2012/02/07/5-principles-backup/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 07:26:04 +0000</pubDate>
		<dc:creator>etbe</dc:creator>
				<category><![CDATA[Ha]]></category>

		<guid isPermaLink="false">http://etbe.coker.com.au/?p=3154</guid>
		<description><![CDATA[Everyone agrees that backups are generally a good thing. But it seems that there is a lot less agreement about how backups should work. Here is a list of 5 principles of backup software that seem to get ignored most of the time: (1/5) Backups should not be Application Specific It&#8217;s quite reasonable for people [...]]]></description>
			<content:encoded><![CDATA[<p>Everyone agrees that backups are generally a good thing. But it seems that there is a lot less agreement about how backups should work. Here is a list of 5 principles of backup software that seem to get ignored most of the time:</p>
<h3>(1/5) Backups should not be Application Specific</h3>
<p>It&#8217;s quite reasonable for people to want to extract data from a backup on a different platform. Maybe someone will want to extract data a few decades after the platform becomes obsolete. I believe that vendors of backup software have an ethical obligation to make it possible for customers to get their data out with minimal effort regardless of the circumstances.</p>
<p>Often when writing a backup application there will be good reasons for not using the existing formats for data storage (tar, cpio, zip, etc). But ideally any data store which involves something conceptually similar to a collection of files in one larger file will use one of those formats. There have been backward compatible extensions to tar and zip for SE Linux contexts and for OS/2 EAs &#8211; the possibility of extending archive file formats with no consequence other than warnings on extraction with an unpatched utility has been demonstrated.</p>
<p>For a backup which doesn&#8217;t involve source files (EG the contents of some sort of database) then it should be in a format that can be easily understood and parsed. Well designed XML is generally a reasonable option. Generally the format should involve plain text that is readable and easy to understand which is optionally compressed with a common compression utility (pkzip is a reasonable choice).</p>
<h3>(2/5) Data Store Formats should be Published</h3>
<p>For every data store there should be public documentation about it&#8217;s format to allow future developers to write support for it. It really isn&#8217;t difficult to release some commented header files so that people can easily determine the data structures. This includes all data stores including databases and filesystems. If I suddenly find myself with a 15yo image of a NTFS filesystem containing a proprietary database I should be able to find official header files for the version of NTFS and the database server in question so I can decode the data if it&#8217;s important enough.</p>
<p>When an application vendor hides the data formats it gives the risk of substantial data loss at some future time. Imposing such risk on customers to try and prevent them from migrating to a rival product is unethical.</p>
<h3>(3/5) Backups should be forward and backward compatible</h3>
<p>It is entirely unreasonable for a vendor to demand that all their users install the latest versions of their software. There are lots of good reasons for not upgrading which includes hardware not supporting new versions of the OS, lack of Internet access to perform the upgrade, application compatibility, and just liking the way the old version works. Even for the case of a critical security fix it should be possible to restore data without applying the fix.</p>
<p>For any pair of versions of software that are only separated by a few versions it should be possible to backup data from one and restore to the other. Even if the data can&#8217;t be used directly (EG a backup of AMD64 programs that is restored on an i386 system) it should still be accessible. If a new version of the software doesn&#8217;t support the ancient file formats then it should be possible for the users to get a slightly older version which talks to both the old and new versions.</p>
<p>Backups made on 64bit systems running the latest development version of Linux and on 10yo 32bit proprietary Unix systems are interchangeable. Admittedly Unix is really good at preserving file format compatibility, but there is no technical reason why other systems can&#8217;t do the same. Source code to cpio, tar, and gnuzip, is freely available!</p>
<p>Apple TimeMachine fails badly in this regard, even a slightly older version of Mac OS can&#8217;t do a restore. It is however nice that most of the TimeMachine data is a tree of files which could be just copied to another system.</p>
<h3>(4/5) Backup Software should not be Dropped</h3>
<p>Sony Ericsson has made me hate them even more by putting the following message on their update web site:</p>
<p><b>The Backup and Restore app will be overwritten and cannot be used to restore data. Check out Android Market for alternative apps to back up and restore your data, such as MyBackup.</b></p>
<p>So if you own a Sony Ericsson phone and it is lost, stolen, or completely destroyed and all you have is a backup made by the Sony Ericsson tool then the one thing you absolutely can&#8217;t do is to buy a new Sony Ericsson phone to restore the data.</p>
<p>I believe that anyone who releases backup software has an ethical obligation to support restoring to all equivalent systems. How difficult would it be to put a new free app in the Google Market that has as it&#8217;s sole purpose recovering old Sony Ericsson backups onto newer phones? It really can&#8217;t be that difficult, so even if they don&#8217;t want to waste critical ROM space by putting the feature in all new phones they can make it available to everyone who needs it. When compared to the cost of developing a new Android release for a series of phones the cost of writing such a restore program would be almost nothing.</p>
<p>It is simply mind-boggling that Sony Ericsson go against their own commercial interests in this regard. Surely it would make good business sense to be able to sell replacements for all the lost and broken Sony Ericsson phones, but instead customers who get burned by broken backups are given an incentive to buy a product from any other vendor.</p>
<h3>(5/5) The greater the control over data the greater the obligation for protecting it</h3>
<p>If you have data stored in a simple and standard manner (EG the /DCIM directory containing MP4 and JPEG files that is on the USB accessible storage in every modern phone) then IMHO it&#8217;s quite OK to leave customers to their own devices in terms of backups. Typical users can work out that if they don&#8217;t backup their pictures then they risk losing them, and they can work out how to do it.</p>
<p>My Sony Ericsson phones have data stored under /data (settings for Android applications) which is apparently only accessible as root. Sony Ericsson have denied me root access which prevents me running backup programs such as Titanium Backup, therefore I believe that they have a great obligation to provide a way of making a backup of this data and restoring it on a new phone or a phone that has been updated. To just provide phone upgrade instructions which tell me that my phone will be entirely wiped and that I should search the App Market for backup programs is unacceptable.</p>
<p>I believe that there are two ethical options available to Sony Ericsson at this time, one is to make it easy to root phones so that Titanium Backup and similar programs can be used, and the other option is to release a suitable backup program for older phones. Based on experience I don&#8217;t expect Sony Ericsson to choose either option.</p>
<p>Now it is also a bad thing for the Android application developers to make it difficult or impossible to backup their data. For example the Wiki for one Android game gives instructions for moving the saved game files to a new phone which starts with &#8220;root your phone&#8221;. The developers of that game should have read the Wiki, realised that rooting a phone for the mundane task of transferring saved game files is totally unreasonable, and developed a better alternative.</p>
<p>The best thing for developers to do is to allow the users to access their own data in the most convenient manner. Then it becomes the user&#8217;s responsibility to manage it and they can concentrate on improving their application.</p>
<h3>Why Freedom is Important</h3>
<p>Installing CyanogenMod on my Galaxy S was painful, but having root access so I can do anything I want is a great benefit. If phone vendors would do the right thing then I could recommend that other people use the vendor release, but it seems that vendors can be expected to act unethically. So I can&#8217;t recommend that anyone use an un-modded Android phone at any time. I also can&#8217;t recommend ever buying a Sony Ericsson product, not even when it&#8217;s really cheap.</p>
<p><a href="http://www.dataliberation.org/">Google have done a great thing with their Data Liberation Front [1]</a>. Not only are they providing access to the data they store on our behalf (which is a good thing) but they have a mission statement that demands the same behavior from other companies &#8211; they make it an issue of competitive advantage! So while Sony Ericsson and other companies might not see a benefit in making people like me stop hating them, failing to be as effective in marketing as Google is a real issue. Data Liberation is something that should be discussed at board elections of IT companies.</p>
<p>Keep in mind the fact that ethics are not just about doing nice things, they are about establishing expectations of conduct that will be used by people who deal with you in future. Sony Ericsson has shown that I should expect that they will treat the integrity of my data with contempt and I will keep this in mind every time I decline an opportunity to purchase their products. Google has shown that they consider the protection of my data as an important issue and therefore I can be confident when using and recommending their services that I won&#8217;t get stuck with data that is locked away.</p>
<p>While Google has demonstrated that corporations can do the right thing, the vast majority of evidence suggests that we should never trust a corporation with anything that we might want to retrieve when it&#8217;s not immediately profitable for the corporation. Therefore avoiding commercial services for storing important data is the sensible thing to do.</p>
<ul>
<li>[1]<a href="http://www.dataliberation.org/"> http://www.dataliberation.org/</a></li>
</ul>
<p>Related posts:</p><ol>
<li><a href='http://etbe.coker.com.au/2011/11/21/galaxy-xperia-android-network/' rel='bookmark' title='Galaxy S vs Xperia X10 and Android Network Access'>Galaxy S vs Xperia X10 and Android Network Access</a> <small>Galaxy S Review I&#8217;ve just been given an indefinite loan...</small></li>
<li><a href='http://etbe.coker.com.au/2012/01/04/standardising-android/' rel='bookmark' title='Standardising Android'>Standardising Android</a> <small>Don Marti wrote an amusing post about the lack of...</small></li>
<li><a href='http://etbe.coker.com.au/2011/02/19/on-burning-platforms/' rel='bookmark' title='On Burning Platforms'>On Burning Platforms</a> <small>Nokia is in the news for it&#8217;s CEO announcing that...</small></li>
</ol>]]></content:encoded>
			<wfw:commentRss>http://etbe.coker.com.au/2012/02/07/5-principles-backup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reliability of RAID</title>
		<link>http://etbe.coker.com.au/2012/02/06/reliability-raid/</link>
		<comments>http://etbe.coker.com.au/2012/02/06/reliability-raid/#comments</comments>
		<pubDate>Sun, 05 Feb 2012 14:46:36 +0000</pubDate>
		<dc:creator>etbe</dc:creator>
				<category><![CDATA[Ha]]></category>

		<guid isPermaLink="false">http://etbe.coker.com.au/?p=3151</guid>
		<description><![CDATA[ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805">ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]</a>. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6.</p>
<p>Of course most systems in the field aren&#8217;t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it&#8217;s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads).</p>
<p>For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn&#8217;t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers.</p>
<p><a href="http://storagemojo.com/2008/02/18/latent-sector-errors-in-disk-drives/">Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of &#8220;consumer&#8221; disks had such errors during the 32 month study period [2]</a>. This is a concern as I run enough RAID-1 systems with &#8220;consumer&#8221; disks that it is very improbable that I&#8217;m not getting such errors. So the question is, how can I discover such errors and fix them?</p>
<p>In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn&#8217;t send an email to alert the sysadmin! <a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658701">I have filed Debian bug #658701 with a patch to make mdadm send email about this</a>. But this really isn&#8217;t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn&#8217;t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won&#8217;t know which disk is failing.</p>
<p>Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there&#8217;s only 470G in use!</p>
<h3>The BTRFS Filesystem</h3>
<p><a href="http://btrfs.ipv5.de/">The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]</a>. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct!</p>
<p>I&#8217;ve just done a quick test of this. I created a filesystem with the command &#8220;<b>mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?</b>&#8221; and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem &#8211; that worked correctly although I was disappointed that it didn&#8217;t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command &#8220;<b>btrfs scrub start .</b>&#8221; and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command &#8220;<b>btrfs scrub status .</b>&#8221; just gave me a count of the corrected errors and didn&#8217;t mention which device had the errors.</p>
<p>It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration.</p>
<p>BTRFS is rumored to have performance problems, I will test this but don&#8217;t have time to do so right now. Anyway I&#8217;m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss.</p>
<h3>BTRFS and Xen</h3>
<p>The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4.</p>
<p>This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage.</p>
<h3>BTRFS and DRBD</h3>
<p>The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead.</p>
<p>It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data!</p>
<h3>Comparing Risks</h3>
<p>I don&#8217;t want to use BTRFS in production now because of the risk of bugs. While it&#8217;s unlikely to have really serious bugs it&#8217;s theoretically possible that as bug could deny access to data until kernel code is fixed and it&#8217;s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it&#8217;s almost certain that I will lose small amounts of data and it&#8217;s most probable that I have silently lost data on many occasions without realising.</p>
<ul>
<li>[1]<a href="http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805"> http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805</a></li>
<li>[2]<a href="http://storagemojo.com/2008/02/18/latent-sector-errors-in-disk-drives/"> http://storagemojo.com/2008/02/18/latent-sector-errors-in-disk-drives/</a></li>
<li>[3]<a href="http://btrfs.ipv5.de/"> http://btrfs.ipv5.de/</a></li>
</ul>
<p>Related posts:</p><ol>
<li><a href='http://etbe.coker.com.au/2008/10/14/some-raid-issues/' rel='bookmark' title='Some RAID Issues'>Some RAID Issues</a> <small>I just read an interesting paper titled An Analysis of...</small></li>
<li><a href='http://etbe.coker.com.au/2008/06/13/ecc-ram-vs-raid/' rel='bookmark' title='ECC RAM is more useful than RAID'>ECC RAM is more useful than RAID</a> <small>A common myth in the computer industry seems to be...</small></li>
<li><a href='http://etbe.coker.com.au/2007/11/16/software-vs-hardware-raid/' rel='bookmark' title='Software vs Hardware RAID'>Software vs Hardware RAID</a> <small>Should you use software or hardware RAID? Many people claim...</small></li>
</ol>]]></content:encoded>
			<wfw:commentRss>http://etbe.coker.com.au/2012/02/06/reliability-raid/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Some Notes on DRBD</title>
		<link>http://etbe.coker.com.au/2011/12/17/drbd-notes/</link>
		<comments>http://etbe.coker.com.au/2011/12/17/drbd-notes/#comments</comments>
		<pubDate>Sat, 17 Dec 2011 08:59:30 +0000</pubDate>
		<dc:creator>etbe</dc:creator>
				<category><![CDATA[Ha]]></category>

		<guid isPermaLink="false">http://etbe.coker.com.au/?p=3056</guid>
		<description><![CDATA[DRBD is a system for replicating a block device across multiple systems. It&#8217;s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it&#8217;s committed to disk locally [...]]]></description>
			<content:encoded><![CDATA[<p>DRBD is a system for replicating a block device across multiple systems. It&#8217;s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it&#8217;s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this.</p>
<p>I&#8217;m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I&#8217;m not using DRBD because the failure conditions that I&#8217;m planning for don&#8217;t include two disks entirely failing. I&#8217;m planning for a system having an outage for a while so it&#8217;s OK to have some inbound and outbound mail delayed but it&#8217;s not OK for the mail store to be unavailable.</p>
<h3>Global changes I&#8217;ve made in /etc/drbd.d/global_common.conf</h3>
<p>In the <b>common</b> section I changed the <b>protocol</b> from &#8220;<b>C</b>&#8221; to &#8220;<b>B</b>&#8220;, this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don&#8217;t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete.</p>
<p>In the <b>net</b> section I added the following:</p>
<p><b>sndbuf-size 512k;<br />
data-integrity-alg sha1;</b></p>
<p>This uses a larger network sending buffer (apparently good for fast local networks &#8211; although I&#8217;d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity).</p>
<h3>Reserved Numbers</h3>
<p>The default port number that is used is 7789. I think it&#8217;s best to use ports below 1024 for system services so I&#8217;ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I&#8217;ll use different ports even if they aren&#8217;t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict.</p>
<p>Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU&#8217;s resources.</p>
<p>It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device.</p>
<p>As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn&#8217;t be mounted locally and which mount options are permitted &#8211; it MIGHT be OK to do a read-only mount of a DomU&#8217;s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices.</p>
<h3>Initial Synchronisation</h3>
<p>After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don&#8217;t then it will probably refuse to accept one copy of the data as primary as it can&#8217;t seem to realise that both are inconsistent. I can&#8217;t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there </p>
<p>The solution sometimes seems to be to run &#8220;<b>drbdsetup /dev/drbd0 primary -</b>&#8221; (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in <b>Inconsistent/Inconsistent</b> state then the solution seems to involve running &#8220;<b>drbdadm -&#45; -&#45;overwrite-data-of-peer primary db0-mysql</b>&#8221; (for the case of a resource named <b>db0-mysql</b> defined in <b>/etc/drbd.d/db0-mysql.res</b>).</p>
<p>Also it seems that some commands can only be run from one node. So if you have a primary node that&#8217;s in service and another node in <b>Secondary/Unknown</b> state (IE disconnected) with data state <b>Inconsistent/DUnknown</b> then while you would expect to be able to connect from the secondary node is appears that nothing other than a &#8220;<b>drbdadm connect</b>&#8221; command run from the primary node will get things going.</p>
<p>Related posts:</p><ol>
<li><a href='http://etbe.coker.com.au/2007/01/03/xen-shared-storage/' rel='bookmark' title='Xen shared storage'>Xen shared storage</a> <small>disk = [ 'phy:/dev/vg/xen1,hda,w', 'phy:/dev/vg/xen1-swap,hdb,w', 'phy:/dev/vg/xen1-drbd,hdc,w', 'phy:/dev/vg/san,hdd,w!' ] For some...</small></li>
<li><a href='http://etbe.coker.com.au/2007/05/15/priorities-for-heartbeat-services/' rel='bookmark' title='priorities for heartbeat services'>priorities for heartbeat services</a> <small>Currently I am considering the priority scheme to use for...</small></li>
<li><a href='http://etbe.coker.com.au/2007/04/15/failure-probability-and-clusters/' rel='bookmark' title='failure probability and clusters'>failure probability and clusters</a> <small>When running a high-availability cluster of two nodes it will...</small></li>
</ol>]]></content:encoded>
			<wfw:commentRss>http://etbe.coker.com.au/2011/12/17/drbd-notes/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hetzner Failover Konfiguration</title>
		<link>http://etbe.coker.com.au/2011/12/15/hetzner-failover-konfiguration/</link>
		<comments>http://etbe.coker.com.au/2011/12/15/hetzner-failover-konfiguration/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 22:44:05 +0000</pubDate>
		<dc:creator>etbe</dc:creator>
				<category><![CDATA[Ha]]></category>

		<guid isPermaLink="false">http://etbe.coker.com.au/?p=3076</guid>
		<description><![CDATA[The Wiki documenting how to configure IP failover for Hetzner servers [1] is closely tied to the Linux HA project [2]. This is OK if you want a Heartbeat cluster, but if you want manual failover or an automatic failover from some other form of script then it&#8217;s not useful. So I&#8217;ll provide the simplest [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://wiki.hetzner.de/index.php/Failover_Konfiguration/en">The Wiki documenting how to configure IP failover for Hetzner servers [1]</a> is closely tied to <a href="http://www.linux-ha.org/">the Linux HA project [2]</a>. This is OK if you want a Heartbeat cluster, but if you want manual failover or an automatic failover from some other form of script then it&#8217;s not useful. So I&#8217;ll provide the simplest possible documentation.</p>
<p>Below is a sample of shell code to get the current failover settings and change them to point the IP address to a different server. In my tests this takes between 19 and 20 seconds to complete, when the command completes the new server will be active and no IP packets will be lost &#8211; but TCP connections will be broken if the servers don&#8217;t support shared TCP state.</p>
<p># username and password for the Hetzner robot<br />
USERPASS=USER:PASS<br />
# public IP<br />
IP=10.1.2.3<br />
# new active server<br />
ACTIVE=10.2.3.4<br />
# get current values<br />
curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP<br />
# change active server<br />
curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP -d active_server_ip=$ACTIVE</p>
<p>Below is the output of the above commands showing the old state and the new state.</p>
<p>failover:<br />
  ip: 10.1.2.3<br />
  netmask: 255.255.255.255<br />
  server_ip: 10.2.3.3<br />
  active_server_ip: 10.2.3.4<br />
failover:<br />
  ip: 10.1.2.3<br />
  netmask: 255.255.255.255<br />
  server_ip: 10.2.3.4<br />
  active_server_ip: 10.2.3.4</p>
<ul>
<li>[1]<a href="http://wiki.hetzner.de/index.php/Failover_Konfiguration/en"> http://wiki.hetzner.de/index.php/Failover_Konfiguration/en</a></li>
<li>[2]<a href="http://www.linux-ha.org/"> http://www.linux-ha.org/</a></li>
</ul>
<p>Related posts:</p><ol>
<li><a href='http://etbe.coker.com.au/2011/10/18/servers-vs-phones/' rel='bookmark' title='Servers vs Phones'>Servers vs Phones</a> <small>Hetzner have recently updated their offerings to include servers with...</small></li>
<li><a href='http://etbe.coker.com.au/2011/10/21/dedicated-vs-virtual/' rel='bookmark' title='Dedicated vs Virtual Servers'>Dedicated vs Virtual Servers</a> <small>A common question about hosting is whether to use a...</small></li>
<li><a href='http://etbe.coker.com.au/2006/09/16/the-next-feature-for-a-spy-movie/' rel='bookmark' title='the next feature for a spy movie'>the next feature for a spy movie</a> <small>I have noticed that motion sensors on burglar alarms don&#8217;t...</small></li>
</ol>]]></content:encoded>
			<wfw:commentRss>http://etbe.coker.com.au/2011/12/15/hetzner-failover-konfiguration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why Clusters Usually Don&#8217;t Work</title>
		<link>http://etbe.coker.com.au/2010/08/04/clusters-dont-work/</link>
		<comments>http://etbe.coker.com.au/2010/08/04/clusters-dont-work/#comments</comments>
		<pubDate>Wed, 04 Aug 2010 06:46:52 +0000</pubDate>
		<dc:creator>etbe</dc:creator>
				<category><![CDATA[Ha]]></category>
		<category><![CDATA[Most Popular]]></category>

		<guid isPermaLink="false">http://etbe.coker.com.au/?p=2207</guid>
		<description><![CDATA[It&#8217;s widely regarded that to solve reliability problems you can just install a cluster. It&#8217;s quite obvious that if instead of having one system of a particular type you have multiple systems of that type and a cluster configured such that broken systems aren&#8217;t used then reliability will increase. Also in the case of routine [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s widely regarded that to solve reliability problems you can just install a cluster.  It&#8217;s quite obvious that if instead of having one system of a particular type you have multiple systems of that type and a cluster configured such that broken systems aren&#8217;t used then reliability will increase.  Also in the case of routine maintenance a cluster configuration can allow one system to be maintained in a serious way (EG being rebooted for a kernel or BIOS upgrade) without interrupting service (apart from a very brief interruption that may be needed for resource failover).  But there are some significant obstacles in the path of getting a good cluster going.</p>
<h3>Buying Suitable Hardware</h3>
<p>If you only have a single server that is doing something important and you have some budget for doing things properly then you really must do everything possible to keep it going.  You need RAID storage with hot-swap disks, hot-swap redundant PSUs, and redundant ethernet cables bonded together.  But if you have redundant servers then the requirement for making one server reliable is slightly reduced.</p>
<p>Hardware is getting cheaper all the time, a Dell R300 1RU server configured with redundant hot-plug PSUs, two 250G hot-plug SATA disks in a RAID-1 array, 2G of RAM, and a dual-core Xeon Pro E3113 3.0GHz CPU apparently costs just under $2,800AU (when using Google Chrome I couldn&#8217;t add some necessary jumper cables to the list so I couldn&#8217;t determine the exact price).  So a cluster of two of them would cost about $5,600 just for the servers.  But a Dell R200 1RU server with no redundant PSUs, a single 250G SATA disk, 2G of RAM, and a Core 2 Duo E7400 2.8GHz CPU costs only $1,048.99AU.  So if a low end server is required then you could buy two R200 servers that have no redundancy built in which cost less than a single server that has hardware RAID and redundant PSUs.  Those two servers have different sets of CPU options and probably other differences in the technical specs, but for many applications they will probably both provide more than adequate performance.</p>
<p>Using a server that doesn&#8217;t even have RAID is a bad idea, a minimal RAID configuration is a software RAID-1 array which only requires an extra disk per server.  That takes the price of a Dell R200 to $1,203.  So it seems that two low-end 1RU servers from Dell that have minimal redundancy features will be cheaper than a single 1RU server that has the full set of features.  If you want to serve static content then that&#8217;s all you need, and a cluster can save you money on hardware!  Of course we can debate whether any cluster node should be missing redundant hot-plug PSUs and disks.  But that&#8217;s not an issue I want to address in this post.</p>
<p>Also serving static content is the simplest form of cluster, if you have a cluster for running a database server then you will need a dual-attached RAID array which will make things start to get expensive (or software for replicating the data over the network which is difficult to configure and may be expensive), so while a trivial cluster may not cost any extra money a real-world cluster deployment is likely to add significant expense.</p>
<p>My observation is that most people who implement clusters tend to have problems getting budget for decent hardware.  When you have redundancy via the cluster you can tolerate slightly less expected uptime from the individual servers.  While we can debate about whether a cluster member should have redundant PSUs and other expensive features it does seem that using a cheap desktop system as a cluster node is a bad idea.  Unfortunately some managers think that a cluster solves the reliability problem and therefore you can just use recycled desktop systems as cluster nodes, this doesn&#8217;t give a good result.</p>
<p>Even if it is agreed that server class hardware is used for all servers so features such as <a href="http://en.wikipedia.org/wiki/ECC_RAM#Errors_and_error_correction">ECC RAM</a> are used you will still have problems if someone decides to use different hardware specs for each of the cluster nodes.</p>
<h3>Testing a Cluster</h3>
<p>Testing a non-clustered server or some servers that use a load-balancing device at the front-end isn&#8217;t that difficult in concept.  Sure you have lots of use cases and exception conditions to test, but they are all mostly straight-through tests.  With a cluster you need to test node failover at unexpected times.  When a node is regarded as having an inconsistent state (which can mean that one service it runs could not be cleanly shutdown when it was due to be migrated) it will need to be rebooted which is sometimes known as a <a href="http://en.wikipedia.org/wiki/STONITH">STONITH</a>.  A STONITH event usually involves something like <a href="http://en.wikipedia.org/wiki/Ipmi">IPMI</a> to cut the power or a command such as &#8220;<b>reboot -nf</b>&#8220;, this loses cached data and can cause serious problems for any application which doesn&#8217;t call fsync() as often as it should.  It seems likely that the vast majority of sysadmins run programs which don&#8217;t call fsync() often enough, but the probability of losing data is low and the probability of losing data in a way that you will notice (IE it doesn&#8217;t get automatically regenerated) is even lower.  The low probability of data loss due to race conditions combined with the fact that a server with a UPS and redundant PSUs doesn&#8217;t unexpectedly halt that often means that problems don&#8217;t get found easily.  But when clusters have problems and start calling STONITH the probability starts increasing.</p>
<p>Getting cluster software to work in a correct manner isn&#8217;t easy.  I filed <a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=430958">Debian bug #430958 about dpkg (the Debian package manager) not calling fsync() and thus having the potential to leave systems in an inconsistent or unusable state if a STONITH happened at the wrong time</a>.  I was inspired to find this problem after finding the same problem with RPM on a SUSE system.  The <a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=578635">result of applying a patch to call fsync() on every file was bug report #578635 about the performance of doing so</a>, the eventual solution was to call sync() after each package is installed.  Next time I do any cluster work on Debian I will have to test whether the sync() code seems to work as desired.</p>
<p>Getting software to work in a cluster requires that not only bugs in system software such as dpkg be fixed, but also bugs in 3rd party applications and in-house code.  Please someone write a comment claiming that their favorite OS has no such bugs and the commercial and in-house software they use is also bug-free &#8211; I could do with a cheap laugh.</p>
<p>For the most expensive cluster I have ever installed (worth about 4,000,000 UK pounds &#8211; back when the pound was worth something) I was not allowed to power-cycle the servers.  Apparently the servers were too valuable to be rebooted in that way, so if they did happen to have any defective hardware or buggy software that would do something undesirable after a power problem it would become apparent in production rather than being a basic warranty or patching issue before the system went live.</p>
<p>I have heard many people argue that if you install a reasonably common OS on a server from a reputable company and run reasonably common server software then the combination would have been tested before and therefore almost no testing is required.  I think that some testing is always required (and I always seem to find some bugs when I do such tests), but I seem to be in a minority on this issue as less testing saves money &#8211; unless of course something breaks.  It seems that the need for testing systems before going live is much greater for clusters, but most managers don&#8217;t allocate budget and other resources for this.</p>
<p>Finally there is the issue of testing issues related to custom code and the user experience.  What is the correct thing to do with an interactive application when one of the cluster nodes goes down and how would you implement it at the back-end?</p>
<h3>Running a Cluster</h3>
<p>Systems don&#8217;t just sit there without changing, you have new versions of the OS and applications and requirements for configuration changes.  This means that the people who run the cluster will ideally have some specialised cluster skills.  If you hire sysadmins without regard to cluster skills then you will probably end up not hiring anyone who has any prior experience with the cluster configuration that you use.  Learning to run a cluster is not like learning to run yet another typical Unix daemon, it requires some differences in the way things are done.  All changes have to be strictly made to all nodes in the cluster, having a cluster fail-over to a node that wasn&#8217;t upgraded and can&#8217;t understand the new data is not fun at all!</p>
<p>My observation is that the typical experience of having a team of sysadmins who have no prior cluster experience being hired to run a cluster usually involves &#8220;learning experiences&#8221; for everyone.  It&#8217;s probably best to assume that every member of the team will break the cluster and cause down-time on at least one occasion!  This can be alleviated by only having one or two people ever work on the cluster and having everyone else delegate cluster work to them.  Of course if something goes wrong when the cluster experts aren&#8217;t available then the result is even more downtime than might otherwise be expected.</p>
<p>Hiring sysadmins who have prior experience running a cluster with the software that you use is going to be very difficult.  It seems that any organisation that is planning a cluster deployment should plan a training program for sysadmins.  Have a set of test machines suitable for running a cluster and have every new hire install the cluster software and get it all working correctly.  It&#8217;s expensive to buy extra systems for such testing, but it&#8217;s much more expensive to have people who lack necessary skills try and run your most important servers!</p>
<p>The trend in recent years has been towards sysadmins not being system programmers.  This may be a good thing in other areas but it seems that in the case of clustering it is very useful to have a degree of low level knowledge of the system that you can only gain by having some experience doing system coding in C.</p>
<p>It&#8217;s also a good idea to have a test network which has machines in an almost identical configuration to the production servers.  Being able to deploy patches to test machines before applying them in production is a really good thing.</p>
<h3>Conclusion</h3>
<p>Running a cluster is something that you should either do properly or not at all.  If you do it badly then the result can easily be less uptime than a single well-run system.</p>
<p>I am not suggesting that people avoid running clusters.  You can take this post as a list of suggestions for what to avoid doing if you want a successful cluster deployment.</p>
<p>Related posts:</p><ol>
<li><a href='http://etbe.coker.com.au/2007/04/15/failure-probability-and-clusters/' rel='bookmark' title='failure probability and clusters'>failure probability and clusters</a> <small>When running a high-availability cluster of two nodes it will...</small></li>
<li><a href='http://etbe.coker.com.au/2009/03/20/cpu-intensive-server/' rel='bookmark' title='Choosing a Server for CPU Intensive work'>Choosing a Server for CPU Intensive work</a> <small>A client is considering some options for serious deployment of...</small></li>
<li><a href='http://etbe.coker.com.au/2009/11/22/planning-servers-for-failure/' rel='bookmark' title='Planning Servers for Failure'>Planning Servers for Failure</a> <small>Sometimes computers fail. If you run enough computers then you...</small></li>
</ol>]]></content:encoded>
			<wfw:commentRss>http://etbe.coker.com.au/2010/08/04/clusters-dont-work/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>A Basic IPVS Configuration</title>
		<link>http://etbe.coker.com.au/2008/08/07/basic-ipvs-configuration/</link>
		<comments>http://etbe.coker.com.au/2008/08/07/basic-ipvs-configuration/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 13:10:19 +0000</pubDate>
		<dc:creator>etbe</dc:creator>
				<category><![CDATA[Ha]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://etbe.coker.com.au/?p=691</guid>
		<description><![CDATA[I have just configured IPVS on a Xen server for load balancing between multiple virtual hosts. The benefit is not load balancing but management. With two virtual machines providing a service I can gracefully shut one down for maintenance and have the other take the load. When there are two machines providing a service a [...]]]></description>
			<content:encoded><![CDATA[<p>I have just configured IPVS on a Xen server for load balancing between multiple virtual hosts.  The benefit is not load balancing but management.  With two virtual machines providing a service I can gracefully shut one down for maintenance and have the other take the load.  When there are two machines providing a service a load balancing configuration is much better than a hot-spare, one reason is the fact that there may be application scaling issues that prevent one machine with twice the resources from giving as much performance as two smaller machines.  Another is the fact that if you have a machine configured but never used there will always be some doubt as to whether it would work&#8230;</p>
<p>The first thing to do is to assign the IP address of the service to the front-end machine so that other machines on the segment (IE routers) will be able to send data to it.  If the address for the service is 10.0.0.5 then the command &#8220;<b>ip addr add dev eth0 10.0.0.5/24 broadcast +</b>&#8221; will make it a secondary address on the <b>eth0</b> interface.  On a Debian system you would add the line &#8220;<b>up ip addr add dev eth0 10.0.0.5/24 broadcast + || true</b>&#8221; to the appropriate section of <b>/etc/network/interfaces</b>, for a Red Hat system it seems that <b>/etc/rc.local</b> is the best place for it.  I expect that it would be possible to merely advertise the IP address via ARP without adding it to the interface, but the ability to ping the IPVS server on the service address seems useful and there seems no benefit in not assigning the address.</p>
<p>There are three methods used by IPVS for forwarding packets, gatewaying/routing (the default), IPIP encapsulation (tunneling), and masquerading.  The gatewaying/routing method requires the back-end server to respond to requests on the service address.  That would mean assigning the address to the back-end server without advertising it via ARP (which seems likely to have some issues for managing the system).  The IPIP encapsulation method requires setting up IPIP which seemed like it would be excessively difficult (although maybe not more than required to set up masquerading).  The masquerading option (which I initially chose) rewrites the packets to have the IP address of the real server.  So for example if the service address is 10.0.0.5 and the back-end server has the address 10.0.1.5 then it will see packets addresses to 10.0.1.5.  A benefit of masquerading is that it allows you to use different ports, so for example you could have a non-virtualised mail server listening on port 25 and a back-end server for a virtual service listening on port 26.  While there is no practical limit to the number of private IP addresses that you might use it seems easier to manage servers listening on different ports with the same IP address &#8211; and there is the issue of server programs that are not written to support binding to an IP address.</p>
<p><b>ipvsadm -A -t 10.0.0.5:25 -s lblc -p<br />
ipvsadm -a -t 10.0.0.5:25 -r 10.0.1.5 -m</b></p>
<p>The above two commands create an IPVS configuration that listens on port 25 of IP address 10.0.0.5 and then masquerades connections to 10.0.1.5 on port 25 (the default is to use the same port).</p>
<p>Now the problem is in getting the packets to return via the IPVS server.  If the IPVS server happens to be your default gateway then it&#8217;s not a problem and it will already be working after the above two commands (if a service is listening on 10.0.1.5 port 25).</p>
<p>If the IPVS server is not the default gateway and you have only one IP address on the back-end server then this will require using netfilter to mark the packets and then route based on the packet matching.  Marking via netfilter also seems to be the only well documented way of doing similar things.  I spent some time working on this and didn&#8217;t get it working.  However having multiple IP addresses per server is a recommended practice anyway (a back-end interface for communication between servers as well as a front-end interface for public data).</p>
<p><b>ip rule add from 10.0.1.5 table 1<br />
ip route add default via 10.0.0.1 table 1</b></p>
<p>I use the above two commands to set up a new routing table for the data for the virtual service.  The first line causes any packets from <b>10.0.1.5</b> to be sent to routing table 1 (I currently have a rough plan to have table numbers match ethernet device numbers, the data in question is going out device eth1).  The second line adds a default router to table 1 which sends all packets to 10.0.0.1 (the private IP address of the IPVS server).</p>
<p>Then it SHOULD all be working, but in the network that I&#8217;m using (RHEL4 DomU and RHEL5 Dom0 and IPVS) it doesn&#8217;t.  For some reason the data packets from the DomU are not seen as part of the same TCP stream (both in Net Filter connection tracking and by the TCP code in the kernel).  So I get an established connection (3 way handshake completed) but no data transfer.  The server sends the SMTP greeting repeatedly but nothing is received.  At this stage I&#8217;m not sure whether there is something missing in my configuration or whether there&#8217;s a bug in IPVS.  I would be happy to send tcpdump output to anyone who wants to try and figure it out.</p>
<p>My next attempt at this was via routing.  I removed the &#8220;<b>-m</b>&#8221; option from the <b>ipvsadm</b> command and added the service IP address to the back-end with the command &#8220;<b>ifconfig lo:0 10.0.0.5 netmask 255.255.255.255</b>&#8221; and configured the mail server to bind to port 25 on address 10.0.0.5.  Success at last!</p>
<p>Now I just have to get Piranha working to remove back-end servers from the list when they fail.</p>
<p>Update:  It&#8217;s quite important that when adding a single IP address to device <b>lo:0</b> you use a netmask of <b>255.255.255.255</b>.  If you use the same netmask as the front-end device (which would seem like a reasonable thing to do) then (with RHEL4 kernels at least) you get proxy ARPs by default.  For example you used netmask 255.255.255.0 to add address 10.0.0.5 to device lo:0 then on device eth0 the machine will start answering ARP requests for 10.0.0.6 etc.  Havoc then ensues.</p>
<p>Related posts:</p><ol>
<li><a href='http://etbe.coker.com.au/2007/07/24/xen-and-bridging/' rel='bookmark' title='Xen and Bridging'>Xen and Bridging</a> <small>In a default configuration of Xen there will be a...</small></li>
<li><a href='http://etbe.coker.com.au/2008/05/24/ipsec-is-pain/' rel='bookmark' title='IPSEC is Pain'>IPSEC is Pain</a> <small>I&#8217;ve been trying to get ipsec to work correctly as...</small></li>
<li><a href='http://etbe.coker.com.au/2007/05/30/another-heartbeat-20-stonith-example-configuration/' rel='bookmark' title='Another Heartbeat 2.0 STONITH example configuration'>Another Heartbeat 2.0 STONITH example configuration</a> <small>In a Heartbeat cluster installation it may not be possible...</small></li>
</ol>]]></content:encoded>
			<wfw:commentRss>http://etbe.coker.com.au/2008/08/07/basic-ipvs-configuration/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

