Linux, politics, and other interesting things


  • Category Archives Ha
  • BTRFS and ZFS as Layering Violations

    LWN has an interesting article comparing recent developments in the Linux world to the “Unix Wars” that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it’s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it).

    A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do.

    The Benefits of Layers

    I don’t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) – and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I’m less likely to encounter a bug when I install a new system.

    LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it – without interfering with other filesystems.

    When using layered storage I can easily add or change layers when it’s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data.

    When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can’t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem.

    Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That’s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn’t support such things and using software RAID underneath ZFS or BTRFS loses some features.

    When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn’t change) so it’s not possible to use BTRFS or ZFS RAID-1 with an NBD device.

    The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.

    The Benefits of Integration

    When RAID and the filesystem are separate things (with some added abstraction from LVM) it’s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it’s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance.

    It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it’s not a direct list of blocks on disk and thus can’t be extracted if there is a problem with ZFS.

    Snapshots are an essential feature by today’s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead.

    The way that ZFS supports snapshots which inherit encryption keys is also interesting.

    Conclusion

    It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round’s blog) it’s probably not going to be implemented.

    Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment.

    ZFS on Linux isn’t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that’s not possible when you have important Linux software that needs fast access to the storage.


  • The Most Important things for running a Reliable Internet Service

    One of my clients is currently investigating new hosting arrangements. It’s a bit of a complex process because there are lots of architectural issues relating to things such as the storage and backup of some terabytes of data and some serious computation on the data. Among other options we are considering cheap servers in the EX range from Hetzner [1] which provide 3TB of RAID-1 storage per server along with reasonable CPU power and RAM and Amazon EC2 [2]. Hetzner and Amazon aren’t the only companies providing services that can be used to solve my client’s problems, but they both provide good value for what they provide and we have prior experience with them.

    To add an extra complication my client did some web research on hosting companies and found that Hetzner wasn’t even in the list of reliable hosting companies (whichever list that was). This is in some ways not particularly surprising, Hetzner offers servers without a full management interface (you can’t see a serial console or a KVM, you merely get access to reset it) and the best value servers (the only servers to consider for many terabytes of data) have SATA disks which presumably have a lower MTBF than SAS disks.

    But I don’t think that this is a real problem. Even when hardware that’s designed for the desktop is run in a server room the reliability tends to be reasonable. My experience is that a desktop PC with two hard drives in a RAID-1 array will give a level of reliability in practice that compares very well to an expensive server with ECC RAM, redundant fans, redundant PSUs, etc.

    My experience is that the most critical factor for server reliability is management. A server that is designed to be reliable can give very poor uptime if poorly maintained or if there is no rapid way of discovering and fixing problems. But a system that is designed to be cheap can give quite good uptime if well maintained, if problems can be repidly discovered and fixed.

    A Brief Overview of Managing Servers

    There are text books about how to manage servers, so obviously I can’t cover the topic in detail in a blog post. But here are some quick points. Note that I’m not claiming that this list includes everything, please add comments about anything particularly noteworthy that you think I’ve missed.

    1. For a server to be well managed it needs to be kept up to date. It’s probably a good idea for management to have this on the list of things to do. A plan to check for necessary updates and apply them at fixed times (at least once a week) would be a good thing. My experience is that usually managers don’t have anything to do with this and sysadmins either apply patches or not at their own whim.
    2. It is really ideal for people to know how all the software works. For every piece of software that’s running it should either have come from a source that provides some degree of support (EG a Linux distribution) or be maintained by someone who knows it well. When you install custom software from people who become unavailable then it puts the reliability of the entire system at risk – if anything breaks then you won’t be able to get it fixed quickly.
    3. It should be possible to rapidly discover problems, having a client phone you to tell you that your web site is offline is a bad thing. Ideally you will have software like Nagios monitoring the network and reporting problems via a SMS gateway service such as ClickaTell.com. I am not sure that Nagios is the best network monitoring system or that ClickaTell is the best SMS gateway, but they have both worked well in my experience. If you think that there are better options for either of those then please write a comment.
    4. It should be possible to rapidly fix problems. That means that a sysadmin must be available 24*7 to respond to SMS and you must have a backup sysadmin for when the main person takes a holiday, or ideally two backup sysadmins so that if one is on holiday and another has an emergency then problems can still be fixed. Another thing to consider is that an increasing number of hotels, resorts, and cruise ships are providing net access. So you could decrease your need for backup sysadmins if you give a holiday bonus to a sysadmin who uses a hotel, resort, or cruise ship that has good net access. ;)
    5. If it seems likely that there may be some staff changes then it’s a really good idea to hire a potential replacement on a casual basis so that they can learn how things work. There have been a few occasions when I started a sysadmin contract after the old sysadmin ceased being on speaking terms with the company owner. This made it difficult for me to learn what’s going on.
    6. If your network is in any way complex (IE it’s something that needs some skill to manage) then it will probably be impossible to hire someone who has experience in all the areas of technology at a salary you are prepared to pay. So you should assume that whoever you hire will do some learning on the job. This isn’t necessarily a problem but is something that needs to be considered. If you use some unusual hardware or software and want it to run reliably then you should have a spare system for testing so that the types of mistake which are typically made in the learning process are not made on your production network.

    Conclusion

    If you have a business which depends on running servers on the Internet and you don’t do all the things in the above list then the reliability of a service like Hetzner probably isn’t going to be an issue at all.


  • ZFS vs BTRFS on Cheap Dell Servers

    I previously wrote about my first experiences with BTRFS [1]. Since then I’ve been using BTRFS on more systems and have had good results. The main problem I want to address is with the reliability of RAID [2].

    Requirements for a File Server

    Now one of my clients has a need for a new fileserver. They need to reliably store terabytes of data (currently 6TB and growing) which is mostly comprised of data files in the 10MB – 15MB size range. The data files will almost never be re-written and I anticiapte that the main bottleneck will be the latency of NFS and other network file sharing protocols. I would hope that saturating a GigE network when sending 10MB data files from SATA disks via NFS, AFS, or SMB wouldn’t be a technical challenge.

    It seems that BTRFS is the way of the future. But it’s still rather new and the lack of RAID-5 and RAID-6 is a serious issue when you need to store 10TB with today’s technology (that would be 8*3TB disks for RAID-10 vs 5*3TB disks for RAID-5). Also the case of two disks entirely failing in a short period of time requires RAID-6 (or RAID-Z2 as the ZFS variant of RAID-6 is known). With BTRFS at it’s current stage of development it seems that to recover from two disks failing you need to have BTRFS on another RAID-6 (maybe Linux software RAID-6). But for filesystems based on concepts similar to ZFS and BTRFS you want to have the filesystem run the RAID so that if a block has a filesystem hash mismatch then the correct copy can be reconstructed from parity.

    ZFS seems to be a lot more complex than BTRFS. While having more features is a good thing (BTRFS seems to be missing some sysadmin friendly features at this stage) complexity means that I need to learn more and test more before going live.

    But it seems that the built in RAID-5 and RAID-6 is the killer issue. Servers start becoming a lot more expensive if you want more than 8 disks and even going past 6 disks is a significant price point. As 3TB disks are available an 8 disk RAID-6 gives something like 18TB usable space vs 12TB on a RAID-10 and a 6 disk RAID-6 gives about 12TB vs 9TB on a RAID-10. With RAID-10 (IE BTRFS) my client couldn’t use a 6 disk server such as the Dell PowerEdge T410 for $1500 as 9TB of usable storage isn’t adequate and the Dell PowerEdge T610 which can support 8 disks and costs $2100 would be barely adequate for the near future with only 12TB of usable storage. Dell does sell significantly larger servers such that any of my clients needs could be covered by RAID-10, but in addition to costing more there are issues of power use and noise. When comparing a T610 and a T410 with a full set of disks the price difference is $1000 (assuming $200 per disk) which is probably worth paying to delay any future need for upgrades.

    Buying Disks

    The problem with the PowerEdge T610 server is that it uses hot-swap disks and the biggest disks available are 2TB for $586.30! 2TB*8 in RAID-6 gives 12TB of usable space for $4690.40! This compares poorly to the PowerEdge T410 which supports non-hot-swap disks so I can buy 6*3TB disks for something less than $200 each and get 12TB of usable space for $1200. If I could get hot-swap trays for Dell disks at a reasonable price then the T610 would be worth considering. But as 12TB of storage should do for at least the next 18 months it seems that the T410 is clearly the better option.

    Does anyone know how to get cheap disk trays for Dell servers?

    Implementation

    In mailing list discussions some people suggest using Solaris or FreeBSD for a ZFS server. ZFS was designed for and implemented on Solaris, and FreeBSD was the first port. However Solaris and FreeBSD aren’t commonly used systems so it’s harder to find skilled people to work with them and there is less of a guarantee that the desired software will work. Among other things it’s really convenient to be able to run software for embedded Linux i386 systems on the server.

    The first port of ZFS to Linux was based on FUSE [3]. This allows a clean separation of ZFS code from the Linux kernel code to avoid license issues but does have some performance problems. I don’t think that I will have any performance issues on this server as the data files are reasonably large, are received via an ADSL link, and which require quite a bit of CPU time to process them when they are accessed. But ZFS-FUSE doesn’t seem to be particularly popular.

    The ZFS On Linux project provides source for a ZFS kernel module which you can compile and load [4]. As the module isn’t distributed with or statically linked to the kernel the license conflict of the CDDL ZFS code and the GPL Linux kernel code is apparently solved. I’ve read some positive reports from people who use this so it will be my preferred option.


  • Cheap NAS Devices Suck

    There are some really good Network Attached Storage (NAS) devices on the market. NetApp is one company that is known for making good products [1]. The advantage of a NAS is that you have a device with NVRAM for write-back caching, a filesystem that supports all the necessary features for best performance (NetApp developed their own filesystem WAFL to provide the features they needed), and a set of quality hardware that has been tested and certified to work together.

    If you want a cheap NAS then you end up with something running Linux with GPL filesystems. This isn’t a bad thing as such, but some of the best performance and data integrity features are available in ZFS (which isn’t GPL) and BTRFS (which isn’t ready for production use). Not to mention WAFL which has been providing ZFS/BTRFS type features for more than a decade.

    A cheap NAS will generally be sold without disks as this is the best way to keep costs down. Selling with disks either means selling lots of different variations (which means it can’t be sold off the shelf) or selling packages that don’t quite suit some customers (thus causing people to buy the device and replace the disks which means extra costs). This means that the vendors can’t provide the guarantees about disk quality and suitability that NetApp can provide.

    One major problem with a NAS is that you typically can’t get shell access. Commands such as “rm -rf” and “cp -rl” which are typically rather quick when performed locally can take ages when run over NFS. Also commands such as “grep -R” which can perform reasonably well over NFS will always perform better when run locally. Also tasks such as compiling big programs which require good disk speed as well as some CPU time can be run locally if you have a file server system that also has local accounts (IE a typical multi-user Unix server configuration), but a dedicated NAS will prevent that.

    I have never used a NetApp device due to my clients deciding (sometimes correctly and sometimes incorrectly IMHO) that they are too expensive. But when considering what my clients do the down-sides of lacking local code execution on a NetApp would be more than compensated in many cases by the significant performance and reliability advantages that they offer (see my post about the reliability issues in standard RAID implementations [2]).

    A server class system (with ECC RAM as a minimum criteria) running RAID-6 can be a fairly decent file server. That requires hardware RAID with NVRAM for the write-back cache for decent write performance, but when write performance isn’t required software RAID does the job quite well. A Dell tower system will typically hold at least 4 disks which means 6TB of RAID-6 storage and is quite cheap – it can be under $2000. also such a system can be easily expanded with extra Ethernet ports etc. NetApp doesn’t sell products directly and doesn’t list prices, but they do have some adverts for products being “under $7500“. That’s not really cheap but not THAT expensive when you consider the features.

    A hidden cost in running a NAS is having someone perform sysadmin work on it. For a relatively expensive device that offers significant features such as a NetApp Filer this expense probably isn’t too great. But for a device that does what any PC running Linux can do it’s noteworthy that more training or experimenting time is required.

    There are some special cases where small and cheap NAS appliances really make sense, such as the Apple Time Capsule for home network backups. But apart from that I don’t think that cheap NAS appliances make sense. It seems that cheap NAS devices provide the biggest down-sides of expensive NAS devices (in terms of lacking local access and having a different administration interface to servers) while also having the biggest down-sides of PC servers (lacking the advanced features of WAFL and performance of a NetApp).

    Earlier today I started a process of reorganising some backups which included backups to a cheap NAS. I have been very unimpressed by the time taken to copy and rm files over NFS. I’m sure that the job would have been completed hours ago if I had local root access to the NAS.


  • Starting with BTRFS

    Based on my investigation of RAID reliability [1] I have determined that BTRFS [2] is the Linux storage technology that has the best potential to increase data integrity without costing a lot of money. Basically a BTRFS internal RAID-1 should offer equal or greater data protection than RAID-6.

    As BTRFS is so important and so very different to any prior technology for Linux it’s not something that can be easily deployed in the same way as other filesystems. It is possible to easily switch between filesystems such as Ext4 and XFS because they work in much the same way, you have a single block device which the filesystem uses to create a single mount-point. While BTRFS supports internal RAID so it may have multiple block devices and it may offer multiple mountable filesystems and snapshots. Much of the functionality of Linux Software RAID and LVM is covered by BTRFS. So the sensible way to deploy BTRFS is to give it all your storage and not make use of any other RAID or LVM.

    So I decided to do a test installation. I started with a Debian install CD that was made shortly before the release of Squeeze (it was first to hand) and installed with BTRFS for the root filesystem, I then upgraded to Debian/Unstable to get the latest kernel as BTRFS is developing rapidly. The system failed on the first boot after upgrading to Unstable because the /etc/fstab entry for the root filesystem had the FSCK pass number set to 1 – which wasn’t going to work as no FSCK program has been written. I changed that number to 0 and it then worked.

    The initial install was on a desktop system that had a single IDE drive and a CD-ROM drive. For /boot I used a degraded RAID-1 and then after completing the installation I removed the CD-ROM drive and installed a second hard drive, after that it was easy to add the other device to the RAID-1. Then I tried to add a new device to the BTRFS group with the command “btrfs device add /dev/sdb2 /dev/sda2” and was informed that it can’t do that to a mounted filesystem! That will decrease the possibilities for using BTRFS on systems with hot-swap drives, I hope that the developers regard it as a bug.

    Then I booted with an ext3 filesystem for root and tried the “btrfs device add /dev/sdb2 /dev/sda2” again but got the error message “btrfs: sending ioctl 5000940a to a partition!” which is not even found by Google.

    The next thing that I wanted to do was to put a swap file on BTRFS, the benefits for having redundancy and checksums on swap space seem obvious – and other BTRFS features such as compression might give a benefit too. So I created a file by using dd to take take from /dev/zero, ran mkswap on it and then tried to run swapon. But I was told that the file has holes and can’t be used. Automatically making zero blocks into holes is a useful feature in many situations, but not in this case.

    So far my experience with BTRFS is that all the basic things work (IE storing files, directories, etc). But the advanced functions I wanted from BTRFS (mirroring and making a reliable swap space) failed. This is a bit disappointing, but BTRFS isn’t described as being ready for production yet.


  • More DRBD Performance tests

    I’ve previously written Some Notes on DRBD [1] and a post about DRBD Benchmarking [2].

    Previously I had determined that replication protocol C gives the best performance for DRBD, that the batch-time parameters for Ext4 aren’t worth touching for a single IDE disk, that barrier=0 gives a massive performance boost, and that DRBD gives a significant performance hit even when the secondary is not connected. Below are the results of some more tests of delivering mail from my Postal benchmark to my LMTP server which uses the Dovecot delivery agent to write it to disk, the rates are in messages per minute where each message is an average of 70K in size. The ext4 filesystem is used for all tests and the filesystem features list is “has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize“.

    p4-2.8
    Default Ext4 1663
    barrier=0 2875
    DRBD no secondary al-extents=7 645
    DRBD no secondary default 2409
    DRBD no secondary al-extents=1024 2513
    DRBD no secondary al-extents=3389 2650
    DRBD connected 1575
    DRBD connected al-extents=1024 1560
    DRBD connected al-extents=1024 Gig-E 1544

    The al-extents option determines the size of the dirty areas that need to be resynced when a failed node rejoins the cluster. The default is 127 extents of 4M each for a block size of 508MB to be synchronised. The maximum is 3389 for a synchronisation block size of just over 13G. Even with fast disks and gigabit Ethernet it’s going to take a while to synchronise things if dirty zones are 13GB in size. In my tests using the maximum size of al-extents gives a 10% performance benefit in disconnected mode while a size of 1024 gives a 4% performance boost. Changing the al-extents size seems to make no significant difference for a connected DRBD device.

    All the tests on connected DRBD devices were done with 100baseT apart from the last one which was a separate Gigabit Ethernet cable connecting the two systems.

    Conclusions

    For the level of traffic that I’m using it seems that Gigabit Ethernet provides no performance benefit, the fact that it gave a slightly lower result is not relevant as the difference is within the margin of error.

    Increasing the al-extents value helps with disconnected performance, a value of 1024 gives a 4% performance boost. I’m not sure that a value of 3389 is a good idea though.

    The ext4 barriers are disabled by DRBD so a disconnected DRBD device gives performance that is closer to a barrier=0 mount than a regular ext4 mount. With the significant performance difference between connected and disconnected modes it seems possible that for some usage scenarios it could be useful to disable the DRBD secondary at times of peak load – it depends on whether DRBD is used as a really current backup or a strict mirror.

    Future Tests

    I plan to do some tests of DRBD over Linux software RAID-1 and tests to compare RAID-1 with and without bitmap support. I also plan to do some tests with the BTRFS filesystem, I know it’s not ready for production but it would still be nice to know what the performance is like.

    But I won’t use the same systems, they don’t have enough CPU power. In my previous tests I established that a 1.5GHz P4 isn’t capable of driving the 20G IDE disk to it’s maximum capacity and I’m not sure that the 2.8GHz P4 is capable of running a RAID to it’s capacity. So I will use a dual-core 64bit system with a pair of SATA disks for future tests. The difference in performance between 20G IDE disks and 160G SATA disks should be a lot less than the performance difference between a 2.8GHz P4 and a dual-core 64bit CPU.


  • 5 Principles of Backup Software

    Everyone agrees that backups are generally a good thing. But it seems that there is a lot less agreement about how backups should work. Here is a list of 5 principles of backup software that seem to get ignored most of the time:

    (1/5) Backups should not be Application Specific

    It’s quite reasonable for people to want to extract data from a backup on a different platform. Maybe someone will want to extract data a few decades after the platform becomes obsolete. I believe that vendors of backup software have an ethical obligation to make it possible for customers to get their data out with minimal effort regardless of the circumstances.

    Often when writing a backup application there will be good reasons for not using the existing formats for data storage (tar, cpio, zip, etc). But ideally any data store which involves something conceptually similar to a collection of files in one larger file will use one of those formats. There have been backward compatible extensions to tar and zip for SE Linux contexts and for OS/2 EAs – the possibility of extending archive file formats with no consequence other than warnings on extraction with an unpatched utility has been demonstrated.

    For a backup which doesn’t involve source files (EG the contents of some sort of database) then it should be in a format that can be easily understood and parsed. Well designed XML is generally a reasonable option. Generally the format should involve plain text that is readable and easy to understand which is optionally compressed with a common compression utility (pkzip is a reasonable choice).

    (2/5) Data Store Formats should be Published

    For every data store there should be public documentation about it’s format to allow future developers to write support for it. It really isn’t difficult to release some commented header files so that people can easily determine the data structures. This includes all data stores including databases and filesystems. If I suddenly find myself with a 15yo image of a NTFS filesystem containing a proprietary database I should be able to find official header files for the version of NTFS and the database server in question so I can decode the data if it’s important enough.

    When an application vendor hides the data formats it gives the risk of substantial data loss at some future time. Imposing such risk on customers to try and prevent them from migrating to a rival product is unethical.

    (3/5) Backups should be forward and backward compatible

    It is entirely unreasonable for a vendor to demand that all their users install the latest versions of their software. There are lots of good reasons for not upgrading which includes hardware not supporting new versions of the OS, lack of Internet access to perform the upgrade, application compatibility, and just liking the way the old version works. Even for the case of a critical security fix it should be possible to restore data without applying the fix.

    For any pair of versions of software that are only separated by a few versions it should be possible to backup data from one and restore to the other. Even if the data can’t be used directly (EG a backup of AMD64 programs that is restored on an i386 system) it should still be accessible. If a new version of the software doesn’t support the ancient file formats then it should be possible for the users to get a slightly older version which talks to both the old and new versions.

    Backups made on 64bit systems running the latest development version of Linux and on 10yo 32bit proprietary Unix systems are interchangeable. Admittedly Unix is really good at preserving file format compatibility, but there is no technical reason why other systems can’t do the same. Source code to cpio, tar, and gnuzip, is freely available!

    Apple TimeMachine fails badly in this regard, even a slightly older version of Mac OS can’t do a restore. It is however nice that most of the TimeMachine data is a tree of files which could be just copied to another system.

    (4/5) Backup Software should not be Dropped

    Sony Ericsson has made me hate them even more by putting the following message on their update web site:

    The Backup and Restore app will be overwritten and cannot be used to restore data. Check out Android Market for alternative apps to back up and restore your data, such as MyBackup.

    So if you own a Sony Ericsson phone and it is lost, stolen, or completely destroyed and all you have is a backup made by the Sony Ericsson tool then the one thing you absolutely can’t do is to buy a new Sony Ericsson phone to restore the data.

    I believe that anyone who releases backup software has an ethical obligation to support restoring to all equivalent systems. How difficult would it be to put a new free app in the Google Market that has as it’s sole purpose recovering old Sony Ericsson backups onto newer phones? It really can’t be that difficult, so even if they don’t want to waste critical ROM space by putting the feature in all new phones they can make it available to everyone who needs it. When compared to the cost of developing a new Android release for a series of phones the cost of writing such a restore program would be almost nothing.

    It is simply mind-boggling that Sony Ericsson go against their own commercial interests in this regard. Surely it would make good business sense to be able to sell replacements for all the lost and broken Sony Ericsson phones, but instead customers who get burned by broken backups are given an incentive to buy a product from any other vendor.

    (5/5) The greater the control over data the greater the obligation for protecting it

    If you have data stored in a simple and standard manner (EG the /DCIM directory containing MP4 and JPEG files that is on the USB accessible storage in every modern phone) then IMHO it’s quite OK to leave customers to their own devices in terms of backups. Typical users can work out that if they don’t backup their pictures then they risk losing them, and they can work out how to do it.

    My Sony Ericsson phones have data stored under /data (settings for Android applications) which is apparently only accessible as root. Sony Ericsson have denied me root access which prevents me running backup programs such as Titanium Backup, therefore I believe that they have a great obligation to provide a way of making a backup of this data and restoring it on a new phone or a phone that has been updated. To just provide phone upgrade instructions which tell me that my phone will be entirely wiped and that I should search the App Market for backup programs is unacceptable.

    I believe that there are two ethical options available to Sony Ericsson at this time, one is to make it easy to root phones so that Titanium Backup and similar programs can be used, and the other option is to release a suitable backup program for older phones. Based on experience I don’t expect Sony Ericsson to choose either option.

    Now it is also a bad thing for the Android application developers to make it difficult or impossible to backup their data. For example the Wiki for one Android game gives instructions for moving the saved game files to a new phone which starts with “root your phone”. The developers of that game should have read the Wiki, realised that rooting a phone for the mundane task of transferring saved game files is totally unreasonable, and developed a better alternative.

    The best thing for developers to do is to allow the users to access their own data in the most convenient manner. Then it becomes the user’s responsibility to manage it and they can concentrate on improving their application.

    Why Freedom is Important

    Installing CyanogenMod on my Galaxy S was painful, but having root access so I can do anything I want is a great benefit. If phone vendors would do the right thing then I could recommend that other people use the vendor release, but it seems that vendors can be expected to act unethically. So I can’t recommend that anyone use an un-modded Android phone at any time. I also can’t recommend ever buying a Sony Ericsson product, not even when it’s really cheap.

    Google have done a great thing with their Data Liberation Front [1]. Not only are they providing access to the data they store on our behalf (which is a good thing) but they have a mission statement that demands the same behavior from other companies – they make it an issue of competitive advantage! So while Sony Ericsson and other companies might not see a benefit in making people like me stop hating them, failing to be as effective in marketing as Google is a real issue. Data Liberation is something that should be discussed at board elections of IT companies.

    Keep in mind the fact that ethics are not just about doing nice things, they are about establishing expectations of conduct that will be used by people who deal with you in future. Sony Ericsson has shown that I should expect that they will treat the integrity of my data with contempt and I will keep this in mind every time I decline an opportunity to purchase their products. Google has shown that they consider the protection of my data as an important issue and therefore I can be confident when using and recommending their services that I won’t get stuck with data that is locked away.

    While Google has demonstrated that corporations can do the right thing, the vast majority of evidence suggests that we should never trust a corporation with anything that we might want to retrieve when it’s not immediately profitable for the corporation. Therefore avoiding commercial services for storing important data is the sensible thing to do.


  • Reliability of RAID

    ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6.

    Of course most systems in the field aren’t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it’s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads).

    For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn’t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers.

    Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of “consumer” disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with “consumer” disks that it is very improbable that I’m not getting such errors. So the question is, how can I discover such errors and fix them?

    In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn’t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn’t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn’t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won’t know which disk is failing.

    Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there’s only 470G in use!

    The BTRFS Filesystem

    The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct!

    I’ve just done a quick test of this. I created a filesystem with the command “mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?” and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem – that worked correctly although I was disappointed that it didn’t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command “btrfs scrub start .” and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command “btrfs scrub status .” just gave me a count of the corrected errors and didn’t mention which device had the errors.

    It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration.

    BTRFS is rumored to have performance problems, I will test this but don’t have time to do so right now. Anyway I’m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss.

    BTRFS and Xen

    The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4.

    This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage.

    BTRFS and DRBD

    The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead.

    It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data!

    Comparing Risks

    I don’t want to use BTRFS in production now because of the risk of bugs. While it’s unlikely to have really serious bugs it’s theoretically possible that as bug could deny access to data until kernel code is fixed and it’s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it’s almost certain that I will lose small amounts of data and it’s most probable that I have silently lost data on many occasions without realising.


  • Some Notes on DRBD

    DRBD is a system for replicating a block device across multiple systems. It’s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it’s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this.

    I’m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I’m not using DRBD because the failure conditions that I’m planning for don’t include two disks entirely failing. I’m planning for a system having an outage for a while so it’s OK to have some inbound and outbound mail delayed but it’s not OK for the mail store to be unavailable.

    Global changes I’ve made in /etc/drbd.d/global_common.conf

    In the common section I changed the protocol from “C” to “B“, this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don’t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete.

    In the net section I added the following:

    sndbuf-size 512k;
    data-integrity-alg sha1;

    This uses a larger network sending buffer (apparently good for fast local networks – although I’d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity).

    Reserved Numbers

    The default port number that is used is 7789. I think it’s best to use ports below 1024 for system services so I’ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I’ll use different ports even if they aren’t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict.

    Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU’s resources.

    It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device.

    As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn’t be mounted locally and which mount options are permitted – it MIGHT be OK to do a read-only mount of a DomU’s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices.

    Initial Synchronisation

    After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don’t then it will probably refuse to accept one copy of the data as primary as it can’t seem to realise that both are inconsistent. I can’t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there

    The solution sometimes seems to be to run “drbdsetup /dev/drbd0 primary -” (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running “drbdadm -- --overwrite-data-of-peer primary db0-mysql” (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res).

    Also it seems that some commands can only be run from one node. So if you have a primary node that’s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a “drbdadm connect” command run from the primary node will get things going.


  • Hetzner Failover Konfiguration

    The Wiki documenting how to configure IP failover for Hetzner servers [1] is closely tied to the Linux HA project [2]. This is OK if you want a Heartbeat cluster, but if you want manual failover or an automatic failover from some other form of script then it’s not useful. So I’ll provide the simplest possible documentation.

    Below is a sample of shell code to get the current failover settings and change them to point the IP address to a different server. In my tests this takes between 19 and 20 seconds to complete, when the command completes the new server will be active and no IP packets will be lost – but TCP connections will be broken if the servers don’t support shared TCP state.

    # username and password for the Hetzner robot
    USERPASS=USER:PASS
    # public IP
    IP=10.1.2.3
    # new active server
    ACTIVE=10.2.3.4
    # get current values
    curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP
    # change active server
    curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP -d active_server_ip=$ACTIVE

    Below is the output of the above commands showing the old state and the new state.

    failover:
    ip: 10.1.2.3
    netmask: 255.255.255.255
    server_ip: 10.2.3.3
    active_server_ip: 10.2.3.4
    failover:
    ip: 10.1.2.3
    netmask: 255.255.255.255
    server_ip: 10.2.3.4
    active_server_ip: 10.2.3.4




©2012 etbe - Russell Coker Entries (RSS) and Comments (RSS)  Raindrops Theme