Process Monitoring

Since forking the Mon project to etbemon [1] I’ve been spending a lot of time working on the monitor scripts. Actually monitoring something is usually quite easy, deciding what to monitor tends to be the hard part. The process monitoring script ps.monitor is the one I’m about to redesign.

Here are some of my ideas for monitoring processes. Please comment if you have any suggestions for how do do things better.

For people who don’t use mon, the monitor scripts return 0 if everything is OK and 1 if there’s a problem along with using stdout to display an error message. While I’m not aware of anyone hooking mon scripts into a different monitoring system that’s going to be easy to do. One thing I plan to work on in the future is interoperability between mon and other systems such as Nagios.

Basic Monitoring

ps.monitor tor:1-1 master:1-2 auditd:1-1 cron:1-5 rsyslogd:1-1 dbus-daemon:1- sshd:1- watchdog:1-2

I’m currently planning some sort of rewrite of the process monitoring script. The current functionality is to have a list of process names on the command line with minimum and maximum numbers for the instances of the process in question. The above is a sample of the configuration of the monitor. There are some limitations to this, the “master” process in this instance refers to the main process of Postfix, but other daemons use the same process name (it’s one of those names that’s wrong because it’s so obvious). One obvious solution to this is to give the option of specifying the full path so that /usr/lib/postfix/sbin/master can be differentiated from all the other programs named master.

The next issue is processes that may run on behalf of multiple users. With sshd there is a single process to accept new connections running as root and a process running under the UID of each logged in user. So the number of sshd processes running as root will be one greater than the number of root login sessions. This means that if a sysadmin logs in directly as root via ssh (which is controversial and not the topic of this post – merely something that people do which I have to support) and the master process then crashes (or the sysadmin stops it either accidentally or deliberately) there won’t be an alert about the missing process. Of course the correct thing to do is to have a monitor talk to port 22 and look for the string “SSH-2.0-OpenSSH_”. Sometimes there are multiple instances of a daemon running under different UIDs that need to be monitored separately. So obviously we need the ability to monitor processes by UID.

In many cases process monitoring can be replaced by monitoring of service ports. So if something is listening on port 25 then it probably means that the Postfix “master” process is running regardless of what other “master” processes there are. But for my use I find it handy to have multiple monitors, if I get a Jabber message about being unable to send mail to a server immediately followed by a Jabber message from that server saying that “master” isn’t running I don’t need to fully wake up to know where the problem is.

SE Linux

One feature that I want is monitoring SE Linux contexts of processes in the same way as monitoring UIDs. While I’m not interested in writing tests for other security systems I would be happy to include code that other people write. So whatever I do I want to make it flexible enough to work with multiple security systems.

Transient Processes

Most daemons have a second process of the same name running during the startup process. This means if you monitor for exactly 1 instance of a process you may get an alert about 2 processes running when “logrotate” or something similar restarts the daemon. Also you may get an alert about 0 instances if the check happens to run at exactly the wrong time during the restart. My current way of dealing with this on my servers is to not alert until the second failure event with the “alertafter 2” directive. The “failure_interval” directive allows specifying the time between checks when the monitor is in a failed state, setting that to a low value means that waiting for a second failure result doesn’t delay the notification much.

To deal with this I’ve been thinking of making the ps.monitor script automatically check again after a specified delay. I think that solving the problem with a single parameter to the monitor script is better than using 2 configuration directives to mon to work around it.

CPU Use

Mon currently has a loadavg.monitor script that to check the load average. But that won’t catch the case of a single process using too much CPU time but not enough to raise the system load average. Also it won’t catch the case of a CPU hungry process going quiet (EG when the SETI at Home server goes down) while another process goes into an infinite loop. One way of addressing this would be to have the ps.monitor script have yet another configuration option to monitor CPU use, but this might get confusing. Another option would be to have a separate script that alerts on any process that uses more than a specified percentage of CPU time over it’s lifetime or over the last few seconds unless it’s in a whitelist of processes and users who are exempt from such checks. Probably every regular user would be exempt from such checks because you never know when they will run a file compression program. Also there is a short list of daemons that are excluded (like BOINC) and system processes (like gzip which is run from several cron jobs).

Monitoring for Exclusion

A common programming mistake is to call setuid() before setgid() which means that the program doesn’t have permission to call setgid(). If return codes aren’t checked (and people who make such rookie mistakes tend not to check return codes) then the process keeps elevated permissions. Checking for processes running as GID 0 but not UID 0 would be handy. As an aside a quick examination of a Debian/Testing workstation didn’t show any obvious way that a process with GID 0 could gain elevated privileges, but that could change with one chmod 770 command.

On a SE Linux system there should be only one process running with the domain init_t. Currently that doesn’t happen in Stretch systems running daemons such as mysqld and tor due to policy not matching the recent functionality of systemd as requested by daemon service files. Such issues will keep occurring so we need automated tests for them.

Automated tests for configuration errors that might impact system security is a bigger issue, I’ll probably write a separate blog post about it.

Apache Mesos on Debian

I decided to try packaging Mesos for Debian/Stretch. I had a spare system with a i7-930 CPU, 48G of RAM, and SSDs to use for building. The i7-930 isn’t really fast by today’s standards, but 48G of RAM and SSD storage mean that overall it’s a decent build system – faster than most systems I run (for myself and for clients) and probably faster than most systems used by Debian Developers for build purposes.

There’s a github issue about the lack of an upstream package for Debian/Stretch [1]. That upstream issue could probably be worked around by adding Jessie sources to the APT sources.list file, but a package for Stretch is what is needed anyway.

Here is the documentation on building for Debian [2]. The list of packages it gives as build dependencies is incomplete, it also needs zlib1g-dev libapr1-dev libcurl4-nss-dev openjdk-8-jdk maven libsasl2-dev libsvn-dev. So BUILDING this software requires Java + Maven, Ruby, and Python along with autoconf, libtool, and all the usual Unix build tools. It also requires the FPM (Fucking Package Management) tool, I take the choice of name as an indication of the professionalism of the author.

Building the software on my i7 system took 79 minutes which includes 76 minutes of CPU time (I didn’t use the -j option to make). At the end of the build it turned out that I had mistakenly failed to install the Fucking Package Management “gem” and it aborted. At this stage I gave up on Mesos, the pain involved exceeds my interest in trying it out.

How to do it Better

One of the aims of Free Software is that bugs are more likely to get solved if many people look at them. There aren’t many people who will devote 76 minutes of CPU time on a moderately fast system to investigate a single bug. To deal with this software should be prepared as components. An example of this is the SE Linux project which has 13 source modules in the latest release [3]. Of those 13 only 5 are really required. So anyone who wants to start on SE Linux from source (without considering a distribution like Debian or Fedora that has it packaged) can build the 5 most important ones. Also anyone who has an issue with SE Linux on their system can find the one source package that is relevant and study it with a short compile time. As an aside I’ve been working on SE Linux since long before it was split into so many separate source packages and know the code well, but I still find the separation convenient – I rarely need to work on more than a small subset of the code at one time.

The requirement of Java, Ruby, and Python to build Mesos could be partly due to language interfaces to call Mesos interfaces from Ruby and Python. Ohe solution to that is to have the C libraries and header files to call Mesos and have separate packages that depend on those libraries and headers to provide the bindings for other languages. Another solution is to have autoconf detect that some languages aren’t installed and just not try to compile bindings for them (this is one of the purposes of autoconf).

The use of a tool like Fucking Package Management means that you don’t get help from experts in the various distributions in making better packages. When there is a FOSS project with a debian subdirectory that makes barely functional packages then you will be likely to have an experienced Debian Developer offer a patch to improve it (I’ve offered patches for such things on many occasions). When there is a FOSS project that uses a tool that is never used by Debian developers (or developers of Fedora and other distributions) then the only patches you will get will be from inexperienced people.

A software build process should not download anything from the Internet. The source archive should contain everything that is needed and there should be dependencies for external software. Any downloads from the Internet need to be protected from MITM attacks which means that a responsible software developer has to read through the build system and make sure that appropriate PGP signature checks etc are performed. It could be that the files that the Mesos build downloaded from the Apache site had appropriate PGP checks performed – but it would take me extra time and effort to verify this and I can’t distribute software without being sure of this. Also reproducible builds are one of the latest things we aim for in the Debian project, this means we can’t just download files from web sites because the next build might get a different version.

Finally the fpm (Fucking Package Management) tool is a Ruby Gem that has to be installed with the “gem install” command. Any time you specify a gem install command you should include the -v option to ensure that everyone is using the same version of that gem, otherwise there is no guarantee that people who follow your documentation will get the same results. Also a quick Google search didn’t indicate whether gem install checks PGP keys or verifies data integrity in other ways. If I’m going to compile software for other people to use I’m concerned about getting unexpected results with such things. A Google search indicates that Ruby people were worried about such things in 2013 but doesn’t indicate whether they solved the problem properly.

Monitoring of Monitoring

I was recently asked to get data from a computer that controlled security cameras after a crime had been committed. Due to the potential issues I refused to collect the computer and insisted on performing the work at the office of the company in question. Hard drives are vulnerable to damage from vibration and there is always a risk involved in moving hard drives or systems containing them. A hard drive with evidence of a crime provides additional potential complications. So I wanted to stay within view of the man who commissioned the work just so there could be no misunderstanding.

The system had a single IDE disk. The fact that it had an IDE disk is an indication of the age of the system. One of the benefits of SATA over IDE is that swapping disks is much easier, SATA is designed for hot-swap and even systems that don’t support hot-swap will have less risk of mechanical damage when changing disks if SATA is used instead of IDE. For an appliance type system where a disk might be expected to be changed by someone who’s not a sysadmin SATA provides more benefits over IDE than for some other use cases.

I connected the IDE disk to a USB-IDE device so I could read it from my laptop. But the disk just made repeated buzzing sounds while failing to spin up. This is an indication that the drive was probably experiencing “stiction” which is where the heads stick to the platters and the drive motor isn’t strong enough to pull them off. In some cases hitting a drive will get it working again, but I’m certainly not going to hit a drive that might be subject to legal action! I recommended referring the drive to a data recovery company.

The probability of getting useful data from the disk in question seems very low. It could be that the drive had stiction for months or years. If the drive is recovered it might turn out to have data from years ago and not the recent data that is desired. It is possible that the drive only got stiction after being turned off, but I’ll probably never know.

Doing it Properly

Ever since RAID was introduced there was never an excuse for having a single disk on it’s own with important data. Linux Software RAID didn’t support online rebuild when 10G was a large disk. But since the late 90’s it has worked well and there’s no reason not to use it. The probability of a single IDE disk surviving long enough on it’s own to capture useful security data is not particularly good.

Even with 2 disks in a RAID-1 configuration there is a chance of data loss. Many years ago I ran a server at my parents’ house with 2 disks in a RAID-1 and both disks had errors on one hot summer. I wrote a program that’s like ddrescue but which would read from the second disk if the first gave a read error and ended up not losing any important data AFAIK. BTRFS has some potential benefits for recovering from such situations but I don’t recommend deploying BTRFS in embedded systems any time soon.

Monitoring is a requirement for reliable operation. For desktop systems you can get by without specific monitoring, but that is because you are effectively relying on the user monitoring it themself. Since I started using mon (which is very easy to setup) I’ve had it notify me of some problems with my laptop that I wouldn’t have otherwise noticed. I think that ideally for desktop systems you should have monitoring of disk space, temperature, and certain critical daemons that need to be running but which the user wouldn’t immediately notice if they crashed (such as cron and syslogd).

There are some companies that provide 3G SIMs for embedded/IoT applications with rates that are significantly cheaper than any of the usual phone/tablet plans if you use small amounts of data or SMS. For a reliable CCTV system the best thing to do would be to have a monitoring contract and have the monitoring system trigger an event if there’s a problem with the hard drive etc and also if the system fails to send a “I’m OK” message for a certain period of time.

I don’t know if people are selling CCTV systems without monitoring to compete on price or if companies are cancelling monitoring contracts to save money. But whichever is happening it’s significantly reducing the value derived from monitoring.

Basics of Backups

I’ve recently had some discussions about backups with people who aren’t computer experts, so I decided to blog about this for the benefit of everyone. Note that this post will deliberately avoid issues that require great knowledge of computers. I have written other posts that will benefit experts.

Essential Requirements

Everything that matters must be stored in at least 3 places. Every storage device will die eventually. Every backup will die eventually. If you have 2 backups then you are covered for the primary storage failing and the first backup failing. Note that I’m not saying “only have 2 backups” (I have many more) but 2 is the bare minimum.

Backups must be in multiple places. One way of losing data is if your house burns down, if that happens all backup devices stored there will be destroyed. You must have backups off-site. A good option is to have backup devices stored by trusted people (friends and relatives are often good options).

It must not be possible for one event to wipe out all backups. Some people use “cloud” backups, there are many ways of doing this with Dropbox, Google Drive, etc. Some of these even have free options for small amounts of storage, for example Google Drive appears to have 15G of free storage which is more than enough for all your best photos and all your financial records. The downside to cloud backups is that a computer criminal who gets access to your PC can wipe it and the backups. Cloud backup can be a part of a sensible backup strategy but it can’t be relied on (also see the paragraph about having at least 2 backups).

Backup Devices

USB flash “sticks” are cheap and easy to use. The quality of some of those devices isn’t too good, but the low price and small size means that you can buy more of them. It would be quite easy to buy 10 USB sticks for multiple copies of data.

Stores that sell office-supplies sell USB attached hard drives which are quite affordable now. It’s easy to buy a couple of those for backup use.

The cheapest option for backing up moderate amounts of data is to get a USB-SATA device. This connects to the PC by USB and has a cradle to accept a SATA hard drive. That allows you to buy cheap SATA disks for backups and even use older disks as backups.

With choosing backup devices consider the environment that they will be stored in. If you want to store a backup in the glove box of your car (which could be good when travelling) then a SD card or USB flash device would be a good choice because they are resistant to physical damage. Note that if you have no other options for off-site storage then the glove box of your car will probably survive if your house burns down.

Multiple Backups

It’s not uncommon for data corruption or mistakes to be discovered some time after it happens. Also in recent times there is a variety of malware that encrypts files and then demands a ransom payment for the decryption key.

To address these problems you should have older backups stored. It’s not uncommon in a corporate environment to have backups every day stored for a week, backups every week stored for a month, and monthly backups stored for some years.

For a home use scenario it’s more common to make backups every week or so and take backups to store off-site when it’s convenient.

Offsite Backups

One common form of off-site backup is to store backup devices at work. If you work in an office then you will probably have some space in a desk drawer for personal items. If you don’t work in an office but have a locker at work then that’s good for storage too, if there is high humidity then SD cards will survive better than hard drives. Make sure that you encrypt all data you store in such places or make sure that it’s not the secret data!

Banks have a variety of ways of storing items. Bank safe deposit boxes can be used for anything that fits and can fit hard drives. If you have a mortgage your bank might give you free storage of “papers” as part of the service (Commonwealth Bank of Australia used to offer that). A few USB sticks or SD cards in an envelope could fit the “papers” criteria. An accounting firm may also store documents for free for you.

If you put a backup on USB or SD storage in your waller then that can also be a good offsite backup. For most people losing data from disk is more common than losing their wallet.

A modern mobile phone can also be used for backing up data while travelling. For a few years I’ve been doing that. But note that you have to encrypt all data stored on a phone so an attacker who compromises your phone can’t steal it. In a typical phone configuration the mass storage area is much less protected than application data. Also note that customs and border control agents for some countries can compel you to provide the keys for encrypted data.

A friend suggested burying a backup device in a sealed plastic container filled with dessicant. That would survive your house burning down and in theory should work. I don’t know of anyone who’s tried it.

Testing

On occasion you should try to read the data from your backups and compare it to the original data. It sometimes happens that backups are discovered to be useless after years of operation.

Secret Data

Before starting a backup it’s worth considering which of the data is secret and which isn’t. Data that is secret needs to be treated differently and a mixture of secret and less secret data needs to be treated as if it’s all secret.

One category of secret data is financial data. If your accountant provides document storage then they can store that, generally your accountant will have all of your secret financial data anyway.

Passwords need to be kept secret but they are also very small. So making a written or printed copy of the passwords is part of a good backup strategy. There are options for backing up paper that don’t apply to data.

One category of data that is not secret is photos. Photos of holidays, friends, etc are generally not that secret and they can also comprise a large portion of the data volume that needs to be backed up. Apparently some people have a backup strategy for such photos that involves downloading from Facebook to restore, that will help with some problems but it’s not adequate overall. But any data that is on Facebook isn’t that secret and can be stored off-site without encryption.

Backup Corruption

With the amounts of data that are used nowadays the probability of data corruption is increasing. If you use any compression program with the data that is backed up (even data that can’t be compressed such as JPEGs) then errors will be detected when you extract the data. So if you have backup ZIP files on 2 hard drives and one of them gets corrupt you will easily be able to determine which one has the correct data.

Failing Systems – update 2016-08-22

When a system starts to fail it may limp along for years and work reasonably well, or it may totally fail soon. At the first sign of trouble you should immediately make a full backup to separate media. Use different media to your regular backups in case the data is corrupt so you don’t overwrite good backups with bad ones.

One traditional sign of problems has been hard drives that make unusual sounds. Modern drives are fairly quiet so this might not be loud enough to notice. Another sign is hard drives that take unusually large amounts of time to read data. If a drive has some problems it might read a sector hundreds or even thousands of times until it gets the data which dramatically reduces system performance. There are lots of other performance problems that can occur (system overheating, software misconfiguration, and others), most of which are correlated with potential data loss.

A modern SSD storage device (as used in a lot of the recent laptops) doesn’t tend to go slow when it nears the end of it’s life. It is more likely to just randomly fail entirely and then work again after a reboot. There are many causes of systems randomly hanging or crashing (of which overheating is common), but they are all correlated with data loss so a good backup is a good idea.

When in doubt make a backup.

Any Suggestions?

If you have any other ideas for backups by typical home users then please leave a comment. Don’t comment on expert issues though, I have other posts for that.

BTRFS Training

Some years ago Barwon South Water gave LUV 3 old 1RU Sun servers for any use related to free software. We gave one of those servers to the Canberra makerlab and another is used as the server for the LUV mailing lists and web site and the 3rd server was put aside for training. The servers have hot-swap 15,000rpm SAS disks – IE disks that have a replacement cost greater than the budget we have for hardware. As we were given a spare 70G disk (and a 140G disk can replace a 70G disk) the LUV server has 2*70G disks and the 140G disks (which can’t be replaced) are in the server for training.

On Saturday I ran a BTRFS and ZFS training session for the LUV Beginners’ SIG. This was inspired by the amount of discussion of those filesystems on the mailing list and the amount of interest when we have lectures on those topics.

The training went well, the meeting was better attended than most Beginners’ SIG meetings and the people who attended it seemed to enjoy it. One thing that I will do better in future is clearly documenting commands that are expected to fail and documenting how to login to the system. The users all logged in to accounts on a Xen server and then ssh’d to root at their DomU. I think that it would have saved a bit of time if I had aliased commands like “btrfs” to “echo you must login to your virtual server first” or made the shell prompt at the Dom0 include instructions to login to the DomU.

Each user or group had a virtual machine. The server has 32G of RAM and I ran 14 virtual servers that each had 2G of RAM. In retrospect I should have configured fewer servers and asked people to work in groups, that would allow more RAM for each virtual server and also more RAM for the Dom0. The Dom0 was running a BTRFS RAID-1 filesystem and each virtual machine had a snapshot of the block devices from my master image for the training. Performance was quite good initially as the OS image was shared and fit into cache. But when many users were corrupting and scrubbing filesystems performance became very poor. The disks performed well (sustaining over 100 writes per second) but that’s not much when shared between 14 active users.

The ZFS part of the tutorial was based on RAID-Z (I didn’t use RAID-5/6 in BTRFS because it’s not ready to use and didn’t use RAID-1 in ZFS because most people want RAID-Z). Each user had 5*4G virtual disks (2 for the OS and 3 for BTRFS and ZFS testing). By the end of the training session there was about 76G of storage used in the filesystem (including the space used by the OS for the Dom0), so each user had something like 5G of unique data.

We are now considering what other training we can run on that server. I’m thinking of running training on DNS and email. Suggestions for other topics would be appreciated. For training that’s not disk intensive we could run many more than 14 virtual machines, 60 or more should be possible.

Below are the notes from the BTRFS part of the training, anyone could do this on their own if they substitute 2 empty partitions for /dev/xvdd and /dev/xvde. On a Debian/Jessie system all that you need to do to get ready for this is to install the btrfs-tools package. Note that this does have some risk if you make a typo. An advantage of doing this sort of thing in a virtual machine is that there’s no possibility of breaking things that matter.

  1. Making the filesystem
    1. Make the filesystem, this makes a filesystem that spans 2 devices (note you must use the-f option if there was already a filesystem on those devices):
      mkfs.btrfs /dev/xvdd /dev/xvde
    2. Use file(1) to see basic data from the superblocks:
      file -s /dev/xvdd /dev/xvde
    3. Mount the filesystem (can mount either block device, the kernel knows they belong together):
      mount /dev/xvdd /mnt/tmp
    4. See a BTRFS df of the filesystem, shows what type of RAID is used:
      btrfs filesystem df /mnt/tmp
    5. See more information about FS device use:
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem to change it to RAID-1 and verify the change, note that some parts of the filesystem were single and RAID-0 before this change):
      btrfs balance start -dconvert=raid1 -mconvert=raid1 -sconvert=raid1 –force /mnt/tmp
      btrfs filesystem df /mnt/tmp
    7. See if there are any errors, shouldn’t be any (yet):
      btrfs device stats /mnt/tmp
    8. Copy some files to the filesystem:
      cp -r /usr /mnt/tmp
    9. Check the filesystem for basic consistency (only checks checksums):
      btrfs scrub start -B -d /mnt/tmp
  2. Online corruption
    1. Corrupt the filesystem:
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=2000 seek=50
    2. Scrub again, should give a warning about errors:
      btrfs scrub start -B /mnt/tmp
    3. Check error count:
      btrfs device stats /mnt/tmp
    4. Corrupt it again:
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=2000 seek=50
    5. Unmount it:
      umount /mnt/tmp
    6. In another terminal follow the kernel log:
      tail -f /var/log/kern.log
    7. Mount it again and observe it correcting errors on mount:
      mount /dev/xvdd /mnt/tmp
    8. Run a diff, observe kernel error messages and observe that diff reports no file differences:
      diff -ru /usr /mnt/tmp/usr/
    9. Run another scrub, this will probably correct some errors which weren’t discovered by diff:
      btrfs scrub start -B -d /mnt/tmp
  3. Offline corruption
    1. Umount the filesystem, corrupt the start, then try mounting it again which will fail because the superblocks were wiped:
      umount /mnt/tmp
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=200
      mount /dev/xvdd /mnt/tmp
      mount /dev/xvde /mnt/tmp
    2. Note that the filesystem was not mountable due to a lack of a superblock. It might be possible to recover from this but that’s more advanced so we will restore the RAID.
      Mount the filesystem in a degraded RAID mode, this allows full operation.
      mount /dev/xvde /mnt/tmp -o degraded
    3. Add /dev/xvdd back to the RAID:
      btrfs device add /dev/xvdd /mnt/tmp
    4. Show the filesystem devices, observe that xvdd is listed twice, the missing device and the one that was just added:
      btrfs filesystem show /mnt/tmp
    5. Remove the missing device and observe the change:
      btrfs device delete missing /mnt/tmp
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem, not sure this is necessary but it’s good practice to do it when in doubt:
      btrfs balance start /mnt/tmp
    7. Umount and mount it, note that the degraded option is not needed:
      umount /mnt/tmp
      mount /dev/xvdd /mnt/tmp
  4. Experiment
    1. Experiment with the “btrfs subvolume create” and “btrfs subvolume delete” commands (which act like mkdir and rmdir).
    2. Experiment with “btrfs subvolume snapshot SOURCE DEST” and “btrfs subvolume snapshot -r SOURCE DEST” for creating regular and read-only snapshots of other subvolumes (including the root).

RAID Pain

One of my clients has a NAS device. Last week they tried to do what should have been a routine RAID operation, they added a new larger disk as a hot-spare and told the RAID array to replace one of the active disks with the hot-spare. The aim was to replace the disks one at a time to grow the array. But one of the other disks had an error during the rebuild and things fell apart.

I was called in after the NAS had been rebooted when it was refusing to recognise the RAID. The first thing that occurred to me is that maybe RAID-5 isn’t a good choice for the RAID. While it’s theoretically possible for a RAID rebuild to not fail in such a situation (the data that couldn’t be read from the disk with an error could have been regenerated from the disk that was being replaced) it seems that the RAID implementation in question couldn’t do it. As the NAS is running Linux I presume that at least older versions of Linux have the same problem. Of course if you have a RAID array that has 7 disks running RAID-6 with a hot-spare then you only get the capacity of 4 disks. But RAID-6 with no hot-spare should be at least as reliable as RAID-5 with a hot-spare.

Whenever you recover from disk problems the first thing you want to do is to make a read-only copy of the data. Then you can’t make things worse. This is a problem when you are dealing with 7 disks, fortunately they were only 3TB disks and only each had 2TB in use. So I found some space on a ZFS pool and bought a few 6TB disks which I formatted as BTRFS filesystems. For this task I only wanted filesystems that support snapshots so I could work on snapshots not on the original copy.

I expect that at some future time I will be called in when an array of 6+ disks of the largest available size fails. This will be a more difficult problem to solve as I don’t own any system that can handle so many disks.

I copied a few of the disks to a ZFS filesystem on a Dell PowerEdge T110 running kernel 3.2.68. Unfortunately that system seems to have a problem with USB, when copying from 4 disks at once each disk was reading about 10MB/s and when copying from 3 disks each disk was reading about 13MB/s. It seems that the system has an aggregate USB bandwidth of 40MB/s – slightly greater than USB 2.0 speed. This made the process take longer than expected.

One of the disks had a read error, this was presumably the cause of the original RAID failure. dd has the option conv=noerror to make it continue after a read error. This initially seemed good but the resulting file was smaller than the source partition. It seems that conv=noerror doesn’t seek the output file to maintain input and output alignment. If I had a hard drive filled with plain ASCII that MIGHT even be useful, but for a filesystem image it’s worse than useless. The only option was to repeatedly run dd with matching skip and seek options incrementing by 1K until it had passed the section with errors.

for n in /dev/loop[0-6] ; do echo $n ; mdadm –examine -v -v –scan $n|grep Events ; done

Once I had all the images I had to assemble them. The Linux Software RAID didn’t like the array because not all the devices had the same event count. The way Linux Software RAID (and probably most RAID implementations) work is that each member of the array has an event counter that is incremented when disks are added, removed, and when data is written. If there is an error then after a reboot only disks with matching event counts will be used. The above command shows the Events count for all the disks.

Fortunately different event numbers aren’t going to stop us. After assembling the array (which failed to run) I ran “mdadm -R /dev/md1” which kicked some members out. I then added them back manually and forced the array to run. Unfortunately attempts to write to the array failed (presumably due to mismatched event counts).

Now my next problem is that I can make a 10TB degraded RAID-5 array which is read-only but I can’t mount the XFS filesystem because XFS wants to replay the journal. So my next step is to buy another 2*6TB disks to make a RAID-0 array to contain an image of that XFS filesystem.

Finally backups are a really good thing…

BTRFS Status June 2015

The version of btrfs-tools in Debian/Jessie is incapable of creating a filesystem that can be mounted by the kernel in Debian/Wheezy. If you want to use a BTRFS filesystem on Jessie and Wheezy (which isn’t uncommon with removable devices) the only options are to use the Wheezy version of mkfs.btrfs or to use a Jessie kernel on Wheezy. I recently got bitten by this issue when I created a BTRFS filesystem on a removable device with a lot of important data (which is why I wanted metadata duplication and checksums) and had to read it on a server running Wheezy. Fortunately KVM in Wheezy works really well so I created a virtual machine to read the disk. Setting up a new KVM isn’t that difficult, but it’s not something I want to do while a client is anxiously waiting for their data.

BTRFS has been working well for me apart from the Jessie/Wheezy compatability issue (which was an annoyance but didn’t stop me doing what I wanted). I haven’t written a BTRFS status report for a while because everything has been OK and there has been nothing exciting to report.

I regularly get errors from the cron jobs that run a balance supposedly running out of free space. I have the cron jobs due to past problems with BTRFS running out of metadata space. In spite of the jobs often failing the systems keep working so I’m not too worried at the moment. I think this is a bug, but there are many more important bugs.

Linux kernel version 3.19 was the first version to have working support for RAID-5 recovery. This means version 3.19 was the first version to have usable RAID-5 (I think there is no point even having RAID-5 without recovery). It wouldn’t be prudent to trust your important data to a new feature in a filesystem. So at this stage if I needed a very large scratch space then BTRFS RAID-5 might be a viable option but for anything else I wouldn’t use it. BTRFS still has had little performance optimisation, while this doesn’t matter much for SSD and for single-disk filesystems for a RAID-5 of hard drives that would probably hurt a lot. Maybe BTRFS RAID-5 would be good for a scratch array of SSDs. The reports of problems with RAID-5 don’t surprise me at all.

I have a BTRFS RAID-1 filesystem on 2*4TB disks which is giving poor performance on metadata, simple operations like “ls -l” on a directory with ~200 subdirectories takes many seconds to run. I suspect that part of the problem is due to the filesystem being written by cron jobs with files accumulating over more than a year. The “btrfs filesystem” command (see btrfs-filesystem(8)) allows defragmenting files and directory trees, but unfortunately it doesn’t support recursively defragmenting directories but not files. I really wish there was a way to get BTRFS to put all metadata on SSD and all data on hard drives. Sander suggested the following command to defragment directories on the BTRFS mailing list:

find / -xdev -type d -execdir btrfs filesystem defrag -c {} +

Below is the output of “zfs list -t snapshot” on a server I run, it’s often handy to know how much space is used by snapshots, but unfortunately BTRFS has no support for this.

NAME USED AVAIL REFER MOUNTPOINT
hetz0/be0-mail@2015-03-10 2.88G 387G
hetz0/be0-mail@2015-03-11 1.12G 388G
hetz0/be0-mail@2015-03-12 1.11G 388G
hetz0/be0-mail@2015-03-13 1.19G 388G

Hugo pointed out on the BTRFS mailing list that the following command will give the amount of space used for snapshots. $SNAPSHOT is the name of a snapshot and $LASTGEN is the generation number of the previous snapshot you want to compare with.

btrfs subvolume find-new $SNAPSHOT $LASTGEN | awk '{total = total + $7}END{print total}'

One upside of the BTRFS implementation in this regard is that the above btrfs command without being piped through awk shows you the names of files that are being written and the amounts of data written to them. Through casually examining this output I discovered that the most written files in my home directory were under the “.cache” directory (which wasn’t exactly a surprise).

Now I am configuring workstations with a separate subvolume for ~/.cache for the main user. This means that ~/.cache changes don’t get stored in the hourly snapshots and less disk space is used for snapshots.

Conclusion

My observation is that things are going quite well with BTRFS. It’s more than 6 months since I had a noteworthy problem which is pretty good for a filesystem that’s still under active development. But there are still many systems I run which could benefit from the data integrity features of ZFS and BTRFS that don’t have the resources to run ZFS and need more reliability than I can expect from an unattended BTRFS system.

At this time the only servers I run with BTRFS are located within a reasonable drive from my home (not the servers in Germany and the US) and are easily accessible (not the embedded systems). ZFS is working well for some of the servers in Germany. Eventually I’ll probably run ZFS on all the hosted servers in Germany and the US, I expect that will happen before I’m comfortable running BTRFS on such systems. For the embedded systems I will just take the risk of data loss/corruption for the next few years.

BTRFS Status Dec 2014

My last problem with BTRFS was in August [1]. BTRFS has been running mostly uneventfully for me for the last 4 months, that’s a good improvement but the fact that 4 months of no problems is noteworthy for something as important as a filesystem is a cause for ongoing concern.

A RAID-1 Array

A week ago I had a minor problem with my home file server, one of the 3TB disks in the BTRFS RAID-1 started giving read errors. That’s not a big deal, I bought a new disk and did a “btrfs replace” operation which was quick and easy. The first annoyance was that the output of “btrfs device stats” reported an error count for the new device, it seems that “btrfs device replace” copies everything from the old disk including the error count. The solution is to use “btrfs device stats -z” to reset the count after replacing a device.

I replaced the 3TB disk with a 4TB disk (with current prices it doesn’t make sense to buy a new 3TB disk). As I was running low on disk space I added a 1TB disk to give it 4TB of RAID-1 capacity, one of the nice features of BTRFS is that a RAID-1 filesystem can support any combination of disks and use them to store 2 copies of every block of data. I started running a btrfs balance to get BTRFS to try and use all the space before learning from the mailing list that I should have done “btrfs filesystem resize” to make it use all the space. So my balance operation had configured the filesystem to configure itself for 2*3TB+1*1TB disks which wasn’t the right configuration when the 4TB disk was fully used. To make it even more annoying the “btrfs filesystem resize” command takes a “devid” not a device name.

I think that when BTRFS is more stable it would be good to have the btrfs utility warn the user about such potential mistakes. When a replacement device is larger than the old one it will be very common to want to use that space. The btrfs utility could easily suggest the most likely “btrfs filesystem resize” to make things easier for the user.

In a disturbing coincidence a few days after replacing the first 3TB disk the other 3TB disk started giving read errors. So I replaced the second 3TB disk with a 4TB disk and removed the 1TB disk to give a 4TB RAID-1 array. This is when it would be handy to have the metadata duplication feature and copies= option of ZFS.

Ctree Corruption

2 weeks ago a basic workstation with a 120G SSD owned by a relative stopped booting, the most significant errors it gave were “BTRFS: log replay required on RO media” and “BTRFS: open_ctree failed”. The solution to this is to run the command “btrfs-zero-log”, but that initially didn’t work. I restored the system from a backup (which was 2 months old) and took the SSD home to work on it. A day later “btrfs-zero-log” worked correctly and I recovered all the data. Note that I didn’t even try mounting the filesystem in question read-write, I mounted it read-only to copy all the data off. While in theory the filesystem should have been OK I didn’t have a need to keep using it at that time (having already wiped the original device and restored from backup) and I don’t have confidence in BTRFS working correctly in that situation.

While it was nice to get all the data back it’s a concern when commands don’t operate consistently.

Debian and BTRFS

I was concerned when the Debian kernel team chose 3.16 as the kernel for Jessie (the next Debian release). Judging by the way development has been going I wasn’t confident that 3.16 would turn out to be stable enough for BTRFS. But 3.16 is working reasonably well on a number of systems so it seems that it’s likely to work well in practice.

But I’m still deploying more ZFS servers.

The Value of Anecdotal Evidence

When evaluating software based on reports from reliable sources (IE most readers will trust me to run systems well and only report genuine bugs) bad reports have a much higher weight than good reports. The fact that I’ve seen kernel 3.16 to work reasonably well on ~6 systems is nice but that doesn’t mean it will work well on thousands of other systems – although it does indicate that it will work well on more systems than some earlier Linux kernels which had common BTRFS failures.

But the annoyances I had with the 3TB array are repeatable and will annoy many other people. The ctree coruption problem MIGHT have been initially caused by a memory error (it’s a desktop machine without ECC RAM) but the recovery process was problematic and other users might expect problems in such situations.

Improving Computer Reliability

In a comment on my post about Taxing Inferior Products [1] Ben pointed out that most crashes are due to software bugs. Both Ben and I work on the Debian project and have had significant experience of software causing system crashes for Debian users.

But I still think that the widespread adoption of ECC RAM is a good first step towards improving the reliability of the computing infrastructure.

Currently when software developers receive bug reports they always wonder whether the bug was caused by defective hardware. So when bugs can’t be reproduced (or can’t be reproduced in a way that matches the bug report) they often get put in a list of random crash reports and no further attention is paid to them.

When a system has ECC RAM and a filesystem that uses checksums for all data and metadata we can have greater confidence that random bugs aren’t due to hardware problems. For example if a user reports a file corruption bug they can’t repeat that occurred when using the Ext3 filesystem on a typical desktop PC I’ll wonder about the reliability of storage and RAM in their system. If however the same bug report came from someone who had ECC RAM and used the ZFS filesystem then I would be more likely to consider it a software bug.

The current situation is that every part of a typical PC is unreliable. When a bug can be attributed to one of several pieces of hardware, the OS kernel and even malware (in the case of MS Windows) it’s hard to know where to start in tracking down a bug. Most users have given up and accepted that crashing periodically is just what computers do. Even experienced Linux users sometimes give up on trying to track down bugs properly because it’s sometimes very difficult to file a good bug report. For the typical computer user (who doesn’t have the power that a skilled Linux user has) it’s much worse, filing a bug report seems about as useful as praying.

One of the features of ECC RAM is that the motherboard can inform the user (either at boot time, after a NMI reboot, or through system diagnostics) of the problem so it can be fixed. A feature of filesystems such as ZFS and BTRFS is that they can inform the user of drive corruption problems, sometimes before any data is lost.

My recommendation of BTRFS in regard to system integrity does have a significant caveat, currently the system reliability decrease due to crashes outweighs the reliability increase due to checksums at this time. This isn’t all bad because at least when BTRFS crashes you know what the problem is, and BTRFS is rapidly improving in this regard. When I discuss BTRFS in posts like this one I’m considering the theoretical issues related to the design not the practical issues of software bugs. That said I’ve twice had a BTRFS filesystem seriously corrupted by a faulty DIMM on a system without ECC RAM.

Using BTRFS

I’ve just installed BTRFS on some systems that matter to me. It is still regarded as experimental but Oracle supports it with their kernel so it can’t be too bad – and it’s almost guaranteed that anything other than BTRFS or ZFS will lose data if you run as many systems as I do. Also I run lots of systems that don’t have enough RAM for ZFS (4G isn’t enough for ZFS in my tests). So I have to use BTRFS.

BTRFS and Virtual Machines

I’m running BTRFS for the DomUs on a virtual server which has 4G of RAM (and thus can’t run ZFS). The way I have done this is to use ext4 on Linux software RAID-1 for the root filesystem on the Dom0 and use BTRFS for the rest. For BTRFS and virtual machines there seem to be two good options given that I want BTRFS to use it’s own RAID-1 so that it can correct errors from a corrupted disk. One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it’s storage. The other option is to have each virtual machine run BTRFS RAID-1.

I’ve created two LVM Volume Groups (VGs) named diska and diskb, each DomU has a Logical Volume (LV) from each VG and runs BTRFS. So if a disk becomes corrupt the DomU will have to figure out what the problem is and fix it.

#!/bin/bash
for n in $(xm list|cut -f1 -d\ |egrep -v ^Name\|^Domain-0) ; do
  echo $n
  ssh $n "btrfs scrub start -B -d /"
done

I use the above script in a cron job from the Dom0 to scrub the BTRFS filesystems in the DomUs. I use the -B option so that I will receive email about any errors and so that there won’t be multiple DomUs scrubbing at the same time (which would be really bad for performance).

BTRFS and Workstations

The first workstation installs of BTRFS that I did were similar to installations of Ext3/4 in that I had multiple filesystems on LVM block devices. This caused all the usual problems of filesystem sizes and also significantly hurt performance (sync seems to perform very badly on a BTRFS filesystem and it gets really bad with lots of BTRFS filesystems). BTRFS allows using subvolumes for snapshots and it’s designed to handle large filesystems so there’s no reason to have more than one filesystem IMHO.

It seems to me that the only benefit in using multiple BTRFS filesystems on a system is if you want to use different RAID options. I presume that eventually the BTRFS developers will support different RAID options on a per-subvolume basis (they seem to want to copy all ZFS features). I would like to be able to configure /home to use 3 copies of all data and metadata on a workstation that only has a single disk.

Currently I have some workstations using BTRFS with a BTRFS RAID-1 configuration for /home and a regular non-RAID configuration for everything else. But now it seems that this is a bad idea, I would be better off just using a single copy of all data on workstations (as I did for everything on workstations for the previous 15 years of running Linux desktop systems) and make backups that are frequent enough to not have a great risk.

BTRFS and Servers

One server that I run is primarily used as an NFS server and as a workstation. I have a pair of 3TB SATA disks in a BTRFS RAID-1 configuration mounted as /big and with subvolumes under /big for the various NFS exports. The system also has a 120G Intel SSD for /boot (Ext4) and the root filesystem which is BTRFS and also includes /home. The SSD gives really good read performance which is largely independent of what is done with the disks so booting and workstation use are very fast even when cron jobs are hitting the file server hard.

The system has used a RAID-1 array of 1TB SATA disks for all it’s storage ever since 1TB disks were big. So moving to a single storage device for /home is a decrease in theoretical reliability (in addition to the fact that a SSD might be less reliable than a traditional disk). The next thing that I am going to do is to install cron jobs that backup the root filesystem to something under /big. The server in question isn’t used for anything that requires high uptime, so if the SSD dies entirely and I need to replace it with another boot device then it will be really annoying but it won’t be a great problem.

Snapshot Backups

One of the most important uses of backups is to recover from basic user mistakes such as deleting the wrong file. To deal with this I wrote some scripts to create backups from a cron job. I put the snapshots of a subvolume under a subvolume named “backup“. A common use is to have everything on the root filesystem, /home as a subvolume, /home/backup as another subvolume, and then subvolumes for backups such as /home/backup/2012-12-17, /home/backup/2012-12-17:00:15, and /home/backup/2012-12-17:00:30. I make /home/backup world readable so every user can access their own backups without involving me, of course this means that if they make a mistake related to security then I would have to help them correct it – but I don’t expect my users to deal with security issues, if they accidentally grant inappropriate access to their files then I will be the one to notice and correct it.

Here is a script I name btrfs-make-snapshot which has an optional first parameter “-d” to cause it do just display the btrfs commands it would run and not actually do anything. The second parameter is either “minutes” or “days” depending on whether you want to create a snapshot on a short interval (I use 15 minutes) or a daily snapshot. All other parameters are paths for subvolumes that are to be backed up:

#!/bin/bash
set -e

# usage:
# btrfs-make-snapshot [-d] minutes|days paths
# example:
# btrfs-make-snapshot minutes /home /mail

if [ "$1" == "-d" ]; then
  BTRFS="echo btrfs"
  shift
else
  BTRFS=/sbin/btrfs
fi

if [ "$1" == "minutes" ]; then
  DATE=$(date +%Y-%m-%d:%H:%M)
else
  DATE=$(date +%Y-%m-%d)
fi
shift

for n in $* ; do
  $BTRFS subvol snapshot -r $n $n/backup/$DATE
done

Here is a script I name btrfs-remove-snapshots which removes old snapshots to free space. It has an optional first parameter “-d” to cause it do just display the btrfs commands it would run and not actually do anything. The next parameters are the number of minute based and day based snapshots to keep (I am currently experimenting with 100 100 for /home to keep 15 minute snapshots for 25 hours and daily snapshots for 100 days). After that is a list of filesystems to remove snapshots from. The removal will be from under the backup subvolume of the path in question.

#!/bin/bash
set -e

# usage:
# btrfs-remove-snapshots [-d] MINSNAPS DAYSNAPS paths
# example:
# btrfs-remove-snapshots 100 100 /home /mail

if [ "$1" == "-d" ]; then
  BTRFS="echo btrfs"
  shift
else
  BTRFS=/sbin/btrfs
fi

MINSNAPS=$1
shift
DAYSNAPS=$1
shift

for DIR in $* ; do
  BASE=$(echo $DIR | cut -c 2-200)
  for n in $(btrfs subvol list $DIR|grep $BASE/backup/.*:|head -n -$MINSNAPS|sed -e "s/^.* //"); do
    $BTRFS subvol delete /$n
  done
  for n in $(btrfs subvol list $DIR|grep $BASE/backup/|grep -v :|head -n -$MINSNAPS|sed -e "s/^.* //"); do
    $BTRFS subvol delete /$n
  done
done

A Warning

The Debian/Wheezy kernel (based on the upstream kernel 3.2.32) doesn’t seem to cope well when you run out of space by making snapshots. I have a filesystem that I am still trying to recover after doing that.

I’ve just been buying larger storage devices for systems while migrating them to BTRFS, so I should be able to avoid running out of disk space again until I can upgrade to a kernel that fixes such bugs.