systemd-nspawn and Private Networking

Currently there’s two things I want to do with my PC at the same time, one is watching streaming services like ABC iView (which won’t run from non-Australian IP addresses) and another is torrenting over a VPN. I had considered doing something ugly with iptables to try and get routing done on a per-UID basis but that seemed to difficult. At the time I wasn’t aware of the ip rule add uidrange [1] option. So setting up a private networking namespace with a systemd-nspawn container seemed like a good idea.

Chroot Setup

For the chroot (which I use as a slang term for a copy of a Linux installation in a subdirectory) I used a btrfs subvol that’s a snapshot of the root subvol. The idea is that when I upgrade the root system I can just recreate the chroot with a new snapshot.

To get this working I created files in the root subvol which are used for the container.

I created a script like the following named /usr/local/sbin/container-sshd to launch the container. It sets up the networking and executes sshd. The systemd-nspawn program is designed to launch init but that’s not required, I prefer to just launch sshd so there’s only one running process in a container that’s not being actively used.

#!/bin/bash

# restorecon commands only needed for SE Linux
/sbin/restorecon -R /dev
/bin/mount none -t tmpfs /run
/bin/mkdir -p /run/sshd
/sbin/restorecon -R /run /tmp
/sbin/ifconfig host0 10.3.0.2 netmask 255.255.0.0
/sbin/route add default gw 10.2.0.1
exec /usr/sbin/sshd -D -f /etc/ssh/sshd_torrent_config

How to Launch It

To setup the container I used a command like “/usr/bin/systemd-nspawn -D /subvols/torrent -M torrent –bind=/home -n /usr/local/sbin/container-sshd“.

First I had tried the --network-ipvlan option which creates a new IP address on the same MAC address. That gave me an interface iv-br0 on the container that I could use normally (br0 being the bridge used in my workstation as it’s primary network interface). The IP address I assigned to that was in the same subnet as br0, but for some reason that’s unknown to me (maybe an interaction between bridging and network namespaces) I couldn’t access it from the host, I could only access it from other hosts on the network. I then tried the --network-macvlan option (to create a new MAC address for virtual networking), but that had the same problem with accessing the IP address from the local host outside the container as well as problems with MAC redirection to the primary MAC of the host (again maybe an interaction with bridging).

Then I tried just the “-n” option which gave it a private network interface. That created an interface named ve-torrent on the host side and one named host0 in the container. Using ifconfig and route to configure the interface in the container before launching sshd is easy. I haven’t yet determined a good way of configuring the host side of the private network interface automatically.

I had to use a bind for /home because /home is a subvol and therefore doesn’t get included in the container by default.

How it Works

Now when it’s running I can just “ssh -X” to the container and then run graphical programs that use the VPN while at the same time running graphical programs on the main host that don’t use the VPN.

Things To Do

Find out why --network-ipvlan and --network-macvlan don’t work with communication from the same host.

Find out why --network-macvlan gives errors about MAC redirection when pinging.

Determine a good way of setting up the host side after the systemd-nspawn program has run.

Find out if there are better ways of solving this problem, this way works but might not be ideal. Comments welcome.

4K Monitors

A couple of years ago a relative who uses a Linux workstation I support bought a 4K (4096*2160 resolution) monitor. That meant that I had to get 4K working, which was 2 years of pain for me and probably not enough benefit for them to justify it. Recently I had the opportunity to buy some 4K monitors at a low enough price that it didn’t make sense to refuse so I got to experience it myself.

The Need for 4K

I’m getting older and my vision is decreasing as expected. I recently got new glasses and got a pair of reading glasses as a reduced ability to change focus is common as you get older. Unfortunately I made a mistake when requesting the focus distance for the reading glasses and they work well for phones, tablets, and books but not for laptops and desktop computers. Now I have the option of either spending a moderate amount of money to buy a new pair of reading glasses or just dealing with the fact that laptop/desktop use isn’t going to be as good until the next time I need new glasses (sometime 2021).

I like having lots of terminal windows on my desktop. For common tasks I might need a few terminals open at a time and if I get interrupted in a task I like to leave the terminal windows for it open so I can easily go back to it. Having more 80*25 terminal windows on screen increases my productivity. My previous monitor was 2560*1440 which for years had allowed me to have a 4*4 array of non-overlapping terminal windows as well as another 8 or 9 overlapping ones if I needed more. 16 terminals allows me to ssh to lots of systems and edit lots of files in vi. Earlier this year I had found it difficult to read the font size that previously worked well for me so I had to use a larger font that meant that only 3*3 terminals would fit on my screen. Going from 16 non-overlapping windows and an optional 8 overlapping to 9 non-overlapping and an optional 6 overlapping is a significant difference. I could get a second monitor, and I won’t rule out doing so at some future time. But it’s not ideal.

When I got a 4K monitor working properly I found that I could go back to a smaller font that allowed 16 non overlapping windows. So I got a real benefit from a 4K monitor!

Video Hardware

Version 1.0 of HDMI released in 2002 only supports 1920*1080 (FullHD) resolution. Version 1.3 released in 2006 supported 2560*1440. Most of my collection of PCIe video cards have a maximum resolution of 1920*1080 in HDMI, so it seems that they only support HDMI 1.2 or earlier. When investigating this I wondered what version of PCIe they were using, the command “dmidecode |grep PCI” gives that information, seems that at least one PCIe video card supports PCIe 2 (released in 2007) but not HDMI 1.3 (released in 2006).

Many video cards in my collection support 2560*1440 with DVI but only 1920*1080 with HDMI. As 4K monitors don’t support DVI input that meant that when initially using a 4K monitor I was running in 1920*1080 instead of 2560*1440 with my old monitor.

I found that one of my old video cards supported 4K resolution, it has a NVidia GT630 chipset (here’s the page with specifications for that chipset [1]). It seems that because I have a video card with 2G of RAM I have the “Keplar” variant which supports 4K resolution. I got the video card in question because it uses PCIe*8 and I had a workstation that only had PCIe*8 slots and I didn’t feel like cutting a card down to size (which is apparently possible but not recommended), it is also fanless (quiet) which is handy if you don’t need a lot of GPU power.

A couple of months ago I checked the cheap video cards at my favourite computer store (MSY) and all the cheap ones didn’t support 4K resolution. Now it seems that all the video cards they sell could support 4K, by “could” I mean that a Google search of the chipset says that it’s possible but of course some surrounding chips could fail to support it.

The GT630 card is great for text, but the combination of it with a i5-2500 CPU (rating 6353 according to cpubenchmark.net [3]) doesn’t allow playing Netflix full-screen and on 1920*1080 videos scaled to full-screen sometimes gets mplayer messages about the CPU being too slow. I don’t know how much of this is due to the CPU and how much is due to the graphics hardware.

When trying the same system with an ATI Radeon R7 260X/360 graphics card (16* PCIe and draws enough power to need a separate connection to the PSU) the Netflix playback appears better but mplayer seems no better.

I guess I need a new PC to play 1920*1080 video scaled to full-screen on a 4K monitor. No idea what hardware will be needed to play actual 4K video. Comments offering suggestions in this regard will be appreciated.

Software Configuration

For GNOME apps (which you will probably run even if like me you use KDE for your desktop) you need to run commands like the following to scale menus etc:

gsettings set org.gnome.settings-daemon.plugins.xsettings overrides "[{'Gdk/WindowScalingFactor', <2>}]"
gsettings set org.gnome.desktop.interface scaling-factor 2

For KDE run the System Settings app, go to Display and Monitor, then go to Displays and Scale Display to scale things.

The Arch Linux Wiki page on HiDPI [2] is good for information on how to make apps work with high DPI (or regular screens for people with poor vision).

Conclusion

4K displays are still rather painful, both in hardware and software configuration. For serious computer use it’s worth the hassle, but it doesn’t seem to be good for general use yet. 2560*1440 is pretty good and works with much more hardware and requires hardly any software configuration.

KMail Crashing and LIBGL

One problem I’ve had recently on two systems with NVideo video cards is KMail crashing (SEGV) while reading mail. Sometimes it goes for months without having problems, and then it gets into a state where reading a few messages (or sometimes reading one particular message) causes a crash. The crash happens somewhere in the Mesa library stack.

In an attempt to investigate this I tried running KMail via ssh (as that precludes a lot of the GL stuff), but that crashed in a different way (I filed an upstream bug report [1]).

I have discovered a workaround for this issue, I set the environment variable LIBGL_ALWAYS_SOFTWARE=1 and then things work. At this stage I can’t be sure exactly where the problems are. As it’s certain KMail operations that trigger it I think that’s evidence of problems originating in KMail, but the end result when it happens often includes a kernel error log so there’s probably a problem in the Nouveau driver. I spent quite a lot of time investigating this, including recompiling most of the library stack with debugging mode and didn’t get much of a positive result. Hopefully putting it out there will help the next person who has such issues.

Here is a list of environment variables that can be set to debug LIBGL issues (strangely I couldn’t find documentation on this when Googling it). If you are stuck with a problem related to LIBGL you can try setting each of these to “1” in turn and see if it makes a difference. That can either be for the purpose of debugging a problem or creating a workaround that allows you to run the programs you need to run. I don’t know why GL is required to read email.

LIBGL_DIAGNOSTIC
LIBGL_ALWAYS_INDIRECT
LIBGL_ALWAYS_SOFTWARE
LIBGL_DRI3_DISABLE
LIBGL_NO_DRAWARRAYS
LIBGL_DEBUG
LIBGL_DRIVERS_PATH
LIBGL_DRIVERS_DIR
LIBGL_SHOW_FPS

WordPress Multisite on Debian

WordPress (a common CMS for blogs) is designed to be copied to a directory that Apache can serve and run by a user with no particular privileges while managing installation of it’s own updates and plugins. Debian is designed around the idea of the package management system controlling everything on behalf of a sysadmin.

When I first started using WordPress there was a version called “WordPress MU” (Multi User) which supported multiple blogs. It was a separate archive to the main WordPress and didn’t support all the plugins and themes. As a main selling point of WordPress is the ability to select from the significant library of plugins and themes this was a serious problem.

Debian WordPress

The people who maintain the Debian package of WordPress have always supported multiple blogs on one system and made it very easy to run in that manner. There’s a /etc/wordpress directory for configuration files for each blog with names such as config-etbe.coker.com.au.php. This allows having multiple separate blogs running from the same tree of PHP source which means only one thing to update when there’s a new version of WordPress (often fixing security issues).

One thing that appears to be lacking with the Debian system is separate directories for “media”. WordPress supports uploading images (which are scaled to several different sizes) as well as sound and apparently video. By default under Debian they are stored in /var/lib/wordpress/wp-content/uploads/YYYY/MM/filename. If you have several blogs on one system they all get to share the same directory tree, that may be OK for one person running multiple blogs but is obviously bad when several bloggers have independent blogs on the same server.

Multisite

If you enable the “multisite” support in WordPress then you have WordPress support for multiple blogs. The administrator of the multisite configuration has the ability to specify media paths etc for all the child blogs.

The first problem with this is that one person has to be the multisite administrator. As I’m the sysadmin of the WordPress servers in question that’s an obvious task for me. But the problem is that the multisite administrator doesn’t just do sysadmin tasks such as specifying storage directories. They also do fairly routine tasks like enabling plugins. Preventing bloggers from installing new plugins is reasonable and is the default Debian configuration. Preventing them from selecting which of the installed plugins are activated is unreasonable in most situations.

The next issue is that some core parts of WordPress functionality on the sub-blogs refer to the administrator blog, recovering a forgotten password is one example. I don’t want users of other blogs on the system to be referred to my blog when they forget their password.

A final problem with multisite is that it makes things more difficult if you want to move a blog to another system. Instead of just sending a dump of the MySQL database and a copy of the Apache configuration for the site you have to configure it for which blog will be it’s master. If going between multisite and non-multisite you have to change some of the data about accounts, this will be annoying on both adding new sites to a server and moving sites from the server to a non-multisite server somewhere else.

I now believe that WordPress multisite has little value for people who use Debian. The Debian way is the better way.

So I had to back out the multisite changes. Fortunately I had a cron job to make snapshots of the BTRFS subvolume that has the database so it was easy to revert to an older version of the MySQL configuration.

Upload Location

update etbe_options set option_value='/var/lib/wordpress/wp-content/uploads/etbe.coker.com.au' where option_name='upload_path';

It turns out that if you don’t have a multisite blog then there’s no way of changing the upload directory without using SQL. The above SQL code is an example of how to do this. Note that it seems that there is special case handling of a value of ‘wp-content/uploads‘ and any other path needs to be fully qualified.

For my own blog however I choose to avoid the WordPress media management and use the following shell script to create suitable HTML code for an image that links to a high resolution version. I use GIMP to create the smaller version of the image which gives me a lot of control over how to crop and compress the image to ensure that enough detail is visible while still being small enough for fast download.

#!/bin/bash
set -e

if [ "$BASE" = "" ]; then
  BASE="http://www.coker.com.au/blogpics/2018"
fi

while [ "$1" != "" ]; do
  BIG=$1
  SMALL=$(echo $1 | sed -s s/-big//)
  RES=$(identify $SMALL|cut -f3 -d\ )
  WIDTH=$(($(echo $RES|cut -f1 -dx)/2))px
  HEIGHT=$(($(echo $RES|cut -f2 -dx)/2))px
  echo "<a href=\"$BASE/$BIG\"><img src=\"$BASE/$SMALL\" width=\"$WIDTH\" height=\"$HEIGHT\" alt=\"\" /></a>"
  shift
done

QEMU for ARM Processes

I’m currently doing some embedded work on ARM systems. Having a virtual ARM environment is of course helpful. For the i586 class embedded systems that I run it’s very easy to setup a virtual environment, I just have a chroot run from systemd-nspawn with the --personality=x86 option. I run it on my laptop for my own development and on a server my client owns so that they can deal with the “hit by a bus” scenario. I also occasionally run KVM virtual machines to test the boot image of i586 embedded systems (they use GRUB etc and are just like any other 32bit Intel system).

ARM systems have a different boot setup, there is a uBoot loader that is fairly tightly coupled with the kernel. ARM systems also tend to have more unusual hardware choices. While the i586 embedded systems I support turned out to work well with standard Debian kernels (even though the reference OS for the hardware has a custom kernel) the ARM systems need a special kernel. I spent a reasonable amount of time playing with QEMU and was unable to make it boot from a uBoot ARM image. The Google searches I performed didn’t turn up anything that helped me. If anyone has good references for getting QEMU to work for an ARM system image on an AMD64 platform then please let me know in the comments. While I am currently surviving without that facility it would be a handy thing to have if it was relatively easy to do (my client isn’t going to pay me to spend a week working on this and I’m not inclined to devote that much of my hobby time to it).

QEMU for Process Emulation

I’ve given up on emulating an entire system and now I’m using a chroot environment with systemd-nspawn.

The package qemu-user-static has staticly linked programs for emulating various CPUs on a per-process basis. You can run this as “/usr/bin/qemu-arm-static ./staticly-linked-arm-program“. The Debian package qemu-user-static uses the binfmt_misc support in the kernel to automatically run /usr/bin/qemu-arm-static when an ARM binary is executed. So if you have copied the image of an ARM system to /chroot/arm you can run the following commands like the following to enter the chroot:

cp /usr/bin/qemu-arm-static /chroot/arm/usr/bin/qemu-arm-static
chroot /chroot/arm bin/bash

Then you can create a full virtual environment with “/usr/bin/systemd-nspawn -D /chroot/arm” if you have systemd-container installed.

Selecting the CPU Type

There is a huge range of ARM CPUs with different capabilities. How this compares to the range of x86 and AMD64 CPUs depends on how you are counting (the i5 system I’m using now has 76 CPU capability flags). The default CPU type for qemu-arm-static is armv7l and I need to emulate a system with a armv5tejl. Setting the environment variable QEMU_CPU=pxa250 gives me armv5tel emulation.

The ARM Architecture Wikipedia page [2] says that in armv5tejl the T stands for Thumb instructions (which I don’t think Debian uses), the E stands for DSP enhancements (which probably isn’t relevant for me as I’m only doing integer maths), the J stands for supporting special Java instructions (which I definitely don’t need) and I’m still trying to work out what L means (comments appreciated).

So it seems clear that the armv5tel emulation provided by QEMU_CPU=pxa250 will do everything I need for building and testing ARM embedded software. The issue is how to enable it. For a user shell I can just put export QEMU_CPU=pxa250 in .login or something, but I want to emulate an entire system (cron jobs, ssh logins, etc).

I’ve filed Debian bug #870329 requesting a configuration file for this [1]. If I put such a configuration file in the chroot everything would work as desired.

To get things working in the meantime I wrote the below wrapper for /usr/bin/qemu-arm-static that calls /usr/bin/qemu-arm-static.orig (the renamed version of the original program). It’s ugly (I would use a config file if I needed to support more than one type of CPU) but it works.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  if(setenv("QEMU_CPU", "pxa250", 1))
  {
    printf("Can't set $QEMU_CPU\n");
    return 1;
  }
  execv("/usr/bin/qemu-arm-static.orig", argv);
  printf("Can't execute \"%s\" because of qemu failure\n", argv[0]);
  return 1;
}

More KVM Modules Configuration

Last year I blogged about blacklisting a video driver so that KVM virtual machines didn’t go into graphics mode [1]. Now I’ve been working on some other things to make virtual machines run better.

I use the same initramfs for the physical hardware as for the virtual machines. So I need to remove modules that are needed for booting the physical hardware from the VMs as well as other modules that get dragged in by systemd and other things. One significant saving from this is that I use BTRFS for the physical machine and the BTRFS driver takes 1M of RAM!

The first thing I did to reduce the number of modules was to edit /etc/initramfs-tools/initramfs.conf and change “MODULES=most” to “MODULES=dep”. This significantly reduced the number of modules loaded and also stopped the initramfs from probing for a non-existant floppy drive which added about 20 seconds to the boot. Note that this will result in your initramfs not supporting different hardware. So if you plan to take a hard drive out of your desktop PC and install it in another PC this could be bad for you, but for servers it’s OK as that sort of upgrade is uncommon for servers and only done with some planning (such as creating an initramfs just for the migration).

I put the following rmmod commands in /etc/rc.local to remove modules that are automatically loaded:
rmmod btrfs
rmmod evdev
rmmod lrw
rmmod glue_helper
rmmod ablk_helper
rmmod aes_x86_64
rmmod ecb
rmmod xor
rmmod raid6_pq
rmmod cryptd
rmmod gf128mul
rmmod ata_generic
rmmod ata_piix
rmmod i2c_piix4
rmmod libata
rmmod scsi_mod

In /etc/modprobe.d/blacklist.conf I have the following lines to stop drivers being loaded. The first line is to stop the video mode being set and the rest are just to save space. One thing that inspired me to do this is that the parallel port driver gave a kernel error when it loaded and tried to access non-existant hardware.
blacklist bochs_drm
blacklist joydev
blacklist ppdev
blacklist sg
blacklist psmouse
blacklist pcspkr
blacklist sr_mod
blacklist acpi_cpufreq
blacklist cdrom
blacklist tpm
blacklist tpm_tis
blacklist floppy
blacklist parport_pc
blacklist serio_raw
blacklist button

On the physical machine I have the following in /etc/modprobe.d/blacklist.conf. Most of this is to prevent loading of filesystem drivers when making an initramfs. I do this because I know there’s never going to be any need for CDs, parallel devices, graphics, or strange block devices in a server room. I wouldn’t do any of this for a desktop workstation or laptop.
blacklist ppdev
blacklist parport_pc
blacklist cdrom
blacklist sr_mod
blacklist nouveau

blacklist ufs
blacklist qnx4
blacklist hfsplus
blacklist hfs
blacklist minix
blacklist ntfs
blacklist jfs
blacklist xfs

Video Mode and KVM

I recently changed my KVM servers to use the kernel command-line parameter nomodeset for the virtual machine kernels so that they don’t try to go into graphics mode. I do this because I don’t have X11 or VNC enabled and I want a text console to use with the -curses option of KVM. Without the nomodeset KVM just says that it’s in 1024*768 graphics mode and doesn’t display the text.

Now my KVM server running Debian/Unstable has had it’s virtual machines start going into graphics mode in spite of nomodeset parameter. It seems that an update to QEMU has added a new virtual display driver which recent kernels from Debian/Unstable support with the bochs_drm driver, and that driver apparently doesn’t respect nomodeset.

The solution is to create a file named /etc/modprobe.d/blacklist.conf with the contents “blacklist bochs_drm” and now my virtual machines have a usable plain-text console again! This blacklist method works for all video drivers, you can blacklist similar modules for the other virtual display hardware. But it would be nice if the one kernel option would cover them all.

Ethernet Interface Naming With Systemd

Systemd has a new way of specifying names for Ethernet interfaces as documented in systemd.link(5). The Debian package should keep working with the old 70-persistent-net.rules file, but I had a problem with this that forced me to learn about systemd.link(5).

Below is a little shell script I wrote to convert a basic 70-persistent-net.rules (that only matches on MAC address) to systemd.link files.

#!/bin/bash

RULES=/etc/udev/rules.d/70-persistent-net.rules

for n in $(grep ^SUB $RULES|sed -e s/^.*NAME..// -e s/.$//) ; do
  NAME=/etc/systemd/network/10-$n.link
  LINE=$(grep $n $RULES)
  MAC=$(echo $LINE|sed -e s/^.*address….// -e s/…ATTR.*$//)
  echo "[Match]" > $NAME
  echo "MACAddress=$MAC" >> $NAME
  echo "[Link]" >> $NAME
  echo "Name=$n" >> $NAME
done

BTRFS Status April 2014

Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 3.13 kernels on all my systems and dealing with the occasional kmail index file corruption problem.

Yesterday my main workstation ran out of disk space and went read-only. I started a BTRFS balance which didn’t seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server – I’ll never know as logs couldn’t be written to disk). I rebooted the system and had kernel threads go into infinite loops with repeated messages about a lack of response for 22 seconds (I should have photographed the screen). When it got into that state the ALT-Fn keys to change a virtual console sometimes worked but nothing else worked – the terminal usually didn’t respond to input.

To try and stop the kernel from entering an infinite loop on every boot that I used “rootflags=skip_balance” on the kernel command line to stop it from continuing the balance which made the system usable for a little longer, unfortunately the skip_balance mount option doesn’t permanently apply, the kernel will keep trying to balance the filesystem on every mount until a “btrfs balance cancel” operation succeeds. But my attempts to cancel the balance always failed.

When I booted my system with skip_balance it would sometimes free some space from the deleted snapshots, after two good runs I got to 17G free. But after that every time I rebooted it would report another Gig or two free (according to “btrfs filesystem df“) and then hang without committing the changes to disk.

I solved this problem by upgrading my USB rescue image to kernel 3.14 from Debian/Experimental and mounting the filesystem from the rescue image. After letting kernel 3.14 work on the filesystem for a while it was in a stage where I could use it with kernel 3.13 and then boot the system normally to upgrade it to kernel 3.14.

I had a minor extra complication due to the fact that I was running “apt-get dist-upgrade” at the time the filesystem went read-only do the dpkg records of which packages were installed were a bit messed up. But that was easy to fix by running a diff against /var/lib/dpkg/info on a recent snapshot. In retrospect I should have copied from an old snapshot of the root filesystem, but I fixed the problems faster than I could think of better ways to fix them.

When running a balance the system had a peak IO rate of about 30MB/s reads and 30MB/s writes. That compares to the maximum contiguous file IO speed of 260MB/s for reads and 320MB/s for writes. During that time it had about 50% CPU time used for my Q8400 quad-core CPU. So far the only tasks that I do regularly which have CPU speed as a significant bottleneck are BTRFS filesystem balancing and recoding MP4 files. Compiling hasn’t been an issue because recently I haven’t been compiling many programs that are particularly big.

Lessons Learned

I should photograph the screen regularly when doing things that won’t be logged, those kernel error messages might have been useful to me or someone else.

The fact that the only kernel that runs BTRFS the way I need comes from the Experimental repository in Debian stands in contrast to the recent kernel patch that stops describing BTRFS as experimental. While I have a high opinion of the people who provide support for the kernel in commercial distributions and their ability to back-port fixes from newer kernels I’m concerned about their decision to support BTRFS. I’m also dubious about whether we can offer BTRFS support in Debian/Jessie (the next version of Debian) without a significant warning. OTOH if you find yourself with a BTRFS system that isn’t working well you could always hire me to fix it. I accept payment via Paypal, bank transfer, or Bitcoin. If you want to pay me in Grange then I assure you I will never forget about it. ;)

I thought that I wouldn’t have CPU speed issues when I started using the AMD64 architecture, for most tasks that’s been the case. But for systems for which storage is important I’ll look at getting faster CPUs because of BTRFS. Using faster CPUs for storage isn’t that uncommon (I used to work for SGI and dealt with some significant CPU power used for file serving), but needing a fast quad-core CPU to drive a single SSD is a little disappointing. While recovery from file system corner cases isn’t going to be particularly common it’s something that you want completed quickly, for personal systems you want to be doing something else and for work systems you don’t want down-time.

The BTRFS problems with running out of disk space are really serious. It seems that even workstations used at home can’t survive without monitoring. For any other filesystem used at home you can just let it get full and then delete stuff.

Include “rootflags=skip_balance” in the boot loader configuration for every system with a BTRFS root filesystem and in the /etc/fstab for every non-root BTRFS filesystem. I haven’t yet encountered a single situation where continuing the balance did any good or when it didn’t do any harm.

Swap Space and SSD

In 2007 I wrote a blog post about swap space [1]. The main point of that article was to debunk the claim that Linux needs a swap space twice as large as main memory (in summary such advice is based on BSD Unix systems and has never applied to Linux and that most storage devices aren’t fast enough for large swap). That post was picked up by Barrapunto (Spanish Slashdot) and became one of the most popular posts I’ve written [2].

In the past 7 years things have changed. Back then 2G of RAM was still a reasonable amount and 4G was a lot for a desktop system or laptop. Now there are even phones with 3G of RAM, 4G is about the minimum for any new desktop or laptop, and desktop/laptop systems with 16G aren’t that uncommon. Another significant development is the use of SSDs which dramatically improve speed for some operations (mainly seeks).

As SATA SSDs for desktop use start at about $110 I think it’s safe to assume that everyone who wants a fast desktop system has one. As a major limiting factor in swap use is the seek performance of the storage the use of SSDs should allow greater swap use. My main desktop system has 4G of RAM (it’s an older Intel 64bit system and doesn’t support more) and has 4G of swap space on an Intel SSD. My work flow involves having dozens of Chromium tabs open at the same time, usually performance starts to drop when I get to about 3.5G of swap in use.

While SSD generally has excellent random IO performance the contiguous IO performance often isn’t much better than hard drives. My Intel SSDSC2CT12 300i 128G can do over 5000 random seeks per second but for sustained contiguous filesystem IO can only do 225M/s for writes and 274M/s for reads. The contiguous IO performance is less than twice as good as a cheap 3TB SATA disk. It also seems that the performance of SSDs aren’t as consistent as that of hard drives, when a hard drive delivers a certain level of performance then it can generally do so 24*7 but a SSD will sometimes reduce performance to move blocks around (the erase block size is usually a lot larger than the filesystem block size).

It’s obvious that SSDs allow significantly better swap performance and therefore make it viable to run a system with more swap in use but that doesn’t allow unlimited swap. Even when using programs like Chromium (which seems to allocate huge amounts of RAM that aren’t used much) it doesn’t seem viable to have swap be much bigger than 4G on a system with 4G of RAM. Now I could buy another SSD and use two swap spaces for double the overall throughput (which would still be cheaper than buying a PC that supports 8G of RAM), but that still wouldn’t solve all problems.

One issue I have been having on occasion is BTRFS failing to allocate kernel memory when managing snapshots. I’m not sure if this would be solved by adding more RAM as it could be an issue of RAM fragmentation – I won’t file a bug report about this until some of the other BTRFS bugs are fixed. Another problem I have had is when running Minecraft the driver for my ATI video card fails to allocate contiguous kernel memory, this is one that almost certainly wouldn’t be solved by just adding more swap – but might be solved if I tweaked the kernel to be more aggressive about swapping out data.

In 2007 when using hard drives for swap I found that the maximum space that could be used with reasonable performance for typical desktop operations was something less than 2G. Now with a SSD the limit for usable swap seems to be something like 4G on a system with 4G of RAM. On a system with only 2G of RAM that might allow the system to be usable with swap being twice as large as RAM, but with the amounts of RAM in modern PCs it seems that even SSD doesn’t allow using a swap space larger than RAM for typical use unless it’s being used for hibernation.

Conclusion

It seems that nothing has significantly changed in the last 7 years. We have more RAM, faster storage, and applications that are more memory hungry. The end result is that swap still isn’t very usable for anything other than hibernation if it’s larger than RAM.

It would be nice if application developers could stop increasing the use of RAM. Currently it seems that the RAM requirements for Linux desktop use are about 3 years behind the RAM requirements for Windows. This is convenient as a PC is fully depreciated according to the tax office after 3 years. This makes it easy to get 3 year old PCs cheaply (or sometimes for free as rubbish) which work really well for Linux. But it would be nice if we could be 4 or 5 years behind Windows in terms of hardware requirements to reduce the hardware requirements for Linux users even further.