BTRFS and SE Linux

I’ve had problems with systems running SE Linux on BTRFS losing the XATTRs used for storing the SE Linux file labels after a power outage.

Here is the link to the patch that fixes this [1]. Thanks to Hans van Kranenburg and Holger Hoffstätte for the information about this patch which was already included in kernel 4.16.11. That was uploaded to Debian on the 27th of May and got into testing about the time that my message about this issue got to the SE Linux list (which was a couple of days before I sent it to the BTRFS developers).

The kernel from Debian/Stable still has the issue. So using a testing kernel might be a good option to deal with this problem at the moment.

Below is the information on reproducing this problem. It may be useful for people who want to reproduce similar problems. Also all sysadmins should know about “reboot -nffd”, if something really goes wrong with your kernel you may need to do that immediately to prevent corrupted data being written to your disks.

The command “reboot -nffd” (kernel reboot without flushing kernel buffers or writing status) when run on a BTRFS system with SE Linux will often result in /var/log/audit/audit.log being unlabeled. It also results in some systemd-journald files like /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/system.journal being unlabeled but that is rarer. I think that the same
problem afflicts both systemd-journald and auditd but it’s a race condition that on my systems (both production and test) is more likely to affect auditd.

root@stretch:/# xattr -l /var/log/audit/audit.log 
security.selinux: 
0000   73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5F    system_u:object_ 
0010   72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73    r:auditd_log_t:s 
0020   30 00                                              0.

SE Linux uses the xattr “security.selinux”, you can see what it’s doing with xattr(1) but generally using “ls -Z” is easiest.

If this issue just affected “reboot -nffd” then a solution might be to just not run that command. However this affects systems after a power outage.

I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security update for Debian/Stretch which is the latest supported release of Debian). I have also reproduced it in an identical manner with kernel 4.16.0-1-amd64 (the latest from Debian/Unstable). For testing I reproduced this with a 4G filesystem in a VM, but in production it has happened on BTRFS RAID-1 arrays, both SSD and HDD.

#!/bin/bash 
set -e 
COUNT=$(ps aux|grep [s]bin/auditd|wc -l) 
date 
if [ "$COUNT" = "1" ]; then 
 echo "all good" 
else 
 echo "failed" 
 exit 1 
fi

Firstly the above is the script /usr/local/sbin/testit, I test for auditd running because it aborts if the context on it’s log file is wrong. When SE Linux is in enforcing mode an incorrect/missing label on the audit.log file causes auditd to abort.

root@stretch:~# ls -liZ /var/log/audit/audit.log 
37952 -rw-------. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun  1 
12:23 /var/log/audit/audit.log

Above is before I do the tests.

while ssh stretch /usr/local/sbin/testit ; do 
 ssh stretch "reboot -nffd" > /dev/null 2>&1 & 
 sleep 20 
done

Above is the shell code I run to do the tests. Note that the VM in question runs on SSD storage which is why it can consistently boot in less than 20 seconds.

Fri  1 Jun 12:26:13 UTC 2018 
all good 
Fri  1 Jun 12:26:33 UTC 2018 
failed

Above is the output from the shell code in question. After the first reboot it fails. The probability of failure on my test system is greater than 50%.

root@stretch:~# ls -liZ /var/log/audit/audit.log  
37952 -rw-------. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun  1 12:26 /var/log/audit/audit.log

Now the result. Note that the Inode has not changed. I could understand a newly created file missing an xattr, but this is an existing file which shouldn’t have had it’s xattr changed. But somehow it gets corrupted.

The first possibility I considered was that SE Linux code might be at fault. I asked on the SE Linux mailing list (I haven’t been involved in SE Linux kernel code for about 15 years) and was informed that this isn’t likely at
all. There have been no problems like this reported with other filesystems.

WordPress Multisite on Debian

WordPress (a common CMS for blogs) is designed to be copied to a directory that Apache can serve and run by a user with no particular privileges while managing installation of it’s own updates and plugins. Debian is designed around the idea of the package management system controlling everything on behalf of a sysadmin.

When I first started using WordPress there was a version called “WordPress MU” (Multi User) which supported multiple blogs. It was a separate archive to the main WordPress and didn’t support all the plugins and themes. As a main selling point of WordPress is the ability to select from the significant library of plugins and themes this was a serious problem.

Debian WordPress

The people who maintain the Debian package of WordPress have always supported multiple blogs on one system and made it very easy to run in that manner. There’s a /etc/wordpress directory for configuration files for each blog with names such as config-etbe.coker.com.au.php. This allows having multiple separate blogs running from the same tree of PHP source which means only one thing to update when there’s a new version of WordPress (often fixing security issues).

One thing that appears to be lacking with the Debian system is separate directories for “media”. WordPress supports uploading images (which are scaled to several different sizes) as well as sound and apparently video. By default under Debian they are stored in /var/lib/wordpress/wp-content/uploads/YYYY/MM/filename. If you have several blogs on one system they all get to share the same directory tree, that may be OK for one person running multiple blogs but is obviously bad when several bloggers have independent blogs on the same server.

Multisite

If you enable the “multisite” support in WordPress then you have WordPress support for multiple blogs. The administrator of the multisite configuration has the ability to specify media paths etc for all the child blogs.

The first problem with this is that one person has to be the multisite administrator. As I’m the sysadmin of the WordPress servers in question that’s an obvious task for me. But the problem is that the multisite administrator doesn’t just do sysadmin tasks such as specifying storage directories. They also do fairly routine tasks like enabling plugins. Preventing bloggers from installing new plugins is reasonable and is the default Debian configuration. Preventing them from selecting which of the installed plugins are activated is unreasonable in most situations.

The next issue is that some core parts of WordPress functionality on the sub-blogs refer to the administrator blog, recovering a forgotten password is one example. I don’t want users of other blogs on the system to be referred to my blog when they forget their password.

A final problem with multisite is that it makes things more difficult if you want to move a blog to another system. Instead of just sending a dump of the MySQL database and a copy of the Apache configuration for the site you have to configure it for which blog will be it’s master. If going between multisite and non-multisite you have to change some of the data about accounts, this will be annoying on both adding new sites to a server and moving sites from the server to a non-multisite server somewhere else.

I now believe that WordPress multisite has little value for people who use Debian. The Debian way is the better way.

So I had to back out the multisite changes. Fortunately I had a cron job to make snapshots of the BTRFS subvolume that has the database so it was easy to revert to an older version of the MySQL configuration.

Upload Location

update etbe_options set option_value='/var/lib/wordpress/wp-content/uploads/etbe.coker.com.au' where option_name='upload_path';

It turns out that if you don’t have a multisite blog then there’s no way of changing the upload directory without using SQL. The above SQL code is an example of how to do this. Note that it seems that there is special case handling of a value of ‘wp-content/uploads‘ and any other path needs to be fully qualified.

For my own blog however I choose to avoid the WordPress media management and use the following shell script to create suitable HTML code for an image that links to a high resolution version. I use GIMP to create the smaller version of the image which gives me a lot of control over how to crop and compress the image to ensure that enough detail is visible while still being small enough for fast download.

#!/bin/bash
set -e

if [ "$BASE" = "" ]; then
  BASE="http://www.coker.com.au/blogpics/2018"
fi

while [ "$1" != "" ]; do
  BIG=$1
  SMALL=$(echo $1 | sed -s s/-big//)
  RES=$(identify $SMALL|cut -f3 -d\ )
  WIDTH=$(($(echo $RES|cut -f1 -dx)/2))px
  HEIGHT=$(($(echo $RES|cut -f2 -dx)/2))px
  echo "<a href=\"$BASE/$BIG\"><img src=\"$BASE/$SMALL\" width=\"$WIDTH\" height=\"$HEIGHT\" alt=\"\" /></a>"
  shift
done

Apache Mesos on Debian

I decided to try packaging Mesos for Debian/Stretch. I had a spare system with a i7-930 CPU, 48G of RAM, and SSDs to use for building. The i7-930 isn’t really fast by today’s standards, but 48G of RAM and SSD storage mean that overall it’s a decent build system – faster than most systems I run (for myself and for clients) and probably faster than most systems used by Debian Developers for build purposes.

There’s a github issue about the lack of an upstream package for Debian/Stretch [1]. That upstream issue could probably be worked around by adding Jessie sources to the APT sources.list file, but a package for Stretch is what is needed anyway.

Here is the documentation on building for Debian [2]. The list of packages it gives as build dependencies is incomplete, it also needs zlib1g-dev libapr1-dev libcurl4-nss-dev openjdk-8-jdk maven libsasl2-dev libsvn-dev. So BUILDING this software requires Java + Maven, Ruby, and Python along with autoconf, libtool, and all the usual Unix build tools. It also requires the FPM (Fucking Package Management) tool, I take the choice of name as an indication of the professionalism of the author.

Building the software on my i7 system took 79 minutes which includes 76 minutes of CPU time (I didn’t use the -j option to make). At the end of the build it turned out that I had mistakenly failed to install the Fucking Package Management “gem” and it aborted. At this stage I gave up on Mesos, the pain involved exceeds my interest in trying it out.

How to do it Better

One of the aims of Free Software is that bugs are more likely to get solved if many people look at them. There aren’t many people who will devote 76 minutes of CPU time on a moderately fast system to investigate a single bug. To deal with this software should be prepared as components. An example of this is the SE Linux project which has 13 source modules in the latest release [3]. Of those 13 only 5 are really required. So anyone who wants to start on SE Linux from source (without considering a distribution like Debian or Fedora that has it packaged) can build the 5 most important ones. Also anyone who has an issue with SE Linux on their system can find the one source package that is relevant and study it with a short compile time. As an aside I’ve been working on SE Linux since long before it was split into so many separate source packages and know the code well, but I still find the separation convenient – I rarely need to work on more than a small subset of the code at one time.

The requirement of Java, Ruby, and Python to build Mesos could be partly due to language interfaces to call Mesos interfaces from Ruby and Python. Ohe solution to that is to have the C libraries and header files to call Mesos and have separate packages that depend on those libraries and headers to provide the bindings for other languages. Another solution is to have autoconf detect that some languages aren’t installed and just not try to compile bindings for them (this is one of the purposes of autoconf).

The use of a tool like Fucking Package Management means that you don’t get help from experts in the various distributions in making better packages. When there is a FOSS project with a debian subdirectory that makes barely functional packages then you will be likely to have an experienced Debian Developer offer a patch to improve it (I’ve offered patches for such things on many occasions). When there is a FOSS project that uses a tool that is never used by Debian developers (or developers of Fedora and other distributions) then the only patches you will get will be from inexperienced people.

A software build process should not download anything from the Internet. The source archive should contain everything that is needed and there should be dependencies for external software. Any downloads from the Internet need to be protected from MITM attacks which means that a responsible software developer has to read through the build system and make sure that appropriate PGP signature checks etc are performed. It could be that the files that the Mesos build downloaded from the Apache site had appropriate PGP checks performed – but it would take me extra time and effort to verify this and I can’t distribute software without being sure of this. Also reproducible builds are one of the latest things we aim for in the Debian project, this means we can’t just download files from web sites because the next build might get a different version.

Finally the fpm (Fucking Package Management) tool is a Ruby Gem that has to be installed with the “gem install” command. Any time you specify a gem install command you should include the -v option to ensure that everyone is using the same version of that gem, otherwise there is no guarantee that people who follow your documentation will get the same results. Also a quick Google search didn’t indicate whether gem install checks PGP keys or verifies data integrity in other ways. If I’m going to compile software for other people to use I’m concerned about getting unexpected results with such things. A Google search indicates that Ruby people were worried about such things in 2013 but doesn’t indicate whether they solved the problem properly.

More KVM Modules Configuration

Last year I blogged about blacklisting a video driver so that KVM virtual machines didn’t go into graphics mode [1]. Now I’ve been working on some other things to make virtual machines run better.

I use the same initramfs for the physical hardware as for the virtual machines. So I need to remove modules that are needed for booting the physical hardware from the VMs as well as other modules that get dragged in by systemd and other things. One significant saving from this is that I use BTRFS for the physical machine and the BTRFS driver takes 1M of RAM!

The first thing I did to reduce the number of modules was to edit /etc/initramfs-tools/initramfs.conf and change “MODULES=most” to “MODULES=dep”. This significantly reduced the number of modules loaded and also stopped the initramfs from probing for a non-existant floppy drive which added about 20 seconds to the boot. Note that this will result in your initramfs not supporting different hardware. So if you plan to take a hard drive out of your desktop PC and install it in another PC this could be bad for you, but for servers it’s OK as that sort of upgrade is uncommon for servers and only done with some planning (such as creating an initramfs just for the migration).

I put the following rmmod commands in /etc/rc.local to remove modules that are automatically loaded:
rmmod btrfs
rmmod evdev
rmmod lrw
rmmod glue_helper
rmmod ablk_helper
rmmod aes_x86_64
rmmod ecb
rmmod xor
rmmod raid6_pq
rmmod cryptd
rmmod gf128mul
rmmod ata_generic
rmmod ata_piix
rmmod i2c_piix4
rmmod libata
rmmod scsi_mod

In /etc/modprobe.d/blacklist.conf I have the following lines to stop drivers being loaded. The first line is to stop the video mode being set and the rest are just to save space. One thing that inspired me to do this is that the parallel port driver gave a kernel error when it loaded and tried to access non-existant hardware.
blacklist bochs_drm
blacklist joydev
blacklist ppdev
blacklist sg
blacklist psmouse
blacklist pcspkr
blacklist sr_mod
blacklist acpi_cpufreq
blacklist cdrom
blacklist tpm
blacklist tpm_tis
blacklist floppy
blacklist parport_pc
blacklist serio_raw
blacklist button

On the physical machine I have the following in /etc/modprobe.d/blacklist.conf. Most of this is to prevent loading of filesystem drivers when making an initramfs. I do this because I know there’s never going to be any need for CDs, parallel devices, graphics, or strange block devices in a server room. I wouldn’t do any of this for a desktop workstation or laptop.
blacklist ppdev
blacklist parport_pc
blacklist cdrom
blacklist sr_mod
blacklist nouveau

blacklist ufs
blacklist qnx4
blacklist hfsplus
blacklist hfs
blacklist minix
blacklist ntfs
blacklist jfs
blacklist xfs

SE Linux in Debian/Stretch

Debian/Stretch has been frozen. Before the freeze I got almost all the bugs in policy fixed, both bugs reported in the Debian BTS and bugs that I know about. This is going to be one of the best Debian releases for SE Linux ever.

Systemd with SE Linux is working nicely. The support isn’t as good as I would like, there is still work to be done for systemd-nspawn. But it’s close enough that anyone who needs to use it can use audit2allow to generate the extra rules needed. Systemd-nspawn is not used by default and it’s not something that a new Linux user is going to use, I think that expert users who are capable of using such features are capable of doing the extra work to get them going.

In terms of systemd-nspawn and some other rough edges, the issue is the difference between writing policy for a single system vs writing policy that works for everyone. If you write policy for your own system you can allow access for a corner case without a lot of effort. But if I wrote policy to allow access for every corner case then they might add up to a combination that can be exploited. I don’t recommend blindly adding the output of audit2allow to your local policy (be particularly wary of access to shadow_t and write access to etc_t, lib_t, etc). But OTOH if you have a system that’s running in enforcing mode that happens to have one daemon with more access than is ideal then all the other daemons will still be restricted.

As for previous releases I plan to keep releasing updates to policy packages in my own apt repository. I’m also considering releasing policy source to updates that can be applied on existing Stretch systems. So if you want to run the official Debian packages but need updates that came after Stretch then you can get them. Suggestions on how to distribute such policy source are welcome.

Please enjoy SE Linux on Stretch. It’s too late for most bug reports regarding Stretch as most of them won’t be sufficiently important to justify a Stretch update. The vast majority of SE Linux policy bugs are issues of denying wanted access not permitting unwanted access (so not a security issue) and can be easily fixed by local configuration, so it’s really difficult to make a case for an update to Stable. But feel free to send bug reports for Buster (Stretch+1).

802.1x Authentication on Debian

I recently had to setup some Linux workstations with 802.1x authentication (described as “Ethernet authentication”) to connect to a smart switch. The most useful web site I found was the Ubuntu help site about 802.1x Authentication [1]. But it didn’t describe exactly what I needed so I’m writing a more concise explanation.

The first thing to note is that the authentication mechanism works the same way as 802.11 wireless authentication, so it’s a good idea to have the wpasupplicant package installed on all laptops just in case you need to connect to such a network.

The first step is to create a wpa_supplicant config file, I named mine /etc/wpa_supplicant_SITE.conf. The file needs contents like the following:

network={
 key_mgmt=IEEE8021X
 eap=PEAP
 identity="USERNAME"
 anonymous_identity="USERNAME"
 password="PASS"
 phase1="auth=MD5"
 phase2="auth=CHAP password=PASS"
 eapol_flags=0
}

The first difference between what I use and the Ubuntu example is that I’m using “eap=PEAP“, that is an issue of the way the network is configured, whoever runs your switch can tell you the correct settings for that. The next difference is that I’m using “auth=CHAP” and the Ubuntu example has “auth=PAP“. The difference between those protocols is that CHAP has a challenge-response and PAP just has the password sent (maybe encrypted) over the network. If whoever runs the network says that they “don’t store unhashed passwords” or makes any similar claim then they are almost certainly using CHAP.

Change USERNAME and PASS to your user name and password.

wpa_supplicant -c /etc/wpa_supplicant_SITE.conf -D wired -i eth0

The above command can be used to test the operation of wpa_supplicant.

Successfully initialized wpa_supplicant
eth0: Associated with 00:01:02:03:04:05
eth0: CTRL-EVENT-EAP-STARTED EAP authentication started
eth0: CTRL-EVENT-EAP-PROPOSED-METHOD vendor=0 method=25
TLS: Unsupported Phase2 EAP method 'CHAP'
eth0: CTRL-EVENT-EAP-METHOD EAP vendor 0 method 25 (PEAP) selected
eth0: CTRL-EVENT-EAP-PEER-CERT depth=0 subject=''
eth0: CTRL-EVENT-EAP-PEER-CERT depth=0 subject=''
EAP-MSCHAPV2: Authentication succeeded
EAP-TLV: TLV Result - Success - EAP-TLV/Phase2 Completed
eth0: CTRL-EVENT-EAP-SUCCESS EAP authentication completed successfully
eth0: CTRL-EVENT-CONNECTED - Connection to 00:01:02:03:04:05 completed [id=0 id_str=]

Above is the output of a successful test with wpa_supplicant. I replaced the MAC of the switch with 00:01:02:03:04:05. Strangely it doesn’t like “CHAP” but is automatically selecting “MSCHAPV2” and working, maybe anything other than “PAP” would do.

auto eth0
iface eth0 inet dhcp
  wpa-driver wired
  wpa-conf /etc/wpa_supplicant_SITE.conf

Above is a snippet of /etc/network/interfaces that works with this configuration.

LUV Server Upgrade to Jessie

On Sunday night I started the process of upgrading the LUV server to Debian/Jessie from Debian/Wheezy. My initial plan was to just upgrade Apache first but dependencies required upgrading systemd too.

One problem I’ve encountered in the past is that the Wheezy version of systemd will often hang on an upgrade to a newer version. Generally the solution to this is to run “systemctl daemon-reexec” from another terminal. The problem in this case was that not all the libraries needed for systemd had been installed, so systemd could re-exec itself but immediately aborted. The kernel really doesn’t like it when process 1 aborts repeatedly and apparently immediately hanging is the result. At the time I didn’t know this, all I knew was that my session died and the server stopped responding to pings immediately after I requested a reexec.

The LUV server is hosted at VPAC for free. As their staff have actual work to do they couldn’t spend a lot of time working on the LUV server. They told me that the screen was flickering and suspected a VGA cable. I got to the VPAC server room with the spare LUV server (LUV had been given 3 almost identical Sun servers from Barwon Water) at 16:30. By 17:30 I had fixed the core problem (boot with “init=/bin/bash“, mount the root filesystem rw, finish the upgrade of systemd and it’s dependencies, and then reboot normally). That got it into a stage where the Xen server for Wikimedia Au was working but most LUV functionality wasn’t working.

By 23:00 on Monday I had the full list server functionality working for users, this is the main feature that users want when it’s not near a meeting time. I can’t remember whether it was Monday night or Tuesday morning when I got the Drupal site going (the main LUV web site). Last night at midnight I got the last of the Mailman administrative interface going, I admit I could have got it going a bit earlier by putting SE Linux in permissive mode, but I don’t think that the members would have benefited from that (I’ll upload a SE Linux policy package that gets Mailman working on Jessie soon).

Now it’s Wednesday and I’m still fixing some cron jobs. Along the way I noticed some problems with excessive disk space use that I’m fixing now and I’ve also removed some Wikimedia related configuration files that were obsolete and would have prevented anyone from using a wikimedia.org.au address to subscribe to the LUV mailing lists.

Now I believe that everything is working correctly and generally working better than before.

Lessons Learned

While Sunday night wasn’t a bad time to start the upgrade it wasn’t the best. If I had started the upgrade on Monday morning there would have been less down-time. Another possibility might be to do the upgrade while near the VPAC office during business hours, I could have started the upgrade while at a nearby cafe and then visited the server room immediately if something went wrong.

Doing an upgrade on a day when there’s no meeting within a week was a good choice. It wasn’t really a conscious choice as I’m usually doing other LUV work near the meeting day which precludes doing other LUV work that doesn’t need to be done soon. But in future it would be best to consciously plan upgrades for a date when users aren’t going to need the service much.

While the Wheezy systemd bug is unlikely to ever be fixed there are work-arounds that shouldn’t result in a broken server. At the moment it seems that the best option would be to kill -9 the systemctl processes that hang until the packages that systemd depends on are installed. The problem is that the upgrade hangs while the new systemctl tries to tell the old systemd to restart daemons. If we can get past that to the stage where the shared objects are installed then it should be ok.

The Apache upgrade from 2.2.x to 2.4.x changed the operation of some access control directives and it took me some time to work out how to fix that. Doing a Google search on the differences between those would have led me to the Apache document about upgrading from 2.2 to 2.4 [1]. That wouldn’t have prevented some down-time of the web sites but would have allowed me to prepare for it and to more quickly fix the problems when they became apparent. Also the rather confusing configuration of the LUV server (supporting many web sites that are no longer used) didn’t help things. I think that removing cruft from an installation before an upgrade would be better than waiting until after things break.

Next time I do an upgrade of such a server I’ll write notes about it while I go. That will give a better blog post about it if it becomes newsworthy enough to be blogged about and also more opportunities to learn better ways of doing it.

Sorry for the inconvenience.

BTRFS Status June 2015

The version of btrfs-tools in Debian/Jessie is incapable of creating a filesystem that can be mounted by the kernel in Debian/Wheezy. If you want to use a BTRFS filesystem on Jessie and Wheezy (which isn’t uncommon with removable devices) the only options are to use the Wheezy version of mkfs.btrfs or to use a Jessie kernel on Wheezy. I recently got bitten by this issue when I created a BTRFS filesystem on a removable device with a lot of important data (which is why I wanted metadata duplication and checksums) and had to read it on a server running Wheezy. Fortunately KVM in Wheezy works really well so I created a virtual machine to read the disk. Setting up a new KVM isn’t that difficult, but it’s not something I want to do while a client is anxiously waiting for their data.

BTRFS has been working well for me apart from the Jessie/Wheezy compatability issue (which was an annoyance but didn’t stop me doing what I wanted). I haven’t written a BTRFS status report for a while because everything has been OK and there has been nothing exciting to report.

I regularly get errors from the cron jobs that run a balance supposedly running out of free space. I have the cron jobs due to past problems with BTRFS running out of metadata space. In spite of the jobs often failing the systems keep working so I’m not too worried at the moment. I think this is a bug, but there are many more important bugs.

Linux kernel version 3.19 was the first version to have working support for RAID-5 recovery. This means version 3.19 was the first version to have usable RAID-5 (I think there is no point even having RAID-5 without recovery). It wouldn’t be prudent to trust your important data to a new feature in a filesystem. So at this stage if I needed a very large scratch space then BTRFS RAID-5 might be a viable option but for anything else I wouldn’t use it. BTRFS still has had little performance optimisation, while this doesn’t matter much for SSD and for single-disk filesystems for a RAID-5 of hard drives that would probably hurt a lot. Maybe BTRFS RAID-5 would be good for a scratch array of SSDs. The reports of problems with RAID-5 don’t surprise me at all.

I have a BTRFS RAID-1 filesystem on 2*4TB disks which is giving poor performance on metadata, simple operations like “ls -l” on a directory with ~200 subdirectories takes many seconds to run. I suspect that part of the problem is due to the filesystem being written by cron jobs with files accumulating over more than a year. The “btrfs filesystem” command (see btrfs-filesystem(8)) allows defragmenting files and directory trees, but unfortunately it doesn’t support recursively defragmenting directories but not files. I really wish there was a way to get BTRFS to put all metadata on SSD and all data on hard drives. Sander suggested the following command to defragment directories on the BTRFS mailing list:

find / -xdev -type d -execdir btrfs filesystem defrag -c {} +

Below is the output of “zfs list -t snapshot” on a server I run, it’s often handy to know how much space is used by snapshots, but unfortunately BTRFS has no support for this.

NAME USED AVAIL REFER MOUNTPOINT
hetz0/be0-mail@2015-03-10 2.88G 387G
hetz0/be0-mail@2015-03-11 1.12G 388G
hetz0/be0-mail@2015-03-12 1.11G 388G
hetz0/be0-mail@2015-03-13 1.19G 388G

Hugo pointed out on the BTRFS mailing list that the following command will give the amount of space used for snapshots. $SNAPSHOT is the name of a snapshot and $LASTGEN is the generation number of the previous snapshot you want to compare with.

btrfs subvolume find-new $SNAPSHOT $LASTGEN | awk '{total = total + $7}END{print total}'

One upside of the BTRFS implementation in this regard is that the above btrfs command without being piped through awk shows you the names of files that are being written and the amounts of data written to them. Through casually examining this output I discovered that the most written files in my home directory were under the “.cache” directory (which wasn’t exactly a surprise).

Now I am configuring workstations with a separate subvolume for ~/.cache for the main user. This means that ~/.cache changes don’t get stored in the hourly snapshots and less disk space is used for snapshots.

Conclusion

My observation is that things are going quite well with BTRFS. It’s more than 6 months since I had a noteworthy problem which is pretty good for a filesystem that’s still under active development. But there are still many systems I run which could benefit from the data integrity features of ZFS and BTRFS that don’t have the resources to run ZFS and need more reliability than I can expect from an unattended BTRFS system.

At this time the only servers I run with BTRFS are located within a reasonable drive from my home (not the servers in Germany and the US) and are easily accessible (not the embedded systems). ZFS is working well for some of the servers in Germany. Eventually I’ll probably run ZFS on all the hosted servers in Germany and the US, I expect that will happen before I’m comfortable running BTRFS on such systems. For the embedded systems I will just take the risk of data loss/corruption for the next few years.

SE Linux Play Machine Over Tor

I work on SE Linux to improve security for all computer users. I think that my work has gone reasonably well in that regard in terms of directly improving security of computers and helping developers find and fix certain types of security flaws in apps. But a large part of the security problems we have at the moment are related to subversion of Internet infrastructure. The Tor project is a significant step towards addressing such problems. So to achieve my goals in improving computer security I have to support the Tor project. So I decided to put my latest SE Linux Play Machine online as a Tor hidden service. There is no real need for it to be hidden (for the record it’s in my bedroom), but it’s a learning experience for me and for everyone who logs in.

A Play Machine is what I call a system with root as the guest account with only SE Linux to restrict access.

Running a Hidden Service

A Hidden Service in TOR is just a cryptographically protected address that forwards to a regular TCP port. It’s not difficult to setup and the Tor project has good documentation [1]. For Debian the file to edit is /etc/tor/torrc.

I added the following 3 lines to my torrc to create a hidden service for SSH. I forwarded port 80 for test purposes because web browsers are easier to configure for SOCKS proxying than ssh.

HiddenServiceDir /var/lib/tor/hidden_service/
HiddenServicePort 22 192.168.0.2:22
HiddenServicePort 80 192.168.0.2:22

Generally when setting up a hidden service you want to avoid using an IP address that gives anything away. So it’s a good idea to run a hidden service on a virtual machine that is well isolated from any public network. My Play machine is hidden in that manner not for secrecy but to prevent it being used for attacking other systems.

SSH over Tor

Howtoforge has a good article on setting up SSH with Tor [2]. That has everything you need for setting up Tor for a regular ssh connection, but the tor-resolve program only works for connecting to services on the public Internet. By design the .onion addresses used by Hidden Services have no mapping to anything that reswemble IP addresses and tor-resolve breaks it. I believe that the fact that tor-resolve breaks thins in this situation is a bug, I have filed Debian bug report #776454 requesting that tor-resolve allow such things to just work [3].

Host *.onion
ProxyCommand connect -5 -S localhost:9050 %h %p

I use the above ssh configuration (which can go in ~/.ssh/config or /etc/ssh/ssh_config) to tell the ssh client how to deal with .onion addresses. I also had to install the connect-proxy package which provides the connect program.

ssh root@zp7zwyd5t3aju57m.onion
The authenticity of host ‘zp7zwyd5t3aju57m.onion ()
ECDSA key fingerprint is 3c:17:2f:7b:e2:f6:c0:c2:66:f5:c9:ab:4e:02:45:74.
Are you sure you want to continue connecting (yes/no)?

I now get the above message when I connect, the ssh developers have dealt with connecting via a proxy that doesn’t have an IP address.

Also see the general information page about my Play Machine, that information page has the root password [4].

Systemd Notes

A few months ago I gave a lecture about systemd for the Linux Users of Victoria. Here are some of my notes reformatted as a blog post:

Scripts in /etc/init.d can still be used, they work the same way as they do under sysvinit for the user. You type the same commands to start and stop daemons.

To get a result similar to changing runlevel use the “systemctl isolate” command. Runlevels were never really supported in Debian (unlike Red Hat where they were used for starting and stopping the X server) so for Debian users there’s no change here.

The command systemctl with no params shows a list of loaded services and highlights failed units.

The command “journalctl -u UNIT-PATTERN” shows journal entries for the unit(s) in question. The pattern uses wildcards not regexs.

The systemd journal includes the stdout and stderr of all daemons. This solves the problem of daemons that don’t log all errors to syslog and leave the sysadmin wondering why they don’t work.

The command “systemctl status UNIT” gives the status and last log entries for the unit in question.

A program can use ioctl(fd, TIOCSTI, …) to push characters into a tty buffer. If the sysadmin runs an untrusted program with the same controlling tty then it can cause the sysadmin shell to run hostile commands. The system call setsid() to create a new terminal session is one solution but managing which daemons can be started with it is difficult. The way that systemd manages start/stop of all daemons solves this. I am glad to be rid of the run_init program we used to use on SE Linux systems to deal with this.

Systemd has a mechanism to ask for passwords for SSL keys and encrypted filesystems etc. There have been problems with that in the past but I think they are all fixed now. While there is some difficulty during development the end result of having one consistent way of managing this will be better than having multiple daemons doing it in different ways.

The commands “systemctl enable” and “systemctl disable” enable/disable daemon start at boot which is easier than the SysVinit alternative of update-rc.d in Debian.

Systemd has built in seat management, which is not more complex than consolekit which it replaces. Consolekit was installed automatically without controversy so I don’t think there should be controversy about systemd replacing consolekit.

Systemd improves performance by parallel start and autofs style fsck.

The command systemd-cgtop shows resource use for cgroups it creates.

The command “systemd-analyze blame” shows what delayed the boot process and
systemd-analyze critical-chain” shows the critical path in boot delays.

Sysremd also has security features such as service private /tmp and restricting service access to directory trees.

Conclusion

For basic use things just work, you don’t need to learn anything new to use systemd.

It provides significant benefits for boot speed and potentially security.

It doesn’t seem more complex than other alternative solutions to the same problems.

https://wiki.debian.org/systemd

http://freedesktop.org/wiki/Software/systemd/Optimizations/

http://0pointer.de/blog/projects/security.html