Sysadmin Skills and University Degrees

I think that a major deficiency in Computer Science degrees is the lack of sysadmin training.

Version Control

The first thing that needs to be added is the basics of version control. CVS (which is now regarded as obsolete) was initially released when I was in the first year of university. But SCCS and RCS had been in use for some time. I think that the people who designed my course were remiss in not adding any mention of version control (not even strategies for saving old versions of your work), one could say that they taught us about version control by letting us accidentally delete our assignments. :-#

If a course is aimed at just teaching programmers (as most CS degrees are) then version control for group assignments should be a standard part of the course. Having some marks allocated for the quality of comments in the commit log would also be good.

A modern CS degree should cover distributed version control, that means covering Git as it’s the most popular distributed version control system nowadays.

For people who want to work as sysadmins (as opposed to developers who run their own PCs) a course should have an optional subject for version control of an entire system. That includes tools like etckeeper for version control of system configuration and tools like Puppet for automated configuration and system maintenance.


It’s quite reasonable for a CS degree to provide simplified problems for the students to solve so they can concentrate on one task. But in the real world the problems are more complex. One of the more difficult parts of managing real systems is dependencies. You have issues of header files etc at compile time and library versions at deployment. Often you need a program to run on systems with different versions of the OS which means making it compile for both and deal with differences in behaviour.

There are lots of hacky things that people do to deal with dependencies in systems. People link compiled programs statically, install custom versions of interpreters in user home directories or /usr/local for daemons, and do many other things. These things can have bad consequences including data loss, system downtime, and security problems. It’s not always wrong to do such things, but it’s something that should only be done with knowledge of the potential consequences and a plan for mitigating them. A CS degree should teach the potential advantages and disadvantages of these options to allow graduates to make informed decisions.


I’ve met many people who call themselves computer professionals and think that backups aren’t needed. I’ve seen production systems that were designed in a way that backups were impossible. The lack of backups is a serious problem for the entire industry.

Some lectures about backups could be part of a version control subject in a general CS degree. For a degree that majors in Sysadmin at least one subject about backups is appropriate.

For any backup (even backing up your home PC) you should have offsite backups to deal with fire damage, multiple backups of different ages (especially important now that encryption malware is a serious threat), and a plan for how fast you can restore things.

The most common use of backups is to deal with the case of deleting the wrong file. Unfortunately this case seems to be the most rarely mentioned.

Another common situation that should be covered is a configuration error that results in a system that won’t boot correctly. It’s a very common problem and one that can be solved quickly if you are prepared but which can take a long time if you aren’t.

For a Sysadmin course it is important to cover backups of systems in remote datacenters.


A good CS degree should cover the process of selecting suitable hardware. Programmers often get to advise on the hardware used to run their code, especially at smaller companies. Reliability features such as RAID, ECC RAM, and clustering should be covered.

Planning for upgrades is a very important part of this which is usually not taught. Not only do you need to plan for an upgrade without much downtime or cost but you also need to plan for what upgrades are possible. Next year will your system require hardware that is more powerful than you can buy next year? If so you need to plan for a cluster now.

For a Sysadmin course some training about selecting cloud providers and remote datacenter hosting should be provided. There are many complex issues that determine whether it’s most appropriate to use a cloud service, hosted virtual machines, hosted physical servers managed by the ISP, hosted physical servers purchased by the client, or on-site servers. Often a large system will involve 2 or more of those options, even some small companies use 3 or more of those options to try and provide the performance and reliability they need at a price they can afford.

We Need Sysadmin Degrees

Covering the basic coding skills takes a lot of time. I don’t think we can reasonably expect a CS degree to cover all that and also give good coverage to sysadmin work. While some basic sysadmin skills are needed by every programmer I think we need to have separate majors for people who want a career in system administration.

Sysadmins need some programming skills, but that’s mostly scripting and basic debugging. Someone who’s main job is as a sysadmin can probably expect to never make any significant change to a program that’s more than 10,000 lines long. A large amount of the programming in a CS degree can be replaced by “file a bug report” for a sysadmin degree.

This doesn’t mean that sysadmins shouldn’t be doing software development or that they aren’t good at it. One noteworthy fact is that it appears that the most common job among developers of the Debian distribution of Linux is System Administration. Developing an OS involves some of the most intensive and demanding programming. But I think that more than a few people who do such work would have skipped a couple of programming subjects in favour of sysadmin subjects if they were given a choice.


Did I miss anything? What other sysadmin skills should be taught in a CS degree?

Do any universities teach these things now? If so please name them in the comments, it is good to help people find universities that teach them what they want to learn and help them in their career.

I Just Ordered a Nexus 6P

Last year I wrote a long-term review of Android phones [1]. I noted that my Galaxy Note 3 only needed to last another 4 months to be the longest I’ve been happily using a phone.

Last month (just over 7 months after writing that) I fell on my Note 3 and cracked the screen. The Amourdillo case is good for protecting the phone [2] so it would have been fine if I had just dropped it. But I fell with the phone in my hand, the phone landed face down and about half my body weight ended up in the middle of the phone which apparently bent it enough to crack the screen. As a result of this the GPS seems to be less reliable than it used to be so there might be some damage to the antenna too.

I was quoted $149 to repair the screen, I could possibly have found a cheaper quote if I had shopped around but it was a good starting point for comparison. The Note 3 originally cost $550 including postage in 2014. A new Note 4 costs $550 + postage now from Shopping Square and a new Note 3 is on ebay with a buy it now price of $380 with free postage.

It seems like bad value to pay 40% of the price of a new Note 3 or 25% the price of a Note 4 to fix my old phone (which is a little worn and has some other minor issues). So I decided to spend a bit more and have a better phone and give my old phone to one of my relatives who doesn’t mind having a cracked screen.

I really like the S-Pen stylus on the Samsung Galaxy Note series of phones and tablets. I also like having a hardware home button and separate screen space reserved for the settings and back buttons. The downsides to the Note series are that they are getting really expensive nowadays and the support for new OS updates (and presumably security fixes) is lacking. So when Kogan offered a good price on a Nexus 6P [3] with 64G of storage I ordered one. I’m going to give the Note 3 to my father, he wants a phone with a bigger screen and a stylus and isn’t worried about cracks in the screen.

I previously wrote about Android device service life [4]. My main conclusion in that post was that storage space is a major factor limiting service life. I hope that 64G in the Nexus 6P will solve that problem, giving me 3 years of use and making it useful to my relatives afterwards. Currently I have 32G of storage of which about 8G is used by my music video collection and about 3G is free, so 64G should last me for a long time. Having only 3G of RAM might be a problem, but I’m thinking of trying CyanogenMod again so maybe with root access I can reduce the amount of RAM use.

Xen CPU Use per Domain again

8 years ago I wrote a script to summarise Xen CPU use per domain [1]. Since then changes to Xen required changes to the script. I have new versions for Debian/Wheezy (Xen 4.1) and Debian/Jessie (Xen 4.4).

Here’s a new script for Debian/Wheezy:

use strict;

open(LIST, "xm list --long|") or die "Can't get list";

my $name = "Dom0";
my $uptime = 0.0;
my $cpu_time = 0.0;
my $total_percent = 0.0;
my $cur_time = time();

open(UPTIME, "</proc/uptime") or die "Can't open /proc/uptime";
my @arr = split(/ /, <UPTIME>);
$uptime = $arr[0];

my %all_cpu;

  if($_ =~ /^\)/)
    my $cpu = $cpu_time / $uptime * 100.0;
    if($name =~ /Domain-0/)
      printf("%s uses %.2f%% of one CPU\n", $name, $cpu);
      $all_cpu{$name} = $cpu;
    $total_percent += $cpu;
  $_ =~ s/\).*$//;
  if($_ =~ /start_time /)
    $_ =~ s/^.*start_time //;
    $uptime = $cur_time – $_;
  if($_ =~ /cpu_time /)
    $_ =~ s/^.*cpu_time //;
    $cpu_time = $_;
  if($_ =~ /\(name /)
    $_ =~ s/^.*name //;
    $name = $_;

sub hashValueDescendingNum {
  $all_cpu{$b} <=> $all_cpu{$a};

my $key;

foreach $key (sort hashValueDescendingNum (keys(%all_cpu)))
  printf("%s uses %.2f%% of one CPU\n", $key, $all_cpu{$key});

printf("Overall CPU use approximates %.1f%% of one CPU\n", $total_percent);

Here’s the script for Debian/Jessie:


use strict;

open(UPTIME, "xl uptime|") or die "Can't get uptime";
open(LIST, "xl list|") or die "Can't get list";

my %all_uptimes;

  chomp $_;

  next if($_ =~ /^Name/);
  $_ =~ s/ +/ /g;

  my @split1 = split(/ /, $_);
  my $dom = $split1[0];
  my $uptime = 0;
  my $time_ind = 2;
  if($split1[3] eq "days,")
    $uptime = $split1[2] * 24 * 3600;
    $time_ind = 4;
  my @split2 = split(/:/, $split1[$time_ind]);
  $uptime += $split2[0] * 3600 + $split2[1] * 60 + $split2[2];
  $all_uptimes{$dom} = $uptime;

my $total_percent = 0;

  chomp $_;

  my $dom = $_;
  $dom =~ s/ .*$//;

  if ( $_ =~ /(\d+)\.[0-9]$/ )
    my $percent = $1 / $all_uptimes{$dom} * 100.0;
    $total_percent += $percent;
    printf("%s uses %.2f%% of one CPU\n", $dom, $percent);

printf("Overall CPU use approximates  %.1f%% of one CPU\n", $total_percent);

BIND Configuration Files

I’ve recently been setting up more monitoring etc to increase the reliability of servers I run. One ongoing issue with computer reliability is any case where a person enters the same data in multiple locations, often people make mistakes and enter slightly different data which can give bad results.

For DNS you need to have at least 2 authoritative servers for each zone. I’ve written the below Makefile to extract the zone names from the primary server and generate a config file suitable for use on a secondary server. The next step is to automate this further by having the Makefile copy the config file to secondary servers and run “rndc reload”. Note that in a typical Debian configuration any user in group “bind” can write to BIND config files and reload the server configuration so this can be done without granting the script on the primary server root access on the secondary servers.

My blog replaces the TAB character with 8 spaces, you need to fix this up if you want to run the Makefile on your own system and also replace with the IP address of your primary server.

all: other/secondary.conf

other/secondary.conf: named.conf.local Makefile
        for n in $$(grep ^zone named.conf.local | cut -f2 -d\"|sort) ; do echo "zone \"$$n\" {\n  type slave;\n  file \"$$n\";\n  masters {; };\n};\n" ; done > other/secondary.conf

Ethernet Interface Naming With Systemd

Systemd has a new way of specifying names for Ethernet interfaces as documented in The Debian package should keep working with the old 70-persistent-net.rules file, but I had a problem with this that forced me to learn about

Below is a little shell script I wrote to convert a basic 70-persistent-net.rules (that only matches on MAC address) to files.



for n in $(grep ^SUB $RULES|sed -e s/^.*NAME..// -e s/.$//) ; do
  LINE=$(grep $n $RULES)
  MAC=$(echo $LINE|sed -e s/^.*address….// -e s/…ATTR.*$//)
  echo "[Match]" > $NAME
  echo "MACAddress=$MAC" >> $NAME
  echo "[Link]" >> $NAME
  echo "Name=$n" >> $NAME


At LCA I attended a talk about Unikernels. Here are the reasons why I think that they are a bad idea:

Single Address Space

According to the Unikernel Wikipedia page [1] a significant criteria for a Unikernel system is that it has a single address space. This gives performance benefits as there is no need to change CPU memory mappings when making system calls. But the disadvantage is that any code in the application/kernel can access any other code directly.

In a typical modern OS (Linux, BSD, Windows, etc) every application has a separate address space and there are separate memory regions for code and data. While an application can request the ability to modify it’s own executable code in some situations (if the OS is configured to allow that) it won’t happen by default. In MS-DOS and in a Unikernel system all code has read/write/execute access to all memory. MS-DOS was the least reliable OS that I ever used. It was unreliable because it performed tasks that were more complex than CP/M but had no memory protection so any bug in any code was likely to cause a system crash. The crash could be delayed by some time (EG corrupting data structures that are only rarely accessed) which would make it very difficult to fix. It would be possible to have a Unikernel system with non-modifyable executable areas and non-executable data areas and it is conceivable that a virtual machine system like Xen could enforce that. But that still wouldn’t solve the problem of all code being able to write to all data.

On a Linux system when an application writes to the wrong address there is a reasonable probability that it will not have write access and you will immediately get a SEGV which is logged and informs the sysadmin of the address of the crash.

When Linux applications have bugs that are difficult to diagnose (EG buffer overruns that happen in production and can’t be reproduced in a test environment) there are a variety of ways of debugging them. Tools such as Valgrind can analyse memory access and tell the developers which code had a bug and what the bug does. It’s theoretically possible to link something like Valgrind into a Unikernel, but the lack of multiple processes would make it difficult to manage.


A full Unix environment has a rich array of debugging tools, strace, ltrace, gdb, valgrind and more. If there are performance problems then tools like sysstat, sar, iostat, top, iotop, and more. I don’t know which of those tools I might need to debug problems at some future time.

I don’t think that any Internet facing service can be expected to be reliable enough that it will never need any sort of debugging.

Service Complexity

It’s very rare for a server to have only a single process performing the essential tasks. It’s not uncommon to have a web server running CGI-BIN scripts or calling shell scripts from PHP code as part of the essential service. Also many Unix daemons are not written to run as a single process, at least threading is required and many daemons require multiple processes.

It’s also very common for the design of a daemon to rely on a cron job to clean up temporary files etc. It is possible to build the functionality of cron into a Unikernel, but that means more potential bugs and more time spent not actually developing the core application.

One could argue that there are design benefits to writing simple servers that don’t require multiple programs. But most programmers aren’t used to doing that and in many cases it would result in a less efficient result.

One can also argue that a Finite State Machine design is the best way to deal with many problems that are usually solved by multi-threading or multiple processes. But most programmers are better at writing threaded code so forcing programmers to use a FSM design doesn’t seem like a good idea for security.


The typical server programs rely on cron jobs to rotate log files and monitoring software to inspect the state of the system for the purposes of graphing performance and flagging potential problems.

It would be possible to compile the functionality of something like the Nagios NRPE into a Unikernel if you want to have your monitoring code running in the kernel. I’ve seen something very similar implemented in the past, the CA Unicenter monitoring system on Solaris used to have a kernel module for monitoring (I don’t know why). My experience was that Unicenter caused many kernel panics and more downtime than all other problems combined. It would not be difficult to write better code than the typical CA employee, but writing code that is good enough to have a monitoring system running in the kernel on a single-threaded system is asking a lot.

One of the claimed benefits of a Unikernel was that it’s supposedly risky to allow ssh access. The recent ssh security issue was an attack against the ssh client if it connected to a hostile server. If you had a ssh server only accepting connections from management workstations (a reasonably common configuration for running servers) and only allowed the ssh clients to connect to servers related to work (an uncommon configuration that’s not difficult to implement) then there wouldn’t be any problems in this regard.

I think that I’m a good programmer, but I don’t think that I can write server code that’s likely to be more secure than sshd.

On Designing It Yourself

One thing that everyone who has any experience in security has witnessed is that people who design their own encryption inevitably do it badly. The people who are experts in cryptology don’t design their own custom algorithm because they know that encryption algorithms need significant review before they can be trusted. The people who know how to do it well know that they can’t do it well on their own. The people who know little just go ahead and do it.

I think that the same thing applies to operating systems. I’ve contributed a few patches to the Linux kernel and spent a lot of time working on SE Linux (including maintaining out of tree kernel patches) and know how hard it is to do it properly. Even though I’m a good programmer I know better than to think I could just build my own kernel and expect it to be secure.

I think that the Unikernel people haven’t learned this.

Compatibility and a Linux Community Server

Compatibility/interoperability is a good thing. It’s generally good for systems on the Internet to be capable of communicating with as many systems as possible. Unfortunately it’s not always possible as new features sometimes break compatibility with older systems. Sometimes you have systems that are simply broken, for example all the systems with firewalls that block ICMP so that connections hang when the packet size gets too big. Sometimes to take advantage of new features you have to potentially trigger issues with broken systems.

I recently added support for IPv6 to the Linux Users of Victoria server. I think that adding IPv6 support is a good thing due to the lack of IPv4 addresses even though there are hardly any systems that are unable to access IPv4. One of the benefits of this for club members is that it’s a platform they can use for testing IPv6 connectivity with a friendly sysadmin to help them diagnose problems. I recently notified a member by email that the callback that their mail server used as an anti-spam measure didn’t work with IPv6 and was causing mail to be incorrectly rejected. It’s obviously a benefit for that user to have the problem with a small local server than with something like Gmail.

In spite of the fact that at least one user had problems and others potentially had problems I think it’s clear that adding IPv6 support was the correct thing to do.

SSL Issues

Ben wrote a good post about SSL security [1] which links to a test suite for SSL servers [2]. I tested the LUV web site and got A-.

This blog post describes how to setup PFS (Perfect Forward Secrecy) [3], after following it’s advice I got a score of B!

From the comments on this blog post about RC4 etc [4] it seems that the only way to have PFS and not be vulnerable to other issues is to require TLS 1.2.

So the issue is what systems can’t use TLS 1.2.

TLS 1.2 Support in Browsers

This Wikipedia page has information on SSL support in various web browsers [5]. If we require TLS 1.2 we break support of the following browsers:

The default Android browser before Android 5.0. Admittedly that browser always sucked badly and probably has lots of other security issues and there are alternate browsers. One problem is that many people who install better browsers on Android devices (such as Chrome) will still have their OS configured to use the default browser for URLs opened by other programs (EG email and IM).

Chrome versions before 30 didn’t support it. But version 30 was released in 2013 and Google does a good job of forcing upgrades. A Debian/Wheezy system I run is now displaying warnings from the google-chrome package saying that Wheezy is too old and won’t be supported for long!

Firefox before version 27 didn’t support it (the Wikipedia page is unclear about versions 27-31). 27 was released in 2014. Debian/Wheezy has version 38, Debian/Squeeze has Iceweasel 3.5.16 which doesn’t support it. I think it is reasonable to assume that anyone who’s still using Squeeze is using it for a server given it’s age and the fact that LTS is based on packages related to being a server.

IE version 11 supports it and runs on Windows 7+ (all supported versions of Windows). IE 10 doesn’t support it and runs on Windows 7 and Windows 8. Are the free upgrades from Windows 7 to Windows 10 going to solve this problem? Do we want to support Windows 7 systems that haven’t been upgraded to the latest IE? Do we want to support versions of Windows that MS doesn’t support?

Windows mobile doesn’t have enough users to care about.

Opera supports it from version 17. This is noteworthy because Opera used to be good for devices running older versions of Android that aren’t supported by Chrome.

Safari supported it from iOS version 5, I think that’s a solved problem given the way Apple makes it easy for users to upgrade and strongly encourages them to do so.

Log Analysis

For many servers the correct thing to do before even discussing the issue is to look at the logs and see how many people use the various browsers. One problem with that approach on a Linux community site is that the people who visit the site most often will be more likely to use recent Linux browsers but older Windows systems will be more common among people visiting the site for the first time. Another issue is that there isn’t an easy way of determining who is a serious user, unlike for example a shopping site where one could search for log entries about sales.

I did a quick search of the Apache logs and found many entries about browsers that purport to be IE6 and other versions of IE before 11. But most of those log entries were from other countries, while some people from other countries visit the club web site it’s not very common. Most access from outside Australia would be from bots, and the bots probably fake their user agent.

Should We Do It?

Is breaking support for Debian/Squeeze, the built in Android browser on Android <5.0, and Windows 7 and 8 systems that haven’t upgraded IE as a web browsing platform a reasonable trade-off for implementing the best SSL security features?

For the LUV server as a stand-alone issue the answer would be no as the only really secret data there is accessed via ssh. For a general web infrastructure issue it seems that the answer might be yes.

I think that it benefits the community to allow members to test against server configurations that will become more popular in the future. After implementing changes in the server I can advise club members (and general community members) about how to configure their servers for similar results.

Does this outweigh the problems caused by some potential users of ancient systems?

I’m blogging about this because I think that the issues of configuration of community servers have a greater scope than my local LUG. I welcome comments about these issues, as well as about the SSL compatibility issues.

Using LetsEncrypt

Lets Encrypt is a new service to provide free SSL keys [1]. I’ve just set it up on a few servers that I run.


The first thing to note is that the client is designed to manage your keys and treat all keys on a server equally with a single certificate. It shouldn’t be THAT difficult to do things in other ways but it would involve extra effort. The next issue that can make things difficult is that it is designed that the web server will have a module to negotiate new keys automatically. Automatically negotiating new keys will be really great when we get that all going, but as I didn’t feel like installing a slightly experimental Apache module on my servers that meant I had to stop Apache while I got the keys – and I’ll have to do that again every 3 months as the keys have a short expiry time.

There are some other ways of managing keys, but the web servers I’m using Lets Encrypt with at the moment aren’t that important and a couple of minutes of downtime is acceptable.

When you request multiple keys (DNS names) for one server to make it work without needless effort you have to get them all in the one operation. That gives you a single key file for all DNS names which is very convenient for services that don’t support getting the hostname before negotiating SSL. But it could be difficult if you wanted to have one of the less common configurations such as having a mail server and a web server on the same IP addess but using different keys

How To Get Keys

deb testing main

The letsencrypt client is packaged for Debian in Testing but not in Jessie. Adding the above to the /etc/apt/sources.list file for a Jessie system allows installing it and a few dependencies from Testing. Note that there are problems with doing this, you can’t be certain that all the other apps installed will be compatible with the newer versions of libraries that are installed and you won’t get security updates.

letsencrypt certonly --standalone-supported-challenges tls-sni-01

The above command makes the letsencrypt client listen on port 443 to talk to the Lets Encrypt server. It prompts you for server names so if you want to minimise the downtime for your web server you could specify the DNS names on the command-line.

If you run it on a SE Linux system you need to run “setsebool allow_execmem 1” before running it and “setsebool allow_execmem 0” afterwards as it needs execmem access. I don’t think it’s a problem to temporarily allow execmem access for the duration of running this program, if you use KDE then you will be forced to allow such access all the time for the desktop to operate correctly.

How to Install Keys

[ssl:emerg] [pid 9361] AH02564: Failed to configure encrypted (?) private key, check /etc/letsencrypt/live/

The letsencrypt client suggests using the file fullchain.pem which has the key and the full chain of certificates. When I tried doing that I got errors such as the above in my Apache error.log. So I gave up on that and used the separate files. The only benefit of using the fullchain.pem file is to have a single line in a configuration file instead of 3. Trying to debug issues with fullchain.pem took me a lot longer than copy/paste for the 3 lines.

Under /etc/letsencrypt/live/$NAME there are symlinks to the real files. So when you get new keys the old keys will be stored but the same file names can be used.

SSLCertificateFile "/etc/letsencrypt/live/"
SSLCertificateChainFile "/etc/letsencrypt/live/"
SSLCertificateKeyFile "/etc/letsencrypt/live/"

The above commands are an example for configuring Apache 2.

smtpd_tls_cert_file = /etc/letsencrypt/live/
smtpd_tls_key_file = /etc/letsencrypt/live/
smtpd_tls_CAfile = /etc/letsencrypt/live/

Above is an example of Postfix configuration.

ssl_cert = </etc/letsencrypt/live/
ssl_key = </etc/letsencrypt/live/
ssl_ca = </etc/letsencrypt/live/

Above is an example for Dovecot, it goes in /etc/dovecot/conf.d/10-ssl.conf in a recent Debian version.


At this stage using letsencrypt is a little fiddly so for some commercial use (where getting the latest versions of software in production is difficult) it might be a better option to just pay for keys. However some companies I’ve worked for have had issues with getting approval for purchases which would make letsencrypt a good option to avoid red tape.

When Debian/Stretch is released with letsencrypt I think it will work really well for all uses.

Finding Storage Performance Problems

Here are some basic things to do when debugging storage performance problems on Linux. It’s deliberately not an advanced guide, I might write about more advanced things in a later post.

Disk Errors

When a hard drive is failing it often has to read sectors several times to get the right data, this can dramatically reduce performance. As most hard drives aren’t monitored properly (email or SMS alerts on errors) it’s quite common for the first notification about an impending failure to be user complaints about performance.

View your kernel message log with the dmesg command and look in /var/log/kern.log (or wherever your system is configured to store kernel logs) for messages about disk read errors, bus resetting, and anything else unusual related to the drives.

If you use an advanced filesystem like BTRFS or ZFS there are system commands to get filesystem information about errors. For BTRFS you can run “btrfs device stats MOUNTPOINT” and for ZFS you can run “zpool status“.

Most performance problems aren’t caused by failing drives, but it’s a good idea to eliminate that possibility before you continue your investigation.

One other thing to look out for is a RAID array where one disk is noticeably slower than the others. For example if you have a RAID-5 or RAID-6 array every drive should have almost the same number of reads and writes, if one disk in the array is at 99% performance capacity and the other disks are at 5% then it’s an indication of a failing disk. This can happen even if SMART etc don’t report errors.

Monitoring IO

The iostat program in the Debian sysstat package tells you how much IO is going to each disk. If you have physical hard drives sda, sdb, and sdc you could run the command “iostat -x 10 sda sdb sdc” to tell you how much IO is going to each disk over 10 second periods. You can choose various durations but I find that 10 seconds is long enough to give results that are useful.

By default iostat will give stats on all block devices including LVM volumes, but that usually gives too much data to analyse easily.

The most useful things that iostat tells you are the %util (the percentage utilisation – anything over 90% is a serious problem), the reads per second “r/s“, and the writes per second “w/s“.

The parameters to iostat for block devices can be hard drives, partitions, LVM volumes, encrypted devices, or any other type of block device. After you have discovered which block devices are nearing their maximum load you can discover which of the partitions, RAID arrays, or swap devices on that disk are causing the load in question.

The iotop program in Debian (package iotop) gives a display that’s similar to that of top but for disk io. It generally isn’t essential (you can run “ps ax|grep D” to get most of that information), but it is handy. It will tell you which programs are causing IO on a busy filesystem. This can be good when you have a busy system and don’t know why. It isn’t very useful if you have a system that is used for one task, EG a database server that is known to be busy doing database stuff.

It’s generally a good idea to have sysstat and iotop installed on all systems. If a system is experiencing severe performance problems you might not want to wait for new packages to be installed.

In Debian the sysstat package includes the sar utility which can give historical information on system load. One benefit of using sar for diagnosing performance problems is that it shows you the time of day that has the most load which is the easiest time to diagnose performance problems.

Swap Use

Swap use sometimes confuses people. In many cases swap use decreases overall disk use, this is the design of the Linux paging algorithms. So if you have a server that accesses a lot of data it might swap out some unused programs to make more space for cache.

When you have multiple virtual machines on one system sharing the same disks it can be difficult to determine the best allocation for RAM. If one VM has some applications allocating a lot of RAM but not using it much then it might be best to give it less RAM and force those applications into swap so that another VM can cache all the data it accesses a lot.

The important thing is not the amount of swap that is allocated but the amount of IO that goes to the swap partition. Any significant amount of disk IO going to a swap device is a serious problem that can be solved by adding more RAM.

Reads vs Writes

The ratio of reads to writes depends on the applications and the amount of RAM. Some applications can have most of their reads satisfied from cache. For example an ideal configuration of a mail server will have writes significantly outnumber reads (I’ve seen ratios of 5:1 for writes to reads on real mail servers). Ideally a mail server will cache all new mail for at least an hour and as the most prolific users check their mail more frequently than that most mail will be downloaded before it leaves the cache. If you have a mail server with reads outnumbering writes then it needs more RAM. RAM is cheap nowadays so if you don’t want to compete with Gmail it should be cheap to buy enough RAM to cache all recent mail.

The ratio of reads to writes is important because it’s one way of quickly determining if you have enough RAM and adding RAM is often the cheapest way of improving performance.

Unbalanced IO

One common performance problem on systems with multiple disks is having more load going to some disks than to others. This might not be a problem (EG having cron jobs run on disks that are under heavy load while the web server accesses data from lightly loaded disks). But you need to consider whether it’s desirable to have some disks under more load than others.

The simplest solution to this problem is to just have a single RAID array for all data storage. This is also the solution that gives you the maximum available disk space if you use RAID-5 or RAID-6.

A more complex option is to use some SSDs for things that require performance and disks for things that don’t. This can be done with the ZIL and L2ARC features of ZFS or by just creating a filesystem on SSD for the data that is most frequently accessed.

What Did I Miss?

I’m sure that I missed something, please let me know of any other basic things to do – or suggestions for a post on more advanced things.

Sociological Images 2015

3 men 1 women on lift sign

The above sign was at the Melbourne Docks in December 2014 when I was returning from a cruise. I have no idea why there are 3 men and 1 woman on the sign (and a dock worker was also surprised when I explained why I was photographing it). I wonder whether a sign that had 3 women and 1 man would ever have been installed or not noticed if it was installed.

rules for asking questions at LCA2015

At the start of the first day of LCA 2015 the above was displayed at the keynote as a flow-chart for whether someone should ask a question at a lecture. Given that the first real item in the list is that a question should fit in a tweet I think it was inspired by my blog post about the length of conference questions [1].

Astronomy Miniconf suggestions for delegates

At the introduction to the Astronomy Miniconf the above slide was displayed. In addition to referencing the flow-chart for asking questions it recommends dimming laptop screens (among other things).

sign saying men to the left because women are always right

The above sign was at a restaurant in Auckland in January 2015. I thought that sort of sexist “joke” went out of fashion a few decades ago.

gendered nerf weaponary

The above photo is from a Melbourne department store in February 2015. Why gender a nerf gun? That just doesn’t make sense. Also it appeared that the only nerf crossbow was the purple/pink one, is a crossbow considered feminine nowadays?

Picture of Angela appropriating Native American clothing

The above picture is a screen-shot of one of the “Talking Angela” series of Android games from March. Appropriating the traditional clothing of marginalised groups is a bad thing. People of Native American heritage who want to wear their traditional clothing face discrimination when they do so, when white people play dress-up in clothing that is a parody of Native American style it’s really offensive. The site has a tag for articles about appropriation [2].

The above was in a library advertising an Ebook reader. In this case they didn’t even have pointlessly gendered products they just had pointlessly gendered adverts for the same product. They also perpetuate the myth that only girls read vampire books and only boys read about space. Also why is the girl lying down to read while the boy is sitting up?

Above is an Advent calendar on sale in a petrol station. Having end of year holiday presents that have nothing to do with religious festivals makes sense. But Advent is a religious observance. I think this would be a better candidate for “war on Christmas” paranoia than a coffee cup of the wrong colour.

The above photo is of boys and girls pipette suckers. Pointlessly gendered recreational products like Nerf guns is one thing, but I think that doing it to scientific equipment is a bigger problem. Are scientists going to stop work if they can’t find a pipette sucker of the desired gender? Is worrying about this going to distract them from their research (really bad if working with infectious or carcinogenic solutions). The Integra advertising claims to be doing this to promote breast cancer research which is also bogus. Here is a Sociological Images article about the problems of using pink to market breast cancer research [3] and the Sociological Images post about pinkwashing (boobies against breast cancer) is also worth reading [4].

As an aside I made a mistake in putting a pipette sucker over the woman’s chest in that picture. The way that Integra portreyed her chest is relevant to analysis of this advert. But unfortunately I didn’t photograph that.

Here is a link to my sociological images post from 2014 [5].