Load Average

Other Unix systems apparently calculate the load average differently to Linux. According to the Wikipedia page about Load(computing) [1] most Unix systems calculate it based on the average number of processes that are using a CPU or available for scheduling on a CPU while Linux also includes the count of processes that are blocked on disk IO (uninterruptible sleep).

There are three load average numbers, the first is for the past minute, the second is for the past 5 minutes, and the third is for the past 15 minutes. In most cases you will only be interested in the first number.

What is a good load average depends on the hardware. For a system with a single CPU core a load average of 1 or greater from CPU use will indicate that some processes may perform badly due to lack of CPU time – although a long-running background process with a high “nice” value can increase the load average without interfering with system performance in most cases. As a general rule if you want snappy performance then the load average component from CPU use should be less than the number of CPU cores (not hyper-threads). For example a system with two dual-core CPUs can be expected to perform really well with a load average of 3.5 from CPU use but might perform badly with a load average of 5.

The component of the load average that is due to disk IO is much more difficult to interpret in a sensible manner. A common situation is to have the load average increased by a NFS server with a network problem. A user accesses a file on the NFS server and gets no response (thus giving a load average of 1), they then open another session and use “ls” to inspect the state of the file – ls is blocked and gives a system load average of 2. A single user may launch 5 or more processes before they realise that they are not going to succeed. If there are 20 active users on a multi-user system then a load average of 100 from a single NFS server that has a network problem is not uncommon. While this is happening the system will perform very well for all tasks that don’t involve the NFS server, the processes that are blocked on disk IO can be paged out so they don’t use any RAM or CPU time.

For regular disk IO you can have load average incremented by 1 for each non-RAID disk without any significant performance problems. For example if you have two users who each have a separate disk for their home directory (not uncommon with certain systems where performance is required and cooperation between users is low) then each could have a single process performing disk IO at maximum speed with no performance problems for the entire system. A system which has four CPU cores and two hard drives used for separate tasks could have a load average slightly below 6 and the performance for all operations would be quite good if there were four processes performing CPU intensive tasks and two processes doing disk intensive tasks on different disks. The same system with six CPU intensive programs would under-perform (each process would on average get 2/3 of a CPU), and if it had six disk intensive tasks that all use the same disk then performance would be terrible (especially if one of the six was an interactive task).

The fact that a single load average number can either mean that the system is busy but performing well, under a bit of load, or totally overloaded means that the load average number is of limited utility in diagnosing performance problems. It is useful as a quick measure, if your server usually has a load average of 0.5 and it suddenly gets a load average of 10 then you know that something is wrong. Then the typical procedure for diagnosing it starts with either running “ps aux|grep D” (to get a list of D state processes – processes that are blocked on disk IO) or running top to see the percentages of CPU time idle and in IO-wait states.

Cpu(s): 15.0%us, 35.1%sy,  0.0%ni, 49.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

7331 rjc      25  0  2868  640  312 R  100  0.0  0:21.57 gzip

Above is a section of the output of top showing a system running gzip -9 < /dev/urandom > /dev/null. Gzip is using one CPU core (100% CPU means 100% of one core – a multi-threaded program can use more than one core and therefore more than 100% CPU) and the overall system statistics indicate 49.9% idle (the other core is almost entirely idle).

Cpu(s):  1.3%us,  3.2%sy,  0.0%ni, 50.7%id, 44.4%wa,  0.0%hi,  0.3%si,  0.0%st

7425 rjc      17  0  4036  872  588 R    4  0.1  0:00.20 find

Above is a section of the output of top showing the same system running find /. The system is registering 44% IO wait and 50.7% idle. The IO wait is the percentage of time that CPU core is waiting on IO, so 44% of the total system CPU time (or 88% of one CPU core) is idle while the system is waiting for disk IO to complete. A common mistake is to think that if the IO was faster then more CPU time would be used, in this case with the find program using 4% of one CPU core if all the IO was instantaneous (EG in cache) then the command would complete 25 times faster with 100% CPU use. But if the disk IO performance was doubled (a realistic possibility given that the system has a pair of cheap SATA disks in a RAID-1) then find would probably use 8% of CPU time.

Really the only use for load average is for getting an instant feel for whether there are any performance problems related to CPU use or disk IO. If you know what the normal number is then a significant change will stand out.

Dr. Neil Gunther has written some interesting documentation on the topic [2], which goes into more technical detail including kernel algorithms used for calculating the load average. My aim in this post is to educate Unix operators as to the basics of the load average.

His book The Practical Performance Analyst gives some useful insights into the field. One thing I learned from his book is the basics of queueing theory. One important aspect of this is that as the rate at which work arrives approaches the rate at which work can be done the queue length starts to increase exponentially, and if work keeps arriving at the same rate when the queue is full and the system can’t perform the work fast enough the queue will grow without end. This means that as the load average approaches the theoretical maximum the probability of the system dramatically increasing it’s load average increases. A machine that’s bottlenecked on disk IO for a task where there is a huge number of independent clients (such as a large web server) may have it’s load average jump from 3 to 100 in a matter of one minute. Of course this won’t mean that you actually need to be able to serve 30 times the normal load, merely slightly more than the normal load to keep the queues short. I recommend reading the book, he explains it much better than I do.

Update: Jon Oxer really liked this post.

5 comments to Load Average

  • Olaf van der Spek

    > But if the disk IO performance was doubled (a realistic possibility given that the system has a pair of cheap SATA disks in a RAID-1) then find would probably use 8% of CPU time.

    Doubling STR does still not (always) double performance. ;)

  • Olaf van der Spek

    BTW, why is there no simple disk usage metric? Like, this disk is busy 50% of the time.

  • etbe

    The disks in question are a year old and were not particularly fast when I bought them. Doubling the performance is quite possible.

    iostat does display what it considers to be the disk usage percentage, not sure how accurate it is – I know that the 0% and 100% values are right though. ;)

  • Peter Moulder

    Incidentally, the traditional description of “average over the last 1, 5 and 15 minutes” is really a piece of fiction, at least regarding Linux. They’re exponentially smoothed values, and they don’t really correspond to “1 minute” or whatever in any meaningful way that I know of; e.g. the smoothing values chosen don’t minimize the sum of (absolute or squared or cubed or ^1.1) differences between the exponentially smoothed value and the true 1 minute (etc.) moving average. The only meaningfulness that I know of for the expression used to calculate the smoothing value in Linux (viz. 1/exp((update interval)/{1,5,15}min)) is that at least the ratio 1:5:15 is meaningful, even if the absolute time durations aren’t. I’d be interested to hear if someone knows why that expression was chosen, or why it’s useful to use that expression rather than one that gives a value closer to the 1min moving average.

  • etbe

    Peter: Good point. I will add an item to my todo list to write more about this.