6

Time Zones and Remote Servers

It’s widely regarded that the best practice is to set the time zone of a server to UTC if people are going to be doing sys-admin work from various countries. I’m currently running some RHEL4 servers that are set to Los Angeles time. So I have to convert the time from Melbourne time to UTC and then from UTC to LA time when tracking down log entries. This isn’t really difficult for recent times (within the last few minutes) as my KDE clock applet allows me to select various time zones to display on a pop-up. For other times I can use the GNU date command to convert from other time zones to the local zone of the machine, for example the command date -d "2008-08-06 10:57 +1000" will display the current Melbourne time (which is in the +1000 time zone) converted to the local time zone. But it is still painful.

In RHEL5, CentOS 5, and apparently all versions of Fedora newer Fedora Core 4 (including Fedora Core 4 updates) the command system-config-date allows you to select Etc/GMT as the time zone to get GMT. For reference selecting London is not a good option, particularly at the moment as it’s apparently daylight savings time there.

For RHEL4 and CentOS 4 the solution is to edit /etc/sysconfig/clock and change the first line to ZONE="Etc/GMT" (the quotes are important), and then run the command ln -sf /usr/share/zoneinfo/Etc/GMT /etc/localtime. Thanks to the Red Hat support guy who found the solution to this, it took a while but it worked in the end! Hopefully this blog post will allow others to fix this without needing to call Red Hat.

In Debian the command tzconfig allows you to select 12 (none of the above) and then GMT or UTC to set the zone. This works in Etch, I’m not sure about earlier versions (tzconfig always worked but I never tried setting UTC). In Lenny the tzconfig command seems to have disappeared, now to configure the time zone you use the command dpkg-reconfigure tzdata which has an option og Etc at the end of the list.

Updated to describe how to do this in Lenny, thanks for the comments.

6

The Date Command and Seconds Since 1970-01-01

The man page for the date command says that the %s option will give “seconds since 1970-01-01 00:00:00 UTC“. I had expected that everything that date did would give output in my time zone unless I requested otherwise.. But it seems that in this case the result is in UTC, and the same seems to be also true for most programs that log dates with the number of seconds.

In a quick google search for how to use a shell script to convert from the number of seconds since 1970-01-01 00:00:00 UTC to a more human readable format the only remotely useful result I found was the command date -d "1970-01-01 1212642879 sec", which in my timezone gives an error of ten hours (at the moment – it would give eleven hours in summer). The correct method is date -d "1970-01-01 1212642879 sec utc", you can verify this with the command date -d "1970-01-01 $(date +%s) sec utc" (which should give the same result as date).

2

Resizing the Root Filesystem

Uwe Hermann has described how to resize a root filesystem after booting from a live-cd or recovery disk [1]. He makes some good points about resizing an LVM PV (which I hadn’t even realised was possible).

The following paragraph is outdated, see the update at the end:
Incidentally it should be noted that if your root filesystem is an LVM logical volume then it can’t be resized without booting from a different device because the way LVM appears to work is that the LV in question is locked, then files under /etc/lvm/ are written, and then the LV is unlocked. If the LV in question contains /etc/lvm then you deadlock your root filesystem and need to press reset. Of course if your root filesystem is on an LV which has been encrypted via cryptsetup (the LV is encrypted not the PV) then a live resize of the root filesystem can work as locking the LV merely means that write-backs from the encryption layer don’t get committed. I’m not sure if this means that data written to an encrypted device is less stable (testing this is on my todo list).

If your root filesystem is on a partition of a hard drive (such as /dev/hda2) then it is possible to extend it without booting from different media. There is nothing stopping you from running fdisk and deleting the partition of your root filesystem and recreating it. When you exit fdisk it will call an ioctl() to re-read the partition table, the kernel code counts the number of open file handles related to the device and if the number is greater than 1 (fdisk has one open handle) then it refuses to re-read the table. So you can use fdisk to change the root partition and then reboot to have the change be noticed. After that ext2online can be used to take advantage of the extra space (if the filesystem has a recent enough version of the ext3 disk format).

One thing he didn’t mention is that if you do need to boot from another device to manipulate your root filesystem (which should be quite rare if you are bold and know the tricks) then you can always use your swap space. To do this you simply run swapoff and then run mkfs on the device that had been used for swap (incidentally there is nothing really special about the swap space, but it does tend to often be used for such recovery operations simply because it has no data that persists across a reboot).

The minimum set of files that need to be copied to a temporary filesystem is usually /bin, /sbin, /dev, /lib (excluding all directories under /lib/modules apart from the one related to the kernel you are using), and /etc. Also you should make directories /proc, /sys, and /selinux (if the machine in question runs SE Linux). The aim is not to copy enough files for the machine to run in a regular manner, merely enough to allow manipulating all filesystems and logical volumes. Often for such recovery I boot with init=/bin/bash as a kernel parameter to skip the regular init and just start with a shell. Note that when you use init=/bin/bash you end up with a shell that has no job control and ^C is not enabled, if you want to run a command that might not terminate of it’s own accord then the command “openvt /bin/bash” can be used to start another session with reasonable terminal settings.

I recommend that anyone who wants to work as a sys-admin experiment with such procedures on a test machine. There are lots of interesting things that you can learn and interesting ways that you can break your system when performing such operations.

Update: Wouter points out that the LVM bug of deadlocking the root filesystem has been fixed and that you can also use ext2online to resize the mounted filesystem [2].

10

Redirecting Output from a Running Process

Someone asked on a mailing list how to redirect output from a running process. They had a program which had been running for a long period of time without having stdout redirected to a file. They wanted to logout (to move the laptop that was used for the ssh session) but not kill the process (or lose output).

Most responses were of the form “you should have used screen or nohup” which is all very well if you had planned to logout and leave it running (or even planned to have it run for a long time).

Fortunately it is quite possible to redirect output of a running process. I will use cat as a trivial example but the same technique will work for most programs that do simple IO (of course programs that do terminal IO may be more tricky – but you could always redirect from the tty device of a ssh session to the tty device of a screen session).

Firstly I run the command “cat > foo1” in one session and test that data from stdin is copied to the file. Then in another session I redirect the output:

Firstly find the PID of the process:
$ ps aux|grep cat
rjc 6760 0.0 0.0 1580 376 pts/5 S+ 15:31 0:00 cat

Now check the file handles it has open:
$ ls -l /proc/6760/fd
total 3
lrwx—— 1 rjc rjc 64 Feb 27 15:32 0 -> /dev/pts/5
l-wx—— 1 rjc rjc 64 Feb 27 15:32 1 -> /tmp/foo1
lrwx—— 1 rjc rjc 64 Feb 27 15:32 2 -> /dev/pts/5

Now run GDB:
$ gdb -p 6760 /bin/cat
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc
[lots more license stuff snipped]
Attaching to program: /bin/cat, process 6760
[snip other stuff that’s not interesting now]
(gdb) p close(1)
$1 = 0
(gdb) p creat(“/tmp/foo3”, 0600)
$2 = 1
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /bin/cat, process 6760

The “p” command in GDB will print the value of an expression, an expression can be a function to call, it can be a system call… So I execute a close() system call and pass file handle 1, then I execute a creat() system call to open a new file. The result of the creat() was 1 which means that it replaced the previous file handle. If I wanted to use the same file for stdout and stderr or if I wanted to replace a file handle with some other number then I would need to call the dup2() system call to achieve that result.

For this example I chose to use creat() instead of open() because there are fewer parameter. The C macros for the flags are not usable from GDB (it doesn’t use C headers) so I would have to read header files to discover this – it’s not that hard to do so but would take more time. Note that 0600 is the octal permission for the owner having read/write access and the group and others having no access. It would also work to use 0 for that parameter and run chmod on the file later on.

After that I verify the result:
ls -l /proc/6760/fd/
total 3
lrwx—— 1 rjc rjc 64 2008-02-27 15:32 0 -> /dev/pts/5
l-wx—— 1 rjc rjc 64 2008-02-27 15:32 1 -> /tmp/foo3 <====
lrwx—— 1 rjc rjc 64 2008-02-27 15:32 2 -> /dev/pts/5

Typing more data in to cat results in the file /tmp/foo3 being appended to.

Update: If you want to close the original session you need to close all file handles for it, open a new device that can be the controlling tty, and then call setsid().

5

Load Average

Other Unix systems apparently calculate the load average differently to Linux. According to the Wikipedia page about Load(computing) [1] most Unix systems calculate it based on the average number of processes that are using a CPU or available for scheduling on a CPU while Linux also includes the count of processes that are blocked on disk IO (uninterruptible sleep).

There are three load average numbers, the first is for the past minute, the second is for the past 5 minutes, and the third is for the past 15 minutes. In most cases you will only be interested in the first number.

What is a good load average depends on the hardware. For a system with a single CPU core a load average of 1 or greater from CPU use will indicate that some processes may perform badly due to lack of CPU time – although a long-running background process with a high “nice” value can increase the load average without interfering with system performance in most cases. As a general rule if you want snappy performance then the load average component from CPU use should be less than the number of CPU cores (not hyper-threads). For example a system with two dual-core CPUs can be expected to perform really well with a load average of 3.5 from CPU use but might perform badly with a load average of 5.

The component of the load average that is due to disk IO is much more difficult to interpret in a sensible manner. A common situation is to have the load average increased by a NFS server with a network problem. A user accesses a file on the NFS server and gets no response (thus giving a load average of 1), they then open another session and use “ls” to inspect the state of the file – ls is blocked and gives a system load average of 2. A single user may launch 5 or more processes before they realise that they are not going to succeed. If there are 20 active users on a multi-user system then a load average of 100 from a single NFS server that has a network problem is not uncommon. While this is happening the system will perform very well for all tasks that don’t involve the NFS server, the processes that are blocked on disk IO can be paged out so they don’t use any RAM or CPU time.

For regular disk IO you can have load average incremented by 1 for each non-RAID disk without any significant performance problems. For example if you have two users who each have a separate disk for their home directory (not uncommon with certain systems where performance is required and cooperation between users is low) then each could have a single process performing disk IO at maximum speed with no performance problems for the entire system. A system which has four CPU cores and two hard drives used for separate tasks could have a load average slightly below 6 and the performance for all operations would be quite good if there were four processes performing CPU intensive tasks and two processes doing disk intensive tasks on different disks. The same system with six CPU intensive programs would under-perform (each process would on average get 2/3 of a CPU), and if it had six disk intensive tasks that all use the same disk then performance would be terrible (especially if one of the six was an interactive task).

The fact that a single load average number can either mean that the system is busy but performing well, under a bit of load, or totally overloaded means that the load average number is of limited utility in diagnosing performance problems. It is useful as a quick measure, if your server usually has a load average of 0.5 and it suddenly gets a load average of 10 then you know that something is wrong. Then the typical procedure for diagnosing it starts with either running “ps aux|grep D” (to get a list of D state processes – processes that are blocked on disk IO) or running top to see the percentages of CPU time idle and in IO-wait states.

Cpu(s): 15.0%us, 35.1%sy,  0.0%ni, 49.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7331 rjc      25  0  2868  640  312 R  100  0.0  0:21.57 gzip

Above is a section of the output of top showing a system running gzip -9 < /dev/urandom > /dev/null. Gzip is using one CPU core (100% CPU means 100% of one core – a multi-threaded program can use more than one core and therefore more than 100% CPU) and the overall system statistics indicate 49.9% idle (the other core is almost entirely idle).

Cpu(s):  1.3%us,  3.2%sy,  0.0%ni, 50.7%id, 44.4%wa,  0.0%hi,  0.3%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7425 rjc      17  0  4036  872  588 R    4  0.1  0:00.20 find

Above is a section of the output of top showing the same system running find /. The system is registering 44% IO wait and 50.7% idle. The IO wait is the percentage of time that CPU core is waiting on IO, so 44% of the total system CPU time (or 88% of one CPU core) is idle while the system is waiting for disk IO to complete. A common mistake is to think that if the IO was faster then more CPU time would be used, in this case with the find program using 4% of one CPU core if all the IO was instantaneous (EG in cache) then the command would complete 25 times faster with 100% CPU use. But if the disk IO performance was doubled (a realistic possibility given that the system has a pair of cheap SATA disks in a RAID-1) then find would probably use 8% of CPU time.

Really the only use for load average is for getting an instant feel for whether there are any performance problems related to CPU use or disk IO. If you know what the normal number is then a significant change will stand out.

Dr. Neil Gunther has written some interesting documentation on the topic [2], which goes into more technical detail including kernel algorithms used for calculating the load average. My aim in this post is to educate Unix operators as to the basics of the load average.

His book The Practical Performance Analyst gives some useful insights into the field. One thing I learned from his book is the basics of queueing theory. One important aspect of this is that as the rate at which work arrives approaches the rate at which work can be done the queue length starts to increase exponentially, and if work keeps arriving at the same rate when the queue is full and the system can’t perform the work fast enough the queue will grow without end. This means that as the load average approaches the theoretical maximum the probability of the system dramatically increasing it’s load average increases. A machine that’s bottlenecked on disk IO for a task where there is a huge number of independent clients (such as a large web server) may have it’s load average jump from 3 to 100 in a matter of one minute. Of course this won’t mean that you actually need to be able to serve 30 times the normal load, merely slightly more than the normal load to keep the queues short. I recommend reading the book, he explains it much better than I do.

Update: Jon Oxer really liked this post.

39

Swap Space

There is a wide-spread myth that swap space should be twice the size of RAM. This might have provided some benefit when 16M of RAM was a lot and disks had average access times of 20ms. Now disks can have average access times less than 10ms but RAM has increased to 1G for small machines and 8G or more for large machines. Multiplying the seek performance of disks by a factor of two to five while increasing the amount of data stored by a factor of close to 1000 is obviously not going to work well for performance.

A Linux machine with 16M of RAM and 32M of swap MIGHT work acceptably for some applications (although when I was running Linux machines with 16M of RAM I found that if swap use exceeded about 16M then the machine became so slow that a reboot was often needed). But a Linux machine with 8G of RAM and 16G of swap is almost certain to be unusable long before the swap space is exhausted. Therefore giving the machine less swap space and having processes be killed (or malloc() calls fail – depending on the configuration and some other factors) is probably going to be a better situation.

There are factors that can alleviate the problems such as RAID controllers that implement write-back caching in hardware, but this only has a small impact on the performance requirements of paging. The 512M of cache RAM that you might find on a RAID controller won’t make that much impact on the IO requirements of 8G or 16G of swap.

I often make the swap space on a Linux machine equal the size of RAM (when RAM is less than 1G) and be half the size of RAM for RAM sizes from 2G to 4G. For machines with more than 4G of RAM I will probably stick to a maximum of 2G of swap. I am not convinced that any mass storage system that I have used can handle the load from more than 2G of swap space in active use.

The reason for the myths about swap space size are due to some old versions of Unix that used to allocate a page of disk space for every page of virtual memory. Therefore having swap space less than or equal to the size of RAM was impossible and having swap space less than twice the size of RAM was probably a waste of effort (see this reference [1]). However Linux has never worked this way, in Linux the virtual memory size is the size of RAM plus the size of the swap space. So while the “double the size of RAM” rule of thumb gave virtual memory twice the size of physical RAM on some older versions of Unix it gave three times the size of RAM on Linux! Also swap spaces smaller than RAM have always worked well on Linux (I once ran a Linux machine with 8M of RAM and used a floppy disk as a swap device).

As far as I recall some time ago (I can’t remember how long) the Linux kernel would by default permit overcommitting of memory. For example if a program tried to malloc() 1G of memory on a machine that had 64M of RAM and 128M of swap then the system call would succeed. However if the program actually tried to use that memory then it would end up getting killed.

The current policy is that /proc/sys/vm/overcommit_memory determines what happens when memory is overcommitted, the default value 0 means that the kernel will estimate how much RAM and swap is available and reject memory allocation requests that exceed that value. A value of 1 means that all memory allocation requests will succeed (you could have dozens of processes each malloc 2G of RAM on a machine with 128M of RAM and 128M of swap). A value of 2 means that a different policy will be followed, incidentally my test results don’t match the documentation for value 2.

Now if you run a machine with /proc/sys/vm/overcommit_memory set to 0 then you have an incentive to use a moderately large amount of swap, safe in the knowledge that many applications will allocate memory that they don’t use, so the fact that the machine would deliver unacceptably low performance if all the swap was used might not be a problem. In this case the ideal size for swap might be the amount that is usable (based on the storage speed) plus a percentage of the RAM size to cater for programs that allocate memory and never use it. By “moderately large” I mean something significantly less than twice the size of RAM for all machines less than 7 years old.

If you run a machine with /proc/sys/vm/overcommit_memory set to 1 then the requirements for swap space should decrease, but the potential for the kernel to run out of memory and kill some processes is increased (not that it’s impossible to have this happen when /proc/sys/vm/overcommit_memory is set to 0).

The debian-administration.org site has an article about a package to create a swap file at boot [2] with the aim of making it always be twice the size of RAM. I believe that this is a bad idea, the amount of swap which can be used with decent performance is a small fraction of the storage size on modern systems and often less than the size of RAM. Increasing the amount of RAM will not increase the swap performance, so increasing the swap space is not going to do any good.

5

Out Of Memory Errors and Apache

I’ve been having problems with one of my Xen virtual servers crashing with kernel error messages regarding OOM conditions. One thing I had been meaning to do is to determine how to make a core dump of a Xen domain and then get data such as the process list from it. But tonight I ended up catching the machine after the problem occurred but before the kernel gave OOM messages so I could log in to fix things.

I discovered the following issues:

  1. 10+ instances of spf-policy.pl (a Postfix program to check the Sender Policy Framework data for a message that is being received), most in D state.
  2. All my Planet news feeds being updated simultaneously (four of them taking 20M each takes a bite out of 256M for a virtual machine).
  3. Over 100 Apache processes running in D state.

I think that there is a bug with having so many instances spf-policy.pl, I’ve also been seeing warning messages from Postfix about timeouts when running it.

For the Planet feeds I changed my cron jobs to space them out. Now unless one job takes 40 minutes to run there will be no chance of having them all run at the same time.

For Apache I changed the maximum number of processes from 150 to 40 and changed the maximum number of requests that a client may satisfy to 100 (it used to be a lot higher). If more than 40 requests come in at the same time then the excess ones will wait in the TCP connection backlog (of 511 entries) until a worker process is ready to service the request. While keeping the connections waiting is not always ideal, it’s better than killing the entire machine!

Finally I installed my memlockd program so that next time I have paging problems the process of logging in and fixing them will be a little faster. Memlockd locks a specified set of files into RAM so that they won’t be paged out when memory runs low. This can make a dramatic difference to the time taken to login to a system that is paging heavily. It also can run ldd on executables to discover the shared objects that they need so it can lock them too.

Pointers in Bash Shell Scripts

# foo=bar
# name=foo
# echo ${!name}
bar

The above example shows how to make a bash environment variable reference the value of another.

# echo ${!f*}
foo fs_abc_opts fs_name

The above example shows how to get a list of all variables beginning with “f” (NB the wildcard can only go at the end).