I’ve been having problems with one of my Xen virtual servers crashing with kernel error messages regarding OOM conditions. One thing I had been meaning to do is to determine how to make a core dump of a Xen domain and then get data such as the process list from it. But tonight I ended up catching the machine after the problem occurred but before the kernel gave OOM messages so I could log in to fix things.
I discovered the following issues:
- 10+ instances of spf-policy.pl (a Postfix program to check the Sender Policy Framework data for a message that is being received), most in D state.
- All my Planet news feeds being updated simultaneously (four of them taking 20M each takes a bite out of 256M for a virtual machine).
- Over 100 Apache processes running in D state.
I think that there is a bug with having so many instances spf-policy.pl, I’ve also been seeing warning messages from Postfix about timeouts when running it.
For the Planet feeds I changed my cron jobs to space them out. Now unless one job takes 40 minutes to run there will be no chance of having them all run at the same time.
For Apache I changed the maximum number of processes from 150 to 40 and changed the maximum number of requests that a client may satisfy to 100 (it used to be a lot higher). If more than 40 requests come in at the same time then the excess ones will wait in the TCP connection backlog (of 511 entries) until a worker process is ready to service the request. While keeping the connections waiting is not always ideal, it’s better than killing the entire machine!
Finally I installed my memlockd program so that next time I have paging problems the process of logging in and fixing them will be a little faster. Memlockd locks a specified set of files into RAM so that they won’t be paged out when memory runs low. This can make a dramatic difference to the time taken to login to a system that is paging heavily. It also can run ldd on executables to discover the shared objects that they need so it can lock them too.