Archives

Categories

Is Pre-Forking any Good?

Many Unix daemons use a technique known as “pre-forking”. This means that to save the amount of time taken to fork a child process they will keep a pool of processes waiting for work to come in. When a job arrives then one of the existing processes is used and the overhead of the fork() system call is saved. I decided to write a little benchmark to see how much overhead a fork() really has. I wrote the below program (which is released under the GPL 3.0 license) to test this. It gives the performance of a fork() operation followed by a waitpid() operation in fork()s per second and also the performance of running a trivial program via system which uses /bin/sh to execute the given command.

On my Thinkpad T61 with a Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz I could get 2429.85 forks per second when running Linux 2.6.32 in 64bit mode. On a Thinkpad T20 with a 500MHz P3 CPU I could get 341.74 forks per second. In both cases is seems that the number of forks per second is significantly greater than the number of real-world requests. If each request on average took one disk seek then neither system would have the fork performance as any sort of bottleneck. Also if each request took more than a couple of milliseconds of CPU time on the T7500 or 10ms of CPU time on the 500MHz P3 then the benefits of pre-forking would be very small. Finally it’s worth noting that the overhead of fork() + waitpid() in a loop will not be the same as the overhead of just fork()ing off processes and calling waitpid() when there’s nothing else to do.

I had a brief look at some of my servers to see how many operations they perform. One busy front-end mail server has about 3,000,000 log entries in mail.log per day, that is about 35 per second. These log entries include calling SpamAssassin and Clamav, which are fairly heavy operations. The system in question averages one Intel(R) Xeon(R) CPU L5420 @ 2.50GHz core being used 24*7, I can’t do a good benchmark run on that system as it’s always busy but I think it’s reasonable to assume for the sake of discussion that it’s about the same speed as the T7500 (it may be 5* faster, but that won’t change things much). At 2429 forks per second (or 0.4ms per fork/wait) if that time is entirely reduced to zero that won’t make any noticeable difference to a system that has an average operation taking 1000/35= 28ms!

Now if a daemon was to use fork() + system() to launch a child process (which is a really slow way of doing it) then the T7500 gets 248.51 fork()+system() operations per second with bash and 305.63 per second with dash. The P3-500 gets 24.48 with bash and 33.06 with dash.

So it seems that if every log entry on my busy mail server involved using a fork()+system() operation and it was replaced to use pre-forked daemons then it might be possible to save almost 10% of the CPU time on that system in question.

Now it is theoretically possible that the setup of a daemon process can take more CPU time than fork()+system(). EG a daemon could have some really complex data structures to initialise. If the structures in question were initialised in the same way for each request then a viable design would be to have the master process initialise all the data which would then be inherited by the children. The only way I can imagine for a daemon child process to take any significant amount of time on modern hardware is for it to generate a session encryption key, and there’s really nothing stopping a single master process from generating several such keys in advance and then passing them to child processes as needed.

In conclusion I think that the meme about pre-forking is based on hardware that was used at a time when a 500MHz 32bit system (like my ancient Thinkpad T20) was unimaginably fast and when operating systems were less efficient than a modern Linux kernel. The only corner case might be daemons which do relatively simple CPU bound operations – such as serving static files from a web server where the data all fits into the system cache, but even then I expect that the benefit is a lot smaller than most people think and the number of pre-forked processes is probably best kept very low.

One final thing to note is that if you compare fork()+exec() with an operation to instruct a running daemon (via Unix domain sockets perhaps) to provide access to a new child (which may be pre-forked or may be forked on demand) then you have the potential to save a moderate amount of CPU time. The initialisation of a new process has some overhead that is greater than calling fork(), and when you fork() a new process there are usually lots of data structures which are not written after that time which means that on Linux they remain as shared memory and thus reduce the system memory use (and improve cache efficiency when they are read).

#include <unistd.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>

#define NUM_FORKS 10000
#define NUM_SHELLS 1000

int main()
{
  struct timeval start, end;
  if(gettimeofday(&start, NULL) == -1)
  {
    fprintf(stderr, "Can't get time of day\n");
    return 1;
  }

  int i = 0;
  while(i < NUM_FORKS)
  {
    pid_t pid = fork();
    if(pid == 0)
      return 0;
    if(pid > 0)
    {
      int status;
      pid_t rc = waitpid(-1, &status, 0);
      if(rc != pid)
      {
        fprintf(stderr, "waidpid() failed\n");
        return 1;
      }
    }
    else
    {
      fprintf(stderr, "fork() failed\n");
      return 1;
    }
    i++;
  }

  if(gettimeofday(&end, NULL) == -1)
  {
    fprintf(stderr, "Can't get time of day\n");
    return 1;
  }

  printf("%.2f fork()s per second\n", double(NUM_FORKS)/(double(end.tv_sec – start.tv_sec) + double(end.tv_usec – start.tv_usec) / 1000000.0) );

  if(gettimeofday(&start, NULL) == -1)
  {
    fprintf(stderr, "Can't get time of day\n");
    return 1;
  }

  i = 0;
  while(i < NUM_SHELLS)
  {
    pid_t pid = fork();
    if(pid == 0)
    {
      if(system("id > /dev/null") == -1)
        fprintf(stderr, "system() failed\n");
      return 0;
    }
    if(pid > 0)
    {
      int status;
      pid_t rc = waitpid(-1, &status, 0);
      if(rc != pid)
      {
        fprintf(stderr, "waidpid() failed\n");
        return 1;
      }
    }
    else
    {
      fprintf(stderr, "fork() failed\n");
      return 1;
    }
    i++;
  }

  if(gettimeofday(&end, NULL) == -1)
  {
    fprintf(stderr, "Can't get time of day\n");
    return 1;
  }

  printf("%.2f fork() and system() calls per second\n", double(NUM_SHELLS)/(double(end.tv_sec – start.tv_sec) + double(end.tv_usec – start.tv_usec) / 1000000.0) );
  return 0;
}

6 comments to Is Pre-Forking any Good?

  • Daniel

    Also, if the job being delegated is at all interesting in its structure, preforking requires either marshalling the data over the process boundary, or setting up shared memory segments, both of which can be tricky and error-prone, and neither of which is free. A straight fork() might end up being just as fast in that case, and of course it wins in terms of simplicity.

  • On one hand I think the meme is even older. From when on ancient BSD didn’t support copy-on-write, so fork() meant immediate copying of all of the process’ memory.

    On the other hand since Linux does implement copy-on-write, doing fork() and immediately exiting the child will be significantly cheaper than actually doing some work as than some memory will get copied.

  • Anonymous

    As you found out, pre-forking doesn’t necessarily improve bandwidth, in terms of the number of requests handled per second. However, it *can* make a difference to latency. I can send a request to an existing child process (or better yet, have the child process receive the request directly) a lot faster than I can fork a new one, and that time directly translates into reduced per-request latency. And when you want to push your response times down to milliseconds or less, spending hundreds of microseconds in fork() *hurts*.

  • Matthew W. S. Bell

    There’s issues of scheduler latency here too; I’m not sure if they’re significant.

  • etbe

    Daniel: Good point about the IPC overhead. Although the most common case of pre-forking is for web servers which use Unix domain sockets to transfer an open file handle which one would hope to be rather quick.

    Jan: It’s true that exiting without writing to any memory avoids copying writable pages and changing the memory mapping (which is expensive). But it still won’t compare to anything that ever hits the disk or talks to a database server.

    Anon: True the latency will be improved, but you have to look at the big picture. In the case of something that runs SpamAssassin (which was the actual example that inspired this post) then it won’t make any difference that you will be able to measure, SA uses a lot of CPU time and does DNS lookups. For web servers if people want to have milli-second response times then they won’t be able to use much PHP (if any) and they won’t be able to do database lookups. That probably rules out most Apache configurations.

    Matthew: Yes, also when you change such things you change the way the cache works in significant ways and lots of other things. Writing a reasonable representation of such things in a synthetic benchmark would be really hard. That’s why I decided to just simulate one aspect for a simple example that proves the extreme cases.

  • There’s another aspect to fork/exec performance, and that is virtualization. In my tests, Xen incurs around 50% performance hit, higher when multiple VCPUs are assigned to a domain, and KVM possibly (much) more.
    So yes, on a bare-metal machine it might not matter anymore, but on a contented VM it might still be something to watch for.
    regards,
    iustin