Linux, politics, and other interesting things
It’s a common practice when hosting email or web space for large numbers of users to group the accounts by the first letter. This is due to performance problems on some filesystems with large directories and due to the fact that often a 16bit signed integer is used for the hard link count so that it is impossible to have more than 32767 subdirectories.
I’ve just looked at a system I run (Bluebottle anti-spam email service ) which has about half a million accounts and counted the incidence of each first letter. It seems that S is the most common at almost 10% and M and A aren’t far behind. Most of the clients have English as their first language, naturally the distribution of letters would be different for other languages.
Now if you were to have a server with less than 300,000 accounts then you could probably split them based on the first letter. If there were more than 300,000 accounts then you would face the risk of having there be too many account names starting with S. See the table below for the incidences of all the first letters.
The two letter prefix MA comprised 3.01% of the accounts. So if faced with a limit of 32767 sub-directories then if you split by two letters then you might expect to have no problems until you approached 1,000,000 accounts. There were a number of other common two-letter prefixes which also had more than 1.5% of the total number of accounts.
Next I looked at the three character prefixes and found that MAR comprised 1.06% of all accounts. This indicates that splitting on the first three characters will only save you from the 32767 limit if you have 3,000,000 users or less.
Finally I observed that the four character prefix JOHN (which incidentally is my middle name) comprised 0.44% of the user base. That indicates that if you have more than 6,400,000 users then splitting them up among four character prefixes is not necessarily going to avoid the 32767 limit.
It seems to me that the benefits of splitting accounts by the first characters is not nearly as great as you might expect. Having directories for each combination of the first two letters is practical I’ve seen directory names such as J/O/JOHN or JO/JOHN (or use J/O/HN or JO/HN if you want to save directory space). But it becomes inconvenient to have J/O/H/N and the form JOH/N will have as many as 17,576 subdirectories for the first three letters which may be bad for performance.
This issue is only academic as far as most sys-admins won’t ever touch a system with more than a million users. But in terms of how you would provision so many users, in the past the limits of server hardware were approached long before these issues. For example in 2003 I was running some mail servers on 2RU rack mounted systems with four disks in a RAID-5 array (plus one hot-spare) – each server had approximately 200,000 mailboxes. The accounts were split based on the first two letters, but even if it had been split on only one letter it would probably have worked. Since then performance has improved in all aspects of hardware. Instead of a 2RU server having five 3.5″ disks it will have eight 2.5″ disks – and as a rule of thumb increasing the number of disks tends to increase performance. Also the CPU performance of servers has dramatically increased, instead of having two single-core 32bit CPUs in a 2RU server you will often have two quad-core 64bit CPUs – more than four times the CPU performance. 4RU machines can have 16 internal disks as well as four CPUs and therefore could probably serve mail for close to 1,000,000 users.
While for reliability it’s not the best idea to have all the data for 1,000,000 users on internal disks on a single server (which could be the topic of an entire series of blog posts), I am noting that it’s conceivable to do so and provide adequate performance. Also of course if you use one of the storage devices that supports redundant operation (exporting data over NFS, iSCSI, or Fiber Channel) then if things are configured correctly then you can achieve considerably more performance and therefore have a greater incentive to have the data for a larger number of users in one filesystem.
Hashing directory names is one possible way of alleviating these problems. But this would be a little inconvenient for sys-admin tasks as you would have to hash the account name to discover where it was stored. But I guess you could have a shell script or alias to do this.
Here is the list of frequency of first letters in account names: