Letter Frequency in Account Names

It’s a common practice when hosting email or web space for large numbers of users to group the accounts by the first letter. This is due to performance problems on some filesystems with large directories and due to the fact that often a 16bit signed integer is used for the hard link count so that it is impossible to have more than 32767 subdirectories.

I’ve just looked at a system I run (Bluebottle anti-spam email service [1]) which has about half a million accounts and counted the incidence of each first letter. It seems that S is the most common at almost 10% and M and A aren’t far behind. Most of the clients have English as their first language, naturally the distribution of letters would be different for other languages.

Now if you were to have a server with less than 300,000 accounts then you could probably split them based on the first letter. If there were more than 300,000 accounts then you would face the risk of having there be too many account names starting with S. See the table below for the incidences of all the first letters.

The two letter prefix MA comprised 3.01% of the accounts. So if faced with a limit of 32767 sub-directories then if you split by two letters then you might expect to have no problems until you approached 1,000,000 accounts. There were a number of other common two-letter prefixes which also had more than 1.5% of the total number of accounts.

Next I looked at the three character prefixes and found that MAR comprised 1.06% of all accounts. This indicates that splitting on the first three characters will only save you from the 32767 limit if you have 3,000,000 users or less.

Finally I observed that the four character prefix JOHN (which incidentally is my middle name) comprised 0.44% of the user base. That indicates that if you have more than 6,400,000 users then splitting them up among four character prefixes is not necessarily going to avoid the 32767 limit.

It seems to me that the benefits of splitting accounts by the first characters is not nearly as great as you might expect. Having directories for each combination of the first two letters is practical I’ve seen directory names such as J/O/JOHN or JO/JOHN (or use J/O/HN or JO/HN if you want to save directory space). But it becomes inconvenient to have J/O/H/N and the form JOH/N will have as many as 17,576 subdirectories for the first three letters which may be bad for performance.

This issue is only academic as far as most sys-admins won’t ever touch a system with more than a million users. But in terms of how you would provision so many users, in the past the limits of server hardware were approached long before these issues. For example in 2003 I was running some mail servers on 2RU rack mounted systems with four disks in a RAID-5 array (plus one hot-spare) – each server had approximately 200,000 mailboxes. The accounts were split based on the first two letters, but even if it had been split on only one letter it would probably have worked. Since then performance has improved in all aspects of hardware. Instead of a 2RU server having five 3.5″ disks it will have eight 2.5″ disks – and as a rule of thumb increasing the number of disks tends to increase performance. Also the CPU performance of servers has dramatically increased, instead of having two single-core 32bit CPUs in a 2RU server you will often have two quad-core 64bit CPUs – more than four times the CPU performance. 4RU machines can have 16 internal disks as well as four CPUs and therefore could probably serve mail for close to 1,000,000 users.

While for reliability it’s not the best idea to have all the data for 1,000,000 users on internal disks on a single server (which could be the topic of an entire series of blog posts), I am noting that it’s conceivable to do so and provide adequate performance. Also of course if you use one of the storage devices that supports redundant operation (exporting data over NFS, iSCSI, or Fiber Channel) then if things are configured correctly then you can achieve considerably more performance and therefore have a greater incentive to have the data for a larger number of users in one filesystem.

Hashing directory names is one possible way of alleviating these problems. But this would be a little inconvenient for sys-admin tasks as you would have to hash the account name to discover where it was stored. But I guess you could have a shell script or alias to do this.

Here is the list of frequency of first letters in account names:

First Letter Percentage
a 7.65
b 5.86
c 5.97
d 5.93
e 2.97
f 2.85
g 3.57
h 3.19
i 2.21
j 6.09
k 3.92
l 3.91
m 8.27
n 3.15
o 1.44
p 4.82
q 0.44
r 5.04
s 9.85
t 5.2
u 0.85
v 1.9
w 2.4
x 0.63
y 0.97
z 0.95

3 comments to Letter Frequency in Account Names

  • Felipe Sateler

    IIRC (because they no longer provide UNIX accounts), my university used a system that went like this for home folders (and possibly for other stuff too):
    This way you can choose a number of allowed subdirectories, and when the number has been reached, you just skip to the next one. This would give you an upper limit of 2^32 = 4.294.967.296.
    OTOH, now finding a user’s directory is harder, but I guess keeping a map of users to numbers shouldn’t be hard.

  • If this kind of system is running on ext3 then it would be wise to enable the dir_index feature. See tune2fs(8) or your local internet for details.

  • etbe

    Felipe: For home directories that sort of thing is managable. You can just type “cd ~user” to get to a user’s home directory. For a big mail server or web server there needs to be a way of mapping between user-names and directories. You can do that in a database or LDAP server but that may require an extra database or LDAP query which increases the system load and the latency of the operation. Of course if you have a large number of Unix accounts then you will have a database or LDAP server to store the /etc/passwd type data.

    But if you can avoid lookups and just know that user foo on the local machine is stored at /mail/f/o/foo then it’s a lot easier.

    Ted: Good point. When I wrote this post I thought that dir_index had been around for ages as a default feature. A quick test however revealed that RHEL4 does not enable it by default. So if running a big server on RHEL4 this is definitely something you should check!