3

Mobile SSH Client

There has been a lot of fuss recently about the release of the iPhone [1] in Australia. But I have not been impressed.

I read an interesting post Why I don’t want an iPhone [2] which summarises some of the issues of it not being an open platform (and not having SSH client support). Given all the fuss about iPhones (which have just arrived in Australia) I had been thinking of writing my own post about this, but TK covered most of the issues that matter to me. One other thing I have to mention is the fact that I want a more fully powered PC with me. So even if I had a Green Phone (which doesn’t seem to be on general sale) [3] or OpenMoko [4] I would still want at least a PDA running Familiar and preferrably a laptop – I often carry both. A Nokia N8x0 series Internet Tablet [4] would satisfy my PDA needs (and also remove the need to carry an MP3/MP4 player and audio recorder).

When doing serious travelling I carry a laptop, a PDA, and a MP3 player all areas of my digital needs are covered better than an iPhone could reasonably manage. Finally mobile phones tend to not work or not work well ($1 per minute calls is part of my definition of “not well”) in other countries. While I haven’t been doing a lot of traveling recently I still try to avoid buying things that won’t work in other countries.

I had planned to just mention TK’s post in a links post. But then a client offered to buy me an iPhone. He wants me to be able to carry a ssh client with me most places that I go so that whenever his systems break I can login. Now apart from the lack of ssh client support an iPhone seems ideal. :-#

The cheapest Optus iPhone plan seems to be $19 per month for calls and data (which includes 100M of data) and $21 per month over 24 months for the iPhone thus giving a cost of $40 per month for 100M of data transfer (and a nice phone). There is a plan ofr a $19 per month iPhone, but that has a $19 per month un-capped phone plan and doesn’t sound like a good way of saving $2 per month. The “Three” phone company offers USB 3G modems for $5 per month (on a 24 month contract) and their cheapest plan is $15 per month which gives you 1GB of data per month and $0.10/M for additional data transfer. So it’s $20 per month for 1G (which requires a laptop) vs $40 per month for 100M.

Three also has a range of phone plans that allow 3G data access over bluetooth to a PC, it seems that a Nokia N8x0 tablet can be used with that which gives a result of two devices the size of mobile phones. But that costs $20 per month (on top of a regular Three bill) for a plan that offers 500M of data and still requires two devices while not giving the full PC benefits.

In the past I’ve done a lot of support work with a Nokia Communicator, so I’ve found that anything less than a regular keyboard really slows things down. While a EeePC keyboard is not nearly as good as a full sized keyboard it is significantly better than a touch-screen keyboard on a PDA (IE the Nokia N8x0 or the OpenMoko).

At the moment I’m looking at the option of carrying an EeePC with a USB Internet access device. That will cost $20 per month for net access. The cost of the EeePC is around $300 for a low-end model or about $650 for a 901 series that can run Xen (as noted in my previous post I’m considering the possibilities for having a mobile Xen simulation of a production network [5]). The savings of $20 per month over 24 months will entirely cover the cost of a low-end EeePC (ssh terminal, web browsing, and local storage of documentation) and cover most of the cost of a high-end EeePC. Another possibility to consider is using an old Toshiba Satellite I have hanging around (which I used to use as a mobile SE Linux demonstration machine) for a few months while the price on the EeePC 901 drops (as soon as the 70x series is entirely sold out and the 1000 series is available I expect that the 901 will get a lot cheaper).

3

The New DNS Mess

The Age has an interesting article about proposed DNS changes [1].

Apparently ICANN is going to sell top level DNS names and a prediction has been made that they will cost more than $100,000 each. A suggestion for a potential use of this would be to have cities as top level names (a .paris TLD was given as an example). The problem with this is that they are not unique. Countries that were colonised in recent times (such as the US and Australia) have many names copied from Europe. It will be interesting to see how they plan to determine which of the cities registers names, for the .paris example I’m sure that the council of Paris Illinois [2] would love to register it. Does the oldest city win an international trademark dispute over a TLD?

The current situation is that French law unambiguously determines who gets to register paris.fr and someone who sees the URL will have no confusion as to what it means (providing that they know that fr is the ISO country code for France).

As well as city names there are region names which are used for products. Australian vineyards produce a lot of sparkling wine that they like to call Champagne and a lot of fortified wine that they like to call Port. There are ongoing battles about how these names can be used and it seems likely to me that the Australian wine industry will change to other terms. But in the mean-time it would be interesting if .champagne and .port were registered by Australian companies. The fuss that would surely cause would probably give enough free publicity to the Australian wine industry to justify an investment of $200,000 on TLDs.

The concern that is cited by business people (including the client who forwarded me the URL and requested my comments) is that of the expense of protecting a brand. Currently if you have a company named “Example” you can register example.com, example.net, and example.org if you are feeling enthusiastic. Then if you have a significant presence in any country you could register your name in the DNS hierarchy for that country (large companies try to register their name in every country – for a multinational registering ~200 domains is not really difficult or expensive). But if anyone can create a new TLD (and therefore if new ones are liable to be created at any time) it becomes much more difficult. For example if a new TLD was created every day then a multi-national corporation would need to assign an employee to work full-time on investigating the new TLDs and deciding which ones to use. A small company that has an international presence (IE an Internet company) would just lose a significant amount of control over their name.

I don’t believe that this is as much of a concern as some people (such as my client) do. Currently I could register a phone line with a listed name indicating that it belongs to the Melbourne branch of a multi-national corporation. I don’t expect that Telstra would stop me, but the benefit from doing this would be minimal (probably someone who attempted fraud using such means would not gain much and would get shut down quickly). I don’t think that a DNS name registered under a .melbourne TLD would cause much more harm than a phone number listed in the Melbourne phone book. Incidentally for readers from the US, I’m thinking of Melbourne in Australia not a city of the same name in the US – yet another example of a name conflict.

Now I believe that it would be better if small companies didn’t use .com domains. The use of country specific names relevant to where they work are more appropriate and technically easier to implement. I don’t regret registering coker.com.au instead of some name in another country or in the .com hierarchy. Things would probably be working better right now if a .com domain name had always cost $100,000 and there were only a few dozen companies that had registered them. But we have to go with the flow sometimes, so I have registered RussellCoker.com.

Now when considering the merit of an idea we should consider who benefits and who (if anyone) loses. Ideally we would choose options that provide benefits for many people and losses for few (or none). In this case it seems that the suggested changes would be a loss for corporations that want to protect their brand, a loss for end-users who just want to find something without confusion, and provide more benefits for domain-squatters than anyone else.

Maybe I should register icann.port and icann.champagne if those TLDs are registered in Australia and impersonate ICANN. ;)

Solving Rubik’s Cube and IO Bandwidth

Solving Rubiks Cube by treating disk as RAM: Gene Cooperman gave an interesting talk at Google about how he proved that Rubik’s Cube can be solved in 26 moves and how treating disk as RAM was essential for this. The Google talk is on Youtube [1]. I recommend that you read the ACM paper he wrote with Daniel Kunkle first before watching the talk. Incidentally due to the resolution of Youtube it would have been good if the notes had less than 10 lines per screen.

Here is the main page for the Rubiks Cube project with source and math [2], note that I haven’t been interested enough to read the source but I’m including the link for reference.

The main concept is that modern disks can deliver up to 100MB/s (I presume that’s from the outer tracks, I suspect that the inner tracks wouldn’t deliver that speed) for contiguous IO. Get 50 disks running at the same speed and you get 5GB/s for contiguous IO which is a typical speed for RAM. Of course that RAM speed is for a single system while getting 50 disks running at that speed will require either a well-tuned system from SGI (who apparently achieved such speeds on a single process on a single system many years ago – but I can’t find a reference) or 5+ machines from anyone else. The configuration that Gene describes apparently involves a network of machines with one disk each, he takes advantage of hardware purchased for other tasks (where the disks are mostly wasted).

I believe that SGI sells Altix machines which can have enough RAM to store all that data. It is NUMA RAM, even the “slow” access to RAM on another NUMA node should be a lot faster in most cases for sequential access and when there are seeks the benefits of NUMA RAM over disk will be dramatic. Of course the cost of a large NUMA installation is also significant, while a set of 50 quad-core machines with 500G disks is affordable by some home users.

3

Letter Frequency in Account Names

It’s a common practice when hosting email or web space for large numbers of users to group the accounts by the first letter. This is due to performance problems on some filesystems with large directories and due to the fact that often a 16bit signed integer is used for the hard link count so that it is impossible to have more than 32767 subdirectories.

I’ve just looked at a system I run (Bluebottle anti-spam email service [1]) which has about half a million accounts and counted the incidence of each first letter. It seems that S is the most common at almost 10% and M and A aren’t far behind. Most of the clients have English as their first language, naturally the distribution of letters would be different for other languages.

Now if you were to have a server with less than 300,000 accounts then you could probably split them based on the first letter. If there were more than 300,000 accounts then you would face the risk of having there be too many account names starting with S. See the table below for the incidences of all the first letters.

The two letter prefix MA comprised 3.01% of the accounts. So if faced with a limit of 32767 sub-directories then if you split by two letters then you might expect to have no problems until you approached 1,000,000 accounts. There were a number of other common two-letter prefixes which also had more than 1.5% of the total number of accounts.

Next I looked at the three character prefixes and found that MAR comprised 1.06% of all accounts. This indicates that splitting on the first three characters will only save you from the 32767 limit if you have 3,000,000 users or less.

Finally I observed that the four character prefix JOHN (which incidentally is my middle name) comprised 0.44% of the user base. That indicates that if you have more than 6,400,000 users then splitting them up among four character prefixes is not necessarily going to avoid the 32767 limit.

It seems to me that the benefits of splitting accounts by the first characters is not nearly as great as you might expect. Having directories for each combination of the first two letters is practical I’ve seen directory names such as J/O/JOHN or JO/JOHN (or use J/O/HN or JO/HN if you want to save directory space). But it becomes inconvenient to have J/O/H/N and the form JOH/N will have as many as 17,576 subdirectories for the first three letters which may be bad for performance.

This issue is only academic as far as most sys-admins won’t ever touch a system with more than a million users. But in terms of how you would provision so many users, in the past the limits of server hardware were approached long before these issues. For example in 2003 I was running some mail servers on 2RU rack mounted systems with four disks in a RAID-5 array (plus one hot-spare) – each server had approximately 200,000 mailboxes. The accounts were split based on the first two letters, but even if it had been split on only one letter it would probably have worked. Since then performance has improved in all aspects of hardware. Instead of a 2RU server having five 3.5″ disks it will have eight 2.5″ disks – and as a rule of thumb increasing the number of disks tends to increase performance. Also the CPU performance of servers has dramatically increased, instead of having two single-core 32bit CPUs in a 2RU server you will often have two quad-core 64bit CPUs – more than four times the CPU performance. 4RU machines can have 16 internal disks as well as four CPUs and therefore could probably serve mail for close to 1,000,000 users.

While for reliability it’s not the best idea to have all the data for 1,000,000 users on internal disks on a single server (which could be the topic of an entire series of blog posts), I am noting that it’s conceivable to do so and provide adequate performance. Also of course if you use one of the storage devices that supports redundant operation (exporting data over NFS, iSCSI, or Fiber Channel) then if things are configured correctly then you can achieve considerably more performance and therefore have a greater incentive to have the data for a larger number of users in one filesystem.

Hashing directory names is one possible way of alleviating these problems. But this would be a little inconvenient for sys-admin tasks as you would have to hash the account name to discover where it was stored. But I guess you could have a shell script or alias to do this.

Here is the list of frequency of first letters in account names:

First Letter Percentage
a 7.65
b 5.86
c 5.97
d 5.93
e 2.97
f 2.85
g 3.57
h 3.19
i 2.21
j 6.09
k 3.92
l 3.91
m 8.27
n 3.15
o 1.44
p 4.82
q 0.44
r 5.04
s 9.85
t 5.2
u 0.85
v 1.9
w 2.4
x 0.63
y 0.97
z 0.95
2

BIND Stats

In Debian the BIND server will by default append statistics to the file /var/cache/bind/named.stats when the command rndc stats (which seems to be undocumented) is run. The default for RHEL4 seems to be /var/named/chroot/var/named/data/named_stats.txt.

The output will include the time-stamp of the log in the number of seconds since 1970-01-01 00:00:00 UTC (see my previous post explaining how to convert this to a regular date format [1]).

By default this only logs a summary for all zones, which is not particularly useful if you have multiple zones. If you edit the BIND configuration and put zone-statistics 1; in the options section then it will log separate statistics for each zone. Unfortunately if you add this and apply the change via rndc reload I don’t know of any convenient way that you can determine when this change was made and therefore the period of time for which the per-zone statistics were kept. So after applying this to my servers I restarted the named processes so that it will be obvious from the process start time when the statistics started.

The reason I became interested in this is when a member of a mailing list that I subscribe to was considering the DNSMadeEasy.com service. That company runs primary DNS servers for $US15 per annum which allows 1,000,000 queries per month, 3 zones, and 120 records (for either a primary or a secondary server). Based on three hours of statistics it seems like my zone coker.com.au is going to get about 360,000 queries a month (between both the primary and the secondary server). So the $15 a year package could accommodate 3 such zones for either primary or secondary (they each got about half the traffic). I’m not considering outsourcing my DNS, but it is interesting to consider how the various offers add up.

Another possibility for people who are considering DNS outsourcing is Xname.org which provides free DNS (primary and secondary) but request contributions from business customers (or anyone else).

Updated because I first published it without getting stats from my secondary server.

18

Moving a Mail Server

Nowadays it seems that most serious mail servers (IE mail servers suitable for running an ISP) use one file per message. In the old days (before about 1996) almost all Internet email was stored in Mbox format [1]. In Mbox you have a large number of messages in a single file, most users would have a single file with all their mail and the advanced users would have multiple files for storing different categories of mail. A significant problem with Mbox is that it was necessary to read the entire file to determine how many messages were stored, as determining the number of messages was the first thing that was done in a POP connection this caused significant performance problems for POP servers. Even more serious problems occurred when messages were deleted as the Mbox file needed to be compacted.

Maildir is a mail storage method developed by Dan Bernstein based around the idea of one file per message [2]. It solves the performance problems of Mbox and also solves some reliability issues (file locking is not needed). It was invented in 1996 and has since become widely used in Unix messaging systems.

The Cyrus IMAP server [3] uses a format similar to Maildir. The most significant difference is that the Cyrus data is regarded as being private to the Cyrus system (IE you are not supposed to mess with it) while Maildir is designed to be used by any tools that you wish (EG my Maildir-Bulletin project [4]).

One down-side to such formats that many people don’t realise (except at the worst time) is the the difficulty in performing backups. As a test I used LVM volume stored on a RAID-1 array of two 20G 7200rpm IDE disks with 343M of data used (according to “df -h” and 39358 inodes in use (as there were 5000 accounts with maildir storage that means 25,000 directories for the home directories and Maildir directories). So there were 14,358 files. To create a tar file of that (written to /dev/null via dd to avoid tar’s optimisation of /dev/null) took 230.6 seconds, 105MB of data was transferred for a transfer rate of 456KB/s. It seems that tar stores the data in a more space efficient manner than the Ext3 filesystem (105MB vs 343MB). For comparison either of the two disks can deliver 40MB/s for the inner tracks. So it seems that unless the amount of used space is less than 1% of the total disk space it will be faster to transfer a filesystem image.

If you have disks that are faster than your network (EG old IDE disks can sustain 40MB/s transfer rates on machines with 100baseT networking and RAID arrays can easily sustain hundreds of megabytes a second on machines with gigabit Ethernet networking) then compression has the potential to improve the speed. Of course the fastest way of transferring such data is to connect the disks to the new machine, this is usually possible when using IDE disks but the vast number of combinations of SCSI bus, disk format, and RAID controller makes it almost impossible on systems with hardware RAID.

The first test I made of compression was on a 1GHz Athlon system which could compress (via gzip -1) 100M of data with four seconds of CPU time. This means that compression has the potential to reduce the overall transfer time (the machine in question has 100baseT networking and no realistic option of adding Gig-E).

The next test I made was on a 3.2GHz Pentium-4 Xeon system. It compressed 1000M of data in 77 seconds (it didn’t have the same data as the Athlon system so it can’t be directly compared), as 1000M would take something like 10 or 12 seconds to transfer at Gig-E speeds that obviously isn’t a viable option.

The gzip -1 compression however compressed the data to 57% of it’s original size, the fact that it compresses so well with gzip -1 suggests to me that there might be a compression method that uses less CPU time while still getting a worth-while amount of compression. If anyone can suggest such a compression method then I would be very interested to try it out. The goal would be a program that can compress 1G of data in significantly less than 10 seconds on a 3.2GHz P4.

Without compression the time taken to transfer 500G of data at Gig-E speeds will probably approach two hours. Not a good amount of down-time for a service that runs 24*7. Particularly given that some time would be spent in getting the new machine to actually use the data.

As for how to design a system to not have these problems, I’ll write a future post with some ideas for how to alleviate that.

7

Mobile Facebook

A few of my clients have asked me to configure their routers to block access to Facebook and Myspace. Apparently some employees spend inappropriate amounts of time using those services while at work. Using iptables to block port 80 and configuring Squid to reject access to those sites is easy to do.

So I was interested to see an advertising poster in my local shopping centre promoting the Telstra “Next G Mobile” which apparently offers “Facebook on the go“. I’m not sure whether Telstra has some special service for accessing Facebook (maybe a Facebook client program running on the phone) or whether it’s just net access on the phone which can be used for Facebook (presumably with a version of the site that is optimised for a small screen).

I wonder if I’ll have clients wanting me to firewall the mobile phones of their employees (of course it’s impossible for me to do it – but they don’t know that).

I have previously written about the benefits of a 40 hour working week for productivity and speculated on the possibility that for some employees the optimum working time might be less than 40 hours a week [1]. I wonder whether there are employees who could get more work done by spending 35 hours working and 5 hours using Facebook than they could by working for 40 hours straight.

2

ARP

In the IP protocol stack the lowest level protocol is ARP (the Address Resolution Protocol). ARP is used to request the Ethernet hardware (MAC) address of the host which owns a particular IP address.

# arping 192.168.0.43
ARPING 192.168.0.43
60 bytes from 00:60:b0:3c:62:6b (192.168.0.43): index=0 time=339.031 usec
60 bytes from 00:60:b0:3c:62:6b (192.168.0.43): index=1 time=12.967 msec
60 bytes from 00:60:b0:3c:62:6b (192.168.0.43): index=2 time=168.800 usec
— 192.168.0.43 statistics —
3 packets transmitted, 3 packets received, 0% unanswered

One creative use of this is the program arping which will send regular ARP request packets for an IP address and give statistics on the success of getting responses. The above is the result of an arping command which shows that the machine in question can respond in 12.9msec or less. One of the features of arping (when compared to the regular ping which uses an ICMP echo) is that it will operate when the interface has no IP address assigned or when the IP address does not match the netmask for the network in question.

This means that if you have a network which lacks DHCP and you want to find a spare IP address in the range that is used then you can use arping without assigning yourself an IP address first. If you wanted to use ping in that situation then you would have to first assign an IP address in which case you may have already broken the network!

Another useful utility is arpwatch. This program listens to ARP traffic and will notify the sys-admin when new machines appear. The notification message will include the Ethernet hardware address and the name of the manufacturer of the device (if it’s known). When you use arpwatch you can say “who added the device with the Intel Ethernet card to the network at lunch time?” instead of “who did something recently to the network that made it break?”. The more specific question is more likely to get an accurate answer.

8

Ethernet Bonding and a Xen Bridge

After getting Ethernet Bonding working (see my previous post) I tried to get it going with a bridge for Xen.

I used the following in /etc/network/interfaces to configure the bond0 device and to make the Xen bridge device xenbr0 use the bond device:

iface bond0 inet manual
pre-up modprobe bond0
pre-up ifconfig bond0 up
hwaddress ether 00:02:55:E1:36:32
slaves eth0 eth1

auto xenbr0
iface xenbr0 inet static
pre-up ifup bond0
address 10.0.0.199
netmask 255.255.255.0
gateway 10.0.0.1
bridge_ports bond0

But things didn’t work well. A plain bond device worked correctly in all my tests, but when I had a bridge running over it I had problems every time I tried pulling cables. My test for a bond is to boot the machine with a cable in eth0, then when it’s running switch the cable to eth1. This means there is a few seconds of no connectivity and then the other port becomes connected. In an ideal situation at least one port would work at all times – but redundancy features such as bonding are not for an ideal situation! When doing the cable switching test I found that the bond device would often get into a state where it every two seconds (the configured ARP ping time for the bond) it would change it’s mind about the link status and have the link down half the time (according to the logs – according to ping results it was down all the time). This made the network unusable.

Now I have deided that Xen is more important than bonding so I’ll deploy the machine without bonding.

One thing I am considering for next time I try this is to use bridging instead of bonding. The bridge layer will handle multiple Ethernet devices, and if they are both connected to the same switch then the Spanning Tree Protocol (STP) is designed to work in this way and should handle it. So instead of having a bond of eth0 and eth1 and running a bridge over that I would just bridge eth0, eth1, and the Xen interfaces.

15

Ethernet Bonding on Debian Etch

I have previously blogged about Ethernet bonding on Red Hat Enterprise Linux. Now I have a need to do the same thing on Debian Etch – to have multiple Ethernet links for redundancy so that if one breaks the system keeps working.

The first thing to do on Debian is to install the package ifenslave-2.6 which provides the utility to manage the bond device. Then create the file /etc/modprobe.d/aliases-bond with the following contents for a network that has 10.0.0.1 as either a reliable host or important router. Note that this will use ARP to ping the router every 2000ms, you could use a lower value for a faster failover or a higher value
alias bond0 bonding
options bond0 mode=1 arp_interval=2000 arp_ip_target=10.0.0.1

If you want to monitor link status then you can use the following options line instead, however I couldn’t test this because the MII link monitoring doesn’t seem to work correctly on my hardware (there are many Ethernet devices that don’t work well in this regard):
options bond0 mode=0 miimon=100

Then edit the file /etc/network/interfaces and inset something like the following (as a replacement for the configuration of eth0 that you might currently be using). Note that XX:XX:XX:XX:XX:XX must be replaced by the hardware address of one of the interfaces that are being bonded or by a locally administered address (see this Wikipedia page for details). If you don’t specify the Ethernet address then it will default to the address of the first interface that is enslaved. This might not sound like a problem, however if the machine boots and a hardware failure is experienced which makes the primary Ethernet device not visible to the OS (IE the PCI card is dead but not killing the machine) then the hardware address of the bond would change, this might cause problems with other parts of your network infrastructure.
auto bond0
iface bond0 inet static
pre-up modprobe bond0
hwaddress ether XX:XX:XX:XX:XX:XX
address 10.0.0.199
netmask 255.255.255.0
gateway 10.0.0.1
up ifenslave bond0 eth0 eth1
down ifenslave -d bond0 eth0 eth1

There is some special support for bonding in the Debian ifup and ifdown utilities. The following will give the same result as the above in /etc/network/interfaces:
auto bond0
iface bond0 inet static
pre-up modprobe bond0
hwaddress ether 00:02:55:E1:36:32
address 10.0.0.199
netmask 255.255.255.0
gateway 10.0.0.1
slaves eth0 eth1

The special file /proc/net/bonding/bond0 can be used to view the current configuration of the bond0 device.

In theory it should be possible to use bonding on a workstation with DHCP, but in my brief attempts I have not got it working – any comments from people who have this working would be appreciated. The first pre-requisite of doing so is to use either MII monitoring or broadcast (mode 3), I experimented with using options bond0 mode=3 in /etc/modprobe.d/aliases-bond but found that it took too long to get the bond working and dhclient timed out.

Thanks for the howtoforge.com article and the linuxhorizon.ro article that helped me discover some aspects of this.

Update: Thanks to Guus Sliepen on the debian-devel mailing list for giving an example of the slaves directive as part of an example of bridging and bonding in response to this question.