The Most Important things for running a Reliable Internet Service

One of my clients is currently investigating new hosting arrangements. It’s a bit of a complex process because there are lots of architectural issues relating to things such as the storage and backup of some terabytes of data and some serious computation on the data. Among other options we are considering cheap servers in the EX range from Hetzner [1] which provide 3TB of RAID-1 storage per server along with reasonable CPU power and RAM and Amazon EC2 [2]. Hetzner and Amazon aren’t the only companies providing services that can be used to solve my client’s problems, but they both provide good value for what they provide and we have prior experience with them.

To add an extra complication my client did some web research on hosting companies and found that Hetzner wasn’t even in the list of reliable hosting companies (whichever list that was). This is in some ways not particularly surprising, Hetzner offers servers without a full management interface (you can’t see a serial console or a KVM, you merely get access to reset it) and the best value servers (the only servers to consider for many terabytes of data) have SATA disks which presumably have a lower MTBF than SAS disks.

But I don’t think that this is a real problem. Even when hardware that’s designed for the desktop is run in a server room the reliability tends to be reasonable. My experience is that a desktop PC with two hard drives in a RAID-1 array will give a level of reliability in practice that compares very well to an expensive server with ECC RAM, redundant fans, redundant PSUs, etc.

My experience is that the most critical factor for server reliability is management. A server that is designed to be reliable can give very poor uptime if poorly maintained or if there is no rapid way of discovering and fixing problems. But a system that is designed to be cheap can give quite good uptime if well maintained, if problems can be repidly discovered and fixed.

A Brief Overview of Managing Servers

There are text books about how to manage servers, so obviously I can’t cover the topic in detail in a blog post. But here are some quick points. Note that I’m not claiming that this list includes everything, please add comments about anything particularly noteworthy that you think I’ve missed.

For a server to be well managed it needs to be kept up to date. It’s probably a good idea for management to have this on the list of things to do. A plan to check for necessary updates and apply them at fixed times (at least once a week) would be a good thing. My experience is that usually managers don’t have anything to do with this and sysadmins either apply patches or not at their own whim.
It is really ideal for people to know how all the software works. For every piece of software that’s running it should either have come from a source that provides some degree of support (EG a Linux distribution) or be maintained by someone who knows it well. When you install custom software from people who become unavailable then it puts the reliability of the entire system at risk – if anything breaks then you won’t be able to get it fixed quickly.
It should be possible to rapidly discover problems, having a client phone you to tell you that your web site is offline is a bad thing. Ideally you will have software like Nagios monitoring the network and reporting problems via a SMS gateway service such as ClickaTell.com. I am not sure that Nagios is the best network monitoring system or that ClickaTell is the best SMS gateway, but they have both worked well in my experience. If you think that there are better options for either of those then please write a comment.
It should be possible to rapidly fix problems. That means that a sysadmin must be available 24*7 to respond to SMS and you must have a backup sysadmin for when the main person takes a holiday, or ideally two backup sysadmins so that if one is on holiday and another has an emergency then problems can still be fixed. Another thing to consider is that an increasing number of hotels, resorts, and cruise ships are providing net access. So you could decrease your need for backup sysadmins if you give a holiday bonus to a sysadmin who uses a hotel, resort, or cruise ship that has good net access. ;)
If it seems likely that there may be some staff changes then it’s a really good idea to hire a potential replacement on a casual basis so that they can learn how things work. There have been a few occasions when I started a sysadmin contract after the old sysadmin ceased being on speaking terms with the company owner. This made it difficult for me to learn what’s going on.
If your network is in any way complex (IE it’s something that needs some skill to manage) then it will probably be impossible to hire someone who has experience in all the areas of technology at a salary you are prepared to pay. So you should assume that whoever you hire will do some learning on the job. This isn’t necessarily a problem but is something that needs to be considered. If you use some unusual hardware or software and want it to run reliably then you should have a spare system for testing so that the types of mistake which are typically made in the learning process are not made on your production network.

Conclusion

If you have a business which depends on running servers on the Internet and you don’t do all the things in the above list then the reliability of a service like Hetzner probably isn’t going to be an issue at all.

Helmut Grohne

April 19, 2012 at 03:43

On 3 (failure notification). There also is a tool called pynotifyd (disclaimer: I’m upstream), that tries to deliver failure notifications to your sysadmin in a cheap manner. It can first try jabber. If your sysadmin is offline, it can try some online-sms service (e.g. sipgate or developergarden). If your networks has failed, it can try a locally connected phone using yaps. That way you can get reliable notification with low cost.

Oliver Gorwits

April 26, 2012 at 08:02

Excellent post, thanks.

It used to be sometimes difficult to justify having a spare server simply to allow checking of updates or reconfigurations prior to deployment. But problems in these tasks is where a lot of unplanned downtime comes from.

A positive change in recent years is the free availability of easy to use virtualisation. What I think this has done is allow managers of systems to have a staging server of each production server, at little cost.

Now it’s possible to have one extra server with KVM instances of each type of production server, to have somewhere to first apply patches and check things out. Makes me sleep easier.

etbe

April 26, 2012 at 13:56

Helmut: Thanks for the comment and thanks for writing a useful tool.

Oliver: One thing about virtualisation is that it allows multiple configurations. I’ve worked at places where they have a few non-virtual servers for testing things, but that only allowed testing one configuration. In such a network you have a great incentive to not break the test system as other people want to test stuff. It’s OK as a pre-deployment test platform, but not for anything else.

With KVM or Xen you can have developers and sysadmins run virtual machines for testing any proposed change. If there are several ways of solving a problem then you can have a VM for each!

I haven’t used virtual servers for testing patch application. But that comes back to the amount of resources spent on things. If you have the time/money then testing patches before applying them in production is a good idea.

etbe – Russell Coker

Archives

Categories

The Most Important things for running a Reliable Internet Service

A Brief Overview of Managing Servers

Conclusion

3 comments to The Most Important things for running a Reliable Internet Service

Archives

Email and RSS

etbe – Russell Coker

Archives

Categories

Tags

The Most Important things for running a Reliable Internet Service

A Brief Overview of Managing Servers

Conclusion

3 comments to The Most Important things for running a Reliable Internet Service

Archives

Email and RSS