Redundancy in Network Infrastructure

It’s generally accepted that certain things need redundancy. RAID is generally regarded as essential for every server except for the corner case of compute clusters where a few nodes can go offline without affecting the results (EG the Google servers). Having redundant network cables with some sort of failover system between big switches is regarded as a good idea, and multiple links to the Internet is regarded as essential for every serious data-center and is gaining increasing acceptance in major corporate offices.

Determining whether you need redundancy for a particular part of the infrastructure is done on the basis of the cost of the redundant device (in terms of hardware and staff costs related to installing it), the cost of not having it available, and the extent to which the expected down-time will be reduced by having some redundancy.

It’s also regarded as a good idea to have more than one person with the knowledge of how to run the servers, jokes are often made about what might happen if a critical person “fell under a bus“, but more mundane things such as the desire to take an occasional holiday or a broken mobile phone can require a backup person.

One thing that doesn’t seem to get any attention is redundancy in the machine used for system administration. I’ve been using an EeePC [1] for supporting my clients, and it’s been working really well for me. Unfortunately I have misplaced the power supply. So I need to replace the machine (if only for the time taken to find the PSU). I have some old Toshiba Satellite laptops, they are quite light by laptop standards (but still heavier than the EeePC) and they only have 64M of RAM. But as a mobile SSH client they will do well. So my next task is to set up a Satellite as a backup machine for my network support work.

It seems that this problem is fairly widespread. I’ve worked in a few companies with reasonably large sysadmin teams. The best managed one had a support laptop that was assigned to the person who was on-call outside business hours. That laptop was not backed up (to the best of my knowledge, it was never connected to the corporate LAN so it seems that no-one had an opportunity to do so) and there was no second machine.

One thing I have been wondering is what happens to laptops with broken screens when the repair price exceeds the replacement cost. I wouldn’t mind buying an EeePC with a broken screen if it comes with a functional PSU, I could use it as a portable server.

6 comments to Redundancy in Network Infrastructure

  • I met some blind guys. They where giving a presentation about accessibility. Two out of three laptops had a damaged or broken screen. Lucky for me they didn’t use a slide show.

  • AlphaG

    and that is why many organisation have moved admin consoles and that functionality to terminal served environments, with a local set of tools as a backup. In this way they can control the security of who can get access to the tools (lockout those being fired), its levels of availability (can be available from multiple locations) etc etc. Therefore the concept of a single device failure is removed to another managed service in the same way a server is managed…my 5cents anyhow

  • Josh Goodall

    Building on the notion of laptop-as-a-server: I know a chap who used to deploy remote content delivery networks by renting a quarter-rack in a DC with transit, shipping a carton full of not-quite-latest-generation laptops running {a free Unix}, and having some local tech plug it all together.

    Result: near-instantly deployed shared-nothing CDN geocluster, with low power consumption, built-in backup power capability, high physical density, and trivially field-replaceable nodes. At least five years ahead of Google’s server design, with many of the headline characteristics.

  • Oi! We run RAID on our clusters! We stripe over all the disks to get good I/O speed.. ;-)

  • etbe

    AlphaG: You still need to be able to access the Internet (non-trivial to set up with 3G broadband) and have suitable software to act as a client for the admin consoles. However easy it may be to set these things up it’s something you don’t want to do in a rush when the network unexpectedly goes down.

    Josh: I’ve had similar ideas in the past, but clients haven’t liked them. Strange really because apart from the lack of ECC RAM laptops can make great servers.

    Chris: Striping (RAID-0) is not redundant, so it’s hardly worthy of the acronym RAID. Maybe you could call it an “AID array”. ;)

  • @etbe: well, it certainly AID’s our users! ;-)

    I’ll get me coat..