SCSI Failures

For a long time it was widely regarded that SCSI was the interface for all serious drives that were suitable for “Enterprise Use” or for anything else which requires reliable operation. On the other hand IDE was for cheap disks that were only suitable for home use. The SCSI vs IDE issue continues to this day but now we have SAS and SATA filling the same market niches with the main difference between the current debate and the debate a decade ago being that a SATA disk can be connected on a SAS bus.

Both SAS and SATA have a single data cable for each disk which avoids the master/slave configuration on IDE and the issue of bus device ID number (from 0-7 or 0-15) and termination on SCSI.

Termination

When a high speed electrical signal travels through a cable some portion of the signal will be reflected from any cable end point or any point of damage. To prevent the signal reflection from the end of a cable you can have a set of resistors (or some other terminating device) at the end of the cable, see the Terminator(electrical) Wikipedia page [1] for a brief overview. As an aside I think that page could do with some work, if you are an EE with a bit of spare time then improving that page would be a good thing.

SCSI was always designed to have termination while IDE never was. I presume that this was largely due to the cable length (18″ for IDE vs 1.5m to 25m for SCSI) and the number of devices (2 for IDE vs 7 or 15 for SCSI). I also presume that some of the problems that I’ve had with IDE systems have been related to signal problems that could have been avoided with a terminated bus.

My first encounter with SCSI was when working for a small business that focused on WindowsNT software development. Everyone in the office knew a reasonable amount about computers and was happy to adjust the hardware of their own workstation. A room full of people who didn’t understand termination who fiddled with SCSI buses tended to give a bad result. On the up-side I learned that a SCSI bus can work most of the time if you have a terminator in the middle of the cable and a hard drive at the end.

There have been two occasions when I’ve been at ground zero for a large deployment of servers from a company I’ll call Moon Computers. In both cases there were two particularly large and expensive servers in a cluster and one of the cluster servers had data loss from bad SCSI termination. This is particularly annoying as the terminators have different colours, all that was needed to get the servers working was to change the hardware to make them look the same. As an aside the company with no backups [2] had one of the servers with bad SCSI termination.

Heat

SCSI disks and now SAS disks tend to be designed for higher performance, this usually means greater heat dissipation. A disk that dissipates a lot of heat won’t necessarily work well in a desktop case with small and quiet fans. This can become a big problem if you have workstations running 24*7 in a hot place (such as any Australian city that’s not in Tasmania) and turn the air-conditioner off on the weekends. One of my clients lost a few disks before they determined that IDE disks are the only option for systems that are to survive Australian heat without any proper cooling.

Differences between IDE/SATA and SCSI/SAS

In 2009 I wrote about vibration and SATA performance [3]. Rumor has it that SCSI/SAS disks are designed to operate in environments where there is a lot of vibration (servers with lots of big fans and fast disks) while IDE/SATA disks are designed for desktop and laptop systems in quiet environments. One thing I’d like to do is to test performance of SATA vs SAS disks in a server that vibrates.

SCSI/SAS disks have apparently been designed for operation in a RAID array and therefore will give a faster timeout on a read error (so another disk can return the data). While IDE/SATA disks are designed for a non-RAID situation and will spend longer trying to read the data.

There are also various claims about the error rates from SCSI/SAS disks being better than those of IDE/SATA disks. But I think that in all cases the error rates are small enough not to be a problem if you use a filesystem like ZFS or BTRFS but they are also large enough to be a significant risk with modern data volumes if you have a lesser filesystem.

Data Loss from Storage Failure

In the data loss that I’ve personally observed from storage failures the loss from SCSI problems (termination and heat) is about equal to all the hardware related data loss I’ve seen on IDE disks. Given that the majority of disks I’ve been responsible for have been IDE and SATA that’s a bad sign for SCSI use in practice.

But all serious data loss that I’ve seen has involved the use of a single disk (no RAID) and inadequate backups. So a basic RAID-1 or RAID-5 installation will solve most hardware related data loss problems.

There was one occasion when heat caused two disks in a RAID-1 to give errors at the same time, but by reading from both disks I managed to get almost all the data back, RAID can save you from some extreme error conditions. That situation would have been ideal for BTRFS or ZFS to recover data.

Conclusion

SCSI and SAS are designed for servers, using them in non-server systems seems to be a bad idea. Using SATA disks in servers can have problems too, but not typically problems that involve massive data loss.

Using technology that is too complex for the people who install it seems risky. That includes allowing programmers to plug SCSI disks into their workstations and whoever it was from Moon computers or their resellers who apparently couldn’t properly terminate a SCSI bus. It seems that the biggest advantage of SAS over SCSI is that SAS is simple enough for most people to be able to correctly install it.

Making servers similar to the systems that the system administrators use at home seems like a really good idea. I think that one of the biggest benefits of using x86 systems as servers is that skills learned on home PCs can be transferred to administration of servers. Of course it would also be a good idea to have test servers that are identical to servers in production so that the sysadmin team can practice and make mistakes on systems that aren’t mission critical, but companies seem to regard that as a waste of money – apparently the risk of down-time is cheaper.

No Backups WTF

Some years ago I was working on a project that involved a database cluster of two Sun E6500 servers that were fairly well loaded. I believe that the overall price was several million pounds. It’s the type of expensive system where it would make sense to spend adequately to do things properly in all ways.

The first interesting thing was the data center where it was running. The front door had a uniformed security guard and a sign threatening immediate dismissal for anyone who left the security door open. The back door was wide open for the benefit of the electricians who were working there. Presumably anyone who had wanted to steal some servers could have gone to the back door and asked the electricians for assistance in removing them.

The system was poorly tested. My colleagues thought that with big important servers you shouldn’t risk damage by rebooting them. My opinion has always been that rebooting a cluster should be part of standard testing and that it’s especially important with clusters which have more interesting boot sequences. But I lost the vote and there was no testing of rebooting.

Along the way there were a number of WTFs in that project. One of which was when the web developers decided to force all users to install the latest beta release of Internet Explorer, a decision that was only revoked when the IE install process broke MS-Office on the PC of a senior manager. Another was putting systems with a default Solaris installation live on the Internet with all default services running, there’s never a reason for a database server to be directly accessible over the Internet.

No Backups At All

But I think that the most significant failing was the decision not to make any backups. This wasn’t merely forgetting to make backups, when I raised the issue I received a negative reaction from almost everyone. As an aside I find it particularly annoying when someone implies that I want backups because I am likely to stuff things up.

There are many ways of proving that there’s a general lack of competence in the computer industry. But I think that one of the best is the number of projects where the person who wants backups has their competence questioned instead of all the people who don’t want backups.

A decision to make no backups relies on one of two conditions, either the service has to be entirely unimportant or you need to have no bugs in the OS or hardware defects that can corrupt data, no application bugs, and a team of sysadmins who never make mistakes. The former condition raises the question of why the service is being run and the latter is impossible.

As I’m more persistent than most people I kept raising the issue via email and adding more people to the CC list until I got a positive reaction. Eventually I CC’d someone who responded with “What the fuck” which I consider to be a reasonable response to a huge and expensive project with no backups. However the managers on the CC list regarded the use of profanity in email to be a much more serious problem. To the best of my knowledge there were never any backups of that system but the policy on email was strongly enforced.

This is only a partial list of WTF incidents that assisted in my decision to leave the UK and migrate to the Netherlands.

Not Doing Much

About a year after leaving I returned to London for a holiday and had dinner with a former colleague. When I asked what he was working on he said “Not much“. It turned out that proximity to the nearest manager determined the amount of work that was assigned. As his desk was a long way from the nearest manager he had spent about 6 months getting paid to read Usenet. That wasn’t really a surprise given my observations of the company in question.

Woolworths Maths Fail

picture of discount from $3.99 to $3.00 advertised as 20% off

The above is a picture of the chocolate display at Woolworths, an Australian supermarket that was formerly known as Safeway – it had the same logo as the US Safeway so there’s probably a connection. This is actually a 24.81% discount. It’s possible that some people might consider it a legal issue to advertise something as a 25% discount when it’s 1 cent short of that (even though we haven’t had a coin smaller than 5 cents in Australia since 1991). But then if they wanted to advertise a discount percentage that’s a multiple of 5% they could have made the discount price $2.99, presumably whatever factors made them make the original price $3.99 instead of $4.00 would also apply when choosing a discount price.

So the question is, do Woolworths have a strict policy of rounding down discount rates to the nearest 5% or do they just employ people who failed maths in high school?

Sometimes when discussing education people ask rhetorical questions such as “when would someone use calculus in real life”, I think that the best answer is “people who have studied calculus probably won’t write such stupid signs”. Sure the claimed discount is technically correct as they don’t say “no more than 20% off” and not misleading in a legal sense (it’s OK to claim less than you provide), but it’s annoyingly wrong. Well educated people don’t do that sort of thing.

As an aside, the chocolate in question is Green and Black, that’s a premium chocolate line that is Fair Trade, Organic, and very tasty. If you are in Australia then I recommend buying some because $3.00 is a good price.

Take Off that Stupid Helmet

Recently I was walking through a park and heard a women call out “Take off that stupid helmet”. Usually I ignore what other people are saying but that seemed noteworthy. It turned out that a young boy (maybe 4yo) was being taught to ride a bike and his parents seemed to think that wearing a helmet was a bad idea. There is ongoing debate about the benefit to an adult in wearing a helmet while riding a bike. But it seems clear that for a young child riding on a concrete path a helmet is a really good thing. When it became apparent that everyone in the park was watching the parents decided to have him ride on the grass instead.

On a related note I was recently talking to an employee of a roadside assistance company about what happens when a child is locked in a car. Apparently if a child is locked in a car with the keys the emergency services people won’t smash a window as long as the child is kicking and screaming. While the child is obviously in distress they apparently aren’t going to immediately die and that’s OK, but when they go quiet it’s time to damage the car to save them! I can imagine situations when it’s OK for the emergency services people to wait for a car expert to open the car without damage, if the weather is cool and the child seems happy then a delay probably doesn’t matter much. But if the child is in distress then the attitude that anything which doesn’t kill the kid is OK seems wrong.

The Security Benefits of Automation

Some Random WTFs

The Daily WTF is an educational and amusing site that recounts anecdotes about failed computer projects. One of their stories titled “Remotely Incompetent” concerns someone who breaks networking on a server and is then granted administrative access to someone else’s server by the Data Center staff [1]!

In one of the discussions about that I saw people make various claims about Data Center security, such as claiming that having their own locked room helps. My experience indicates that such things don’t do much good, I have often been granted access to server rooms without appropriate checks.

My experience is that security guards on site generally don’t directly do any good. I once had a guard hold a door for me when I was removing a server from a DC without even bothering to ask for ID! On another occasion in the Netherlands I had a security guard who didn’t speak English unlock the wrong server room for me, I used hand gestures to inform him that I needed access to the room with the big computers and he gave me the access I needed! It seems that the benefit of security guards is solely based on scaring people who don’t have the confidence needed to bluff their way in. Preventing children from thieving is a good thing,

On another occasion I showed ID and signed in for access to a DC owned by my employer and I used my security key to go through a locked door with a sign that promised many bad consequences if I failed to lock the door behind me. Then I discovered that the back door was wide open for the benefit of some electricians who were working in the building. Presumably the electricians who had no security training were expected to act as ad-hoc security guards if someone tried to enter through the back door – presumably they would not have been good at it.

When a company uses part of their own office for a server room then many of these problems disappear. But a common issue in such ad-hoc DCs is the lack of planning and procedures, I have lost count of the number of times I’ve seen doors (and even windows) propped open to allow ventilation because there were too many servers for the air-conditioning to cope. The most ironic example of this is the company that had a walk-in safe (think of a small bank vault with concrete walls and thick solid steel door) used for storing servers but with it’s door propped open to allow cooling. The advantage of a serious hosting company is that they will have procedures for cooling etc and will be very unlikely to do strange and silly things.

Having a locked room in a DC makes some sense, but if security guards have the master keys and are allowed to use them then it might not do much good. The one time I locked my keys in such a room I had a guard let me in without verifying my ID or the claim that there were actually keys locked in the room. Presumably anyone could just claim to have forgotten their keys and get the door unlocked – just like a cheap hotel.

Locking a rack sounds like a good idea, but the racks I’ve seen have had locks which are quite easy to pick. On the one occasion when I had to pick a lock on a rack (due to keys being too difficult to manage for the relevant people) the security guards didn’t investigate, so either the security cameras were not supervised or they just didn’t care about people picking locks in a shared server room. Also if you allow people to do things freely in a shared server room they could install devices to monitor network traffic.

A locked cage in a server room should work well. In the one case where I worked for a company that used such a cage I found it to mostly work well – apart from the few weeks when the lock was broken.

One company that I worked for had scales before the door between a server room and the car-park to prevent people from stealing heavy servers. Of course that wouldn’t stop people stealing hard drives full of data which is worth more than the servers! Also an over-weight colleague had to have the scales disabled for him (as they were based on absolute mass not unexpected changes in an individual’s mass) which presumably means that any skinny employee could steal a 2RU server and still be below the mass threshold.

How to Solve some of these Problems

Computers are subject to all manner of security problems. But they tend not to do arbitrary things for no apparent reason and they will never give in to someone who is charming, attractive, or aggressive – unlike humans.

I have servers running on Hetzner, Linode, and the Rackspace Cloud. I am always concerned about possible security compromises. But I am not worried about someone climbing in a window of a server room or convincing a security guard to let them in through the door. All three of those hosting companies have the vast majority of interactions automated. I can change many aspects of the servers without involving ANY human interaction. Out of the three of those companies I have had some human interaction with Hetzner (who provide managed servers) when a hard drive needed to be replaced – obviously replacing a disk in the wrong server would have been a significant system integrity issue even though everyone would be running RAID-1 and if Hetzner improperly disposed of the broken disk then there could be security issues – but this is an unlikely mistake in the face of a rare occurrence. With Linode and the Rackspace Cloud (and the previous Slicehost hosting that was purchased by Rackspace) the most common interactions I have with employees of those companies are when my clients don’t pay their bills on time – and that’s an administrative not a technical issue. When I do have to contact the support people about a technical issue it’s usually something that’s not immediately connected to the virtual server (EG a loss of routing to the DC).

It seems most likely that there are a fairly small number of people who are allowed in the DCs for companies like Hetzner, Linode, and Rackspace. Those people would probably be recognised by the security guards and their work would be restricted to replacing failing hardware and not involve granting access requests. There are some unusual requests that they can process (EG one of my clients recently transferred a virtual server between business units) but even in those cases the administrative software controls who gets access. This is much better than just handing hardware access to what seems to be the correct physical server to a client.

If you have software running a few computers and operating correctly then you can probably scale it up to run thousands of computers and have it still work correctly. But if you have a team of people controlling access requests and want to scale it up significantly then there are huge problems in hiring skilled people and training them correctly. There is a real risk of security flaws in such administrative software, if someone managed to exploit the automated management system for one of those three companies then they could probably gain access to the private data of any of their customers. But the risk of this seems a lot less than the risk of general incompetence among humans who perform routine and boring tasks which have the potential for great errors.

Sociological Images

I’ve recently been reading the Sociological Images blog [1]. That site has lots of pictures and videos that are relevant to the study of Sociology (most of which have a major WTF factor) and it’s run by people who have Ph.Ds in Sociology so the commentary is insightful. Since reading that I’ve started photographing relevant things.

woman in straight-jacket advertising energy prices

I can’t work out the logic behind the above advert for Energy Watch which was on a billboard near Ringwood Station in Melbourne, Australia. The only thing that it clear is that it spreads bad ideas about mental illness and psychiatric treatment. It doesn’t make me want to do business with them.

Antons full display

The above picture is a shop-front for the Antons clothing store (I’m not sure if they are a tailor or if they sell ready to wear). It was taken on Lonsdale St, Melbourne where the store apparently used to be, now they are in Melbourne Central.

Antons left display, African and Southern EuropeanAntons right display, Northern European and Japanese

The above pictures show more detail. Unfortunately the combination of lighting and my camera (Xperia X10 phone camera) wasn’t adequate to show the apparent ethnic differences between the two men. It seems that the most likely Australian interpretation of the ethnic groups that are represented are African (maybe Afro-American), Southern-European or maybe American Hispanic, North-Western European, and Japanese. It’s good to have mannequins representing the fact that not everyone in Australia is white, but different facial expressions for different races seems a strange choice (admittedly it might be a choice made by mannequin manufacturers). Also the Japanese woman with fan idea is rather outdated.

I’ve just started reading You May Ask Yourself: An Introduction to Thinking Like a Sociologist (Second Edition) by Dalton Conley. I’ve only read the first chapter, but that was good enough that the entire book has to be good enough to recommend.

How Not to Park a Mercedes

Why is that Mercedes blocking so much of the road through the station car park?

It’s because it’s the second car in a 1 car parking spot!

Here’s a front view.

These were taken in the Coburg Station car park on Wednesday. The car park was about half empty, so the alternative to blocking part of the road and blocking in someone who was legally parked was to just park about 10 meters further away from the station.

Some people just shouldn’t be driving.

Email Passwords

I was doing some routine sysadmin work for a client when I had to read mail in the system administration mailbox. This mailbox is used for cron job email, communication with ISPs that run servers for the company, and other important things. I noticed that the account was subscribed to some mailing lists related to system administration, the following is from one of the monthly messages from a list server:

Passwords for sysadmin@example.com:
List Password // URL
---- --------
whatever-users@example.org victoria3

That doesn’t seem terribly exciting, unless you know that the password used for the list server happens to be the same as the one used for POP and IMAP access to the account in question, and that it is available as webmail… Of course I didn’t put the real password in my blog post, I replaced it with something conceptually similar and equally difficult to guess (naturally I’ve changed the password). The fact that the password wasn’t a string of 8 semi-random letters and digits is not a good thing, but not really bad on it’s own. It’s only when the password gets used for 3rd party servers that you have a real problem.

I wonder how many list servers are run by unethical people who use the passwords to gain access to email accounts, and how many hostile parties use such lists of email addresses and passwords when they compromise servers that run mailing lists.

Now there would be an obvious security benefit to not having the list server store the password in clear-text or at least not send it out every month. Of course the down-side to doing that is that it doesn’t give someone like me the opportunity to discover the problem and change the password.

Bugs and User Practice

Wouter points out a mistake in one of my blog posts which was based on old data [1]. My original post was accurate for older distributions of Linux but since then the bug in question was fixed.

Normally when writing blog posts or email I do a quick test before committing the text to avoid making mistakes (it’s easy to mis-remember things). However in this case the bug would dead-lock machines which made me hesitant to test it (I didn’t have a machine that I wanted to dead-lock).

There are two lessons to be learned from this. The most obvious is to test things thoroughly before writing about them (and have test machines available so that tests which cause service interruption or data loss can be performed).

The next lesson is that when implementing software you should try not to have limitations that will affect user habits in a bad way. In the case of LVM, if the tool lvetend had displayed a message such as “Resizing the root LV would dead-lock the system, until locking is fixed such requests will be rejected – please boot from a rescue disk to perform this operation” then I would have performed a test before writing the blog post (as it would be a harmless test to perform). Also on occasion when I really wanted to resize a root device without a reboot I would have attempted the command in the hope that LVM had been fixed.

A bug that deadlocks a system is one that will really have an adverse affect on users, both their habits in future use, and the probability of them using the software in future. A bug (or missing feature) that displays a warning message will have much less of a problem.

From now on I will still be hesitant in using lvextend on a LV for a root filesystem on any machines other than the very latest for fear that they will break. The fact that lvextend will sometimes work on the root filesystem and sometimes put the machine offline is a serious problem that impacts my use of that feature.

Most people won’t be in a position to have a bug or missing feature that deadlocks a system, but there are an infinite number of ways that software can potentially interrupt service or destroy data. Having software fail in a soft way such that data is not lost is a significant benefit for users and an incentive to use such software.

I’ve put this post in the WTF category, because having a dead-lock bug in a very obvious use-case of commonly used software really makes me say WTF.

Bad Project Management

I have just read a rant by Sean Middleditch about bad project management [1]. He describes his post as “personal, rather angsty, and especially whiny” but I think it’s useful and informative. He makes some interesting technical points about PHP programming (I wasn’t aware that there were so many ways of easily getting things wrong and having difficulty to get them right). But of course this isn’t all limited to PHP, the web site WorseThanFailure.com has anecdotes about mistakes of similar calibre being implemented in every language imaginable.

Sean is apparently considering leaving the computer industry after having numerous bad experiences of having highly paid people mess up projects while he gets paid a lot less to try and fix the worst of the bugs and get the systems working in production. I understand what it’s like, I have occasionally idly contemplated leaving the industry after bad projects. However the fun of working on free software combined with the amounts of money that I can earn in the computer industry made me quickly abandon such ideas.

His stories in some ways resemble my experiences in working as a contractor, most of my contracts have been profoundly weird for various reasons (I’ll use the WTF [2] category of this blog to document some of them). I had two theories as to why I ended up in so many strange contracts, one was that I was in some sort of Twilight Zone and the other was that taking contracts based on the amount of money offered puts you at high risk of being employed by people who have no financial pressure to do things in a sensible manner.

My advice to anyone in such a situation is to try and find a contract position paying an unreasonable amount of money. Getting more than $80 an hour (the rate Sean cites as being paid to the idiots who cause problems) is going to be difficult, but getting $50 or $60 an hour is much easier to achieve and should be enough to alleviate the pain of working on doomed projects.