18

Some Tips for Shell Code that Won’t Destroy Your OS

When writing a shell script you need to take some care to ensure that it won’t run amok. Extra care is needed for shell scripts that run as root, firstly because of the obvious potential for random destruction, and secondly because of the potential for interaction between accounts that can cause problems.

  • One possible first step towards avoiding random destruction is to start your script with “#/bin/sh -e” instead of “#/bin/sh“, this means that the script will exit on an unexpected error, which is generally better than continuing merrily along to destroy vast swathes of data. Of course sometimes you will expect an error, in which case you can use “/usr/local/bin/command-might-fail || true” to make it not abort on a command that might fail.
  • #!/bin/sh -e
    cd /tmp/whatever
    rm -rf *
    #!/bin/sh
    cd /tmp/whatever || exit 1
    rm -rf *

    Instead of using the “-e” switch to the shell you can put “|| exit 1” after a command that really should succeed. For example neither of the above scripts is likely to destroy your system, while the following script is very likely to destroy your system:
    #!/bin/sh
    cd /tmp/whatever
    rm -rf *
  • Also consider using absolute paths. “rm -rf /tmp/whatever/*” is as safe as the above option but also easier to read – avoiding confusion tends to improve the reliability of the system. Relative paths are most useful for humans doing typing, when a program is running there is no real down-side to using long absolute paths.
  • Shell scripts that cross account boundaries are a potential cause of problems, for example if a script does “cd /home/user1” instead of “cd ~user1” then if someone in the sysadmin team moves the user’s home directory to /home2/user1 (which is not uncommon when disk space runs low) then things can happen that you don’t expect – and we really don’t want unexpected things happening as root! Most shells don’t support “cd ~$1“, but that doesn’t force you to use “cd /home/$1“, instead you can use some shell code such as the following:
    #!/bin/sh
    HOME=`grep ^$1 /etc/passwd|head -1|cut -f6 -d:`
    if [ "$HOME" = "" ]; then
      echo "no home for $1"
      exit 1
    fi
    cd ~


    I expect that someone can suggest a better way of doing that. My point is not to try and show the best way of solving the problem, merely to show that hard coding assumptions about paths is not necessary. You don’t need to solve a problem in the ideal way, any way that doesn’t have a significant probability of making a server unavailable and denying many people the ability to do their jobs will do. Also consider using different tools, zsh supports commands such as “cd ~$1“.
  • When using a command such as find make sure that you appropriately limit the results, in the case of find that means using options such as -xdev, -type, and -maxdepth. If you mistakenly believe that permission mode 666 is appropriate for all files in a directory then it won’t do THAT much harm. But if your find command goes wrong and starts applying such permissions to directories and crosses filesystem boundaries then your users are going to be very unhappy.
  • Finally when multiple scripts use the same data consider using a configuration file. If you feel compelled to do something grossly ugly such as writing a dozen expect scripts which use the root password then at least make it an entry in a configuration file so that it can be changed in one place. It seems that every time I get a job working on some systems that other people have maintained there is at least one database, LDAP directory, or Unix root account for which the password can’t be changed because no-one knows how many scripts have it hard-coded. It’s usually the most important server, database, or directory too.

Please note that nothing in this post is theoretical, it’s all from real observations of real systems that have been broken.

Also note that this is not an attempt at making an exhaustive list of ways that people may write horrible scripts, merely enough to demonstrate the general problem and encourage people to think about ways to solve the general problems. But please submit your best examples of how scripts have broken systems as comments.

First Dead Disk of Summer

Last night I was in the middle of checking my email when I found that clicking on a URL link wouldn’t work. It turned out that my web browser had become unavailable due to a read error on the partition for my root filesystem (the usual IDE uncorrectable error thing). My main machine is a Thinkpad T41p, it is apparently possible to replace the CD-ROM drive with a second hard drive to allow RAID-1 but I haven’t felt inclined to spend the money on that. So any hard drive error is a big problem.

Fortunately I had made a backup of /home only a few days ago. I use offline IMAP for my email so that my recent email (the most variable data that matters to me) is stored on a server with a RAID-1 as well as on my laptop and my netbook. The amount of other stuff I’ve been working on in my home directory is fairly small, and the amount of that which isn’t on other systems is even smaller (I usually build packages on servers and then scp the relevant files to my laptop for Debian uploads, bug reports, etc.

The first thing I did was to ssh to one of my servers and paste a bunch of text from various open programs into a file there. That was the contents of all open programs, the URLs of web pages I was reading, and the contents of an OpenOffice spread-sheet which I couldn’t save directly (it seems that a read-only /tmp will prevent OpenOffice from saving anything). Then I used scp to copy 600M of ted.com videos that I hadn’t backed up, I don’t usually backup such things but I don’t want to download them twice if I can avoid it (I only have a quota of 25G per month).

After that I made new backups of all filesystems starting with /home. I then used tar to backup the root filesystem.

The hard drive in the laptop only had a single bad sector, so I could have re-written it so that it would be remapped (as I have done before with that disk), but I think that on a 5yo disk it’s probably best to replace it. I had been thinking of installing a larger disk anyway.

On restore I restored the root filesystem from a month-old backup and then used “diff -r” to discover what had changed, it took me less than an hour to merge the changes from the corrupted root filesystem to the restored one.

Now I have lots of free disk space and no data loss!

I am now considering making an automated backup system for /home. My backup method is to make an LVM snapshot of the LV which is used and then copy that – this gets the encrypted data so I can safely store it on USB devices while traveling. I could easily write a cron job that uses scp to transfer a backup to one of my servers at some strange time of the night.

The next issue is how many other disks I will lose this summer. I have installed many small mail server and Internet gateway systems running RAID-1, it seems most likely that some of them will have dead disks with the expected record temperatures this summer.

1

Debian SSH and SE Linux

I have just filed Debian bug report #556644 against the version of openssh-server in Debian/Unstable (Squeeze).  It has a patch that moves the code to set the SE Linux context for the child process before calling chroot. Without this a chroot environment on a SE Linux system can only work correctly if /proc and /selinux are mounted in the chroot environment.

deb http://www.coker.com.au squeeze selinux

I’ve created the above APT repository for Squeeze which has a package that fixes this bug. I will continue to use that repository for a variety of SE Linux patches to Squeeze packages, at the moment it’s packages from Unstable but I will also modify released packages as needed.

The bug report #498684 has a fix for a trivial uninitialised variable bug. The fix is also in my build.

Also I filed the bug report #556648 about the internal version of sftp being
incompatible with SE Linux (it doesn’t involve an exec so the context doesn’t change). The correct thing to do is for sshd to refuse to run an internal sftpd at least if the system is in enforcing mode, and probably even in permissive mode.

deb http://www.coker.com.au lenny selinux

Update: I’ve also backported my sshd changes to Lenny at the above APT repository.

9

Backing up MySQL

I run a number of MySQL databases, the number of mysqld installations that I run is something like 8, but I may have forgotten some. With the number of servers that I run on a “do nothing except when it breaks” basis it’s difficult to remember the details. The number of actual databases that I run would be something like 30, four databases running on a database server (not counting “mysql” is fairly common. Now I need to maintain some sort of automated backup of these, this fact became obvious to me a couple of days ago when I found myself trying to recreate blog entries and comments from Google’s cache…

There are two types of database that I run. There are ones of significant size (more than 1GB) and tiny ones – I don’t think I run any database which has a MySQL dump file that is more than 20M and less than 2G in size.

For the machines with small databases I have the following script run daily from a cron job (with db1 etc replaced with real database names and /mysql-backup replaced by something more appropriate). The “--skip-extended-insert” option allows the possibility of running diff on the dump files but at the cost of increasing file size, when the raw file size is less than 20M this overhead doesn’t matter – and gzip should handle some extra redundancy well.

#!/bin/bash -e
for n in db1 etc ; do
  /usr/bin/mysqldump --skip-extended-insert $n | gzip -9 > /mysql-backup/$n-`date +%Y-%m-%d`.gz
done

Then I have a backup server running the following script from a cron job to copy all the dump files off the machines.

#!/bin/bash -e
cd /backup-store
for n in server1 server2 ; do
  scp $n:/mysql-backup/*-`date +%Y-%m-%d`.gz $n
done

This script relies on being run after the script that generates the dump files. Which is a little more tricky than it should be, it’s a pity that cron jobs can’t be set to have UTC run times. I could have run the dumps more frequently and used rsync to transfer the data, but it seems that the risk of losing one day’s worth of data is acceptable. For my blog I can get any posts that I might lose from Planet installations in that time period.

For the bigger databases my backup method starts by putting the database and the binary log files on the same filesystem – not /var. This requires some minor hackery of the MySQL configuration. Then I use rsync to copy the contents of an LVM snapshot of the block device. The risks of data consistency problems involved in doing this should be no greater than the risks from an unexpected power fluctuation – and the system should be able to recover from that without any problems.

My experience with MySQL dumps is that they take too long and too much system resources for large databases so I only use them for backing up small databases (where a dump can be completed in a matter of seconds so even without using a transaction it doesn’t hurt).

8

What to do When You Break a Server

Everyone who does any significant amount of sysadmin work will break a server. Most people who have any significant experience will have broken several. Anyone who has never broken one should be treated with suspicion by other members of the sysadmin team, they probably haven’t learned the caution that most of us learn from stuffing something up badly.

When you break a server there are some things that you can do to greatly mitigate the scale of the disaster. Firstly carefully watch what happens, for example don’t type “rm -rf *” and then walk away, watch the command – if it takes unusually long then press ^C and double-check that you are removing the right files. Fixing a half broken server is often easier than fixing one that has been broken properly and completely!

If you don’t know what to do then do nothing! Doing the wrong thing can make things worse. Seek advice from the most experienced sysadmin who is available. If backups are inadequate then leave the server in a broken state while seeking advice, particularly if you were working at the end of the day and it doesn’t need to be up for a while. There is often time to seek advice by email before doing something.

Do not reboot the system! Even if your backups are perfect there is probably some important data that has been recently modified and can be salvaged. Certain types of corruption (such as filesystem metadata corruption) will leave good data in memory where it can be recovered.

Do not logout! If you do then you may not be able to login again and this may destroy your chances of fixing the system or recovering data.

Do not terminate any programs that you have running. There have been more than a few instances where the first step towards recovering from disaster involved using an open editor session (such as vi or emacs). If the damage prevents an editor from being used (EG by removing the editor’s program or a shared object it relies on) then having one running is very important.

These procedures are particularly important if you are unable to visit the server. For example when using a hosted server at an ISP the more cost effective plans give you no option to ever gain physical access. So plugging the disks into another machine for recovery is not an option.

Does anyone have any other suggestions as to what to do in such a catastrophy?

PS This post is related to the fact that I had to recover the last couple of weeks of blog comments and posts from Google’s cache…

Update: I got my data back, not I have to copy one day of blog stuff from one database to another.

2

Links November 2009

Credit Writedowns has a populist interpretation of the latest Boom-Bust cycle [1]. It’s an interesting analysis of the way the US economy is working.

Bono writes for the NY Times about Rebranding America [2]. He praises Barack Obama suggesting a different reason to believe that the peace prize is deserved and describes what he believes to be the world’s hope for the US.

IsMyBlogWorking.com is a useful site that analyses your blog [3]. It gives some advice on how to improve some things as well as links to feed validation sites.

Evgeny Morozov gave an interesting TED talk “How the Net Aids Dictatorships” [4]. I don’t agree with his conclusion, he has some evidence to support his claims but I think that a large part of that is due to people not using the Internet well. I expect things to improve. The one claim that was particularly weak was when he mentioned radio stations in Rwanda as an example of technology being used for bad purposes – the entire point about the first-world discussion about such things is the radio vs the Internet.

Ray Anderson gave an inspiring talk about “The Business Logic of Sustainability” [5]. He transformed his carpet company, decreasing it’s environmental impact by 82% and it’s impact per volume of product by more than 90% while also significantly increasing it’s profitability. He says that corporate managers who don’t protect the environment should be regarded as criminals. Making his company more environmentally friendly reduced expenses (through efficiency), attracted more skillful employees, and attracted environmentally aware customers. Managers who don’t follow Ray’s example are not only doing the wrong thing for the environment, they are doing the wrong thing for their stockholders! Ray’s company Flor takes carpet orders over the web [6]. They won’t ship a catalogue outside the US, so presumably they only sell carpet to the US too.

Marc Koska gave an interesting TED talk about a new syringe design that prevents re-use [7]. His main aim is to prevent the spread of AIDS in the developing world – where even hospital staff knowingly reuse syringes. It will also do some good in developed countries that try to prohibit drug use.

David Logan gave an interesting TED talk about tribal leadership [8]. His use of the word “tribe” seems rather different from most other uses, and I am a bit dubious about some of his points. But it is definitely a talk worth seeing and considering.

Deirdre Walker is a recently retired Assistant Chief of Police who has worked for 24 years as a police officer, she describes in detail her analysis of the flaws in the TSA security checks at US airports [9].

Brian Krebs wrote an article for the Washington Post recommending that Linux Live CDs be used for Internet banking [10]. Windows trojans have been used to take over bank accounts that were accessed by security tokens, that could only be accessed by certain IP addresses, and that required two people to login. It seems that nothing less than a Linux system that is solely used for banking is adequate when a lot of money is at stake.

The NY Times has an interesting review of the book “Ayn Rand and the World She Made” [11]. It seems that Ayn was even madder than I thought.

Gary Murphy has written an interesting analysis of the latest stage in the collapse of the US Republican party [12].

The ABC (AU) Law Report has an interesting article about Evony’s (of China and the US) attempting to sue Bruce Everiss (of the UK) in Australia [13].

The Guardian has an insightful article about the IEA making bogus claims about the remaining oil reserves [14]. It seems that the experts who work for the IEA estimate that oil is running out rapidly while the US is forcing them to claim otherwise.

Dean Baker of the Center for Economic and Policy Research has written an interesting article about the economic effects of the war in Iraq [15]. Apparently it caused the loss of over 2,000,000 jobs – considerably more than the job losses that could ever result from efforts to combat global warming.

20

I’m an Aspie

I’ve recently been diagnosed with Asperger Syndrome (AS) [1]. Among other things this means that I am genetically predisposed to have an interest in solving technical problems and give lectures about how I solved them, but that I tend not to be a “people-person”.

AS is generally regarded as an Autism Spectrum Disorder (ASD), but there is a lot of debate among the experts about the exact relationship. Some people (such as the psychologist who assessed me) believe that AS is a synonym for High Functioning Autism (HFA), however one theory I’ve heard is that HFA people are more sensory oriented and Aspies differ by being information oriented – in that case I would not be regarded as HFA. I’m not bothered by this issue, I’m sure that in a few years time the experts will have some consistent definitions for such things that most people can agree on.

There is no Boolean assessment for AS, the assessment is based on a sliding scale of number of criteria. People who are almost Aspies but don’t quite pass the test (or more commonly don’t get assessed because they don’t think they would pass) are sometimes referred to as Asperger Cousins (AC), this is a slang term that is not formally recognised but is often used in online discussions. I’m sure that a significant portion of the readers of my blog would regard themselves as being at least ACs if they investigated the issue. The test at Glenn Rowe’s web site [2] can give you an idea of how you rate by some criteria – but note that this is not at all conclusive and it’s based on the theories of Professor Simon Baron-Cohen [3], not all of which have general agreement. Leif Ekblad is running a project to analyse long-term changes to the Aspie score of adults [4]. The main quiz for that project seems quite popular for self-diagnosis, but again it’s not conclusive.

I think that diagnosing oneself for an ASD is not nearly as crazy as most things which might fall in the category of being one’s own psychologist, but I still strongly recommend getting a formal assessment if you believe that you are an Aspie. In Australia it costs about $600 and there’s a waiting list of about 3 months. Chaotic Idealism has an insightful post about the pros and cons of self-diagnosis [5], if you suspect that you may be an Aspie then I recommend that you read it before doing the tests.

3

New Play Machine

Update:
Thanks to Sven Joachim and Andrew Pollock for informing me about /etc/init.d/mountoverflowtmp which exists to mount a tmpfs named overflow if /tmp is full at boot time. It appears that the system was not compromised. But regular reinstalls are always a good thing.

On the 24th of August this year I noticed the following on my SE Linux Play Machine [1]:
root@play:/root# df
Filesystem          1K-blocks      Used Available Use% Mounted on
/dev/hda              1032088    938648    41012  96% /
tmpfs                    51296        0    51296  0% /lib/init/rw
udev                    10240        24    10216  1% /dev
tmpfs                    51296        4    51292  1% /dev/shm
/dev/hdb                516040    17128    472700  4% /root
/dev/hdc                  1024        8      1016  1% /tmp
overflow                  1024        8      1016  1% /tmp

The kernel message log had the following:
[210511.546152] su[769]: segfault at 0 ip b7e324e3 sp bfa4b064
error 4 in libc-2.7.so[b7dbb000+158000]
[210561.527839] su[778]: segfault at 0 ip b7eb14e3 sp bfec84d4 error 4 in
libc-2.7.so[b7e3a000+158000]
[210585.270372] su[784]: segfault at 0 ip b7e044e3 sp bff1b534 error 4 in
libc-2.7.so[b7d8d000+158000]
[210595.855278] su[789]: segfault at 0 ip b7e014e3 sp bfd18324 error 4 in
libc-2.7.so[b7d8a000+158000]
[210639.496847] su[796]: segfault at 0 ip b7e874e3 sp bf99e7b4 error 4 in
libc-2.7.so[b7e10000+158000]

Naturally this doesn’t look good, the filesystem known as “overflow” indicates a real problem. It appears that the machine was compromised. So I’ve made archival copies of all the data and reinstalled it.

As the weather here is becoming warmer I’ve used new hardware for my new Play Machine. The old system was a 1.8GHz Celeron with 1280M of RAM and two IDE disks in a RAID-1 array. The new system is a P3-800 with 256M of RAM and a single IDE disk. It’s a Compaq Evo which runs from a laptop PSU and is particularly energy efficient and quiet. The down-side is that there is no space for a second disk and only one RAM socket so I’m limited to 256M – that’s just enough to run a Xen server with a single DomU.

I put the new play machine online on Friday the 23rd of October after almost two months of down-time.

3

Exetel Stupidity

Anand Kumria has an ongoing dispute with Exetel, the latest is that a director of Exetel has libeled him in a blog comment [1].

Having public flame-wars with customers generally isn’t a winning move for a corporation. But doing so in the context of the blog world is a particularly bad idea. The first issue is that almost everyone who regularly reads Anand’s blog will trust him instead of a corporation (Anand is well regarded in the free software community). So it’s not as if accusing Anand of lying will gain anything.

But when a director of the company starts doing this it makes the issue more dramatic and interesting to many people on the net. Now Anand’s side of the story will get even more readers, of course Anand’s side was always going to get more readers than Exetel – I’m sure that Anand’s blog is more popular than that of Steve Waddington. I wouldn’t be surprised if my blog was more popular than Anand’s and now my readers will be following the Exetel saga for the Lulz. I’m sure that I won’t be the last person to comment on this.

The most amazing thing is that Steve Waddington talks about having to pay to take the TIO complaint. So I guess that means I should start complaining whenever I get bad service from an ISP and cost them some money! I should have stayed with Optus and started complaining all the time when they caused me problems!

One thing that Steve and people like him should keep in mind is that members of our community are not only heavy users of the Internet, we generally recommend ISPs to other people, and many of us make money working for ISPs. If you want your ISP to get good reviews and to be able to hire good staff then attacking people like Anand is not the way to go.

WordPress Plugins

I’ve just added the WordPress Minify [1] plugin to my blog. It’s purpose is to combine CSS and Javascript files and to optimise them for size and it’s based on the Minify project [2]. On my documents blog this takes the main page from 313KB uncompressed, 169KB compressed, and a total of 23 HTTP transfers to 306KB uncompressed, 117KB compressed, and 21 HTTP transfers. In each case 10 of the HTTP transfers are from Google for advertising. It seems that a major obstacle to optimising the web page load times is Google adverts – of course Google has faster servers than I do so I guess it’s not that much of a performance problem. The minify plugin caches it’s data files and I had to really hack at the code to make it use /var/cache/wordpress-minify – a subdirectory of the plugins directory was specified in many places.

deb http://www.coker.com.au lenny wordpress
I’ve added a wordpress-minify package to my repository of WordPress packages for Debian/Lenny with the above APT line. I’ve also got the following packages:
adman
all-in-one-seo-pack
google-sitemap-generator
openid
permalink-redirect
stats
subscribe-to-comments
yubikey

The Super Cache [3] plugin has some nice features. It generates static HTML files that are served to users who aren’t logged in and who haven’t entered a comment. This saves significant amounts of CPU time when there is high load. The problem is that installing this requires modifying the main .htaccess file, adding a new .htaccess file in the plugins directory, and lots of other hackery. The main reason for this is to avoid running any PHP code in the most common cases, it would be good for really heavy use. Also PHP “safe mode” has to be disabled for some reason, which is something I’d rather not do.

The Cache [4] plugin was used as the base for the Super Cache plugin. It seems less invasive, but requires the ability to edit the config file. Getting it into a shape that would work well in Debian would take more time than I have available at the moment. This combined with the fact that my blog will soon be running on a system with two quad-core CPUs that won’t be very busy means that I won’t be packaging it.

If anyone would like to Debianise the Cache or Super Cache plugin then I would be happy to give them my rough initial efforts as a possible starting point.

I’m not planning to upload any of these packages to Debian, it would just add too much work to the Debian security team without adding enough benefit.