Some Tips for Shell Code that Won’t Destroy Your OS

When writing a shell script you need to take some care to ensure that it won’t run amok. Extra care is needed for shell scripts that run as root, firstly because of the obvious potential for random destruction, and secondly because of the potential for interaction between accounts that can cause problems.

  • One possible first step towards avoiding random destruction is to start your script with “#/bin/sh -e” instead of “#/bin/sh“, this means that the script will exit on an unexpected error, which is generally better than continuing merrily along to destroy vast swathes of data. Of course sometimes you will expect an error, in which case you can use “/usr/local/bin/command-might-fail || true” to make it not abort on a command that might fail.
  • #!/bin/sh -e
    cd /tmp/whatever
    rm -rf *
    cd /tmp/whatever || exit 1
    rm -rf *

    Instead of using the “-e” switch to the shell you can put “|| exit 1” after a command that really should succeed. For example neither of the above scripts is likely to destroy your system, while the following script is very likely to destroy your system:
    cd /tmp/whatever
    rm -rf *
  • Also consider using absolute paths. “rm -rf /tmp/whatever/*” is as safe as the above option but also easier to read – avoiding confusion tends to improve the reliability of the system. Relative paths are most useful for humans doing typing, when a program is running there is no real down-side to using long absolute paths.
  • Shell scripts that cross account boundaries are a potential cause of problems, for example if a script does “cd /home/user1” instead of “cd ~user1” then if someone in the sysadmin team moves the user’s home directory to /home2/user1 (which is not uncommon when disk space runs low) then things can happen that you don’t expect – and we really don’t want unexpected things happening as root! Most shells don’t support “cd ~$1“, but that doesn’t force you to use “cd /home/$1“, instead you can use some shell code such as the following:
    HOME=`grep ^$1 /etc/passwd|head -1|cut -f6 -d:`
    if [ "$HOME" = "" ]; then
      echo "no home for $1"
      exit 1
    cd ~

    I expect that someone can suggest a better way of doing that. My point is not to try and show the best way of solving the problem, merely to show that hard coding assumptions about paths is not necessary. You don’t need to solve a problem in the ideal way, any way that doesn’t have a significant probability of making a server unavailable and denying many people the ability to do their jobs will do. Also consider using different tools, zsh supports commands such as “cd ~$1“.
  • When using a command such as find make sure that you appropriately limit the results, in the case of find that means using options such as -xdev, -type, and -maxdepth. If you mistakenly believe that permission mode 666 is appropriate for all files in a directory then it won’t do THAT much harm. But if your find command goes wrong and starts applying such permissions to directories and crosses filesystem boundaries then your users are going to be very unhappy.
  • Finally when multiple scripts use the same data consider using a configuration file. If you feel compelled to do something grossly ugly such as writing a dozen expect scripts which use the root password then at least make it an entry in a configuration file so that it can be changed in one place. It seems that every time I get a job working on some systems that other people have maintained there is at least one database, LDAP directory, or Unix root account for which the password can’t be changed because no-one knows how many scripts have it hard-coded. It’s usually the most important server, database, or directory too.

Please note that nothing in this post is theoretical, it’s all from real observations of real systems that have been broken.

Also note that this is not an attempt at making an exhaustive list of ways that people may write horrible scripts, merely enough to demonstrate the general problem and encourage people to think about ways to solve the general problems. But please submit your best examples of how scripts have broken systems as comments.

18 comments to Some Tips for Shell Code that Won’t Destroy Your OS

  • Michael Goetze

    grep ^$1 /etc/passwd

    should surely be

    /usr/bin/getent passwd $1

    After all, another sysadmin might come along and change the PAM configuration of the system to use, say, LDAP.

  • etbe

    Michael: Thanks for that. Now if only the getent tool (or something similar) would allow us to extract the home directory on a single line. It seems that there will be a huge number of shell scripts that need to know the home directories of users. While piping it through cut is not overly difficult it does add an extra possibility for things to go wrong.

  • As usual, awk to the rescue!

    test -z “$1” && exit 1
    awk -F: -vuser=$1 ‘{if ($6 == “”) print user ” has no home directory” }’ /etc/passwd

    And to properly work in environments that use directory services like nis/ldap/kerberos/etc

    test -z “$1” && exit 1
    getent passwd $1 | awk -F: -vuser=$1 ‘{if ($6 == “”) print user ” has no home directory” }’

    And finally:
    test -z “$1” && exit 1
    getent passwd $1 | awk -F: ‘{if ($6 == “”) print $1 ” has no home directory” }’

    awk is magic

  • It’s possibly better to use


    set -e

    so that running the script directly like


    won’t override the -e setting.

  • etbe

    Andrew Pollock also makes the point about set -e but with another example of how it can fail. Andrew also mentions exit handlers.

    Exit handlers are good, but I really don’t expect anyone who puts “cd /tmp/whatever ; rm -rf *” in a shell script to be able to do that in the near future.

    I think it’s best to concentrate on refraining from destroying servers as a first priority. Writing the quality of shell code that Andrew advocates is a good thing to do later on.

    One of the advantages of blogging about such things is learning from experts such as those of you who have commented and Andrew.

  • etbe

    Jon Dowland suggests not using shell scripts at all. That sounds nice, but there are many people who can’t/won’t learn a scripting language. So really we are stuck with shell scripts so let’s try and not do it too badly.

    Then of course there are short scripts which are not particularly demanding. The vast majority of my shell scripts are less than 10 lines long including comments. For such scripts using Perl probably wouldn’t provide much of a benefit.

  • @etbe: Jow Dowland is a bit confused if he thinks shell scripts are the wrong tool for systems administration. I’d be willing to bet he hasn’t done much hard core sysadmin work ever. The shell is the least common denominator on every ‘nix including OS X. From HP-UX to Linux to ultra embedded busybox distros on your nas without perl or python it works on them all. If you can program bourne shell efficiently, you can manage posix hosts.

    I reckon nine times out of ten it’s the right decision. But when you just have to use a shellscript, use set -e (as Andrew Pollock points out, this is safer than putting it on the hashbang line) and set -u too.

    Just because he doesn’t like or know it particularly well doesn’t mean that using awk/sed/bash to solve your problem is wrong. Look at the speed comparisons between awk and perl to search text. If you can do it in awk, it is much faster than perl with a fraction of the memory footprint. His statements are amusing at best.

  • etbe

    Jeff: You make some good technical points.

    This discussion (both here and in Planet Debian and Planet Linux Australia) hasn’t taken the course that I had hoped for.

    While I agree that there is scope for a lot of discussion about finding the best ways of solving such problems, my focus here is on avoiding the worst ways of doing things. In particular bad ways that involve people phoning me early in the morning because their server isn’t working. ;)

    Let’s try not to assume that someone lacks knowledge because they like to do things a different way.

  • The advice about “set -e” or “|| exit 1” is valid only if you don’t do any parallel processing in your shell scripts. With parallel processing, it leads to undesired consequences.

    Consider the following example, where someone wants to run long-running-program in parallel with other two commands and then merge the results:

    long-running-program &
    wait $lpid

    If the command-that-can-fail fails and causes the script to exit, you end up with the long-running-program still in the background, which may be undesirable (e.g., you can’t just fix the problem and restart the script).

    And, by “no parallel processing at all”, I also mean pipelines. E.g., if you don’t want to run “publish” if file.xml is invalid, and want to save reports about invalid XML just in case, the following works:

    xmllint –noout –valid file.xml >report
    publish file.xml

    Then, suppose that reports sometimes become too long, and you want to take only the first few lines. This doesn’t work (i.e., will publish invalid XML):

    xmllint –noout –valid file.xml | head >report
    publish file.xml

    Of course, bash has a good solution for both problems (by handling ERR or setting the pipefail option).

  • Jon

    Etbe, I originally tried to post my blog post as a comment here (via a phone) but had some openid problems so ended up blogging it when I got in :)

    Jeff, I’ve been working as a full-time sysadmin for 5 years and administer 200+ UNIX systems, should experience count for something. I assure you I am well versed in awk/cut/sed/grep etc.

    You are placing far too much emphasis on speed of execution. This is of course important in some contexts (say embedded) but if the differential between awk/sed etc. and perl is big enough to be important (and I wouldn’t be confident even measuring the difference accurately on a modern system) you would almost certainly need to be working in C or similar anyway.

    In terms of least-common denominator, perl is universally available and crucially much more *consistent* across platforms. Even POSIX sh is not a low enough bar for cross platform shell scripts (solaris doesn’t support $(foo) for example)

    Alexander, good point about parallelism, which is increasingly important in modern systems. I’ve attempted parallel subshells / clever use of wait / juggling file descriptors and I really think this is a classic example of a problem space which shell is terrible for.

  • @Jon: I apologize for being incorrect about your experience. My team manages a few short of 2k servers but numbers don’t matter much. If you can sucessfully manage 200 systems well, you can manage 5000. posix is posix is posix.

    However, I still respectfully disagree. The shell is the lowest common denominator in ‘nix environments from supercomputers to embedded distros like emdebian (with no perl). It is all you need to manage posix.

  • etbe

    A Polish blogger has an interesting post that has the solutions to some common shell scripting mistakes. Above are the links for the blog post in question and for the Google translation into English.

  • In my opinion, the set -e option is just a pain and should NEVER be used in a decent shell script.

    I think the set -e option is just an excuse for not writing proper error handling routines and fail safe code. The || exit 1 statements seem like an ugly solution to me.

    cd /tmp/whatever || exit 1
    rm -rf *

    Could be better written something like:

    if [ -e “$DIR” ] && [ ! -z “$DIR” ]
    rm -rf “$DIR/*”

    If you want to run dangerous commands like ‘rm’ you may want to spend the 2 minutes extra to write a ‘proper’ fail safe mechanism instead of using set -e or other simple solutions.

    But that is just me.

  • Sorry for creating a separate posting, but I want to state that I fully agree with the author, only not on all the solutions. ;)

  • etbe

    Louwrentius: The thing to keep in mind is that I didn’t write this post for the benefit of skillful people who are capable of coding in the style you advocate, or for people who need to write code that will later be maintained by such people.

    Putting “set -e” at the start of a script is easy, simple, and solves many problems. While it’s not ideal, it will save servers from being trashed on occasion and that’s what really matters.

    But your points are good and are useful for anyone who wants to take their scripting to a higher level.

  • Ok, i misunderstood the pov of your article. From your perspective, I agree with the set -e option.

    Also, I understand why people might argue that if you write anything in shell script that would be worth the effort to do it ‘right’, you might as well do it in a ‘proper language’ such as Python or Ruby for example.

    I wrote some bigger stuff in bash, but that’s because I’m just crazy. There is no other reason for it.

  • vk3jed

    Nice article about shell scripting. I didn’t consider -e myself, as I tend to use techniques similar to what Louwrentius advocates, sometimes going as far as ensuring the script has been started by the correct user, has all the data it needs to work, and in some cases, is started from the correct place in the filesystem.

    use shell scripting quite a lot, because in my spare time I play around a lot with software that is already 90% shell scripts which manage a few binaries, and there’s no guarantee that the boxes have anything else available (though Perl is usually there and has sometimes been used by others). The software is also designed to run unattended for months on end, so the scripts have to be able to deal with common situations without intervention, and if they fail, fail gracefully.

    However, for quick and dirty scripts, the techniques outlined in the blog post are a good way to avoid too many tears.

  • etbe

    Peter Eisentraut points out that in Bash you can run “set -o pipefail” to handle some failure cases for pipelined commands.