The use of hard drives for backups is outpacing other forms of backup media by a country mile. The largest IDE drive available right now is 200 gigabytes (Western Digital's Drivezilla, which gets my vote for best name). Tape backup has valiantly attempted to keep pace. Tape autoloaders, holding up to 1.4 terabytes, and 100/200 gigabyte capacity tapes, are some of the modern tape storage options.
I have never warmed up to tape storage, though. While I found tape tolerable when it presented a significant cost savings over other data storage media, I rarely use tape at all now that hard drives have become so large and so inexpensive. Hard drives have many advantages over tape, including speed, redundancy, ease of use, and versatility.
Users needing removeable storage have several options such as mobile drive racks and USB hard drives. The one downside is hard drives probably don't have the longevity that tapes do. No one knows what the future will bring, or if we'll even have the means to read those 30-year tapes in 30 years. But long-term archiving is a topic for another day; today we focus on short-term backup needs and specifically on a tool that helps backup information to one or more hard drives.
There's a new kid on the backup software block: rsync. rsync was originally designed to replace rcp, the venerable old Unix remote copy program. Because of its sophisticated means of synchronizing and transferring file trees, rsync is widely used for mirroring Web sites. rsync transfers only the changes in files, using the devilishly clever rsync algorithm. It calculates diffs without needing both files to be present. This little bit of magic is described in the documentation accompanying the program (for those interested in such). rsync then does on-the-fly compression, making network file transfers very fast and efficient.
rsync has some lovely security features. It supports ssh, which is the recommended protocol for secure network file transfer. It copies the bits to be transferred to a temporary file, then builds the upload from the temporary file, minimizing the chances of something bad happening to the original. It also has a useful "dry-run" command option for testing new command options safely.
rsync is simple to use, but don't let simplicity fool you -- it is a powerful tool, and bad things can happen just as easily as good things. rsync runs on Linux and Windows (see Resources for downloads and installation instructions).
The following command makes copies on the same machine. The file or directory being copied is named first and the destination directory second:
$ rsync -a sourcedir destinationdir
To copy a directory from the local machine to a remote machine (note that rsync must be present on both machines):
$ rsync -a sourcedir remotehost:destinationdir
To copy a directory from the remote machine to a local machine:
$ rsync -a remotehost:/sourcedir destinationdir
It is best to be explicit and use full filepaths, even when copying from the current directory.
-a, or --archive, is shorthand for -rlptgoD:
- r = Copy directories recursively. Without this switch, directories will not get copied at all.
- l = Recreate symlinks.
- p = Preserve permissions.
- t = Transfer modification times and update them on the remote system. Must have this for accurate synchronization.
- g = Set the group of the destination file to be the same as the source file.
- o = Set the owner of the destination file to be the same as the source file.
- D = Re-create character and block devices on the remote system (only root can do this).
To transfer over a ssh tunnel, use this command:
$ rsync -a -e ssh sourcedir firstname.lastname@example.org:/destinationdir/
-e ssh means "replace the native rsh protocol with ssh." If there is some other secure tunnel you want to use, this is where to name it.
Also, mind your trailing slashes, as they make a difference to rsync in the source arguments. A trailing "/" on a source argument means "copy the contents of this directory." Leaving off a trailing slash means "copy the directory and its contents."
$ rsync -avn -e ssh sourcedir email@example.com:/destinationdir/
The -n switch, or --dry-run, shows what will happen when rsyn is run but without actually copying or changing anything. Use with -v, verbose, to see the messages. Verbose has three levels: -v, -vv, and -vvv for maximum verbosity. Always perform dry runs until you're satisfied rsync will work as desired.
Some other useful command options to be aware of include:
- --delete = Use with caution! Always do a dry-run first when using --delete. Don't say I didn't warn you! --delete removes all files at the destination that do not exist on the source.
- --delete-excluded = Delete any files that are named by --exclude. As you can see, this is powerful stuff to keep archives tidy and uncluttered. Use it wisely.
- -z (or --compress) = Use rsync's compression.
- -S (or --sparse) = Handle sparse files efficiently.
- -H (or --hard-links) = Preserve hard links. -a does not preserve hard links.
- -b (or --backup) = Appends a ~ to existing destination files. You're not stuck with "~", as the --suffix command lets you specify anything you like
- --backup-dir=DIR = Combine with --backup to tell rsync where to store backups.
rsync can be exclusive as well as inclusive:
--exclude pattern = Exclude files matching pattern.
--exclude-from file = Exclude patterns listed in file.
For example, --exclude *.tmp will exclude all .tmp files. --exclude *.bak excludes .bak files. Name individual files and directories. Each --exclude can take only one argument. For multiple excludes, either string them together on the command line:
--exclude *.bak --exclude *.tmp
Or better, put them in a file:
rsync now supports regular expressions, as all good Linux programs should, for fine-grained file selection. It requires applying a patch; follow this linkfor details.
Running an rsync Server
rsync can also be run as a daemon, which can be invoked from the command line --daemon. rsync listens on TCP port 873. Setting up a dedicated rsync server lets any number of machines and users connect for mirroring, backups, file retrieval, whatever they need. /etc/rsyncd.conf contains the server's environment and runtime variables, including user and host access control lists.
The best and safest way to use rsync is to run it from scripts. Refine your command strings and directory options, then record them for posterity in scripts. Put them in crontabs so you don't have to remember to run them manually. Using (and reusing) scripts greatly reduces the chances of errors. See the rsync web sitefor examples of backup scripts. The site gives examples for using an ordinary 7-day rotation to a central backup server, mirroring a CVS tree, and backing up to a spare disk.
As always, the more you know about scripting and regular expressions, the better you'll be able to make things work. If you ever get tired of my nagging about scripting and want a nice tutorial instead, drop me a line and I will wheedle my excellent editor into letting me write one.