Because they can often require a complete reinstallation, security compromises are best treated as disasters. A business can be crippled for days while a compromised server is reconstructed and the kinks worked out, but this needn't be the case if you apply the same thoroughness to incident recovery that you would to any physical calamity.
Many sites have mechanisms to deal with the need to suddenly reinstall a machine, but even well prepared sites are likely to have a few servers that can't be reinstalled quickly. There are a number of different approaches to backups and system management that can make recovering from a compromise much more pleasant, however. Pleasant enough that rounding up the last of those oddball servers and bringing them in line with your recovery plan is well worth your time. The main tenets we'll consider:
- Systems should be automated, using configuration management tools, such that every server can be reinstalled easily and brought back to working order without any manual intervention.
- Backup plans should take into consideration the need for total system recovery, including procedures to make disk images of the most important servers.
- To verify the procedures and infrastructure are conducive to success, teams should practice recovering from a compromise of the most important server using spare hardware, and then put it into production for a short test period.
Cookie-cutter servers are the name of the game. Any divergence from a standard OS load absolutely must be documented and automated. If you aren't already immersed in the wonderful world of configuration management, take a serious look at puppet or cfengine.
Even something as simple as disk failures can be a major disaster if you have servers running unknown and undocumented configurations. If you're in this position, your situation is dire. Hopefully there's some documentation available to aid in converting these servers to some type of automated configuration system. Frequently there won't be; in fact, simply rebooting a server may cause it to stop functioning because services aren't configured to start at boot time. Frankly, if you've ever experienced this, you aren't doing things properly.
Most sites are somewhere in-between. Perhaps they have a half-implemented configuration management infrastructure or just a few machines that are completely divergent. It may be too much work to get them in line with the standard server load. That's OK, in small doses, but the oddball servers must get some special treatment, since they aren't completely automated.
In an ideal situation, you'll only be backing up attached storage, SAN or otherwise, because the OS data doesn't matter. In this case you can suck data directly off storage gear for full backups, and the OS doesn't even have to be involved (assuming a SAN infrastructure). Very few servers will have local storage that's vital, because all divergent information is stored in your configuration management software or perhaps mounted over NFS.
For the not-so-lucky, or perhaps the host that holds the configuration management data, there needs to be a sane mechanism for backing up the data. Not just backing it up—that's easy to do—backing it up in a restorable manner. The most common backup methods will spread at least a week's worth of data across many tapes, making it a royal pain to completely restore an entire file system.
There are virtual tape libraries that make this a bit more tolerable, but the restoration process still isn't quick when you need to completely rebuild a server that requires tons of customizing. That requires disk images.
Any critical server should have two OS disks mirrored; that's a given. What we're talking about here is creating an entire disk image and backing that up as well. Storing a week's worth of those is certainly handy when you need to back up a few days. Just like VMware snapshots, in fact, but for full servers. Ideally you don't want to do that, but for small shops or disaster situations it sure beats reconstructing a server from memory and 5-20 tapes' worth of backup.
The Security Aspect
As was discussed in "How Do You Know When You've Been Owned?," you're not always 100 percent certain you've been hacked. In the situations where you need to reinstall the server, e.g. root was compromised, it's usually clearer. But wouldn't it be better if a suspected security incident could be cleaned up with a simple reload and no manual intervention?
Some say yes, some say no. You'll certainly want to know how someone was able to breach your perimeter, which usually leads to a full determination of whether or not you're in danger. But even a hacked Web site can pose risks later on down the road, even if it's cleaned up. The choice is yours.
The absolute best solution is to have every server configuration well documented and automated. The system disk can also be archived for security-related or other disaster recovery needs. Many servers nowadays (HP started it) are coming with internal USB connections. The idea is to flash a Linux boot image onto the USB drive, so that if the need arises, you can boot off the USB disk and 'dd' over the OS image you need. Limited to the speed of your network, this is the fastest method of disaster recovery.
In short, you need to put yourself in a position where all your servers are like the others. Divergent cases should be automated such that they will resume their prior configuration shortly after having been reinstalled—without manual intervention.
In addition to automated servers, your most critical servers should be doubly; nay, triply backed up. Their configurations (IP addresses, files, everything) documented and automated, the normal tape backup rotations, and frequent OS disk images to ensure fast recovery in the event of any disaster. Your job will be easier, your company will be happier, and therefore your life will be easier.