A server crash is always a nightmare. Follow these techniques to track down the source of the problem and get back online as soon as possible.
Have you ever come into the office in the morning and discovered that one of your servers crashed in the middle of the night? (For information on server repairs, see the article Repairing Windows 2000 through the Recovery Console ) When this happens, there's only one thing to do: Hold your breath, reboot the server, and pray that it comes back up. Often you'll get lucky, and the crash was merely a fluke or the result of a minor change made the day before that can be easily corrected. Sometimes, though, a server crash is an indication of a very real problem. In this article, I'll discuss some techniques that you can use to recover, should your Windows 2000 server fail to boot.
Before You Begin
Recovering a server is a very touchy subject for many people. I can't count the number of times when clients have called me to come fix a server that has crashed. During many of these recoveries, the concerned client hovered over me. People get really nervous when a server goes down--they lose productivity while the server is down, but more importantly, they may lose the data stored on the server.
Because of the sensitive nature of the server and of the people who own the server, I tend to take a minimalist approach during recovery. Formatting the hard disk and reloading the operating system from scratch will get a server back on line every time (assuming the hardware is functional), but doing so isn't the preferred recovery method--it destroys settings, user accounts, data, or whatever may be stored on the server. Instead, even though doing so can be tedious, I start with the simplest possible causes and work toward the more complex possibilities. Doing so guarantees that you perform the least invasive procedure, thereby preserving the integrity of the server's settings and data as much as possible.
When I'm called to look at a server that has crashed, typically begin by asking the people involved what has changed recently. Almost always, the answer is, "Nothing." However, if you ask more detailed questions--has the client added any new drivers, service packs, hot fixes, or hardware?--you'll often get a more helpful answer. Usually, the most recent change to the server is related to the problem. If you can find out what this change is, you'll be much more likely to fix the problem quickly than if you had to figure out the cause totally on your own.
What Tools Are Available?
VGA Mode or Boot Logging
Usually when I write an article like this, I receive dozens of email messages pointing out tools that I have forgotten. In this case, such tools might include VGA Mode or boot logging. However, these tools are both automatically implemented through Safe Mode, as are a few others. Debugging mode is also available. However, given that you could write an entire book on debugging mode, and that you need a PhD to understand the output, I won't cover debugging mode in this article.
When a server won't boot, the first trick to getting it to boot is to know which tools are available for correcting the problem. For example, here are a few things you have available during most crashes (later, I'll explain how these tools may come in handy):
- The error itself
- Last Known Good
- Safe Mode
- Recovery Console
- Emergency Repair Disk (ERD)
Usually, when Windows 2000 won't boot, the OS goes partially through the boot sequence before failing. When the boot process fails, the system will lock up completely, go to a blank screen (or reboot), or generate a blue screen of death. The type of failure can help you to know what to look for. For example, if the system boots to a blank screen, it usually indicates a corrupt or incorrect video driver (or a video driver that's set to the wrong resolution).
If the system continually reboots, it often means that a PCI card has vibrated slightly loose. To fix such a problem, simply take apart the machine and reseat all the PCI cards. After you do so, the computer will usually restart correctly. However, when a card vibrates loose, some of the system's Plug and Play information may get messed up. To correct this problem, boot to Safe Mode and go to the Device Manager. Once in Device Manager, remove any references to the various PCI cards; then reboot the machine and let it redetect the cards.
If the system boots to a blue screen, it's often due to an incorrect driver or a hardware failure. Fortunately, each blue screen contains an error message that points to a specific condition. If you are experiencing a blue screen, your best bet is to check the Windows 2000 Server Resource Kit or to look on the Internet for the meaning of the error you're receiving.
Perhaps the toughest error to correct is one where the system locks up during the boot process. When this happens, it relates to the actual Windows system files. For example, a DLL file may have been accidentally changed to an incorrect version. In such a situation, you should use the System File Checker to correct the problem. We'll discuss the System File Checker in greater detail later on.
Last Known Good
The Last Known Good option is often handy for getting a damaged system up and running quickly. Every time Windows 2000 boots successfully, Windows takes a snapshot of its configuration. During the boot process, Windows presents a message that says Press the Spacebar for Last Known Good Configuration. Pressing the spacebar when you see this message causes Windows to boot using the same configuration it used during the last successful boot.
When it comes to recovering a server, Safe Mode can be your best friend. If you've ever struggled through fixing a Windows NT 4.0 Server that won't boot, then you can truly appreciate Safe Mode. Safe Mode is an option that loads Windows 2000 using a minimal set of drivers. For example, instead of using your normal video driver, Windows 2000 loads using the standard VGA driver (just like VGA Mode). Safe Mode also disables features such as networking, the CD-ROM drive, and your sound card. You can access Safe Mode by pressing the F8 key as soon as you see the Starting Windows 2000 Server message during Startup. When you do, you'll see the Windows 2000 Boot Menu. The menu contains an option for Safe Mode and a few variations, such as Safe Mode with Networking or VGA Mode.
If Windows 2000 boots in Safe Mode, you can relax--the problem is usually not serious. If the system will boot in Safe Mode but not in Normal mode, it almost always indicates a bad device driver or a hardware conflict. The real trick is diagnosing which hardware device is causing the problem. Begin by looking in your event logs for a clue. If the event logs are no help, you can use the process of elimination. To do so, go into the Device Manager and disable every device that would normally be disabled in Safe Mode. Now, reboot the machine in Normal mode. If the machine boots properly, enable one of the devices that you disabled and reboot. Repeat this process, enabling one device at a time, until you find the device that's causing the problem.
If checking the devices doesn't cure the problem, you can try to use the System File Checker to test the integrity of your critical system files. To do so, go to a command prompt and enter the command sfc /scannow. Doing so will begin the process of testing your critical system files. You can access other System File Checker options with the sfc /? command.
I discuss the Recovery Console in the article "Repairing Windows 2000 Through The Recovery Console", so I won't go into exhaustive detail here. Briefly, the Recovery Console is a utility that you can use to access the entire system from a command prompt. The Recovery Console is useful for tasks like replacing files that are missing or are the incorrect version. You can also use the Recovery Console to repair logical damage to the hard disk. Because the Recovery Console isn't installed by default, you must have previously installed it to access it, or you can use the Windows 2000 boot floppies to access it.
If you suspect hard-disk damage, simply load the Recovery Console and get to the command prompt. Several available commands are specifically designed for correcting hard-disk corruption issues. For example, the chkdsk /f command is an all-purpose utility that will correct most common hard-disk corruption problems. You can also use the fixmbr command to repair the master boot record. Another handy command is fixboot, which will repair the hard disk's boot sector. Other more potentially destructive commands, such as format and fdisk, are also available.
Emergency Repair Disk
Another option available to you is the ERD. As you probably know, the ERD is created during the Setup process. It contains a backup copy of the BOOT.INI file, along with several critical Registry keys. Unfortunately, the ERD is useless unless you keep it updated, because the Registry is constantly updated as you make changes to the system.
If you have an up-to-date ERD, simply boot the system off the Windows 2000 boot floppies. When you reach the point at which Setup asks if you want to set up Windows 2000 or repair an existing installation, choose the repair option and follow the prompts. Before the repair process begins, you'll be asked if you want to use a fast repair or a manual repair:
- A fast repair automatically updates some key system files, the Registry, the boot sector, and the startup environment.
- A manual repair is most useful when you already know what's wrong with the system and you don't want to overwrite anything you don't have to.
A Full Restore
I always hesitate to use the full-restore option, because you'll lose any data added to the server since the last backup. However, if you have determined that all of the system's hardware is OK, but you still can't get the system to work, restoring from backup may be your last option.
Unfortunately, you can't restore a backup unless Windows 2000 is functional. Therefore, you should format your hard disk and load Windows 2000 from scratch. During this reload, you should load Windows 2000 on a standalone server that isn't a domain controller or a part of a domain. You should also load Windows into a different directory than usual, to avoid interfering with the copy you're about to restore. Once you've loaded this temporary copy of Windows, you can restore your backup. Make sure you restore the System State data along with the usual files. Once the restore has completed, reboot the server. You should now have a functional server. When you're done, you can delete the temporary copy of Windows.