Beyond RAID 5: Mirroring your way to fault-tolerant storage

by Joel Leider

Standard RAID systems don't provide for a single point of failure, so here are three mirroring methods to protect you from the dangers of component failure.

Most network and system managers prefer RAID disk arrays because they provide a measure of protection against drive failures. However, those who must keep the shop running without interruption require servers and storage with higher standards of fault tolerance or no single point of failure. Unfortunately, standard RAID systems simply do not provide this measure of protection. Further, these enterprises typically run business-critical database applications that are seldom, if ever, closed--even for backup. They often run 24 hours a day, 7 days a week.

Disks, power supplies, fans, and computers all fail--we just want them to fail gracefully and not take the enterprise or our data with them. It is the network manager's job to anticipate the costs of these predictable failures and compare them to the costs of prevention. How can you add several elements of protection in network server environments where the cost of data loss and downtime are very high? For these situations, the cost of extra equipment redundancy is low compared to the anticipated costs of downtime or data loss.

Becoming Better than No Single Point of Failure

To reduce downtime due to component failures, you should consider one of three methods to mirror your RAID 5 storage systems to provide a cost-effective solution for protection in critical situations:

Level 1: Storage redundancy
Level 2: Server redundancy
Level 3: Clustering

Each method protects your data even if an entire RAID array fails--these methods offer no single point of failure at the storage array and optionally at the server.

Level 1: Storage redundancy

The first method--called Level 1: Storage redundancy--provides a storage architecture designed to impart a full measure of fault tolerance via component redundancy. Unique to the architecture, this Level 1 system avoids the vulnerabilities to single failure points commonly found in typical storage arrays. This no single point of failure design in the RAID storage architecture uses a pair of RAID 5 systems, each connected to a server that supports host-based mirroring. Operating systems that perform mirroring include Netware, Windows NT, Solaris, HP-UX, AIX, OpenVMS (Volume Shadowing), Digital UNIX, SGI Irix, and others. Figure 1 shows how simple this design is to set up.

Figure 1: Level 1 Storage redundancy
A RAID system can withstand multiple failures at the same time.
Source: Joel Leider
Unlike a standard RAID 5 array, this method can withstand three simultaneous drive failures and still continue to run properly--all transparently to system users. Each RAID 5 array can sustain a drive failure and continue operating with parity information. If a hot spare is present or a replacement drive is inserted, the system continues with either one or even both data rebuilds in progress simultaneously. During this critical time, another drive can fail, taking down an entire RAID 5 array. This event would normally disable the server and all the users and risk data loss. However, this Level 1 system keeps running: One RAID 5 array with a failed drive is sufficient to run the server and keep data continuously accessible to users. You have ample time to fix the disabled RAID 5 array and avoid potential data loss. With optional hot-spare drives installed, the Level 1 array can withstand the subsequent failure of up to two additional drives.

This Level 1 array can also withstand the failure of multiple fans and multiple power supplies. Each RAID 5 array is typically connected to a separate uninterruptible power supply (UPS) and has redundant AC connections. An AC power line to each array can fail, a UPS can fail during a power outage, or even an entire RAID 5 array can completely fail--and the mirrored RAID 5 arrays will continue to operate.

Level 2: Server redundancy

Figure 2: Level 2 & 3 Server redundancy/share data
Level 2 adds the ability to withstand a server failure and multiple host bus failures; level 3's optional cluster links provide the capability for hosts to coordinate and share
Source: Joel Leider

The multi-host, Level 2 configuration diagram takes advantage of a RAID array's multi-hosting capabilities. Each server is connected to both RAID 5 arrays. In most cases, the second server is a standby server--ready to take over if the first server fails. This configuration also withstands the failure of a bus, because two RAID arrays are connected to each server via separate buses. Bus hang-ups do occur; and although rebooting can easily reset these hang-ups, this practice may prove unacceptable in many environments where users expect the system to be in full operation. Thus, the environment continues operating with the loss of a server, bus, or storage element.

Level 3: Clustering

Sharing data between multiple servers enables network managers to distribute workloads onto multiple servers without the need to arbitrarily decide how to split up the data. When combined with RAID 5 storage, this Level 3 capability offers no single point of failure in the environment. It also eliminates the obvious idleness of the backup server. The following operating systems support this environment: Digital UNIX and OpenVMS, HP-UX, AIX, and Solaris. This architecture creates a server environment with true no single point of failure fault tolerance, maximizes the utility of all installed components, and promotes ease of access to shared data. Figure 2 illustrates the Level 2 and Level 3 mirroring methods.//

As chief executive officer of Winchester Systems, Joel Leider oversees marketing and the financial operations of the Woburn, Mass.-based company. Over the years, Leider has worked closely with sales to develop an elite group of consultative telesales professionals who walk and talk to the beat of storage. And his articles on marketing and telesales can be found in course materials used by third-party sales training organizations.

This article was originally published on Monday Jun 12th 2000