So Far Away and Yet So Close: High Availability Meets Remote Management

by Jacqueline Emigh

How about a world where 'bluescreened' servers can be recovered over the network? Intel-based high availability systems got some attention at Microsoft's WinHEC.

Over the past week or so, Microsoft and some of its allies have put on a strong push on promoting server enhancements geared to Windows crash prevention. In a pitch to network managers, NEC, for example, is positioning its emerging lineup of fault tolerant (FT) servers as an alternative to clustering in any situation where high system uptime is deemed "mission-critical."

About a year from now, NEC will introduce a three-CPU FT server to the US market, according to Mike Mitsch, NEC's director of enterprise computing. Available since last fall in one- and two-processor configurations, the NEC Express5800/ft320La server comes with remote management features that are "unique in the industry," Mitsch contended, during a presentation to Windows systems managers last Thursday.

NEC is already selling a three-processor FT server in Japan. The US edition will be a blade server. "We want to use the three-processor architecture to differentiate our blade server from other (future) blade servers," he said.

NEC was also on hand earlier last week at Microsof's WinHEC show, along with the two other players now populating the small universe of Intel-based FT hardware. Microsoft's announcement of intentions to produce Bluetooth-enabled peripherals drew the lion's share of industry notice at WinHEC, a trade show targeting hardware developers.

During conference sessions at WinHEC, though, Microsoft officials talked about some of the issues that have long plagued network managers, including blue screens and reboots.

Mario Garzia of Microsoft released results on a survey reportedly conducted by Microsoft among 4,000 servers operated by 20 customers. Only 65 percent of all Windows NT reboots recorded in the survey were planned reboots. Also according to the results, however, the proportion of unplanned reboots shrank to 3 percent on Windows XP servers.

"The perception is that software is the cause of most failures of Windows systems. This is changing as the operating system is becoming more reliable. Hardware is a more significant problem," according to Microsoft Program Manager Sandy Arthur, another speaker at WinHEC.

The Microsoft execs also acknowledged issues with the core OS, as well as with application failures, third-party filter and device drivers, hardware reconfigurations, antivirus software, and "operator errors," for instance

In his talk to network managers on Thursday, Mitsch also pointed to problems in some Intel-based hardware, blaming "three-month" design cycles as the big culprit. "There can be progressive hardware degradation," he said.

These days, many network managers think applications like e-mail, Web access, and database application need 24/7 availability, he suggested.

Mitsch added that one of NEC's customers used to try to keep Windows NT from crashing by doing "preventive reboots" every night.

Also according to Mitsch, though, NEC's servers manage to provide high availability by combining elements such as a redundant "lockstep" CPU architecture, "hardened" device drivers from its partner Stratus, and "NEC's own software" for local and remote system management.

"Lockstepped" CPUs are synchronized to a single clock, and their instruction streams are supposed to be the same. The 5800 also features redundant I/0 ports, memory, hard disk drives, and power supply components, in pedestal and rackmount form factors.

NEC released the one-CPU version of the 5800 in the US last September, and the two-CPU edition in November. Both these machines were also rolled out in the Japanese market first.

"NEC and Stratus co-developed the fault tolerant server. Stratus produced the overall design. NEC manufactures the CPUs," said Mitsch. The 5800 uses 800 MHz Pentium III processors.

Marathon produces a competing FT Intel-based server, the Endurance 6200L, which uses two CPUs in a four-server crossbar configuration.

For its part, Stratus sells the ftServer 3200, a machine almost identical to the NEC Express5800. So far. Stratus has concentrated mainly on existing FT markets such as finance, government, and telecommunications. Outside of its long-time line-up of Unix FT servers, Stratus also offers the ftServer 5200, a larger Intel-based FT server which uses Pentium III Xeon processors.

NEC, though, is eyeing a number of industries that are new to the FT concept, including retail and small business, for example. The chief difference between the Express5800 and Stratus' ftServer 3200 is on the service side.

Stratus is bundling its own remote monitoring and management services with ftServer 3200. The 3200 monitors its own operations, reporting any exceptions to a customer assistance center. Service representatives there use the company's Stratus Service Network (SSN) for remote trouble-shooting and management.

With its FT server, on the other hand, NEC is rolling in system management software designed for use by other management service out-sourcers, as well as by corporate Windows administrators.

For one thing, system management and maintenance can be performed over either one or both of the 5800's dual redundant links, according to Mitsch. From remote locations, administrators can upgrade firmware or switch between CPUs, for example. SNMP-compliant network management software is included, too.

NEC, Stratus and Marathon are all claiming "five nines" (99.999 percent) uptime for their products, in contrast to the 99.9 percent uptime often attributed to clusters. Mitsch maintained that the Express5800 copies memory between processors while the system is running, for virtually uninterrupted service. According to an IDC white paper, server downtime of more than 5 minutes per year (the level associated with 99.9 percent availability) is unacceptable to 80 percent of IT sites currently using clustering.

Two NEC customers, though, are actually using the 5800 in conjunction with clustering, Mitsch said.

Meanwhile, many software programs still aren't enabled for clustering, according to Mitsch. "Microsoft SharePoint won't be ready for clustering for quite some time," he told the network managers.

NEC's current FT product runs Microsoft Windows 2000 Advanced Server. "We are not sure yet where we'll stand with the .NET servers, because Microsoft plans to offer five different versions of the OS," he pointed out.

According to Mitsch, the forthcoming three-CPU model will use a "voted" model, in which the least capable CPU in the trio will be voted out. CPUs with lower error rates will be more likely to withstand the vote, as will "older" CPUs. The third CPU will remain available to step in, though, in case one of the other CPUs goes down.

With current pricing that starts at about $17,000, the 5800 is much less costly than FT systems from Unix competitors. Still, some users remain doubtful that any Windows system can provide the uptime they need, based on their own prior experiences with NT.

"Is there any way I can just send up all my crash reports to Microsoft by default?" asked one network manager.

"What's the difference between now and the days of Windows NT? Microsoft kept promising higher system availability all the time back then, too," another administrator observed.

» See All Articles by Columnist Jacqueline Emigh

This article was originally published on Tuesday Apr 23rd 2002