Many options exist for setting up clustered and highly available storage, but figuring out what each does will take a bit of research. Your choice of storage architecture as well as file system is crucial, as most have severe limitations that require careful design workarounds.
In this article we will cover a few common physical storage configurations, as well as clustered and distributed file system options. Hopefully, this is a good starting point to begin looking into the technology that will work best for your high availability storage needs.
Some readers may wish to configure a cluster of servers that simply have concurrent access to the same file system, while others want to replicate storage and provide both concurrent access and redundancy. There are two ways to go about providing multiple servers access to the same disks: let them both see it, or via replication.
Shared-disk configurations are most common in the Fibre Channel SAN and iSCSI worlds. It is quite simple to configure storage systems such that multiple servers can see the same logical block device, or LUN, but without a clustered file system, chaos will ensue if both try to use it at the same time. This problem is dealt with by using clustered file systems, which we will cover in a moment.
Generally speaking, shared-disk setups have a single point of failure: the storage system. This is not always true, however, as "shared-disk" is a confusing term with today's technology. SANs, NAS appliances, and commodity hardware running Linux can all replicate the underlying disks in real-time to another storage node, which provides a simulated shared-disk environment. Since the underlying block devices are replicated, the nodes have access to the same data and both run a clustered file system, but this replication breaks the traditional shared-disk definition.
Shared-nothing, in contrast, was the original answer to shared-disk single points of failure. Nodes with distinct storage would notify a master server with changes, as each block was written. Nowadays, shared-nothing architectures still exist in file systems like Hadoop, which purposely creates multiple copies of data across many nodes for both performance and redundancy. Also, clusters that employ replication between storage devices or nodes with their own storage, are said to be share-nothing as well.
You cannot access the same block device via multiple servers, we said, without really explaining why. You always hear about file system locking, so it's strange that normal file systems cannot handle this, right?
At the file system level, the file system itself is locking files to protect the data against mistakes. But at the operating system level, the file system drivers have full access to the underlying block device, upon which it's free to roam. Most file systems assume that they are given a block device, and it's theirs and theirs alone.
To get around this, clustered file systems implement a mechanism for concurrency control. Some clustered file systems will store metadata within a partition of the shared device, and some choose to utilize a centralized metadata server. Both allow all nodes in the cluster to have a consistent view of the state of the file system, to allow safe concurrent access. The model with the central metadata sever, however, is sub-optimal if your goal is high availability and eliminating single points of failure.
One final note: the clustered file systems model requires swift action when a node does something wrong. If a node writes bad data or stops communicating its metadata changes for some reason, other nodes need to be able to "fence" the offender. Fencing is accomplished in many ways, most often using lights-out management interfaces. Healthy nodes will Shoot The Other Node In The Head (STONITH), or yank its power, at the first sign of inconsistencies to preserve the data.
Clustered File Systems
- GFS: Global File System.
- GFS, available in Linux, is the most widely used clustered file system. Developed by Red Hat, GFS allows concurrent access by all participating cluster nodes. Metadata is generally stored on a partition of the shared (or replicated) storage.
- OCFS: Oracle's Oracle Clustered File System.
- OCFS is conceptually very much like GFS, and OCFS2 is now available in Linux.
- VMFS: VMware's Virtual Machine File System.
- VMFS is the clustered file system that ESX Server uses to allow multiple servers access to the same shared storage. This makes virtual machine migration (to different servers) seamless, as the same storage is accessible at the source and destination. Journals are distributed, and there is no single point of failure between the ESX servers.
- Lustre: Sun's clustered, distributed file system.
- Lustre is a distributed file system designed to work with very large clusters containing thousands of nodes. Lustre is available for Linux, but its applications outside the high performance computing circle are limited.
- Hadoop: a distributed file system, like Google uses.
- This is not a clustered file system, but rather a distributed one. We include Hadoop because of its rising popularity, and the wide array of storage architecture design decisions that can take advantage of Hadoop. By default, you will have three copies of your data on three different nodes. Changes are replicated to each, so in a sense it can be treated as a clustered file system. Hadoop does, however, have a single point of failure: the name node, which keeps track of all file system level data.
Having too many choices is never a bad thing. Your implementation goals will dictate which clustered or distributed file system and storage architecture you choose. All of the mentioned file systems work very well, assuming they are used as intended.
Charlie Schluting is the author of Network Ninja, a must-read for every network engineer.