Networking

Fault Tolerance

As far as computers are concerned, fault tolerance refers to the capability of the computer system or network to provide continued data availability in the event of hardware failure. Every component within a server, from CPU fan to power supply, has a chance of failure. Some components such as processors rarely fail, whereas hard disk failures are well documented.

Almost every component has fault-tolerant measures. These measures typically require redundant hardware components that can easily or automatically take over when there is a hardware failure.

Of all the components inside computer systems, the one that requires the most redundancy are the hard disks. Not only are hard disk failures more common than any other component but they also maintain the data, without which there would be little need for a network.

Hard Disks Are Half the Problem

In fact, according to recent research, hard disks are responsible for one of every two server hardware failures. This is an interesting statistic to think about.

Disk-level Fault Tolerance

Making the decision to have hard disk fault tolerance on the server is the first step; the second is deciding which fault-tolerant strategy to use.

Hard disk fault tolerance is implemented according to different RAID (redundant array of inexpensive disks) levels. Each RAID level offers differing amounts of data protection and performance.

The RAID level appropriate for a given situation depends on the importance placed on the data, the difficulty of replacing that data, and the associated costs of a respective RAID implementation. Oftentimes, the cost of data loss and replacement outweigh the costs associated with implementing a strong RAID fault-tolerant solution.