Fault Tolerance for VLSI Multicomputers

Yuval Tamir

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-86-256

, 1986

The performance requirements of future high-end computers will only be met by systems that facilitate the exploitation of the parallelism inherent in the algorithms that they execute. One such system is a multicomputer that consists of hundreds or thousands of VLSI computation nodes interconnected by dedicated links. Some important applications of high-end computers, such as weather forecasting, require continuous correct operation for many hours. This requirement can only be met if the system is fault-tolerant, i.e., can continue to operate correctly despite the failure of some of its components. This dissertation investigates the use of fault tolerance techniques to increase the reliability of VLSI multicomputers. Different techniques are evaluated in the context of the entire system, its implementation technology, and intended applications. A proposed fault tolerance scheme combines hardware that performs error detection and system-level protocols for error recovery and fault treatment. Practical design and implementation tradeoffs are discussed.

A fault-tolerant system must identify erroneous information produced by faulty hardware. It is shown that a high probability of error detection can be achieved with self-checking nodes implemented using duplication and comparison. The requirements for detecting errors caused by hardware faults are: (1) the comparator is fault-free, and (2) the functional modules never produce identical incorrect outputs. Requirement (1) is fulfilled with a self-testing comparator that signals its own faults during normal operation. An implementation of such a comparator using MOS PLAs is discussed. Requirement (2) is fulfilled with two modules that are implemented differently so that, although they perform identical functions, they have a low probability of failing simultaneously in exactly the same way. Low-cost techniques for implementing such modules are presented.

The detection of an error implies that the state of the system has been corrupted. In order to recover from the error and resume correct operation, a valid system state must be restored. A low-overhead, application-transparent error recovery scheme for multicomputers is presented. It involves periodic checkpointing of the entire system state, using protocols that ensure that the saved states of all the nodes are consistent, and rolling back to the last checkpoint when an error is detected.

Advisors: Carlo H. Séquin

BibTeX citation:

@phdthesis{Tamir:CSD-86-256,
Author= {Tamir, Yuval},
Title= {Fault Tolerance for VLSI Multicomputers},
School= {EECS Department, University of California, Berkeley},
Year= {1986},
Month= {Aug},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1986/6080.html},
Number= {UCB/CSD-86-256},
Abstract= {The performance requirements of future high-end computers will only be met by systems that facilitate the exploitation of the parallelism inherent in the algorithms that they execute. One such system is a multicomputer that consists of hundreds or thousands of VLSI computation nodes interconnected by dedicated links. Some important applications of high-end computers, such as weather forecasting, require continuous correct operation for many hours. This requirement can only be met if the system is fault-tolerant, i.e., can continue to operate correctly despite the failure of some of its components. This dissertation investigates the use of fault tolerance techniques to increase the reliability of VLSI multicomputers. Different techniques are evaluated in the context of the entire system, its implementation technology, and intended applications. A proposed fault tolerance scheme combines hardware that performs error detection and system-level protocols for error recovery and fault treatment. Practical design and implementation tradeoffs are discussed. <p> A fault-tolerant system must identify erroneous information produced by faulty hardware. It is shown that a high probability of error detection can be achieved with self-checking nodes implemented using duplication and comparison. The requirements for detecting errors caused by hardware faults are: (1) the comparator is fault-free, and (2) the functional modules never produce identical incorrect outputs. Requirement (1) is fulfilled with a self-testing comparator that signals its own faults during normal operation. An implementation of such a comparator using MOS PLAs is discussed. Requirement (2) is fulfilled with two modules that are implemented differently so that, although they perform identical functions, they have a low probability of failing simultaneously in exactly the same way. Low-cost techniques for implementing such modules are presented. <p> The detection of an error implies that the state of the system has been corrupted. In order to recover from the error and resume correct operation, a valid system state must be restored. A low-overhead, application-transparent error recovery scheme for multicomputers is presented. It involves periodic checkpointing of the entire system state, using protocols that ensure that the saved states of all the nodes are consistent, and rolling back to the last checkpoint when an error is detected.},
}

EndNote citation:

%0 Thesis
%A Tamir, Yuval 
%T Fault Tolerance for VLSI Multicomputers
%I EECS Department, University of California, Berkeley
%D 1986
%@ UCB/CSD-86-256
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1986/6080.html
%F Tamir:CSD-86-256