A Method and Tool for Analyzing Fault-Tolerance in Systems
Abstract:
As computers are integrated into systems that have stringent
fault-tolerance requirements, there is a growing need for techniques to
establish that these systems actually satisfy those requirements.
Informal arguments do not supply the desired level of assurance for
critical systems. This dissertation presents a rigorous, automated
approach to analyzing distributed systems, with a focus on checking
fault-tolerance requirements, and describes a prototype implementation
of the analysis. The analysis is a novel hybrid of ideas from
stream-processing semantics of networks of processes, abstract
interpretation of programs, and symbolic computation. The underlying
principles of the analysis method are general, but specialized
techniques---such as the use of perturbations to represent changes to
normal behavior caused by failures---are developed to deal efficiently
with the types of systems and requirements that arise in establishing
fault-tolerance. The method is illustrated with three examples: the
Oral Messages algorithm for Byzantine Agreement, due to Lamport, Shostak
and Pease, a standard protocol for FIFO reliable broadcast, and a
(subtly) flawed protocol for fault-tolerant moving agents.
Scott Stoller's Home Page