System Support for Software Fault Tolerance in Highly Available Database Management Systems

Mark P. Sullivan

EECS Department, University of California, Berkeley

Technical Report No. UCB/ERL M93/5

, 1993

Today, software errors are the leading cause of outages in fault tolerant systems. System availability can be improved despite software errors by fast error detection and recovery techniques that minimize total downtime following an outage. This dissertation analyzes software errors in three commercial systems and describes the implementation and evaluation of several techniques for error detection and fast recovery in a database management system (DBMS). The software error study examines errors reported by customers in three IBM systems programs: the MVS operating system, the IMS DBMS, and the DB2 DBMS. The study classifies errors by the type of coding mistake and the circumstances in the customer's environment that caused the error to arise. It observes a higher availability im pact from addressing errors, such as uninitialized pointers, than software errors as a whole It also details the frequencies and types of addressing errors and characterizes the damage they do. The error detection work evaluates the use of hardware write protection both to detect addressing-related errors quickly and to limit the damage that can occur after a software error. System calls added to the operating system allow the DBMS to guard (write-protect) some of its internal data structures. Guarding DBMS data provides quick detection of corrupted pointers and similar software errors. Data structures can be guarded a s long as correct software is given a means to temporarily unprotect the data structures before updates. The dissertation analyzes the effects of three different update models on performance, software complexity, and error protection. To improve DBMS recovery time, previous work on the POSTGRES DBMS has suggested using a storage system based on no-overwrite techniques instead of write- ahead log processing. The dissertation describes modifications to the storage system that improve its performance in environments with high update rates. Analysis shows that, with these modifications and some non-volatile RAM, the I/O requirements of POSTGRES running a TPI benchmark will be the same as those of a conventional system, despite the POSTGRES force-at-commit buffer management policy. The dissertation also presents an extension to POSTGRES to support the fast recovery of communication links between the DBMS and its clients. Finally, the dissertation adds to the fast recovery capabilities of POSTGRES with two techniques for maintaining B-tree index consistency without log processing. One technique is similar to shadow paging, but improves performance by integrating shadow meta-data with index meta-data. The other technique uses a two-phase page reorganization scheme to reduce the space overhead caused by shadow paging. Measurements of a prototype implementation and estimates of the effect of the algorithms on large trees show that they will have limited impact on data manager performance.

Advisors: Michael R. Stonebraker

BibTeX citation:

@phdthesis{Sullivan:M93/5,
Author= {Sullivan, Mark P.},
Title= {System Support for Software Fault Tolerance in Highly Available Database Management Systems},
School= {EECS Department, University of California, Berkeley},
Year= {1993},
Month= {Jan},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1993/2268.html},
Number= {UCB/ERL M93/5},
Abstract= {Today, software errors are the leading cause of outages in fault tolerant systems. System availability can be improved despite software errors by fast error detection and recovery techniques that minimize total downtime following an outage. This dissertation analyzes software errors in three commercial systems and describes the implementation and evaluation of several techniques for error detection and fast recovery in a database management system (DBMS). The software error study examines errors reported by customers in three IBM systems programs: the MVS operating system, the IMS DBMS, and the DB2 DBMS. The study classifies errors by the type of coding mistake and the circumstances in the customer's environment that caused the error to arise. It observes a higher availability im pact from addressing errors, such as uninitialized pointers, than software errors as a whole It also details the frequencies and types of addressing errors and characterizes the damage they do. The error detection work evaluates the use of hardware write protection both to detect addressing-related errors quickly and to limit the damage that can occur after a software error. System calls added to the operating system allow the DBMS to guard (write-protect) some of its internal data structures. Guarding DBMS data provides quick detection of corrupted pointers and similar software errors. Data structures can be guarded a s long as correct software is given a means to temporarily unprotect the data structures before updates. The dissertation analyzes the effects of three different update models on performance, software complexity, and error protection. To improve DBMS recovery time, previous work on the POSTGRES DBMS has suggested using a storage system based on no-overwrite techniques instead of write- ahead log processing. The dissertation describes modifications to the storage system that improve its performance in environments with high update rates. Analysis shows that, with these modifications and some non-volatile RAM, the I/O requirements of POSTGRES running a TPI benchmark will be the same as those of a conventional system, despite the POSTGRES force-at-commit buffer management policy. The dissertation also presents an extension to POSTGRES to support the fast recovery of communication links between the DBMS and its clients. Finally, the dissertation adds to the fast recovery capabilities of POSTGRES with two techniques for maintaining B-tree index consistency without log processing. One technique is similar to shadow paging, but improves performance by integrating shadow meta-data with index meta-data. The other technique uses a two-phase page reorganization scheme to reduce the space overhead caused by shadow paging. Measurements of a prototype implementation and estimates of the effect of the algorithms on large trees show that they will have limited impact on data manager performance.},
}

EndNote citation:

%0 Thesis
%A Sullivan, Mark P. 
%T System Support for Software Fault Tolerance in Highly Available Database Management Systems
%I EECS Department, University of California, Berkeley
%D 1993
%@ UCB/ERL M93/5
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1993/2268.html
%F Sullivan:M93/5