Improving Software Fault Tolerance in Highly Available Database Systems

M. Sullivan and Michael Stonebraker

EECS Department, University of California, Berkeley

Technical Report No. UCB/ERL M90/11

, 1990

Software errors often damage the transient state of a transaction processing system (TPS) without causing the system to fail immediately. We propose several techniques to increase the chance of detecting latent software errors before disaster occurs. The same techniques can improve recovery speed by making non-volatile memory a more practical medium for permanent storage. These techniques include: (1) using hardware write protection to guard data in the database buffer pool from errors. (2) using a shadow-paging scheme to reduce the chance that an erring transaction propagates errors to correct pages, (3) inserting an artificial delay between the time a transaction completes its work and the time it is considered committed. Because of the delay, errors may remain undetected for a longer time without causing irrecoverable damage. Simulations show these techniques reduce transaction throughput by as little as one to seven percent. An analytic model estimates reliability improvements given several possible models of errors. Our proposal also outlines the software fault tolerance concerns in designing a data manager that writes log records to non-volatile memory on commit instead of disk.

BibTeX citation:

@techreport{Sullivan:M90/11,
    Author= {Sullivan, M. and Stonebraker, Michael},
    Title= {Improving Software Fault Tolerance in Highly Available Database Systems},
    Year= {1990},
    Month= {Feb},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1990/1411.html},
    Number= {UCB/ERL M90/11},
    Abstract= {Software errors often damage the transient state of a transaction processing system (TPS) without causing the system to fail immediately. We propose several techniques to increase the chance of detecting latent software errors before disaster occurs.  The same techniques can improve recovery speed by making non-volatile memory a more practical medium for permanent storage.  These techniques include: (1)     using hardware write protection to guard data in the database buffer pool from errors. (2)     using a shadow-paging scheme to reduce the chance that an erring transaction propagates errors to correct pages, (3)     inserting an artificial delay between the time a transaction completes its work and the time it is considered committed. Because of the delay, errors may remain undetected for a longer time without causing irrecoverable damage. Simulations show these techniques reduce transaction throughput by as little as one to seven percent.  An analytic model estimates reliability improvements given several possible models of errors.  Our proposal also outlines the software fault tolerance concerns in designing a data manager that writes log records to non-volatile memory on commit instead of disk.},
}

EndNote citation:

%0 Report
%A Sullivan, M. 
%A Stonebraker, Michael 
%T Improving Software Fault Tolerance in Highly Available Database Systems
%I EECS Department, University of California, Berkeley
%D 1990
%@ UCB/ERL M90/11
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1990/1411.html
%F Sullivan:M90/11