An Analysis of Error Behavior in a Large Storage System

Nisha Talagala and David Patterson

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-99-1042
February 1999

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1042.pdf

This paper analyzes the error behavior of a 3.2TB disk storage system. We report reliability data for 18 months of the prototype's operation, and analyze 6 months of error logs from nodes in the prototype. We found that the disks drives were among the most reliable components in the system. We were also able to divide errors into eleven categories, comprising disk errors, network errors and SCSI errors that appeared repeatedly across all nodes. We also gained insight into the types of error messages reported by devices in various conditions, and the effects of these events on the operating system. We also present data from four cases of disk drive failures. These results and insights should be useful to any designer of a fault tolerant storage system.


BibTeX citation:

@techreport{Talagala:CSD-99-1042,
    Author = {Talagala, Nisha and Patterson, David},
    Title = {An Analysis of Error Behavior in a Large Storage System},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {1999},
    Month = {Feb},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/5367.html},
    Number = {UCB/CSD-99-1042},
    Abstract = {This paper analyzes the error behavior of a 3.2TB disk storage system. We report reliability data for 18 months of the prototype's operation, and analyze 6 months of error logs from nodes in the prototype. We found that the disks drives were among the most reliable components in the system. We were also able to divide errors into eleven categories, comprising disk errors, network errors and SCSI errors that appeared repeatedly across all nodes. We also gained insight into the types of error messages reported by devices in various conditions, and the effects of these events on the operating system. We also present data from four cases of disk drive failures. These results and insights should be useful to any designer of a fault tolerant storage system.}
}

EndNote citation:

%0 Report
%A Talagala, Nisha
%A Patterson, David
%T An Analysis of Error Behavior in a Large Storage System
%I EECS Department, University of California, Berkeley
%D 1999
%@ UCB/CSD-99-1042
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/5367.html
%F Talagala:CSD-99-1042